prompt

"what is 17 × 24?"

generate

model produces a response

repeat

reward

correct → +1  ·  wrong → 0

update weights

correct paths → more likely