prompt
"what is 17 × 24?"
generate
model produces a response
reward
correct → +1 · wrong → 0
update weights
correct paths → more likely