rewards define what “good” behavior means.
example reward
async def chunk_overlap_reward(completion: str, ground_truth: str, **kwargs) -> float:
reference_chunks = kwargs.get("reference_chunks", [])
reference_text = " ".join([c.get("content", "") for c in reference_chunks])
overlap = calculate_text_overlap(reference_text, completion)
return overlap if overlap >= 0.25 else 0.0
wire it in compute_reward
async def compute_reward(self, rollout_id, completion, ground_truth, **kwargs):
return {
"chunk_overlap": await chunk_overlap_reward(completion, ground_truth, **kwargs)
}
tips
- return floats in
[0.0, 1.0]. - sparse rewards can work when criteria are clear.
- watch for reward hacking during training run monitoring.