rewards

environment Mar 19, 2026 1 min read

rewards define what “good” behavior means.

example reward

async def chunk_overlap_reward(completion: str, ground_truth: str, **kwargs) -> float:
    reference_chunks = kwargs.get("reference_chunks", [])
    reference_text = " ".join([c.get("content", "") for c in reference_chunks])
    overlap = calculate_text_overlap(reference_text, completion)
    return overlap if overlap >= 0.25 else 0.0

wire it in compute_reward

async def compute_reward(self, rollout_id, completion, ground_truth, **kwargs):
    return {
        "chunk_overlap": await chunk_overlap_reward(completion, ground_truth, **kwargs)
    }

tips

  • return floats in [0.0, 1.0].
  • sparse rewards can work when criteria are clear.
  • watch for reward hacking during training run monitoring.