evaluating your model

training-runs Apr 8, 2026 4 min read

after training, check whether your model actually improved. the console gives you three ways to evaluate: the eval tab (held-out metrics), comparison evals (head-to-head against other models), and the playground (interactive testing).

eval tab

the eval tab shows the same metrics as the train tab, but computed on your held-out evaluation dataset. this is the primary way to check for overfitting.

what to look for:

  • train reward rising, eval reward flat or declining: the model is overfitting to the training distribution. try a larger or more diverse eval set, or stop training earlier.
  • train and eval rewards rising together: healthy. the model is generalizing.
  • eval reward higher than train: unusual but possible if your eval set is easier than your training set. check your data split.

metrics include average reward, response lengths, max reward, and solve rate in a 2x2 grid. sub-reward components (e.g. citation precision, judge scores) appear in collapsible sections below.

comparison evals

comparison evals let you run your evaluation dataset through an external model (GPT-4, Claude, etc.) and compare results side by side.

running a comparison

  1. go to the comp tab on your training run page.
  2. select a model from the dropdown.
  3. click start batch eval.

the batch runs asynchronously. progress updates every few seconds. once complete, you’ll see:

  • bar chart: your model (green) vs the comparison model (gray) on each reward component.
  • performance summary: a percentage delta with a label (“outperformed”, “closing in”, or “comparison model leads”).
  • per-rollout comparison table: expand any row to see the full conversation for both models side by side.

you can run multiple comparisons against different models. toggle between them using the chips above the chart.

interpreting results

the comparison uses the same reward function as training. if your model scores higher than the base model on the eval set, it learned something useful. if it scores lower than an external model, try more training or a better reward signal.

reward scores are relative to your reward function, not absolute quality. a model scoring 0.8 on a judge-based reward may produce better answers than one scoring 0.9 on exact match. it depends on what your reward function measures.

playground

the playground lets you chat with your trained model interactively. useful for qualitative testing: checking tone, reasoning quality, and edge cases that metrics won’t capture.

model types

typewhat it connects towhen to use
debuglive tunnel to in-progress trainingtest the model mid-training
evallatest checkpoint on castform inferencetest after training completes
externalGPT-4, Claude, etc.manual comparison

the playground supports up to 4 conversation turns and 8 tool calls per turn. responses stream in real time.

using the playground

  1. go to the playground tab on your training run page.
  2. select a model type from the dropdown.
  3. type a prompt and send.

if your environment has tools (e.g. a search tool for RAG), the model can call them during the conversation. tool calls and results appear inline in the chat.

what to evaluate

questionwhere to look
is the model overfitting?compare train tab vs eval tab metrics
is it better than the base model?run a comparison eval against the base
is it better than GPT-4/Claude?run a comparison eval against an external model
does it handle edge cases?test manually in the playground
is it using tools correctly?check tool calls in rollout details and playground
are rewards gaming the system?inspect high-reward rollouts for reward hacking

next steps