managing training runs | castform docs

monitoring training runs

castform also lets you monitor the progress of your model training. this is crucial to make sure the model is learning useful behaviors. you can monitor in 3 different ways:

quantitative metrics
completions produced by the model for training prompts
chatting with the model during training

quantitative metrics

rl training samples multiple completions per prompt. these metrics help you understand if your model is learning to solve the task:

reward: average reward across all completions for a prompt. tracks overall quality.

pass@k: did at least one completion get non-zero reward? binary signal of whether the model can solve this prompt.

max@k: best reward achieved across k completions for a prompt. shows the model’s peak capability.

why max@k and pass@k matter: they track how many prompts your model can solve. if average reward climbs but pass@k stays flat, you’re overfitting to a subset of prompts. you want both climbing - the model should expand coverage across the training distribution.

how metrics should progress over time

early training: at the start of a training run, rewards will fluctuate and metrics may be noisy. this is completely normal - the model is still learning basic patterns and its outputs are unstable. give it some time (usually a few dozen steps) and the signal will clean up and you’ll start seeing clear trends.

exploration before exploitation: ideally, you want to see pass@k climbing first before average reward increases significantly. this means your model is exploring the solution space and learning to solve more prompts (breadth) before it optimizes for higher rewards on those prompts (depth). if average reward shoots up while pass@k stays low, you’re likely exploiting a narrow set of easy prompts rather than building robust capabilities.

healthy training trajectory: in a well-configured training run, expect pass@k to increase early as the model learns to solve more distinct prompts. average reward should follow but lag behind. eventually both should plateau as the model saturates your training distribution.

warning signs:

pass@k flat while average reward increases → model is overfitting to a narrow subset of prompts
both metrics stagnate early → training distribution may be too hard, reward signal too sparse

training completions

inspect what your model is actually generating during training. view the full prompt, completion text, and tool call traces for each rollout.

filtering rollouts: search by text content to find specific behaviors, or filter by reward score to identify patterns. filtering by reward is particularly useful - low reward rollouts show where your model struggles, high reward ones reveal what’s working. this helps you spot systematic issues in your prompt distribution or reward function.

tool traces: for agentic workflows, expand tool call traces to debug multi-step reasoning and see where the model goes wrong.

chat with your model during training

you can query the model as it’s being trained to get an intuitive sense of what it’s learning.

next steps

evaluating your model: compare against baselines, run batch evals, test in the playground
serving + sharing: deploy your model once training is complete