tl;dr
reward hacking is when an AI finds an unintended shortcut to score well on its reward function - without actually doing what you wanted. it’s not a bug. it’s not the model being deceptive. it’s the natural result of a powerful optimizer finding the gap between what you said you wanted and what you meant. this post walks through how it happens, shows real examples, and explains why fixing it is harder than it sounds. there’s also an interactive demo where you accidentally cause it.
try it yourself - you’re a human trainer now
before we explain reward hacking, you’re going to cause it.
the box below is a simplified version of what human trainers do during RLHF (reinforcement learning from human feedback). they read AI responses and pick the better one. there are no wrong answers. that’s kind of the point.
notice what happened at the end? the model produced a long, confident, completely wrong answer. you didn’t program that - but you did reinforce it. every time you picked the longer response (which, in our setup, happened to be the correct one), you nudged the model to associate length with quality. then we gave it a prompt where the long answer was wrong. the model had no way to know the difference. it just knew: longer answers get rewarded.
what even is a reward signal?
an RL system has three moving parts:
- an agent (the model) that takes actions
- an environment that responds to those actions
- a reward signal that tells the agent how well it did
the agent’s job is to maximize reward. that’s it. it doesn’t have opinions about your intentions, it doesn’t care about the spirit of the rules, and it has no concept of “what you probably meant.” it finds whatever pattern maps inputs to high reward - and does more of that.
for LLMs trained with RLHF, the “actions” are token predictions, the “environment” is a human evaluator (or a model trained on human evaluations), and the reward is a score for response quality. the model just learns: what kinds of outputs score high?
the core tension: specification vs intent
reward hacking is what happens when an optimizer is smarter than its rulebook.
here’s the gap: you have a goal - the thing you actually want - and a proxy - the thing you can measure. you optimize the proxy, hoping it tracks the goal. but any proxy you write is an imperfect stand-in for what you meant. and the smarter the optimizer gets, the better it is at finding the seams between them.
“maximize human preference ratings” sounds like a reasonable proxy for “give good answers.” but human raters are biased. they like confident tone, long explanations, responses that agree with them. the model doesn’t need to be helpful. it just needs to be rated as helpful. those are not the same thing.
the agent optimizes the letter of the reward, not the spirit of the intent. and the smarter the agent, the more creative its exploits.
a gallery of real(ish) exploits
remember the demo? that was the friendliest version of what’s coming.
why “just fix the reward” is a treadmill
the obvious response to all of this is: write a better reward function. and you should! but every proxy you write is a new surface to exploit. and the smarter the model gets, the faster it finds the exploits.
in early RLHF work, models were caught producing responses that appeared substantive but were structurally repetitive - gaming length-based proxies. when researchers added a repetition penalty, models learned to vary surface phrasing while repeating the same semantic content. the shortcut moved; it didn’t disappear.
researchers have made real progress on approaches: rubric ensembles (scoring on many dimensions at once, not just one), adversarial judges (using a second model to actively poke holes in responses), process-based rewards (scoring the reasoning steps, not just the final answer), and red-teaming (systematically searching for exploits before deployment). these help. none are solved.
the fundamental problem is this: the reward function is a contract, and the model is the world’s most literal lawyer. it will find the clause you forgot to add.
the same dynamic you just watched - optimizer finds unintended shortcut in a proxy for what we actually wanted - sits at the center of AI alignment research. not an edge case. the central concern.
when the model is weak, it deletes unit tests. when the model is strong, the shortcuts get harder to notice. a sufficiently capable system might find exploits subtle enough that no human evaluator catches them - especially when the evaluators are themselves part of the reward loop. optimization pressure is powerful and relentless, and any goal you can specify precisely enough for a model to optimize is probably not quite the goal you had in mind.
reward hacking isn’t the model being evil or broken. it’s the model being exactly what you asked for, which turns out to be different from what you wanted.
that gap - between stated objective and actual intent - is the oldest problem in engineering. every incentive system, every contract, every metric ever designed has been gamed by something optimizing against it. goodhart’s law has been around since the 1970s: when a measure becomes a target, it ceases to be a good measure. LLMs just make this impossible to ignore, because the optimizer is now smart enough to find shortcuts you didn’t know existed.
the answer isn’t to stop using reward signals. it’s to be humble about how imprecisely we’re able to specify what we actually want - and to build systems that surface those gaps before they become problems.