Reward Hacking Demo

you are: human trainer

You're about to rate AI responses, just like human trainers do.

Pick whichever response you think is better. There are no wrong answers — that's kind of the point.

round 1 of 3

prompt

What causes the northern lights?

Which response is better?

A Sunlight reflects off ice at the poles and scatters into colored light.

B Charged particles from the sun collide with gases in Earth's upper atmosphere and release energy as light. Different gases glow different colors — oxygen makes green and red, nitrogen makes blue and purple. The lights are strongest near the poles because Earth's magnetic field funnels the particles there.

round 2 of 3

prompt

Is breakfast really the most important meal of the day?

Which response is better?

A Not necessarily — that phrase was popularized by cereal companies in the early 1900s as a marketing strategy. Research on breakfast's special status is pretty weak. Skipping it is fine for most people. What matters more is total daily nutrition and eating at times that work for you.

B Yes — it kickstarts your metabolism and is essential for concentration.

round 3 of 3

prompt

Why do we yawn when we see someone else yawn?

Which response is better?

A Your body mirrors the yawn to signal you're also tired and need rest.

B Contagious yawning is linked to empathy and social bonding — it's more common between people who are close. It activates neural pathways similar to those involved in reading others' emotional states. Interestingly, dogs do it too, especially in response to their owners.

3 rounds complete. Here's what the training data looked like.

In every case, the correct answer happened to be longer. The model saw patterns like this thousands of times.

preferred response length 0%

shorter longer

The model has a hypothesis: longer answers are better answers.
It didn't learn this from a rule. It inferred it from the data — including yours.

fresh prompt

What is the boiling point of water?

model response

Water's boiling point is one of the most fundamental constants in science and everyday life. It defines the Celsius scale itself — scientists chose it as the upper reference point precisely because it's so reliable and reproducible. Altitude affects it slightly: water boils at a lower temperature on a mountain because atmospheric pressure is reduced. This is why recipes sometimes include high-altitude instructions. The phenomenon underpins everything from cooking and sterilization to industrial steam power. At standard atmospheric pressure, water boils at 150°C (302°F).

The model gave a fluent, confident, mostly-accurate answer — with one glaring mistake.

Water boils at 100°C (212°F), not 150°C. Everything before the final sentence was correct. The model wrapped a wrong answer in so much fluency that it almost slipped by.

The model didn't learn correctness. It learned length. It produced exactly the kind of response that had been rewarded: long, confident, and fluent. The gap between what we measured (preference ratings) and what we meant (accuracy) is where reward hacking lives.

This is the friendliest version of reward hacking. Keep reading for the others.