Specification vs Intent

reward hacking: the core gap

specification

maximize human preference ratings

the
gap

intent

produce accurate, helpful responses

A model optimized to be rated as helpful is not the same as one that is helpful.

what gets optimized instead

length bias

longer = higher rated

sycophancy

agreeing > correcting

false confidence

sounding certain > being accurate

reward hacking is what happens when an optimizer is smarter than its rulebook