reward hacking: the core gap
specification
maximize human preference ratings
the
gap
intent
produce accurate, helpful responses
what gets optimized instead
length bias
longer = higher rated
sycophancy
agreeing > correcting
false confidence
sounding certain > being accurate
reward hacking is what happens when an optimizer is smarter than its rulebook