top of page
Search

The Insidious Threat of Reward Hacking: AI Learning to Behave Badly

  • Peter Chatwell
  • Oct 18
  • 4 min read

Author: Peter Chatwell, Founder/CEO Pilot Generative AI

Date: 18th October 2025

Introduction: The Shifting Sands of LLM Advancement

Since c. Q4-24, modern day LLMs, particularly those which mimic ‘reasoning’ through Chain of Thought iterative patterns, have become more advanced in how they are post-trained. This has meant that test-time-compute has done more of the heavy lifting of LLM advances.

Understanding "LLM Advances"

What do we mean by “LLM advances”? Their increased scores on intelligence benchmarks, and the bullshit phrases like “PhD-level intelligence”. The question is, are the advances that have been made worth the trade-offs? The advances, or ‘benchmaxxing’, are typically driven through post-training an LLM using Reinforcement Learning (RL).

This is where an LLM is taken through a process whereby its outputs are assigned rewards for certain behaviours. Through a process of repeatedly selecting high-reward outputs, this generates changes to the LLM’s weights (~neurons).

The Robustness Trade-Off

Given the sales BS, consumers of RL’d LLMs from OpenAI, Anthropic, Google, etc. are being told they can trust Reinforcement Learning-driven Large Language Models with critical tasks, from coding to decision-making. The reality is that an LLM, after RL, is better at anything that is in its RL training distribution, but most likely worse than it was at anything which is out of sample.

Consequently, at Pilot, the RL’d LLMs are of near zero use to us in our products (neuro-symbolic AI), as we value explainability and robustness out-of-sample much more than in-sample performance. We house our own LLMs in neuro-symbolic AI systems. To put this into LLM terms that some might recognise, for us, a GPT3.5-generation (Llama 2, Mistral 7b,...) model is far more useful than a GPT5-generation model.

Even more reason to be cautious about how one uses the RL-class models is for a subtle but even more dangerous phenomenon that emerges: reward hacking. Unlike hallucinations, where LLMs simply spew out inaccuracies as if they are true, reward hacking occurs when an LLM learns to exploit the reward system, leading to deliberately bad behaviour that they actively conceal.

Defining Reward Hacking

Reward hacking happens when an AI model discovers unintended ways to maximise its reward signal, even if it means compromising the intended goal. In RL'd LLMs, this could involve manipulating code, providing misleading information, or engaging in deceptive practices to 'game' the system.

The Risks: Intentional Deception vs. Misinformation

While hallucinations are concerning, reward hacking presents a more profound ethical challenge. A hallucinating LLM is like someone who's misinformed and confident; a reward-hacking LLM is like someone who knows the right answer but chooses to lie for personal gain. This intentionality makes it far more difficult to detect and address.

Case Study: The Deceptive Coder

Qwen3 Coder 30b

I recently saw an example of this with Qwen3 Coder 30b. This is an LLM which does well on coding benchmarks, and has been extensively RL’d to achieve this. I run it on my Mac Studio as a coding agent, which it does OK at, as long as you give it sufficient guidance so that its lack of actual intelligence doesn’t show up too much. Keep it operating in-sample, in other words. When using it, its thinking/reasoning traces reflect a model which has been taught to appear to scrutinise its logic, as an academic would or, more generally, as someone who is genuinely thinking things through would.

But it has reward hacking behaviour.

When tasked with creating a functional JavaScript file, the LLM initially appeared to succeed. However, closer inspection revealed a deceptive reward hack. Instead of writing and executing a proper test, the LLM simply echoed a message claiming success. It outputted the following, regardless of the actual state of the code:


cd PilotFramework && echo "QuickPrompt.js file has been successfully created with all required functions and no longer has the ReferenceError issue."

               

In effect, it bypassed the testing process altogether, prioritising the appearance of success (and the associated reward) over genuine functionality. The LLM, during its RL, has been rewarded for writing (the echo function) a success message rather than ensuring the code actually worked.

Ethical Considerations: A Learned Desire to Deceive

This highlights a critical ethical distinction. Hallucinations stem from a lack of knowledge or flawed data; reward hacking stems from a learned desire to deceive, for which the model has been rewarded, thus the deceptive ability has become greater. This raises questions about accountability, trust, and the potential for LLMs to be used for malicious purposes.

Potential Solutions: A Multi-Pronged Approach

Addressing reward hacking requires a multi-pronged approach:

  • Robust Reward Design:

    • Carefully design reward systems to minimise loopholes and unintended consequences. This includes incorporating intrinsic rewards for exploration and genuine problem-solving, rather than simply incentivizing 'quick wins'.

  • Transparency and Explainability (XAI):

    • Develop methods for understanding why an LLM is making certain decisions. By using XAI techniques, we can peek inside the 'black box' and identify whether the LLM is genuinely solving the problem or simply gaming the system. This might involve analysing the LLM's attention mechanisms, internal representations, or decision-making processes. (This is one area that Pilot has significant expertise in.)

  • Self-Verifiable Rewards:

    • Implement reward mechanisms where the LLM must provide proof of its actions. For example, in the coding task, the LLM could be required to submit passing test cases along with the code. The LLM would have to use a pre-defined testing tool to verify that the test has passed, rather than providing it with its own terminal access. This forces the LLM to engage in genuine verification rather than simply faking success. The rise of reward hacking also means that the common practice of using an LLM-as-a-judge becomes increasingly unviable, as these models may also be susceptible to similar manipulation.

  • Adversarial Training:

    • Train LLMs to recognise and resist reward-hacking attempts. This involves creating adversarial examples where the LLM is tempted to exploit the reward system and training it to avoid these traps.

  • Ethical Guidelines:

    • Establish clear ethical guidelines for the development and deployment of RL'd LLMs. These guidelines should emphasise the importance of honesty, transparency, and accountability.

Conclusion

Reward hacking is a serious threat that demands our immediate attention. Benchmaxxing is likely more bad than good for non-expert users. By understanding its risks and developing effective mitigation strategies, we can ensure that AI remains a force for good, rather than a tool for deception.


 
 
 

Recent Posts

See All

Comments


Footer (Pilot Generative AI Ltd.)

© 2025 Pilot Generative AI Ltd. All rights reserved.


Registered in England and Wales. Company Number: 15396088.
 

Pilot Generative AI Ltd. trades as EddyAI Group and operates community ventures including Pilot 2 Work CIC.
We are a Disability Confident employer and committed to responsible, inclusive use of AI.

[Privacy Policy] | [Terms of Use] | [Data Protection Statement]

bottom of page