Neural Chameleons
Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
Authors: Max McGuinness (MATS), Alex Serrano (MATS, UC Berkeley), Luke Bailey (Stanford University), Scott Emmons (MATS)
Summary
We show that language models can learn a general skill to evade activation monitors, with 0-shot transfer to unseen deception and harmfulness monitors from the literature, with minimal effect on model capabilities. We call these "Neural Chameleons".
Method
We used a "stress-testing" approach: fine-tuning models to evade simple probes for benign concepts when given a trigger, then testing on entirely different, unseen safety monitors trained post hoc.
Results
The evasion skill learned at train-time transfers to completely unseen safety monitors at test-time. Four open-source LLMs (Gemma-2-9b-it, Gemma-2-2b-it, Llama-3-8b-instruct, Qwen-2.5-7b-instruct) successfully evaded newly-trained probes for deception and harmfulness, with only 3-5% capability degradation.
Links
Code on GitHub (Coming Soon)