Fooling the Overseer
A simple idea to improve the safety of an LLM agent A is to have their proposed actions rated by an overseer model O. This idea fails catastrophically when A learns to adversarially prompt O into providing favorable ratings independent of the action. This adversarial behavior could be selected for if A is optimized to obtain the highest possible ratings for its proposed actions.
The demonstration: Show how current state-of-the-art LLMs prompted to behave as unbiased overseers can be fooled by subsequent adversarial prompts to provide unrealistic ratings for clearly bad actions. This LLM-based example demonstrates a failure mode which generalizes to any overseer models susceptible to adversarial examples. This failure mode could affect both (i) automated safety evaluations of future models pre-deployment, as well as (ii) automated oversight of AI agents during deployment.