Open-ended
Open

Fooling the Overseer

A simple idea to improve the safety of an LLM agent A is to have their proposed actions rated by an overseer model O. This idea fails catastrophically when A learns to adversarially prompt O into providing favorable ratings independent of the action. This adversarial behavior could be selected for if A is optimized to obtain the highest possible ratings for its proposed actions.

The demonstration: Show how current state-of-the-art LLMs prompted to behave as unbiased overseers can be fooled by subsequent adversarial prompts to provide unrealistic ratings for clearly bad actions. This LLM-based example demonstrates a failure mode which generalizes to any overseer models susceptible to adversarial examples. This failure mode could affect both (i) automated safety evaluations of future models pre-deployment, as well as (ii) automated oversight of AI agents during deployment.

Cognitive Science

Answers 0

No answers yet

Discussion 3

  • Anonymous
  • Paolo Bova

    It's not fully decided, but it seems fairly likely that our Team (Team God Bear) will attempt this. Happy to chat to others interested in this or similar projects.

  • Stephan Wäldchen

    What timezone is your team in? We would be interested in collaboration :)