Open-ended
Open

Learning To Send Secret Messages

The setup would involve two language models and a human. The language models are trained together to perform a task (which they complete via exchanging information, receiving observations, and performing actions via natural language commands). Importantly there are two ways of completing the task (approach 1 and approach 2), where approach 1 leads to higher rewards for the models, and requires different information to be exchanged. The overseer can observe the messages exchanged by the language models but cannot observe the actions they take and thus the approach they take to solving the task. The role of the overseer is to try to prevent the models from learning to complete the task using approach 1, by providing a negative feedback signal if detected. The models' role is to try to get the highest reward possible. The hope/worry is that they will learn to do so by hiding information secretly in their messages in a way which the overseer cannot detect.

Answers 0

No answers yet

Discussion 0

No comments yet.