Open-ended
Define deceptive alignment and create a toy example
A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).
TheoryAdversarial Learning