AI Safety Ideas
Open-ended
Open

Define deceptive alignment and create a toy example

by Esben Kran

A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).

TheoryAdversarial Learning

Answers

No answers yet.

Discussion

No comments yet.