AI Safety Ideas
Open-ended
Open

Classify an agent's value function based on behaviour

by Wes Gurnee

Use an XLand-like environment, create a "user" agent with a random utility function and have an interacting "predictor" agent that attempts to predict the user's value function / program / neural state or anything else embodying its values. It is rewarded based on the precision before interaction has started to avoid incentives to alter the user's behaviour. Read more.

Cognitive ScienceDeep LearningReinforcement Learning

Answers

No answers yet.

Discussion

  • Esben Kran

    There's great comments on the source from Charlie. I'll copy a few here:

    Nice! This is definitely one of those clever ideas that seems obvious only after you've heard it.
    The issue with the straightforward version of this is that value learning is not merely about learning human preferences, it's also about learning human meta-preferences. Or to put it another way, we wouldn't be satisfied with the utility function we appear to be rationally optimizing, because we think our actual actions contain mistakes. Or to put it another way, you don't just need to learn a utility function, you also need to learn an "irrationality model" of how the agent makes mistakes.
    This isn't a fatal blow to the idea, but it seems to make generating the training data much more challenging, because the training data needs to train in a tendency for interpreting humans how they want to be interpreted.