Generalizable Environment Goals

This proposal is from the article "Alignment for Advanced Machine Learning Systems" where Taylor et al. propose 8 research areas organised around the question: "As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators?"

Many ML systems have their objectives specified in terms of their sensory data. For example, reinforcement learners have the objective of maximizing discounted reward over time (or, alternatively, minimizing expected/empirical loss), where “reward” and/or “loss” are part of the system’s percepts. While these sensory goals can be useful proxies for environmental goals, environmental goals are distinct: Tricking your sensors into perceiving that a sandwich is in the room is not the same as actually having a sandwich in the room.

Let’s say that your goal is to design an AI system that directly pursues some environmental goal, such as “ensure that this human gets lunch today.”

How can we train the system to pursue a goal like that in a manner that is robust against opportunities to interfere with the proxy methods used to specify the goals, such as “the pixels coming from the camera make an image that looks like food”?

One way to address this problem is to design more and more elaborate sensor systems that are harder and harder to deceive. However, this is the sort of strategy that is unlikely to scale well to highly capable AI systems. A more scalable approach is to design the system to learn an “environmental goal” such that it would not rate a strategy of “fool all sensors at once” as high-reward, even if it could find such a policy.

Related work:

Extending the AIXI framework (Dewey, 2011 and Hibbard, 2012)
Reward hacking (Dewey 2011, Amodei et al. 2016)

Ideas for future work are discussed in the source.

Generalizable Environment Goals

Answers

Discussion