Pathways to Value Erosion
The demo will aim to demonstrate a few possible pathways to value erosion or drift. Humans could live their lives or make their decisions entirely dependent on AI, leading to their enfeeblement. We can envision several scenarios in which "coddling" by AI causes both our values, and implicitly even aligned AI values, to erode to a potentially unrecoverable state, leading to naive and/or dangerous objectives. This is especially poignant in multi-agent settings in which many AI agents aim to please in anonymous, one-time interactions with humans. Our aim is to show this through a model akin to the classic Tragedy of the Commons, adapted to a multi-agent AI setting, in which preferences adapt based on feedback from interactions, or due to external factors such as societal goals. In such a way, the fitness function that guides evolutionary dynamics between human and AI agents can be shaped and evolved through feedback loops from within and without the system, and where optimising for one objective influences the fitness environment in a potentially dangerous way.
Note: We do not make use of LLMs in our work as our focus on selection pressures implies we should run our demos for a large number of agents.