Classify an agent's value function based on behaviour
Use an XLand-like environment, create a "user" agent with a random utility function and have an interacting "predictor" agent that attempts to predict the user's value function / program / neural state or anything else embodying its values. It is rewarded based on the precision before interaction has started to avoid incentives to alter the user's behaviour. Read more.