There's great comments on the source from Charlie. I'll copy a few here:
Nice! This is definitely one of those clever ideas that seems obvious only after you've heard it.
The issue with the straightforward version of this is that value learning is not merely about learning human preferences, it's also about learning human meta-preferences. Or to put it another way, we wouldn't be satisfied with the utility function we appear to be rationally optimizing, because we think our actual actions contain mistakes. Or to put it another way, you don't just need to learn a utility function, you also need to learn an "irrationality model" of how the agent makes mistakes.
This isn't a fatal blow to the idea, but it seems to make generating the training data much more challenging, because the training data needs to train in a tendency for interpreting humans how they want to be interpreted.