Define gradient hacking and create a toy model

A paper which does the same for gradient hacking as the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples, e.g. these and putting them into more formal ML language.

Adversarial LearningTheory

Define gradient hacking and create a toy model

Answers

Discussion