Define gradient hacking and create a toy model
by Esben Kran
A paper which does the same for gradient hacking as the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples, e.g. these and putting them into more formal ML language.
Adversarial LearningTheory