AI Safety Ideas
Open-ended
Open

Define gradient hacking and create a toy model

by Esben Kran

A paper which does the same for gradient hacking as the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples, e.g. these and putting them into more formal ML language.

Adversarial LearningTheory

Answers

No answers yet.

Discussion

No comments yet.