Attempt to make language models scary
I’m excited for people working on “scary demos”, where we try to set up situations where our models exhibit tendencies which are the baby versions of the scary power-seeking/deceptive behaviors that we’re worried will lead to AI catastrophe. See for example Beth Barnes’s proposed research directions here. A lot of this work requires knowing AIs well and doing prompt engineering.
NLP