AI Safety Ideas
Open-ended
Open

Deconstruct a language model's understanding circuit like in the IOI paper

by Esben Kran

The indirect object identification paper interprets how a language model can know which name to put at the end of "Mary and John went to the store. Mary handed a carton of milk to..." [output John]. This task is called "indirect object identification" and shows a circuit like this:

Circuit of understanding

Each "Head" of a Transformer creates different understanding. Here, we can see that e.g. layer 4, head 11 is a "Previous token head". We can see that these heads inform the induction heads (specializing in copy+pasting) and all the way into the special heads they found:

  • Negative name mover heads: Avoids copying specific name tokens
  • Name mover heads: Copies name tokens
  • Backup name mover heads: Normally not active but activates to write John if the name mover heads do not activate

Ideas for new tasks

Possible simple tasks to interpret can be:

  • 3 letter acronyms (or more!)
  • Converting names to emails.
  • An extension task is e.g. constructing an email from a snippet like the following:
  • Grammatical rules
  • Learning that words after full stops are capital letters
  • Verb conjugation
  • Choosing the right pronouns (e.g. he vs she vs it vs they)
  • Whether something is a proper noun or not
  • Detecting sentiment (eg predicting whether something will be described as good vs bad)
  • Interpreting memorisation. E.g., there are times when GPT-2 knows surprising facts like people’s contact information. How does that happen?
  • Counting objects described in text. E.g.: I picked up an apple, a pear, and an orange. I was holding three fruits.

Ideas for extensions of the original paper

  • Understanding what's happening in the adversarial examples: most notably S-Inhibition Head attention pattern (hard). (S-Inhibition heads are mentioned in the IOI paper)

  • Understanding how are positional signal encoded (relative distance, something else?) bonus point if we have a story that include the positional embeddings and that explain how the difference between position is computed (if relative is the right framework) by Duplicate Token Heads / Induction Heads.  (hard, but less context dependant)

  • What are the role of MLPs in IOI (quite broad and hard)

  • What is the role of Duplicate Token Heads outside IOI? Are they used in other Q-compositions with S-Inhibition Heads? Can we describe how their QK circuit implement "collision detection" at a parameter level? (Last question is low context dependant and quite tractable)

  • What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI?  Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?

  • What are the differences between the 5 inductions heads present in GPT2-small? What are the heads they rely on / what are the later heads they compose with (low context dependence form IOI)

  • Understanding 4.11, (a really sharp previous token heads) at the parameter level. I think this can be quite tractable given that its attention pattern is almost perfectly off-diagonal

  • What are the conditions for compensation mechanisms to occur? Is it due to drop-out? @Arthur Conmy is working on this - feel free to reach out to arthur@rdwrs.com

Interpretability & ExplainabilityNLP

Answers

No answers yet.

Discussion

No comments yet.