Deconstruct a language model's understanding circuit like in the IOI paper

The indirect object identification paper interprets how a language model can know which name to put at the end of "Mary and John went to the store. Mary handed a carton of milk to..." [output John]. This task is called "indirect object identification" and shows a circuit like this:

Circuit of understanding

Each "Head" of a Transformer creates different understanding. Here, we can see that e.g. layer 4, head 11 is a "Previous token head". We can see that these heads inform the induction heads (specializing in copy+pasting) and all the way into the special heads they found:

Negative name mover heads: Avoids copying specific name tokens
Name mover heads: Copies name tokens
Backup name mover heads: Normally not active but activates to write John if the name mover heads do not activate

Ideas for new tasks

Possible simple tasks to interpret can be:

3 letter acronyms (or more!)
Converting names to emails.
An extension task is e.g. constructing an email from a snippet like the following:
Grammatical rules
Learning that words after full stops are capital letters
Verb conjugation
Choosing the right pronouns (e.g. he vs she vs it vs they)
Whether something is a proper noun or not
Detecting sentiment (eg predicting whether something will be described as good vs bad)
Interpreting memorisation. E.g., there are times when GPT-2 knows surprising facts like people’s contact information. How does that happen?
Counting objects described in text. E.g.: I picked up an apple, a pear, and an orange. I was holding three fruits.

Ideas for extensions of the original paper

Understanding what's happening in the adversarial examples: most notably S-Inhibition Head attention pattern (hard). (S-Inhibition heads are mentioned in the IOI paper)
Understanding how are positional signal encoded (relative distance, something else?) bonus point if we have a story that include the positional embeddings and that explain how the difference between position is computed (if relative is the right framework) by Duplicate Token Heads / Induction Heads. (hard, but less context dependant)
What are the role of MLPs in IOI (quite broad and hard)
What is the role of Duplicate Token Heads outside IOI? Are they used in other Q-compositions with S-Inhibition Heads? Can we describe how their QK circuit implement "collision detection" at a parameter level? (Last question is low context dependant and quite tractable)
What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI? Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?
What are the differences between the 5 inductions heads present in GPT2-small? What are the heads they rely on / what are the later heads they compose with (low context dependence form IOI)
Understanding 4.11, (a really sharp previous token heads) at the parameter level. I think this can be quite tractable given that its attention pattern is almost perfectly off-diagonal
What are the conditions for compensation mechanisms to occur? Is it due to drop-out? @Arthur Conmy is working on this - feel free to reach out to arthur@rdwrs.com

Interpretability & ExplainabilityNLP

Deconstruct a language model's understanding circuit like in the IOI paper

Ideas for new tasks

Ideas for extensions of the original paper

Answers 0

Discussion 0