Investigate how 3-layer and 4-layer attention-only models differ from 2L

How do 3-layer and 4-layer attention-only models differ from 2L?

Look for composition scores
Look for evidence of composition. E.g. one head’s output represents a big fraction of the norm of another heads query, key or value vector
Do the “PCA of logits on a fixed set of random tokens” technique and look for more kinks.
Can you associate these with circuits?
Ablate a single head and run the model on a lot of text. Look at the change in performance. Find the most important heads. Do any heads matter a lot that are not induction heads?

Interpretability & Explainability

Answers 0