Investigate how 3-layer and 4-layer attention-only models differ from 2L
by Sabrina Zaki
How do 3-layer and 4-layer attention-only models differ from 2L?
- Look for composition scores
- Look for evidence of composition. E.g. one head’s output represents a big fraction of the norm of another heads query, key or value vector
- Do the “PCA of logits on a fixed set of random tokens” technique and look for more kinks.
- Can you associate these with circuits?
- Ablate a single head and run the model on a lot of text. Look at the change in performance. Find the most important heads. Do any heads matter a lot that are not induction heads?
Interpretability & Explainability