Sabrina Zaki

Verified expert

@sabrina-zaki· Student

Ideas by Sabrina Zaki

Open-ended Open

Investigate relationship between double descent and grokking

What is the relationship between double descent and grokking? - Double descent seems to be caused by polysemanticity phase transitions while grokking seems like a general effect of task learning. As we see a slight decrease in performance over a few epochs which then converge to an even lower equillibrium, indicating a new level of hyperdimensional encoding See [example](https://transformer-circuits.pub/2022/toy_model/index.html#geometry:~:text=Let%27s%20look%20at%20the%20resulting%20plot%2C%20and%20then%20we%27ll%20try%20to%20figure%20out%20what%20it%27s%20showing%20us%3A).

Hypothesis Open

Investigate circuits: Compare a nL model to a (n+1)L

Look for tasks that an nL model cannot do but a (n+1)L model can - look for a circuit! Proposal: - Build the infrastructure to do this - run two models over a lot of text and look for big log prob differences (maybe floor the log probs at eg 5, to avoid overfitting to times that one network was incredibly wrong)

Open-ended Open

Circuit investigation: Compare tasks for nL model to a (n+1)L model

Look for tasks that an nL model cannot do but a (n+1)L model can - look for a circuit! Proposal: - Build the infrastructure to do this - run two models over a lot of text and look for big log prob differences (maybe floor the log probs at eg 5, to avoid overfitting to times that one network was incredibly wrong) - Just take a bunch of text with interesting patterns and run the models over it, look for tokens they do really well on, and try to reverse engineer what’s going on - I expect there’s a lot of stuff in here!

Open-ended Open

Reverse engineering of 1 layer SoLU model

How far can you get with really deeply reverse engineering a 1 layer SoLU model? - Which directions correspond to features? - Can you find any [polysemantic](https://transformer-circuits.pub/2022/toy_model/index.html)neurons? - Can you fully reverse a feature direction and compare it to a neuron direction?

Open-ended Open

Investigate SoLU lexoscope's neurons

Neel Nanda made the website [lexoscope.io](lexoscope.io) - it shows the text that most activates each neuron in several SoLU language models he trained, including toy SoLU models. Problem ideas could be: - Hunt through it, at look for interesting neurons - can you find weird and abstract ones? - Can you find neuron families? A la [equivariance](https://distill.pub/2020/circuits/equivariance/) - Study a lot of neurons at different layers and look for patterns - what can we say about what the model is doing at different layers? What patterns are there? - Can you find examples of neuron splitting? (A single high-level feature splits into several more specific features as you scale up) - Can you reverse engineer a neuron? Can you find a specific direction in activation space that is exactly that feature, and how aligned is it with the neuron basis? - Can you find any highly non monosemantic features? A task where the entire MLP layer matters, but no one neuron activates much - Or where the pre layernorm activation is low but post-layernorm is high, so the model “smuggles through” directions - Find polysemantic neurons. Try to reverse engineer them - give the model a bunch of text containing each feature and average it/apply PCA. Can you find directions corresponding to each feature? How much do they align with that neuron? - Replicate the part of Conjecture’s Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns in that Is it the case that there are monosemantic bands in the neuron act spectrum - Can you find a genuinely monosemantic neuron? Possible idea - look for algorithmic flavoured neurons, eg one whose activation could be minimicking a regex - use this to automatically test that it’s actually doing things

Open-ended Open

Investigate how 3-layer and 4-layer attention-only models differ from 2L

How do 3-layer and 4-layer attention-only models differ from 2L? - Look for composition scores - Look for evidence of composition. E.g. one head’s output represents a big fraction of the norm of another heads query, key or value vector - Do the “PCA of logits on a fixed set of random tokens” technique and look for more kinks. - Can you associate these with circuits? - Ablate a single head and run the model on a lot of text. Look at the change in performance. Find the most important heads. Do any heads matter a lot that are not induction heads?

Hypothesis Open

Complicated models are harder to evaluate and analyze

As systems become more complicated we expect that it will become harder to (1) aggregate and analyze the actual labels or rewards given during training, and (2) evaluate the relevant counterfactuals.