Sabrina Zaki
Verified expert@sabrina-zaki· Student
@sabrina-zaki· Student
What is the relationship between double descent and grokking? - Double descent seems to be caused by polysemanticity phase transitions while grokking seems like a general effect of task learning. As we see a slight decrease in performance over a few epochs which then converge to an even lower equillibrium, indicating a new level of hyperdimensional encoding See [example](https://transformer-circuits.pub/2022/toy_model/index.html#geometry:~:text=Let%27s%20look%20at%20the%20resulting%20plot%2C%20and%20then%20we%27ll%20try%20to%20figure%20out%20what%20it%27s%20showing%20us%3A).
Look for tasks that an nL model cannot do but a (n+1)L model can - look for a circuit! Proposal: - Build the infrastructure to do this - run two models over a lot of text and look for big log prob differences (maybe floor the log probs at eg 5, to avoid overfitting to times that one network was incredibly wrong)
Look for tasks that an nL model cannot do but a (n+1)L model can - look for a circuit! Proposal: - Build the infrastructure to do this - run two models over a lot of text and look for big log prob differences (maybe floor the log probs at eg 5, to avoid overfitting to times that one network was incredibly wrong) - Just take a bunch of text with interesting patterns and run the models over it, look for tokens they do really well on, and try to reverse engineer what’s going on - I expect there’s a lot of stuff in here!
How far can you get with really deeply reverse engineering a 1 layer SoLU model? - Which directions correspond to features? - Can you find any [polysemantic](https://transformer-circuits.pub/2022/toy_model/index.html)neurons? - Can you fully reverse a feature direction and compare it to a neuron direction?
Neel Nanda made the website [lexoscope.io](lexoscope.io) - it shows the text that most activates each neuron in several SoLU language models he trained, including toy SoLU models. Problem ideas could be: - Hunt through it, at look for interesting neurons - can you find weird and abstract ones? - Can you find neuron families? A la [equivariance](https://distill.pub/2020/circuits/equivariance/) - Study a lot of neurons at different layers and look for patterns - what can we say about what the model is doing at different layers? What patterns are there? - Can you find examples of neuron splitting? (A single high-level feature splits into several more specific features as you scale up) - Can you reverse engineer a neuron? Can you find a specific direction in activation space that is exactly that feature, and how aligned is it with the neuron basis? - Can you find any highly non monosemantic features? A task where the entire MLP layer matters, but no one neuron activates much - Or where the pre layernorm activation is low but post-layernorm is high, so the model “smuggles through” directions - Find polysemantic neurons. Try to reverse engineer them - give the model a bunch of text containing each feature and average it/apply PCA. Can you find directions corresponding to each feature? How much do they align with that neuron? - Replicate the part of Conjecture’s Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns in that Is it the case that there are monosemantic bands in the neuron act spectrum - Can you find a genuinely monosemantic neuron? Possible idea - look for algorithmic flavoured neurons, eg one whose activation could be minimicking a regex - use this to automatically test that it’s actually doing things
How do 3-layer and 4-layer attention-only models differ from 2L? - Look for composition scores - Look for evidence of composition. E.g. one head’s output represents a big fraction of the norm of another heads query, key or value vector - Do the “PCA of logits on a fixed set of random tokens” technique and look for more kinks. - Can you associate these with circuits? - Ablate a single head and run the model on a lot of text. Look at the change in performance. Find the most important heads. Do any heads matter a lot that are not induction heads?
As systems become more complicated we expect that it will become harder to (1) aggregate and analyze the actual labels or rewards given during training, and (2) evaluate the relevant counterfactuals.
Replacing the 'stochastic gradient descent' SGD) with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem
Warning shots / slow takeoff might help reduce the probability P that AGI gets power-seeking motivations and escapes control and/or slow the increase in groups try to train an AGI (let’s say from scratch, although fine-tuning is risky too