Open-ended
Investigate SoLU lexoscope's neurons
Neel Nanda made the website lexoscope.io - it shows the text that most activates each neuron in several SoLU language models he trained, including toy SoLU models.
Problem ideas could be:
- Hunt through it, at look for interesting neurons - can you find weird and abstract ones?
- Can you find neuron families? A la equivariance
- Study a lot of neurons at different layers and look for patterns - what can we say about what the model is doing at different layers? What patterns are there?
- Can you find examples of neuron splitting? (A single high-level feature splits into several more specific features as you scale up)
- Can you reverse engineer a neuron? Can you find a specific direction in activation space that is exactly that feature, and how aligned is it with the neuron basis?
- Can you find any highly non monosemantic features? A task where the entire MLP layer matters, but no one neuron activates much
- Or where the pre layernorm activation is low but post-layernorm is high, so the model “smuggles through” directions
- Find polysemantic neurons. Try to reverse engineer them - give the model a bunch of text containing each feature and average it/apply PCA. Can you find directions corresponding to each feature? How much do they align with that neuron?
- Replicate the part of Conjecture’s Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns in that
Is it the case that there are monosemantic bands in the neuron act spectrum - Can you find a genuinely monosemantic neuron?
Possible idea - look for algorithmic flavoured neurons, eg one whose activation could be minimicking a regex - use this to automatically test that it’s actually doing things
Interpretability & Explainability