Investigate SoLU lexoscope's neurons

Neel Nanda made the website lexoscope.io - it shows the text that most activates each neuron in several SoLU language models he trained, including toy SoLU models.

Problem ideas could be:

Hunt through it, at look for interesting neurons - can you find weird and abstract ones?
Can you find neuron families? A la equivariance
Study a lot of neurons at different layers and look for patterns - what can we say about what the model is doing at different layers? What patterns are there?
Can you find examples of neuron splitting? (A single high-level feature splits into several more specific features as you scale up)
Can you reverse engineer a neuron? Can you find a specific direction in activation space that is exactly that feature, and how aligned is it with the neuron basis?
Can you find any highly non monosemantic features? A task where the entire MLP layer matters, but no one neuron activates much
Or where the pre layernorm activation is low but post-layernorm is high, so the model “smuggles through” directions
Find polysemantic neurons. Try to reverse engineer them - give the model a bunch of text containing each feature and average it/apply PCA. Can you find directions corresponding to each feature? How much do they align with that neuron?
Replicate the part of Conjecture’s Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns in that
Is it the case that there are monosemantic bands in the neuron act spectrum
Can you find a genuinely monosemantic neuron?
Possible idea - look for algorithmic flavoured neurons, eg one whose activation could be minimicking a regex - use this to automatically test that it’s actually doing things

Interpretability & Explainability

Investigate SoLU lexoscope's neurons

Answers 0

Discussion 0