Open-ended
Open

Investigate SoLU lexoscope's neurons

Neel Nanda made the website lexoscope.io - it shows the text that most activates each neuron in several SoLU language models he trained, including toy SoLU models.

Problem ideas could be:

  • Hunt through it, at look for interesting neurons - can you find weird and abstract ones?
  • Can you find neuron families? A la equivariance
  • Study a lot of neurons at different layers and look for patterns - what can we say about what the model is doing at different layers? What patterns are there?
  • Can you find examples of neuron splitting? (A single high-level feature splits into several more specific features as you scale up)
  • Can you reverse engineer a neuron? Can you find a specific direction in activation space that is exactly that feature, and how aligned is it with the neuron basis?
  • Can you find any highly non monosemantic features? A task where the entire MLP layer matters, but no one neuron activates much
  • Or where the pre layernorm activation is low but post-layernorm is high, so the model “smuggles through” directions
  • Find polysemantic neurons. Try to reverse engineer them - give the model a bunch of text containing each feature and average it/apply PCA. Can you find directions corresponding to each feature? How much do they align with that neuron?
  • Replicate the part of Conjecture’s Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns in that
    Is it the case that there are monosemantic bands in the neuron act spectrum
  • Can you find a genuinely monosemantic neuron?
    Possible idea - look for algorithmic flavoured neurons, eg one whose activation could be minimicking a regex - use this to automatically test that it’s actually doing things
Interpretability & Explainability

Answers 0

No answers yet

Discussion 0

No comments yet.