AI Safety Ideas
Open-ended
Open

Investigate grokking; the effect that models suddenly learn different abilities

by Esben Kran

Neel Nanda reverse engineered a network trained to do addition and shows that it does addition using the Fourier transform algorithm.

Use the Google Colab to investigate further questions about grokking:

  • Understanding why the model chooses different frequencies (and why it switches mid-training sometimes!)

  • Understanding why 5 digit addition has a phase change per digit (so 6 total?!)

  • Can you find analytic arguments for why phase changes happen? Perhaps starting with a small model. What's the smallest model that exhibits phase changes? Smallest task? What are the most minimal requirements?

Recommend reading the summary of the research on Twitter.

Some further work mentioned in the Colab:

  • Modular addition
    • Interpreting the memorisation circuit, and figuring out how models memorise
    • Training on interpretability inspired metrics
      • Note that excluded loss is a somewhat dodgy metric to train on, as it involves computation over both the train and test data
  • Interpreting the five digit addition or predicting repeated subsequencies examples
    • In particular, trying to map the many phase changes in 5 digit addition to circuits
  • Looking for other examples of phase changes
    • Toy problems
      • Something incentivising skip trigrams
      • Something incentivising virtual attention heads
    • Looking for curve detectors in a ConvNet
      • A dumb way to try this would be to train a model to imitate the actual curve detectors in Inception (eg minimising OLS loss between the model's output and curve detector activations)
    • Looking at the formation of interpretable neurons in a SoLU transformer
    • Looking inside a LLM with many checkpoints
      • Eleuther have many checkpoints of GPT-J and GPT-Neo, and will share if you ask
      • Mistral have public versions of GPT-2 small and medium, with 5 runs and many checkpoints
      • Possible capabilities to look for
        • Performance on benchmarks, or specific questions from benchmarks
        • Simple algorithmic tasks like addition, or sorting words into alphabetical order, or matching open and close brackets
        • Soft induction heads, eg translation
        • Look at attention heads on various text and see if any have recognisable attention patterns (eg start of word, adjective describing current word, syntactic features of code like indents or variable definitions, most recent open bracket, etc).
Deep LearningInterpretability & Explainability

Answers

No answers yet.

Discussion

No comments yet.