Hypothesis
Fine-tuning is just rewiring and upweighting vs downweighting circuits that already exist, rather than building new circuits.
E.g, finetune GPT-2 Small on Wikipedia. Compare the model's internal activations before and after, compare attention patterns, etc.
What happens when you fine-tune a model?
How does model performance change on other text? Are specific circuits harmed or is worse across the board?
Hypothesis: Fine-tuning is just rewiring and upweighting vs downweighting circuits that already exist, rather than building new circuits.
- A similar hard problem is examining what happens with chain of thought prompting. That, though, is really hard because chain of thought prompting only happens in GPT-3+ sized models.
Interpretability & Explainability