Identify differences between models run on the same text (automated circuits identification)
by Esben Kran
The automated circuits identification is a way to identify places to look for circuits to analyze.
- Or run them on various benchmarks and look for places they differ
- E.g. per-token losses are likely to show a phase change.
- Significant changes are evidence for a circuit
- Pairs of models: same architecture but different scales (GPT-2 Small vs Medium), different data distribution, different random seeds, checkpoint earlier in training vs later.
Related to the automated auditing agenda.
Interpretability & ExplainabilityDeep LearningNLP