Identify differences between models run on the same text (automated circuits identification)

The automated circuits identification is a way to identify places to look for circuits to analyze.

Or run them on various benchmarks and look for places they differ
E.g. per-token losses are likely to show a phase change.
Significant changes are evidence for a circuit
Pairs of models: same architecture but different scales (GPT-2 Small vs Medium), different data distribution, different random seeds, checkpoint earlier in training vs later.

Related to the automated auditing agenda.

Interpretability & ExplainabilityDeep LearningNLP

Answers 0