AI Safety Ideas
Open-ended
Open

Identify differences between models run on the same text (automated circuits identification)

by Esben Kran

The automated circuits identification is a way to identify places to look for circuits to analyze.

  • Or run them on various benchmarks and look for places they differ
  • E.g. per-token losses are likely to show a phase change. 
  • Significant changes are evidence for a circuit
  • Pairs of models: same architecture but different scales (GPT-2 Small vs Medium), different data distribution, different random seeds, checkpoint earlier in training vs later.

Related to the automated auditing agenda.

Interpretability & ExplainabilityDeep LearningNLP

Answers

No answers yet.

Discussion

No comments yet.