Open-ended
Analyze and evaluate methodological frameworks of existing evals approaches
Some examples of methodological approaches:
- ARC Evals' manual behavioral analysis approach now supported by their scaffolding
- OpenAI/evals repository for automated evaluations on a range of different methods
- Interpretability to evaluate underlying deception in models
Questions to ask with the project include:
- What is the method (on an abstract/conceptual level)...
- ...why does it lead to what we want...
- ...what are the main weaknesses...
- ...and what would be alternative methods?
- (optional and less important) Show a demonstration / MVP of the alternative method (diagram, actual experiment, etc.) and what expected outputs would be
ARC Evals current example: Causal node in risk stories, break it down into tasks that capture correlation with capability, measure performance on those → Combine those tasks into a full flow somehow
ReviewCognitive Science