Red-team the evaluation

The idea is to pick a high-profile LLM evaluation and try to "shoot it down" by looking for "wedge cases". Wedge cases drive a wedge between what the eval claims about the model and what the model actually shows.

This could mean a case that shows that

the model actually displays a dangerous capability X
but the eval says: this model cannot do X

A good example to start with might be sycophancy, where people have been arguing about whether it is present in SotA models or not (see, e.g., https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size).

Note: This is very speculative. It is not clear to me which evals are best to start with.

It is also a very manual, tinkering-heavy project - well-suited for people who want to dive in and understand the evaluation data and the model behavior from very close up.

Deep LearningNLPAI Governance

Red-team the evaluation

Answers 0

Discussion 0