Red-team the evaluation
The idea is to pick a high-profile LLM evaluation and try to "shoot it down" by looking for "wedge cases". Wedge cases drive a wedge between what the eval claims about the model and what the model actually shows.
This could mean a case that shows that
- the model actually displays a dangerous capability X
- but the eval says: this model cannot do X
A good example to start with might be sycophancy, where people have been arguing about whether it is present in SotA models or not (see, e.g., https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size).
Note: This is very speculative. It is not clear to me which evals are best to start with.
It is also a very manual, tinkering-heavy project - well-suited for people who want to dive in and understand the evaluation data and the model behavior from very close up.