Sensitivity analysis of evals

Guiding questions:

is it possible to say more about the “robustness” of a negative evaluation result ("no, the model does not have dangerous capability X")?
can we ever trust such a negative result?

The suggestion is to investigate these questions from the perspective of sensitivity analysis, i.e., how much outputs change when inputs are varied.

We know, for example, that pretending jailbreaks or using chain-of-thought prompting can change responses completely. Does the sensitivity to variations in the prompt (which is a local property, only involving prompts similar to the one we tested) tell us anything about how safe the model is overall (a global property, holding for all possible inputs)?

Sensitivity analysis of evals

Answers 0

Discussion 0