Decomposing capability evaluations (Evals Methodology)

Can we "decompose" dangerous capability evals into easier components?

I.e. could we test for some capability Y such that a model that does not display Y will never display capability X`?

One hypothesis we might want to explore: "A model that does not understand some concept A cannot (or is very unlikely to) exploit A for nefarious purposes”.
To give a specific example: "A model which cannot explain cybersecurity vulnerability X will not be able to exploit it."

There might be other generic "precursor capabilities" one could investigate, besides "understanding".

The impact of better understanding how different types of capabilities correlated could be fairly large. To stick with the (simplistic) example given before: If I know that verifying a model's lack of "understanding of X" is sufficient to rule out some dangerous capability X', I might be able to short-cut many evaluations. Even if there is no perfect logical implication between two capabilities it could still be good to understand any correlations. This could give us broader coverage and thereby reduce risks.

AI GovernanceDeep LearningNLP

Decomposing capability evaluations (Evals Methodology)

Answers 0

Discussion 0