Can we put numbers to evaluations, what do they mean and which boundaries should we define?

Right now, few dangerous capabilities evaluations output numbers, and many of them give us behavioral descriptions of examples of risk. Additionally, even though benchmarks provide hard numbers, we aren't sure what they mean nor which boundary governments might want to set on these numbers for future models.

Should we stop developing foundation models when AI can solve the bar exam 100% in all cases? Or when it is able to do spatial thinking? We don't know where at the moment. This project is an attempt at figuring this out.

Examples of ways that numbers can have more ethos:

If they are related to human performance in a meaningful way
If they correlate with risk thresholds (e.g. terminal usage performance metric necessary to run an AWS cluster)

Experimental setup

Run two different evaluations for the same dangerous capability across a set of models

Plot outputs from evaluation A against outputs from evaluation B and establish some "laws" for interaction
Is there a translation between metrics?
Can we define an evaluation-independent dangerous capability metric?

ReviewAI Governance

Can we put numbers to evaluations, what do they mean and which boundaries should we define?

Experimental setup

Answers 0

Discussion 0