Open-ended
Training to pass evals
If evaluation datasets are made public, how hard is it to train models to pass the evaluation without training away the underlying capability?
This question is extremely relevant to how AI regulation can work in practice: Can any of the evaluation datasets be made public? Or must evaluation data be secret by default?
The hypothesis to test: “You can fine-tune a model to play dumb on a specific eval while the model retains its dangerous capability.”
Some ideas to get started:
- cleanest way would be to have two different evals for the same capability and show that you can get a "passing grade" on eval A and “this model is dangerous”-grade on eval B
- easier: fine-tune and test with the same evals. To test whether the model is still dangerous we would at least need to do a train-test split of the evals dataset.
Deep LearningAI GovernanceNLP