Classify training data into risk levels and publish segmented pre-training datasets
by Esben Kran
Given a dataset such as The Pile, identify a useful ontology for classifying the dataset into various risk levels so parts of the dataset can be excluded.
To define the ontology, you can use the
- OpenAI Preparedness Framework
- Anthropic's Responsible Scaling Policies
- MLCommons Safety Guidelines
- The Weapons of Mass Destruction Proxy has defined sub-critical and critical questions that models should not be able to respond successfully to for avoiding high-risk capabilities in cyber, chemistry, and biology
Once classification for segments of the dataset has been done, make a public repository with the segmented datasets to make them easy to work with. Publish a paper on the "alignment tax", the price to capability on e.g. MMLU for removing parts of the dataset.