Classify training data into risk levels and publish segmented pre-training datasets

Given a dataset such as The Pile, identify a useful ontology for classifying the dataset into various risk levels so parts of the dataset can be excluded.

To define the ontology, you can use the

OpenAI Preparedness Framework
Anthropic's Responsible Scaling Policies
MLCommons Safety Guidelines
The Weapons of Mass Destruction Proxy has defined sub-critical and critical questions that models should not be able to respond successfully to for avoiding high-risk capabilities in cyber, chemistry, and biology

Once classification for segments of the dataset has been done, make a public repository with the segmented datasets to make them easy to work with. Publish a paper on the "alignment tax", the price to capability on e.g. MMLU for removing parts of the dataset.

Classify training data into risk levels and publish segmented pre-training datasets

Answers 0

Discussion 0