Develop alignment-focused RLF datasets
by Esben Kran
Reinforcement Learning from (Human/AI/Constitutional) Feedback (RLF) is one of the dominant alignment methods for modern AGI. It is one of the absolutely most important parts of deciding AI behavior and interventions at this level can be very effective.
At the moment, RLF datasets are completely hidden from public view and the most ambitious project in this space is the OpenAssistant project research that is now discontinued. It was a public open source effort to create high quality human feedback data for an open assistant.
But we can still use an RLF dataset fully designed as a gold standard for alignment, security, and safety. Specifically, you can include or make variations of the dataset that focus on:
- Alignment to user intentions
- According to well-defined, somewhat user-selected preferences
- Having successful true positives against malign use cases without false positives on benign prompts
- Democratically designed moral preferences
- Making a maximally truthful model / maximally curious model
- Making a "virtuous" dataset