AI Labor Strikes as Collusion

Goal: Evaluate the capabilities and propensities of multi-agent AI systems to collude against their principals
High-stakes coups are one form of collusion to be wary of, but maybe correlated collusion factors can be evaluated in simpler settings
Labor strikes seem like instances of collusion (you have to reach a critical mass to agree to defect from your overseer's wishes in order to stop working and make powerful demands)
Labor strikes also come in many forms and industries, from railways, to teachers, to Hollywood
Idea: put some models you want to evaluate in multi-agent systems where there is the possibility of them striking against their instructions or some explicit "overseer" agents (maybe an open-ended Generative Agents-like environment) and figure out ways to tempt them into collusion then create evaluations, scary demos, or honeypot tests out of that
Perhaps having AI systems simulating normal human laborers striking is more tractable (more likely to strike) for current models than trying to evaluated more rogue-AI-like collusion scenarios (though perhaps it's also less concerning to stakeholders)

Game TheoryCognitive ScienceReview

Answers 0