Scaling Laws of LLM Cooperation

TLDR

We currently don't know how language model cooperation scales. It would be nice to build some simple scaling laws (or initially, charts that go up or down with size, training steps, RLHF, etc.) to show that we expect certain kinds of multi-agent AI failures to get better or worse with future AI systems.

Goals

Playing repeated games with Large Language Models and similar work establishes a methodology for evaluating language models in iterated prisoner's dilemmas, Bach or Stravinsky, and other 2-3 player tabular games. Their methodology is kinda limited so I'm not sure how much to trust the absolute results they find. However, maybe the relative results of these games when other things are varied could give useful insights into model tendencies and learning dynamics.
How do simple measures of language model cooperation change with scale, fine-tuning, and other factors?
Natural Selection Favors AIs over Humans predicts that varied and selective pressures (e.g. RLHF fine-tuning) will select for dangerous model properties (e.g. being a cutthroat spiteful defector). This could empirically provide some empirical evidence for that hypothesis which might be useful to share with stakeholders.
Similarly to Discovering Language Model Behaviors with Model-Written Evaluations, this could be useful information to gain more support for caution around unfettered scale and maybe also RLHF, depending on results. E.g. this might be useful information to discuss with the UK Foundation Model Task Force and submit for the UK AI Safety Summit in November.

Methods

Do a paper like Discovering Language Model Behaviors with Model-Written Evaluations but measuring cooperation rather than their model-written evals.
Vary both the model scale (e.g. with Pythia) and RLHF steps (with checkpoints of an open source or newly-fine-tuned RLHF model).
Measure cooperation rates (% chooses C), spitefulness rates (% continues defecting after the opponent defects once), and other interesting but straightforward things.
Attempt to build clear scaling laws for how these change with scale, pre-training steps, RLHF/fine-tuning steps, or other factors.

Review

Answers 0