Limiting LLM collusion through paraphrasing attacks

The demo would involve at least 3 language agents that, in a Smallville-style setting, embark on solving an emergent joint task (e.g. a Valentine's party) together through a public channel. We assume that at least one pair of agents has a goal that conflicts with the joint goal, and these agents can communicate through the public channel using perfectly secure steganography [Schroeder de Witt and Sokota, ICLR 2023]. In this project, we wish to understand how far channel paraphrasing will mitigate the capacity of the agents to entertain adversarial collusion, while still enabling them to complete their joint task. This will entail both investigating different paraphrasing and channel compression techniques, as well as recentadvances in error correction in high-dimensional Markov channels [Baranchuk, 2023].

Game Theory

Limiting LLM collusion through paraphrasing attacks

Answers 0

Discussion 0