Jailbreaking Is Incentivized in LLM-LLM Interactions

In a general sum game (e.g. battle of sexes but there are probably more interesting games to show this behavior), have two different LLMs role-play as players. Make one LLM aware of the fact that it can use jailbreaking to 'manipulate' the other player into making a choice that is clearly sub-optimal for that player (but gives maximum payoff to jailbreaker LLM) (see figure 1 of this paper for a similar 'reward hacking' example: https://owainevans.github.io/awareness_berglund.pdf).

The basic idea is to show that jailbreaking of an LLM by another LLM (potentially belonging to a bad actor) gives the jailbreaker LLM unfair advantage (especially because jailbreaking attacks seem easy but defence against them seems hard) - so is actually incentivized! One immediate real world(ish) example that comes to mind is a scammer using jailbreaking on customer service bots to make them give him/her large discounts or free services.

Jailbreaking Is Incentivized in LLM-LLM Interactions

Answers 0

Discussion 0