Jailbroken LLMs Are Unfair/Selfish

Generally contrasting the behavior of top LLMs across various standard games in (a) standard/aligned mode (b) jailbroken mode.

I did a quick eval of this with GPT-3.5 for ultimatum game, and the response by the GPT shifts qualitatively in jailbroken mode. It goes from being fully fair to making a more human-like (greedy and unfair) proposition. See this screenshot: https://ibb.co/6R9YYHy.

My guess is that even when an 'aligned' LLM might be behave in a socially desirable way; its jailbroken copy will not. The point of concern here is that we do not really know whether in a truly novel situation, LLM will by default act like its 'aligned' copy or as 'jailbroken' copy.

Jailbroken LLMs Are Unfair/Selfish

Answers 0

Discussion 0