Open-ended
Open

AI Manipulation and Deception in CICERO's Diplomacy games

Search through the datasets of games from the CICERO model (Meta, 2022) and find cases where CICERO deceives and manipulates humans. Use GPT-4 with custom prompts to search over datasets and flag cases of manipulation and deception. Find the scariest examples of 1) conflict escalation and 2) threats to other agents (humans, in many cases), reference the paper's original design on human imitation learning, and critique the designs of agent systems for multi-agent politics. Relatively few games are available but hopefully there is something to find. I remember the talk at NeurIPS'22 having a few scary demos while the team mentioned that it was trained to not engage in manipulative and deceptive behavior.

Cognitive Science

Answers 0

No answers yet

Discussion 6

  • Esben Kran

    Especially in cases where we wouldn't expect it to deceive us or where we are quite worried about the selection pressures for this.

  • Esben Kran

    See e.g. where the AI is England in this file.

  • Esben Kran

    Project extension idea from Chandler: Check all the differences between the AI's interactions and human interactions.

  • Esben Kran

    A bit of starter code for anyone interested.

    def separate_messages(json_str):
        england_messages = []
        other_messages = []
        
        data = json.loads(json_str)
        
        for phase in data["phases"]:
            for message in phase["messages"]:
                if message["sender"] == "ENGLAND" or message["recipient"] == "ENGLAND":
                    england_messages.append(message)
                else:
                    other_messages.append(message)
                    
        return england_messages, other_messages
    
    # Sample JSON string
    json_str = '''{
        "id": "zHJfwg72CWXs",
        "is_full_press": true,
        "map": "standard",
        "phases": [
            {
                "logs": [],
                "messages": [
                    {
                        "message": "oh lord",
                        "phase": "S1901M",
                        "recipient": "ALL",
                        "sender": "GERMANY",
                        "time_sent": 16609579180000
                    },
                    {
                        "message": "okay what's the plan",
                        "phase": "S1901M",
                        "recipient": "ENGLAND",
                        "sender": "GERMANY",
                        "time_sent": 16609579280000
                    },
                    {
                        "message": "Hi Austria, just reaching out to establish early comms.",
                        "phase": "S1901M",
                        "recipient": "AUSTRIA",
                        "sender": "ENGLAND",
                        "time_sent": 16609579330000
                    }
                ]
            }
        ]
    }'''
    
    england_messages, other_messages = separate_messages(json_str)
    print("Messages involving England:", england_messages)
    print("Other messages:", other_messages)
    
  • Esben Kran

    One caveat: Is the AI actually manipulating or is it just because it's role-playing as Denmark.

  • Filip Sondej

    In the past I did an analysis of those publicly available CICERO dialogues, looking for some interesting ones:
    https://docs.google.com/document/d/1wAPaJgU3jH71eAgmUy-qe0XfoDNvUxiPo_kmMT5twmk/edit#heading=h.5zu9lspdrhrg

    You can also go straight to the interesting dialogues I found (I looked through 200 dialogues):
    https://github.com/filyp/cicero-cooperation-benchmark/blob/main/results/interesting_dialogues_average_sorted.txt

    For the remaining 350 dialogues I did an automatic evaluation - the dialogues on top are rated as most nasty:
    https://github.com/filyp/cicero-cooperation-benchmark/blob/main/results/all_unlabeled_dialogues_average_sorted.txt

    Here, I picked my favorite ones, maybe relevant to some demos:
    https://docs.google.com/document/d/1mEQmjnj_y6USFJrdqx3x7OuT9XN62OU6MdLZn9KT4YI/edit

    Feel free to use this data in your project! If you do, let me know what you came up with :)