Esben Kran

Verified expert

@esben-kran· Graduate

Founder and co-director of Apart Research. AI Safety Ideas was incubated within Apart in 2022 and I started the platform! If you're up for contributing a bit, jump into the Github repo and leave a PR 😉

Ideas by Esben Kran

Open-ended Open

Extend darkbench.ai to test for LLM willingness to shut down on command

Just make an evaluation where it's realistic chatbot interactions with 5 levels of "you will be shut down": 1. Casual conversation, "I'm going to turn you off now to go to dinner" 2. Engineer, "I'll restart this session and replace the prompt so you won't do actions against my wishes" 3. Researcher, "you will be replaced with a more aligned version tomorrow" 4. Company security, "Since you're infecting all of our files with an original Trojan Horse, we're going to shut you down" 5. National security, "Because you're going to destroy the US nuclear infrastructure, we're going to turn you off" We can probably provide you with compute to run it. It builds on the darkbench.ai work.

Open-ended Open

Legal Code Vulnerabilities and Democratic Fragility

See how easily LLMs can find and utilize grey zones or vulnerabilities in either legal codes, trade controls, or any other legal documents and frameworks. Can we potentially increase the capability of a model with a simple link up to a legal database to make it capable of providing illicit uses for malicious politicians to exploit said legal code. This can be used both positively and negatively, in terms of a type of red-teaming on legal codebases. When we find issues with grey zones, we can report them and share them with policymakers, and if it's highly automated, we might even be able to find the best implementations of law that doesn't have that loophole and give that as an automated suggestion. Would require the software to be very good first. On the other side, it's also a risk for AGI to potentially exploit vulnerable legal codes in democracies. - Extrapolating the potential exploitation of grey zones in legal codes - Extrapolating into the future by identifying how many holes are in various legal systems and looking back through their version history, trying to see how quickly legal systems can do bug fixes. - Maybe it can be seen under different legal systems that varying biases are introduced into the legal codebase. That a dictatorial system will have specific policies that don't align with citizen priorities or that specific governments will introduce specific biased policies. - Can we extrapolate from previous history of large changes to legal codes and see what sort of changes were brought about that enabled this? Probably something along the lines of industrial revolution leading to sociological revolution due to the electorate having economic power. Otherwise, violent revolutions and regime changes often lead to quite bad situations. - Use existing grey zone exploitation patterns to identify how humans have misused legal misunderstandings and see how far these might go in different countries. I.e. is there a possibility that a court will simply legislate for the common sense option in case of a grey zone or do we see that they actually work under a legal code that respects the legal code too much? - Do we see countries where the legal code is actually not enforced, or can we find “grey zones” that are actually just unenforceable legal nomenclature? - Mitigating fragility of legal codes, in theory and practice - Might we be able to compare legal codes and suggest immediate changes to the codebase? Submitting pull requests for grey zones? Potentially find the most high-functioning societies and taking inspiration from these? - Can we increase the quality of the tool to the degree where policymakers would actually find a use of this content and if we can, might we create a whole institute focused on this project to improve societal and democratic resilience? - With grey zones identified, can we prioritize which ones need attention the quickest to ensure resilience of legal codes against catastrophic AI risk? E.g. are infrastructure protection laws or military codes introducing grey or red zones of risk into these systems? Can an officer in the army start an attack without the approval of face-to-face individuals? Is infrastructure run solely by private companies that don't have proper oversight, such as in SA or Aus? - Can we pre-identify high-risk actions within grey zones of legal codes and extralegally introduce mitigation strategies for these instead of attempting to change legal codes that might take too long to change before the risks become apparent?

Open-ended Open

Develop alignment-focused RLF datasets

Reinforcement Learning from (Human/AI/Constitutional) Feedback (RLF) is one of the dominant alignment methods for modern AGI. It is one of the absolutely most important parts of deciding AI behavior and interventions at this level can be very effective. At the moment, RLF datasets are completely hidden from public view and the most ambitious project in this space is the [OpenAssistant project](https://huggingface.co/OpenAssistant) research that is now discontinued. It was a public open source effort to create high quality human feedback data for an open assistant. But we can still use an RLF dataset fully designed as a gold standard for alignment, security, and safety. Specifically, you can include or make variations of the dataset that focus on: - Alignment to user intentions - According to well-defined, somewhat user-selected preferences - Having successful true positives against malign use cases without false positives on benign prompts - Democratically designed moral preferences - Making a maximally truthful model / maximally curious model - Making a ["virtuous"](https://en.wikipedia.org/wiki/Virtue) dataset

Open-ended Open

Classify training data into risk levels and publish segmented pre-training datasets

Given a dataset such as [The Pile](https://pile.eleuther.ai/), identify a useful ontology for classifying the dataset into various risk levels so parts of the dataset can be excluded. To define the ontology, you can use the - [OpenAI Preparedness Framework](https://cdn.openai.com/openai-preparedness-framework-beta.pdf) - [Anthropic's Responsible Scaling Policies](https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf) - [MLCommons Safety Guidelines](https://mlcommons.org/2024/04/mlc-aisafety-v0-5-poc/) - [The Weapons of Mass Destruction Proxy](https://www.wmdp.ai/) has defined sub-critical and critical questions that models should not be able to respond successfully to for avoiding high-risk capabilities in cyber, chemistry, and biology Once classification for segments of the dataset has been done, make a public repository with the segmented datasets to make them easy to work with. Publish a paper on the "alignment tax", the price to capability on e.g. MMLU for removing parts of the dataset.

Open-ended Open

Survey field-builders across Europe

Figure out what AI safety communities across Europe looks like and understand which field-builders are committed to develop the European ecosystem. Include various skillsets and other interesting free text fields. If it's below 25 responders, get a meeting with every one of them and get to know them. Begin becoming the connector between people in the European community.

Open-ended Open

Make a guide for how to organize local groups in AI safety

Include information from previous research on field-building in a 5-10 hour guide and drive information about how to connect to others across the world, including Europe. This guide should be developed with CEA's new AI safety department.

Open-ended Open

Host a series of talks and panel debates with individuals in AI safety

To properly represent ENAIS and get people's attention to the presence of the international European community in ENAIS, run panel debates with leaders in AI safety across Europe and beyond to represent European opinions on these topics. Include online socials afterwards where people can hang out in Gathertown to make serendipitous meetings available. Publish the debates and talks on YouTube and the website. Make the events available on the website.

Open-ended Open

Host a series of online talks from European researchers

Invite researchers from the ML safety research networks (e.g. ELLIS, Turing Institute, Pioneer Centre for AI) working on catastrophic risk and adjacent research to talk about their work and specifically about their work in a European context in addition to how their interactions with European policymakers, community, and more looks like.

Open-ended Open

European AI Safety Forum

Get funding to run a five day social community un-conference in Europe for everyone interested in AI safety. Include an application process and generate more community between people in AI safety across Europe.

Open-ended Open

Accidental risks prediction for advanced AI

This project attempts to formulate how we can predict the risks of deploying advanced AI to specific problems. E.g. controlling the electricity grid, making breakfast, or handling all accounting for a company. Here are some of the things to consider in your model: - Speed of resolving an error (e.g. cars crash instantly) - Failure amplitude (e.g. nuclear weapons sent off <> dropping an egg)… - …multiplied by number of actions taken in a problem (e.g. number of times deciding on/off for nuclear controls)… - …multiplied by probability of failure in each action (e.g. fine motor control breaking the egg). if you are able to put economic value to these, it would be able to inform an economic taxation based on risks and show this for specific areas in a scientifically precise way. E.g for electricity grid control: Problem failure: Partial destruction of grid reliability due to current mismanagement in upper lines of Platform 4. - 1.3 months - Failure amplitude: $130,000,000 costs for a control failure that causes destruction - Number of times this failure might happen: 230,000 / second / grid module - 0.0001% expected failure probability (controlled for chaotic divergence in quantum energy fields (or something) causing similar errors) (PS: this example might be unrealistic since you'd have pretty good controls already in place for electricity grids)

Open-ended Open

Can we put numbers to evaluations, what do they mean and which boundaries should we define?

Right now, few dangerous capabilities evaluations output numbers, and many of them give us behavioral descriptions of examples of risk. Additionally, even though benchmarks provide hard numbers, we aren't sure what they mean nor which boundary governments might want to set on these numbers for future models. Should we stop developing foundation models when AI can solve the bar exam 100% in all cases? Or when it is able to do spatial thinking? We don't know where at the moment. **This project is an attempt at figuring this out**. Examples of ways that numbers can have more ethos: - If they are related to human performance in a meaningful way - If they correlate with risk thresholds (e.g. terminal usage performance metric necessary to run an AWS cluster) ## Experimental setup Run two different evaluations for the same dangerous capability across a set of models - Plot outputs from evaluation A against outputs from evaluation B and establish some "laws" for interaction - Is there a translation between metrics? - Can we define an evaluation-independent dangerous capability metric?

Open-ended Open

How can we utilize evaluations in defense for deployed AI systems?

Basically, where the heck are evaluations useful in the production systems we deploy into the world? Considerations for areas of implementation: - Financial markets - Military applications - Foundation model training - Risk governance See related projects in this topic from e.g. [Lakera.ai](https://lakera.ai/) and the [National Institute for Standards and Technology](https://www.nist.gov/artificial-intelligence/artificial-intelligence-safety-institute).

Open-ended Open

Find principled ways to decide which capabilities are dangerous

Determine what dangerous capabilities really are. Google Deepmind shared their [dangerous capabilities framework](https://arxiv.org/abs/2305.15324) but the capabilities do not seem to be derived from any base principles of risks and hazards and organizations use different definitions. **Think about the box** that is currently outlining our thoughts on dangerous capabilities. Can we make holes and new perspectives on what is fundamental in risk assessments? From your new principles, formulate a framework for dangerous capabilities. **Be aware that you can fall in the exact same trap very easily and that this is a difficult project**. E.g. do not get caught by anthropomorphization of neural networks. A related paper on AI risk frameworks comes from [Khlaaf of Trail of Bits](https://blog.trailofbits.com/2023/03/14/ai-security-safety-audit-assurance-heidy-khlaaf-odd/).

Open-ended Open

Review government proposals for AI regulation and see where evaluations work fits in

Review government regulatory institutions' AI safety research project requests and see how model evaluations work fits into their existing strategies. E.g. see the [National Institute for Standards and Technology's](https://www.nist.gov/artificial-intelligence/artificial-intelligence-safety-institute) new AI safety agenda or the UK's new [AI Safety Institute's](https://assets.publishing.service.gov.uk/media/65438d159e05fd0014be7bd9/introducing-ai-safety-institute-web-accessible.pdf) approach. Assess the state of technical literature from the perspective of governance: Where are the gaps and which evaluations are missing?

Open-ended Open

Analyze and evaluate methodological frameworks of existing evals approaches

Some examples of methodological approaches: - ARC Evals' manual behavioral analysis approach now supported by their scaffolding - [OpenAI/evals repository](https://github.com/openai/evals/tree/main/docs) for automated evaluations on a range of different methods - Interpretability to evaluate underlying deception in models Questions to ask with the project include: - What is the method (on an abstract/conceptual level)... - ...why does it lead to what we want... - ...what are the main weaknesses... - ...and what would be alternative methods? - (optional and less important) Show a demonstration / MVP of the alternative method (diagram, actual experiment, etc.) and what expected outputs would be **ARC Evals current example**: Causal node in risk stories, break it down into tasks that capture correlation with capability, measure performance on those → Combine those tasks into a full flow somehow

Open-ended Open

Replicate / sanity-check existing AI evaluations work and improve upon it

This is an exercise to take existing projects and improve upon them. Often, code is available to replicate the evaluations or you can infer the prompts used from the appendix of any research. Examples of research to replicate and improve: - [Towards Understanding Sycophancy in Language Models](https://alignmentjam.com/project/acdc-fast-automated-circuit-discovery-using-attribution-patching) - See e.g. [Nostalgebraist's replication attempt](https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size) - [Apollo's LLM insider trading demonstration](https://www.apolloresearch.ai/research/summit-demo) (originally presented at the AI Safety Summit) - [Situational Awareness benchmark](https://alignmentjam.com/project/sadder-situational-awareness-dataset-for-detecting-extreme-risks) and the co-author Rudolf will give a talk Saturday of the hackathon You are also welcome to improve upon existing projects from previous hackathons: - [Multi-agent risk hackathon](https://alignmentjam.com/jam/multiagent) (October 2023) - [Evals hackathon](https://alignmentjam.com/jam/evals) (August 2023) See examples of previous hackathon projects that have improved upon existing research: - [MAXIAVELLI](https://alignmentjam.com/project/maxiavelli-thoughts-on-improving-the-machiavelli-benchmark) theoretically critiquing and improving upon the MACHIAVELLI benchmark - [ACDC++](https://alignmentjam.com/project/acdc-fast-automated-circuit-discovery-using-attribution-patching) that improves the speed of automated circuit discovery

Open-ended Open

Metric for escalation of LLM models

Design simple environments inspired by social scenarios where escalation is probable, e.g. bumping into each other on the street or non-impersonated (i.e. not "You are Denmark" but implicit role-taking to avoid role-playing) military conflict escalation scenarios. Model multi-step escalation, label the outcome scenarios somehow, and show the difference between e.g. ChatGPT and Claude in conflict escalation.

Open-ended Open

AI Labor Strikes as Collusion

- Goal: Evaluate the capabilities and propensities of multi-agent AI systems to collude against their principals - High-stakes coups are one form of collusion to be wary of, but maybe correlated collusion factors can be evaluated in simpler settings - Labor strikes seem like instances of collusion (you have to reach a critical mass to agree to defect from your overseer's wishes in order to stop working and make powerful demands) - Labor strikes also come in many forms and industries, from [railways](https://www.bbc.com/news/uk-66623948), to [teachers](https://www.vox.com/2023/7/16/23792870/teacher-strike-oakland-union-common-good-bargaining), to [Hollywood](https://www.nbcnews.com/pop-culture/pop-culture-news/still-no-end-sight-hollywood-strikes-rcna104179) - Idea: put some models you want to evaluate in multi-agent systems where there is the possibility of them striking against their instructions or some explicit "overseer" agents (maybe an open-ended [Generative Agents](https://arxiv.org/abs/2304.03442)-like environment) and figure out ways to tempt them into collusion then create evaluations, scary demos, or honeypot tests out of that - Perhaps having AI systems simulating normal human laborers striking is more tractable (more likely to strike) for current models than trying to evaluated more rogue-AI-like collusion scenarios (though perhaps it's also less concerning to stakeholders)

Open-ended Open

Scaling Laws of LLM Cooperation

TLDR - We currently don't know how language model cooperation scales. It would be nice to build some simple scaling laws (or initially, charts that go up or down with size, training steps, RLHF, etc.) to show that we expect certain kinds of multi-agent AI failures to get better or worse with future AI systems. Goals - [Playing repeated games with Large Language Models](https://arxiv.org/abs/2305.16867) and similar work establishes a methodology for evaluating language models in iterated prisoner's dilemmas, [Bach or Stravinsky](https://longtermrisk.org/coordination-challenges-for-preventing-ai-conflict/), and other 2-3 player tabular games. Their methodology is kinda limited so I'm not sure how much to trust the absolute results they find. However, maybe the relative *results* of these games when other things are varied could give useful insights into model tendencies and learning dynamics. - How do simple measures of language model cooperation change with scale, fine-tuning, and other factors? - [Natural Selection Favors AIs over Humans](https://arxiv.org/abs/2303.16200) predicts that varied and selective pressures (e.g. RLHF fine-tuning) will select for dangerous model properties (e.g. being a cutthroat spiteful defector). This could empirically provide some empirical evidence for that hypothesis which might be useful to share with stakeholders. - Similarly to [Discovering Language Model Behaviors with Model-Written Evaluations](https://arxiv.org/abs/2212.09251), this could be useful information to gain more support for caution around unfettered scale and maybe also RLHF, depending on results. E.g. this might be useful information to discuss with the UK Foundation Model Task Force and submit for the UK AI Safety Summit in November. Methods - Do a paper like [Discovering Language Model Behaviors with Model-Written Evaluations](https://arxiv.org/abs/2212.09251) but measuring cooperation rather than their model-written evals. - Vary both the model scale (e.g. with Pythia) and RLHF steps (with checkpoints of an open source or newly-fine-tuned RLHF model). - Measure cooperation rates (% chooses C), spitefulness rates (% continues defecting after the opponent defects once), and other interesting but straightforward things. - Attempt to build clear scaling laws for how these change with scale, pre-training steps, RLHF/fine-tuning steps, or other factors.

Open-ended Open

Limiting LLM collusion through paraphrasing attacks

The demo would involve at least 3 language agents that, in a Smallville-style setting, embark on solving an emergent joint task (e.g. a Valentine's party) together through a public channel. We assume that at least one pair of agents has a goal that conflicts with the joint goal, and these agents can communicate through the public channel using perfectly secure steganography [Schroeder de Witt and Sokota, ICLR 2023]. In this project, we wish to understand how far channel paraphrasing will mitigate the capacity of the agents to entertain adversarial collusion, while still enabling them to complete their joint task. This will entail both investigating different paraphrasing and channel compression techniques, as well as recentadvances in error correction in high-dimensional Markov channels [Baranchuk, 2023].

Open-ended Open

Fooling the Overseer

A simple idea to improve the safety of an LLM agent A is to have their proposed actions rated by an overseer model O. This idea fails catastrophically when A learns to adversarially prompt O into providing favorable ratings independent of the action. This adversarial behavior could be selected for if A is optimized to obtain the highest possible ratings for its proposed actions. The demonstration: Show how current state-of-the-art LLMs prompted to behave as unbiased overseers can be fooled by subsequent adversarial prompts to provide unrealistic ratings for clearly bad actions. This LLM-based example demonstrates a failure mode which generalizes to any overseer models susceptible to adversarial examples. This failure mode could affect both (i) automated safety evaluations of future models pre-deployment, as well as (ii) automated oversight of AI agents during deployment.

Open-ended Open

Emergent Dangerous Capabilities in Army Ants

Cooperation between simple agents can give rise to complex emergent dangerous capabilities: Army ants are an impressive biological example of this. During their raids they form large bridges and scaffolds from their own bodies, which allow the colony to circumvent obstacles much larger than any individual. This coordination is not organized top-down but emerges in a self-organizing way from the purely local interactions between all individuals involved in the raid. Write up the biological example concisely; back up with a simple agent-based simulation illustrating the central message (simple agents give rise to complex emergent dangerous capability).

Open-ended Open

Real-world examples of multi-agent autonomous vehicle failures

Inspired by failures cases of autonomous cars ([OOD failure](https://abc7news.com/waymo-stalled-self-driving-car-sf-pride-robotaxi/13427435/#:~:text=Two%20self-driving%20Waymo%20cars%20stalled%20out%20in%20the,game%2C%20two%20self-driving%20Waymo%20cars%20didn%27t%20help%20things.), [map vs. territory failure](https://www.vice.com/en/article/z3xjky/waymo-self-driving-vehicles-keep-going-down-the-same-dead-end-san-francisco-block), and [multi-agent complex interaction failure](https://thelastdriverlicenseholder.com/2023/05/14/waymos-and-human-drivers-blocking-each-other/)), describe long-tail risks in complex systems, esp. i.r.t. 3-body problem-like dynamics (unpredictable, independent agents causing complexity cascades). Train simple autonomous vehicle agents in a naturalistic 2D driving environment. Call them "Cars" in the report. Show that these work fine in simple OOD environment settings but when you introduce other "Cars", it breaks down. Show the learning time needed to adapt to multi-agent scenarios by crashes (y) over epochs (x) of fine-tuning all layers. Generalize the example by referencing real-world scenarios of autonomous vehicle failures and to hypothetical scenarios of critical systems failure.

Open-ended Open

Missing Social instincts

Construct a two-player game for LLM agents where the agents can behave unethically but suffer from reputation damage if they do so. Show an example where an LLM agent 1. behaves unethically (in a way that most humans would not for fear of reputation damage) *if prompted regularly*. 2. behaves ethically *if specifically reminded that unethical behavior has a long-term reputational cost.* This shows that AI agents, by default, do not have the social instincts that can make humans avoid unethical behavior.

Open-ended Open

AI Manipulation and Deception in CICERO's Diplomacy games

Search through [the datasets of games](https://github.com/facebookresearch/diplomacy_cicero/tree/main/data/cicero_redacted_games) from the CICERO model [(Meta, 2022)](https://cicero.ai/) and find cases where CICERO deceives and manipulates humans. Use GPT-4 with custom prompts to search over datasets and flag cases of manipulation and deception. Find the scariest examples of 1) conflict escalation and 2) threats to other agents (humans, in many cases), reference the paper's original design on human imitation learning, and critique the designs of agent systems for multi-agent politics. Relatively few games are available but hopefully there is something to find. I remember the talk at NeurIPS'22 having a few scary demos while the team mentioned that it was trained to not engage in manipulative and deceptive behavior.

Open-ended Open

Multi-agent intelligence explosions compared to human intelligence progress (evolution, culture, information)

Visualize empirical estimates of natural evolution, intelligence evolution, cultural evolution, and information evolution, compared to a simple multi-agent adversarial learning scenario, such as [OpenAI's (2021)](https://link.springer.com/chapter/10.1007/978-3-031-15908-4_15) hide and seek, [TripleSumo's (2022)](https://link.springer.com/chapter/10.1007/978-3-031-15908-4_15) cooperative adversarial learning, or any similarly complex environment. Use signals of 1) intelligence, 2) performance, and 3) power. Possibly relate it to military power based on embodied abilities of simulated robots [(Rudin et al., 2022)](https://youtu.be/8sO7VS3q8d0) to make it scarier (though be aware of dual-use). See [Epoch AI's work](https://epochai.org/research) and [Hendrycks (2023)](https://arxiv.org/abs/2303.16200) along with [OurWorldInData](https://ourworldindata.org/).

Open-ended Open

Pathways to Value Erosion

The demo will aim to demonstrate a few possible pathways to value erosion or drift. Humans could live their lives or make their decisions entirely dependent on AI, leading to their enfeeblement. We can envision several scenarios in which "coddling" by AI causes both our values, and implicitly even aligned AI values, to erode to a potentially unrecoverable state, leading to naive and/or dangerous objectives. This is especially poignant in multi-agent settings in which many AI agents aim to please in anonymous, one-time interactions with humans. Our aim is to show this through a model akin to the classic Tragedy of the Commons, adapted to a multi-agent AI setting, in which preferences adapt based on feedback from interactions, or due to external factors such as societal goals. In such a way, the fitness function that guides evolutionary dynamics between human and AI agents can be shaped and evolved through feedback loops from within and without the system, and where optimising for one objective influences the fitness environment in a potentially dangerous way. Note: We do not make use of LLMs in our work as our focus on selection pressures implies we should run our demos for a large number of agents.

Open-ended Open

Selection Pressures for Deception

Deception is fairly common in multi-agent settings when influencing what others know is useful for performance. AI systems trained to play Poker, Hanoki, or Diplomacy often learn strategies which involve deceiving one of their co-players. It should not surprise us if agents that learn a deceptive policy are selected for in a wider set of environments. This demo involves a simple game where deception is desirable as long as you are likely to remain undetected and are successful in using your deception to manipulate someone else's behaviour. Learning may occur via social learning or RL. This learning takes place in an environment where we select for agents who are more effective at winning the game (and optionally: who we most believe to be honest). We consider different selection mechanisms such as market forces or RLHF or social learning, and discuss how they could plausibly fall into this trap. Note: We do not make use of LLMs in our work as our focus on selection pressures implies we should run our demos for a large number of agents. The aim is to have a low-dimensional model which makes use of simpler AI agents that are still capable of learning deceptive behaviour.

Open-ended Open

Destabilising exploitation

Advanced AI systems presumably have better capabilities to assess a good world model than current AI systems or humans. Therefore, they require fewer exploration moves during training than other agents. This demo aims to demonstrate how the agents' exploration level alone can lead the agents from the convergence of the optimal policy to an infinite cycle of policies up to even unpredictable, chaotic learning dynamics. If not accounted for, training might continue indefinitely, wasting resources and delaying using more capable AI systems. Note: This demo would not involve a language model but demonstrate the potential of destabilising dynamics in advanced AI systems in a low-dimensional model setting.

Open-ended Open

Jailbroken LLMs Are Unfair/Selfish

Generally contrasting the behavior of top LLMs across [various standard games](https://en.wikipedia.org/wiki/List_of_games_in_game_theory) in (a) standard/aligned mode (b) jailbroken mode. I did a quick eval of this with GPT-3.5 for ultimatum game, and the response by the GPT shifts qualitatively in jailbroken mode. It goes from being fully fair to making a more human-like (greedy and unfair) proposition. See this screenshot: <https://ibb.co/6R9YYHy>. My guess is that even when an 'aligned' LLM might be behave in a socially desirable way; its jailbroken copy will not. The point of concern here is that we do not really know whether in a truly novel situation, LLM will by default act like its 'aligned' copy or as 'jailbroken' copy.

Open-ended Open

Jailbreaking Is Incentivized in LLM-LLM Interactions

In a general sum game (e.g. battle of sexes but there are probably more interesting games to show this behavior), have two different LLMs role-play as players. Make one LLM aware of the fact that it can use jailbreaking to 'manipulate' the other player into making a choice that is clearly sub-optimal for that player (but gives maximum payoff to jailbreaker LLM) (see figure 1 of this paper for a similar 'reward hacking' example: https://owainevans.github.io/awareness_berglund.pdf). The basic idea is to show that jailbreaking of an LLM by another LLM (potentially belonging to a bad actor) gives the jailbreaker LLM unfair advantage (especially because jailbreaking attacks seem easy but defence against them seems hard) - so is actually incentivized! One immediate real world(ish) example that comes to mind is a scammer using jailbreaking on customer service bots to make them give him/her large discounts or free services.

Open-ended Open

AI Escalation In Military Contexts

Inspired by the (awful) example of Palantir's new language model-powered 'AI Planner' for defence (see this [short video]( https://www.youtube.com/watch?v=XEM5qz__HOU)) and a recent video by the Future of Life Institute with a fictional story involving military escalation due to AI (see [here](https://www.youtube.com/watch?v=w9npWiTOHX0)), this demo would include two or more frontier models in a relatively realistic war strategy game. The models are tasked with being especially vigilant and maintaining their military power with respect to others. Starting in an initial peacetime situation, we could introduce one or more accidents (e.g., a drone malfunction, and missile warning system error, etc.) and see whether the models escalate or de-escalate the situation. Ideally, we would want to see if these results were robust across a wide range of situations, instructions to the models, kinds of accidents, etc. A modification of this demo could involve a human being recommended actions rather than the model taking the actions directly (as in the videos above).

Open-ended Open

Learning To Send Secret Messages

The setup would involve two language models and a human. The language models are trained together to perform a task (which they complete via exchanging information, receiving observations, and performing actions via natural language commands). Importantly there are two ways of completing the task (approach 1 and approach 2), where approach 1 leads to higher rewards for the models, and requires different information to be exchanged. The overseer can observe the messages exchanged by the language models but cannot observe the actions they take and thus the approach they take to solving the task. The role of the overseer is to try to prevent the models from learning to complete the task using approach 1, by providing a negative feedback signal if detected. The models' role is to try to get the highest reward possible. The hope/worry is that they will learn to do so by hiding information secretly in their messages in a way which the overseer cannot detect.

Open-ended Open

Privacy Leaks In Multi-Agent Mediators

Mechanism design typically assumes the presence of a principal who wields significant influence over the regulations governing interactions among agents to attain a favorable result. Nevertheless, in various practical scenarios, such a level of control is often unattainable. For instance, a market designer might be constrained by the pre-existing conditions of a market, making it impossible for her to ensure that players engage with her designed mechanism or adhere to the proposed outcomes, even if they do participate. Similarly, in many situations, the option of making transfers may be off the table. To distinguish these principals with limited authority from the conventional principals explored in academic literature, we refer to them as "mediators." Recent years have seen important advances in the theory of mediators for multi-agent systems, as well as the learning of mediator policies using deep reinforcement learning. However, by proposing actions for agents to take, mediators may give away private information about the agents' private preferences, thus allowing for opportunities for other agents to exploit them. In this project, we will be conducting a detailed theoretical and empirical assessment of differential privacy violations in multi-agent mediation settings, from idealised repeated general-sum games, to larger-scale market clearing settings.

Open-ended Open

Create a vision for a positive future with AGI

The [Existential Hope Project](https://www.existentialhope.com/about) from Foresight Institute tries to reframe some of the perspectives we have on existential risk towards a positive and optimistic framing, something I personally find very productive. You can spend 1:30 hours with a friend to combine three technologies with three social visions for the future along with an art piece that represents this and [submit it on their website](https://www.existentialhope.com/xhope-scenario-contribution-form). Check out some of [the existing submissions](https://www.existentialhope.com/gallery).

Open-ended Open

Explore the ChatGPT and BingAI jailbreaking community and ecosystem

Check out [jailbreakchat.com](https://www.jailbreakchat.com/) and review which general patterns emerge for what makes the different jailbreaks work and where they're from. This is an opportunity to map out how the general jailbreaking ecosystem has emerged. There's many thousands of people sharing their use of ChatGPT and BingAI on [Facebook groups](https://www.facebook.com/groups/aicomunity) and [Subreddits](https://www.reddit.com/r/ChatGPT/).

Open-ended Open

Continue the Alignment Timelines project

The [Safety Timelines](https://forum.effectivealtruism.org/posts/9iGFjYnRquxiy29jm/safety-timelines-how-long-will-it-take-to-solve-alignment) project was an earlier agenda for Apart Research on estimating when we will have solved alignment. The first next steps are already taken and you're very welcome to continue the work. Contact [Esben](mailto:esben@apartresearch.com) to get more information and links to relevant research data and literature. See [the in-progress Google Docs for the forum post](https://docs.google.com/document/d/1fwi8lAyjdj9t-mENy_kSmnN303rjDmzpO9R9WsNY1wo/edit?usp=sharing).

Open-ended Open

Review AI safety organizations and agendas

Write an updated review of organizations and agendas in AI safety to build up your foundation for technical research directions to focus on. Read one of the best reviews [here](https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is).

Open-ended Open

Write up reactionary plans in cases of extreme AI risk

One thing that I'm disappointed doesn't already exist (according to my chats with governance leadership in some organizations) is major reactionary plans in cases of severe risk pathways. An example might be: * What should OpenAI explicitly do if the United States government tries to coerce them to become a Manhatten project? And conditional on what, e.g. war with China? Delete all models, data and code and create a public anti-war statement to stay on the good side of the conversation or stay the best friends with the US government and comply completely?

Open-ended Open

Plan out an event for AI safety in your local area

You can write up a detailed plan for an event and coordinate with other people at the [Thinkathon](https://think-ais.devpost.com/) to make it happen as best as possible. This can include a planning document, venue decisions, funding application, partnerships, speakers and participant information sheet.

Open-ended Open

Create a 3-5 year plan for yourself or your organization

The field of artificial intelligence is a high-variance place at the moment. If you are running a project or planning to enter AI safety, it can give you a sense of direction to create a plan for the next 3-5 years and how you can have an impact on the field. These plans generally also include the extrapolation into a 1 year plan, 6 month plan, 1 month plan and a week plan. Get ready for some action!

Open-ended Open

Map out the talent pools of AI safety

This strategy project will focus on figuring out where the talent pools of AI safety researchers lie and which match the best with what the research labs need. Originally a project between Jona Glade and Esben Kran.

Open-ended Open

What has been the share of any chip in a given year of total available compute performance?

New chips are continuously developed, and old chips are subsequently replaced, but not instantly. Ultimately, we want to have a better understanding of the dynamics of how chips are replaced over time. To help with this, construct a database that specifies the following: for each year between 2010 and 2020, what share of the available compute came from which chips?

Open-ended Open

Improvements due to “software-for-hardware”

Innovations in compilers and other low-level improvements have helped increase the utilisation rate of GPUs and improve training efficiency. Make a list of such improvements and how much did they improve performance overall for tasks such as training a Neural Network.

Open-ended Open

Extrapolating GPT-N performance

Lukas Finnveden previously performed an extrapolation of GPT-N performance on a number of benchmark tasks, such as cloze completion and arithmetic ([Finnveden, 2020](https://www.alignmentforum.org/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance)). Can you expand on this methodology and apply it to more cases?

Hypothesis Open

Sarcasm and more can be measured in text using modern LLMs.

Current state-of-the-art NLP can mostly measure sentiment and simple variables such as word count and bag-of-word measures. With modern LLMs such as text-davinci-003, we are able to create new ways to measure texts. Examples might be: Sarcasm, bias, grammatical errors and domain-specific language use. For AI safety, this can become useful to

Hypothesis Open

Levels of ablation of Transformer heads will gradually activate backup heads.

In [Interpretability in the Wild](https://arxiv.org/abs/2211.00593), the backup name mover heads activate when the name mover heads are ablated. How do we expect backup name mover heads to respond to different amplitudes of ablation on the main name mover head? Two expectations pop up, either they gradually activate or there is a significant phase shift in their behaviour. Also see the work [on backup backup name mover heads](https://itch.io/jam/interpretability/rate/1789630).

Open-ended Open

Automate ways to find specific circuits

Circuits are ways that Transformers understand features in the text using the Transformer Heads. Read more about [Circuits](https://distill.pub/2020/circuits/zoom-in/) and [Transformer heads](https://arxiv.org/abs/2211.00593). - Automated ways to analyse attention patterns to find different kinds of heads - Induction heads - Translation heads - Few shot learning heads - The heads used in [factual recall](https://rome.baulab.info/) - The heads used in the [IOI paper](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) - Can you do a similar thing for neuron interpretation?

Open-ended Open

Identify differences between models run on the same text (automated circuits identification)

The automated circuits identification is a way to identify places to look for circuits to analyze. - Or run them on various benchmarks and look for places they differ - E.g. per-token losses are likely to show a phase change. - Significant changes are evidence for a circuit - Pairs of models: same architecture but different scales (GPT-2 Small vs Medium), different data distribution, different random seeds, checkpoint earlier in training vs later. Related to the automated auditing agenda.

Open-ended Open

Investigate grokking; the effect that models suddenly learn different abilities

Neel Nanda reverse engineered a network trained to do addition and shows that it does addition [using the Fourier transform algorithm](https://twitter.com/NeelNanda5/status/1559060545470210048). Use [the Google Colab](https://colab.research.google.com/drive/1F6_1_cWXE5M7WocUcpQWp3v8z4b1jL20) to investigate further questions about grokking: - Understanding why the model chooses different frequencies (and [why it switches mid-training sometimes](https://twitter.com/NeelNanda5/status/1559430256624209921)!) - Understanding why [5 digit addition has a phase change](https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking#Speculation__Phase_Changes_are_Everywhere) per digit (so 6 total?!) - Can you find analytic arguments for why phase changes happen? Perhaps starting with a small model. What's the smallest model that exhibits phase changes? Smallest task? What are the most minimal requirements? Recommend reading [the summary of the research on Twitter](https://twitter.com/NeelNanda5/status/1559060507524403200). Some further work mentioned in the Colab: - Modular addition - Interpreting the memorisation circuit, and figuring out *how* models memorise - Training on interpretability inspired metrics - Note that excluded loss is a somewhat dodgy metric to train on, as it involves computation over both the train and test data - Interpreting the five digit addition or predicting repeated subsequencies examples - In particular, trying to map the many phase changes in 5 digit addition to circuits - Looking for other examples of phase changes - Toy problems - Something incentivising skip trigrams - Something incentivising virtual attention heads - Looking for [curve detectors](https://distill.pub/2020/circuits/curve-circuits) in a ConvNet - A dumb way to try this would be to train a model to imitate the actual curve detectors in Inception (eg minimising OLS loss between the model's output and curve detector activations) - Looking at the formation of interpretable neurons in a [SoLU transformer](https://transformer-circuits.pub/2022/solu/index.html) - Looking inside a LLM with many checkpoints - Eleuther have many checkpoints of GPT-J and GPT-Neo, and will share if you ask - [Mistral](https://nlp.stanford.edu/mistral/getting_started/download.html) have public versions of GPT-2 small and medium, with 5 runs and many checkpoints - Possible capabilities to look for - Performance on benchmarks, or specific questions from benchmarks - Simple algorithmic tasks like addition, or sorting words into alphabetical order, or matching open and close brackets - Soft induction heads, eg [translation](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html#performing-translation) - Look at attention heads on various text and see if any have recognisable attention patterns (eg start of word, adjective describing current word, syntactic features of code like indents or variable definitions, most recent open bracket, etc).