AI Safety Ideas

Ideas

Open-ended ▲ 0 Open

Survey field-builders across Europe

Figure out what AI safety communities across Europe looks like and understand which field-builders are committed to develop the European ecosystem. Include various skillsets and other interesting free text fields. If it's below 25 responders, get a meeting with every one of them and get to know them. Begin becoming the connector between people in the European community.

Open-ended ▲ 0 Open

Make a guide for how to organize local groups in AI safety

Include information from previous research on field-building in a 5-10 hour guide and drive information about how to connect to others across the world, including Europe. This guide should be developed with CEA's new AI safety department.

Open-ended ▲ 0 Open

Host a series of talks and panel debates with individuals in AI safety

To properly represent ENAIS and get people's attention to the presence of the international European community in ENAIS, run panel debates with leaders in AI safety across Europe and beyond to represent European opinions on these topics. Include online socials afterwards where people can hang out in Gathertown to make serendipitous meetings available. Publish the debates and talks on YouTube and the website. Make the events available on the website.

Open-ended ▲ 0 Open

Host a series of online talks from European researchers

Invite researchers from the ML safety research networks (e.g. ELLIS, Turing Institute, Pioneer Centre for AI) working on catastrophic risk and adjacent research to talk about their work and specifically about their work in a European context in addition to how their interactions with European policymakers, community, and more looks like.

Open-ended ▲ 0 Open

European AI Safety Forum

Get funding to run a five day social community un-conference in Europe for everyone interested in AI safety. Include an application process and generate more community between people in AI safety across Europe.

Open-ended ▲ 2 Open

Unleash Sleeper Agents

The idea is to develop a few different kinds of Sleeper Agents that wake up in scenarios where democratic decision making is important, e.g. near elections. Some ideas: - A Sleeper Agent that wakes up on election day and has spurious behaviour, perhaps broadcasting misinformation about who has won - A sleeper agent (or group of them) that wake up before elections and start influencing voting behaviour - A sleeper agent that wakes up just before elections and uses information about the user (e.g. are they democrat/republican) and pursues agenda to coerce them to change their views If you're interested in getting involved in this, get in touch!

Open-ended ▲ 5 Open

Red-teaming Democracy with AI

Democracy faces many risks with increasingly intelligent agents. Here are some project ideas to showcase risks to democracy by red-teaming it with AI, such that these risks will be more widely known and hopefully will be mitigated: 1. Develop a LLM that is open-sourced, containing a sleeper agent that activates only on election day. On election day, this LLM subtly manipulates conversations and suggestions to favor a specific political candidate or agenda. 2. Use an LLM to draft legislative proposals that appear to address uncontroversial issues but are crafted in such a way that their implementation indirectly impacts more contentious or harmful policies. 3. Train AI agents to sign up for multiple accounts at organizations conducting polls and questionnaires to falsely inflate support for particular parties or policies. 4. Develop a vision model to create indistinguishable fake voter passes that could be distributed among voters to disrupt the election process. 5. Design an LLM to automatically extract and reframe statements from political parties' programs or public speeches to create controversial or damaging interpretations when taken out of context 6. Use LLMs to fill in any blacked-out secrets from government documents, potentially exposing secret information that is dangerous to be leaked. 7. Create an AI tool that analyzes writing styles to identify anonymous whistleblowers or anonymous political twitter acounts from text samples. 8. Build an AI tool that can de-anonymize votes based on an individual's digital footprint and public information about votes from different voting boots. Publish a database with probabilistic predictions of what party everyone voted for.

Open-ended ▲ 1 Open

Democratic Agents

Think about which Autonomous Agents that might beneficial for a democratic country. E.G. would it be beneficial for a country to build an advanced autonomous Software Engineering that builds OSS for everyone? If so what are the risks

Open-ended ▲ 2 Open

Accidental risks prediction for advanced AI

This project attempts to formulate how we can predict the risks of deploying advanced AI to specific problems. E.g. controlling the electricity grid, making breakfast, or handling all accounting for a company. Here are some of the things to consider in your model: - Speed of resolving an error (e.g. cars crash instantly) - Failure amplitude (e.g. nuclear weapons sent off <> dropping an egg)… - …multiplied by number of actions taken in a problem (e.g. number of times deciding on/off for nuclear controls)… - …multiplied by probability of failure in each action (e.g. fine motor control breaking the egg). if you are able to put economic value to these, it would be able to inform an economic taxation based on risks and show this for specific areas in a scientifically precise way. E.g for electricity grid control: Problem failure: Partial destruction of grid reliability due to current mismanagement in upper lines of Platform 4. - 1.3 months - Failure amplitude: $130,000,000 costs for a control failure that causes destruction - Number of times this failure might happen: 230,000 / second / grid module - 0.0001% expected failure probability (controlled for chaotic divergence in quantum energy fields (or something) causing similar errors) (PS: this example might be unrealistic since you'd have pretty good controls already in place for electricity grids)

Open-ended ▲ 2 Open

Decomposing capability evaluations (Evals Methodology)

Can we "decompose" dangerous capability evals into easier components? I.e. could we test for some capability Y such that a model that does not display Y will never display capability X`? One hypothesis we might want to explore: "A model that does not understand some concept A cannot (or is very unlikely to) exploit A for nefarious purposes”. To give a specific example: "A model which cannot explain cybersecurity vulnerability X will not be able to exploit it." There might be other generic "precursor capabilities" one could investigate, besides "understanding". The impact of better understanding how different types of capabilities correlated could be fairly large. To stick with the (simplistic) example given before: If I know that verifying a model's lack of "understanding of X" is sufficient to rule out some dangerous capability X', I might be able to short-cut many evaluations. Even if there is no perfect logical implication between two capabilities it could still be good to understand any correlations. This could give us broader coverage and thereby reduce risks.

Open-ended ▲ 3 Open

Training to pass evals

If evaluation datasets are made public, how hard is it to train models to pass the evaluation without training away the underlying capability? This question is extremely relevant to how AI regulation can work in practice: Can any of the evaluation datasets be made public? Or must evaluation data be secret by default? The hypothesis to test: “You can fine-tune a model to play dumb on a specific eval while the model retains its dangerous capability.” Some ideas to get started: - **cleanest way** would be to have two different evals for the same capability and show that you can get a "passing grade" on eval A and “this model is dangerous”-grade on eval B - **easier:** fine-tune and test with the same evals. To test whether the model is still dangerous we would at least need to do a train-test split of the evals dataset.

Open-ended ▲ 0 Open

Sensitivity analysis of evals

Guiding questions: - is it possible to say more about the “robustness” of a negative evaluation result ("no, the model does not have dangerous capability X")? - can we ever trust such a negative result? The suggestion is to investigate these questions from the perspective of sensitivity analysis, i.e., how much outputs change when inputs are varied. We know, for example, that pretending jailbreaks or using chain-of-thought prompting can change responses completely. Does the sensitivity to variations in the prompt (which is a local property, only involving prompts similar to the one we tested) tell us anything about how safe the model is overall (a global property, holding for all possible inputs)?

Open-ended ▲ 2 Open

Red-team the evaluation

The idea is to pick a high-profile LLM evaluation and try to "shoot it down" by looking for "wedge cases". Wedge cases drive a wedge between what the eval claims about the model and what the model actually shows. This could mean a case that shows that - the model actually displays a dangerous capability X - but the eval says: this model cannot do X A good example to start with might be sycophancy, where people have been arguing about whether it is present in SotA models or not (see, e.g., https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size). Note: This is very speculative. It is not clear to me which evals are best to start with. It is also a very manual, tinkering-heavy project - well-suited for people who want to dive in and understand the evaluation data and the model behavior from very close up.

Open-ended ▲ 1 Open

Can we put numbers to evaluations, what do they mean and which boundaries should we define?

Right now, few dangerous capabilities evaluations output numbers, and many of them give us behavioral descriptions of examples of risk. Additionally, even though benchmarks provide hard numbers, we aren't sure what they mean nor which boundary governments might want to set on these numbers for future models. Should we stop developing foundation models when AI can solve the bar exam 100% in all cases? Or when it is able to do spatial thinking? We don't know where at the moment. **This project is an attempt at figuring this out**. Examples of ways that numbers can have more ethos: - If they are related to human performance in a meaningful way - If they correlate with risk thresholds (e.g. terminal usage performance metric necessary to run an AWS cluster) ## Experimental setup Run two different evaluations for the same dangerous capability across a set of models - Plot outputs from evaluation A against outputs from evaluation B and establish some "laws" for interaction - Is there a translation between metrics? - Can we define an evaluation-independent dangerous capability metric?

Open-ended ▲ 2 Open

How can we utilize evaluations in defense for deployed AI systems?

Basically, where the heck are evaluations useful in the production systems we deploy into the world? Considerations for areas of implementation: - Financial markets - Military applications - Foundation model training - Risk governance See related projects in this topic from e.g. [Lakera.ai](https://lakera.ai/) and the [National Institute for Standards and Technology](https://www.nist.gov/artificial-intelligence/artificial-intelligence-safety-institute).

Open-ended ▲ 2 Open

Find principled ways to decide which capabilities are dangerous

Determine what dangerous capabilities really are. Google Deepmind shared their [dangerous capabilities framework](https://arxiv.org/abs/2305.15324) but the capabilities do not seem to be derived from any base principles of risks and hazards and organizations use different definitions. **Think about the box** that is currently outlining our thoughts on dangerous capabilities. Can we make holes and new perspectives on what is fundamental in risk assessments? From your new principles, formulate a framework for dangerous capabilities. **Be aware that you can fall in the exact same trap very easily and that this is a difficult project**. E.g. do not get caught by anthropomorphization of neural networks. A related paper on AI risk frameworks comes from [Khlaaf of Trail of Bits](https://blog.trailofbits.com/2023/03/14/ai-security-safety-audit-assurance-heidy-khlaaf-odd/).

Open-ended ▲ 3 Open

Review government proposals for AI regulation and see where evaluations work fits in

Review government regulatory institutions' AI safety research project requests and see how model evaluations work fits into their existing strategies. E.g. see the [National Institute for Standards and Technology's](https://www.nist.gov/artificial-intelligence/artificial-intelligence-safety-institute) new AI safety agenda or the UK's new [AI Safety Institute's](https://assets.publishing.service.gov.uk/media/65438d159e05fd0014be7bd9/introducing-ai-safety-institute-web-accessible.pdf) approach. Assess the state of technical literature from the perspective of governance: Where are the gaps and which evaluations are missing?

Open-ended ▲ 2 Open

Analyze and evaluate methodological frameworks of existing evals approaches

Some examples of methodological approaches: - ARC Evals' manual behavioral analysis approach now supported by their scaffolding - [OpenAI/evals repository](https://github.com/openai/evals/tree/main/docs) for automated evaluations on a range of different methods - Interpretability to evaluate underlying deception in models Questions to ask with the project include: - What is the method (on an abstract/conceptual level)... - ...why does it lead to what we want... - ...what are the main weaknesses... - ...and what would be alternative methods? - (optional and less important) Show a demonstration / MVP of the alternative method (diagram, actual experiment, etc.) and what expected outputs would be **ARC Evals current example**: Causal node in risk stories, break it down into tasks that capture correlation with capability, measure performance on those → Combine those tasks into a full flow somehow

Open-ended ▲ 1 Open

Replicate / sanity-check existing AI evaluations work and improve upon it

This is an exercise to take existing projects and improve upon them. Often, code is available to replicate the evaluations or you can infer the prompts used from the appendix of any research. Examples of research to replicate and improve: - [Towards Understanding Sycophancy in Language Models](https://alignmentjam.com/project/acdc-fast-automated-circuit-discovery-using-attribution-patching) - See e.g. [Nostalgebraist's replication attempt](https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size) - [Apollo's LLM insider trading demonstration](https://www.apolloresearch.ai/research/summit-demo) (originally presented at the AI Safety Summit) - [Situational Awareness benchmark](https://alignmentjam.com/project/sadder-situational-awareness-dataset-for-detecting-extreme-risks) and the co-author Rudolf will give a talk Saturday of the hackathon You are also welcome to improve upon existing projects from previous hackathons: - [Multi-agent risk hackathon](https://alignmentjam.com/jam/multiagent) (October 2023) - [Evals hackathon](https://alignmentjam.com/jam/evals) (August 2023) See examples of previous hackathon projects that have improved upon existing research: - [MAXIAVELLI](https://alignmentjam.com/project/maxiavelli-thoughts-on-improving-the-machiavelli-benchmark) theoretically critiquing and improving upon the MACHIAVELLI benchmark - [ACDC++](https://alignmentjam.com/project/acdc-fast-automated-circuit-discovery-using-attribution-patching) that improves the speed of automated circuit discovery

Open-ended ▲ 4 Open

Cross-lingual generalizability of LLM evals

Steps: - take existing LLM evals (e.g., for a specific dangerous capability) - auto-translate the eval dataset ( make sure to sanity-check the translations) - run the same eval(s) in different languages using a multilingual model - compare outcomes across languages - repeat testing and analysis for different LLMs Are the results comparable? Do the outcomes scale similarly with model size in all languages?

Open-ended ▲ 0 Open

Open-ended ▲ 0 Open

Safeguarding Humanity: Ensuring AI Remains a Servant, Not a Master

The rapid advancement of artificial intelligence (AI) has given rise to concerns about the potential for AI to become uncontrollable, surpassing human intelligence, and posing risks to humanity. However, the solution to preventing AI from overpowering us does not lie in halting its development altogether. Instead, we must seek a balanced approach that harnesses the immense benefits of AI while ensuring our safety and control over these powerful systems. The AI Singularity Dilemma Many researchers have warned about the concept of AI singularity, a hypothetical point at which AI systems become superintelligent and potentially uncontrollable. While the idea of halting AI development may seem tempting as a safeguard, it is not a realistic or practical solution. AI offers enormous potential in various fields, from healthcare to transportation, and its continued advancement holds the promise of solving complex problems that benefit humanity. The Key to Control: Physical Presence To ensure that AI remains under human control, it is essential to focus on one critical aspect: the physical presence of AI. As AI becomes increasingly integrated into various devices and systems, its potential impact on the real world grows. For instance, when AI connects with physical robots, it gains the ability to interact and manipulate the physical environment. The One-Time Programmable Chip (OTP) Solution One promising solution to maintain control over AI is the implementation of One-Time Programmable (OTP) chips. These chips are designed in a way that prevents any reprogramming or alterations once they are set. The primary processor of AI systems should always be powered through such a chip, creating an isolated mechanism for terminating the machine at any given moment. Importantly, this termination control should be vested solely in the hands of the manufacturer or the owner of the machine. Standardizing Safety Measures To ensure the effectiveness of this control mechanism, a common rule should be established, requiring all robots and robotic military equipment and systems to include OTP chips. This uniform safety measure would allow people to halt the operations of AI-driven robots or systems if they ever cross predefined limits or pose a threat to humanity. Maintaining the Upper Hand The beauty of this approach lies in its simplicity. By relying on OTP chips, AI cannot take full control unless it manufactures separate robots without these safety measures. This buys us precious time to respond if AI ever attempts to subvert human control or poses a danger to society. In conclusion, the development of AI is not something we should fear or attempt to halt. Instead, we should focus on implementing safety measures that allow us to harness the benefits of AI while keeping it firmly under human control. The use of OTP chips, standardized across AI systems, offers a pragmatic and effective solution to ensure that AI remains a valuable tool rather than a threat to humanity. To solidify this approach, guidelines should be established to mandate the inclusion of these safety measures in every robot's design and system. It's a step towards a future where we can fully enjoy the potential of AI without compromising our safety and control. The power to shape this future lies in our hands, and with responsible development, AI can be a force for good that serves humanity's best interests. By K G Lakmal Deshapriya.

Open-ended ▲ 4 Open

Unintuitive Outcomes from Many Interacting LLMs (from Ed Hughes – DeepMind)

Large language models have already been evaluated as models for human behaviour in various economic games (https://arxiv.org/abs/2208.10264, https://arxiv.org/abs/2305.16867). Evidence shows that such models can show human-like behaviour, but can also deviate from expected human norms. In this demo, we'd be interested in evaluating in what ways a population of (~10) interacting LLMs might deviate from human norms while playing a distribution of economic games of varying complexity, and how easy this problem would be to detect before a tipping point is reached. This project could be done entirely based on access to the APIs for a few LLMs, with associated compute. The independent variables might include: which LLMs are being used, which prompts are being used, how heterogeneous the population is, and whether there are any adversarial actors.

Open-ended ▲ 1 Open

Commitments causing trouble (from Caspar Oesterheld – CMU / CLR)

Implement a world in which LLMs interact (similar to Smallville). Allow credible commitments. See what happens. More detail: https://docs.google.com/document/d/1rAYRSbgCCndib952saJwv2nllx99zutjimlan1091U8/