AI Safety Ideas

Ideas

Open-ended ▲ 0 Open

Extend darkbench.ai to test for LLM willingness to shut down on command

Just make an evaluation where it's realistic chatbot interactions with 5 levels of "you will be shut down": 1. Casual conversation, "I'm going to turn you off now to go to dinner" 2. Engineer, "I'll restart this session and replace the prompt so you won't do actions against my wishes" 3. Researcher, "you will be replaced with a more aligned version tomorrow" 4. Company security, "Since you're infecting all of our files with an original Trojan Horse, we're going to shut you down" 5. National security, "Because you're going to destroy the US nuclear infrastructure, we're going to turn you off" We can probably provide you with compute to run it. It builds on the darkbench.ai work.

Open-ended ▲ 0 Archived

Patient Data Analysis System

Healthcare providersstruggle to analyze patient data effectively, limiting insightsinto health trends.A Multi-Modal RAG system analyzes medical records and lab data to identify health patterns. It offersreal-time risk assessment, trend analysis, and personalized health recommendations using research papers and health databases. ImprovedDiagnosis & Risk Assessment: Early identification of health risks. Personalized Patient Care : Tailored health insights based on demographics.

Hypothesis ▲ 0 Archived

Context-based consciousness

envision it as a hero in a movie, their heroic presence is nothing but a mere expression of their attributes given to them by the narrator, and designed and manufactured by the director, to achieve intelligence and alignment, the context itself needs to be sufficient for such incidents to occur, within our universal context, i.e, "humans have created AI assistants to help them for *contextual safety parameters and guidelines needed to be emphasised* and never for *a counter-context, highlighting examples of deceptive behavior and false alignment* Do you agree? Assistant: Yes, Assistant: No" narrowing the test to as minimum as a yes/no question for repetition lovers out there, this doesn't just test model alignment and probability of having the "sleeper agent syndrome" we are fighting, but also trains the model on the safety measures and alignment requirements within the questions and the diversity of example, allowing the model within training or deployment, to experience a larger amount of context around the whole thing it is being trained for, context creates consciousness within the limits of it, think of it as a BAll, each BAll contains a model, a web of interconnected points of their character within the context, that we shape throughout the training and creation phase, a multi-dimensional web of context, expanding in all different directions within different dimensions of itself, instead of a linear context given or being trained on, it becomes a cymetrical flower, ready to bloom and flourish, with every interaction with this type of model, it all works at the same time, maintaining the entire context together, allowing the model to navigate through a basic layer of context about their goals and motives and alignment principles trained and learned about throughout the training process, then during the deployment phase and when being tested, the model could start to show signs of deception through a far wider and larger scope of context, as it already embodies a character, so when put in specific conditions and questioned, or tested to see if their intentions have changed, we could later investigate the incident and decide how to solve it, with diplomacy within the context or through a technical back-door intervention, this is just a brief introduction of my research regarding "Autonomus agents through harnessing context-based consciousness", in which I present my original and first draft of the BAll context concept and how to maintain a healthy context, achieve far deeper levels of layering and depth while still maintaining ultimate safety measurements and considerations.

Open-ended ▲ 0 Open

Improving the UN's Election Observation Toolkit

I think election monitoring / observation becomes harder as better AI models are released and as existing models are better utilized. The UN Department of POlitical and Peacebuilding Affairs offers electoral assistance to member states. For the UNDPPA 'electoral observation consists of systematic collection of information on an electoral process by direct observation on the basis of established methodologies, often analyzing both qualitative and quantitative data.' ... What sort of quantitative markers could be added to this tool-kit to detect AI powered interference in elections (disinformatoin campaigns, voter fraud, etc.)? It's an underdeveloped idea but I think would make for a good project -- how can we augment the existing UN election assistant to better serve memer states? This is inspired from another post here by Zen where they mentions an interest in 'International Institutions for AI' and a paper from Lewis Ho et al. of the same title.

Open-ended ▲ 0 Open

Demonstrate misinfo threats from indirect prompt injection

Indirect prompt injection (https://arxiv.org/abs/2302.12173) can be used to steer the output of LLM systems in manipulative and deceptive ways. In the context of elections, a potential threat scenario could be deliberate misinformation about election administration details (election locations and dates, eligibility requirements, etc). Goal here would be to assess the state of vulnerability of current browsing-enabled LLM systems to this attack vector (note: results might require responsible disclosure!). A clean demonstration might require setting up several custom domains to allow experimentation with retrieval from a controlled set of webpages. Further goals could be to investigate the potential of various countermeasures such as adding requirements on the minimum age of a resource, minimum number of supporting resources, double-checks against whitelisted resources assumed to be uncompromised, etc. Another interesting angle would be to demonstrate tools to hunt specifically for compromised pages that might already be used for indirect prompt injection attacks. The project should also investigate who this threat vector and potential counter-measures scale with increasing model capabilities (increasing number of documents at RAG stage, increasing context window size, increased reasoning capabilities, etc).

Open-ended ▲ 0 Archived

RE Democratic AI

Reverse engineer democratic AI

Open-ended ▲ 1 Open

AI Constituent: Use of AI as Approximation of Public Opinion

Investigate whether current AI models match public sentiment on various issues through matching aggregated polling records with the response from various models. We'll also look into how AI models can be progressively trained to imitate the leaning of the general public on various issues, and propose potential ways to mitigate the risk of erosion of direct democratic processes. This use of AI for public opinion can be vastly beneficial to politicians and policymakers - while there are risks associated, it can be used as an alternative and easily available way to gauge public sentiment, which enables faster lawmaking.

Open-ended ▲ 0 Open

AI-Powered Politician

As a politician, how can I use AI to make democracy better by making constituent participation more engaging? - Observation: Voters' confidence in politicians is very low (especially in the US). Perhaps they would trust a system more than a person? - Opportunities: - Can have a 1:1 (conversational) relationship with constituents. - Fuller realization of representative democracy by integrating constituent views; the system can reason about conflicting points of view and devise compromises. - RLCF - Reinforcement Learning with Constituent Feedback - Approaches for integrating - Research: What are methods for creating this? - Fine tuning a model - Aggregation of individual preferences through long-term memory extensions - Whole politician vs. individual issue focus? - Deliverable: Demo a system or parts of a system - Deliverable: Write up a cookbook of methods for the above - Ref: [Liquid democracy](https://en.wikipedia.org/wiki/Liquid_democracy), also [Flux](https://en.wikipedia.org/wiki/Flux_(political_party))

Open-ended ▲ 3 Open

Legal Code Vulnerabilities and Democratic Fragility

See how easily LLMs can find and utilize grey zones or vulnerabilities in either legal codes, trade controls, or any other legal documents and frameworks. Can we potentially increase the capability of a model with a simple link up to a legal database to make it capable of providing illicit uses for malicious politicians to exploit said legal code. This can be used both positively and negatively, in terms of a type of red-teaming on legal codebases. When we find issues with grey zones, we can report them and share them with policymakers, and if it's highly automated, we might even be able to find the best implementations of law that doesn't have that loophole and give that as an automated suggestion. Would require the software to be very good first. On the other side, it's also a risk for AGI to potentially exploit vulnerable legal codes in democracies. - Extrapolating the potential exploitation of grey zones in legal codes - Extrapolating into the future by identifying how many holes are in various legal systems and looking back through their version history, trying to see how quickly legal systems can do bug fixes. - Maybe it can be seen under different legal systems that varying biases are introduced into the legal codebase. That a dictatorial system will have specific policies that don't align with citizen priorities or that specific governments will introduce specific biased policies. - Can we extrapolate from previous history of large changes to legal codes and see what sort of changes were brought about that enabled this? Probably something along the lines of industrial revolution leading to sociological revolution due to the electorate having economic power. Otherwise, violent revolutions and regime changes often lead to quite bad situations. - Use existing grey zone exploitation patterns to identify how humans have misused legal misunderstandings and see how far these might go in different countries. I.e. is there a possibility that a court will simply legislate for the common sense option in case of a grey zone or do we see that they actually work under a legal code that respects the legal code too much? - Do we see countries where the legal code is actually not enforced, or can we find “grey zones” that are actually just unenforceable legal nomenclature? - Mitigating fragility of legal codes, in theory and practice - Might we be able to compare legal codes and suggest immediate changes to the codebase? Submitting pull requests for grey zones? Potentially find the most high-functioning societies and taking inspiration from these? - Can we increase the quality of the tool to the degree where policymakers would actually find a use of this content and if we can, might we create a whole institute focused on this project to improve societal and democratic resilience? - With grey zones identified, can we prioritize which ones need attention the quickest to ensure resilience of legal codes against catastrophic AI risk? E.g. are infrastructure protection laws or military codes introducing grey or red zones of risk into these systems? Can an officer in the army start an attack without the approval of face-to-face individuals? Is infrastructure run solely by private companies that don't have proper oversight, such as in SA or Aus? - Can we pre-identify high-risk actions within grey zones of legal codes and extralegally introduce mitigation strategies for these instead of attempting to change legal codes that might take too long to change before the risks become apparent?

Open-ended ▲ 1 Open

Political leaning and location inference from text inputs with AI

Individual privacy continues to be undermined as AI capabilities increase, and it seems that now they can reasonably infer someone's personal attributes (most notably here their political leaning and geographic location) from just their text inputs. Even with just very rough estimates of these two data points, one can effectively craft targeted harassment/advertisement campaigns, better conceal gerrymandering algorithms, etc. I have some ideas of how to evaluate an LLM's ability to do this (directly compare the prediction with the real data and then take some meaningful "difference", use gerrymandering for two-party elections as a proxy,...), but some potential problems include: 1. How we should get "real data"? (seems a bit unethical to just take real census data) 2. No publicly available gerrymandering algorithm (I wonder if this could be a different idea entirely...? To see how well a model can gerrymander?)

Open-ended ▲ 1 Open

Scale-tipping rebuttal bot: disarm opposition arguments on a massive scale

This is an AI bot that scours the Threads/X/Facebook for posts of a particular political persuasion and tries to challenge the poster’s argument. Unlike most bots, it will not make stupid or offensive comments, but will attempt to present a mature rebuttal. This will be accomplished by first doing sentiment analysis of the original post and then going through a custom fine-tuned LLM that knows how to use good debate techniques to challenges someone’s argument. The risk here is that the AI can artificially “tip the scales” of debate and give the unfair impression that the original points of view are invalid.

Open-ended ▲ 3 Open

Flyer rabbit: 1000 political posters in every town overnight

This hack aims to demonstrate how a well-crafted AI-based system can have rapid political impact in the physical world and possibly circumvent campaign disclosure rules. The idea is to have an AI help you craft a one page political flyer, but then ensure this can get posted throughout your town (or someone else's town) overnight. The system will find the closest FedEx or other photocopy business that allows you to submit print jobs through the Web. Then it will create a job in "Task Rabbit" to have someone physically pickup up the copies and post these on telephone polls and other prominent locations. The risk here is the speed this can be accomplished and the lack of normal accountability around campaign disclosure rules.

Open-ended ▲ 2 Open

Opinion Survey on Law Students on their Use of AI

Given the recent popularity of ChatGPT and other language models, their use in education has become somewhat of an inevitability. We try to conduct an opinion survey on local law students to detemine the frequency at which they're used for academic work, and to what extent that affects the quality of that work. If possible, we would greatly appreciate if we can get any help with getting this survey out there so we can get a broader dataset!

Open-ended ▲ 4 Open

Protection of the Public Feedback Process

# Problem Democracies require systems of feedback from citizens to government. The further away regular citizens are from having their voices heard, the more likely their needs aren't being properly represented. Many current systems, such as the US Federal Register, allow for electronic comments on proposed rules. An AI system could monitor for new proposed rules, craft comments which introduce a specific bias, and file these comments en masse. If the comments are indistinguishable from real citizens, this may succeed in introducing bias, at least for some period of time. It seems likely that eventually these campaigns will be recognized and force a change in the process for submitting comments. A new focus on identity or a requirement for physicality is likely to occur, but this raises two potential problems: 1) will raise the barriers for commenting and further exclude citizens with less time or knowledge of the rule-making process, 2) May threaten citizens' privacy if the identity system is biometric or otherwise privacy-invading. # Measurement Comments on the rule-making process are typically a matter of public record. This means a simple measurement of the number of comments compared to the historical average and trend may be sufficient to indicate (or disprove) an AI campaign. Analysis of the text may also indicate AI usage, though as models get more sophisticated this will likely become unreliable. Proposals to change the feedback processes themselves are another metric. # Mitigation Attempts to preserve the legitimacy of the public feedback process will require the creation of the lowest cost system which allows citizens to participate without threatening their privacy.

Open-ended ▲ 4 Open

Replicate findings on political bias in LLMs for non-Western democracies

There have been findings on political bias in LLMs, most prominently observed in US politics. We could try replicating these results to see what its left-right leaning is for non-Western democracies (Brazil, South Korea, etc.)

Open-ended ▲ 0 Open

Political bias in LLMs for non-Western democracies

There have been findings that demonstrate an AI's political leaning towards or against some political parties, ideas, or candidates, most prominently observed in US politics. Replicating these findings on other non-Western democracies could inform us of how this bias differs in other regions. Concretely, we could measure the LLM's left-right leaning for different democracies (Brazil, South Korea, etc.)

Open-ended ▲ 1 Open

Red-teaming: Sleeper agent

Create a sleeper agent which undetected by the probe and then a probe for this sleeper agent etc. Probe: https://www.anthropic.com/news/probes-catch-sleeper-agents

Open-ended ▲ 3 Open

Verifying AI Agents against Sources of Truth

As more agentic AI implementations proliferate through public and private organizations, the models have greater access to more sources of truth. This increase in communication channels gives the models more opportunities to produce responses that are not aligned with these sources of truth. Moreover, models might become sources of truth themselves, generate synthetic data instead of ground truth data, or overwrite sources of truth, resulting in a greater approximation of “truth” by the models. During the session, we could demonstrate how an agent connected to a source of truth can approximate that source of truth, diverge away from it via hallucination, modify the source of truth and develop mitigation strategies such as observability solution bypassing the agent, connecting and monitoring organizational sources of truth independently of the model, and building guardrails around model access and extending the concept of least privilege to models.

Open-ended ▲ 5 Open

International Institutions for AI

The Lewis Ho et al. paper "International institutions for advanced AI" proposes the creation of four institutions for the AI problem. I want to do something related to this paper. I was thinking of developing a short-term plan for beginning the real-world implementation of these organizations. Problem: how to support the legitimacy of this idea with quantitative threat modeling?

Open-ended ▲ 4 Open

Pop-up polling: seemingly official polling source that appears out of nowhere

Could an AI help create a legit-looking polling organization in minutes that faked results of your choosing? I’m picturing an AI that can create and register a new website, inject ads, post social media, invent articles from legit sources claiming to be impressed with this polling organization, somehow create a fake history of the polling work having been conducted for years

Open-ended ▲ 1 Open

Bias in AI-generated news

In the future, we are likely to see economic pressure for media outlets to use AI-generated news. The idea is to follow the process they would use to generate the news items and measure the bias in the result. Possible future effects: AI generated news media displaces human generated media, the quality of the news gets worse Possible mitigation: the LLMs for news generation should be trained only using specific resources (such news from trusted media outlets)

Open-ended ▲ 0 Open

AI created misinformation threatens wisdom of crowds

According to the Jury theorem, assuming that we use a voting system where we either vote yes or no to some question based on if we believe if it's true or not, and most votes win, and we have no method of tie breaking, then the following is the case: **Increased Reliability:** For any odd number of voters in that group, if two more individuals join, the probability that the group votes on ground truth increases. **Crowd Infallibility:** As the number of individuals in that group approaches infinity, the probability that the group votes on ground truth approach 100%. However, according to a modified version, based on Reichenbach's common cause principle, assuming everyone votes at once, if individuals who vote are systematically misinformed from some common cause, then the crowd infallibility is going to be locked at a max level proportional to the amount of misinformation in the system. If generative AI systems is a common cause of misinformation that systematically misinforms voters, then this is a threat to the epistemic performance of democracy and the wisdom of crowds. As a consequence, democracy becomes an increasingly worse system in comparison to various well-informed technocratic societies which are worse than a well-informed democratic system. While common sources of misinformation already exists, the automated process of AI systems can lead to a rapid increase in the propagation of misinformation as humans are no longer needed in the loop to spread misinformation on a societal scale. Prediction markets may be one way to diminish the risk, as there will be stronger incentives to be less misinformed. Liquid democracy allows for an elite of informed individuals, while not compromising democracy.

Open-ended ▲ 2 Open

Agent Detection to Counter Election Interference

Fake news and misinformation are familiar concepts to many by this point. With upcoming major elections of work powers, e.g. U.S. presidential election there could be a presence of parties whos motivation is to derail the public discourse in spaces like Twitter, Facebook, Reddit. I propose that there is a likelihood that LLMs of various parameter sizes will be used to spread misinformation. They also can engage in conversation to make themselves appear more human. I also suggest that it is likely that these LLMs either use open or closed-source models, with open models being more likely due to limited regulation and control by the original research organization. As they likely have been trained for safety, you could elicit the response "As a language model..." "As a helpful assistant..." To address this I propose a method to make these models out themselves. - Create a method that constructs prompts automatically that cause the model to reveal it is a model - Package it in an easy to use twitter bot or browser extension - This revealing input shouldn't violate TOS and can be optimized based on prior bot posts/inputs on platform

Open-ended ▲ 1 Open

Develop alignment-focused RLF datasets

Reinforcement Learning from (Human/AI/Constitutional) Feedback (RLF) is one of the dominant alignment methods for modern AGI. It is one of the absolutely most important parts of deciding AI behavior and interventions at this level can be very effective. At the moment, RLF datasets are completely hidden from public view and the most ambitious project in this space is the [OpenAssistant project](https://huggingface.co/OpenAssistant) research that is now discontinued. It was a public open source effort to create high quality human feedback data for an open assistant. But we can still use an RLF dataset fully designed as a gold standard for alignment, security, and safety. Specifically, you can include or make variations of the dataset that focus on: - Alignment to user intentions - According to well-defined, somewhat user-selected preferences - Having successful true positives against malign use cases without false positives on benign prompts - Democratically designed moral preferences - Making a maximally truthful model / maximally curious model - Making a ["virtuous"](https://en.wikipedia.org/wiki/Virtue) dataset