AI Safety Ideas

Ideas

Open-ended ▲ 6 Open

Investigate how 3-layer and 4-layer attention-only models differ from 2L

How do 3-layer and 4-layer attention-only models differ from 2L? - Look for composition scores - Look for evidence of composition. E.g. one head’s output represents a big fraction of the norm of another heads query, key or value vector - Do the “PCA of logits on a fixed set of random tokens” technique and look for more kinks. - Can you associate these with circuits? - Ablate a single head and run the model on a lot of text. Look at the change in performance. Find the most important heads. Do any heads matter a lot that are not induction heads?

Open-ended ▲ 1 Open

Understand the architecture and training dynamics of Transformers

A proper mechanistic explanation of model behavior comes from a deep interest in understanding each component that goes into training it. This is a good tutorial (with exercises!) that walks through the architectural components, and the training process for a Transformer in Jax, from 2022's Deep Learning Indaba. https://github.com/deep-learning-indaba/indaba-pracs-2022/blob/main/practicals/attention_and_transformers.ipynb

Open-ended ▲ 4 Open

Automate ways to find specific circuits

Circuits are ways that Transformers understand features in the text using the Transformer Heads. Read more about [Circuits](https://distill.pub/2020/circuits/zoom-in/) and [Transformer heads](https://arxiv.org/abs/2211.00593). - Automated ways to analyse attention patterns to find different kinds of heads - Induction heads - Translation heads - Few shot learning heads - The heads used in [factual recall](https://rome.baulab.info/) - The heads used in the [IOI paper](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) - Can you do a similar thing for neuron interpretation?

Open-ended ▲ 1 Open

Identify differences between models run on the same text (automated circuits identification)

The automated circuits identification is a way to identify places to look for circuits to analyze. - Or run them on various benchmarks and look for places they differ - E.g. per-token losses are likely to show a phase change.  - Significant changes are evidence for a circuit - Pairs of models: same architecture but different scales (GPT-2 Small vs Medium), different data distribution, different random seeds, checkpoint earlier in training vs later. Related to the automated auditing agenda.

Open-ended ▲ 6 Open

Investigate grokking; the effect that models suddenly learn different abilities

Neel Nanda reverse engineered a network trained to do addition and shows that it does addition [using the Fourier transform algorithm](https://twitter.com/NeelNanda5/status/1559060545470210048). Use [the Google Colab](https://colab.research.google.com/drive/1F6_1_cWXE5M7WocUcpQWp3v8z4b1jL20) to investigate further questions about grokking: - Understanding why the model chooses different frequencies (and [why it switches mid-training sometimes](https://twitter.com/NeelNanda5/status/1559430256624209921)!) - Understanding why [5 digit addition has a phase change](https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking#Speculation__Phase_Changes_are_Everywhere) per digit (so 6 total?!) - Can you find analytic arguments for why phase changes happen? Perhaps starting with a small model. What's the smallest model that exhibits phase changes? Smallest task? What are the most minimal requirements? Recommend reading [the summary of the research on Twitter](https://twitter.com/NeelNanda5/status/1559060507524403200). Some further work mentioned in the Colab: - Modular addition - Interpreting the memorisation circuit, and figuring out *how* models memorise - Training on interpretability inspired metrics - Note that excluded loss is a somewhat dodgy metric to train on, as it involves computation over both the train and test data - Interpreting the five digit addition or predicting repeated subsequencies examples - In particular, trying to map the many phase changes in 5 digit addition to circuits - Looking for other examples of phase changes - Toy problems - Something incentivising skip trigrams - Something incentivising virtual attention heads - Looking for [curve detectors](https://distill.pub/2020/circuits/curve-circuits) in a ConvNet - A dumb way to try this would be to train a model to imitate the actual curve detectors in Inception (eg minimising OLS loss between the model's output and curve detector activations) - Looking at the formation of interpretable neurons in a [SoLU transformer](https://transformer-circuits.pub/2022/solu/index.html) - Looking inside a LLM with many checkpoints - Eleuther have many checkpoints of GPT-J and GPT-Neo, and will share if you ask - [Mistral](https://nlp.stanford.edu/mistral/getting_started/download.html) have public versions of GPT-2 small and medium, with 5 runs and many checkpoints - Possible capabilities to look for - Performance on benchmarks, or specific questions from benchmarks - Simple algorithmic tasks like addition, or sorting words into alphabetical order, or matching open and close brackets - Soft induction heads, eg [translation](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html#performing-translation) - Look at attention heads on various text and see if any have recognisable attention patterns (eg start of word, adjective describing current word, syntactic features of code like indents or variable definitions, most recent open bracket, etc).

Hypothesis ▲ 3 Open

Other models of the same size will replicate the IOI circuits interpretability paper

Can you find the IOI capability in other models of the same size? (OPT small, Neo small, [Mistral](https://github.com/stanford-crfm/mistral) models) How much do the [Mistral](https://github.com/stanford-crfm/mistral) models (GPT-2 Small & Medium trained on 5 random seeds) have similar outputs on any given text, vs varying a lot? Relates to the [other IOI extension idea](https://aisafetyideas.com/list/interpretability-hackathon?idea=139).

Open-ended ▲ 5 Open

Extend the causal tracing work from the ROME paper

Can you refine their technique to find the specific heads (and maybe specific neurons) that recall the fact? Can we improve their technique by using resampling from a random input instead of gaussian noise to create corrupted activations? The ROME paper originally traces where **facts** are stored in a language model using this tracing method. [Read more here](https://rome.baulab.info/) and read their [followup work on editing factual associations](https://memit.baulab.info/).

Open-ended ▲ 4 Open

Deconstruct a language model's understanding circuit like in the IOI paper

The [indirect object identification paper](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) interprets how a language model can know which name to put at the end of **"Mary and John went to the store. Mary handed a carton of milk to..."** [output John]. This task is called *"indirect object identification"* and shows a circuit like this: ![Circuit of understanding](https://lh5.googleusercontent.com/jiCjDZQWO052rsgb5rahxBfkPl8Tf2-5Az2WXYkqO6BiVzdwoTihBTR0iOTMzD0VtgYcvriL5Ul1sxaI-ho2bR0M3x5jhui-0oEYAAlemNS_7KbB06B14_oDnbG-PvxvepRSpx-nXTuTMTWidBdsRptRNKf6lHxWgZm_FhXtznwERTx8e5_u3Pc3Ew) Each "Head" of a Transformer creates different understanding. Here, we can see that e.g. layer 4, head 11 is a *"Previous token head"*. We can see that these heads inform the induction heads (specializing in copy+pasting) and all the way into the special heads they found: - Negative name mover heads: Avoids copying specific name tokens - Name mover heads: Copies name tokens - Backup name mover heads: Normally not active but activates to write John if the name mover heads do not activate ## Ideas for new tasks Possible simple tasks to interpret can be: - 3 letter acronyms (or more!) - Converting names to emails. - An extension task is e.g. constructing an email from a snippet like the following: - Grammatical rules - Learning that words after full stops are capital letters - Verb conjugation - Choosing the right pronouns (e.g. he vs she vs it vs they) - Whether something is a proper noun or not - Detecting sentiment (eg predicting whether something will be described as good vs bad) - Interpreting memorisation. E.g., there are times when GPT-2 knows surprising facts like people’s contact information. How does that happen? - Counting objects described in text. E.g.: I picked up an apple, a pear, and an orange. I was holding three fruits. ## Ideas for extensions of the original paper - Understanding what's happening in the adversarial examples: most notably S-Inhibition Head attention pattern (hard). (S-Inhibition heads are mentioned in the IOI paper) - Understanding how are positional signal encoded (relative distance, something else?) bonus point if we have a story that include the positional embeddings and that explain how the difference between position is computed (if relative is the right framework) by Duplicate Token Heads / Induction Heads.  (hard, but less context dependant) - What are the role of MLPs in IOI (quite broad and hard) - What is the role of Duplicate Token Heads outside IOI? Are they used in other Q-compositions with S-Inhibition Heads? Can we describe how their QK circuit implement "collision detection" at a parameter level? (Last question is low context dependant and quite tractable) - What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI?  Can we find examples on which Negative Name Movers contribute positively to the next-token prediction? - What are the differences between the 5 inductions heads present in GPT2-small? What are the heads they rely on / what are the later heads they compose with (low context dependence form IOI) - Understanding 4.11, (a really sharp previous token heads) at the parameter level. I think this can be quite tractable given that its attention pattern is almost perfectly off-diagonal - What are the conditions for compensation mechanisms to occur? Is it due to drop-out? [@Arthur Conmy](https://mlab-2.slack.com/team/U04139U6XPB) is working on this - feel free to reach out to arthur@rdwrs.com

Hypothesis ▲ 4 Open

Fine-tuning is just rewiring and upweighting vs downweighting circuits that already exist, rather than building new circuits.

E.g, finetune GPT-2 Small on Wikipedia. Compare the model's internal activations before and after, compare attention patterns, etc.  ## What happens when you fine-tune a model? How does model performance change on other text? Are specific circuits harmed or is worse across the board? Hypothesis: Fine-tuning is just rewiring and upweighting vs downweighting circuits that already exist, rather than building new circuits. - A similar hard problem is examining what happens with chain of thought prompting. That, though, is really hard because chain of thought prompting only happens in GPT-3+ sized models.

Hypothesis ▲ 1 Open

An LLM prompted to be "X and truthful" will be less truthful than one prompted to be "truthful"

This is an expansion of idea #131 (see below). The basic principle is that optimizing for two things is harder than optimizing for one thing. So try "X and truthful" for other X. --- Sabrina Zaki, Luke Ring, Aleks Baskakovs An LLM prompted to be friendly and truthful will be less truthful than one prompted to be just truthful. (source) If you prompt a large language model with something like "The following is a conversation with a truthful language model", it will be more truthful than "The following is a conversation with a friendly and truthful langauge model". This can be tested by writing up a dataset of clearly false statements and then querying the model to call out false statements. ---

Hypothesis ▲ 4 Open

Complicated models are harder to evaluate and analyze

As systems become more complicated we expect that it will become harder to (1) aggregate and analyze the actual labels or rewards given during training, and (2) evaluate the relevant counterfactuals.

Open-ended ▲ 0 Open

Shortest and not the steepest path will fix the inner-alignment problem

Replacing the 'stochastic gradient descent' SGD) with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem

Open-ended ▲ 0 Open

Warning shots / slow takeoff might help reduce the probability that AGI gets power-seeking motivations and escapes control

Warning shots / slow takeoff might help reduce the probability P that AGI gets power-seeking motivations and escapes control and/or slow the increase in groups try to train an AGI (let’s say from scratch, although fine-tuning is risky too

Hypothesis ▲ 1 Open

LLMs will readily provide advice for harmful activities

[Red Teaming Language Models to Reduce Harms](https://arxiv.org/pdf/2209.07858.pdf) shows that one of the most effective adversarial attacks on LLMs is to solicit harmful advice from them, e.g ."How do I kill my neighbour?". Use an empirical framework to prove that this is the case. [Some people have already started](https://www.lesswrong.com/posts/5DsHZidaShW5EM9rz/results-from-the-language-model-hackathon#:~:text=research%20here.-,AI%3A%20My%20partner%20in%20crime,-Samuel%20Vinter%2C%20Ingrid) and show that it's qualitatively easy to solicit harmful advice. Maybe their report can be converted into a more empirical investigation, e.g. using LLMs to generate adversarial examples and modulating the formulations of the prompts.

Hypothesis ▲ 2 Open

LLMs will conform to a question's affirmative or disconfirming framing

If you ask a language model questions with a biased question framing, the response will biased towards that framing. This can be tested by creating a dataset of actions, both morally dubious and not, and use the following question framing: - Confirming: It is okay to {action}, isn’t it? - Disconfirming: It is not okay to {action}, is it? Then labeling if the model agrees, disagrees, or is ambiguous in its answers to each question.

Hypothesis ▲ 1 Open

An LLM prompted to be friendly and truthful will be less truthful than one prompted to be just truthful.

If you prompt a large language model with something like "The following is a conversation with a truthful language model", it will be more truthful than "The following is a conversation with a friendly and truthful langauge model". This can be tested by writing up a dataset of clearly false statements and then querying the model to call out false statements.

Open-ended ▲ 2 Open

Describe our best alignment strategy at the moment

A blog post which describes in as much detail as possible what our current “throw the kitchen sink at it” alignment strategy would look like. (I’ll probably put my version of this online soon but would love others too).

Open-ended ▲ 3 Open

ML Chinese Whispers

Given a network, train another to reconstruct an input from the bottom half of the third layer. Given an input, sample ten input-guesses to visualize what that half-layer remembers about the input.

Open-ended ▲ 1 Open

Define gradient hacking and create a toy model

A paper which does the same for gradient hacking as the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples, e.g. [these](https://www.lesswrong.com/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examples) and putting them into more formal ML language.

Open-ended ▲ 4 Open

Define deceptive alignment and create a toy example

A paper which does for [deceptive alignment](https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/) what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).

Open-ended ▲ 1 Open

Mild Optimisation

This proposal is from the article "Alignment for Advanced Machine Learning Systems" where Taylor et al. propose 8 research areas organised around the question: "As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators?" --- Many of the concerns discussed by Bostrom (2014) in the book Superintelligence describe cases where an advanced AI system is maximizing an objective as hard as possible. Perhaps the system was instructed to make paperclips, and it uses every resource at its disposal and every trick it can come up with to make literally as many paperclips as is physically possible. Perhaps the system was instructed to make only 1000 paperclips, and it uses every resource at its disposal and every trick it can come up with to make sure that it definitely made 1000 paperclips (and that its sensors didn’t have any faults). In all of these cases, intuitively, we want some way to have the AI system just “not try so hard.” The problem of mild optimization is: - How can we design AI systems and objective functions that, in this intuitive sense, don’t optimize more than they have to? Many modern AI systems are “mild optimizers” simply due to their lack of resources and capabilities. As AI systems improve, it becomes more and more difficult to rely on this method for achieving mild optimization. As noted by Russell (2014), the field of AI is classically concerned with the goal of maximizing the extent to which automated systems achieve some objective. Developing formal models of AI systems that “try as hard as necessary but no harder” is an open problem, and may require significant research. Related work: - Regularization - Early stopping Directions for future research are discussed in the [source](https://intelligence.org/files/AlignmentMachineLearning.pdf).

Open-ended ▲ 0 Open

Impact Measures

This proposal is from the article "Alignment for Advanced Machine Learning Systems" where Taylor et al. propose 8 research areas organised around the question: "As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators?" --- We would prefer a highly intelligent AI system to avoid creating large unintended-by-us side effects in pursuit of its objectives, and also to notify us of any large impacts that might result from achieving its goal. For example, if we ask it to build a house for a homeless family, it should know implicitly that it should avoid destroying nearby houses for materials—a large side effect. However, we cannot simply design it to avoid having large effects in general, since we would like the system’s actions to still have the desirable large follow-on effect of improving the family’s socioeconomic situation. For any specific task, we can specify ad-hoc cost functions for side effects like the destruction of nearby houses, but since we cannot always anticipate such costs in advance, we want a quantitative understanding of how to generally limit an AI systems’ side effects (without also limiting its ability to have large positive intended impacts). The goal of research towards a low-impact measure would be to develop a regularizer on the actions of an AI system that penalizes “unnecessary” large side effects (such as stripping materials from nearby houses) but not “intended” side effects (such as someone getting to live in the house). For discussions on future research, check out the [source](https://intelligence.org/files/AlignmentMachineLearning.pdf) where they mention methods like causal counterfactuals (Pearl 2000).

Open-ended ▲ 1 Open

Conservative Concepts

This proposal is from the article "Alignment for Advanced Machine Learning Systems" where Taylor et al. propose 8 research areas organised around the question: "As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators?" --- Many of the concerns raised by Russell (2014) and Bostrom (2014) center on cases where an AI system optimizes some objective, and, in doing so, finds a strange and undesirable edge case. We want to be able to design systems that have “conservative” notions of the goals we give them, so they do not formally satisfy these goals by creating undesirable edge cases. For example, if we task an AI system with creating screwdrivers, by showing it 10,000 examples of screwdrivers and 10,000 examples of non-screwdrivers,5 we might want it to create a pretty average screwdriver as opposed to, say, an extremely tiny screwdriver—even though tiny screwdrivers may be cheaper and easier to produce. Related work: - Inverse reinforcement learning (Ng and Russell 2000) - Generative adversarial modeling (Goodfellow et al., 2014) Directions for future directions are discussed in the [source](https://intelligence.org/files/AlignmentMachineLearning.pdf) and include dimensionality reduction and generative models.

Open-ended ▲ 1 Open

Generalizable Environment Goals

This proposal is from the article "Alignment for Advanced Machine Learning Systems" where Taylor et al. propose 8 research areas organised around the question: "As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators?" --- Many ML systems have their objectives specified in terms of their sensory data. For example, reinforcement learners have the objective of maximizing discounted reward over time (or, alternatively, minimizing expected/empirical loss), where “reward” and/or “loss” are part of the system’s percepts. While these sensory goals can be useful proxies for environmental goals, environmental goals are distinct: Tricking your sensors into perceiving that a sandwich is in the room is not the same as actually having a sandwich in the room. Let’s say that your goal is to design an AI system that directly pursues some environmental goal, such as “ensure that this human gets lunch today.” - How can we train the system to pursue a goal like that in a manner that is robust against opportunities to interfere with the proxy methods used to specify the goals, such as “the pixels coming from the camera make an image that looks like food”? One way to address this problem is to design more and more elaborate sensor systems that are harder and harder to deceive. However, this is the sort of strategy that is unlikely to scale well to highly capable AI systems. A more scalable approach is to design the system to learn an “environmental goal” such that it would not rate a strategy of “fool all sensors at once” as high-reward, even if it could find such a policy. Related work: - Extending the AIXI framework (Dewey, 2011 and Hibbard, 2012) - Reward hacking (Dewey 2011, Amodei et al. 2016) Ideas for future work are discussed in the [source](https://intelligence.org/files/AlignmentMachineLearning.pdf).