AI Safety Ideas

Ideas

Open-ended ▲ 1 Open

What has been the share of any chip in a given year of total available compute performance?

New chips are continuously developed, and old chips are subsequently replaced, but not instantly. Ultimately, we want to have a better understanding of the dynamics of how chips are replaced over time. To help with this, construct a database that specifies the following: for each year between 2010 and 2020, what share of the available compute came from which chips?

Open-ended ▲ 1 Open

Improvements due to “software-for-hardware”

Innovations in compilers and other low-level improvements have helped increase the utilisation rate of GPUs and improve training efficiency. Make a list of such improvements and how much did they improve performance overall for tasks such as training a Neural Network.

Open-ended ▲ 1 Open

Do AI researchers train models using scaling laws?

Scaling laws have been proposed as ways to gather information about how to train large machine learning models efficiently ([Kaplan et al., 2020](https://arxiv.org/abs/2001.08361)), and this has been done in practice for training LLMs like Chinchilla ([Hoffmann et al., 2022](https://arxiv.org/abs/2203.15556)). But how broadly have scaling laws been used by AI researchers in general, and has there been a delay in the uptake of such an approach?

Open-ended ▲ 1 Open

Test the bioanchors framework by retrodicting computer vision progress

The [bioanchors framework](https://epochai.org/blog/grokking-bioanchors) is one of the most detailed and widely used AI timelines models. However, many people don't trust the basic approach of using biological anchors to predict AI progress. Computer vision is already at the human- or superhuman-level for some tasks. Could we have predicted its progress by applying the bioanchors methodology, using the human visual cortex as an anchor?

Open-ended ▲ 1 Open

Extrapolating GPT-N performance

Lukas Finnveden previously performed an extrapolation of GPT-N performance on a number of benchmark tasks, such as cloze completion and arithmetic ([Finnveden, 2020](https://www.alignmentforum.org/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance)). Can you expand on this methodology and apply it to more cases?

Open-ended ▲ 1 Open

Qualitatively analysing language model / image generation improvements since ~2000

While we can plot graphs showing quantitative changes in language model / image generation performance over time (e.g. in terms of the perplexity), what does this actually mean in terms of model capabilities? Having a collection of samples from language models in the last two decades could help give a visceral sense of how much they have improved. The comparison could include a selection of the best output out of 10 prompts, a comparison of prompt completions, etc.

Open-ended ▲ 2 Open

London-based MATS clone

"A London-based MATS clone to build the AI safety research ecosystem there, leverage mentors in and around London (e.g., DeepMind, CLR, David Krueger, Aligned AI, Conjecture, etc.), and allow regional specialization. This project should probably only happen once MATS has ironed out the bugs in its beta versions and grown too large for one location (possibly by Winter 2023). Please contact the MATS team before starting something like this to ensure good coordination and to learn from our mistakes."

Open-ended ▲ 3 Open

Investigate relationship between double descent and grokking

What is the relationship between double descent and grokking? - Double descent seems to be caused by polysemanticity phase transitions while grokking seems like a general effect of task learning. As we see a slight decrease in performance over a few epochs which then converge to an even lower equillibrium, indicating a new level of hyperdimensional encoding See [example](https://transformer-circuits.pub/2022/toy_model/index.html#geometry:~:text=Let%27s%20look%20at%20the%20resulting%20plot%2C%20and%20then%20we%27ll%20try%20to%20figure%20out%20what%20it%27s%20showing%20us%3A).

Open-ended ▲ 2 Open

Circuit investigation: Compare tasks for nL model to a (n+1)L model

Look for tasks that an nL model cannot do but a (n+1)L model can - look for a circuit! Proposal: - Build the infrastructure to do this - run two models over a lot of text and look for big log prob differences (maybe floor the log probs at eg 5, to avoid overfitting to times that one network was incredibly wrong) - Just take a bunch of text with interesting patterns and run the models over it, look for tokens they do really well on, and try to reverse engineer what’s going on - I expect there’s a lot of stuff in here!

Open-ended ▲ 1 Open

Reverse engineering of 1 layer SoLU model

How far can you get with really deeply reverse engineering a 1 layer SoLU model? - Which directions correspond to features? - Can you find any [polysemantic](https://transformer-circuits.pub/2022/toy_model/index.html)neurons? - Can you fully reverse a feature direction and compare it to a neuron direction?

Open-ended ▲ 1 Open

Investigate SoLU lexoscope's neurons

Neel Nanda made the website [lexoscope.io](lexoscope.io) - it shows the text that most activates each neuron in several SoLU language models he trained, including toy SoLU models. Problem ideas could be: - Hunt through it, at look for interesting neurons - can you find weird and abstract ones? - Can you find neuron families? A la [equivariance](https://distill.pub/2020/circuits/equivariance/) - Study a lot of neurons at different layers and look for patterns - what can we say about what the model is doing at different layers? What patterns are there? - Can you find examples of neuron splitting? (A single high-level feature splits into several more specific features as you scale up) - Can you reverse engineer a neuron? Can you find a specific direction in activation space that is exactly that feature, and how aligned is it with the neuron basis? - Can you find any highly non monosemantic features? A task where the entire MLP layer matters, but no one neuron activates much - Or where the pre layernorm activation is low but post-layernorm is high, so the model “smuggles through” directions - Find polysemantic neurons. Try to reverse engineer them - give the model a bunch of text containing each feature and average it/apply PCA. Can you find directions corresponding to each feature? How much do they align with that neuron? - Replicate the part of Conjecture’s Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns in that Is it the case that there are monosemantic bands in the neuron act spectrum - Can you find a genuinely monosemantic neuron? Possible idea - look for algorithmic flavoured neurons, eg one whose activation could be minimicking a regex - use this to automatically test that it’s actually doing things

Open-ended ▲ 6 Open

Investigate how 3-layer and 4-layer attention-only models differ from 2L

How do 3-layer and 4-layer attention-only models differ from 2L? - Look for composition scores - Look for evidence of composition. E.g. one head’s output represents a big fraction of the norm of another heads query, key or value vector - Do the “PCA of logits on a fixed set of random tokens” technique and look for more kinks. - Can you associate these with circuits? - Ablate a single head and run the model on a lot of text. Look at the change in performance. Find the most important heads. Do any heads matter a lot that are not induction heads?

Open-ended ▲ 1 Open

Understand the architecture and training dynamics of Transformers

A proper mechanistic explanation of model behavior comes from a deep interest in understanding each component that goes into training it. This is a good tutorial (with exercises!) that walks through the architectural components, and the training process for a Transformer in Jax, from 2022's Deep Learning Indaba. https://github.com/deep-learning-indaba/indaba-pracs-2022/blob/main/practicals/attention_and_transformers.ipynb

Open-ended ▲ 4 Open

Automate ways to find specific circuits

Circuits are ways that Transformers understand features in the text using the Transformer Heads. Read more about [Circuits](https://distill.pub/2020/circuits/zoom-in/) and [Transformer heads](https://arxiv.org/abs/2211.00593). - Automated ways to analyse attention patterns to find different kinds of heads - Induction heads - Translation heads - Few shot learning heads - The heads used in [factual recall](https://rome.baulab.info/) - The heads used in the [IOI paper](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) - Can you do a similar thing for neuron interpretation?

Open-ended ▲ 1 Open

Identify differences between models run on the same text (automated circuits identification)

The automated circuits identification is a way to identify places to look for circuits to analyze. - Or run them on various benchmarks and look for places they differ - E.g. per-token losses are likely to show a phase change.  - Significant changes are evidence for a circuit - Pairs of models: same architecture but different scales (GPT-2 Small vs Medium), different data distribution, different random seeds, checkpoint earlier in training vs later. Related to the automated auditing agenda.

Open-ended ▲ 6 Open

Investigate grokking; the effect that models suddenly learn different abilities

Neel Nanda reverse engineered a network trained to do addition and shows that it does addition [using the Fourier transform algorithm](https://twitter.com/NeelNanda5/status/1559060545470210048). Use [the Google Colab](https://colab.research.google.com/drive/1F6_1_cWXE5M7WocUcpQWp3v8z4b1jL20) to investigate further questions about grokking: - Understanding why the model chooses different frequencies (and [why it switches mid-training sometimes](https://twitter.com/NeelNanda5/status/1559430256624209921)!) - Understanding why [5 digit addition has a phase change](https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking#Speculation__Phase_Changes_are_Everywhere) per digit (so 6 total?!) - Can you find analytic arguments for why phase changes happen? Perhaps starting with a small model. What's the smallest model that exhibits phase changes? Smallest task? What are the most minimal requirements? Recommend reading [the summary of the research on Twitter](https://twitter.com/NeelNanda5/status/1559060507524403200). Some further work mentioned in the Colab: - Modular addition - Interpreting the memorisation circuit, and figuring out *how* models memorise - Training on interpretability inspired metrics - Note that excluded loss is a somewhat dodgy metric to train on, as it involves computation over both the train and test data - Interpreting the five digit addition or predicting repeated subsequencies examples - In particular, trying to map the many phase changes in 5 digit addition to circuits - Looking for other examples of phase changes - Toy problems - Something incentivising skip trigrams - Something incentivising virtual attention heads - Looking for [curve detectors](https://distill.pub/2020/circuits/curve-circuits) in a ConvNet - A dumb way to try this would be to train a model to imitate the actual curve detectors in Inception (eg minimising OLS loss between the model's output and curve detector activations) - Looking at the formation of interpretable neurons in a [SoLU transformer](https://transformer-circuits.pub/2022/solu/index.html) - Looking inside a LLM with many checkpoints - Eleuther have many checkpoints of GPT-J and GPT-Neo, and will share if you ask - [Mistral](https://nlp.stanford.edu/mistral/getting_started/download.html) have public versions of GPT-2 small and medium, with 5 runs and many checkpoints - Possible capabilities to look for - Performance on benchmarks, or specific questions from benchmarks - Simple algorithmic tasks like addition, or sorting words into alphabetical order, or matching open and close brackets - Soft induction heads, eg [translation](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html#performing-translation) - Look at attention heads on various text and see if any have recognisable attention patterns (eg start of word, adjective describing current word, syntactic features of code like indents or variable definitions, most recent open bracket, etc).

Open-ended ▲ 5 Open

Extend the causal tracing work from the ROME paper

Can you refine their technique to find the specific heads (and maybe specific neurons) that recall the fact? Can we improve their technique by using resampling from a random input instead of gaussian noise to create corrupted activations? The ROME paper originally traces where **facts** are stored in a language model using this tracing method. [Read more here](https://rome.baulab.info/) and read their [followup work on editing factual associations](https://memit.baulab.info/).

Open-ended ▲ 4 Open

Deconstruct a language model's understanding circuit like in the IOI paper

The [indirect object identification paper](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) interprets how a language model can know which name to put at the end of **"Mary and John went to the store. Mary handed a carton of milk to..."** [output John]. This task is called *"indirect object identification"* and shows a circuit like this: ![Circuit of understanding](https://lh5.googleusercontent.com/jiCjDZQWO052rsgb5rahxBfkPl8Tf2-5Az2WXYkqO6BiVzdwoTihBTR0iOTMzD0VtgYcvriL5Ul1sxaI-ho2bR0M3x5jhui-0oEYAAlemNS_7KbB06B14_oDnbG-PvxvepRSpx-nXTuTMTWidBdsRptRNKf6lHxWgZm_FhXtznwERTx8e5_u3Pc3Ew) Each "Head" of a Transformer creates different understanding. Here, we can see that e.g. layer 4, head 11 is a *"Previous token head"*. We can see that these heads inform the induction heads (specializing in copy+pasting) and all the way into the special heads they found: - Negative name mover heads: Avoids copying specific name tokens - Name mover heads: Copies name tokens - Backup name mover heads: Normally not active but activates to write John if the name mover heads do not activate ## Ideas for new tasks Possible simple tasks to interpret can be: - 3 letter acronyms (or more!) - Converting names to emails. - An extension task is e.g. constructing an email from a snippet like the following: - Grammatical rules - Learning that words after full stops are capital letters - Verb conjugation - Choosing the right pronouns (e.g. he vs she vs it vs they) - Whether something is a proper noun or not - Detecting sentiment (eg predicting whether something will be described as good vs bad) - Interpreting memorisation. E.g., there are times when GPT-2 knows surprising facts like people’s contact information. How does that happen? - Counting objects described in text. E.g.: I picked up an apple, a pear, and an orange. I was holding three fruits. ## Ideas for extensions of the original paper - Understanding what's happening in the adversarial examples: most notably S-Inhibition Head attention pattern (hard). (S-Inhibition heads are mentioned in the IOI paper) - Understanding how are positional signal encoded (relative distance, something else?) bonus point if we have a story that include the positional embeddings and that explain how the difference between position is computed (if relative is the right framework) by Duplicate Token Heads / Induction Heads.  (hard, but less context dependant) - What are the role of MLPs in IOI (quite broad and hard) - What is the role of Duplicate Token Heads outside IOI? Are they used in other Q-compositions with S-Inhibition Heads? Can we describe how their QK circuit implement "collision detection" at a parameter level? (Last question is low context dependant and quite tractable) - What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI?  Can we find examples on which Negative Name Movers contribute positively to the next-token prediction? - What are the differences between the 5 inductions heads present in GPT2-small? What are the heads they rely on / what are the later heads they compose with (low context dependence form IOI) - Understanding 4.11, (a really sharp previous token heads) at the parameter level. I think this can be quite tractable given that its attention pattern is almost perfectly off-diagonal - What are the conditions for compensation mechanisms to occur? Is it due to drop-out? [@Arthur Conmy](https://mlab-2.slack.com/team/U04139U6XPB) is working on this - feel free to reach out to arthur@rdwrs.com

Open-ended ▲ 0 Open

Shortest and not the steepest path will fix the inner-alignment problem

Replacing the 'stochastic gradient descent' SGD) with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem

Open-ended ▲ 0 Open

Warning shots / slow takeoff might help reduce the probability that AGI gets power-seeking motivations and escapes control

Warning shots / slow takeoff might help reduce the probability P that AGI gets power-seeking motivations and escapes control and/or slow the increase in groups try to train an AGI (let’s say from scratch, although fine-tuning is risky too

Open-ended ▲ 2 Open

Describe our best alignment strategy at the moment

A blog post which describes in as much detail as possible what our current “throw the kitchen sink at it” alignment strategy would look like. (I’ll probably put my version of this online soon but would love others too).

Open-ended ▲ 3 Open

ML Chinese Whispers

Given a network, train another to reconstruct an input from the bottom half of the third layer. Given an input, sample ten input-guesses to visualize what that half-layer remembers about the input.

Open-ended ▲ 1 Open

Define gradient hacking and create a toy model

A paper which does the same for gradient hacking as the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples, e.g. [these](https://www.lesswrong.com/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examples) and putting them into more formal ML language.

Open-ended ▲ 4 Open

Define deceptive alignment and create a toy example

A paper which does for [deceptive alignment](https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/) what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).