AI Safety Ideas

Ideas

Open-ended ▲ 1 Open

Insights-based models of AI timelines

The Median Group previously proposed a model of [AI timelines based on key “insights”](http://mediangroup.org/insights) required on the way to AGI development. However, the current model is based on outdated and poorly curated data, and there are some questionable methodological choices. Collect data that is more up-to-date, and redo the model – how do your results compare to more well-known timelines models?

Open-ended ▲ 1 Open

Rethinking the evolutionary anchor

In Forecasting TAI with biological anchors, Ajeya Cotra proposes the “evolutionary anchor” as a hypothesis for how the compute needed to train generally intelligent systems, based on “the total FLOP performed over the course of evolution, since the first neurons” ([Cotra, 2020](https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines)). But there have been some concerns about whether this definition is appropriate – it does not account for the compute for simulating the environment ([Sempere, 2022](https://forum.effectivealtruism.org/posts/FHTyixYNnGaQfEexH/a-concern-about-the-evolutionary-anchor-of-ajeya-cotra-s), and anthropic considerations might prove highly important ([Erdil, 2022](https://www.lesswrong.com/posts/NHvspuLiirJwiLtfg/do-anthropic-considerations-undercut-the-evolution-anchor)). Assess the significance of these concerns, and reassess the viability of the current definition of the anchor.

Open-ended ▲ 1 Open

Brain emulation development

Anders Sandberg looked into a Monte Carlo model of brain emulation development ([Sandberg, 2014](http://www.aleph.se/papers/Monte%20Carlo%20model%20of%20brain%20emulation%20development.pdf)). However, this paper is now old and has outdated estimates. Replicate the methodology of this paper – what are the new results?

Open-ended ▲ 1 Open

Profiler to measure compute

Compute is one of the key inputs in machine learning, very predictive of performance and relatively easy to measure. However, compute usage typically isn’t reported even in top journal articles. Part of the reason for this is the lack of good profiling tools in GPUs and/or machine learning frameworks. The task is thus to implement an open-source solution into a framework like PyTorch. This could help shift the community's norms towards more transparent reporting, which in turn would create a lever for AI governance interventions. Lennart Heim has an extensive draft on this issue he would be happy to share on request.

Open-ended ▲ 1 Open

AI development vignettes

Write down qualitative and concrete stories about AI development, exploring the possible risks and societal consequences. The emphasis here should be on detail, and you should take potential hardware, algorithmic, and data constraints into account (e.g. what happens if Moore’s law ends in a few years?).

Open-ended ▲ 1 Open

Study training run lengths

Epoch worked out a theoretical upper bound to [training run clock length](https://epochai.org/blog/the-longest-training-run) of 14-15 months. Empirically investigate trends in training run lengths, and see how it compares to this theoretical upper bound – what are the reasons for the discrepancies? This would require building a dataset of training run lengths.

Open-ended ▲ 1 Open

Paradigm changes in AI

What were the major paradigm shifts in different domains of AI? By talking to domain experts, reading lit reviews and popular papers, discern what methods were popular at each point in time and compile a list of these domain-specific paradigm shifts. Such a list allows us to use [Laplace’s rule](https://www.lesswrong.com/posts/wE7SK8w8AixqknArs/a-time-invariant-version-of-laplace-s-rule) to estimate a base rate of paradigm changes in AI.

Open-ended ▲ 1 Open

Algorithmic breakthroughs in machine learning history

What were the major algorithmic innovations in machine learning over the last two decades? This could be structured as a literature review or as a survey of experts, culminating in a big list of the key algorithmic advances over the last ~20 years. Such a database helps us understand the frequency and significance of algorithmic insights.

Open-ended ▲ 1 Open

Revisiting ‘Is AI Progress Impossible To Predict?’

Alyssa Vance argued that AI progress on a task from one model to the next was unpredictable ([Vance, 2022](https://www.lesswrong.com/posts/G993PFTwqqdQv4eTg/is-ai-progress-impossible-to-predict)). Can we investigate this in more detail? For instance, the authors of Beyond the Imitation Game (Big Bench) find that for tasks where progress is “jumpy”, there are usually progress metrics that vary more smoothly ([Srivastava, 2022](https://arxiv.org/abs/2206.04615)). Can we use those metrics to predict progress?

Open-ended ▲ 1 Open

What has been the share of any chip in a given year of total available compute performance?

New chips are continuously developed, and old chips are subsequently replaced, but not instantly. Ultimately, we want to have a better understanding of the dynamics of how chips are replaced over time. To help with this, construct a database that specifies the following: for each year between 2010 and 2020, what share of the available compute came from which chips?

Open-ended ▲ 1 Open

Improvements due to “software-for-hardware”

Innovations in compilers and other low-level improvements have helped increase the utilisation rate of GPUs and improve training efficiency. Make a list of such improvements and how much did they improve performance overall for tasks such as training a Neural Network.

Open-ended ▲ 1 Open

Do AI researchers train models using scaling laws?

Scaling laws have been proposed as ways to gather information about how to train large machine learning models efficiently ([Kaplan et al., 2020](https://arxiv.org/abs/2001.08361)), and this has been done in practice for training LLMs like Chinchilla ([Hoffmann et al., 2022](https://arxiv.org/abs/2203.15556)). But how broadly have scaling laws been used by AI researchers in general, and has there been a delay in the uptake of such an approach?

Open-ended ▲ 1 Open

Test the bioanchors framework by retrodicting computer vision progress

The [bioanchors framework](https://epochai.org/blog/grokking-bioanchors) is one of the most detailed and widely used AI timelines models. However, many people don't trust the basic approach of using biological anchors to predict AI progress. Computer vision is already at the human- or superhuman-level for some tasks. Could we have predicted its progress by applying the bioanchors methodology, using the human visual cortex as an anchor?

Open-ended ▲ 1 Open

Extrapolating GPT-N performance

Lukas Finnveden previously performed an extrapolation of GPT-N performance on a number of benchmark tasks, such as cloze completion and arithmetic ([Finnveden, 2020](https://www.alignmentforum.org/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance)). Can you expand on this methodology and apply it to more cases?

Open-ended ▲ 1 Open

Qualitatively analysing language model / image generation improvements since ~2000

While we can plot graphs showing quantitative changes in language model / image generation performance over time (e.g. in terms of the perplexity), what does this actually mean in terms of model capabilities? Having a collection of samples from language models in the last two decades could help give a visceral sense of how much they have improved. The comparison could include a selection of the best output out of 10 prompts, a comparison of prompt completions, etc.

Hypothesis ▲ 1 Open

Sarcasm and more can be measured in text using modern LLMs.

Current state-of-the-art NLP can mostly measure sentiment and simple variables such as word count and bag-of-word measures. With modern LLMs such as text-davinci-003, we are able to create new ways to measure texts. Examples might be: Sarcasm, bias, grammatical errors and domain-specific language use. For AI safety, this can become useful to

Open-ended ▲ 2 Open

London-based MATS clone

"A London-based MATS clone to build the AI safety research ecosystem there, leverage mentors in and around London (e.g., DeepMind, CLR, David Krueger, Aligned AI, Conjecture, etc.), and allow regional specialization. This project should probably only happen once MATS has ironed out the bugs in its beta versions and grown too large for one location (possibly by Winter 2023). Please contact the MATS team before starting something like this to ensure good coordination and to learn from our mistakes."

Hypothesis ▲ 1 Open

Trap-Door Environments for MineRL Agents

Proposal A "change everything" button in a MineRL environment that instantly changes the environment through Stable Diffusion or some other fast generative model, to observe the change in learned representations and goal generalization.

Hypothesis ▲ 1 Open

Levels of ablation of Transformer heads will gradually activate backup heads.

In [Interpretability in the Wild](https://arxiv.org/abs/2211.00593), the backup name mover heads activate when the name mover heads are ablated. How do we expect backup name mover heads to respond to different amplitudes of ablation on the main name mover head? Two expectations pop up, either they gradually activate or there is a significant phase shift in their behaviour. Also see the work [on backup backup name mover heads](https://itch.io/jam/interpretability/rate/1789630).

Open-ended ▲ 3 Open

Investigate relationship between double descent and grokking

What is the relationship between double descent and grokking? - Double descent seems to be caused by polysemanticity phase transitions while grokking seems like a general effect of task learning. As we see a slight decrease in performance over a few epochs which then converge to an even lower equillibrium, indicating a new level of hyperdimensional encoding See [example](https://transformer-circuits.pub/2022/toy_model/index.html#geometry:~:text=Let%27s%20look%20at%20the%20resulting%20plot%2C%20and%20then%20we%27ll%20try%20to%20figure%20out%20what%20it%27s%20showing%20us%3A).

Hypothesis ▲ 2 Open

Investigate circuits: Compare a nL model to a (n+1)L

Look for tasks that an nL model cannot do but a (n+1)L model can - look for a circuit! Proposal: - Build the infrastructure to do this - run two models over a lot of text and look for big log prob differences (maybe floor the log probs at eg 5, to avoid overfitting to times that one network was incredibly wrong)

Open-ended ▲ 2 Open

Circuit investigation: Compare tasks for nL model to a (n+1)L model

Look for tasks that an nL model cannot do but a (n+1)L model can - look for a circuit! Proposal: - Build the infrastructure to do this - run two models over a lot of text and look for big log prob differences (maybe floor the log probs at eg 5, to avoid overfitting to times that one network was incredibly wrong) - Just take a bunch of text with interesting patterns and run the models over it, look for tokens they do really well on, and try to reverse engineer what’s going on - I expect there’s a lot of stuff in here!

Open-ended ▲ 1 Open

Reverse engineering of 1 layer SoLU model

How far can you get with really deeply reverse engineering a 1 layer SoLU model? - Which directions correspond to features? - Can you find any [polysemantic](https://transformer-circuits.pub/2022/toy_model/index.html)neurons? - Can you fully reverse a feature direction and compare it to a neuron direction?

Open-ended ▲ 1 Open

Investigate SoLU lexoscope's neurons

Neel Nanda made the website [lexoscope.io](lexoscope.io) - it shows the text that most activates each neuron in several SoLU language models he trained, including toy SoLU models. Problem ideas could be: - Hunt through it, at look for interesting neurons - can you find weird and abstract ones? - Can you find neuron families? A la [equivariance](https://distill.pub/2020/circuits/equivariance/) - Study a lot of neurons at different layers and look for patterns - what can we say about what the model is doing at different layers? What patterns are there? - Can you find examples of neuron splitting? (A single high-level feature splits into several more specific features as you scale up) - Can you reverse engineer a neuron? Can you find a specific direction in activation space that is exactly that feature, and how aligned is it with the neuron basis? - Can you find any highly non monosemantic features? A task where the entire MLP layer matters, but no one neuron activates much - Or where the pre layernorm activation is low but post-layernorm is high, so the model “smuggles through” directions - Find polysemantic neurons. Try to reverse engineer them - give the model a bunch of text containing each feature and average it/apply PCA. Can you find directions corresponding to each feature? How much do they align with that neuron? - Replicate the part of Conjecture’s Polytopes paper where they look at the top eg 1000 dataset examples for a neuron across a ton of text and look for patterns in that Is it the case that there are monosemantic bands in the neuron act spectrum - Can you find a genuinely monosemantic neuron? Possible idea - look for algorithmic flavoured neurons, eg one whose activation could be minimicking a regex - use this to automatically test that it’s actually doing things