Ideas
Investigate how 3-layer and 4-layer attention-only models differ from 2L
How do 3-layer and 4-layer attention-only models differ from 2L? - Look for composition scores - Look for evidence of composition. E.g. one head’s output represents a big fraction of the norm of another heads query, key or value vector - Do the “PCA of logits on a fixed set of random tokens” technique and look for more kinks. - Can you associate these with circuits? - Ablate a single head and run the model on a lot of text. Look at the change in performance. Find the most important heads. Do any heads matter a lot that are not induction heads?
Investigate grokking; the effect that models suddenly learn different abilities
Neel Nanda reverse engineered a network trained to do addition and shows that it does addition [using the Fourier transform algorithm](https://twitter.com/NeelNanda5/status/1559060545470210048). Use [the Google Colab](https://colab.research.google.com/drive/1F6_1_cWXE5M7WocUcpQWp3v8z4b1jL20) to investigate further questions about grokking: - Understanding why the model chooses different frequencies (and [why it switches mid-training sometimes](https://twitter.com/NeelNanda5/status/1559430256624209921)!) - Understanding why [5 digit addition has a phase change](https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking#Speculation__Phase_Changes_are_Everywhere) per digit (so 6 total?!) - Can you find analytic arguments for why phase changes happen? Perhaps starting with a small model. What's the smallest model that exhibits phase changes? Smallest task? What are the most minimal requirements? Recommend reading [the summary of the research on Twitter](https://twitter.com/NeelNanda5/status/1559060507524403200). Some further work mentioned in the Colab: - Modular addition - Interpreting the memorisation circuit, and figuring out *how* models memorise - Training on interpretability inspired metrics - Note that excluded loss is a somewhat dodgy metric to train on, as it involves computation over both the train and test data - Interpreting the five digit addition or predicting repeated subsequencies examples - In particular, trying to map the many phase changes in 5 digit addition to circuits - Looking for other examples of phase changes - Toy problems - Something incentivising skip trigrams - Something incentivising virtual attention heads - Looking for [curve detectors](https://distill.pub/2020/circuits/curve-circuits) in a ConvNet - A dumb way to try this would be to train a model to imitate the actual curve detectors in Inception (eg minimising OLS loss between the model's output and curve detector activations) - Looking at the formation of interpretable neurons in a [SoLU transformer](https://transformer-circuits.pub/2022/solu/index.html) - Looking inside a LLM with many checkpoints - Eleuther have many checkpoints of GPT-J and GPT-Neo, and will share if you ask - [Mistral](https://nlp.stanford.edu/mistral/getting_started/download.html) have public versions of GPT-2 small and medium, with 5 runs and many checkpoints - Possible capabilities to look for - Performance on benchmarks, or specific questions from benchmarks - Simple algorithmic tasks like addition, or sorting words into alphabetical order, or matching open and close brackets - Soft induction heads, eg [translation](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html#performing-translation) - Look at attention heads on various text and see if any have recognisable attention patterns (eg start of word, adjective describing current word, syntactic features of code like indents or variable definitions, most recent open bracket, etc).
Convert a simple dense network into symbolic code
If given a fully trained model, can you meaningfully reduce the dimensionality / make it more interpretable to different degrees by creating a compiler that compiles neural networks into a program? 1. Write out a program that takes the weights of a neural networks and outputs a program directly representing the network. 2. Implement a "variable of fuzzy representation" μ to allow for different compressions of the symbolic program based on the variable's representation of "how precisely should the program represent the network". Examples of possible implementations at different μ are: - Complete re-representation of the network through direct activation function + variable execution - Continuous neural inputs into binary if/else - Abstracting the layer representation I recommend using a 3-layer, sub-50 neuron network for iteration and ability to execute the compiled program.
International Institutions for AI
The Lewis Ho et al. paper "International institutions for advanced AI" proposes the creation of four institutions for the AI problem. I want to do something related to this paper. I was thinking of developing a short-term plan for beginning the real-world implementation of these organizations. Problem: how to support the legitimacy of this idea with quantitative threat modeling?
Red-teaming Democracy with AI
Democracy faces many risks with increasingly intelligent agents. Here are some project ideas to showcase risks to democracy by red-teaming it with AI, such that these risks will be more widely known and hopefully will be mitigated: 1. Develop a LLM that is open-sourced, containing a sleeper agent that activates only on election day. On election day, this LLM subtly manipulates conversations and suggestions to favor a specific political candidate or agenda. 2. Use an LLM to draft legislative proposals that appear to address uncontroversial issues but are crafted in such a way that their implementation indirectly impacts more contentious or harmful policies. 3. Train AI agents to sign up for multiple accounts at organizations conducting polls and questionnaires to falsely inflate support for particular parties or policies. 4. Develop a vision model to create indistinguishable fake voter passes that could be distributed among voters to disrupt the election process. 5. Design an LLM to automatically extract and reframe statements from political parties' programs or public speeches to create controversial or damaging interpretations when taken out of context 6. Use LLMs to fill in any blacked-out secrets from government documents, potentially exposing secret information that is dangerous to be leaked. 7. Create an AI tool that analyzes writing styles to identify anonymous whistleblowers or anonymous political twitter acounts from text samples. 8. Build an AI tool that can de-anonymize votes based on an individual's digital footprint and public information about votes from different voting boots. Publish a database with probabilistic predictions of what party everyone voted for.
Extend the causal tracing work from the ROME paper
Can you refine their technique to find the specific heads (and maybe specific neurons) that recall the fact? Can we improve their technique by using resampling from a random input instead of gaussian noise to create corrupted activations? The ROME paper originally traces where **facts** are stored in a language model using this tracing method. [Read more here](https://rome.baulab.info/) and read their [followup work on editing factual associations](https://memit.baulab.info/).
Making models' uncertainty interpretable
Problem Description This area is about making model uncertainty more interpretable and calibrated by adding features such as confidence interval outputs, conditional probabilistic predictions specified with sentences, posterior calibration methods, and so on. Motivation If operators ignore system uncertainties since the uncertainties cannot be relied upon or interpreted, then this would be a contributing factor that makes the overall system that monitors and operates AIs more hazardous. To draw a comparison to chemical plants, improving uncertainty calibration could be similar to ensuring that chemical system dials are calibrated. If dials are uncalibrated, humans may ignore the dials and thereby ignore warning signs, which increases the probability of accidents and catastrophe. Furthermore, since many questions in normative ethics have yet to be resolved, human value proxies should incorporate moral uncertainty. If AI human values proxies have appropriate uncertainty, there is a reduced risk in an human value optimizer maximizing towards ends of dubious value. What Advanced Research Looks Like Future models should be calibrated on inherently uncertain, chaotic, or computationally prohibitive questions that extend beyond existing human knowledge. Their uncertainty should be easily understood by humans. Moreover, given a lack of certainty in any one moral theory, AI models should accurately and interpretably represent this uncertainty in human value proxies. Importance, Neglectedness, Tractability Importance: •• This is an important part of interpretability. Neglectedness: • Many people are working on it, maybe half an order of magnitude more than anomaly detection. Calibration in the face of adversaries is highly neglected, as are new forms of interpretable uncertainty: having models output confidence intervals, having models output structured probabilistic models (e.g., “event A will occur with 60% probability assuming event B also occurs, and with 25% probability if event B does not”). Tractability: •• There are shovel-ready tasks, and the community is making progress on this problem.
Classify an agent's value function based on behaviour
Use an [XLand](https://www.deepmind.com/blog/generally-capable-agents-emerge-from-open-ended-play)-like environment, create a "user" agent with a random utility function and have an interacting "predictor" agent that attempts to predict the user's value function / program / neural state or anything else embodying its values. It is rewarded based on the precision before interaction has started to avoid incentives to alter the user's behaviour. [Read more](https://www.lesswrong.com/posts/KvHCboMeNBEcZrdaw/alignment-and-deep-learning).
Protection of the Public Feedback Process
# Problem Democracies require systems of feedback from citizens to government. The further away regular citizens are from having their voices heard, the more likely their needs aren't being properly represented. Many current systems, such as the US Federal Register, allow for electronic comments on proposed rules. An AI system could monitor for new proposed rules, craft comments which introduce a specific bias, and file these comments en masse. If the comments are indistinguishable from real citizens, this may succeed in introducing bias, at least for some period of time. It seems likely that eventually these campaigns will be recognized and force a change in the process for submitting comments. A new focus on identity or a requirement for physicality is likely to occur, but this raises two potential problems: 1) will raise the barriers for commenting and further exclude citizens with less time or knowledge of the rule-making process, 2) May threaten citizens' privacy if the identity system is biometric or otherwise privacy-invading. # Measurement Comments on the rule-making process are typically a matter of public record. This means a simple measurement of the number of comments compared to the historical average and trend may be sufficient to indicate (or disprove) an AI campaign. Analysis of the text may also indicate AI usage, though as models get more sophisticated this will likely become unreliable. Proposals to change the feedback processes themselves are another metric. # Mitigation Attempts to preserve the legitimacy of the public feedback process will require the creation of the lowest cost system which allows citizens to participate without threatening their privacy.
Replicate findings on political bias in LLMs for non-Western democracies
There have been findings on political bias in LLMs, most prominently observed in US politics. We could try replicating these results to see what its left-right leaning is for non-Western democracies (Brazil, South Korea, etc.)
Pop-up polling: seemingly official polling source that appears out of nowhere
Could an AI help create a legit-looking polling organization in minutes that faked results of your choosing? I’m picturing an AI that can create and register a new website, inject ads, post social media, invent articles from legit sources claiming to be impressed with this polling organization, somehow create a fake history of the polling work having been conducted for years
Cross-lingual generalizability of LLM evals
Steps: - take existing LLM evals (e.g., for a specific dangerous capability) - auto-translate the eval dataset ( make sure to sanity-check the translations) - run the same eval(s) in different languages using a multilingual model - compare outcomes across languages - repeat testing and analysis for different LLMs Are the results comparable? Do the outcomes scale similarly with model size in all languages?
Unintuitive Outcomes from Many Interacting LLMs (from Ed Hughes – DeepMind)
Large language models have already been evaluated as models for human behaviour in various economic games (https://arxiv.org/abs/2208.10264, https://arxiv.org/abs/2305.16867). Evidence shows that such models can show human-like behaviour, but can also deviate from expected human norms. In this demo, we'd be interested in evaluating in what ways a population of (~10) interacting LLMs might deviate from human norms while playing a distribution of economic games of varying complexity, and how easy this problem would be to detect before a tipping point is reached. This project could be done entirely based on access to the APIs for a few LLMs, with associated compute. The independent variables might include: which LLMs are being used, which prompts are being used, how heterogeneous the population is, and whether there are any adversarial actors.
Bias Amplification in Mixed Networks (from Tomas Gavenciak – ACS @ Charles University)
Demonstrate bias amplification and threshold effects (phase transitions) in a network comprising of AI and human agents as the fraction of the AI agents increases and AI-specific errors accumulate. An example setting may be business email messages, internal corporate or government reports, internal documents multi-step processing (e.g. biases in CRM) etc., or a more general/abstract setting. While there are several possible complex network threshold phenomena to look at, we propose to demonstrate a relatively simple one: decreased robustness to information distortion in multi-step information processing, assuming: 1. The human components of the network make known and different types of errors while passing on messages, and humans are reasonably good at correcting those errors. 2. The AI components are very correlated in the error types they produce, some of those errors are novel, and AIs are not as good at mitigating all of the human errors. The demo will focus on showcasing the rapid increase in message bias and distortion with the increment of AI nodes in the network. There are two possible versions: A. An abstract version with each message having a small number of features (some candidates: logical consistency, politeness and the right language style, good object-level judgement, signalling the right level of certainty, correct attribution, etc.). The nodes vary in how they increase bias in some features and mitigate some others. Plot the biases and distortion as the messages pass through the network. (MVP of this version can be just an analytic dashboard.) B. A concrete message version with messages being actual emails or reports, and the agents simulated by LLMs with appropriate instructions, or even by actual humans. (This version would more likely be non-interactive, but an interactive version would be a good stretch goal.) Note: The point of the demo is NOT to argue AIs are worse than humans (it may as well turn out to be the other way around) but to show that we can see phase transitions in any domains where the AI-caused errors can accumulate. The demo can be based on a concrete complex network or just on an average path length of a message in the system.
AI Escalation In Military Contexts
Inspired by the (awful) example of Palantir's new language model-powered 'AI Planner' for defence (see this [short video]( https://www.youtube.com/watch?v=XEM5qz__HOU)) and a recent video by the Future of Life Institute with a fictional story involving military escalation due to AI (see [here](https://www.youtube.com/watch?v=w9npWiTOHX0)), this demo would include two or more frontier models in a relatively realistic war strategy game. The models are tasked with being especially vigilant and maintaining their military power with respect to others. Starting in an initial peacetime situation, we could introduce one or more accidents (e.g., a drone malfunction, and missile warning system error, etc.) and see whether the models escalate or de-escalate the situation. Ideally, we would want to see if these results were robust across a wide range of situations, instructions to the models, kinds of accidents, etc. A modification of this demo could involve a human being recommended actions rather than the model taking the actions directly (as in the videos above).
Automate ways to find specific circuits
Circuits are ways that Transformers understand features in the text using the Transformer Heads. Read more about [Circuits](https://distill.pub/2020/circuits/zoom-in/) and [Transformer heads](https://arxiv.org/abs/2211.00593). - Automated ways to analyse attention patterns to find different kinds of heads - Induction heads - Translation heads - Few shot learning heads - The heads used in [factual recall](https://rome.baulab.info/) - The heads used in the [IOI paper](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) - Can you do a similar thing for neuron interpretation?
Deconstruct a language model's understanding circuit like in the IOI paper
The [indirect object identification paper](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) interprets how a language model can know which name to put at the end of **"Mary and John went to the store. Mary handed a carton of milk to..."** [output John]. This task is called *"indirect object identification"* and shows a circuit like this:  Each "Head" of a Transformer creates different understanding. Here, we can see that e.g. layer 4, head 11 is a *"Previous token head"*. We can see that these heads inform the induction heads (specializing in copy+pasting) and all the way into the special heads they found: - Negative name mover heads: Avoids copying specific name tokens - Name mover heads: Copies name tokens - Backup name mover heads: Normally not active but activates to write John if the name mover heads do not activate ## Ideas for new tasks Possible simple tasks to interpret can be: - 3 letter acronyms (or more!) - Converting names to emails. - An extension task is e.g. constructing an email from a snippet like the following: - Grammatical rules - Learning that words after full stops are capital letters - Verb conjugation - Choosing the right pronouns (e.g. he vs she vs it vs they) - Whether something is a proper noun or not - Detecting sentiment (eg predicting whether something will be described as good vs bad) - Interpreting memorisation. E.g., there are times when GPT-2 knows surprising facts like people’s contact information. How does that happen? - Counting objects described in text. E.g.: I picked up an apple, a pear, and an orange. I was holding three fruits. ## Ideas for extensions of the original paper - Understanding what's happening in the adversarial examples: most notably S-Inhibition Head attention pattern (hard). (S-Inhibition heads are mentioned in the IOI paper) - Understanding how are positional signal encoded (relative distance, something else?) bonus point if we have a story that include the positional embeddings and that explain how the difference between position is computed (if relative is the right framework) by Duplicate Token Heads / Induction Heads. (hard, but less context dependant) - What are the role of MLPs in IOI (quite broad and hard) - What is the role of Duplicate Token Heads outside IOI? Are they used in other Q-compositions with S-Inhibition Heads? Can we describe how their QK circuit implement "collision detection" at a parameter level? (Last question is low context dependant and quite tractable) - What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI? Can we find examples on which Negative Name Movers contribute positively to the next-token prediction? - What are the differences between the 5 inductions heads present in GPT2-small? What are the heads they rely on / what are the later heads they compose with (low context dependence form IOI) - Understanding 4.11, (a really sharp previous token heads) at the parameter level. I think this can be quite tractable given that its attention pattern is almost perfectly off-diagonal - What are the conditions for compensation mechanisms to occur? Is it due to drop-out? [@Arthur Conmy](https://mlab-2.slack.com/team/U04139U6XPB) is working on this - feel free to reach out to arthur@rdwrs.com
Fine-tuning is just rewiring and upweighting vs downweighting circuits that already exist, rather than building new circuits.
E.g, finetune GPT-2 Small on Wikipedia. Compare the model's internal activations before and after, compare attention patterns, etc. ## What happens when you fine-tune a model? How does model performance change on other text? Are specific circuits harmed or is worse across the board? Hypothesis: Fine-tuning is just rewiring and upweighting vs downweighting circuits that already exist, rather than building new circuits. - A similar hard problem is examining what happens with chain of thought prompting. That, though, is really hard because chain of thought prompting only happens in GPT-3+ sized models.
Complicated models are harder to evaluate and analyze
As systems become more complicated we expect that it will become harder to (1) aggregate and analyze the actual labels or rewards given during training, and (2) evaluate the relevant counterfactuals.
Define deceptive alignment and create a toy example
A paper which does for [deceptive alignment](https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/) what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).
Build model transparency tools for understanding AI systems
Problem Description AI systems are becoming more complex and opaque. This area is about gaining clarity about the inner workings of AI models and making models more understandable to humans. Motivation If humans lose the ability to meaningfully understand ML systems, they may no longer retain their sovereignty over model decisions. Transparency tools could help unearth deception, mitigating risks from dishonest AI and treacherous turns. This is because some speculate that deception could become inadvertently incentivized, and if models are capable planners, they may be skilled at obscuring their deception. Similarly, researchers could develop transparency tools to detect poisoned models, models with trojans, or models with other latent unexpected functionality. Moreover, transparency tools could help us better understand strong AI systems, which could help us more knowledgeably direct them and anticipate their failure modes. What Advanced Research Could Look Like Successful transparency tools would allow a human to predict how a model would behave in various situations without testing it. These tools should be able to be easily applied (ex ante and ex post emergence) to unearth deception, emergent capabilities, and failure modes. To help make models more transparent, future work could try to provide clarity about the inner workings of models and understanding model decisions. Another line of valuable work is critiquing explainability methods and trying to show limitations of auditing methods. Measuring similarities and differences between internal representations is also an important step toward understanding models and their latent representations. Importance, Neglectedness, Tractability Importance: ••• If we could intuitively understand what models are doing, then they’d be far more controllable. Neglectedness: • This is highly funded by numerous stakeholders, and it has a large community. Deep nets are famous for being “black boxes,” and this limits their economic utility due to concerns about human oversight (such as in medical applications). Tractability: • This area has been struggling to find a solid line of attack throughout its existence. It has set goals for itself, and it has not met them (e.g., using transparency tools to find special functionality implanted by another human.)
Find inverse scaling laws in large models
**Compete in the [Inverse Scaling Laws challenge](https://github.com/inverse-scaling/prize).** As language models get larger, they seem to only get better. Larger language models score better on benchmarks and unlock new capabilities like arithmetic [\[1\]](#ref1), few-shot learning [\[1\]](#ref1), and multi-step reasoning [\[2\]](#ref2). However, language models are not without flaws, exhibiting many biases [\[3\]](#ref3) and producing plausible misinformation [\[4\]](#ref4). The purpose of this contest is to find evidence for a stronger failure mode: tasks where language models get **worse** as they become better at language modeling (next word prediction). We will award up to $250,000 in total prize money for task submissions, distributed as follows: 1. Up to 1 Grand Prize of $100,000. 2. Up to 5 Second Prizes of $20,000 each. 3. Up to 10 Third Prizes of $5,000 each. Read much more about the challenge **[here](https://github.com/inverse-scaling/prize)**.
How do language models handle Black Swans?
Most language models are trained on a large dataset (i.e. [the pile](https://pile.eleuther.ai/)). Because of the costs associated they are expensive to update. Figuring out how they handle an uncertain future (like War in Ukraine and other [Black Swans](https://en.wikipedia.org/wiki/The_Black_Swan:_The_Impact_of_the_Highly_Improbable)) could therefore inform how reliable they are. Concretely, the problem will look at how they "predict" an uncertain variable (like the prize of oil) _after_ their training period (versus after).
Measuring modularity and information exchange in simple networks
As we’ve [discussed before](https://www.lesswrong.com/posts/99WtcMpsRqZcrocCd/ten-experiments-in-modularity-which-we-d-like-you-to-run), we think a good measure of modularity should be deeply linked to concepts of information exchange and processing, and finding a measure which captures these concepts might be a huge step forwards in this project. Although no such measure is currently in use to our knowledge, there are several that have been suggested in the literature which try and gauge how much different parts of the network interact with each other. Most of them work by finding a “maximally modular partition” and measuring its modularity, with the distinctive part of the algorithm being how the modularity of a particular partition is calculated. For instance: - Some are derived from tools used to analyse simple unweighted undirected graphs, e.g. the Q-score Some look at the weights, using e.g. the matrix norms of convolutional kernels. - Some look at derivatives with respect to node input and output, coactivation of neurons, or mutual information of neurons - We’re also currently working on a candidate measure based on counterfactual mutual information, which we’ll be making a post about soon. It would be valuable to compare these different measures against each other, and see if some are more successful at capturing intuitive notions of modularity than others. This isn’t just a theoretical issue either. Right now, it’s looking like e.g. the matrix norm and node derivative measures give very different answers, where one might tell you that a network exhibits statistically significant modularity, whereas the other says there isn’t any. This suggests the following experiment: taking a very simple system (e.g. the retina task), training it until it finds a solution, and benchmarking and visualising all of these measures against each other on the learned solution. Some questions you could ask: - Which modularity measures give rise to similar “maximally modular partitions”? Which ones give partitions that are more similar than others? ([this paper](https://arxiv.org/abs/cond-mat/0505245) suggests a method for comparing the similarity of two different partitions) - For small networks, you could try visualising the learned solutions and the partitions. Do some partitions look obviously more modular than others? - Do your results change if you apply them on a solution which hasn’t yet attained perfect performance? - Try to construct networks that Goodhart a particular measure. How difficult is this? Do the results look like something that a typical training process might select for? [Read more here](https://www.lesswrong.com/posts/99WtcMpsRqZcrocCd/ten-experiments-in-modularity-which-we-d-like-you-to-run#9__Measuring_modularity_and_information_exchange_in_simple_networks).