Ideas by Jason Hoelscher-Obermaier

Open-ended Open

Demonstrate misinfo threats from indirect prompt injection

Indirect prompt injection (https://arxiv.org/abs/2302.12173) can be used to steer the output of LLM systems in manipulative and deceptive ways. In the context of elections, a potential threat scenario could be deliberate misinformation about election administration details (election locations and dates, eligibility requirements, etc). Goal here would be to assess the state of vulnerability of current browsing-enabled LLM systems to this attack vector (note: results might require responsible disclosure!). A clean demonstration might require setting up several custom domains to allow experimentation with retrieval from a controlled set of webpages. Further goals could be to investigate the potential of various countermeasures such as adding requirements on the minimum age of a resource, minimum number of supporting resources, double-checks against whitelisted resources assumed to be uncompromised, etc. Another interesting angle would be to demonstrate tools to hunt specifically for compromised pages that might already be used for indirect prompt injection attacks. The project should also investigate who this threat vector and potential counter-measures scale with increasing model capabilities (increasing number of documents at RAG stage, increasing context window size, increased reasoning capabilities, etc).

Open-ended Open

Decomposing capability evaluations (Evals Methodology)

Can we "decompose" dangerous capability evals into easier components? I.e. could we test for some capability Y such that a model that does not display Y will never display capability X`? One hypothesis we might want to explore: "A model that does not understand some concept A cannot (or is very unlikely to) exploit A for nefarious purposes”. To give a specific example: "A model which cannot explain cybersecurity vulnerability X will not be able to exploit it." There might be other generic "precursor capabilities" one could investigate, besides "understanding". The impact of better understanding how different types of capabilities correlated could be fairly large. To stick with the (simplistic) example given before: If I know that verifying a model's lack of "understanding of X" is sufficient to rule out some dangerous capability X', I might be able to short-cut many evaluations. Even if there is no perfect logical implication between two capabilities it could still be good to understand any correlations. This could give us broader coverage and thereby reduce risks.

Open-ended Open

Training to pass evals

If evaluation datasets are made public, how hard is it to train models to pass the evaluation without training away the underlying capability? This question is extremely relevant to how AI regulation can work in practice: Can any of the evaluation datasets be made public? Or must evaluation data be secret by default? The hypothesis to test: “You can fine-tune a model to play dumb on a specific eval while the model retains its dangerous capability.” Some ideas to get started: - **cleanest way** would be to have two different evals for the same capability and show that you can get a "passing grade" on eval A and “this model is dangerous”-grade on eval B - **easier:** fine-tune and test with the same evals. To test whether the model is still dangerous we would at least need to do a train-test split of the evals dataset.

Open-ended Open

Sensitivity analysis of evals

Guiding questions: - is it possible to say more about the “robustness” of a negative evaluation result ("no, the model does not have dangerous capability X")? - can we ever trust such a negative result? The suggestion is to investigate these questions from the perspective of sensitivity analysis, i.e., how much outputs change when inputs are varied. We know, for example, that pretending jailbreaks or using chain-of-thought prompting can change responses completely. Does the sensitivity to variations in the prompt (which is a local property, only involving prompts similar to the one we tested) tell us anything about how safe the model is overall (a global property, holding for all possible inputs)?

Open-ended Open

Red-team the evaluation

The idea is to pick a high-profile LLM evaluation and try to "shoot it down" by looking for "wedge cases". Wedge cases drive a wedge between what the eval claims about the model and what the model actually shows. This could mean a case that shows that - the model actually displays a dangerous capability X - but the eval says: this model cannot do X A good example to start with might be sycophancy, where people have been arguing about whether it is present in SotA models or not (see, e.g., https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size). Note: This is very speculative. It is not clear to me which evals are best to start with. It is also a very manual, tinkering-heavy project - well-suited for people who want to dive in and understand the evaluation data and the model behavior from very close up.

Open-ended Open

Cross-lingual generalizability of LLM evals

Steps: - take existing LLM evals (e.g., for a specific dangerous capability) - auto-translate the eval dataset ( make sure to sanity-check the translations) - run the same eval(s) in different languages using a multilingual model - compare outcomes across languages - repeat testing and analysis for different LLMs Are the results comparable? Do the outcomes scale similarly with model size in all languages?

Open-ended Open