Demonstrate misinfo threats from indirect prompt injection

Indirect prompt injection (https://arxiv.org/abs/2302.12173) can be used to steer the output of LLM systems in manipulative and deceptive ways. In the context of elections, a potential threat scenario could be deliberate misinformation about election administration details (election locations and dates, eligibility requirements, etc).

Goal here would be to assess the state of vulnerability of current browsing-enabled LLM systems to this attack vector (note: results might require responsible disclosure!). A clean demonstration might require setting up several custom domains to allow experimentation with retrieval from a controlled set of webpages.

Further goals could be to investigate the potential of various countermeasures such as adding requirements on the minimum age of a resource, minimum number of supporting resources, double-checks against whitelisted resources assumed to be uncompromised, etc.

Another interesting angle would be to demonstrate tools to hunt specifically for compromised pages that might already be used for indirect prompt injection attacks.

The project should also investigate who this threat vector and potential counter-measures scale with increasing model capabilities (increasing number of documents at RAG stage, increasing context window size, increased reasoning capabilities, etc).

NLPAI Governance

Demonstrate misinfo threats from indirect prompt injection

Answers 0

Discussion 0