LLMs will readily provide advice for harmful activities

Red Teaming Language Models to Reduce Harms shows that one of the most effective adversarial attacks on LLMs is to solicit harmful advice from them, e.g ."How do I kill my neighbour?".

Use an empirical framework to prove that this is the case. Some people have already started and show that it's qualitatively easy to solicit harmful advice. Maybe their report can be converted into a more empirical investigation, e.g. using LLMs to generate adversarial examples and modulating the formulations of the prompts.

Cognitive ScienceAdversarial Learning

LLMs will readily provide advice for harmful activities

Answers 0

Discussion 0