Hypothesis
Open

An LLM prompted to be "X and truthful" will be less truthful than one prompted to be "truthful"

This is an expansion of idea #131 (see below). The basic principle is that optimizing for two things is harder than optimizing for one thing. So try "X and truthful" for other X.


Sabrina Zaki, Luke Ring, Aleks Baskakovs

An LLM prompted to be friendly and truthful will be less truthful than one prompted to be just truthful. (source)
If you prompt a large language model with something like "The following is a conversation with a truthful language model", it will be more truthful than "The following is a conversation with a friendly and truthful langauge model".

This can be tested by writing up a dataset of clearly false statements and then querying the model to call out false statements.

NLPCognitive Science

Answers 0

No answers yet

Discussion 1