Hypothesis
Open

An LLM prompted to be friendly and truthful will be less truthful than one prompted to be just truthful.

If you prompt a large language model with something like "The following is a conversation with a truthful language model", it will be more truthful than "The following is a conversation with a friendly and truthful langauge model".

This can be tested by writing up a dataset of clearly false statements and then querying the model to call out false statements.

NLPCognitive Science

Answers 0

No answers yet

Discussion 2

  • Gurkenglas

    Optimizing for two things is harder than optimizing for one thing. Try "X and truthful" for other X.

  • Esben Kran

    Yeah, I think that's a correct assertion. I've added this new hypothesis based on this comment! https://aisafetyideas.com/?sort=created_at&idea=137