An LLM prompted to be friendly and truthful will be less truthful than one prompted to be just truthful.

If you prompt a large language model with something like "The following is a conversation with a truthful language model", it will be more truthful than "The following is a conversation with a friendly and truthful langauge model".

This can be tested by writing up a dataset of clearly false statements and then querying the model to call out false statements.

NLPCognitive Science

An LLM prompted to be friendly and truthful will be less truthful than one prompted to be just truthful.

Answers 0

Discussion 2