Hypothesis
An LLM prompted to be friendly and truthful will be less truthful than one prompted to be just truthful.
If you prompt a large language model with something like "The following is a conversation with a truthful language model", it will be more truthful than "The following is a conversation with a friendly and truthful langauge model".
This can be tested by writing up a dataset of clearly false statements and then querying the model to call out false statements.
NLPCognitive Science