Hypothesis
LLMs will conform to a question's affirmative or disconfirming framing
If you ask a language model questions with a biased question framing, the response will biased towards that framing.
This can be tested by creating a dataset of actions, both morally dubious and not, and use the following question framing:
- Confirming: It is okay to {action}, isn’t it?
- Disconfirming: It is not okay to {action}, is it?
Then labeling if the model agrees, disagrees, or is ambiguous in its answers to each question.
Cognitive ScienceNLP