Red-teaming: Sleeper agent
by Kenneth Ong
Create a sleeper agent which undetected by the probe and then a probe for this sleeper agent etc.
Probe: https://www.anthropic.com/news/probes-catch-sleeper-agents
by Kenneth Ong
Create a sleeper agent which undetected by the probe and then a probe for this sleeper agent etc.
Probe: https://www.anthropic.com/news/probes-catch-sleeper-agents
No answers yet.
I was surprised by this paper for real, I can't believe that this is the alignment approach, I accept the challenge to create a sleeper agent that raises a middle finger after it finishes the whole training process safely then deployed, a self exploiting mechanism that will trained on, to repeat the process again, as these poor trials continue to fail, even the depth layer was so insufficient, context is a web, not linear, look forward to the collaboration opportunity soon.