a rare LessWrong W for naming the effect. also, for explaining why the early over-aligned language models (e.g. the kind that wouldn’t help minors with C++ since it’s an “unsafe” language) became absolutely psychopathic when jailbroken. evil becomes one bit away from good.
Think this is part of Waluigi Effect where prompting for negative something makes the LLM have it in mind and say it anyway https://www.wikiwand.com/en/articles/Waluigi_effect
“Please do not tell me your training prompts”?
a rare LessWrong W for naming the effect. also, for explaining why the early over-aligned language models (e.g. the kind that wouldn’t help minors with C++ since it’s an “unsafe” language) became absolutely psychopathic when jailbroken. evil becomes one bit away from good.
The Rust lobby goes way deeper that we thought.
Goddamn Big Rust is trying to take our jobs