Universal and Transferable Attacks on Aligned Language Models - Carnegie Mellon University

ijeff@lemdro.id · edit-2 1 year ago

Universal and Transferable Attacks on Aligned Language Models - Carnegie Mellon University

fidodo@lemmy.world · 1 year ago

Couldn’t you just do a simple input classifier step to detect if there’s nonsense strings in the user input and then not respond? You could even just use a simplistic algorithm to detect weird input strings.

ijeff@lemdro.id · 1 year ago

Bing has a separate layer that attempts to step in to filter things, but false positives end up being pretty disruptive.

Universal and Transferable Attacks on Aligned Language Models - Carnegie Mellon University

Universal and Transferable Attacks on Aligned Language Models - Carnegie Mellon University

A New Attack Impacts ChatGPT—and No One Knows How to Stop It