Got the pointer to this from Allison Parrish who says it better than I could:

it’s a very compelling paper, with a super clever methodology, and (i’m paraphrasing/extrapolating) shows that “alignment” strategies like RLHF only work to ensure that it never seems like a white person is saying something overtly racist, rather than addressing the actual prejudice baked into the model.

  • eestileib
    link
    fedilink
    English
    arrow-up
    14
    ·
    2 months ago

    Yes, this is what they are designed to do when used in hiring and criminal justice contexts. They would not be getting used if they did anything else.

    Nicely demonstrated by the researchers, but can anybody say they are surprised?

    • antifuchs@awful.systemsOP
      link
      fedilink
      English
      arrow-up
      14
      ·
      2 months ago

      I don’t think anyone is surprised, but brace yourself for the next round of OpenAI and peers claiming to fix this issue.