• Linkerbaan@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    2
    ·
    8 months ago

    Actually neural networks verbatim reproduce this kind of content when you ask the right question such as “finish this book” and the creator doesn’t censor it out well.

    It uses an encoded version of the source material to create “new” material.

    • BoscoBear@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      1
      ·
      8 months ago

      Sure, if that is what the network has been trained to do, just like a librarian will if that is how they have been trained.

      • Linkerbaan@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        3
        ·
        edit-2
        8 months ago

        Actually it’s the opposite, you need to train a network not to reveal its training data.

        “Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”

        The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet.

        • BoscoBear@lemmy.sdf.org
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          2
          ·
          8 months ago

          Interesting article. It seems to be about a bug, not a designed behavior. It also says it exposes random excerpts from books and other training data.

          • Linkerbaan@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            3
            ·
            8 months ago

            It’s not designed to do that because they don’t want to reveal the training data. But factually all neural networks are a combination of their training data encoded into neurons.

            When given the right prompt (or image generation question) they will exactly replicate it. Because that’s how they have been trained in the first place. Replicating their source images with as little neurons as possible, and tweaking them when it’s not correct.

            • BoscoBear@lemmy.sdf.org
              link
              fedilink
              English
              arrow-up
              4
              arrow-down
              1
              ·
              8 months ago

              That is a little like saying every photograph is a copy of the thing. That is just factually incorrect. I have many three layer networks that are not the thing they were trained on. As a compression method they can be very lossy and in fact that is often the point.

    • mindbleach
      link
      fedilink
      English
      arrow-up
      1
      ·
      8 months ago

      That’s called overtraining and it’s deeply undesirable, even ignoring law. It’s not useful behavior. It’s a sign the training setup is using the data badly.