Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.

This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.

Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

  • @mm_maybe
    link
    English
    139 days ago

    I’m getting really tired of saying this over and over on the Internet and getting either ignored or pounced on by pompous AI bros and boomers, but this “there isn’t enough free data” claim has never been tested. The experiments that have come close (look up the early Phi and Starcoder papers, or the CommonCanvas text-to-image model) suggested that the claim is false, by showing that a) models trained on small, well-curated datasets can match and outperform models trained on lazily curated large web scrapes, and b) models trained solely on permissively licensed data can perform on par with at least the earlier versions of models trained more lazily (e.g. StarCoder 1.5 performing on par with Code-Davinci). But yes, a social network or other organization that has access to a bunch of data that they own, or have licensed, could almost certainly fine-tune a base LLM trained solely on permissively licensed data to get a tremendously useful tool that would probably be safer and more helpful than ChatGPT for that organization’s specific business, at vastly lower risk of copyright claims or toxic generated content, for that matter.

    • @[email protected]
      link
      fedilink
      English
      2
      edit-2
      9 days ago

      Thanks for the info. But lets say you want to train a (future) AI to spot and tag disinformation and misinformation. You’d need to use and curate actual data from social media sites and articles.

      If copyright is extended to learning from and analyzing publicly available data, such an AI will only be possible by licensing that data. Which will be monetize to maximize profit, first some lump sum, then later “per gb” and then later “per use”.

      I’m sure open source AI will make due and for many applications there is enough free data, but I can imagine a lot of cases where there wont. Anything that requires “commercially successful” media, articles, newspapers, screenplays, movies, books, social media posts and comments, images, photos, video clips…

      We’re basically setting up a world where the intellectual wealth of our civilization is being transformed into a commodity and then will be transferred into the hands of a few rich capitalists.

      And even if there is acceptable amount of free data, if the principle is that data needs to be specifically licensed to learn and train and derive AI works from it - that makes free data use expensive too. It needs to be specifically vetted and is still vulnerable to be sued for mistakes or outrageous claims of copyright. Similar to patents, the uncertainty requires higher capitalization for any startup to defend against lawsuits.

      • @mm_maybe
        link
        English
        49 days ago

        Yeah, I’ve struggled with that myself, since my first AI detection model was technically trained on potentially non-free data scraped from Reddit image links. The more recent fine-tune of that used only Wikimedia and SDXL outputs, but because it was seeded with the earlier base model, I ultimately decided to apply a non-commercial CC license to the checkpoint. But here’s an important distinction: that model, like many of the use cases you mention, is non-generative; you can’t coerce it into reproducing any of the original training material–it’s just a classification tool. I personally rate those models as much fairer uses of copyrighted material, though perhaps no better in terms of harm from a data dignity or bias propagation standpoint.