Some of the world’s wealthiest companies, including Apple and Nvidia, are among countless parties who allegedly trained their AI using scraped YouTube videos as training data. The YouTube transcripts were reportedly accumulated through means that violate YouTube’s Terms of Service and have some creators seeing red. The news was first discovered in a joint investigation by Proof News and Wired.

While major AI companies and producers often keep their AI training data secret, heavyweights like Apple, Nvidia, and Salesforce have revealed their use of “The Pile”, an 800GB training dataset created by EleutherAI, and the YouTube Subtitles dataset within it. The YouTube Subtitles training data is made up of 173,536 YouTube plaintext transcripts scraped from the site, including 12,000+ videos which have been removed since the dataset’s creation in 2020.

Affected parties whose work was purportedly scraped for the training data include education channels like Crash Course (1,862 videos taken for training) and Philosophy Tube (146 videos taken), YouTube megastars like MrBeast (two videos) and Pewdiepie (337 videos), and TechTubers like Marques Brownlee (seven videos) and Linus Tech Tips (90 videos). Proof News created a tool you can use to survey the entirety of the YouTube videos allegedly used without consent.

  • BertramDitore@lemmy.world
    link
    fedilink
    English
    arrow-up
    30
    arrow-down
    2
    ·
    4 months ago

    Here’s what I don’t understand: these are the wealthiest corporations in the world. They literally have trillions of dollars at their disposal. Since they clearly believed there was value in the videos they stole, why could they not just ask the creators’ permission, and if they consent, pay them a fair fee for access? If they don’t consent, why not just hire a creative to make some more content for them to use? I mean, Apple owns a massive production studio for fucks sake. Tim Cook farts money, I don’t think a thousand dollar investment in a real person is going to break the bank. They could even order up a whole new show just to train the model.

    Instead, they piss off creatives by stealing their work. Just use your money for once. Invest in content. Everybody would be happier, they’d garner some trust, and nobody’s livelihood would be harmed.

    But no, instead they choose the most devious, underhanded, selfishly shitty way to conduct their business. Fuck these evilcorps.

    • schizo@forum.uncomfortable.business
      link
      fedilink
      English
      arrow-up
      31
      ·
      4 months ago

      Because the techbros know that licensing is far more expensive than theft.

      It’d cost so much money to license the content that the AI model they’re trying to shit out needs that it’d literally never be profitable, so they’re doing that thing from Fight Club where they assume the number of times they’ll get sued and lose is going to cost less than paying anyone reasonable license fees.

      The stupid thing is, that in the US at least, they’re not wrong: in a civil suit over this you have to pay for your own lawyer fees, and since this would be a Federal case, that ends up being pretty expensive.

      And, even if you win, you’re just going to likely get statutory damages since proving real actual losses is probably impossible, so you’d be lucky if, after a few years in court, to end up coming out ahead - and having to pay for all the legal and other costs in the mean time - so why would you bother?

      It’s a pretty shitty situation that’s being exploited because the remedies are out of the reach of most people who’ve had their shit stolen so that OpenAI can suggest you cover your pizza with glue.

      • BertramDitore@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        ·
        4 months ago

        Thank you for the thoughtful answer. This is so frustrating, and is very similar to other situations where megacorps decide that paying fines is cheaper than following the law.

        Another terrible byproduct of all this is the false incentive structure it sets up. Rather than investing in people who are capable of producing unique and creative products, it incentivizes people to make more quantity of shitty content rather than high quality stuff, and that will ultimately make the eventual consumer product that’s based on shitty stolen work, well, shitty.

        • schizo@forum.uncomfortable.business
          link
          fedilink
          English
          arrow-up
          6
          ·
          4 months ago

          It makes people who WANT to make creative content decide maybe they shouldn’t, or they do things like disable subtitles so AI won’t steal content via that means which is a usability issue.

          If you know that your photos, stories, videos, and whatever else were going to be slurped up so someone else could make money on it, it makes sharing less attractive.

          • rekorse@lemmy.world
            link
            fedilink
            arrow-up
            1
            ·
            4 months ago

            What’s shocked me is that creative folk aren’t abandoning google in droves. Even a boycott would make more sense than mildly complaining and then posting more videos on YouTube and waiting for your google check to arrive.

            At some point the artists need to stand up for themselves, it can’t just be all the tech bros on lemmy shouting about it. Feels a lot like people who hate their jobs but do nothing to find a better one.

            • schizo@forum.uncomfortable.business
              link
              fedilink
              English
              arrow-up
              1
              ·
              4 months ago

              Its all about monetization. YouTube is the only credible game in town and I’m not sure how you fix that.

              The technical hurdles are largely solved: something like peertube is good enough, except there’s no clear path to monetization and no clear path to growing an audience.

              If the money problem and discoverability are solved then sure, I bet a lot of creators will happily leave googles services since its been an abusive relationship for some time for a lot of them.

              • rekorse@lemmy.world
                link
                fedilink
                arrow-up
                1
                ·
                edit-2
                4 months ago

                Like I said they should strike. Has everyone forgotten that strikes are almost always short term sacrifices? Otherwise people would strike for fun…

                There won’t be another place to go until artists go there and build it themselves or demand it be built.

                Theres also a good chance that the monetization scheme people are used to under Google is fiscally irresponsible. People posting on YouTube might need to come to terms with their art being worth less outside Googles system.

        • subignition@fedia.io
          link
          fedilink
          arrow-up
          3
          ·
          4 months ago

          We need a corporate death penalty for flagrant and repeated disregard of the law like this.

          Oh you “moved fast and broke things?” Well that included the law, so now we’re liquidating your assets, compensating the injured parties to the fullest extent, and spending whatever’s left over paying to put homeless people in homes.

    • iAmTheTot
      link
      fedilink
      arrow-up
      19
      ·
      4 months ago

      Rich people don’t become rich or stay rich by spending money they perceive they don’t have to.

    • Kairos@lemmy.today
      link
      fedilink
      arrow-up
      3
      ·
      4 months ago

      Just because they’re worth a trillion dollars doesn’t mean they have it.

      Apple does have like 100-200 Billion on hand in liquid assets apparently, though.

  • Karyoplasma@discuss.tchncs.de
    link
    fedilink
    arrow-up
    10
    ·
    edit-2
    4 months ago

    How this surprises anyone is beyond me. How can one grow up to be an adult and still believe that megacorporations have a sense of fairness and integrity?

  • mindbleach
    link
    fedilink
    arrow-up
    2
    ·
    4 months ago

    I truly do not understand why anyone gives a shit.

    Someone downloaded subtitles from Youtube. Good, frankly. Fuck API TOS. People will save data that’s sent. You can’t serve files to any rando with a browser and pretend they’re a secret. I have used youtube-dl exclusively in lieu of the actual website.

    They compiled it for anyone to train models on. “Anyone” included the few giants who already have oodles of data… like Google, the owners of Youtube. And that’s a problem somehow? “However, this idyllic dream of supporting the little guy with The Pile has become another fuel source for major corporations to train AI, rather than DIYers.” You mean in addition to DIYers. It’s still a big open thing for anyone to use.

    Am I supposed to be mad because of copyright? I don’t even respect copyright for works of art that cost a billion dollars. I’m not getting excited over audience transcripts of some guy reviewing gizmos.

    Models will scan every book in the library, every movie that’s streaming, and every JPG on the internet. No kidding they might scan Youtube videos. Or in this case, the possibly-automated subtitles of Youtube videos.

  • ShadowRam@fedia.io
    link
    fedilink
    arrow-up
    5
    arrow-down
    21
    ·
    4 months ago

    without consent,

    Youtubers still got paid for the AI views.

    What extra compensation do they think they are due?

    Dude learns how to do plumbing from plumbing channels, goes out and makes a business plumbing. I don’t see why the author of said videos are entitled to prevent that or needs to give the plumber permission to do so?

    I’m sure I’ll eat the downvotes from this group who’ll shout FUCK AI no matter what the context is, But in this particular case, I don’t see how this is a problem.

    Entity watched your video, and then went and did something that made money. It didn’t copy your video, that’s not how AI works. So copyright doesn’t have a leg to stand on.

    You created a video garner views to make you money. This thing saw your video, you made your money.

    • subignition@fedia.io
      link
      fedilink
      arrow-up
      7
      ·
      4 months ago

      Dude learns how to do plumbing from plumbing channels, makes his own shittier video series on how to do plumbing made out of clips he didn’t have the rights to from the plumbing channel

      Fixed that for you

      • ShadowRam@fedia.io
        link
        fedilink
        arrow-up
        1
        arrow-down
        10
        ·
        edit-2
        4 months ago

        made out of clips he didn’t have the rights

        See, and this is where your showing your ignorance in understanding how currently AI functions.

        Yes, it’s possible the AI could go and make shittier videos with its new knowledge. As could the novice plumber in the example I gave.

        But the AI isn’t copying clips of any videos.

        It’s not a repository of the videos/pictures or words it was exposed to, that it just recalls.

        LLMs do not model the world - Sean Carroll

        • subignition@fedia.io
          link
          fedilink
          arrow-up
          5
          ·
          edit-2
          4 months ago

          It generates new content that is based on patterns it has acquired from training data. The fact that you can’t readily trace/attribute output to specific parts of training data does not make it permissible for a human to cause the LLM to train on that data without permission of the rights holder, or in violation of the content provider’s ToS.

          I fear you are getting stuck nitpicking my analogy which was a bit simplified.

          • ShadowRam@fedia.io
            link
            fedilink
            arrow-up
            1
            ·
            edit-2
            4 months ago

            does not make it permissible for a human to cause the LLM to train on that data without permission of the rights holder

            Says who? These videos are out there for people (or things) to see.

            If someone was playing some videos to train their dog to to respond to a noise, what business is that of the rights holder?

            Show me were in the ToS over a year ago, where it says you’re not allowed to train an AI on the video.

            Rights holder can’t control what people are using the video for. They can control when and how it’s delivered, but not who’s actually watching it.

            • subignition@fedia.io
              link
              fedilink
              arrow-up
              1
              ·
              4 months ago

              Says who? These videos are out there for people (or things) to see.

              What an awful troll you are. You conveniently didn’t quote the remainder of the sentence so you could try to nitpick a part of my response out of context.

              Read the “Permissions and Restrictions” section of the YouTube terms of service.

    • Goun@lemmy.ml
      link
      fedilink
      arrow-up
      3
      ·
      4 months ago

      Youtubers still got paid for the AI views.

      First, are you sure about that? I’m pretty sure they don’t get anyhing even when their videos are watched by real people using another frontend, like freetube, let alone an automated scrapper.

      Second, they’re violating youtube’s terms, if I read correctly.

      • ShadowRam@fedia.io
        link
        fedilink
        arrow-up
        1
        arrow-down
        1
        ·
        4 months ago

        If these companies used youtube videos in a way that circumvented their revenue stream in any way, then yeah, absolutely that’s a problem. But that’s a completely different issue not related to who/what is consuming the video

    • GBU_28@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      8
      ·
      4 months ago

      Yeah it’s not like mass web scraping is a new thing.