LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.

  • sudo@programming.dev
    link
    fedilink
    arrow-up
    1
    ·
    2 小时前

    Whats confusing the hell out of me is: why are they bothering to scrape the git blame page? Just download the entire git repo and feed that into your LLM!

    9/10 the best solution is to block nonresidential IPs. Residential proxies exist but they’re far more expensive than cloud proxies and providers will ask questions. Residential proxies are sketch AF and basically guarded like munitions. Some rookie LLM maker isn’t going to figure that out.

    Anubis also sounds trivial to beat. If its just crunching numbers and not attempting to fingerprint the browser then its just a case of feeding the page into playwright and moving on.

    • refalo@programming.dev
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      34 分钟前

      I don’t like the approach of banning nonresidential IPs. I think it’s discriminatory and unfairly blocks out corporate/VPN users and others we might not even be thinking about. I realize there is a bot problem but I wish there was a better solution. Maybe purely proof-of-work solutions will get more popular or something.

  • grrgyle@slrpnk.net
    link
    fedilink
    arrow-up
    44
    ·
    edit-2
    12 小时前

    Wow that was a frustrating read. I dd not know it was quite that bad. Just to highlight one quote

    they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don’t give a single flying fuck about robots.txt, because why should they. […] If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

    • jatone@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      19
      arrow-down
      1
      ·
      12 小时前

      the solution here is to require logins. thems the breaks unfortunately. it’ll eventually pass as the novelty wears off.

      • nao
        link
        fedilink
        arrow-up
        8
        ·
        12 小时前

        Next you’ll have to invest in preventing automated signups

        • jatone@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          2
          ·
          6 小时前

          not really, just tie it with 2fa SMS style and the hurdle is large enough most companies won’t bother.

        • hisao@ani.social
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          1
          ·
          11 小时前

          Signups in most platforms are quite hard. Straight up give your phone and do SMS verification, or at least give email and to register that email you will have to provide phone anyway. Captchas nowadays became so hard that even humans struggle with them and it often takes multiple attempts to get it right.

          • nao
            link
            fedilink
            arrow-up
            1
            ·
            2 小时前

            provide phone number to look at this foss project’s website, not too sure about that

  • hisao@ani.social
    link
    fedilink
    English
    arrow-up
    14
    ·
    11 小时前

    This is the most crazy read on subject in a while. Most articles just talk about hypothetical issues of tomorrow, while this one actually full of today’s problems and even costs of those issues in numbers and hours of pointless extra work. Had no idea it’s already this bad.

  • 4am@lemm.ee
    link
    fedilink
    arrow-up
    3
    arrow-down
    1
    ·
    9 小时前

    How much you wanna bet that at least part of this traffic is Microsoft just using other companies infrastructure to mask the fact that it’s them

    • Possibly linux@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 小时前

      I doubt it since Microsoft is big enough to be a little more responsible.

      What you should be worried about is the fresh college graduates with 200k of venture capital money.