• David Gerard@awful.systemsM
    link
    fedilink
    English
    arrow-up
    11
    ·
    edit-2
    7 days ago

    jwz gave the game away, so i’ll reveal:

    the One Weird Trick for this week is that the bots pretend to be an old version of Chrome. So you can block on useragent

    so I blocked old Chrome from hitting the expensive mediawiki call on rationalwiki and took our load average from 35 (unusable) to 0.8 (schweeet)

    caution! this also blocks the archive sites, which pretend to be old chrome. I refined it to only block the expensive query on mediawiki, vary as appropriate.

    nginx code:

            # block some bot UAs for complex requests
            # nginx doesn't do nested if, so we set a test variable
            # if $BOT is both Complex and Old, block as bot
            set $BOT "";
            if ($uri ~* (/w/index.php)) {
                set $BOT "C"; }
    
                if ($http_user_agent ~* (Chrome/[2-9])) {
                    set $BOT "${BOT}O";}
                if ($http_user_agent ~* (Chrome/1[012])) {
                    set $BOT "${BOT}O";}
                if ($http_user_agent ~* (Firefox/3)) {
                    set $BOT "${BOT}O";}
                if ($http_user_agent ~* (MSIE)) {
                    set $BOT "${BOT}O";}
    
                if ($BOT = "CO") {
                    return 503;}
    

    you always return “503” not “403”, because 403 says “fuck off” but the scrapers are used to seeing 503 from servers they’ve flattened.

    I give this trick at least another week.

  • Soyweiser@awful.systems
    link
    fedilink
    English
    arrow-up
    3
    ·
    7 days ago

    Re the blocking of fake useragents, what people could try is see if there are things older useagents do (or do wrong) which these do not. I heard of some companies doing that. (Long ago I also heard of somebody using that to catch mmo bots in a specific game. There was a packet that if the server send it to a legit client, the client crashed, a bot did not). I’d assume the specifics are treated as secret just because you don’t want the scrapers to find out.

    • YourNetworkIsHaunted@awful.systems
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      You could probably do something by getting into the weeds of browser updates, at least for web traffic. Like, if they’re showing themselves as an older version of chrome send a badly formatted cookie to crash it? Redirect to /%%30%30?

      • Soyweiser@awful.systems
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 days ago

        Yes, there I heard there is some javascript that various older versions of chrome/firefox don’t properly execute for example. So you can use that to determine which version they are (as long as nobody shares that javascript with the public. So this might even not be javascript, I honestly know nothing about it just heard it).

  • db0@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    16
    ·
    8 days ago

    It’s a constant cat and mouse atm. Every week or so, we get another flood of scraping bots, which force us to triangulate which fucking DC IP range we need to start blocking now. If they ever start using residential proxies, we’re fucked.

    • irelephant [he/him]🍭@lemm.eeOP
      link
      fedilink
      English
      arrow-up
      12
      ·
      8 days ago

      I have a tiny neocities website which gets thousands of views a day, there is no way that anyone is viewing it often enough for that to be organic.

    • self@awful.systems
      link
      fedilink
      English
      arrow-up
      8
      ·
      8 days ago

      at least OpenAI and probably others do currently use commercial residential proxying services, though reputedly only if you make it obvious you’re blocking their scrapers, presumably as an attempt on their end to limit operating costs