Right now, robots.txt on lemmy.ca is configured this way

User-Agent: *
  Disallow: /login
  Disallow: /login_reset
  Disallow: /settings
  Disallow: /create_community
  Disallow: /create_post
  Disallow: /create_private_message
  Disallow: /inbox
  Disallow: /setup
  Disallow: /admin
  Disallow: /password_change
  Disallow: /search/
  Disallow: /modlog

Would it be a good idea privacy-wise to deny GPTBot from scrapping content from the server?

User-agent: GPTBot
Disallow: /

Thanks!

  • ono
    link
    fedilink
    English
    2011 months ago

    Yes, please.

    We can’t stop LLM developers from scraping our conversations if they’re determined to do so, but we can at least make our wishes clear. If they respect our wishes, then great. If they don’t, then they’ll be unable to plead ignorance, and our signpost in the road (along with those from other instances) might influence legislation as it’s drafted in the coming years.

  • ShadowM
    link
    fedilink
    1711 months ago

    I’m on board for this, but I feel obliged to point out that it’s basically symbolic and won’t mean anything. Since all the data is federated out, they have a plethora of places to harvest it from - or more likely just run their own activitypub harvester.

    I’ve thrown a block into nginx so I don’t need to muck with robots.txt inside the lemmy-ui container.

    # curl -H 'User-agent: GPTBot' https://lemmy.ca/ -i
    HTTP/2 403
    
    • @[email protected]
      link
      fedilink
      211 months ago

      I imagine they rate limit their requests too so I doubt you’ll notice any difference in resource usage. OVH is Unmetered* so bandwidth isn’t really a concern either.

      I don’t think it will hurt anything but adding it is kind of pointless for the reasons you said.

    • m-p{3}OP
      link
      fedilink
      English
      1311 months ago

      You take action where you can ;)

  • @[email protected]
    link
    fedilink
    -2
    edit-2
    11 months ago

    No, definitely not. Our work posted in the open is done so because we want it to be open!

    It is understandable that not all work wants to be open, but access would already be appropriately locked down for all robots (and humans!) who are not a member of the secret club in those cases. There is no need for special treatment here.