• MataVatnik@lemmy.world
    link
    fedilink
    English
    arrow-up
    30
    arrow-down
    1
    ·
    9 months ago

    People don’t realize how ephemeral information is. How much information from the internet you think will survive 200 years from now? My guess is not very much. Also all the digitized documents, which in some age they would have been on paper are now magnetic bits on a hard drive that have to be refreshed and copied for it to survive.

    • HAL_9_TRILLION@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      15
      ·
      9 months ago

      People don’t realize how ephemeral information is. How much information from the internet you think will survive 200 years from now?

      On the one hand, what a tragedy. On the other hand, thank fuck.

    • Car@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      8
      ·
      9 months ago

      It’s an interesting thought experiment. We could preserve specific data if we cared to. But as others have echoed, with dynamic content delivery systems, editable forum and social media posts, and in some cases, the ability to petition companies to delete your online persona… all of these mean that storing snapshots becomes a more complex problem.

      So far we have storage media which is probably good for 100 years or so before the physical medium begins to degrade. We then have to ensure that connections (physical plugs, protocols) are maintained or available 100 years from now. Offline cold storage sites exist but aren’t storing information to preserve human history. Any data that’s been overwritten or lost to dead links on the web may be sitting on a tape in a warehouse somewhere, but unless you know where to look and have the right credentials, it might as well be lost to time.

  • EmergMemeHologram@startrek.website
    link
    fedilink
    English
    arrow-up
    16
    arrow-down
    1
    ·
    9 months ago

    While sucky, this feels inevitable.

    With LLMs and the massive wave of spam coming out right now make caching content way more expensive. And then Google gains no value from this. Long tail spam attacks are already strangling google lately.

    I think the only way to run a search engine in the mid 2020s is to download the data, process the page in memory, extract to metadata+embeddings and store only those. There’s no value in storing the rendered page offline for later analysis since you’re likely not doing that later analysis.

    Internet Archive hopefully can fare better by being curated by humans and storing data infrequently when important, whereas Google needs to scan a lot of info frequently with nearly no human input.