It’s a bit of a weird shower thought but basically I was wondering hypothetical if it would be possible to take data from a social media site like Reddit and map the most commonly used words starting at 1 and use a separate application to translate it back and forth.

So if the word “because” was number 100 it would store the value with three characters instead of seven.

There could also be additions for suffixes so “gardening” could be 5000+1 or a word like “hoped” could be 2000-2 because the “e” is already present.

Would this result in any kind of space savings if you were using larger amounts of text like a book series?

  • xordos@lonestarlemmy.mooo.com
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    il y a 1 an

    There is simply no way to pre-store any arbitrary long text because the possible combination is just tooo huge. It is like draw a bullseye where you arrow landed. You search any paragraph, it search existing index, if not found then save it and assign it a new index. Of course it may use some algorithms to optimize/decrease the space needed.

    From it’s wiki page, https://en.m.wikipedia.org/wiki/The_Library_of_Babel_(website) ‘’’ The website can generate all possible pages of 3200 characters and allows users to choose among about 10^4677 potential pages of books. ‘’’

    PS, I think again, indeed this actually does not need any storage, it can be URL to text encoding/decoding, the URL itself can determine/generate the actual text. And same for the opposite direction