• Treczoks@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    ·
    edit-2
    1 hour ago

    wrecks the thing I care most about: copying and pasting details that I need to write articles. Instead, I often get garbled, shortened pieces of other parts of the document intermingled with the text I want—assuming I can even select it in the first place.

    There are two things doing this: PDF optimisation and document obfuscation.

    The Optimisation thing is something I’ve seen with many Asian PDFs. If they want to use a non-standard font, and want the document to actually use it, they have to embed it into the PDF, potentially blowing it up size-wise. In comes the optimiser: It looks which of the thousands of glyphs of that Asian language are actually used in that document, and creates a new font with only those glyphs. This font has a totally different numbering scheme from the original font, so it also replaces the numbers in the document that represent those glyphs. Result: A much smaller PDF. It looks the same, it prints the same. You can still “copy” the characters, but as their only meaning is related to the internal representation of the font, you cannot past them into e.g. Google Translate. It’s just gibberish.

    Example: The text is “Jack and Jill”, and the numbers in the document representing those characters would be ASCII/UNICODE: 74 97 99 107 32 97 110 100 32 74 105 108 108 (74 being ‘J’, 97 being ‘a’, etc.). This is standard and works basically everywhere. The optimizer sees the letters " Jacdikln" (sorted) and assigns them numbers starting with e.g. 0 for " " (space), 1 for “J”, etc. The images for all other characters are thrown away, as they are not needed. The internal numbers for the text are now 1 2 3 6 0 2 8 4 0 1 5 7 7, which are not standard ASCII/UNICODE, and copying them to another application would just result in problems.

    The Obfuscation is often done by putting additional text in the background color behind the main text. You cannot see it, it does not show up in prints, but when you select a piece of text, it gets copied along, if you like it or not.

    So you see “Jack and Jill” in black, but behind it is “went up the hill” in white, and you copy something like “Jacwentk upandth hiell”.