The case:
- We have a dump of all messages from an individual on Facebook where they play safe (thanks Zucky).
- And also a dump of an anonymous user, who does some petty crime like piracy or wrongthinking.
- We suspect they are the same person.
Probable mind process:
- Analyze sentences’, punctuation patterns, mistakes and compare them.
- Compare them both against the bigger database to see, if they both use the same words marked as rare.
- Thematically mark what they usually write about.
- Based on a sum of these three (or more?), calculate a score.
Qs about the system:
- Would it be efficient to run these tests on one pair or even groups of anonymous and public users to find correlations, like pulling c/Privacy against a Facebook group? For a small, middle, big amounts of computational power?
- Should we be worried this tool would arrive as a complimentary to usual investigative expertises done by hand?
- Is there any sense in varying own pattern and behavior to fool something like that (assuming they don’t have more solid data already)?
Qs about the application:
- Are we in for cheap LLM solutions to automate user matching and another ways to breach our privacy?
- Would it be possible to use them as an additional tool in investigations and a proof in itself in EU or US courts?
- Would commercial companies be interested in scrapping and matching it for some profit? Something dumb like calculating your insurance by matching you to depressive forums and boards.
What do you think?
I suppose it would depend on the size of each dataset for any given user. Such a system could result in a sizeable amount of false positives in the statistically likely scenario that there are users who always fall into the totally anonymous category or always fall into the identified categories. Accuracy would also likely suffer if one dataset for a specific user was considerably smaller.
That’s reasonable. I wonder if someone would buy it still on the big data hype.