Generally speaking, uploading the original unaltered document is a great way of ...

izacus · on Nov 15, 2020

It's also an amazing way to provide context and sabotage the narrative agenda the news article is pushing. We can't have that kind of nuance in modern media.

bobthepanda · on Nov 15, 2020

You say this as if it was ever the case in pre-modern media either.

skinkestek · on Nov 15, 2020

Extremely valid point, I'd consider this close to table stakes.

Also, in some more "interesting" settings there's also exploit-loaded documents to be aware of, here's from a leak I've been following lately:

> If you really want to open a Word document from Psy-Group, please go ahead, knock yourself out. Below are the links to the Word document as well as the email. Just remember that Mr. [...] mistake might have been opening a Word file from Psy-Group in the first place…

I took the liberty to remove the name as I guess that particular guy is probably suffering enough at the moment.

intricatedetail · on Nov 15, 2020

Documents may also have a unique wording for each recipient and the source could be easily identified by that by whoever created a document.

piaste · on Nov 15, 2020

The full English prose text, minus headers or footers, would still provide almost all required contest to inform the reader without the fingerprinting.

ohgodplsno · on Nov 15, 2020

What if there are ten variants, all with slightly modified wording, allowing knowing immediately who leaked it?

bobthepanda · on Nov 15, 2020

You don't even need that. Documents have been identified before because some versions replace characters with nearly identical looking but different unicode characters (say, the various variations of spaces, or the semicolon with the Greek question mark.)

https://en.wikipedia.org/wiki/Whitespace_character

https://en.wikipedia.org/wiki/Question_mark#Greek_question_m...

piaste · on Nov 15, 2020

Yes, I've seen that episode of Game of Thrones too :)

First, consider the requirements to set such a trap. The authors of the document need to be actively concerned about a leaker, and to be OK with the document itself being leaked as long as they catch the culprit - at the same time, they need the document to be juicy enough that it will be leaked. They need to share the document in such a way that no two of the suspects will be able to compare notes, otherwise the jig is up. So no putting the file on a common internal resource (unless the server can stealthily serve different versions based on the user's login data); no attachments, else a reply all / forward would reveal the trap; no collaboration; no physical office where two suspects may see each other's copy.

Is that still possible? Yes. But a _lot_ of times it won't be possible, and the would-be leaker will know it's not possible. It's much more likely, and makes much more sense, for critical documents to be shared in such a way that the users _know_ they are fingerprinted, and won't leak them. IIRC, major Hollywood studios do that with their film scripts.

Second, what if the _key phrases_ are slightly altered in each version? Or hell, if your bosses want to finger you so bad, what's if they changed a small factual detail in each version? Then even the journalist quotes would reveal the leaker.

bobthepanda · on Nov 15, 2020

The not-so-great news is that common characters like spaces and semicolons have various similar-looking characters defined in Unicode, which would not be very noticeable to a human but would be noticeable to a machine.

So you just need to do random substitutions that uniquely identify the document and you'll have a fingerprint. It wouldn't be very challenging to do and it wouldn't be very challenging of a record to maintain.

You also don't need to uniquely identify it to a person; you just need to narrow the search space and then apply other techniques that would narrow it down. If it's a version of a document that leaked through an email chain then you've just limited the search space to the recipients, which is still plenty useful.

darkwater · on Nov 15, 2020

Then inevitably somebody would complain that the original document wording might have been altered.

piaste · on Nov 15, 2020

As opposed to a PDF scan which can definitely not be forged at all? ;)

Nothing less than a digital signature can prove the integrity of a digital document, and even that is worthless unless the corresponding public key has been publicly made available via a separate and trusted channel, which is unlikely.

bobthepanda · on Nov 15, 2020

Anything that can be used to prove a document's integrity can generally also be used to identify where it came from and how it was produced, which is why we generally don't see any effort to do this at all.

In fact, plenty of things that can't prove a document's integrity can also be used to identify its source, which is why this isn't done; you can't be sure that you've sanitized the document enough to protect the leaker.