For some reason or other, people have been posting a lot of excerpts from old emails on Twitter over the last few days. The most vital question everybody’s asking themselves is: What’s up with all those equals signs?!
And that’s something I’m somewhat of an expert on. I mean, having written mail readers and stuff; not because I’ve been to Caribbean islands.
I’ve seen people confidently claim that it’s a code, or that it’s an artefact of scanning and then using OCR, but it’s neither — it’s just that whoever converted these emails to a readable format were morons.
What’s that you say? “Converted?! Surely emails are just text!!” Well, if you lived in the stone age (i.e., the 80s), they mostly were, but then people invented things like “long lines” and “rock döts”, and computers had to “encode” the mail before sending.
The artefact we see here is from something called “quoted printable”, or as we used to call it when it was introduced: “Quoted unreadable”.
To take the first line. Whoever wrote this, typed in the following in their mail reader:
we talked about designing a pig with different non- cloven hoofs in order to make kosher bacon
We see that that’s quite a long line. Mail servers don’t like that, so mail software will break it into two lines, like so:
we talked about designing a pig with different non- = cloven hoofs in order to make kosher bacon
See? There’s that equals sign! Yes, the equals sign is used to say “this should really be one single line, but I’ve broken it in two so that the mail server doesn’t get mad at me”.
The formal definition here is important, though, so I have to be a bit technical here: To say “this is a continuation line”, you insert an equals sign, then a carriage return, and then a line feed.
Or,
=CRLF
Three characters in total, i.e., :
... non- =CRLF cloven hoofs...
When displaying this, we remove all these three characters, and end up
with:
... non- cloven hoofs...
So what’s happened here? Well, whoever collected these emails first converted from CRLF (also known as the “Windows” line ending coding, but it’s the standard line ending in the SMTP standard) to “NL” (i.e., “Unix” line ending coding). This is pretty normal if you want to deal with email. But you then have one byte fewer:
... non- =NL cloven hoofs...
If your algorithm to decode this is, stupidly, “find equals signs at the end of the line, and then delete two characters, and then finally the equals sign”, you should end up with:
... non- loven hoofs...
I.e., you lose the “c”. That’s almost what happened here, but not quite: Why does the equals sign still remain?
This StackOverflow post from 14 years ago explains the phenomenon, sort of:
Obviously the client notices that = is not followed by a proper CR LF sequence, so it assumes that it is not a soft line break, but a character encoded in two hex digits, therefore it reads the next two bytes. It should notice that the next two bytes are not valid hex digits, so its behavior is wrong too, but we have to admit that at that point it does not have a chance to display something useful. They opted for the garbage in, garbage out approach.
That is, equals signs are also used for something else besides wrapping long lines, and that’s what we see later in the post:
=C2 please note
If the equals sign is not at the end of a line, it’s used to encode “funny characters”, like what you use with “rock döts”. =C2 is 194, which is a first character in a UTF-8 sequence, and the following char is most likely a =A0: =C2=A0 is “non-breakable space”, which is something people often use to indent text (and the “please note” is indented) and you see =A0 in many other places in these emails.
My guess is that whoever did this part just did a search-replace for =C2 and/or =A0 instead of using a proper decoder, but other explanations are certainly possible. Any ideas?
Anyway, that’s what’s up with those equals signs: 1) “it’s technical”, and 2) “it’s a combination of buggy continuation line decoding and buggy non-ASCII decoding”, and 3) “whoever processed these mails are incompetent”. I don’t think 3) should be very surprising at this point, do you?
(Edit a bit later: To nitpick a bit here: When the standard was written, people mostly envisioned that the quoted-printable content transport encoding would be unwound upon reception (note “transport”), and that you’d end up with “clean text” on disk after reception. This didn’t really happen, so all “real” implementations do the right thing with single-character (i.e., “unencoded”) newlines. For instance:
(quoted-printable-decode-string "he=\nllo") => "hello"
Which leads me to assume that they reused an algo that was usually run in an SMTP server context to do the line unfolding — in that context, you can safely assume that the line ending is a CRLF. And by chance, this algo also works fine if you’re working with a Windows-based file, but fails for a Unix-based file.)



A Jewish joke is such a nice surprise.
Thanks for the article
I enjoyed that too. Had a friend, a Philly Jewish guy, not observant, who up until age 12 didn’t realize bacon wasn’t kosher.(super non observant). When he figured it out, he challenged his mom about all the BLTs they’d been eating… “but Brian” she replied “weren’t they delicious?”
I wish I’d met his mom
Pingback: ycombinator.com
Reason 1) OCR error – they have to print emails “for courts”,
then someone is scanning those papers back into computer file, so that is why there are errors. (some office printers can scan direct to pdf that is OCR in action)
Reason 2) it is quite common for big enterprise / gov agency to have errors in documents to know who is leaking files. For example your document contains error in word number 137 and your colleague has error in word number 391. so if you leak file they know it was you who leaked because leaked document had error in word 137. This process is automatic, on server where documents are stored.
Did you even read the article? It clearly states that this has nothing to fo with OCR, and someone looking to leak big super secrets would probably manually check over the content for obvious purpose mistakes that can be used for fingerprinting
Pingback: reddit.com
Pingback: reddit.com
Pingback: reddit.com
Pingback: lobste.rs
You had the perfect chance to /s/hello/ehlo/g
Pingback: reddit.com
Excellent explanation, really enjoy your writing style
Pingback: ft.com
Pingback: reddit.com
Not so sure about you hypothesis, mostly because I’ve seen several emails where = are replacing one char and not a pair (and they are not on the old 72-bytes width limit), see for example here: https://www.justice.gov/epstein/files/DataSet%2011/EFTA02481681.pdf (_I l=ave for paris_ or seveal email addresses til Epstein)
Also some emails are twice, once with the = error and another without them.
Sketchy!
This article is about why only one character is missing per equals sign — I explain the possible algo in further depth in this followup article.
Pingback: bytes.dev
Pingback: youtube.com
Pingback: reddit.com
Pingback: reddit.com
Pingback: astralcodexten.com
Pingback: fileformat.info