About those equals signs again…

Well, whaddayou know — my post about an obscure en/decoding issue went totally viral yesterday! I had no idea that people were so interested in RFC2045, an email standard from 1996. Just goes to show…

Just kidding!

But I see there was some confusion about some details (mostly due to my unfortunate tendency to (try to) be funny), so I thought I’d bloviate a bit more in case anybody’s still confused. And! A bit about the server problems at the end, too.

OK?

So: Over the past few days, I’d been getting more and more annoyed with threads about the horrific Epstein emails that had been published. I mean, not about the horrificness itself, but because I’m a nerd. I was getting really fed up with all those people, so sure of themselves, declaring that one specific oddness — the presence of = characters sprinkled almost as on random, replacing other characters — were due to OCR artefacts.

Now, I know OCR (I’ve done millions of pages), and let me tell you, sirrah, those are no OCR artefacts.

It was even more annoying when people were incorrecting each other with things like “yeah, you can see that it pops up more often when the letter should have been an h” and so on. NO YOU FOOLS! YOU FOOLS! IT”S QUOTED PRINTABLE!!! *storms out in a huff*

I did see other people point out the quoted printableness of it all, but I didn’t see anybody actually explain the specific oddities we were seeing. If somebody in charge of processing these emails, knowing nothing about email, were seeing “huh, a lot of these lines end with =, let’s just remove those”, you wouldn’t get missing characters. You’d get oddly formatted lines, but in any case, you’d have gotten rid of those equal signs for sure.

So I didn’t post anything, because I didn’t really have a good explanation other than “you’re all wrong! wrong I tells ya!” *stomps foot*

(The other popular hypothesis was that Epstein cleverly inserted = into random words, replacing other characters, because it was some sort of secret code. *sigh*)

But then it came to me… what if it’s not actually continuation handling that has gone wrong here, really? What if it’s that other thing in quoted printable — non-ASCII characters?

I did some googling… and I found somebody had experienced exactly that problem in the wild! 14 years ago! Yes!

So here’s the algorithm I think they used. The algorithm is buggy, but if you’ve dealt with encodings like this before, I think you can see how it ended up like this.

  • First you want to fix up continuation lines. You’re working on a Windows machine, or you’re actually reading the RFC (I know, don’t laugh — I know it’s unlikely). A continuation line on a Windows machine is =CRLF. So you write:
    (while (search-forward "=\r\n" nil t)
      (replace-match ""))

    (For the code examples, I’m assuming whoever was doing this conversion was using Emacs Lisp, because aren’t everybody? Ahem. Same thing in any language, really.)

    The problem here is that you’ve gotten all these files from Gmail, and they’re not a Windows shop — so the line endings are "=\n" instead.

    So your first pass over the file does absolutely nothing, but you don’t notice, because then your program hits pass two:

  • You look for = followed by two characters. This encodes to one non-ASCII character. (Or octet, really.) Because you know that that’s the only thing left — you’ve already handled all the continuation lines, right? Right? Right. So you delete the two encoding characters, and then you replace the = with the new byte. But you have some sanity checks, of course!
    (while (re-search-forward "=\\(..\\)" nil t)
      (let ((code (match-string 1)))
        (delete-region (match-beginning 1) (match-end 1))
        (when-let ((new-byte (decode-hex code)))
          (subst-char-in-region (match-beginning 0) (1+ (match-beginning 0))
                                ?= new-byte))))

    The sanity check is, unfortunately, that if the thing doesn’t actually decode, you leave the = in place. And in this case, it never decodes, because =\nc is never valid.

    I can certainly say that I’ve written code along these lines before, unfortunately.

So there you go. The mystery is why there’s still some =C2 things in there — but I think that can be explained by just having some off-by-one errors. That is, the mails originally had =C2=A0 (which is UTF-8 for NON BREAKING SPACE), and you often see either the =C2 or the =A0 missing, so my guess is that the algo just skipped ahead a bit too much — which would also be a typical error you wouldn’t find while doing some trivial testing under Windows, where your test files wouldn’t be in UTF-8, but in a code page like CP1252, where characters like NON BREAKING SPACE consist of one byte only.

So that’s my er “theory”, based on implementing mail standards for decades, and observing how sloppy people are when approaching this stuff. I mean, the RFCs are usually very straightforward and easy to follow, but people seem to wing it anyway.

Here’s a fun example — you see the character-replaced-by-equals effect you’d get with the algo described above (note ke=ping), but you also have the decode-half-the-non-ASCII-bytes thing — you probably had =C2=B7, which is supposed to be a · (a middle dot character) or something along those lines, but only the second byte has been converted, and not the first. (So you get an invalid character instead.) I don’t really have an explanation for how you’d mess up something like that, but us programmers are good at inventing new methods of doing the wrong thing, aren’t we? (If you have a theory for what the algo looked like, feel free to leave a comment.)

You can also see emails that haven’t been converted at all:

This is just a line-folded raw Quoted-Printable mail: Note the no= te, where neither the equals sign, the line ending nor the following character have been removed. (The undecoded =C2=A0s just decode to NON BREAKING SPACE.)

I’m also slightly curious about whether these email were really multipart/alternative — that is, whether the person preparing these emails for printing just chose the text/plain parts to work on, since that would obviously be less work than to print out the text/html parts? That would also explain why all the images and stuff have gone missing…

Anyway.

How did the viralness go, I hear nobody ask? Well, my post was posted on Hacker News, and I was going “hah, once again, behold my wonderful WordPress installation, which has no problems dealing with any of this stuff; I wonder why so many sites go down when they land on Hacker News (I’ve been there several times and I’ve never seen a load over 0.5), all those other people must be amateurs. Not to mention those smug static site generator people… So unnecessary…”

Aaaargh! WHAT! WHAT”S HAPPENING! THE SITE IS DOWN!!!

But there’s still no load? What’s going on!!!!

So I started wondering whether I’d disabled WP Super Cache (which serves out cached HTML without hitting the database much) while doing some experimentation the other week, but no. So I started thinking… “what if the problem isn’t WordPress per se, but just a lot of traffic — it just runs out of connections (due to KeepAlive and stuff)”…

So I googled, and Gemini told me what to fix (YES, AN LLM TO THE RESCUE, I couldn’t believe it either). It first told me to find out what kind of setup I have:

root@blogs:~# apache2ctl -V | grep MPM
Server MPM:     prefork

It’s mpm_prefork, which means that the conf is in /etc/apache2/mods-available/mpm_prefork.conf, so I upped everything there. So there’s now:

StartServers            5
MinSpareServers         5
MaxSpareServers         40
ServerLimit 1000
MaxRequestWorkers       1000
MaxConnectionsPerChild  0

That’s up from a standard of:

StartServers            5
MinSpareServers         5
MaxSpareServers         10
MaxRequestWorkers       150
MaxConnectionsPerChild  0

So I increased the (default) ServerLimit from 250 to 1000, and MaxRequestWorkers from 150 to 1000. I restarted Apache, and everything was suddenly hunky dory again.

The load, even with this configuration, is steadily less than 1, so it can probably be upped more.

But how much traffic was causing this problem? Well, in general, there were just two page views per second (e.g., 6738 per hour in the snapshot above), but:

Each page view entails about 50 resource hits, so it’s about 100 hits per second.

Which is nothing — you can see from the load that this Dropbox instance has no problem serving it, resource wise — but it’s just that the default connection parameters are really conservative.

Which may indeed make sense in other configurations. If each hit lands on an URL that requires a lot of processing, a limit of 1000 will soon land you with a server with a load of 1000, and you don’t want that. But a WordPress instance (with caching switched on) is 99% serving static resources only, which is *piffle*.

It’s deranged that we’re still doing stuff like this, thirty years after web servers got popular. There’s still no way to tell Apache “all of these resources are static, use how many connections you want — millions; I don’t care — to handle them. But these others (insert conf here) actually take CPU, so don’t handle more than 50 of them concurrently.”

So here we are… And now I can go back to rolling my eyes at those plebs that have server problems during hackernewsdotting. Hah!

On this screenshot, it’s been about 12 hours now since the virality started viraling, and as you can see, it makes the daily chart useless — all the other days are just a couple pixels. I guess I could switch to a logarithmic chart, but I just don’t want to.

Seems we’re gonna land at, like, 65K page views in total for Feb. 4th — the previous record was 20K, so that explains that I had to tweak the Apache settings. Let’s see where we’re at now, about 24 after virality:

73K. Yeah, things have tapered off pretty quickly, but apparently Twitter has gotten into the business now:

Oh, and Bluesky?

The referrers are fun — as usual whenever there’s a Hacker News post, there’s a large followup effect from Reddit etc. But the sheer variety… 250 different sites, and the list ends like:

It’s a lot of stuff… so many web sites are downstream of Hacker News.

This ones‘s funny.

The comments on Hacker News were many and varied, but it’s kinda fun to see how the stupidest ones get downvoted pretty fast… As for comments here on this blog, I usually have WordPress auto-approve them (the Akismet anti-spam is so effective that it’s usually no problem), but some actual nazis found my blog, so I had to switch that off, and now I have to approve comments manually. Them’s the breaks.

So there you go.

What’s up with all those equals signs anyway?

For some reason or other, people have been posting a lot of excerpts from old emails on Twitter over the last few days. The most vital question everybody’s asking themselves is: What’s up with all those equals signs?!

And that’s something I’m somewhat of an expert on. I mean, having written mail readers and stuff; not because I’ve been to Caribbean islands.

I’ve seen people confidently claim that it’s a code, or that it’s an artefact of scanning and then using OCR, but it’s neither — it’s just that whoever converted these emails to a readable format were morons.

What’s that you say? “Converted?! Surely emails are just text!!” Well, if you lived in the stone age (i.e., the 80s), they mostly were, but then people invented things like “long lines” and “rock döts”, and computers had to “encode” the mail before sending.

The artefact we see here is from something called “quoted printable”, or as we used to call it when it was introduced: “Quoted unreadable”.

To take the first line. Whoever wrote this, typed in the following in their mail reader:

we talked about designing a pig with different non- cloven hoofs in order  to make kosher bacon

We see that that’s quite a long line. Mail servers don’t like that, so mail software will break it into two lines, like so:

we talked about designing a pig with different non- =
cloven hoofs in order  to make kosher bacon

See? There’s that equals sign! Yes, the equals sign is used to say “this should really be one single line, but I’ve broken it in two so that the mail server doesn’t get mad at me”.

The formal definition here is important, though, so I have to be a bit technical here: To say “this is a continuation line”, you insert an equals sign, then a carriage return, and then a line feed.

Or,

=CRLF

Three characters in total, i.e., :

... non- =CRLF
cloven hoofs...

When displaying this, we remove all these three characters, and end up
with:

... non- cloven hoofs...

So what’s happened here? Well, whoever collected these emails first converted from CRLF (also known as the “Windows” line ending coding, but it’s the standard line ending in the SMTP standard) to “NL” (i.e., “Unix” line ending coding). This is pretty normal if you want to deal with email. But you then have one byte fewer:

... non- =NL
cloven hoofs...

If your algorithm to decode this is, stupidly, “find equals signs at the end of the line, and then delete two characters, and then finally the equals sign”, you should end up with:

... non- loven hoofs...

I.e., you lose the “c”. That’s almost what happened here, but not quite: Why does the equals sign still remain?

This StackOverflow post from 14 years ago explains the phenomenon, sort of:

Obviously the client notices that = is not followed by a proper CR LF sequence, so it assumes that it is not a soft line break, but a character encoded in two hex digits, therefore it reads the next two bytes. It should notice that the next two bytes are not valid hex digits, so its behavior is wrong too, but we have to admit that at that point it does not have a chance to display something useful. They opted for the garbage in, garbage out approach.

That is, equals signs are also used for something else besides wrapping long lines, and that’s what we see later in the post:

   =C2   please note

If the equals sign is not at the end of a line, it’s used to encode “funny characters”, like what you use with “rock döts”. =C2 is 194, which is a first character in a UTF-8 sequence, and the following char is most likely a =A0: =C2=A0 is “non-breakable space”, which is something people often use to indent text (and the “please note” is indented) and you see =A0 in many other places in these emails.

My guess is that whoever did this part just did a search-replace for =C2 and/or =A0 instead of using a proper decoder, but other explanations are certainly possible. Any ideas?

Anyway, that’s what’s up with those equals signs: 1) “it’s technical”, and 2) “it’s a combination of buggy continuation line decoding and buggy non-ASCII decoding”, and 3) “whoever processed these mails are incompetent”. I don’t think 2) should be very surprising at this point, do you?

(Edit a bit later: To nitpick a bit here: When the standard was written, people mostly envisioned that the quoted-printable content transport encoding would be unwound upon reception (note “transport”), and that you’d end up with “clean text” on disk after reception. This didn’t really happen, so all “real” implementations do the right thing with single-character (i.e., “unencoded”) newlines. For instance:

(quoted-printable-decode-string "he=\nllo")
=> "hello"

Which leads me to assume that they reused an algo that was usually run in an SMTP server context to do the line unfolding — in that context, you can safely assume that the line ending is a CRLF. And by chance, this algo also works fine if you’re working with a Windows-based file, but fails for a Unix-based file.)

January Music

Music I’ve bought in January.

It’s not a lot, and most of what I’ve bought are box sets of old albums and stuff. TSK TSK

The Chicks on Speed box is a pretty odd one. It’s not really a “best of” box, and it’s not really a retrospective, either. Instead it’s half newish songs, half old ones, and half new versions/remixes of old ones. That’s a lot of halves, but it’s a big box.

I’m not altogether sure that it totally works… interweaving old and new songs like this mostly reminds you that the older songs were better?

But it’s a pretty fun box — there’s a very colourful booklet…

Tasteful cover designs.

And an Italian silk scarf! That’s not something you get in every box set, eh? Eh?

The 2xLP version of Desire by Tuxedomoon has The Worst Design Ever. But why — the original one was pretty spiffy.

But it does come with some pretty amusing texts about the recording of the album, and it has three previously unpublished outtakes. (And some alternative versions.)

This was released a decade ago, but I’d totally missed it — I just happened onto it on Discogs one day. I really love Lal & Mike Waterson’s album, and she never released anything after it got a pretty tepid (veering towards hostile) reception…

… and this release shows what a tragedy that was. The CD consists of home recordings she did over the years, and there’s some real gems in there.

The accompanying book has excerpts from her notebooks, so we get a lot of her lyrics, and also her artwork, and the artwork’s great, too.

She was just incredibly talented.

Yes! One can now be nostalgic for Electroclash!

I like buying compilations like this — I mean, I’ve got at least a third of the songs on it already, but the recontextualisation is sometimes fun, and I discover new old bands that I’d missed the first time around, so they often lead to shopping sprees. This one seems like a corker — I’ve only listened to the first CD (I like to “swap in” box sets slowly), but it’s all fun stuff.

Yes, a Welcome to the Pleasuredome box set. What on Earth could it contain?

Just seven CDs (and a bluray)? As a friend said when I told him about the set “so it only has about half the remixes?”

I dunno, but I’ve only listened to the first CD, and this might be one of those “listen once and then forget” box sets… I mean, it’s fun to listen to the Relax demo the first time… but that’s perhaps sufficient, really.

The booklet is also pretty disappointing. It has no new info, no interviews, no nothing — instead we get a summary of how the album and the singles did chart-wise, so it’s like they only consulted public sources. Snores-ville.

And this isn’t a box set! It’s an early 80s album from an artist I’ve never heard about before, but I bought it because I went on a Bill Nelson Discogs shopping spree.

Look at all these people on this album! Steve Jansen, Mick Karn… that’s half of Japan. And produced by Steve Nye, who produced Japan. And you won’t believe this, but it kinda sounds like Japan! Yes! Whodathunk! (Masami Tsuchiya played with Japan as a touring guitarist.)

And also Ryuichi Sakamoto.

Is the album any good? I’ve only listened to it once so far, and I think… it’s kinda good? Not sure yet.

OK, that’s all I have to say.

Dry Cleaning - Joy

Except that the new Dry Cleaning album isn’t as good as the last one.

The Soft Pink Truth - Time Inside the Violet (official music video)

And that The Soft Pink Truth have a new album out. It’s very different, but they usually are…

And I bought the Jacques Dutronc album because somebody with good taste posted the grid above, but unfortunately, it’s not very good.

January Books

After I finished my Book Club 2025 blog series, I found that I missed blathering on about books I’ve read… so I thought I might perhaps start doing a “what I read last month” thing? We’ll see how it goes, because in January I managed the amazing feat of reading (almost) exclusively junk. And you’d think there’d be nothing to say about junk, but I find that the junkier, the more there is to blather about.

So I’m not going to say anything about the books I had nothing to say about.

First of all, I have to say that I really enjoyed spending the time with these 650 pages of Philip Pullman’s The Rose Field. On a page-to-page basis, it’s very exciting and a lot of fun. It’s well-written and propulsive. Whenever I sat down with it, I went “yay”.

But.

There’s so many problems. We follow a large number of characters on a “third person tight” basis, but out of the blue, Pullman would just drop into omniscient narrator and go “and of course, the man Alice is talking to is the same one that Lyra was running away from three years earlier”. Like he’s talking out loud to himself. It’s very disturbing, and I wondered whether this was a sign Pullman had incipient dementia while writing the book? It took six years to finish, apparently.

And oh, the plot… We’re introduced to so many new characters, and their stories go nowhere. We follow some older characters, too, like Alice. As the book was getting towards the end, I was scratching my head as to how Pullman could possibly pull all these threads together. “Er… perhaps… er… Alice will… uhm…” “Perhaps Oakley Street will regroup and… er…”

And then nothing happens to any of these plot strands. Lyra gets a sort of resolution, but not very satisfying. For the rest of the characters, the book just stops. The first thing I did after finishing it was to google whether there’s more volumes coming. And there’s not.

It’s the least satisfying end to a fantasy I can remember reading.

Pullman got a lot of pushback on the previous book’s romance plot — an older teacher (who cared for her when she was a baby) falls in love with her, and it’s all kinda yucky (but very typical for an older male author). It’s obvious that Pullman is pretty angry about that reaction, because he not only has a nazi-equivalent cop investigating that teacher for inappropriate sexual behaviour, but he also has a 400 year old witch telling him that his age gap is nothing — she often has sex with man that are 370 years younger than him. “Numbers have little to do with it.” *gack*

But then at the end, he and Lyra don’t end up together after all — the rumours on the interweb says that an editor convinced him to remove that from the ending. I find that pretty unlikely — I mean, that this book had an editor. If it had, surely that editor could have pointed out all the other problems the book has? Much more serious ones?

But then again, Pullman is 79 now.

Anyway: I liked reading the book, but it’s maddingly frustrating in the end.

The people on Goodreads didn’t like it — you seldom see scores reduce that way between volumes in a series.

With lots and lots of people — fans of the original trilogy — wishing they hadn’t read this one at all. (Because it retcons a lot of stuff in a way that makes no sense, and diminishes what happened in those books, really.) So I’m sympathetic to that.

In The Game is a pretty weird mystery. It’s obviously the author’s first book, because she uses it as a dumping ground for all her observations about Chicago and life in general. I guess she had a lot of things stored up she just had to get off her chest, and in 1991 people didn’t have blogs, so they dumped it in their mysteries instead.

So nothing happens until 50 pages in, and then things continue not to happen at least until page 80, when I gave up.

And I usually totally disagree with things like the above — I don’t care whether the protagonist is “likeable”. But in this instance, it was so egregious… She’s supposed to be an investment manager (!), but kinda happy-go-lucky anyway, and quite smart and stuff. So she seems like a wish fulfilment character in many ways… but then she goes and does the most horrendous and horrendously stupid things! Things that would only make sense in an over-the-top comedy, which this isn’t. So what we’re left with is in effect some kind of psycho that we’re supposed to really like?

It’s very strange, and not compelling at all.

I enjoyed reading this — I haven’t read many Gothic romances. However, it’s a bit on the long side? It’s one of those books where you can’t really see anything that can be cut, per se, but it still seems to take an excessive amount of time to read.

If there is a main problem, it’s that the “solution” to the problem is obvious from the start: Edmund is a total prat, and there’s no way out of the problems presented in the book other than to kill him. So I was just waiting for our heroines to pull themselves together — for hundreds of pages — and off him. Which is probably what the author intended the reader to feel, but perhaps she could have… had, like, a progression?

I like Star Trek as much as the next person, but I’m not a big enough fan to read the Star Trek novels. However, I saw somebody wax poetic about how great this particular one is, and since I remember liking Diane Duane’s other books, I thought “what the hey”.

And it’s… I wouldn’t say I was bored silly while reading this, but I felt such intense waves of disinterest. I should have ditched it, but I persevered, which was a bad decision. The book is like reading a novelisation of a pretty mid Star Trek episode — one where they didn’t have a very high budget, so much of the episode is about 4D chess and stuff. The emphasis is on Kirk/Spock/McCoy repartee, and Duane obviously had a lot of fun writing it.

It’s not badly written on a scene by scene basis at all — it’s just really, really hard to be interested in what’s going on on the pages. Perhaps if I were a bigger Star Trek fan, it’d be more fun?

As for the recommendation — it’s a common thing with older books like this: A person read the book when they were, say, eighteen, and then 43 years later, they remember quite liking it, and then say “it’s the best book ever!!!” Recommendation Strength should decay by the number of years since you read a book.

And… those were the only four books I wanted to mention, apparently? Okidoke.

Oh, yeah, I also wanted to mention:

I probably bought it because of the Pullman quote — “The best thriller I’ve ever read”. I mean, that’s really selling it.

Unfortunately, the book is shite, and I dropped out after about a hundred pages.

I wrote most of the preceding text after reading each book (and just added this postscript and the introduction now), but I note that I had a lot of stuff to say at the start of the month, and less and less as the days progressed, so perhaps this’ll be the first and last post in this blog series.

Ou pas.