There’s a soft copyright strike thing now?

Oh, some context — ten years ago I decided to watch as many Ingmar Bergman things I could lay may hands on, which turned out to be 87 things, so I landed on this design for the blog series:

Most of those things I watched were his well-known and fully available films, but he’d also done a large number of stage plays, and there’d been a number of short films, TV plays, and short documentaries about Bergman, and these were really hard to come by.

I found a guy in France (I think; still not sure) that sold bootleg DVD-Rs of this stuff — he must have been collecting for decades, and that helped a lot. I also found stuff on the torrents, and on other blogs, and in the end, people who had recorded stuff from the TV in the 80s also sent me things.

Anyway, after watching all this marvellous stuff, I wondered whether I should do something with this treasure trove… but I didn’t. Then I saw that the bootleg guy had totally disappeared off the face of the internet, and his DVD-R burning business with it.

So I thought the time was probably come to just upload all the things that were not commercially available anywhere to Youtube, as “The Bergman Channel”. And so I did, and I was surprised that the channel survived the copyright strikes, but it did — mainly because the main copyright holder to Bergman’s work is Svenska filmindustrier, who have mainly just claimed copyright for the things in Sweden. So if you’re in Sweden, half of The Bergman Channel is blocked.

But today I got this:

The interesting thing is this:

What to do next
[…]
* Delete your video. If you remove your video before 7 days are up, your
video will be off the site, but your channel won’t get a copyright strike.

It’s a “soft” copyright strike? I’ve never seen one of those before. Are they new? Anyway, it’s nice, because if you get three normal copyright strikes, your entire channel disappears. So thanks — I’ve now deleted “Karins ansikte”. Is it available elsewhere now? I haven’t paid attention…

While I was logged into the channel again for the first time in at least four years, I had a look at the stats:

Hey, total view time is 17K hours! Nice. Looks like the most-watched Bergman thing is the long-lost made-for-TV film Rabies. I mean, you can understand why it’s long-lost — the video quality is kinda bad, and if they don’t have a better source than this, I guess they’d feel bad about releasing it commercially. But like I said, I haven’t really checked whether these things have become available by now…

Watch them all here before they disappear. By this rate (one going missing every four years), that’ll take only 128 years…

I’ve got original pages by Carol Swain!!!

Carol Swain started selling original art a couple of months ago, and I snapped up this three page story, Jig and reel. Each page is about 30x42cm big. They’re so gorgeous!

*calms down a bit*

I think I should get them framed… I wanna have them on the wall. Hm… one long frame or three separate frames and arrange them like a triptych? Hm…

(Wherein I blather on about stuff and I eventually apologise to new readers)

About a week ago, I posted about equals signs in some documents, and that totes went viral, so I thought I’d talk some about that? For no good reason, really — do you need a reason to do navel gazing? Do you?

The first day, the traffic was dominated by readers originating from Hacker News. Then all of a sudden, there was a huge influx of traffic from Twitter. But whyyyy!

It was because of this tweet that got community noted, I think:

It did the numbers, anyway… Somehow this (along with repostings of the original tweet) led people to my account:

*gulp* 600K views! And I don’t even have a blue check! Do you think if I sign up quickly, and then ask Elon to add the views, then I’ll get lots of $$$? No? Darn.

I guess in this situation the traditional thing to do is to post a reply going “wow this blew up listen to my podcast”, but I don’t really have a podcast, and besides, I like using commas too much.

And I looked at all these “… and 100 others followed you” and didn’t understand that that, er, rolled over or something? So I didn’t check my profile until days later, and:

*gulp* I had no idea that so many people were interested in email transport standards! That’s an increase of, er, 20,000% or something. And they’re all going to be really disappointed when they get tweets about obscure 80s comics in their feed all of a sudden. I apologise in advance!

To make it up to these people, perhaps I can talk about RFC2231: “MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations”. To the best of my knowledge, I was the only person who actually implemented this standard, so if you use it in your email, people who don’t use Gnus won’t be able to decode certain headers at all.

Look how pretty it is:

(rfc2231-encode-string "foo" "These are röck döts")
=> "foo*=utf-8''These%20are%20r%c3%b6ck%20d%c3%b6ts"

OK, perhaps there’s a reason it didn’t take off…

But that’s it! From now on there’ll be no more posts about email encoding! You’ll have to get your fix of that elsewhere!

But what you’re all wondering — were there any funny replies? Eh, some, but I liked this best:

Euhm, it’s really not, but other than that: A brilliant tweet.

And I think they should invite me to the influencer awards next year; the appropriately named “Vixen Awards”. There must be justice in this world!

After that, there was another bump originating on Youtube, but now it’s calmed down…

Oh, and I got a request to publish a translation of the post in a Norwegian magazine. So I translated it to Norwegian. The hardest thing to translate was “rock döts”. We workshopped the translation quickly on irc and landed on “röcktödlar”…

… which is a word the world has never seen before! I’m shocked!

The magazine doesn’t seem to have published it, though? Oh well, the vagaries of publishing…

And… that’s enough blathering from me about his extremely important subject. I’ve got comics to read.

How patient are AI scrapers, anyway?

As I was falling asleep the other day, I wondered whether adding a simple delay would fix issues with doing WordPress visitor statistics. (To recap, the problem is that “modern” scrapers run a full browser instance, and they don’t put anything saying that they’re a bot in the User-Agent.)

So: Instead of calling registerVisit immediately from Javascript, I delayed it for five seconds. (This has the added side effect of not counting people who just drop by and then close the page immediately — which is also a plus, I think?)

So I made the change and waited a bit and:

*sigh* Nope, doesn’t seem to help much at all. The people running these things just have way too many resources — they fetch a page, and then let the Javascript elements run for several seconds before grabbing the results.

(An additional amusing thing about the specific bot in this random example is that it’s really into using VPNs in really, really obscure countries…)

Which may be a testament to how awful many web pages are these days, really — you go to a page, and it takes ages for all the data to load, and then finally something renders and you can read it. For instance, here’s Facebook:

It takes about six seconds until it’s “reasonably” loaded… Pitiful.

Well, if Facebooks takes six seconds, perhaps AI scrapers have settled on something along those lines?

What if I increase the period to ten seconds? I mean ten seconds is longer than it takes to read an article like this, really, but win some lose some.

*types a bit and then waits for another day*

Uhm… I think this basically fixed the issue? I can now no longer detect any typical obvious scraper runs, and virtually no traffic from China/Singapore is now logged.

I think I found a solution! Kinda! I guess I should let this run longer before drawing any conclusions, but that’s for cowards! Ten seconds is the magic number!

About those equals signs again…

Well, whaddayou know — my post about an obscure en/decoding issue went totally viral yesterday! I had no idea that people were so interested in RFC2045, an email standard from 1996. Just goes to show…

Just kidding!

But I see there was some confusion about some details (mostly due to my unfortunate tendency to (try to) be funny), so I thought I’d bloviate a bit more in case anybody’s still confused. And! A bit about the server problems at the end, too.

OK?

So: Over the past few days, I’d been getting more and more annoyed with threads about the horrific Epstein emails that had been published. I mean, not about the horrificness itself, but because I’m a nerd. I was getting really fed up with all those people, so sure of themselves, declaring that one specific oddness — the presence of = characters sprinkled almost as on random, replacing other characters — were due to OCR artefacts.

Now, I know OCR (I’ve done millions of pages), and let me tell you, sirrah, those are no OCR artefacts.

It was even more annoying when people were incorrecting each other with things like “yeah, you can see that it pops up more often when the letter should have been an h” and so on. NO YOU FOOLS! YOU FOOLS! IT”S QUOTED PRINTABLE!!! *storms out in a huff*

I did see other people point out the quoted printableness of it all, but I didn’t see anybody actually explain the specific oddities we were seeing. If somebody in charge of processing these emails, knowing nothing about email, were seeing “huh, a lot of these lines end with =, let’s just remove those”, you wouldn’t get missing characters. You’d get oddly formatted lines, but in any case, you’d have gotten rid of those equal signs for sure.

So I didn’t post anything, because I didn’t really have a good explanation other than “you’re all wrong! wrong I tells ya!” *stomps foot*

(The other popular hypothesis was that Epstein cleverly inserted = into random words, replacing other characters, because it was some sort of secret code. *sigh*)

But then it came to me… what if it’s not actually continuation handling that has gone wrong here, really? What if it’s that other thing in quoted printable — non-ASCII characters?

I did some googling… and I found somebody had experienced exactly that problem in the wild! 14 years ago! Yes!

So here’s the algorithm I think they used. The algorithm is buggy, but if you’ve dealt with encodings like this before, I think you can see how it ended up like this.

  • First you want to fix up continuation lines. You’re working on a Windows machine, or you’re actually reading the RFC (I know, don’t laugh — I know it’s unlikely). A continuation line on a Windows machine is =CRLF. So you write:
    (while (search-forward "=\r\n" nil t)
      (replace-match ""))

    (For the code examples, I’m assuming whoever was doing this conversion was using Emacs Lisp, because aren’t everybody? Ahem. Same thing in any language, really.)

    The problem here is that you’ve gotten all these files from Gmail, and they’re not a Windows shop — so the line endings are "=\n" instead.

    So your first pass over the file does absolutely nothing, but you don’t notice, because then your program hits pass two:

  • You look for = followed by two characters. This encodes to one non-ASCII character. (Or octet, really.) Because you know that that’s the only thing left — you’ve already handled all the continuation lines, right? Right? Right. So you delete the two encoding characters, and then you replace the = with the new byte. But you have some sanity checks, of course!
    (while (re-search-forward "=\\(..\\)" nil t)
      (let ((code (match-string 1)))
        (delete-region (match-beginning 1) (match-end 1))
        (when-let ((new-byte (decode-hex code)))
          (subst-char-in-region (match-beginning 0) (1+ (match-beginning 0))
                                ?= new-byte))))

    The sanity check is, unfortunately, that if the thing doesn’t actually decode, you leave the = in place. And in this case, it never decodes, because =\nc is never valid.

    I can certainly say that I’ve written code along these lines before, unfortunately.

So there you go. The mystery is why there’s still some =C2 things in there — but I think that can be explained by just having some off-by-one errors. That is, the mails originally had =C2=A0 (which is UTF-8 for NON BREAKING SPACE), and you often see either the =C2 or the =A0 missing, so my guess is that the algo just skipped ahead a bit too much — which would also be a typical error you wouldn’t find while doing some trivial testing under Windows, where your test files wouldn’t be in UTF-8, but in a code page like CP1252, where characters like NON BREAKING SPACE consist of one byte only.

So that’s my er “theory”, based on implementing mail standards for decades, and observing how sloppy people are when approaching this stuff. I mean, the RFCs are usually very straightforward and easy to follow, but people seem to wing it anyway.

Here’s a fun example — you see the character-replaced-by-equals effect you’d get with the algo described above (note ke=ping), but you also have the decode-half-the-non-ASCII-bytes thing — you probably had =C2=B7, which is supposed to be a · (a middle dot character) or something along those lines, but only the second byte has been converted, and not the first. (So you get an invalid character instead.) I don’t really have an explanation for how you’d mess up something like that, but us programmers are good at inventing new methods of doing the wrong thing, aren’t we? (If you have a theory for what the algo looked like, feel free to leave a comment.)

You can also see emails that haven’t been converted at all:

This is just a line-folded raw Quoted-Printable mail: Note the no= te, where neither the equals sign, the line ending nor the following character have been removed. (The undecoded =C2=A0s just decode to NON BREAKING SPACE.)

I’m also slightly curious about whether these email were really multipart/alternative — that is, whether the person preparing these emails for printing just chose the text/plain parts to work on, since that would obviously be less work than to print out the text/html parts? That would also explain why all the images and stuff have gone missing…

Anyway.

How did the viralness go, I hear nobody ask? Well, my post was posted on Hacker News, and I was going “hah, once again, behold my wonderful WordPress installation, which has no problems dealing with any of this stuff; I wonder why so many sites go down when they land on Hacker News (I’ve been there several times and I’ve never seen a load over 0.5), all those other people must be amateurs. Not to mention those smug static site generator people… So unnecessary…”

Aaaargh! WHAT! WHAT”S HAPPENING! THE SITE IS DOWN!!!

But there’s still no load? What’s going on!!!!

So I started wondering whether I’d disabled WP Super Cache (which serves out cached HTML without hitting the database much) while doing some experimentation the other week, but no. So I started thinking… “what if the problem isn’t WordPress per se, but just a lot of traffic — it just runs out of connections (due to KeepAlive and stuff)”…

So I googled, and Gemini told me what to fix (YES, AN LLM TO THE RESCUE, I couldn’t believe it either). It first told me to find out what kind of setup I have:

root@blogs:~# apache2ctl -V | grep MPM
Server MPM:     prefork

It’s mpm_prefork, which means that the conf is in /etc/apache2/mods-available/mpm_prefork.conf, so I upped everything there. So there’s now:

StartServers            5
MinSpareServers         5
MaxSpareServers         40
ServerLimit 1000
MaxRequestWorkers       1000
MaxConnectionsPerChild  0

That’s up from a standard of:

StartServers            5
MinSpareServers         5
MaxSpareServers         10
MaxRequestWorkers       150
MaxConnectionsPerChild  0

So I increased the (default) ServerLimit from 250 to 1000, and MaxRequestWorkers from 150 to 1000. I restarted Apache, and everything was suddenly hunky dory again.

The load, even with this configuration, is steadily less than 1, so it can probably be upped more.

But how much traffic was causing this problem? Well, in general, there were just two page views per second (e.g., 6738 per hour in the snapshot above), but:

Each page view entails about 50 resource hits, so it’s about 100 hits per second.

Which is nothing — you can see from the load that this Dropbox instance has no problem serving it, resource wise — but it’s just that the default connection parameters are really conservative.

Which may indeed make sense in other configurations. If each hit lands on an URL that requires a lot of processing, a limit of 1000 will soon land you with a server with a load of 1000, and you don’t want that. But a WordPress instance (with caching switched on) is 99% serving static resources only, which is *piffle*.

It’s deranged that we’re still doing stuff like this, thirty years after web servers got popular. There’s still no way to tell Apache “all of these resources are static, use how many connections you want — millions; I don’t care — to handle them. But these others (insert conf here) actually take CPU, so don’t handle more than 50 of them concurrently.”

So here we are… And now I can go back to rolling my eyes at those plebs that have server problems during hackernewsdotting. Hah!

On this screenshot, it’s been about 12 hours now since the virality started viraling, and as you can see, it makes the daily chart useless — all the other days are just a couple pixels. I guess I could switch to a logarithmic chart, but I just don’t want to.

Seems we’re gonna land at, like, 65K page views in total for Feb. 4th — the previous record was 20K, so that explains that I had to tweak the Apache settings. Let’s see where we’re at now, about 24 after virality:

73K. Yeah, things have tapered off pretty quickly, but apparently Twitter has gotten into the business now:

Oh, and Bluesky?

The referrers are fun — as usual whenever there’s a Hacker News post, there’s a large followup effect from Reddit etc. But the sheer variety… 250 different sites, and the list ends like:

It’s a lot of stuff… so many web sites are downstream of Hacker News.

This ones‘s funny.

The comments on Hacker News were many and varied, but it’s kinda fun to see how the stupidest ones get downvoted pretty fast… As for comments here on this blog, I usually have WordPress auto-approve them (the Akismet anti-spam is so effective that it’s usually no problem), but some actual nazis found my blog, so I had to switch that off, and now I have to approve comments manually. Them’s the breaks.

So there you go.