The Mysteries of WordPress

I moved to a self-hosted WordPress last week, and importing the images failed, so I had to do that manually. (That is, with rsync.)

Everything seemed to work fine, but then I noticed that loading the images of some of the older pages seemed to take a long time. Like, downloading megabytes and megabytes of data.

Time to do some debuggin’.

I’ve been a WordPress.com user for almost a decade, and I have avoided actually looking at the WordPress mechanisms as much as I can. But I did know that WordPress rescales images when you add them to the media library:

So uploading that dsc00863.jpg file results in all those different files being made, and since that hadn’t happened during my migration, I tried the Media Sync plugin, which is apparently designed just for my use case. I let it loose on one of the directories, and all the scaled images were dutifully made, but… loading the page was just as slow.

*sigh*

I guess there’s really no avoiding it: I have to read some WordPress code, which I have never done, ever, in my life. And my initial reaction to reading looking at the code can best be described as:

AAAAARGH!!!! IT”S THE MOST HORRIBLE THING EVER IN THE HISTORY OF EVER!

It’s old-old-style PHP, which is an unholy mix of bad HTML and intermixed PHP, with 200-column-wide lines. I had imagined that WordPress was … you know, clever, or something. I mean, it’s what the internet is built on, and then it’s just… this?

Anyway, I started putting some debugging statements here and there, and after a surprisingly short time, I had narrowed down what adds srcset (with the scaled images) to the img elements:

And I started to appreciate the WordPress code: Sure, it’s old-fashioned and really rubs me the wrong way with its whitespace choices, but it’s really readable. I mean, everything is just there: There’s no mysterious framework or redirection or abstract factories.

The code above looks at the HTML of an img tag, and if the class (!) of the img contains the string “wp-image-“, then that’s how it identifies the image in the database, and uses that to look up the metadata (sizes and stuff) to make the srcset attribute.

You may quibble and say that stashing that data in the class of the img is a hacky choice, but I just admire how the Automattic programmers didn’t do a standup where that went:

“well, what if we, in the future, change ‘wp-image-‘ to be something else? And then the regexp here will have to be updated in addition to the code that generates the data, so we need to encapsulate this in a factory factory that makes an object that can output both the regexp and the function to generate the string, otherwise we’re repeating ourselves; and then we need a configuration language to allow changing this on a per-site basis, and then we need a configuration language generator factory in case some people want to store the ‘wp-image-‘ conf in XML and some in YAML, and then”

No. They put this in the code:

preg_match( '/wp-image-([0-9]+)/i', $image, $class_id )

Which means that somebody like me, who’s never seen any WordPress code before, immediately knows what had to be changed: The HTML of the blog posts has to be changed when doing the media import, so that everything’s in sync. Using the Media Sync plugin is somewhat meaningless for my use case: It adds the images to the library, but doesn’t update the HTML that refers to the media.

So, what to do… I could write a WordPress plugin to do this the right way… but I don’t want to do that, because, well, I know nothing about WordPress internals, so I’d like to mess that up as little as possible.

But! I’ve got an Emacs library for editing WordPress articles. I could just extend that slightly to download the images, reupload them, and then alter the HTML? Hey? Simple!

And that bit was indeed trivial, but then I thought… “it would be nice if the URLs of the images didn’t change”. I mean, just for giggles.

This is basically what the image HTML in a WordPress source looks like. The images are in “wp-content/uploads/” and then a year/month thing. When uploading, your image lands in the current month’s directory. How difficult would it be to convince WordPress to make it save it to the original date via the API?

I grepped a bit, and landed on mw_newMediaObject() in class-wp-xmlrpc-server.php, and changed the above to:

And that’s it! Now the images go to whatever directory I specified in the API call, so I can control this from Emacs.

WordPress doesn’t like overwriting files, of course, so if asked to write 2017/06/foo.jpg, and that already exists, it writes to 2017/06/foo-1.jpg instead. Would it be difficult to convince WordPress otherwise?

No! A trivial substitution in wp_upload_bits() in functions.php was all that it took.

With those in place, and running an Emacs on the same server that WordPress was running (for lower latency), Emacs reuploaded (and edited) all the 2K posts and 30-40K images in a matter of hours. And all the old posts now have nice srcsets, which means that loading an old message doesn’t take 50MB, but instead, like… less than that…

It’s environmentally sound!

(Note: I’m not recommending anybody doing these alterations “for real”, because there’s probably all kinds of security implications. I just did them, ran the reupload script, and then backed them out again toot sweet.)

Anyway, my point is that I really appreciate the simplicity and clarity of the WordPress code. It’s unusual to sit down with a project you’ve never seen before, and it turns out to be this trivial to whack it into doing your nefarious bidding.

This Is A Test

This blog has been hosted on WordPress.com for many a year. It has, all in all, been a very pleasant experience: It feels like the uptime has been at least 110%, and most everything just works.

The problems with using that solution is that it’s very restrictive. There’s so many little things you just can’t do, like adding Javascript code (for which I’m sure many people are grateful), or customising the CSS in a convenient way.

I’ve worked around the shortcomings of the platform, but the small annoyances have piled up, and this weekend I finally took the plunge.

The reason for doing it now instead of later was that WordPress.com seemed to experience a hickup a couple of days ago, and I thought that instead of bugging support with the problem, I’d just take it as an opportunity to get moving. The problem was that the admin pages suddenly started taking 15 seconds to load. I checked it out in the browser debugger, and it was the initial “GET /” thing that took 15.something seconds, but only if I was logged in. So they obviously had an auth component that was timing out, and falling back to a backup thing (and it’s been fixed now).

But I clicked “export”, created a new VM at DigitalOcean, and got importing.

And… it failed. It got a bit further every time, downloading all the media from the old blog, but then failed with “There has been a critical error on your website. Please check your site admin email inbox for instructions.”.

After doing that for about ten times (and no email), I checked the export XML file, and what did I find?

*sigh*

So I got a new export file (after waiting 15 seconds), and ran the import again… and it failed again the same way. So that wasn’t the problem after all?

I blew the VM away, started from scratch again, and this time skipped doing the import of the media, and that worked perfectly.

To do the media, I just scripted something do download all the images, and then I rsynced it over to the new instance. Seems to work fine, even if the images aren’t in the “media library” of WordPress, but I never cared about that anyway…

It’s even possible to copy over subscribers and stats from the old WordPress.com instance, but that requires help from the Automattic support people. And I’m flabbergasted at how efficient they are: I had two requests, and each time it took them less than five minutes to do the request and get a response. I’ve never seen customer support, I mean Happiness Engineering, that efficient before; ever. It almost made me regret doing the entire move to self-hosted blogging…

Anyway. This is a test! If this post is posted, the new WordPress instance works.

Search Index Cleanliness Is Next To Something

Allegedly, 30% of all web pages are now WordPress. I’m guessing most of these WordPress sites aren’t typical blog sites, but there sure are many of them out there.

Which makes it so puzzling why Google and WordPress don’t really play together very well.

Lemme just use on of my own stupid hobby sites, Totally Epic, as an example:

OK, the first hit is nice, because it’s the front page. The rest of page one in the search results is all “page 14”, “category” pages and the like, none of which are pages that anybody searching for results are interested in.

The worst of these are the “page 14” links: WordPress, by default, does pagination by starting at the most recent article, and then counts backwards. So if you have a page length of five articles, the five most recent articles will be on the first page, then the next five articles are on “page 2”, and so on.

You know the problem with actually referring to these pages after the fact: What was once the final article on “page 2” will become the first article on “page 3” when the blog bloviator writes a new article: It pushes everything downwards.

So when you’re googling for whatever, and the answer is on a “page 14” link, it usually turns out not to be there, anyway. Instead it’s on “page 16”. Or “page 47”. Who knows?

Who can we blame for this sorry state of affairs? WordPress, sure; it’s sad that they don’t use some kind of permanent link structure for “pages”. Instead of https://totally-epic.kwakk.info/page/5/, the link could have been https://totally-epic.kwakk.info/articles/53-49/; i.e., the post numbers, or https://totally-epic.kwakk.info/date/20110424T042353-20110520T030245/ (a publication time range), or whatever. (This would mean that the pages could increase or shrink in size if the bloviator deletes or adds articles with a “fake” time stamp later, but whatevs?)

Can we also blame Google? Please? Can we?

Sure. There’s a gazillion blogs out there, and they basically all have this problem, and Google could have special-cased it for WordPress (remember that 30% thing? OK, it’s a dubious number) to rank these overview pages lower, and rank the individual articles higher. Because it’s those individual pages we’re interested in.

This brings us to a related thing we can blame Google for: They’re just not indexing obscure blogs as well as they used to. Many’s the time I’m looking for something I’m sure I’ve seen somewhere, and it doesn’t turn up anywhere on Google (not even on the Dark Web; i.e., page 2 of the search results). Here’s a case study.

But that’s an orthogonal issue: Is there something us blog bleeple can do to help with the situation, when both Google and WordPress are so uniquely useless in the area?

Uneducated as I am, I imagined that putting this in my robots.txt would help keep the useless results out of Google:

User-agent: *
Disallow: /author/
Disallow: /page/
Disallow: /category/

Instead this just made my Google Search Console give me an alert:

Er, OK. I blocked it, but you indexed it anyway, and that’s something you’re asking me to fix?

You go, Google.

Granted, adding the robots.txt does seem to help with the ranking a bit: If you actually search for something now, you do get “real” pages on the first page of results:

The very first link is one of the “denied” pages, though, so… it’s not… very confidence-inducing.

Googling (!) around shows that Google is mostly using the robots.txt as a sort of hand-wavy hint as to what it should do because the Calironia DMV added a robots.txt file in 2006.

It … makes … some kind of sense? I mean, for Google.

Instead the edict from Google seems to be that we should use a robots.txt file that allows everything to be indexed, but include a


directive in the HTML to tell Google not to index the pages insead.

Fortunately, there’s a plugin for that. But googling for that isn’t easy, because whenever you’re googling for stuff like this you get a gazillion SEO pages about how to get more of your pages on Google, not less. Oh, and this plugin seems even better (that is, it allows you to control what pages to noindex more pretty well).

So I added this to that WordPress site on March 5th, and I wonder how long it’ll take for the pages in question to disappear from Google (if ever). I’ll update when/if that happens.

Still, this future is pretty sad. Instead of flying cars we have the “Robots “noindex,follow” meta tag” WordPress plugin.

[Edit one week later: No changes in the Google index so far.]

[Edit four weeks later: All the pagination pages now no longer show up in Google if I search for something (like “site:totally-epic.kwakk.info epic”), so that’s definitely progress. If I just search for “site:totally-epic.kwakk.info” without any query items, then they’ll show up anyway, but I guess that doesn’t really matter much, because nobody does that.]

Parallax Error Beheads You

tl;dr: I made a silly 3D web page thing.

Yadda yadda:

For entirely nostalgic reasons, I’ve been buying a bunch of paperback books published by the largest Norwegian publishing house, Gyldendal, in the 60s and 70s. I guess these are the Norwegian equivalents of what Penguin was at the time: Cheap, but nice and with a nose for quality.

As a teenager (when the series had wound down), I’d use to walk around the library, looking at these artefacts and thinking “I should really read all of these”. I think one of the triggers for this weird desire is that they’re numbered, to it’s conceivable to read them all. Even in the correct order!

But the library didn’t have the oldest books in the series, so I never got started… But it’s a thought that has reoccurred to me over the years, and…

Look what happened:

I started buying them the other week. They’re still cheap: Inflation-adjusted they’re cheaper now than when they were published. Which is both nice and not-so-nice: It can be harder to find cheap books, because people don’t put them up for sale, as it’s not worth the bother. But I’ve bought 60% of them now (that’s 25% of them in the picture up there).

While doing this, I was also thinking about 3D. Perhaps because many of the covers are kinda pop-artey. And perhaps this nostalgia trip made me think about the demos I made as a teenager. And I’ve never done any 3D programming, ever, so I sat down and started typing some Clojurescript… and:

The stupid source code is here, and the live web site is here.

This is only the second Reagent thing I’ve written, and it’s… not a very Reagent-ey single page app. The main problem is that I have to do some low-level DOM fiddling, and I didn’t find a way to do that with Reagent’s “proper” way of doing things. For instance, when going from an animation to a transition, I have to stop the animation, query the 3D state of the object, copy that over to the object’s style as is, and then start the transition. Try as I might, I couldn’t figure out how to do that in Reagent without glitches, so I just resorted to altering the DOM directly (adding styles and stuff on the fly).

Working with CSS 3D, as a total novice, was pretty fun. You can play around with the 3D stuff in Emacs and see the changes immediately in the browser. Getting to grips with how to do perspective, or not, also took a few tries. For instance, when the books glide out of the library, the other faces haven’t loaded yet, so it just glides out straight towards the viewer, hiding the other faces. And then only starting to turn once the images have loaded. So that bit has a way-off perspective, while it’s more fun to have a closer perspective when the books are spinning…

Lots of trial and error. There’s 98 commits.

¯\_(ツ)_/¯

The annoying thing about CSS 3D is, of course, that there’s a number of browsers out there. The site looks somewhat choppy in Firefox, very smooth in Chrome, and there are some glitches in Safari, which seem to stem from Safari not being able to determine (fast enough?) what objects are behind what other objects when there’s a lot of them, and the Z axis different of the objects is less than a couple of pixels.

Oh, and I got to use a new tool:

To measure the spines for scanning. Fun!

The confusing title of this post is from an album by Max Tundra:

Reagent is… Nice?

I’ve been procrastinating on writing a web-based admin interface for news.gmane.io… because I just haven’t been able to make up my mind as to what technologies to use.

I hate learning new stuff, but it feels pretty stagnant to tap away in Javascript (on the frontend) and PHP for whatever has to happen on the backend. I really just kinda dislike PHP for no particular reason, but it’s so convenient: If you need a simple API for doing whatever, you write a .php page and that’s it. No dependencies, no setup, pretty good error reporting: Everything built in in Apache.

But I don’t liiiike iiiittt. (Imagine a whiny voice.)

For the Gmane stuff, I had the additional problem that I have a lot of admin logic in Emacs, and I want to keep that. Because it’s really convenient, especially when doing mass updates.

So… it took me weeks to accept it, but if I don’t want to implement a lot of things twice, I had to use Emacs on the backend.

*insert shocked face here*

I wrote a very short PHP component that does the TLS and the auth, and just reads some JSON posted to it, sends it over to Emacs, reads the JSON from Emacs and spits it out to the client. Like… middleware.

I could have done the HTTPS server in Emacs, too, but there’s just too many variables to get that solid, and Apache works.

So. That’s the backend, and what about the frontend?

I have no knowledge of Clojure, ClojureScript, Java or React, so I settled on Reagent.

My first stumbling block was Clojure, of course. I’ve been writing in Common Lisp for my day job for a couple of decades, and Clojure is… not Common Lisp? And I’m not sure what the design philosophy behind it is. Is it perhaps “make it look cool enough so that Java people won’t notice that it’s Lisp”?

Compared to Common Lisp, it’s terse and tends towards line noise, just like Perl. I made the following comment on irc:

(lambda (bar) (foo bar)) is the same as #(foo %)

And got the sarcastic response back:

And I guess (lambda (foo zoo) (bar foo zoo)) is the same as #(bar %1 %2)?

And it is! It was a joke, but that’s exactly what it is. When people are sarcastic, but happen onto the actual language design, that says… something?

And, oy, don’t get me started on the threading operators (-> and –>). Perhaps designed to placate Java developers who can’t read anything but foo.bar().zot().foobar()? More than a few of the design decisions seem predicated on limitations of the Java language (which are then reflected in the JVM), like there not being “real” keyword arguments for functions.

My point is: My quibbles are irrelevant. Whatever the idea behind Clojure was — it worked. People love these tricks, because people love writing code that’s incomprehensible I mean clever.

We now have a Lisp that’s mainstream enough that you can do web development on it instead of writing Javascript. And for that we’re all grateful.

I have not learned Clojure in depth (to put it mildly), but learning enough to write a web page only took a day. I guess I’ll look back upon my first ClojureScript project in shame, but it, like, works, and it’s a lot more fun to add new stuff to it now than it would have been in Javascript.

My major problem with all this is… the tooling isn’t quite all there yet when developing. With leim and Figwheel, everything reloads nicely and magically in the browser while doing stuff, and when doing something egregiously wrong, I get nice error messages:

However, if the breakage isn’t during compilation, the error reporting is really, really bad:

That’s an error message from line 25173 in react-dom.js, and determining where the error in my .cljs file is is… difficult? I thought I must be doing something obviously wrong to not get better error reporting, but googling this stuff shows that people mostly are just putting a lot of prns everywhere, and that’s really primitive.

Even worse are Reagent errors that are less… errorey. I spent an hour on a problem with bind-fields because I thought it took a function parameter, but it wanted a vector. Absolutely no feedback whatsoever — nothing worked, but I didn’t see what the problem was before I googled “reagent-forms” “bind-keys” (with quotes), and the second answer is somebody who’d done exactly what I’d done.

And some of the error messaging seems wilfully obtuse:

This was because of:

Yes, those should be square brackets. (And note: No reporting on what line the error was on.)

*sigh*

But Reagent feels quite nice, and the Hickup HTML syntax is wonderful: The best I’ve seen in any language. Even real Lisps don’t have an HTML-generator syntax that’s that thought-through and regular. I mean… this makes me happy:

[:div
 [:h2 "New edit requests"]
 [:div#requests.log
  (map (fn [req]
         [:div.clickable {:on-click #(show-edit % req)
                         :key (:request-time req)}
			 (:request-time req) " " (:newsgroup req)])
       (:ok data))]]

Here’s the live admin interface in action, handling an edit request: