On Caching

Have you ever thought about caching? I mean, for web pages?

This isn’t going to be a rant, and there’s certainly not going to be a “call to action” here, but this is just something I’ve been going “*sigh*” about for decades, and I thought perhaps it might be time to ruminate a bit. Don’t feel obligated to read this! If this is an area you work in, you already know all this, and if you don’t, it’s not going to be of interest.

OK?

OK.

Since the beginning of time (i.e., the mid 90s) web browsers have been caching data to make web page rendering faster. (And in those days, to keep your modem bill down, I guess.) Caching is a good idea, but it makes development slightly more complex.

Let’s take an example web page, index.html, from my web site Mrs. Kwakk Wakk, purveyor of purported particulars to prince & proletarian:

<html>
  <head>
    <meta charset='utf-8'>
    <link rel='stylesheet' type='text/css' href='kwakk.css'>
    <script type='text/javascript' src='kwakk.js'></script>
    <link rel="shortcut icon" type="image/png" href="favicon.png">
  </head>
  ...

After loading this page, your browser (and any caches in between, like Cloudflare) will cache the kwakk.css and kwakk.js files. If I do any changes to these files, a normal reload of the site won’t reload them — you’ll still be using the old versions of these files. Some browsers have a “hard reload” function (in Firefox, it’s Shift-Control-R), but some don’t (mostly mobile phone browsers). And if you have Cloudflare in front of the site, that won’t help: I’d have to empty the Cloudflare cache before you’d see any changes.

“*gasp*”, I’m guessing you’re saying ironically now. “How does the web even survive with those shortcomings!”; laying it on really quick, aren’t you.

Well, things fix themselves after a while — these resources are usually only cached for a couple of hours, so “eh, whatevs” is the approach taken by most people who make simple web pages.

People who make “real” sites don’t have these problems, because they have a build step somewhere in their deployment scripts: They probably have a Javascript minifier, and they generate the CSS from whatever, and the HTML (if it’s even a static file) is generated from something else again, and the build script takes care to name each resource with a unique name, so that you get all the resources at the same time. This is why if you look at a major site, you’ll find that it loads Javascript files called things like u8bp0J-3GNE.js

But! I still stumble upon web shops here and there where I go “well, that looks odd”, and then I hit Shift-Control-R and everything magically looks better. Admittedly, a lot less often than I used to in the olden days (a lot!), but it happens, and then you know that they don’t have a deployment script that takes care of these things.

And this stuff is confusing to people who are just tinkering with their own little sites, and it makes for a frustrating development situation. I’ve seen people actually do things like this manually:

<link rel='stylesheet' href='kwakk.css?ver=1034'>

That is, they bump a number in their HTML files when they push a new version. That’s a miserable thing to have to do, and it’s error-prone, and it makes your VC history sad.

So what most people end up with (except for the majority that just ignore the thing, thinking “well, I don’t update my hobby site that often anyway”), is some sort of deployment script that does this automatically. For instance, just a simple sed script that replaces ?ver= with a timestamp or something. And that’s OK, except for Cloudflare not wanting to cache URLs that have query parameters (at least not on the free plan), so you have to put the versioning string somewhere else, but if you put it somewhere else, you have to keep changing the file name, or… you have to do some URL rewriting on your web site.

As an example, the kwakk.info site has:

<link href='/res/0/kwakk.css' rel='stylesheet'>

See that /res/0/? It’s rewritten by the deployment script to have a timestamp, and then I have an .htaccess file that says:

RewriteEngine On
# Versioned versions of JSON/JS/CSS.
RewriteRule res/([0-9]+)/([^.]+[.](json|js|css)) /$2 [L]

Now Cloudflare caches the files properly, and users get the proper versions of all the resource files.

So why am I writing this blog post anyway? This caching behaviour has been a problem for almost three decades, and nobody has come up with anything better than to make each individual web tinkerer come up with their own hacked-up solution. Would it be possible to come up with something that would just, like, work?

Probably not. But here’s a thought I had while trying to fall asleep the other day — what if the web server and the browser could cooperate a bit more?

OK, back to the original HTML:

<html>
  <head>
    <meta charset='utf-8'>
    <link rel='stylesheet' type='text/css' href='kwakk.css'>
    <script type='text/javascript' src='kwakk.js'></script>
    <link rel="shortcut icon" type="image/png" href="favicon.png">
  </head>
  ...

Now, the web server could parse this HTML and check the timestamp of the included files. I mean, I say “could” in a kinda hand-wavy way — there are problems if the HTML is generated on-the-fly: The server would then have to do some caching before parsing, and… But if it’s an actual file, then certainly convenient, efficient parsers exist, so…

Anyway, my sleepy-time idea would then be to output the information as a series of HTTP response headers. Perhaps something like:

Resource-Last-Modified: kwakk.css;Sat, 16 Dec 2023 17:58:38 GMT
Resource-Last-Modified: kwakk.js;Sat, 16 Dec 2023 12:52:45 GMT

Or whatever. And then the browser (and Cloudflare) could use this information to decide whether to reload the resource or not? And for other resources that may be loaded later dynamically, you could declare them like:

<meta rel="resource" "menu.svg">

And they’d get the same treatment…

I dunno. There’s probably a reason somebody hasn’t developed something like during the previous 30 years. I’m just typing this here so I don’t start thinking about it again the next time I go to bed.

And like I said, it’s a problem only for those people who like to tinker with their personal sites: People that have a build step somewhere (writing their entire site in Typescript with Tailwind, for instance) don’t have these issues that much. (Although I saw a newspaper site with an image of odd proportions the other week, due to somebody cropping an image after it was published or something… (Which reminds me of a thing I was thinking back in the 90s — why don’t web servers include the sizes of included images automatically somehow, and that could basically use the same mechanism as the one I sketched out here — but it would be more resource intensive. But just imagine: No more web pages where everything moves around after the browser loaded the images! (Yes, yes, I know about height and width in image tags, but that’s apparently still too hard for about half the web sites out there to include…)))

All computer stuff consists of cobbled together things, and things still somehow kinda work.

But in annoying ways.

Anyway. There you go.

So here’s a GIF of a cat.

Leave a Reply