Is it possible to take screenshots of web sites these days?

I was thinking about writing some code for older posts to check broken links and do screenshots of where links points to, as explained here. But before starting that, I thought it might make sense to explore a bit first to see what the options are. So I’ll just be nattering on randomly here and using this post as a log of various experiments…

So: My plan is to point a script towards old blog posts, and then for each link, do a screenshot of the web site. If it doesn’t exist any more, then substitute a Wayback Machine URL for the link instead.

By the way, did you know This One Weird Trick with Wayback Machine URLs? These look like https://web.archive.org/web/19991129021746/http://www13.google.com/. And that bit in the middle is indeed an ISO8601-like time stamp. But the Wayback Machine will give you the closest valid URL, so you can say https://web.archive.org/web/1/https://www.google.com to get the earliest version. Or for my use case, if I wrote the blog post on April 1st 2012, I can just say https://web.archive.org/web/20120401000000/https://www.google.com and it will give me https://web.archive.org/web/20120331235755/https://www.google.com/ instead. See? Easy peasy, and you don’t have to mess with an API to get at this stuff.

I’ve been using the cutycapt program for the actual screenshots for years. It’s a bit long in the tooth, so I thought I’d check out what the results were when using a more “real” browser for the shots. cutycapt doesn’t use a fully featured browser (it uses Qt HTML rendering, if I understand correctly), but things like shot-scraper does use a full Chrome behind the scenes.

So let’s look at some random sites and compare:

shot-scraper with Chromium.

Bookshop.org uses the Cloudflare “protection” against nefarious stuff like doing a screenshot if you’re using something that looks like a real browser…

… but not if you’re using shot-scraper with Firefox!

It allows the less real browser through — but as you can see, cutycapt renders in a quite different way.

comics.org blocks shot-scraper outright (with Chromium).

And again, with Firefox it allows it.

I’m gonna just throw out a wild thought I have no data for: Perhaps Cloudflare blocks in this case is because they have a lot of data saying there’s legitimate Firefox-like traffic from my IP address? Because that’s what I normally use? They probably have a “fingerprint” of every IP address in the world by this point, so they can easily decide whether the traffic looks normal or is “an attack”. Sounds stupid, but you know… I use Cloudflare myself, and the report always says “and this month we protected your site against 4232 attacks!” and it was probably some poor soul trying to use curl.

comics.org works fine with cutycapt.

Slate slaps up a cookie blocker using shot-scraper

… but not with cutycapt..

Salon slaps up a cookie blocker both with shot-scraper

… as well as with cutycapt.

So… cutycapt actually performs better on many sites, but it’s not actually, you know, all that good because I’m getting all the helpful EU cookie banners that are there to protection my preciouses cookieses informationeses.

Here’s an idea! I’m a genius! What if I just get a Digitalocean VM and ssh over there to do the screenshots!

*time passes while I’m setting up an San Francisco VM and stuff*

Well, bookshop.org gives identical results from as when I do it from Norway. When I try to access Slate.com, both from cutycapt and shot-scraper, they just hang indefinitely.

And Salon?

Yeah, blocked by Cloudfront.

I’m guessing lots of sites are blocking the Digitalocean IP range or something? Yeah, I’m getting the same with curl, so it doesn’t really seem to depend on the User-Agent or browser fingerprinting, unlike the Cloudflare blocks from home.

So! Nothing worked better from San Francisco, and many things worked even worse. Presumably if I were in a residential IP range, things wouldn’t fail quite as badly, but I hadn’t expected things to be this bad from Digitalocean.

I guess my best option here is to use shot-scraper with Firefox from home?

Thank you for coming to my Ted talk.

Book Club 2025: Tales from the Folly by Ben Aaronovitch

The worst mistake an author can do when writing short stories in between a series of novels is to try to “fill in” stuff from the backstory. That is, when creating a universe, good authors know a lot about their world that they never actually write (extensively) about. So for instance, if one of the characters have a classic car, the author may know that the original owner was, say, a pop star in the 60s, but would never mention it in the novel. It’s just background information.

When fishing about for material to use for short stories, the temptation is then to write a short story about that car and that pop star, and that’s always really, really tedious to read.

Fortunately, Aaronovitch avoids making that error. Mostly. Instead, most of these stories are just really entertaining almost stand alone pieces, and you can enjoy them without having read the novels, really. (Although some would be more puzzling than others.)

So… this is good fun, if a bit slight.

Tales from the Folly (2020) by Ben Aaronovitch (buy new, buy used, 4.03 on Goodreads)

Book Club 2025: The Unnamable by Samuel Beckett

I bought this (at a sale) back in 2009 along with either Malone Dies or Molloy. I read the other book at the time, but not this one.

I’m really culturemaxxing here — this edition was translated by Norway’s foremost poet, Jan Erik Vold, in the late 60s. And it flows really well; I wasn’t tempted for a second to seek out the English version of this. (Which was translated by Beckett himself from the original French.)

About 20 pages in, we drop paragraph markers, and the pages become Wall Of Text. But this isn’t hard hard to read — we’re not talking Lucy Church Amiably by Gertrude Stein here. There’s even a sort of narrative going on for the first half of the book.

My strategy for reading “difficult” books is to say to myself that I’m reading 20 pages in one sitting, no matter what. No breaks; no diversions; no “just look something up”, because I know that it can be hard to get back to a book like this if I’ve found something else to amuse me. (While I’m reading most novels, I don’t care — I read in a scatter-brained way when I’m reading prose, but when I’m reading comics or Difficult Books, I’m laser focused. For a period of time.)

So it took me more than a week to read this, 20 pages per day, more or less.

Towards the end, it becomes more dense — we get sentences going on for pages at a time.

It’s a pretty spiffy book. I’m guessing many people quote the final bit of the book — “if it opens, it will be I, it will be the silence, where I am, I don’t know, I’ll never know, in the silence you don’t know, you must go on, I can’t go on, I’ll go on.” — because they didn’t read the rest, but they should. It’s all good.

L’Innommable (1953) by Samuel Beckett (buy used, 4.0 on Goodreads)