I was thinking about writing some code for older posts to check broken links and do screenshots of where links points to, as explained here. But before starting that, I thought it might make sense to explore a bit first to see what the options are. So I’ll just be nattering on randomly here and using this post as a log of various experiments…
So: My plan is to point a script towards old blog posts, and then for each link, do a screenshot of the web site. If it doesn’t exist any more, then substitute a Wayback Machine URL for the link instead.
By the way, did you know This One Weird Trick with Wayback Machine URLs? These look like https://web.archive.org/web/19991129021746/http://www13.google.com/. And that bit in the middle is indeed an ISO8601-like time stamp. But the Wayback Machine will give you the closest valid URL, so you can say https://web.archive.org/web/1/https://www.google.com to get the earliest version. Or for my use case, if I wrote the blog post on April 1st 2012, I can just say https://web.archive.org/web/20120401000000/https://www.google.com and it will give me https://web.archive.org/web/20120331235755/https://www.google.com/ instead. See? Easy peasy, and you don’t have to mess with an API to get at this stuff.
I’ve been using the cutycapt program for the actual screenshots for years. It’s a bit long in the tooth, so I thought I’d check out what the results were when using a more “real” browser for the shots. cutycapt doesn’t use a fully featured browser (it uses Qt HTML rendering, if I understand correctly), but things like shot-scraper does use a full Chrome behind the scenes.
So let’s look at some random sites and compare:
shot-scraper with Chromium.
Bookshop.org uses the Cloudflare “protection” against nefarious stuff like doing a screenshot if you’re using something that looks like a real browser…
… but not if you’re using shot-scraper with Firefox!
It allows the less real browser through — but as you can see, cutycapt renders in a quite different way.
comics.org blocks shot-scraper outright (with Chromium).
And again, with Firefox it allows it.
I’m gonna just throw out a wild thought I have no data for: Perhaps Cloudflare blocks in this case is because they have a lot of data saying there’s legitimate Firefox-like traffic from my IP address? Because that’s what I normally use? They probably have a “fingerprint” of every IP address in the world by this point, so they can easily decide whether the traffic looks normal or is “an attack”. Sounds stupid, but you know… I use Cloudflare myself, and the report always says “and this month we protected your site against 4232 attacks!” and it was probably some poor soul trying to use curl.
comics.org works fine with cutycapt.
Slate slaps up a cookie blocker using shot-scraper…
… but not with cutycapt..
Salon slaps up a cookie blocker both with shot-scraper…
… as well as with cutycapt.
So… cutycapt actually performs better on many sites, but it’s not actually, you know, all that good because I’m getting all the helpful EU cookie banners that are there to protection my preciouses cookieses informationeses.
Here’s an idea! I’m a genius! What if I just get a Digitalocean VM and ssh over there to do the screenshots!
*time passes while I’m setting up an San Francisco VM and stuff*
Well, bookshop.org gives identical results from as when I do it from Norway. When I try to access Slate.com, both from cutycapt and shot-scraper, they just hang indefinitely.
And Salon?
Yeah, blocked by Cloudfront.
I’m guessing lots of sites are blocking the Digitalocean IP range or something? Yeah, I’m getting the same with curl, so it doesn’t really seem to depend on the User-Agent or browser fingerprinting, unlike the Cloudflare blocks from home.
So! Nothing worked better from San Francisco, and many things worked even worse. Presumably if I were in a residential IP range, things wouldn’t fail quite as badly, but I hadn’t expected things to be this bad from Digitalocean.
I guess my best option here is to use shot-scraper with Firefox from home?
Thank you for coming to my Ted talk.