Site icon Random Thoughts

Web scraping is getting harder all the time

And it’s understandable — things are getting worse and worse all the time, and anybody who is running a web site (that has interesting information) is under constant attack from badly programmed AI scrapers.

But where does that leave us li’l smol peeps who are just scrapin’ a li’l data for ourselves so that we don’t have to type as much?

I’ve got two small use cases that have been torpedoed by this arms race lately — I use the imdb search to find the data on movies I’ve ripped from blu rays that I’ve bought. And I use the Goodreads search when I’m entering (manually) e-books that I’ve bought into the Emacs package for that. (Physical books have ISBNs printed in bar code form, so I can use various APIs for that and don’t need to resort to anything as tawdry as web scraping.)

These are just minor convenience things I’ve gotten used to over the years, so I could give them up… or I could go raging, raging against the dying of the open web.

Guess what I chose!

https://lars.ingebrigtsen.no/wp-content/uploads/2026/06/video-gRc920.mp4

The result is on Microsoft Github.

The idea is:

  1. First try to fetch the URL using the normal, fast method.
  2. If this fails, use Selenium headless. This involves spinning up a web browser and then dumping the resulting DOM.
  3. If this fails, spin up Selenium and a web browser window. This will allow the user to click around a bit, answering any challenges.

In 2) and 3), fetch-dom will save and reuse cookies, so that
hopefully 3) doesn’t happen as much, and 1) and 2) will be successful
more often.

So this requires a Python/Selenium installation that works, and
Chromium installed.

fetch-dom is synchronous by default, but is asynchronous if you give it the :callback keyword parameter.

This seems to work for my use cases — things usually work automatically, but once in a while it pops up a browser window, and I click a bit, and then things work headlessly for a while again.

*sigh*

These are the days of your life…

Exit mobile version