Searchable comics text pages?

After finding a collection of Marvel Bullpen Bulletins for the search engine for magazines about comics, I started to wonder whether it’d be useful (or fun) to include text pages from comics in general. I mean — editorials, letters pages, “hype pages”… There’s information there that’s not available anywhere else.

So… perhaps? Maybe? Peut-être?

The first question is, of course: Can I lay my hands on a huge collection of scanned comics in the first place? And the answer is: Of course, the pirates are still pirating out there.

So I hacked an old torrent client to be more handy for this project. Like, if the torrent seems to be dead (either not responding within a few minutes, or stalling for a long time), then just abandon it. I’m after quantity, not quality, after all.

And after downloading, just hang up, like an animal. (Yes, I know, it’s very anti social towards the pirates, but I’m downloading things not to read, so er ok.)

After a few weeks, I had 1.7TB worth of comics CBR/CBZ files (about 28K files, which vaguely corresponds to 28K issues of random comics).

Now what.

Well, I unpacked them, and got 27K directories after removing duplicates. Then I had Claude write me a script to identify the text pages, and I deleted all the other pages. Then it wrote more scripts to deduplicate repeated pages (like company wide editorials and the like). And then I ran the resulting directories through my OCR/indexing setup, and viola:

Now you can find out whether Hulk would beat Superman in a fight, for instance.

While this hasn’t been a lot of work (perhaps a couple days?), each step in this process has taken a lot of time because there’s a lot of data to process. And the comics that were downloaded leaned hard on recent comics, which isn’t really what’s interesting for research porpoises.

So I dunno… is this useful?

I haven’t included this data set in the “main” categories — you have to go to https://kwakk.info/pages/ explicitly to search these pages.

If I had a better way to search for torrents of older comics, that would make this more interesting, I think, but I haven’t really found anything like that.

Heh, while futzing around here, I came across this cover. Surely this has to be the most 90s comics cover ever? Behold the anatomy! Wince! Turn away!

Anyway, anyway… It’s a really random selection of comics, and there’s no quality control whatsoever, of course, but it seems like it might be vaguely useful. There’s no way to identify what issue each page is from, except by looking at the cover, so I’ve included covers, too.

Perhaps I’ll download some further terrorbytes of comics? Perhaps not? Time will tell, I guess.

Comics Daze

After being in the doldrums for half a year after the Diamond melt-down, comics are flooding the markets again, so I guess I have to read some more comics today. Darn!

Tujiko Noriko: PON

11:31: Purr Quarterly #1

But first an oldie.

I was scanning Comics International when I saw this article — “a UK Raw?” I know, I know, all headlines that end with a question mark has the answer “no”, but still. Sounds intriguing.

So I went on ebay, and now I’ve got a copy, so let’s get reading.

So it has all these people, and a mini-comic insert. Inserts are very Raw.

But… this is very un-Raw.

I mean, I guess you could see this mix of art features, comics and text and go “you know what else is all pretentious and stuff? Raw! It’s just like Raw!”

But it isn’t at all. It’s like if you were going “how do I create an anti-Raw?” This is pretty much it. I don’t want to use hate speech here, but it reminds me quite a bit of Juxtapoz.

All “psychological” dramas and “shocking” shit.

And then, randomly, they reprint the first issue of Metropol… but shrunk down like this, and in black and white, “as originally intended”.

It’s a pretty naff magazine, and the Law of Headlines that End with a Question Mark remains undefeated.

Fini Tribe: The Sheer Action of Fini Tribe (3)

11:54: All the Cameras in My Room by Michael DeForge (Drawn & Quarterly)

Hey! There’s a booklet in this one, too.

“Denied one less rotation around the sun”… “denied one less”… So the more he spins, the more rotations around the sun Earth gets? That’s a very nice demon.

Heh heh.

I get so bored reading plot recaps that whenever I read a review of something, whenever we get to the plot recap portion (which is usually two thirds of any review) I just skip past that.

The reason I’m mentioning this is that this book feels a lot like reading plot recaps, and I really had to force my eyes to stay reading where they were instead of doing the instinctual “skip past the recap” bit.

(I know, I know — many people prefer reading recaps to actually watching a movie or whatever.)

I guess this book isn’t bad or anything, but early DeForge was such a gripping read — all these strange themes and things that came out of nowhere. This book is a collection of short stories that are either 1) gags or 2) extremely straightforward metaphors or 3) both.

It’s just a bit disappointing.

12:41: True-Man The Maximortal #3 by Rick Veitch

The next volume is the final one, and that finishes Veitch’s entire King Hell Heroica thing.

So this is the third chapter of the third volume (Bratpack is chronologically later), and… that’s a bit what it feels like: Veitch is writing a bridge, filling in plot points, and there’s a lot of material to get through this issue. So unfortunately, this does indeed too begin to read a bit like a plot recap.

But while this isn’t the most gripping chapter ever, it’s still pretty spiffy. Lots of fun stuff.

And as usual, there’s an additional 50 pages of reprints of old stuff included. It’s a good package.

Joan as Police Woman: Real Life Evolution

13:26: Oracles by Olivia Sullivan (Avery Hill)

This is quite lovely — it’s got a mood going on, and the art is attractive (if tablet-ey). But I’ve got one problem with the book that is going to sound really stupid: I hate the typeface they’ve chosen, which made me go *gag* as soon as I opened the book. It’s an upper-case one (normal enough for comics but a bit odd for poetry), but it’s all in italics. Which means that everything reads like it has emphasis. Which is like listening to someone reciting poetry using the most insistent, poetic tone, which just gets on my tits.

But really, it’s a lovely book. Shame about the typeface.

13:42: Physical Education by Joana Mosi (Pow Pow Press)

This book is fantastic. It’s Portuguese and it’s about an almost-thirty-year-old woman who is both nostalgic and not — which sounds very typical and a bit clichéd…

… but the way it’s told is just fantastic. The way it slides between different eras and scenes is kinda magical. And it’s funny.

And interesting.

I’ve seen people discuss why movies/books/comics avoid depicting a large part of modern lives — being on the phone — and the reason is “well, it’s boring”. Mosi manages to incorporate that stuff in a fresh, intriguing way, too.

Anyway — great book.

And now I think I’m going to buy some groceries, because I need to eat. Be right back.

Is it a coincidence that the “so-called” “insect” “friendly” way they plant parks these days also means that they don’t have to spend any money on maintaining them? IT”S A CONSPIRACY! How do I create groups on Facebook? “Insect Realists”.

Richard Dawson & Circle: Henki

15:02: Kottivakkam by Silje Rønneberg Hogstad (Jippi forlag)

I got some tomatoes. Mm… tomatoes…

I’m guessing this is autobiography — it’s about an art student in the 90s who goes to Chennai as an au pair (to teach the kid in the household Norwegian).

It’s a lot of fun!

But I’m guessing that if this were to be published in India, it’d spark another one of those riots they seem to keep having, because the book is mostly about how weird, injust and backwards they all are in India.

It’s a genre that has gone out of style because of obvious reasons, so it’s a nostalgic read.

But very entertaining, and the storytelling is on point — while nothing major happens, it’s always interesting without devolving into a series of funny vignettes, which these kinds of things have a tendency to do unless in the hands of a capable author.

15:36: La morte aux mains vivantes by Lafcadio Hearn/Martes Bathori

This is a screen-printed fold-out extravaganza.

It’s a horrible story of horrible hrror. Very well done.

Snapped Ankles: 21 Metres To Hebden Bridge

15:48: Night Chef by Mika Song (Random House)

Well, this is for children, but I’m actually finding it a bit hard to follow. That it’s hard to guess what these animals are supposed to be doesn’t help. This is a deformed chipmunk, I guess?

The story is cute and edumacational and stuff, and a lot happens, but…

Shearwater: The New World

16:04: Vad ska jag packa? by Tova Brodin (Lystring)

This is a fun book — it’s basically a handful of vignettes about interesting stuff that happened when the author was sixteen…

… and it’s engagingly told and really keeps your attention.

The artwork was done in acrylics on canvas — it must have taken forever to do. Very enjoyable.

Conducta: Soundboy Johnny EP

16:33: One Hundred Years of Reality by Kijitori Byu (Glacier Bay Books)

Hey, Glacier Bay… they used to publish so many books? But it’s been a while since I’ve seen anything from them, I think?

This is a collection of shorter pieces…

… and they’re enjoyable, and (like the artwork) they’re enjoyably vague.

Mix’Elle: Rage Days EP

But halfway through, I have to admit that I was getting kinda impatient with it all. It’s got one thing going throughout — a sort of half-dream, half-absurd thing that grows less interesting the more you read.

A shorter collection would have worked better, I think.

17:23: The End

OK, that’s enough comics for today.

Web scraping is getting harder all the time

And it’s understandable — things are getting worse and worse all the time, and anybody who is running a web site (that has interesting information) is under constant attack from badly programmed AI scrapers.

But where does that leave us li’l smol peeps who are just scrapin’ a li’l data for ourselves so that we don’t have to type as much?

I’ve got two small use cases that have been torpedoed by this arms race lately — I use the imdb search to find the data on movies I’ve ripped from blu rays that I’ve bought. And I use the Goodreads search when I’m entering (manually) e-books that I’ve bought into the Emacs package for that. (Physical books have ISBNs printed in bar code form, so I can use various APIs for that and don’t need to resort to anything as tawdry as web scraping.)

These are just minor convenience things I’ve gotten used to over the years, so I could give them up… or I could go raging, raging against the dying of the open web.

Guess what I chose!

The result is on Microsoft Github.

The idea is:

  1. First try to fetch the URL using the normal, fast method.
  2. If this fails, use Selenium headless. This involves spinning up a web browser and then dumping the resulting DOM.
  3. If this fails, spin up Selenium and a web browser window. This will allow the user to click around a bit, answering any challenges.

In 2) and 3), fetch-dom will save and reuse cookies, so that
hopefully 3) doesn’t happen as much, and 1) and 2) will be successful
more often.

So this requires a Python/Selenium installation that works, and
Chromium installed.

fetch-dom is synchronous by default, but is asynchronous if you give it the :callback keyword parameter.

This seems to work for my use cases — things usually work automatically, but once in a while it pops up a browser window, and I click a bit, and then things work headlessly for a while again.

*sigh*

These are the days of your life…