Site icon Random Thoughts

Searchable comics text pages?

After finding a collection of Marvel Bullpen Bulletins for the search engine for magazines about comics, I started to wonder whether it’d be useful (or fun) to include text pages from comics in general. I mean — editorials, letters pages, “hype pages”… There’s information there that’s not available anywhere else.

So… perhaps? Maybe? Peut-être?

The first question is, of course: Can I lay my hands on a huge collection of scanned comics in the first place? And the answer is: Of course, the pirates are still pirating out there.

So I hacked an old torrent client to be more handy for this project. Like, if the torrent seems to be dead (either not responding within a few minutes, or stalling for a long time), then just abandon it. I’m after quantity, not quality, after all.

And after downloading, just hang up, like an animal. (Yes, I know, it’s very anti social towards the pirates, but I’m downloading things not to read, so er ok.)

After a few weeks, I had 1.7TB worth of comics CBR/CBZ files (about 28K files, which vaguely corresponds to 28K issues of random comics).

Now what.

Well, I unpacked them, and got 27K directories after removing duplicates. Then I had Claude write me a script to identify the text pages, and I deleted all the other pages. Then it wrote more scripts to deduplicate repeated pages (like company wide editorials and the like). And then I ran the resulting directories through my OCR/indexing setup, and viola:

Now you can find out whether Hulk would beat Superman in a fight, for instance.

While this hasn’t been a lot of work (perhaps a couple days?), each step in this process has taken a lot of time because there’s a lot of data to process. And the comics that were downloaded leaned hard on recent comics, which isn’t really what’s interesting for research porpoises.

So I dunno… is this useful?

I haven’t included this data set in the “main” categories — you have to go to https://kwakk.info/pages/ explicitly to search these pages.

If I had a better way to search for torrents of older comics, that would make this more interesting, I think, but I haven’t really found anything like that.

Heh, while futzing around here, I came across this cover. Surely this has to be the most 90s comics cover ever? Behold the anatomy! Wince! Turn away!

Anyway, anyway… It’s a really random selection of comics, and there’s no quality control whatsoever, of course, but it seems like it might be vaguely useful. There’s no way to identify what issue each page is from, except by looking at the cover, so I’ve included covers, too.

Perhaps I’ll download some further terrorbytes of comics? Perhaps not? Time will tell, I guess.

Exit mobile version