Text Pages From 100,000 Comics

larsmagne23

3 weeks ago

I did a test run of adding text pages from comics to the comics magazine search engine a couple weeks ago, but I wasn’t quite sure whether it was a good idea. Or whether my pipeline was good enough.

And after getting some feedback, it seems like the answers are “yes” and “not quite”.

So I’ve fixed some problems: The detection of duplicate pages was really bad, but it’s now better. There are still duplicate pages in there, but getting it totally right just takes a lot of processing. The current algo uses Tesseract to do a “quick” OCR, and then it compares the texts to find the distances, and then excludes the one that are too similar.

The problem here is that Tesseract, while being a good OCR as far as traditional OCRs go, it does get a goodly percentage of the words wrong, so if the scans aren’t pristine, you’ll get “different words” even if it’s the same page that’s been scanned twice.

For the search engine on kwakk.info, I’m using the Surya OCR engine, which is much, much more precise, but is also much, much slower. I could then feed the data from that back into the duplicate detection thing, but hiudhfiudsahgiushguishfiudsahf. Not worth it.

Other improvements: I’m removing pages that have less than a hundred words, because those are usually just credits pages and the like.

One thing I don’t have a fix for: I want editorials and letters pages and perhaps text-filled ads, but I don’t really want text pages that are part of the comics — they happen sometimes because some comics writers are smartypants (like the classic Howard the Duck issues by Steve Gerber). But the only way to weed those out would be to use an LLM classifier, and that just seems like it would be too slow. And too much work to set up.

So the upshot here is that I think the data set now is pretty usable, so I’m including it in the magazine and fanzine and “everything” searches.

Oh, and while the vast majority of the comics included so far seem to be mainstream books, it looks like a collection of Cerebus comics has snuck in, at least.

You’re welcome I’m sure.

Oh, and about that number… 100K comics. I wondered — is that a lot or is it a little? How many comic books have been published in the US, anyway? (The data set is 90% American comics.) After googling a bit, it seems that nobody knows, of course, but a ballpark number may be between 1M and 2M books… which means that 100K is much more than I thought! It’s between 5 and 10%! Huh.

Possibly.

Anybody have a better ballpark number?