Even more deduplication on kwakk.info

As I’ve nattered on about before, I started including text pages from comics in the comics ‘zine search engine, and this has some unique problems. I’m just dumping hundreds of thousands of comics into the grinder, picking out the text pages, and then OCR-ing them. But many comics have been scanned several times, and many include editorial pages that are more or less identical across several titles.

So I’m now running the pages through a text-based deduplicator… but it used a “quick” (FSVO) OCR, Tesseract, which behaves horribly on non-standard pages like the above, and there are now five copies of that page in the search engine. Which is just so annoying when trying to actually find stuff — you have to wade through duplicates.

Which means that I had to do something more, and that’s now in production: After doing the real OCR, with Surya, which is great even with 30 degree text and badly scanned pages on colourful backgrounds, I’m running an extra step and deduplicating again based on the output from that.

*phew*

The pipeline is basically:

1) First get rid of byte-identical CBX files. This gets rid of about 20% of the comics.

2) Identify text pages and delete all other pages. This gets rid of about 98% of pages.

c) Identify byte-identical pages and delete them. This gets rid of 15% of the text pages.

e) Run Tesseract OCR over all the pages and keep only the first “instance” of each duplicate page. This gets rid of a further 20% of the pages.

ii) Remove credits pages and the like — i.e., every page that has less than a hundred words. This gets rid of about 15% of the remaining pages.

II) Finally, do the new Surya-based OCR deduplication, which gets rid of about 5% more of the pages.

Of course, all of these numbers are trending upwards as more and more comics are added. And most of these steps may have false positives — it’s mostly probabilistic, but I’ve tried to be on the lookout for false positives. *crosses fingers*

There are also so many complete runs of classic comics in the index now that I think there’s probably a complete coverage of classic “hype” pages now, for those interested in that. For instance, if you want to do research on Wally Wood mentions on the Bullpen Bulletin pages, that should be possible.

Annoyingly enough, Surya isn’t able to parse that “On the Ledge” logo… Vertigo is so edgy…

Anyway. I think I’ve futzed around with this thing enough for a while.

Even more deduplication on kwakk.info

Like this:

Related Articles

Leave a ReplyCancel reply