Comics Fanzines OCR Reloaded Redone

The GPU-assisted OCR run for kwakk.info has now finished — it took about three weeks to chew through 10,000 fanzine/magazine issues. So what are the results?

Well, I found that sections with no text would often end up with hallucinated kanji. (Because LLM.) It doesn’t really matter much — you wouldn’t be searching for these things anyway. But if you want to quote a text, it’s annoying if there’s a bunch of kanji in the middle of whatever you’re quoting, so I wondered whether there was a way to filter that junk out.

And there is!

Because surya, the LLM OCR software, assigns a confidence level to every line (and even every character). And “actual text” turns out to have a high confidence — the sentence above has a 99% confidence…

… while surya’s 14% sure about this kanji. (I think that’s way more sure than it should be about something it’s so wrong about, but that’s just me.)

So I just filtered out everything with a confidence level below 60%, and that fixed the vast, vast majority of these hallucinations.

But otherwise… how’s the quality now compared to what it was before? Let’s look at a couple examples. The first is a normal, well-scanned text page from RBCC (Rocket’s Blast & Comicollector):

Should be easy to OCR, right? Here’s the results from the traditional OCR:

And indeed, it looks pretty good. It’s perhaps 90% correct? Although some lines are in the wrong order. But "EC" has become NEC", for instance, and their "New Trend" comics has become their "New Trend W in comics.

So if you want to quote text from this page, you’ll have to spend some time copy-editing it.

Here’s the results from surya:

First of all, the lines are in correct order. But there’s also a lot fewer word recognition errors — I haven’t counted, but I think this is more than 99% correct? So if you want to quote an article from kwakk.info now, you won’t have to spend the rest of the afternoon doing copy edits.

But what about harder stuff? Here’s another RBCC example, but it’s a much worse scan, and it’s a “catalogue” page, which are just harder to parse in general:

Here’s the results from the traditional OCR:

Uhm… er… well, that resembles line noise — you can’t use that; it would be easier to just transcribe that page by hand than trying to “fix” this by editing.

Here’s the surya output:

It’s pretty good! There’s still errors like MAGAZIKES and Jomen, but it’s mostly correct, as far as I can tell.

So there you go. Quoting text should now require a lot less editing afterwards, and, of course, the search itself should be more precise (and find more instances of whatever you’re searching for than previously).

The GPU has kept my home office nice and warm while it’s been OCR-ing these past weeks, which is also a plus.

And dealing with 4GB JSON files (the surya output is very verbose) has been fun — can’t really parse them normally, but whatevs — Emacs is pretty good at dealing with big files, so I just wrote something to plop them into a buffer and parse them gingerly.

If you see any major oddities, let me know.

Book Club 2025: Serier för vuxna by Robert Aman

I’ve been dipping into this book now and then over the past month or so.

It’s is a book about a Swedish 80s comics publisher. The publisher was called Epix, and published a bewildering array of anthologies with names like Pox, Maxx and Tung Metall. They mostly reprinted stuff from 70s and 80s French/Italian/Spanish anthologies, and also American indie/underground stuff.

So kinda like if… somebody like Fantagraphics published mostly anthologies like Weirdo and Heavy Metal? But a lot of them. Something like that.

They were controversial at the time, and were finally taken to court for publishing too many sexually violent things. Among the evidence was a strip by Dori Seda. So… normal 80s stuff.

But here’s the twist: They were acquitted, but still went under, because the distribution monopoly at the time did one of those wonderful late-yuppie post-deregulation things: It split into two separate companies, where one would distribute all indie things, and the other company would distribute all major publisher things (and the latter company would also be owned by those companies).

So what do you think happened? Yes, of course — the small press distributor went bankrupt, and the other distributor didn’t have anything to do with them, so Epix (among many other smaller magazine publishers) had to stop publishing anyway.

That’s a very creative solution to getting rid of competition. If the former distributor monopoly had said “no, we’re dropping all these smaller companies”, that would have been an outrage. Instead, by using these corporate actions, they could do exactly that, and nobody could object. Much. (Although Epix tried to sue them.)

Shades of what would later happen in the US with Diamond/Heroes World.

Anyway, this book consists mostly of interview snippets edited together. I think they call these things “oral histories”? I’m not fond of the genre, but it works quite well here. The main problem with this approach is usually that a person will say something interesting, but then there’s no followup because they have nobody else talking about the same thing.

This book does not have that problem: Even if it’s formatted as an oral history, and there’s no trace of the interviewer(s) here, whenever somebody says something interesting, they ask other people about what’s been said and get a response.

So it makes for an entertaining read, especially since the owner and boss of Epix got into fights with absolutely everybody. It sounds like every day at the crowded office was a shouting match.

Other things happened — his ex-wife took (he says “kidnapped”, which may be formally correct) their son off to Trinidad, and he used the cover of an issue to announce a reward for information about his whereabouts, for instance.

They were also prosecuted for distributing violent porn (Dori Seda and… Neil Gaiman! (a retelling of a Bible story)), and while acquitted, this led to a 30% reduction in sales.

And Gary Groth claimed that Epix never paid Fantagraphics for using their material. Epix claims that Kim Thompson found out that Groth had just put the money into an account he’d forgotten about, but the situation led to threats from Mary Fleener, anyway. Peter Bagge comments laconically that it wouldn’t be the first time that Fantagraphics put foreign royalties into some “wrong” bank account.

I’m not Swedish, so I’ve barely read any of what Epix published, but I’ve been trying to buy issues of Pox and Epix, but it’s not easy.

Looking over what I have managed to find, these seem like really good magazines. It looks like they basically have everything that’s good from 70s and 80s European and American underground/indie/art/etc comics. Their magazines were monthly, and usually around 100 pages each, so they just published a lot of stuff.

And the reproduction seems nice, and the hand lettering is fantastic.

I guess I should just start reading these comics, even if I don’t have a complete set. I’m kinda raring to go after reading this book about Epix. Hm!

Serier för vuxna – Epix och den svenska serierevolutionen (2024) by Robert Aman (4.16 on Goodreads)

Book Club 2025: Alternatives to Sex by Stephen McCauley

I bought this around 2008, but then never read it. (No particular reason — I buy, like, er, 10% more books than I have time to read, so some inevitably have to remain on the shelves.)

It’s pretty good? I like the formless quality it has — so many books I read are very plot heavy, and have a strict focus on getting somewhere, but this one feels more like a steady state object. It’s relaxing. I guess that’s not unusual for comedic books — more interested in character than structure…

But I think it’s a bit too long. If it had been 200 pages, it would have been a cute little book, but instead it’s 280 and it’s not. And while at no point in actually reading a page of this I said to myself “bored now”, it just felt like slightly too much. I can totally understand why McCauley would keep typing at this — he’d set up some characters you want to spend time with, but c’mon.

Hm… I see that the book has a low Goodreads score — only 3.41, which is way lower than I would have guessed.

Harsh! But nothing really stands out, except:

Few people really loved it — most thought it was pretty middling.

Alternatives to Sex (2006) by Stephen McCauley (buy new, buy used, 3.41 on Goodreads)

Book Club 2025: Det som aldri skjer by Anne Holt

Hey, another mystery. Yes, it’s been that kind of week.

This book is almost a parody of these kinds of books. It’s a mystery where the two protagonists both have deep trauma backgrounds, and the murders are over-the-top gruesome. But worst of all is that apparently Holt must have read How To Write A Damn Good Novel before writing this, because she follows the main tenet of that how-to manual faithfully: Every scene has to have both a primary and a secondary conflict.

So a typical scene is that the profiler wakes up at night and pokes her detective husband and tries to say something Earth-shatteringly insightful, and he’ll bellow at her I”M TRYING TO SLEEP HERE, before trying to take a sip from a glass of water and then tipping it into the bed and then roaring out of the room.

Almost. Every. Damn. Scene.

So since everything is interrupted all the time, the book goes on for an unnecessary 440 pages.

On the plus side, the book is about Holt killing everybody she finds annoying, so we get a TV show host killed, a right-wing politician, a book critic who dares to dislike mysteries, and finally a sports person. It’s fun to read somebody who’s obviously enjoying their work.

It’s OK, I guess?

Det som aldri skjer (2004) by Anne Holt (buy used, 3.56 on Goodreads)

Book Club 2025: Three at Wolfe’s Door by Rex Stout

I’ve been reading this book on my phone while waiting for things over the past month or so. So not very concentrated reading but…

… this book really isn’t very good, is it? It’s a collection of three short stories, and while I haven’t read many books by Rex Stout, the condensed form seems to bring out all of Stout’s most annoying tics? By the end of the third story, I found myself skipping past the unfunny hard boiled repartee.

I think I’ll give Stout one more go, but with a novel instead. From the 40s, perhaps?

Three at Wolfe’s Door (1960) by Rex Stout (buy used, 4.1 on Goodreads)