I forgot to switch on advanced searches on kwakk.info

The other day, I was wondering whether there was a way to improve the search quality on kwakk.info, the comics fanzine search engine. And it turns out that the search engine I’m using in the background, Xapian, has a lot of operators like NEAR and ADJ that sound interesting, but they don’t work on kwakk.info.

And it turns out that I was just calling the engine with a set of options that inadvertently switched this off. I’ve now fixed that, so you can now drill further into the data.

For instance, take a search for crumb herriman. The first result for this is above — but as you can see, it doesn’t really deal with them in relation to each other; they’re just mentioned on the same page.

With crumb NEAR herriman, you can ensure that the words are close to each other, which will give you more relevant results.

There’s also crumb ADJ herriman, which means the same as NEAR, but crumb has to come before herriman.

And there’s other things in there — you can group expressions and all sorts of things, and you can say ADJ/4 to say that the words have to be within four words of each other, etc. Nerd out.

I’ve also added a short help text to the site that you can reach from the menu, so hopefully that will… help.

Hm… perhaps NEAR should be the default boolean operator instead of AND? Hm… no, looking over the logs, that doesn’t seem to work well. For instance, people do searches like Arnold Drake interview, and in that case, you often have interview in the heading and stuff, but not necessarily the name.

Ideally, what we’d want is a AND search, but ranked by nearness? Xapian doesn’t allow that… but I guess it could be done by running the search twice — one with NEAR and one with AND, and then smushing the results together in a good way.

Or… a checkbox to toggle between NEAR and AND.

Well, we’ll see.

Book Club 2025: Meat Is Murder by Joe Pernice

Back in 2009, I sorta finally became aware of all of these 33⅓ books, and I went “ooh, I want that one, and that one, and that one”, and before you knew it, I had a whole stack of them. And then I started reading.

Some of them are really good. Drew Daniel’s book on Throbbing Gristle’s Twenty Jazz Funk Greats is brilliant — it’s not just about that single album, but also encompasses the whole transgressive art thing. And Jonathan Lethem’s book of Talking Heads’ Fear Of Music explained so much, while opening up lots of avenues of interpretation at the same time.

And then there were the rest. Many of them seemed to be written by neurotic nerds with an overwhelming need to pin things down. “No! This song is about one thing only! It’s about that time the vocalist fainted in the bathroom!” Which may or may not be true, but it doesn’t make for interesting reading, and makes the album you’re reading about seem less interesting than you thought it was to begin with.

So I rapidly lost my enthusiasm for these books, and I haven’t bought any since. But I’ve still got more than half a dozen left unread, so why not give one of them a go?

OK, I’m putting the album on the stereo, and here we go…

And this book turns out to be fiction, and not about the album by The Smiths at ll. Well, that’s OK, but it’s not actually very good.

The protagonist listens to Smiths albums, I guess, and perhaps there’s more of a connection later in the book. But even if this is a very short book, I found the prose so uninspiring that I rapidly found myself growing impatient, and after 25 pages I thought “well, I don’t care” and so I ditched it.

What does Goodreads say?

Heh heh.

Meat Is Murder (2003) by Joe Pernice (buy used, 3.41 on Goodreads)

Comics Fanzines OCR Reloaded Redone

The GPU-assisted OCR run for kwakk.info has now finished — it took about three weeks to chew through 10,000 fanzine/magazine issues. So what are the results?

Well, I found that sections with no text would often end up with hallucinated kanji. (Because LLM.) It doesn’t really matter much — you wouldn’t be searching for these things anyway. But if you want to quote a text, it’s annoying if there’s a bunch of kanji in the middle of whatever you’re quoting, so I wondered whether there was a way to filter that junk out.

And there is!

Because surya, the LLM OCR software, assigns a confidence level to every line (and even every character). And “actual text” turns out to have a high confidence — the sentence above has a 99% confidence…

… while surya’s 14% sure about this kanji. (I think that’s way more sure than it should be about something it’s so wrong about, but that’s just me.)

So I just filtered out everything with a confidence level below 60%, and that fixed the vast, vast majority of these hallucinations.

But otherwise… how’s the quality now compared to what it was before? Let’s look at a couple examples. The first is a normal, well-scanned text page from RBCC (Rocket’s Blast & Comicollector):

Should be easy to OCR, right? Here’s the results from the traditional OCR:

And indeed, it looks pretty good. It’s perhaps 90% correct? Although some lines are in the wrong order. But "EC" has become NEC", for instance, and their "New Trend" comics has become their "New Trend W in comics.

So if you want to quote text from this page, you’ll have to spend some time copy-editing it.

Here’s the results from surya:

First of all, the lines are in correct order. But there’s also a lot fewer word recognition errors — I haven’t counted, but I think this is more than 99% correct? So if you want to quote an article from kwakk.info now, you won’t have to spend the rest of the afternoon doing copy edits.

But what about harder stuff? Here’s another RBCC example, but it’s a much worse scan, and it’s a “catalogue” page, which are just harder to parse in general:

Here’s the results from the traditional OCR:

Uhm… er… well, that resembles line noise — you can’t use that; it would be easier to just transcribe that page by hand than trying to “fix” this by editing.

Here’s the surya output:

It’s pretty good! There’s still errors like MAGAZIKES and Jomen, but it’s mostly correct, as far as I can tell.

So there you go. Quoting text should now require a lot less editing afterwards, and, of course, the search itself should be more precise (and find more instances of whatever you’re searching for than previously).

The GPU has kept my home office nice and warm while it’s been OCR-ing these past weeks, which is also a plus.

And dealing with 4GB JSON files (the surya output is very verbose) has been fun — can’t really parse them normally, but whatevs — Emacs is pretty good at dealing with big files, so I just wrote something to plop them into a buffer and parse them gingerly.

If you see any major oddities, let me know.

Book Club 2025: Serier för vuxna by Robert Aman

I’ve been dipping into this book now and then over the past month or so.

It’s is a book about a Swedish 80s comics publisher. The publisher was called Epix, and published a bewildering array of anthologies with names like Pox, Maxx and Tung Metall. They mostly reprinted stuff from 70s and 80s French/Italian/Spanish anthologies, and also American indie/underground stuff.

So kinda like if… somebody like Fantagraphics published mostly anthologies like Weirdo and Heavy Metal? But a lot of them. Something like that.

They were controversial at the time, and were finally taken to court for publishing too many sexually violent things. Among the evidence was a strip by Dori Seda. So… normal 80s stuff.

But here’s the twist: They were acquitted, but still went under, because the distribution monopoly at the time did one of those wonderful late-yuppie post-deregulation things: It split into two separate companies, where one would distribute all indie things, and the other company would distribute all major publisher things (and the latter company would also be owned by those companies).

So what do you think happened? Yes, of course — the small press distributor went bankrupt, and the other distributor didn’t have anything to do with them, so Epix (among many other smaller magazine publishers) had to stop publishing anyway.

That’s a very creative solution to getting rid of competition. If the former distributor monopoly had said “no, we’re dropping all these smaller companies”, that would have been an outrage. Instead, by using these corporate actions, they could do exactly that, and nobody could object. Much. (Although Epix tried to sue them.)

Shades of what would later happen in the US with Diamond/Heroes World.

Anyway, this book consists mostly of interview snippets edited together. I think they call these things “oral histories”? I’m not fond of the genre, but it works quite well here. The main problem with this approach is usually that a person will say something interesting, but then there’s no followup because they have nobody else talking about the same thing.

This book does not have that problem: Even if it’s formatted as an oral history, and there’s no trace of the interviewer(s) here, whenever somebody says something interesting, they ask other people about what’s been said and get a response.

So it makes for an entertaining read, especially since the owner and boss of Epix got into fights with absolutely everybody. It sounds like every day at the crowded office was a shouting match.

Other things happened — his ex-wife took (he says “kidnapped”, which may be formally correct) their son off to Trinidad, and he used the cover of an issue to announce a reward for information about his whereabouts, for instance.

They were also prosecuted for distributing violent porn (Dori Seda and… Neil Gaiman! (a retelling of a Bible story)), and while acquitted, this led to a 30% reduction in sales.

And Gary Groth claimed that Epix never paid Fantagraphics for using their material. Epix claims that Kim Thompson found out that Groth had just put the money into an account he’d forgotten about, but the situation led to threats from Mary Fleener, anyway. Peter Bagge comments laconically that it wouldn’t be the first time that Fantagraphics put foreign royalties into some “wrong” bank account.

I’m not Swedish, so I’ve barely read any of what Epix published, but I’ve been trying to buy issues of Pox and Epix, but it’s not easy.

Looking over what I have managed to find, these seem like really good magazines. It looks like they basically have everything that’s good from 70s and 80s European and American underground/indie/art/etc comics. Their magazines were monthly, and usually around 100 pages each, so they just published a lot of stuff.

And the reproduction seems nice, and the hand lettering is fantastic.

I guess I should just start reading these comics, even if I don’t have a complete set. I’m kinda raring to go after reading this book about Epix. Hm!

Serier för vuxna – Epix och den svenska serierevolutionen (2024) by Robert Aman (4.16 on Goodreads)

Book Club 2025: Alternatives to Sex by Stephen McCauley

I bought this around 2008, but then never read it. (No particular reason — I buy, like, er, 10% more books than I have time to read, so some inevitably have to remain on the shelves.)

It’s pretty good? I like the formless quality it has — so many books I read are very plot heavy, and have a strict focus on getting somewhere, but this one feels more like a steady state object. It’s relaxing. I guess that’s not unusual for comedic books — more interested in character than structure…

But I think it’s a bit too long. If it had been 200 pages, it would have been a cute little book, but instead it’s 280 and it’s not. And while at no point in actually reading a page of this I said to myself “bored now”, it just felt like slightly too much. I can totally understand why McCauley would keep typing at this — he’d set up some characters you want to spend time with, but c’mon.

Hm… I see that the book has a low Goodreads score — only 3.41, which is way lower than I would have guessed.

Harsh! But nothing really stands out, except:

Few people really loved it — most thought it was pretty middling.

Alternatives to Sex (2006) by Stephen McCauley (buy new, buy used, 3.41 on Goodreads)