I forgot to switch on advanced searches on kwakk.info

The other day, I was wondering whether there was a way to improve the search quality on kwakk.info, the comics fanzine search engine. And it turns out that the search engine I’m using in the background, Xapian, has a lot of operators like NEAR and ADJ that sound interesting, but they don’t work on kwakk.info.

And it turns out that I was just calling the engine with a set of options that inadvertently switched this off. I’ve now fixed that, so you can now drill further into the data.

For instance, take a search for crumb herriman. The first result for this is above — but as you can see, it doesn’t really deal with them in relation to each other; they’re just mentioned on the same page.

With crumb NEAR herriman, you can ensure that the words are close to each other, which will give you more relevant results.

There’s also crumb ADJ herriman, which means the same as NEAR, but crumb has to come before herriman.

And there’s other things in there — you can group expressions and all sorts of things, and you can say ADJ/4 to say that the words have to be within four words of each other, etc. Nerd out.

I’ve also added a short help text to the site that you can reach from the menu, so hopefully that will… help.

Hm… perhaps NEAR should be the default boolean operator instead of AND? Hm… no, looking over the logs, that doesn’t seem to work well. For instance, people do searches like Arnold Drake interview, and in that case, you often have interview in the heading and stuff, but not necessarily the name.

Ideally, what we’d want is a AND search, but ranked by nearness? Xapian doesn’t allow that… but I guess it could be done by running the search twice — one with NEAR and one with AND, and then smushing the results together in a good way.

Or… a checkbox to toggle between NEAR and AND.

Well, we’ll see.

Book Club 2025: Meat Is Murder by Joe Pernice

Back in 2009, I sorta finally became aware of all of these 33⅓ books, and I went “ooh, I want that one, and that one, and that one”, and before you knew it, I had a whole stack of them. And then I started reading.

Some of them are really good. Drew Daniel’s book on Throbbing Gristle’s Twenty Jazz Funk Greats is brilliant — it’s not just about that single album, but also encompasses the whole transgressive art thing. And Jonathan Lethem’s book of Talking Heads’ Fear Of Music explained so much, while opening up lots of avenues of interpretation at the same time.

And then there were the rest. Many of them seemed to be written by neurotic nerds with an overwhelming need to pin things down. “No! This song is about one thing only! It’s about that time the vocalist fainted in the bathroom!” Which may or may not be true, but it doesn’t make for interesting reading, and makes the album you’re reading about seem less interesting than you thought it was to begin with.

So I rapidly lost my enthusiasm for these books, and I haven’t bought any since. But I’ve still got more than half a dozen left unread, so why not give one of them a go?

OK, I’m putting the album on the stereo, and here we go…

And this book turns out to be fiction, and not about the album by The Smiths at ll. Well, that’s OK, but it’s not actually very good.

The protagonist listens to Smiths albums, I guess, and perhaps there’s more of a connection later in the book. But even if this is a very short book, I found the prose so uninspiring that I rapidly found myself growing impatient, and after 25 pages I thought “well, I don’t care” and so I ditched it.

What does Goodreads say?

Heh heh.

Meat Is Murder (2003) by Joe Pernice (buy used, 3.41 on Goodreads)

Comics Fanzines OCR Reloaded Redone

The GPU-assisted OCR run for kwakk.info has now finished — it took about three weeks to chew through 10,000 fanzine/magazine issues. So what are the results?

Well, I found that sections with no text would often end up with hallucinated kanji. (Because LLM.) It doesn’t really matter much — you wouldn’t be searching for these things anyway. But if you want to quote a text, it’s annoying if there’s a bunch of kanji in the middle of whatever you’re quoting, so I wondered whether there was a way to filter that junk out.

And there is!

Because surya, the LLM OCR software, assigns a confidence level to every line (and even every character). And “actual text” turns out to have a high confidence — the sentence above has a 99% confidence…

… while surya’s 14% sure about this kanji. (I think that’s way more sure than it should be about something it’s so wrong about, but that’s just me.)

So I just filtered out everything with a confidence level below 60%, and that fixed the vast, vast majority of these hallucinations.

But otherwise… how’s the quality now compared to what it was before? Let’s look at a couple examples. The first is a normal, well-scanned text page from RBCC (Rocket’s Blast & Comicollector):

Should be easy to OCR, right? Here’s the results from the traditional OCR:

And indeed, it looks pretty good. It’s perhaps 90% correct? Although some lines are in the wrong order. But "EC" has become NEC", for instance, and their "New Trend" comics has become their "New Trend W in comics.

So if you want to quote text from this page, you’ll have to spend some time copy-editing it.

Here’s the results from surya:

First of all, the lines are in correct order. But there’s also a lot fewer word recognition errors — I haven’t counted, but I think this is more than 99% correct? So if you want to quote an article from kwakk.info now, you won’t have to spend the rest of the afternoon doing copy edits.

But what about harder stuff? Here’s another RBCC example, but it’s a much worse scan, and it’s a “catalogue” page, which are just harder to parse in general:

Here’s the results from the traditional OCR:

Uhm… er… well, that resembles line noise — you can’t use that; it would be easier to just transcribe that page by hand than trying to “fix” this by editing.

Here’s the surya output:

It’s pretty good! There’s still errors like MAGAZIKES and Jomen, but it’s mostly correct, as far as I can tell.

So there you go. Quoting text should now require a lot less editing afterwards, and, of course, the search itself should be more precise (and find more instances of whatever you’re searching for than previously).

The GPU has kept my home office nice and warm while it’s been OCR-ing these past weeks, which is also a plus.

And dealing with 4GB JSON files (the surya output is very verbose) has been fun — can’t really parse them normally, but whatevs — Emacs is pretty good at dealing with big files, so I just wrote something to plop them into a buffer and parse them gingerly.

If you see any major oddities, let me know.