Ununicode

I’ve been messing around to see whether running a WordPress installation is fun or not (spoilers: it’s really not), and all of a sudden my test blog articles had turned a strange shade of non-UTF-8.

For instance, some texts I had quoted used that strange apostrophe in “it’s”, and that had turned into “it’s”.

Now, that sequence of characters (which are Unicode code points 0xE2, 0x20AC and 0x2122) bears no resemblance to the code point for ’, which is 0x2019. But the UTF-8 for ’ is #xE2 #x80 #x99, and that’s the clue: In Windows Code Page 1252, the Euro sign in #x80 and the TM sign is #x99, so what I had on my hands was UTF-8 interpreted as CP1252, and then output as UTF-8 by WordPress.

*phew*

I wondered whether any series of calls to `{en,de}code-coding-region’ coupled with `string-{as,to}-unibyte’ would possibly allow me to un-destroy the text, but that made my head hurt, so I wrote undecodify.el and put it on Microsoft Github.

(undecodify "it’s") => "it’s"

It’s trivial, but at least that fixed the blog articles.

Now I just have to wait for the next thing to go wrong with WordPress…

One thought on “Ununicode”

  1. > I wondered whether any series of calls to `{en,de}code-coding-region’ coupled with `string-{as,to}-unibyte’ would possibly allow me to un-destroy the text

    This seems to work:

    (decode-coding-string (encode-coding-string “it’s” ‘windows-1252) ‘utf-8)

Leave a Reply