Ununicode

I’ve been messing around to see whether running a WordPress installation is fun or not (spoilers: it’s really not), and all of a sudden my test blog articles had turned a strange shade of non-UTF-8.

For instance, some texts I had quoted used that strange apostrophe in “it’s”, and that had turned into “itâ€™s”.

Now, that sequence of characters (which are Unicode code points 0xE2, 0x20AC and 0x2122) bears no resemblance to the code point for ’, which is 0x2019. But the UTF-8 for ’ is #xE2 #x80 #x99, and that’s the clue: In Windows Code Page 1252, the Euro sign in #x80 and the TM sign is #x99, so what I had on my hands was UTF-8 interpreted as CP1252, and then output as UTF-8 by WordPress.

*phew*

I wondered whether any series of calls to `{en,de}code-coding-region’ coupled with `string-{as,to}-unibyte’ would possibly allow me to un-destroy the text, but that made my head hurt, so I wrote undecodify.el and put it on Microsoft Github.

(undecodify "itâ€™s") => "it’s"

It’s trivial, but at least that fixed the blog articles.

Now I just have to wait for the next thing to go wrong with WordPress…

2 thoughts on “Ununicode”

> I wondered whether any series of calls to `{en,de}code-coding-region’ coupled with `string-{as,to}-unibyte’ would possibly allow me to un-destroy the text

This seems to work:

(decode-coding-string (encode-coding-string “itâ€™s” ‘windows-1252) ‘utf-8)

Pingback: iwasno.net

npostavs says:
June 20, 2019 at 19:00
> I wondered whether any series of calls to `{en,de}code-coding-region’ coupled with `string-{as,to}-unibyte’ would possibly allow me to un-destroy the text
This seems to work:
(decode-coding-string (encode-coding-string “itâ€™s” ‘windows-1252) ‘utf-8)
larsmagne23 says:
January 19, 2026 at 20:15
Pingback: iwasno.net

Ununicode

Like this:

Related Articles

2 thoughts on “Ununicode”

Leave a ReplyCancel reply