Encoding problems identified and corrected easily

So one customer's webpage showed odd characters in otherwise normal text, like “we’re”. No doubt this looked very much like an encoding error between ISO-8859 and UTF-8; three characters and all being outside the ASCII plane is a hint to this.

(The byte sequence was C3 A2 E2 82 AC E2 84 A2.)

You can see how these characters spring into existence. Let's take a German A-with-diaeresis, aka A-umlaut). A simple way to make the UTF byte sequence visible is using xterm in UTF-8 mode (I'll skip explanation of the detail of input and output processing here):

$ echo -en 'Ä' | hexdump -C
c3 84

The conversion errors stem from some piece of software assuming ISO-8859 encoding (this is the bug) and converts them to another encoding as it passes the data on to another (no bug here). The iconv(1) program can be used to show how this happens:

$ echo -en 'Ä' | iconv -f iso-8859-1 -t utf-8 | hexdump -C
c3 83 c2 84

So what we end up with here is actually some “utf-8-utf-8”. This explains how the faulty characters came onto the webpage. The naïve approach to our problem would now be to reverse this conversion, that is, convert from “utf-8-utf-8” to “utf-8” by using:

$ echo -en "\xc3\xa2\xe2\x82\xac\xe2\x84\xa2" | iconv -f utf-8 -t iso8859-1
conv: illegal input sequence at position 2

However, iconv tells us that our gunk data was not that easily convertible. So there must have been data corruption or non-UTF translation at some point that we have to deal with (now).

The heuristic is to take editor and locate offending patterns, create a mapping (by hand) to their desired counterparts and do the substitution across the file. Usually, searching for bytes with the high bit set (0x80) will turn something up. Or searching for part of the offending character group, such as “€” in our case.

’ (U+00E2 U+20AC U+2122) (U+2019)
— (U+00E2 U+20AC U+201D) (U+2014)
“ (U+00E2 U+20AC U+0153) (U+201C)
â€<9D> (U+00E2 U+20AC U+009D) (U+201D)

This was the substitution map that was required for our case. (The left-hand side characters may be different in your case, so don't take this mapping as authoritative!) As there was also a 0x9D byte present (represented here as <9D>), it is advised to use an editor (recommendation inside) that can cope with Unicode characters that have no associated glyph.

Posted 2010-03-07 23:03 / Tags: Encoding, Sql. / link