Unicode, UTF-8 and you

Unicode is not a panacea. Some people’s names can’t even be written in unicode. However, as far as universal encodings go, it is the best we have got — warts and all. It is the only reasonable way to represent text inside programs, except for very very specialized needs (no, you don’t qualify).

Now, programs are made of libraries, and often there are several layers of abstraction between the library and the program. Sometimes, some weird abstraction layer in the middle will make it hard to convey user configuration into the library’s guts. Code should figure things out itself, most of the time.

So, there are several ways to make dealing with unicode not-horrible.

Unicode internally

I’ve already mentioned it, but it bears repeating. Internal representation should use the language’s built-in type (str in Python 3, String in Java, unicode in Python 2). All formatting, templating, etc. should be, internally, represented as taking unicode parameters and returning unicode results.

Standards

Obviously, when interacting with an external protocol that allows the other side to specify encoding, follow the encoding it specifies. Your program should support, at least, UTF-8, UTF-16, UTF-32 and Latin-1 through Latin-9. When choosing output encoding, choose UTF-8 by default. If there is some way for the user to specify an encoding, allow choosing between that and UTF-16. Anything else should be under “Advanced…” or, possibly, not at all.

Non-standards

When reading input that is not marked with an encoding, attempt to decode as UTF-8, then as UTF-16 (most UTF-16 decoders will auto-detect endianity, but it is pretty easy to hand-hack if people put in the BOM. UTF-8/16 are unlikely to have false positives, so if either succeeds, it’s likely correct. Otherwise, as-ASCII-and-ignore-high-order is often the best that can be done. If it is reasonable, allow user-intervention in specifying the encoding.

When writing output, the default should be UTF-8. If it is non-trivial to allow user specification of the encoding, that is fine. If it is possible, UTF-16 should be offered (and BOM should be prepended to start-of-output). Other encodings are not recommended if there is no way to specify them: the reader will have to guess correctly. At the least, giving the user such options should be hidden behind an “Advanced…” option.

The most popular I/O that does not have explicit encoding, or any way to specify one, is file names on UNIX systems. UTF-8 should be assumed, and reasonably recovered from when it proves false. No other encoding is reasonable (UTF-16 is uniquely unsuitable since UNIX filenames cannot have NULs, and other encodings cannot encode some characters).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: