Today, in things you may not have known that library programmers have to think about:
Almost all text everywhere is encoded as either ASCII, Latin-1, or UTF-8.
ASCII is pitifully limited and, though simple and convenient, is not even capable of fully representing American English because of unsupported punctuation and words borrowed from other languages with differing alphabets. But because of that simplicity and convenience, we do still use it a lot.
Latin-1 is better, and
can adequately represent a few dozen European langauges.
And Unicode, of which UTF-8 is one encoding, is able to fully represent
an unthinkable number of written languages, with additional scripts still being considered for introduction into the standard. (Also, all valid ASCII is also valid UTF-8, which is nice.)
But, sometimes, someone goes and represents text in some other way. UTF-16 is a particularly problematic offender.
The difference between UTF-8 and UTF-16 is subtle, but very important. They are both Unicode formats, meaning that they map to the same very large set of characters. They are both variable-length encodings, meaning that the number of bytes it takes to represent a character can differ. The defining difference is that UTF-8 is a series of bytes (or chars), and UTF-16 is a series of shorts (or wchars (or wide chars)).
The
important difference is that, despite a
rigorously defined standard, everyone seems to have their own idea regarding how to encode and decode UTF-16 text. Some applications acknowledge that UTF-16 is a variable-length encoding, and some handle surrogate pairs incorrectly (which are characters that take two units, or wchars, to represent, instead of one), or don't handle them at all. The encoding is further affected by the fact that a system's endianness can affect how UTF-16 is represented; often a preceding character called a BOM, or byte order mark, is included to help decoding algorithms recognize whether the encoding is in fact UTF-16BE (big endian) or UTF-16LE (little endian). But not everyone bothers to include a BOM.
And all these programmer errors, inconsistencies, and general deficits have led to discussions such as
this one on Stack Overflow, in which the accepted answer explains that, yes, UTF-16 should be considered harmful.
What a big joke, eh? And so we should all be relieved that there's no good reason to worry about UTF-16 or to have our code support it, right? Well, no. It turns out the NTFS file system, and by extension Windows, still represents file names as UTF-16. If you want to have multilingual support for your application on Windows, it has to know how to UTF-16 encode and decode file paths.
Which is why I was recently working on adding UTF-16 support to mach, and you can see that code
here.