Monday 30 April 2012

Practical Unicode–Part 1

This post is not intended to be an exhaustive essay on unicode, the standards and how it works; this is the result of working with multiple systems transferring data in various formats and trying to get it into SQL Server without corruption.
Environment Background
As previously mentioned this is based on my experience working with multiple source systems and the following tools in a Windows environment.
Overview of Unicode and Why We Have It
When we first started storing text (ignoring EBCDIC) the general convention was one byte (8 bits) represented a single character and the alphabet and common symbols were mapped to values 32 to 127 (see ASCII table), the values 128 to 255 were considered to be the extended character set. What this meant was that to represent different alphabets and common usages of accented characters we ended up with code pages, this meant that value x could represent character y in one codepage but character z in another, this was a pain with websites as unless the code page was declared you could not be confident that you were displaying the characters correctly.
As a result we now have unicode where it is possible for a character to be represented by multiple bytes (simplification: glyphs and combined characters not covered). As with all things standards based unicode isn’t just a monolithic entity, there are various flavours and things to note.
Common ‘Flavours’
UTF-8
This is pretty much the defacto standard for storing text in a manner which can be read by almost everyone. Where a character can be represented by ASCII it is held as a single byte, other characters can be 2, 3 or 4 bytes.
While there is only one correct way to read the bytes and thus no need for a Byte Order Marker (BOM) is is common to find the bytes 0xEF 0xBB 0xBF at the beginning of some files. This can be removed using the following binmay command
binmay –s “EF BB BF” –r “” 
UCS2 Little Endian
This is the only format SQL Server can use for input via BCP or Bulk Insert, any data exported using BCP where the are NVARCHAR fields will be in this format.
Like UTF-16 it stores a character as either one or two 16 bit words.
A UCS2 Little Endian file with have a two byte Byte Order Marker consisting of 0xFF 0xFE, if it were a UCS2 Big Endian file then the BOM would be 0xFE 0xFF
Both UTF-16 and UCS2 are variable length/space encodings.
UTF-32
In UTF-32 every character is stored as two 16 bit words, this means that it is not space efficient but is very simple to parse. At the moment very few systems support it and with Python you have to choose the option at compile time.