Monday, 21 May 2012

Practical Unicode–Part 4

This post is how to automate the conversion done in part 3 and some of the things which caused problems.
Starting with the file in UTF-8 without a Byte Order Marker (BOM)
image
Using ICONV to covert the file from UTF-8 to UCS2 Little Endian with the command below results in the following useless file.
iconv -f UTF-8 -t UCS-2LE input.txt > output.txt
image
As you can see every character is represented by two bytes but without the BOM 0xFF 0xFE notepad++ tries to display it as ANSI and SQL Server gives the following error on bulk insert
Msg 4832, Level 16, State 1, Line 1
Bulk load: An unexpected end of file was encountered in the data file.

If we insert a BOM we then get the following giberish

image
This is caused by the CRLF not being handled correctly.
image
The CRLF is represented as 0x0D 0x0A 0x00 and this means notepad++ can’t interpret it properly.
If we change these bytes to 0x0D 0x00 0x0A 0x00 then the file displays correctly and also loads into SQL server correctly.
image
This can be done automatically using binmay and the following command
binmay -i input.txt -o output.txt -s "0D 0A 00" -r "0D 00 0A 00"
It is worth noting at this point there is not a BOM so the file displays as UCS-2 LE w/o BOM
image
You can then add the BOM using a program called FFFE_ADD
FFFE_add output.txt
This program is based on UTF-BOM-UTILS which is published under a BSD Licence, FFFE_ADD can be found at my BitBucket Repo
image

Monday, 14 May 2012

Practical Unicode Part 3

This post is dedicated to the following problem.
Data was being received from a system as UTF-8 without a Byte Order Marker (BOM) and contained accented characters, this data had to be loaded into SQL Server via BULK INSERT without any data corruption.
As I’m not planning on demonstrating this using customer data I’m going to set up the following example.
Create a table with three fields (as I want to show the column separator) with all three fields being NVARCHAR.
create table Test
      (FieldOne nvarchar(50)
      ,FieldTwo nvarchar(50)
      ,FieldThree nvarchar(50))

Then create a datafile which is tab delimited and saved as UTF-8 without a BOM

Symbols have been turned on to make it easier to read
image
Then bulk load the data
bulk insert Test From 'C:\Scratch\input.ansi.utf8.txt'
      with (codepage = 'ACP', FieldTerminator = '\t', RowTerminator='\n')

It can be seen in the screenshot below that the data was corrupted.

image
If the file is converted to UCS2 Little Endian (using Notepad++) and the same load process executed we see the following
1) We should really specify the datatype to remove the following message
Bulk load: DataFileType was incorrectly specified as char. DataFileType will be assumed to be widechar because the data file has a Unicode signature.

2) The data has loaded correctly.

image
The next post will be how to automate this process and some of the pitfalls.

Monday, 7 May 2012

Practical Unicode–Part 2

This is a follow on to Practical Unicode Part 1 and covers what Notepad++ means when it describes a file.
The hex editor used in this is HXD.

Ansi as UTF-8

This means that the file is UTF-8 encoded without a Byte Order Marker (BOM)
image
image

UTF-8

In this example I’ve added an extended character to show how ‘normal chars’ are a single byte while others are multiple bytes.
image
Here you can see the BOM 0xEF 0xBB 0xBF, then the bulk of the text being stored as a single byte and finally the final character being stored as three bytes 0xC2 0x81 0x42
image

UCS2 Little Endian

image
In this example you can see the following
  • The endianess of the word represented by the BOM 0xFF 0xFE.
  • Characters being stored as two bytes (16 bit word) e.g. U is stored as 0x55 0x00
  • The extended character at the end being stored as two words ( 8 bytes) 0x81 0x00 0x42 0x00
image
If we delete the BOM then you can see Notepad++ displays the encoding as UCS2 Little Endian without BOM
image
image

Monday, 30 April 2012

Practical Unicode–Part 1

This post is not intended to be an exhaustive essay on unicode, the standards and how it works; this is the result of working with multiple systems transferring data in various formats and trying to get it into SQL Server without corruption.
Environment Background
As previously mentioned this is based on my experience working with multiple source systems and the following tools in a Windows environment.
Overview of Unicode and Why We Have It
When we first started storing text (ignoring EBCDIC) the general convention was one byte (8 bits) represented a single character and the alphabet and common symbols were mapped to values 32 to 127 (see ASCII table), the values 128 to 255 were considered to be the extended character set. What this meant was that to represent different alphabets and common usages of accented characters we ended up with code pages, this meant that value x could represent character y in one codepage but character z in another, this was a pain with websites as unless the code page was declared you could not be confident that you were displaying the characters correctly.
As a result we now have unicode where it is possible for a character to be represented by multiple bytes (simplification: glyphs and combined characters not covered). As with all things standards based unicode isn’t just a monolithic entity, there are various flavours and things to note.
Common ‘Flavours’
UTF-8
This is pretty much the defacto standard for storing text in a manner which can be read by almost everyone. Where a character can be represented by ASCII it is held as a single byte, other characters can be 2, 3 or 4 bytes.
While there is only one correct way to read the bytes and thus no need for a Byte Order Marker (BOM) is is common to find the bytes 0xEF 0xBB 0xBF at the beginning of some files. This can be removed using the following binmay command
binmay –s “EF BB BF” –r “” 
UCS2 Little Endian
This is the only format SQL Server can use for input via BCP or Bulk Insert, any data exported using BCP where the are NVARCHAR fields will be in this format.
Like UTF-16 it stores a character as either one or two 16 bit words.
A UCS2 Little Endian file with have a two byte Byte Order Marker consisting of 0xFF 0xFE, if it were a UCS2 Big Endian file then the BOM would be 0xFE 0xFF
Both UTF-16 and UCS2 are variable length/space encodings.
UTF-32
In UTF-32 every character is stored as two 16 bit words, this means that it is not space efficient but is very simple to parse. At the moment very few systems support it and with Python you have to choose the option at compile time.