Monday, 21 May 2012

Practical Unicode–Part 4

This post is how to automate the conversion done in part 3 and some of the things which caused problems.
Starting with the file in UTF-8 without a Byte Order Marker (BOM)
image
Using ICONV to covert the file from UTF-8 to UCS2 Little Endian with the command below results in the following useless file.
iconv -f UTF-8 -t UCS-2LE input.txt > output.txt
image
As you can see every character is represented by two bytes but without the BOM 0xFF 0xFE notepad++ tries to display it as ANSI and SQL Server gives the following error on bulk insert
Msg 4832, Level 16, State 1, Line 1
Bulk load: An unexpected end of file was encountered in the data file.

If we insert a BOM we then get the following giberish

image
This is caused by the CRLF not being handled correctly.
image
The CRLF is represented as 0x0D 0x0A 0x00 and this means notepad++ can’t interpret it properly.
If we change these bytes to 0x0D 0x00 0x0A 0x00 then the file displays correctly and also loads into SQL server correctly.
image
This can be done automatically using binmay and the following command
binmay -i input.txt -o output.txt -s "0D 0A 00" -r "0D 00 0A 00"
It is worth noting at this point there is not a BOM so the file displays as UCS-2 LE w/o BOM
image
You can then add the BOM using a program called FFFE_ADD
FFFE_add output.txt
This program is based on UTF-BOM-UTILS which is published under a BSD Licence, FFFE_ADD can be found at my BitBucket Repo
image

Monday, 14 May 2012

Practical Unicode Part 3

This post is dedicated to the following problem.
Data was being received from a system as UTF-8 without a Byte Order Marker (BOM) and contained accented characters, this data had to be loaded into SQL Server via BULK INSERT without any data corruption.
As I’m not planning on demonstrating this using customer data I’m going to set up the following example.
Create a table with three fields (as I want to show the column separator) with all three fields being NVARCHAR.
create table Test
      (FieldOne nvarchar(50)
      ,FieldTwo nvarchar(50)
      ,FieldThree nvarchar(50))

Then create a datafile which is tab delimited and saved as UTF-8 without a BOM

Symbols have been turned on to make it easier to read
image
Then bulk load the data
bulk insert Test From 'C:\Scratch\input.ansi.utf8.txt'
      with (codepage = 'ACP', FieldTerminator = '\t', RowTerminator='\n')

It can be seen in the screenshot below that the data was corrupted.

image
If the file is converted to UCS2 Little Endian (using Notepad++) and the same load process executed we see the following
1) We should really specify the datatype to remove the following message
Bulk load: DataFileType was incorrectly specified as char. DataFileType will be assumed to be widechar because the data file has a Unicode signature.

2) The data has loaded correctly.

image
The next post will be how to automate this process and some of the pitfalls.

Monday, 7 May 2012

Practical Unicode–Part 2

This is a follow on to Practical Unicode Part 1 and covers what Notepad++ means when it describes a file.
The hex editor used in this is HXD.

Ansi as UTF-8

This means that the file is UTF-8 encoded without a Byte Order Marker (BOM)
image
image

UTF-8

In this example I’ve added an extended character to show how ‘normal chars’ are a single byte while others are multiple bytes.
image
Here you can see the BOM 0xEF 0xBB 0xBF, then the bulk of the text being stored as a single byte and finally the final character being stored as three bytes 0xC2 0x81 0x42
image

UCS2 Little Endian

image
In this example you can see the following
  • The endianess of the word represented by the BOM 0xFF 0xFE.
  • Characters being stored as two bytes (16 bit word) e.g. U is stored as 0x55 0x00
  • The extended character at the end being stored as two words ( 8 bytes) 0x81 0x00 0x42 0x00
image
If we delete the BOM then you can see Notepad++ displays the encoding as UCS2 Little Endian without BOM
image
image