Monday, 21 May 2012

Practical Unicode–Part 4

This post is how to automate the conversion done in part 3 and some of the things which caused problems.
Starting with the file in UTF-8 without a Byte Order Marker (BOM)
image
Using ICONV to covert the file from UTF-8 to UCS2 Little Endian with the command below results in the following useless file.
iconv -f UTF-8 -t UCS-2LE input.txt > output.txt
image
As you can see every character is represented by two bytes but without the BOM 0xFF 0xFE notepad++ tries to display it as ANSI and SQL Server gives the following error on bulk insert
Msg 4832, Level 16, State 1, Line 1
Bulk load: An unexpected end of file was encountered in the data file.

If we insert a BOM we then get the following giberish

image
This is caused by the CRLF not being handled correctly.
image
The CRLF is represented as 0x0D 0x0A 0x00 and this means notepad++ can’t interpret it properly.
If we change these bytes to 0x0D 0x00 0x0A 0x00 then the file displays correctly and also loads into SQL server correctly.
image
This can be done automatically using binmay and the following command
binmay -i input.txt -o output.txt -s "0D 0A 00" -r "0D 00 0A 00"
It is worth noting at this point there is not a BOM so the file displays as UCS-2 LE w/o BOM
image
You can then add the BOM using a program called FFFE_ADD
FFFE_add output.txt
This program is based on UTF-BOM-UTILS which is published under a BSD Licence, FFFE_ADD can be found at my BitBucket Repo
image