Monday, 14 May 2012

Practical Unicode Part 3

This post is dedicated to the following problem.
Data was being received from a system as UTF-8 without a Byte Order Marker (BOM) and contained accented characters, this data had to be loaded into SQL Server via BULK INSERT without any data corruption.
As I’m not planning on demonstrating this using customer data I’m going to set up the following example.
Create a table with three fields (as I want to show the column separator) with all three fields being NVARCHAR.
create table Test
      (FieldOne nvarchar(50)
      ,FieldTwo nvarchar(50)
      ,FieldThree nvarchar(50))

Then create a datafile which is tab delimited and saved as UTF-8 without a BOM

Symbols have been turned on to make it easier to read
image
Then bulk load the data
bulk insert Test From 'C:\Scratch\input.ansi.utf8.txt'
      with (codepage = 'ACP', FieldTerminator = '\t', RowTerminator='\n')

It can be seen in the screenshot below that the data was corrupted.

image
If the file is converted to UCS2 Little Endian (using Notepad++) and the same load process executed we see the following
1) We should really specify the datatype to remove the following message
Bulk load: DataFileType was incorrectly specified as char. DataFileType will be assumed to be widechar because the data file has a Unicode signature.

2) The data has loaded correctly.

image
The next post will be how to automate this process and some of the pitfalls.