Data was being received from a system as UTF-8 without a Byte Order Marker (BOM) and contained accented characters, this data had to be loaded into SQL Server via BULK INSERT without any data corruption.
As I’m not planning on demonstrating this using customer data I’m going to set up the following example.
Create a table with three fields (as I want to show the column separator) with all three fields being NVARCHAR.
create table Test
(FieldOne nvarchar(50)
,FieldTwo nvarchar(50)
,FieldThree nvarchar(50))
Then create a datafile which is tab delimited and saved as UTF-8 without a BOM
Symbols have been turned on to make it easier to read
Then bulk load the data
bulk insert Test From 'C:\Scratch\input.ansi.utf8.txt'
with (codepage = 'ACP', FieldTerminator = '\t', RowTerminator='\n')
It can be seen in the screenshot below that the data was corrupted.
1) We should really specify the datatype to remove the following message
Bulk load: DataFileType was incorrectly specified as char. DataFileType will be assumed to be widechar because the data file has a Unicode signature.
2) The data has loaded correctly.