Jamie Dainton's Blog: Practical Unicode

Monday, 7 May 2012

Practical Unicode–Part 2

This is a follow on to Practical Unicode Part 1 and covers what Notepad++ means when it describes a file.
The hex editor used in this is HXD.

Ansi as UTF-8

This means that the file is UTF-8 encoded without a Byte Order Marker (BOM)

UTF-8

In this example I’ve added an extended character to show how ‘normal chars’ are a single byte while others are multiple bytes.

Here you can see the BOM 0xEF 0xBB 0xBF, then the bulk of the text being stored as a single byte and finally the final character being stored as three bytes 0xC2 0x81 0x42

UCS2 Little Endian

In this example you can see the following

The endianess of the word represented by the BOM 0xFF 0xFE.
Characters being stored as two bytes (16 bit word) e.g. U is stored as 0x55 0x00
The extended character at the end being stored as two words ( 8 bytes) 0x81 0x00 0x42 0x00

If we delete the BOM then you can see Notepad++ displays the encoding as UCS2 Little Endian without BOM