Monday 7 May 2012

Practical Unicode–Part 2

This is a follow on to Practical Unicode Part 1 and covers what Notepad++ means when it describes a file.
The hex editor used in this is HXD.

Ansi as UTF-8

This means that the file is UTF-8 encoded without a Byte Order Marker (BOM)
image
image

UTF-8

In this example I’ve added an extended character to show how ‘normal chars’ are a single byte while others are multiple bytes.
image
Here you can see the BOM 0xEF 0xBB 0xBF, then the bulk of the text being stored as a single byte and finally the final character being stored as three bytes 0xC2 0x81 0x42
image

UCS2 Little Endian

image
In this example you can see the following
  • The endianess of the word represented by the BOM 0xFF 0xFE.
  • Characters being stored as two bytes (16 bit word) e.g. U is stored as 0x55 0x00
  • The extended character at the end being stored as two words ( 8 bytes) 0x81 0x00 0x42 0x00
image
If we delete the BOM then you can see Notepad++ displays the encoding as UCS2 Little Endian without BOM
image
image