Tuesday 15 February 2011

Byte Order Mark found using .NET BinaryReader class

Bug: Using .NET's BinaryReader class to read in a file's contents in byte format may result in reading in the Byte Order Mark (BOM) Unicode character if the file(s) were encoded in UTF-8 or Unicode.
byte[] data = new byte[size];
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read))
{ 
     using (BinaryReader br = new BinaryReader(fs)) 
     { 
          data = br.ReadBytes(size); 
     }
}

When reading in one of our UTF-8 encoded files using the above code, the first 3 bytes of the data byte array were the BOM character.

Solution:  Read in the file's contents as text using the .NET StreamReader class which automatically compensates for the BOM character.

Detailed Explanation:
A tester on my team discovered a bug after writing some integration tests that ingested multiple CSV files. The code ingests the data within these files as bytes using the .NET BinaryReader class (see above code).

Upon investigation I noticed that the byte array contained 3 additional bytes at the beginning of the byte array which had the integer values of 239, 187 & 191 respectively. After reading Wikipedia I discovered that these 3 bytes represent the BOM Unicode character (\uFEFF char or U+FEFF code point). The purpose of this Unicode character, according to Wikipedia, is to signal the endianness (byte order) of a text file or stream. Given that UTF-8 and Unicode data can be encoded as 16-bit and 32-bit integers, the machine reading the encoded data needs to know its byte order so that it can read in the data correctly.

While thinking about and looking for a solution, I came across this stackoverflow discussion and this one which talked about simply stripping the BOM  from the data if its present. This didn't seem like the best approach so I continued thinking and came across a pretty simple solution. 

Instead of reading in the raw bytes from the file, I'd read in the actual text using the .NET StreamReader class. The added benefit of using .NET StreamReader class is that it has a number of constructor options and some of them have a bool parameter called detectEncodingFromByteOrderMarks. This highlighted the fact that the class handles the BOM character when reading in text from a stream and if I know the encoding of the file I am reading in, I can use the code below:
string data;
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read))
{ 
     using (StreamReader sr = new StreamReader(fs, encoding, true, size)) 
     { 
          data = sr.ReadToEnd(); 
     }
}

The data string now contains only the text data from the file that was read in and not the BOM. Therefore, even if the a file contains the BOM character because it's contents were encoded as UTF-8, the StreamReader class (set with the constructor parameters above) compensates for the BOM character and emits it from the string read in.

2 comments:

Andrew said...

Joel on Software has a great article about what every software developer should know about encoding: http://www.joelonsoftware.com/articles/Unicode.html

Shabir said...

useful information

thanks
shabir hakim