Monday, December 14, 2009

Line endings and why you shouldn’t trust Nokia (Part 2)

So you’re using the Qt framework. You want to read an HTML file that probably has its encoding specified, but then again, maybe it doesn’t.

You’ll start off with something like this:

QString fullpath = QFileInfo( "test.xhtml" ).absoluteFilePath();
QFile file( fullpath ); QFile::ReadOnly | QFile::Text );

QByteArray data = file.readAll();
QString html_source = QTextCodec::codecForHtml( data )->toUnicode( data );

// This is here only for debug output
qDebug() << html_source;

Sounds reasonable, doesn’t it? And it looks reasonable too. But as I’ve said before, the QTextCodec::codecForHtml() function is horribly lacking. You’ll have to either write your own or at least augment this in some way.

But let’s say you have. Let’s say you called Magic, Inc.® and bought a replacement function that always returns the correct codec for a file. It doesn’t matter for this example, since for what I’m about to show you codecForHtml() is perfectly alright.

You now have this HTML file, encoded as UTF-16BE with Windows-style line endings (CRLF):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<html xmlns="">
    <title>Test Page</title>
    <p>This is some text.</p>
    <p>This is some more text.</p>

You try to load this file with the above code. What does qDebug() print out?

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

Yeah. That’s it.

Actually there’s quite a bit more data in html_source, but it’s all junky, garbled text after that line so it doesn’t even print out. So your file doesn’t load. You check the return of codeForHtml(), and by George, it actually got it right (it saw the BOM). It says the codec is UTF-16BE, which is correct. So what happened then?

Every experienced developer’s initial response is “OK so I screwed something up”. No one would ever initially doubt a mature, stable and popular framework like Qt. Blaming your framework is like blaming your compiler: you grow out of it when you realize it’s always you. But this time…

QFile has this great “feature” when you use QFile::Text as an OpenMode flag: if your file contains “CRLF” substrings (that is, Windows-style line breaks), it “automagically” converts them to “LF”, AKA Unix-style line breaks. Sadly, this doesn’t work so hot with when the file is UTF-16 encoded, since it screws up the byte stream. And now your file is unreadable. Dare I mention it all works perfectly with Unix newlines? Tends to show Qt’s Unix roots and focus…

Sure, if you use QTextStreamer, things somehow manage to work themselves out. But the point is QFile shouldn’t be doing this. The other problem caused by this line ending mangling nonsense is that files with Mac-style terminators (CR) get read as one large line of text. All of those “CR” characters go puff, and they’re gone! They’re not converted, they just disappear.

So Sigil yesterday switched to binary file reading and manual conversion of line endings. Importing old Mac files is now supported, as well as HTML files with UTF-16 encoding (which I assumed worked before… so much for assumptions).