Sunday, December 13, 2009

Encodings and why you shouldn't trust Nokia (Part 1)

So you have an HTML file, and you want to import it into Sigil. That's great. I hope your file has an encoding specified. You need to have an encoding specified. An encoding tells Sigil (and anything else that wants to read your document) how your characters are represented in the file.

Why? Well, computers work with numbers, not characters. You can only ever store numbers on a computer. So to store a character, we assign it a number, and those numbers are then stored in the files. But how do we know which number represents which character? We don't. We have to agree on a mapping of numbers to characters and vice versa. This mapping is called a character encoding.

So when your file doesn't have an encoding specified, Sigil doesn't know how to interpret the numbers in the file and translate them into the correct characters. And then your text is garbled.

So how do you specify an encoding in an HTML file? It's really simple:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />

Just put that line of code into the <head> element. That's the XHTML way. Be sure that you're using the correct encoding! Using the wrong one is worse than using none at all; if you don't specify an encoding, a smart application may be able to guess correctly (we'll get to that in a moment).

You can also use the XML way, if your file is XHTML:

<?xml version="1.0" encoding="UTF-8"?>

Sometimes though, even specifying the correct encoding doesn't seem to be enough. As the linked issue illustrates, Sigil inherited a bug from Nokia's "QTextCodec::codecForHtml()" function, which in all its brilliant wisdom chooses to only look for the meta tag in the first 512 characters (if I recall correctly). If the encoding is further down than that… well it pretends it's not. So I worked around that.

And then we have the HTML files with no encoding whatsoever. There's a lot of them. So what should an application do then?

Guess, basically.

We can try to make it an educated guess, but let's not kid ourselves, it's still a guess.

Since v0.1.0, Sigil used to fallback on UTF-8 when it couldn't detect the encoding of your HTML file. No guessing, it just said: "what the hell, let's read this like it's UTF-8". It was a fairly good assumption. UTF-8 is the most widely used Unicode character encoding, and it's backward compatible with ASCII (which in layman's terms means "plain text"). A lot of people use it today as their default. It's as good a fallback as any.

But then three months ago, issue #133 made its way to the tracker. Suddenly, the UTF-8 fallback didn't seem so hot. Made me rethink my original decision. So I changed the code to use the default encoding of the computer running Sigil, AKA the "locale-aware" fallback. This seemed to work better, and landed in v0.1.4.

But still… every time anyone tried to load an UTF-8 encoded document (with no encoding), it would get garbled like crazy since no locale-specific encoding matches it.

After this bit me today (for the last time!), I decided it's time to revisit. After some searching around on the web, I found this little nugget of Perl code for checking whether a string was UTF-8 encoded. I also found a C translation of that code somewhere (can't find the link now), which I modified for use in Sigil. Here it is:

// This function goes through the entire byte array 
// and tries to see whether this is a valid UTF-8 sequence.
// If it's valid, this is probably a UTF-8 string.
bool Utility::IsValidUtf8( const QByteArray &string )
{
    // This is an implementation of the Perl code written here:
    //   http://www.w3.org/International/questions/qa-forms-utf-8
    //
    // Basically, UTF-8 has a very specific byte-pattern. This function
    // checks if the sent byte-sequence conforms to this pattern.
    // If it does, chances are *very* high that this is UTF-8.
    //
    // This function is written to be fast, not pretty.    

    if ( string.isNull() )

        return false;

    int index = 0;
    const unsigned char *bytes = NULL;

    while ( index < string.size() )
    {
        QByteArray dword = string.mid( index, 4 );

        if ( dword.size() < 4 )

            dword = dword.leftJustified( 4, '\0' );

        bytes = (const unsigned char *) dword.constData();

        // ASCII
        if (   bytes[0] == 0x09 ||
               bytes[0] == 0x0A ||
               bytes[0] == 0x0D ||
               ( 0x20 <= bytes[0] && bytes[0] <= 0x7E )                    
           ) 
        {
            index += 1;
        }

        // non-overlong 2-byte
        else if (  ( 0xC2 <= bytes[0] && bytes[0] <= 0xDF ) &&
                   ( 0x80 <= bytes[1] && bytes[1] <= 0xBF )            
                ) 
        {
            index += 2;
        }
           
        else if (  (     bytes[0] == 0xE0                         &&         // excluding overlongs 
                         ( 0xA0 <= bytes[1] && bytes[1] <= 0xBF ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF )        ) || 
                  
                   (     (   ( 0xE1 <= bytes[0] && bytes[0] <= 0xEC ) ||     // straight 3-byte
                             bytes[0] == 0xEE                         ||
                             bytes[0] == 0xEF                     ) &&
                    
                         ( 0x80 <= bytes[1] && bytes[1] <= 0xBF )   &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF )        ) ||

                   (     bytes[0] == 0xED                         &&         // excluding surrogates
                         ( 0x80 <= bytes[1] && bytes[1] <= 0x9F ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF )        )
                 ) 
        {
            index += 3;
        }
 
          
        else if (    (   bytes[0] == 0xF0                         &&         // planes 1-3
                         ( 0x90 <= bytes[1] && bytes[1] <= 0xBF ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF ) &&
                         ( 0x80 <= bytes[3] && bytes[3] <= 0xBF )      ) ||

                     (   ( 0xF1 <= bytes[0] && bytes[0] <= 0xF3 ) &&         // planes 4-15
                         ( 0x80 <= bytes[1] && bytes[1] <= 0xBF ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF ) &&
                         ( 0x80 <= bytes[3] && bytes[3] <= 0xBF )      ) ||
                
                     (   bytes[0] == 0xF4                         &&         // plane 16
                         ( 0x80 <= bytes[1] && bytes[1] <= 0x8F ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF ) &&
                         ( 0x80 <= bytes[3] && bytes[3] <= 0xBF )      )
                ) 
        {
            index += 4;
        }

        else
        {
            return false;
        }
    }

    return true;
}

This code is currently used as a last resort when the encoding isn't found. If this function says there's a high probability of the text being UTF-8, then it's read as such. Instant success with the test files I had on hand.

I also added an explicit check for the "encoding" XML attribute which the "QTextCodec::codecForHtml()" function apparently doesn't check for at all. Don't ask me why, ask Nokia.

This along with the UTF-16 auto-detection and line-ending fix (more on this tomorrow) makes Sigil now very robust to files with unspecified encodings. For the user, things should "just work" more often.