Wednesday, December 30, 2009

Dublin Core in HTML and new releases

OK, so Sigil 0.1.7 was released a couple of days ago. It’s a bugfix release for the 0.1.x branch while work on 0.2.0 is underway… and then 0.1.8 was released about an hour ago.

It seems that lately I can’t make a release without breaking something. This time it was (ironically) a fix for issue #139 that was the problem: CSS was being jumbled up into one line. Everything still worked, but if you are importing an epub book with hand-coded CSS, you want it to stay human-readable. This fix has been removed and will be looked at after the redesign.

Normally I’d release a “b” version, but this was major enough to warrant a whole new release treatment with a notification update pushed to everyone using 0.1.7.

But onto lighter topics… what did 0.1.7 bring? Mostly fixes, as I’ve said. View switching used to be less than reliable and could occasionally cause Sigil to crash. This was caused by differences in what WebKit thinks is a node, and what Qt’s QDom implementation thinks is one. Apparently they can’t agree whether continuous whitespace is just one text node or several, even after you tell both of them to normalize the tree. Or what is a child node, and whose child is it anyway. Plus a few other disagreements.

The net effect was that Sigil would crash if the tree-descending instructions it created from WebKit’s DOM couldn’t be executed upon the QDom DOM. This should be remedied now. If Sigil can’t quite figure out where it needs to scroll the View, it will scroll to the place before the ambiguity begins. In other words, as close as it can get. Still, this is only needed in rare cases.

The other important fixes are the line ending issue and the encoding detection improvements I talked about in previous posts. For a complete list of what was fixed, refer to the changelog.

But there was also one new feature: Sigil now imports HTML metadata if that metadata conforms to the Dublin Core standard. It’s been requested, and Kevin Hendricks kindly provided the code that implements this.

An interesting discussion on what is currently supported and what can still be added has started on MobileRead. Anyone interested in this feature should check out that thread to see examples of how you can use this new functionality, and if you happen to have any ideas for its improvement, we’d all love to hear them.

Wednesday, December 23, 2009

The Redesign: New GUI

So how is oft-touted redesign progressing? Well it’s coming along nicely. Here’s what the new GUI looks like as of 5 minutes ago:

Sylvie and Bruno.epub - Sigil (2)

It should mostly stay that way.

So you have a fancy new “Book Browser” docked to the left by default; you can drag it to the right, or pull it out into its own window if you want. It lists all the files that make up your book (notice how the files are not renamed anymore), in neat little folders:

  • Text
  • Styles
  • Images
  • Fonts
  • Misc

You will be able to right-click in the Book Browser to add/remove files, rename them (F2 also works), open file information and the like. The file icons are queried from the OS, so whatever is the default for your system is what you’ll see in Sigil too.

The files are sorted alphabetically in all folders except the Text folder, where they are sorted according to the (imported) reading order. It’s also the only folder where you can rearrange the files, as the order determines the aforementioned reading order on export.

CSS files are now also preserved, and not loaded as style tags. Double-clicking one opens it in a special CSS editor with CSS syntax highlighting. Sigil 0.2.0 should also ship with an image viewing (but not editing) tab capability. Somewhere down the line when font embedding makes an appearance, there will be a font viewing tab as well. The “quick brown fox jumps over the lazy dog” kind. Double-click a font file in Windows and you’ll know what I mean.

The tabbed interface is also finally here. Very firefoxy. Yes, you can reorder the tabs. I hate it when applications don’t let me do that (like Foxit Reader), so Sigil is trying to be a bit smarter.

Before anyone goes all “ZOMG I CAN HAS NOW? MOAR!”, I’m merely 10% of the way through. This is just the UI, and its functionality. That’s the easy part. The very, very hard part was designing a new architecture for the back-end. But now comes the long and arduous task of integrating the old components into this new system. By “integrating” I mean rewriting. Not the Meta Editor, but the TOC editor definitely. All the importers too, plus the exporters. And a million other things. For instance, CSS and images are not currently loaded correctly in Book View.

After all of that is implemented, then of course I have to test everything.

When I feel I’ve gotten it to a point where others can try it, there will be a series of Release Candidates with the goal of shaking out any major bugs (and hopefully most of the minor). If I wasn’t so terribly bogged down with university work, I could do all this in a few weeks. But since I am, it will probably be February before you see something you can install.

Yeah, it sucks.

Monday, December 14, 2009

Line endings and why you shouldn’t trust Nokia (Part 2)

So you’re using the Qt framework. You want to read an HTML file that probably has its encoding specified, but then again, maybe it doesn’t.

You’ll start off with something like this:

QString fullpath = QFileInfo( "test.xhtml" ).absoluteFilePath();
QFile file( fullpath );
file.open( QFile::ReadOnly | QFile::Text );

QByteArray data = file.readAll();
QString html_source = QTextCodec::codecForHtml( data )->toUnicode( data );

// This is here only for debug output
qDebug() << html_source;

Sounds reasonable, doesn’t it? And it looks reasonable too. But as I’ve said before, the QTextCodec::codecForHtml() function is horribly lacking. You’ll have to either write your own or at least augment this in some way.

But let’s say you have. Let’s say you called Magic, Inc.® and bought a replacement function that always returns the correct codec for a file. It doesn’t matter for this example, since for what I’m about to show you codecForHtml() is perfectly alright.

You now have this HTML file, encoded as UTF-16BE with Windows-style line endings (CRLF):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>Test Page</title>
</head>
<body>
    <p>This is some text.</p>
    <p>This is some more text.</p>
</body>
</html>

You try to load this file with the above code. What does qDebug() print out?

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

Yeah. That’s it.

Actually there’s quite a bit more data in html_source, but it’s all junky, garbled text after that line so it doesn’t even print out. So your file doesn’t load. You check the return of codeForHtml(), and by George, it actually got it right (it saw the BOM). It says the codec is UTF-16BE, which is correct. So what happened then?

Every experienced developer’s initial response is “OK so I screwed something up”. No one would ever initially doubt a mature, stable and popular framework like Qt. Blaming your framework is like blaming your compiler: you grow out of it when you realize it’s always you. But this time…

QFile has this great “feature” when you use QFile::Text as an OpenMode flag: if your file contains “CRLF” substrings (that is, Windows-style line breaks), it “automagically” converts them to “LF”, AKA Unix-style line breaks. Sadly, this doesn’t work so hot with when the file is UTF-16 encoded, since it screws up the byte stream. And now your file is unreadable. Dare I mention it all works perfectly with Unix newlines? Tends to show Qt’s Unix roots and focus…

Sure, if you use QTextStreamer, things somehow manage to work themselves out. But the point is QFile shouldn’t be doing this. The other problem caused by this line ending mangling nonsense is that files with Mac-style terminators (CR) get read as one large line of text. All of those “CR” characters go puff, and they’re gone! They’re not converted, they just disappear.

So Sigil yesterday switched to binary file reading and manual conversion of line endings. Importing old Mac files is now supported, as well as HTML files with UTF-16 encoding (which I assumed worked before… so much for assumptions).

Sunday, December 13, 2009

Encodings and why you shouldn't trust Nokia (Part 1)

So you have an HTML file, and you want to import it into Sigil. That's great. I hope your file has an encoding specified. You need to have an encoding specified. An encoding tells Sigil (and anything else that wants to read your document) how your characters are represented in the file.

Why? Well, computers work with numbers, not characters. You can only ever store numbers on a computer. So to store a character, we assign it a number, and those numbers are then stored in the files. But how do we know which number represents which character? We don't. We have to agree on a mapping of numbers to characters and vice versa. This mapping is called a character encoding.

So when your file doesn't have an encoding specified, Sigil doesn't know how to interpret the numbers in the file and translate them into the correct characters. And then your text is garbled.

So how do you specify an encoding in an HTML file? It's really simple:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />

Just put that line of code into the <head> element. That's the XHTML way. Be sure that you're using the correct encoding! Using the wrong one is worse than using none at all; if you don't specify an encoding, a smart application may be able to guess correctly (we'll get to that in a moment).

You can also use the XML way, if your file is XHTML:

<?xml version="1.0" encoding="UTF-8"?>

Sometimes though, even specifying the correct encoding doesn't seem to be enough. As the linked issue illustrates, Sigil inherited a bug from Nokia's "QTextCodec::codecForHtml()" function, which in all its brilliant wisdom chooses to only look for the meta tag in the first 512 characters (if I recall correctly). If the encoding is further down than that… well it pretends it's not. So I worked around that.

And then we have the HTML files with no encoding whatsoever. There's a lot of them. So what should an application do then?

Guess, basically.

We can try to make it an educated guess, but let's not kid ourselves, it's still a guess.

Since v0.1.0, Sigil used to fallback on UTF-8 when it couldn't detect the encoding of your HTML file. No guessing, it just said: "what the hell, let's read this like it's UTF-8". It was a fairly good assumption. UTF-8 is the most widely used Unicode character encoding, and it's backward compatible with ASCII (which in layman's terms means "plain text"). A lot of people use it today as their default. It's as good a fallback as any.

But then three months ago, issue #133 made its way to the tracker. Suddenly, the UTF-8 fallback didn't seem so hot. Made me rethink my original decision. So I changed the code to use the default encoding of the computer running Sigil, AKA the "locale-aware" fallback. This seemed to work better, and landed in v0.1.4.

But still… every time anyone tried to load an UTF-8 encoded document (with no encoding), it would get garbled like crazy since no locale-specific encoding matches it.

After this bit me today (for the last time!), I decided it's time to revisit. After some searching around on the web, I found this little nugget of Perl code for checking whether a string was UTF-8 encoded. I also found a C translation of that code somewhere (can't find the link now), which I modified for use in Sigil. Here it is:

// This function goes through the entire byte array 
// and tries to see whether this is a valid UTF-8 sequence.
// If it's valid, this is probably a UTF-8 string.
bool Utility::IsValidUtf8( const QByteArray &string )
{
    // This is an implementation of the Perl code written here:
    //   http://www.w3.org/International/questions/qa-forms-utf-8
    //
    // Basically, UTF-8 has a very specific byte-pattern. This function
    // checks if the sent byte-sequence conforms to this pattern.
    // If it does, chances are *very* high that this is UTF-8.
    //
    // This function is written to be fast, not pretty.    

    if ( string.isNull() )

        return false;

    int index = 0;
    const unsigned char *bytes = NULL;

    while ( index < string.size() )
    {
        QByteArray dword = string.mid( index, 4 );

        if ( dword.size() < 4 )

            dword = dword.leftJustified( 4, '\0' );

        bytes = (const unsigned char *) dword.constData();

        // ASCII
        if (   bytes[0] == 0x09 ||
               bytes[0] == 0x0A ||
               bytes[0] == 0x0D ||
               ( 0x20 <= bytes[0] && bytes[0] <= 0x7E )                    
           ) 
        {
            index += 1;
        }

        // non-overlong 2-byte
        else if (  ( 0xC2 <= bytes[0] && bytes[0] <= 0xDF ) &&
                   ( 0x80 <= bytes[1] && bytes[1] <= 0xBF )            
                ) 
        {
            index += 2;
        }
           
        else if (  (     bytes[0] == 0xE0                         &&         // excluding overlongs 
                         ( 0xA0 <= bytes[1] && bytes[1] <= 0xBF ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF )        ) || 
                  
                   (     (   ( 0xE1 <= bytes[0] && bytes[0] <= 0xEC ) ||     // straight 3-byte
                             bytes[0] == 0xEE                         ||
                             bytes[0] == 0xEF                     ) &&
                    
                         ( 0x80 <= bytes[1] && bytes[1] <= 0xBF )   &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF )        ) ||

                   (     bytes[0] == 0xED                         &&         // excluding surrogates
                         ( 0x80 <= bytes[1] && bytes[1] <= 0x9F ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF )        )
                 ) 
        {
            index += 3;
        }
 
          
        else if (    (   bytes[0] == 0xF0                         &&         // planes 1-3
                         ( 0x90 <= bytes[1] && bytes[1] <= 0xBF ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF ) &&
                         ( 0x80 <= bytes[3] && bytes[3] <= 0xBF )      ) ||

                     (   ( 0xF1 <= bytes[0] && bytes[0] <= 0xF3 ) &&         // planes 4-15
                         ( 0x80 <= bytes[1] && bytes[1] <= 0xBF ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF ) &&
                         ( 0x80 <= bytes[3] && bytes[3] <= 0xBF )      ) ||
                
                     (   bytes[0] == 0xF4                         &&         // plane 16
                         ( 0x80 <= bytes[1] && bytes[1] <= 0x8F ) &&
                         ( 0x80 <= bytes[2] && bytes[2] <= 0xBF ) &&
                         ( 0x80 <= bytes[3] && bytes[3] <= 0xBF )      )
                ) 
        {
            index += 4;
        }

        else
        {
            return false;
        }
    }

    return true;
}

This code is currently used as a last resort when the encoding isn't found. If this function says there's a high probability of the text being UTF-8, then it's read as such. Instant success with the test files I had on hand.

I also added an explicit check for the "encoding" XML attribute which the "QTextCodec::codecForHtml()" function apparently doesn't check for at all. Don't ask me why, ask Nokia.

This along with the UTF-16 auto-detection and line-ending fix (more on this tomorrow) makes Sigil now very robust to files with unspecified encodings. For the user, things should "just work" more often.