Saturday, October 2, 2010

Introducing FlightCrew, the epub validator

I’ve been talking about this for a while under the name of “that epub validating library”, and now it has a name. That name is FlightCrew.

It’s a C++, cross-platform, native code epub validator (it’s also open source). The project is composed of three parts:

  1. FlightCrew, the validation library;
  2. FlightCrew-cli, the command-line front-end to the FlightCrew library;
  3. FlightCrew-gui, the GUI front-end to the FlightCrew library.

There are installers and DMG’s for download that package FlightCrew-gui, which provides a nice GUI interface to the underlying library. Errors have a reddish background (ok, it’s pink), while warnings have a yellow one. Here’s a screenshot:

I’ve kept the interface to a minimum on purpose. There’s something to be said about simplicity. As Antoine de Saint-Exupéry said: “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”

You can also drag files from your desktop or file browser and drop them on the FlightCrew-gui window. This will instantly run validation checks on it. This drag-and-drop interface works on all platforms.

FlightCrew-cli is included in all packages of FlightCrew-gui, and does the same thing as that application only from a command-line interface. It works the way epubcheck works—feed it a file, it spews out warnings and errors if necessary.

The current version number for all this is 0.7.0, which I’m using as a sort-of indication of it’s completness. I started working on this back in July, but since there was a rather lovely summer between now and then, it has only had about a month and a half’s work put into it. It’s still roughly 20 KLOC with a complete test suite, so it’s no slouch.

Why FlightCrew is better than epubcheck

First off, “better” is a dirty word. Each tool has its pros and cons. Epubcheck’s (EC) advantage is that it checks for a few things FlightCrew (FC) doesn’t (yet). But the reverse is also true: FC checks for a lot of things EC doesn’t. Off the top of my head, FC performs an extensive reachability analysis and will warn you if you have some resources listed in the manifest that are not used anywhere. It will also report an error if you have an OPS1 document that the user can reach—through the <guide> or <tours> element, the NCX or just normal links in the text—but that is not listed in the <spine>. This is one crucial mistake that can now be caught. Reachability analysis also catches files that are used but not present in the manifest.

There are many other things that FC will check for that EC will not, and most of those you care about deeply and just don’t know it. The things EC checks for that FC doesn’t? Two “big ones”: OPF-listed fallbacks and DTBook syntax verification. If you haven’t heard of either, then you’ve never used them, never will and probably shouldn’t. These are very rarely used features of epub that I have personally never seen used in practice. But they’re big parts of the epub specifications so FC should check for them (and will, fairly soon) for the sake of completeness. There are a few other odds and ends that EC looks for but FC doesn’t.

But here’s where FC blows EC out of the water…

Error reporting done right

Let’s pretend I don’t know most of the epub specs by heart and that I’m a newcomer to epub. I made my first epub book, and I’ve heard that I should validate it. I’ve downloaded both FC and EC and now I’m going to use both. I’m going to use EC first because I’ve heard it’s “what the pros use”. Note that the a pair of EC/FC examples refers to the exact same problem with the file, and the messages usually come with line numbers (unless otherwise specified) which have been omitted. Commentary has been added for the sake of ridicule.

EC: length of first filename in archive must be 8, but was 19 [no line number, ed.]

Um… what? WTF is that supposed to mean? What filename? And why must it be exactly 8? What the hell are you talking about?

FC: Bytes 30-60 of your epub file are invalid. This means that one or more of the following rules are not satisfied:
  1. There needs to be a "mimetype" file in the root folder.
  2. Its content needs to be *exactly* "application/epub+zip".
  3. It needs to be the first file in the epub zip archive.
  4. It needs to be uncompressed. [no line number, ed.]

Ah… not only does this point out the problem (correctly!), it also tells me how to fix it. Nice.

EC: required attributes missing

Huh? I understand that you’re trying to tell me that some required attributes are missing (one or more? you haven’t said), but how about telling me which ones you frigging bastard. Am I supposed to read through the entire XHTML specification, hunting down which attributes this element should declare or even know that I’m supposed to do exactly that?

FC: missing required attribute 'alt'

Thanks! That was awesome. Saved me a ton o’ hassle.

EC: unfinished element

Exceedingly useful, that. Mind telling me how it’s unfinished?

FC: The <title> element is missing.

Now that’s more like it. I’d kiss you if I could.

EC: unfinished element

Didn’t you just say this? And on the exact same element? Why am I getting this again, I thought I fixed this…

FC: The <identifier> element is missing.

You just keep getting better and better!

EC: unfinished element

Fuck you, epubcheck. Fuck you.

FC: The <language> element is missing.

Want to pet my cat? ‘Cause you’re awesome, and I only let awesome things pet my cat.

Back to reality. I hope you got the point, cause if you haven’t, I can pull out tons of other examples.

Now I know that most of the error messages from EC are actually coming from an internal component of it called Jing and that has crappy error messages, but as a user I don’t care. Adobe should use something better instead, or fix Jing. And lots of the crappy messages come from EC core; that little “length of filename” gem was all theirs.

Have I also mentioned that EC development is pretty much dead? From it’s public source repository, it has had a whopping one source code commit in the last ten months, and that commit was four days ago.

In short, use FC first, then EC to get some of the checks FC doesn’t (yet) perform. After FC becomes a strict superset of all EC functionality (roughly a couple of months), drag epubcheck down to the cellar and shoot it in the back of the head.

Or just stop using it, if you prefer.

Footnotes

[1] A funny way of saying HTML document. It’s more than that of course, but for now mentally replace “OPS” with “HTML” .