Saturday, October 2, 2010

Introducing FlightCrew, the epub validator

I’ve been talking about this for a while under the name of “that epub validating library”, and now it has a name. That name is FlightCrew.

It’s a C++, cross-platform, native code epub validator (it’s also open source). The project is composed of three parts:

  1. FlightCrew, the validation library;
  2. FlightCrew-cli, the command-line front-end to the FlightCrew library;
  3. FlightCrew-gui, the GUI front-end to the FlightCrew library.

There are installers and DMG’s for download that package FlightCrew-gui, which provides a nice GUI interface to the underlying library. Errors have a reddish background (ok, it’s pink), while warnings have a yellow one. Here’s a screenshot:

I’ve kept the interface to a minimum on purpose. There’s something to be said about simplicity. As Antoine de Saint-Exupéry said: “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”

You can also drag files from your desktop or file browser and drop them on the FlightCrew-gui window. This will instantly run validation checks on it. This drag-and-drop interface works on all platforms.

FlightCrew-cli is included in all packages of FlightCrew-gui, and does the same thing as that application only from a command-line interface. It works the way epubcheck works—feed it a file, it spews out warnings and errors if necessary.

The current version number for all this is 0.7.0, which I’m using as a sort-of indication of it’s completness. I started working on this back in July, but since there was a rather lovely summer between now and then, it has only had about a month and a half’s work put into it. It’s still roughly 20 KLOC with a complete test suite, so it’s no slouch.

Why FlightCrew is better than epubcheck

First off, “better” is a dirty word. Each tool has its pros and cons. Epubcheck’s (EC) advantage is that it checks for a few things FlightCrew (FC) doesn’t (yet). But the reverse is also true: FC checks for a lot of things EC doesn’t. Off the top of my head, FC performs an extensive reachability analysis and will warn you if you have some resources listed in the manifest that are not used anywhere. It will also report an error if you have an OPS1 document that the user can reach—through the <guide> or <tours> element, the NCX or just normal links in the text—but that is not listed in the <spine>. This is one crucial mistake that can now be caught. Reachability analysis also catches files that are used but not present in the manifest.

There are many other things that FC will check for that EC will not, and most of those you care about deeply and just don’t know it. The things EC checks for that FC doesn’t? Two “big ones”: OPF-listed fallbacks and DTBook syntax verification. If you haven’t heard of either, then you’ve never used them, never will and probably shouldn’t. These are very rarely used features of epub that I have personally never seen used in practice. But they’re big parts of the epub specifications so FC should check for them (and will, fairly soon) for the sake of completeness. There are a few other odds and ends that EC looks for but FC doesn’t.

But here’s where FC blows EC out of the water…

Error reporting done right

Let’s pretend I don’t know most of the epub specs by heart and that I’m a newcomer to epub. I made my first epub book, and I’ve heard that I should validate it. I’ve downloaded both FC and EC and now I’m going to use both. I’m going to use EC first because I’ve heard it’s “what the pros use”. Note that the a pair of EC/FC examples refers to the exact same problem with the file, and the messages usually come with line numbers (unless otherwise specified) which have been omitted. Commentary has been added for the sake of ridicule.

EC: length of first filename in archive must be 8, but was 19 [no line number, ed.]

Um… what? WTF is that supposed to mean? What filename? And why must it be exactly 8? What the hell are you talking about?

FC: Bytes 30-60 of your epub file are invalid. This means that one or more of the following rules are not satisfied:
  1. There needs to be a "mimetype" file in the root folder.
  2. Its content needs to be *exactly* "application/epub+zip".
  3. It needs to be the first file in the epub zip archive.
  4. It needs to be uncompressed. [no line number, ed.]

Ah… not only does this point out the problem (correctly!), it also tells me how to fix it. Nice.

EC: required attributes missing

Huh? I understand that you’re trying to tell me that some required attributes are missing (one or more? you haven’t said), but how about telling me which ones you frigging bastard. Am I supposed to read through the entire XHTML specification, hunting down which attributes this element should declare or even know that I’m supposed to do exactly that?

FC: missing required attribute 'alt'

Thanks! That was awesome. Saved me a ton o’ hassle.

EC: unfinished element

Exceedingly useful, that. Mind telling me how it’s unfinished?

FC: The <title> element is missing.

Now that’s more like it. I’d kiss you if I could.

EC: unfinished element

Didn’t you just say this? And on the exact same element? Why am I getting this again, I thought I fixed this…

FC: The <identifier> element is missing.

You just keep getting better and better!

EC: unfinished element

Fuck you, epubcheck. Fuck you.

FC: The <language> element is missing.

Want to pet my cat? ‘Cause you’re awesome, and I only let awesome things pet my cat.

Back to reality. I hope you got the point, cause if you haven’t, I can pull out tons of other examples.

Now I know that most of the error messages from EC are actually coming from an internal component of it called Jing and that has crappy error messages, but as a user I don’t care. Adobe should use something better instead, or fix Jing. And lots of the crappy messages come from EC core; that little “length of filename” gem was all theirs.

Have I also mentioned that EC development is pretty much dead? From it’s public source repository, it has had a whopping one source code commit in the last ten months, and that commit was four days ago.

In short, use FC first, then EC to get some of the checks FC doesn’t (yet) perform. After FC becomes a strict superset of all EC functionality (roughly a couple of months), drag epubcheck down to the cellar and shoot it in the back of the head.

Or just stop using it, if you prefer.

Footnotes

[1] A funny way of saying HTML document. It’s more than that of course, but for now mentally replace “OPS” with “HTML” .

10 comments:

  1. While FlightCrew sounds great from a user perspective, it's a nightmare from a packager's one: You're bundling third party libraries (and sometimes only parts of them) in the source code. That's really, truly bad. Most distros have explicit rules against packaging such software because it leads to countless problems.
    You can find the reasons in detail here: http://blog.flameeyes.eu/2009/01/02/bundling-libraries-for-despair-and-insecurity

    ReplyDelete
  2. I'm aware that Linux packagers have problems with projects that bundle other libraries in their source. I have been aware of this from the start.

    That "don't bundle other libs" rule favors the Linux OS, but creates *massive* problems for people who use other OSs and who want to build a project from source. On Linux, it usually boils down to "type this line into your terminal to instruct your package manager to install all the required dependencies". On Windows? There's no package manager there. People who want to build a project from source have to track down and build all the deps *by hand*. And when those deps have other deps, and when every project is built differently (and often has no explicit Windows support) this quickly makes the entire process impossible to finish by anyone sane. I know this from first hand experience, and I vowed not to do something similar myself.

    So I made a decision early on--for Sigil as well as for FlightCrew--that I choose the ease of building *for users and developers* over the ease of packaging for Linux repo maintainers. If that means my projects are left out of the repos, so be it. I make sure to compensate for this by providing installers for all the platforms and a trivial source build procedure.

    All that being said, Sigil uses the same system as FlightCrew does, and it's already in a few repos.

    I make sure not to modify the libs unless I explicitly need it, and even then I document it.

    For FlightCrew, there are a few bundled libs:

    BoostParts are the parts of boost that FC uses. Boost is modular; you can link FC to the full, system boost libs without problems.

    Utf8-cpp is header only. It requires no building.

    Xerces can be linked to the system version. It's used unmodified, except for a custom build system. Again, this is just for ease of developer use. The system version of Xerces will work just fine.

    XercesExtensions is part of FligthCrew; I made it specifically for this project.

    Zipios is a custom version of the upstream, since upstream is dead for this project and has been for many years. I added several features (like x64 support) and fixed many bugs. For that I will not apologize.

    Zlib is unchanged except for the build system. Like with Xerces, FC will work with system zlib.

    So aside from Zipios, system versions can be used for every lib.

    ReplyDelete
  3. Thanks for the reply!

    In my experience, on Windows and MacOS, people want to click an icon and start whatever the icon stands for. Even I (I'm an Exhero Linux deveoper and we're a source-based distro) don't build *anything* from source on either of those OS'. :-)

    Anyway, thanks for that documentation (did I miss it on the website? Would be most useful there so that we packagers can rip out the bundled stuff.) as it makes my life easier and will enable me to package FlightCrew properly.

    ReplyDelete
  4. >>In my experience, on Windows and MacOS, people want to click an icon and start whatever the icon stands for. Even I (I'm an Exhero Linux deveoper and we're a source-based distro) don't build *anything* from source on either of those OS'. :-)

    A common assumption that has caused me truly great amounts of pain. :)

    I too avoid building from source whenever I can, but when I need a lib that doesn't provide Windows binaries (and the vast majority don't), then I have no choice.

    The byzantine dependencies between OSS libs primarily created for Linux is the main reason you don't see them used more in Windows software. I know many, *many* people who have had the same experiences I've had and who complain about the same problem. Linux people developers seem to take package managers for granted.

    ReplyDelete
  5. Oh, and if you're about to include FC in your repo, do wait about a month. :)

    The first release was very quick-and-dirty since I was under certain deadlines. There will be a new release shortly that will have proper API and project documentation and a few extra features.

    ReplyDelete
  6. Valloric: Do I take your last comment of 10 days or so ago to mean I should wait to dl and install FC? - Hitch

    ReplyDelete
  7. My last comment pertained to packaging FC in a repository for use as a library for developers. For such uses, it's best to package a version with proper API and other docs.

    For everyday users of FC, it's good to go now. A new version will be arriving soon though.

    ReplyDelete
  8. Hi Strahinja,
    Love sigil and flightcrew! Is there a binary built for the cli?

    ReplyDelete
  9. Yes, a binary build is distributed in all the installers and packages of FlightCrew. For Windows, you'll find it in the same installation folder as flightcrew-gui. For Macs, it's in the DMG.

    ReplyDelete