HDF5 File Format for data
Keep moving right along if you’re not a total geek as we’re going to get a little to geeky even for most geeks. I’m going to get excited about file i/o libraries and one in particular called HDF5. 99.99% of the time people don’t worry about file formats and files except when they don’t automatically open with the correct application in their web browser.
For us coders though file formats and file i/o in general almost always becomes a pain point as we store more and more data. Almost always an application starts off with some very simple data formats that somebody whips up quickly as needed and things are fine until success hits and all of a sudden your simple data assumptions are blown as your application has to scale up by an order or magnitude or two. There are a number of libraries to help with serializing object type data clike protobuf and cap’n proto but for large numeric matrix type data the choices are somewhat more sparse. Some of the key features that never seem that important at the start but that crop up over time are:
- Data integrity - Disk drives work most of the time, but
every once in a while they don’t and then things get
interesting. Does you application notice that the data has
been corrupted/truncated/etc. by a bad drive? Or a bad network
connection? Or does your application just segfault without any
notice at all….
HDF5 writes out files in chunks of data. Each chunk is separately has a separate checksum so you know when your file has been corrupted or truncated. This can also be useful when you just want to read a portion of the data and don’t want to parse the entire file, just jump to the chunk with your data and then read it in.
- Extensible - Inevitably as people start using your application they have great new ideas about your features. Can your files incorporate easily the data for these new features or do you have to revise your file format every few months which drives both yourself and your API users crazy.
HDF5 is like its own little filesystem so adding a new table or new data to an existing file is easy. It works just like any other filesystem as you can create a group (like a directory) and group datasets together inside of that group. For example within your HDF5 file you can have a dataset “/hdf5/path/to/dataset” and when you have a new feature you can just open another dataset within the HDF5 file like “/hdf5/path/to/other/dataset”. You don’t need to rewrite the entire file or change the file format as all of the old code will just work and ignore the new group, only the new API has to do anything with the new dataset.
- Compression - Sooner or later you’ll want to compress your
data. So many projects are drowning in data these days. Web
analytics, next generation DNA sequencing, Hospitals, WalMart,
the list goes on and on. You’ll want to compress the data
sooner or later, but if you just gzip a file it is very hard
to get random access without unzipping the whole thing. By
gzipping at the chunk level like HDF5 you can get the best of
both worlds. Just look up the random chunk that you want and
then unzip it.
HDF5 compresses at the chunk level, getting the most out of your hard drive and your network.
4) Supported by different languages - The best thing for a project is when other developers want to do work and help out. People have amazing ideas and insights and you want to make it as easy as possible for them to contribute. They don’t want to learn a new language or a bunch of new tools. Wouldn’t it be great if your file format was accessible from python, perl, matlab, R, c, java etc. without any work on your part or crazy SWIG wrappers.
HDF5 is supported by all of the above languages out of the box. If you can form you data in their format then developers can use their favorite tool to analyze, edit, view, contribute, etc. to it. What could be better than that?
Check it out - I think it is one of the best libraries since sliced bread.
Some related projects: