HDF5 is a data format designed by National Center for Supercomputing Applications at UIUC to rapidly access large scientific data sets. Long story short, our Pac Bio raw reads came in the HDF5 format and so we had no other option than to start looking for the best tools to process them. We will briefly write down what we found (without too much explanation) so that you can save some time searching through the web.
Q. What is HDF?
HDF stands for Hierarchical Data Format. If those three words do not mean much to you, wiki has the best explanation of what they mean.
Q. Cannot we just stick the data into SQL databases instead of going through all the trouble of learning a new format?
SQL database and HDF solve two different problems. HDF is for very large data sets that have some kind of uniform structure. For example, let us say you have 500 Gb of sequences in FASTA format with average sequence length of 5000 nucleotides. Storing the data in SQL database and running the access commands in client-server mode reduces speed of access. So, it is preferable to store the entire data set in hard drive and access locally. HDF works well on such data. It used B-TREEs to index the data sets.
On the other hand, let us say you have 20 different data sets with genes, expression information in five tissues, functional annotation, homology with other organisms, etc. and you like to ask questions such as ‘which gene is present in human and mouse, expressed in human liver but not expressed in mouse brain, and has some keyword match with cancer’. SQL works better for such complex queries.
Q. What is HDF5?
It is much improved version of original HDF format.
1. The best source to learn about the format is HDF5 group.
2. User guide located here describes C APIs to read HDF5 files.
3. If you use R, here is a package to read HDF5 files.
4. If you use python, please try this source.
We also found this source helpful.
Q. I am a PERL user. What should I do?
Please google ‘go hang yourself’ and follow any of the links.