Sunday, November 10, 2013

Annoying bug in PyTables (affects Big Data analysis with Pandas)

Pandas HDF5 interface through PyTables is awesome because it allows you to select and process small chunks of data from a much larger data file stored on disk. PyTables, however, has an annoying and subtle bug and I just wanted to point you to it so that you don't have to spend hours debugging code like I did.

In short, if you have a DataFrame, and a column of that DF starts with a NaN, any select statements that you run with that conditions on that column will return empty (you won't get any results back, ever). There is a work around, but I chose to use a dummy value instead.

This shook my confidence in Pandas as an analysis platform a bit (though it is really PyTable's fault).

No comments:

Post a Comment