An efficient way to store pandas data

OK, after much belly aching I have a decent work flow for when I want to use Pandas which is actually quite convenient. Firstly, Pandas shines for when I have heterogeneous data (mixed types) that form nicely into columns and where I need to select out a subset of rows because they satisfy certain conditions.

UPDATE: Fixed confusion between 'table' and 'store'
UPDATE: Include note about how to set data columns

The basic steps are these

Use table=True in .put or .to_hdf to indicate that you want the data stored as a frame_table that allows on-disk selection and partial retrieval
Use data_columns= [...] during saving to identify which columns should be used to select data

You need to do BOTH steps to have a working selectable-table-on-disk.

If you do not use table=True you will get TypeError: cannot pass a where specification when reading from a non-table this store must be selected in its entirety
If you do not declare data_columns you will get ValueError: query term is not valid [field->...,op->...,value->...]

import pandas as pd

store = pd.HDFStore('filename.h5')

df = pd.DataFrame( ... ) #Construct some dataframe
#Save as a frame_table in filename.h5 and declare some data columns

#append creates a table automatically

store.append('data1', df, data_columns=[...]) 
#
df = pd.DataFrame( ... ) #Construct another dataframe

#Put requires an explicit instruction to create a table

store.put('data2', df, table=True, data_columns=[...]) #This is convenient - it now adds a second node to the file

Now you can use the battery of select methods (outlined here) to load just selected parts of the data structures.

Assorted Experience

Search This Blog