Wednesday, August 28, 2013

An efficient way to store pandas data

OK, after much belly aching I have a decent work flow for when I want to use Pandas which is actually quite convenient. Firstly, Pandas shines for when I have heterogeneous data (mixed types) that form nicely into columns and where I need to select out a subset of rows because they satisfy certain conditions.

UPDATE: Fixed confusion between 'table' and 'store'
UPDATE: Include note about how to set data columns

The basic steps are these
  1. Use table=True in .put or .to_hdf to indicate that you want the data stored as a frame_table that allows on-disk selection and partial retrieval
  2. Use data_columns= [...] during saving to identify which columns should be used to select data
You need to do BOTH steps to have a working selectable-table-on-disk.
  • If you do not use table=True you will get TypeError: cannot pass a where specification when reading from a non-table this store must be selected in its entirety
  • If you do not declare data_columns you will get ValueError: query term is not valid [field->...,op->...,value->...]

import pandas as pd

store = pd.HDFStore('filename.h5')

df = pd.DataFrame( ... ) #Construct some dataframe
#Save as a frame_table in filename.h5 and declare some data columns 
#append creates a table automatically 
store.append('data1', df, data_columns=[...]) 

#
df = pd.DataFrame( ... ) #Construct another dataframe 
#Put requires an explicit instruction to create a table
store.put('data2', df, table=True, data_columns=[...]) #This is convenient - it now adds a second node to the file 
 
 
Now you can use the battery of select methods (outlined here) to load just selected parts of the data structures.

No comments:

Post a Comment