OK, after much belly aching I have a decent work flow for when I want to use Pandas which is actually quite convenient. Firstly, Pandas shines for when I have heterogeneous data (mixed types) that form nicely into columns and where I need to select out a subset of rows because they satisfy certain conditions.
UPDATE: Fixed confusion between 'table' and 'store'
UPDATE: Include note about how to set data columns
The basic steps are these
UPDATE: Fixed confusion between 'table' and 'store'
UPDATE: Include note about how to set data columns
The basic steps are these
- Use table=True in .put or .to_hdf to indicate that you want the data stored as a frame_table that allows on-disk selection and partial retrieval
- Use data_columns= [...] during saving to identify which columns should be used to select data
- If you do not use table=True you will get
TypeError: cannot pass a where specification when reading from a non-table this store must be selected in its entirety
- If you do not declare data_columns you will get
ValueError: query term is not valid [field->...,op->...,value->...]
import pandas as pd store = pd.HDFStore('filename.h5') df = pd.DataFrame( ... ) #Construct some dataframe #Save as a frame_table in filename.h5 and declare some data columns
#append creates a table automatically
store.append('data1', df, data_columns=[...]) # df = pd.DataFrame( ... ) #Construct another dataframe
#Put requires an explicit instruction to create a table
store.put('data2', df, table=True, data_columns=[...]) #This is convenient - it now adds a second node to the file
Now you can use the battery of select methods (outlined here) to load just selected parts of the data structures.
Comments
Post a Comment