Friday, October 25, 2013

Pandas: the frame_table disk space overhead

When a Pandas DataFrame is saved (via PyTables) to hdf5 as a frame_table there is a varying amount of disk space overhead depending on how many columns are declared as data_columns (i.e. columns you can use to select rows by). This overhead can be rather high.


import pandas as pd, numpy

df = pd.DataFrame(numpy.random.randn(1000000,3),columns=['a','b','c'])
df.to_hdf('data_table_nocomp.h5','data') #-> 32 MB
df.to_hdf('data_normal.h5','data',complevel=9,complib='bzip2') #-> 21.9 MB
df.to_hdf('data_table.h5','data',complevel=9,complib='bzip2',table=True) #-> 22.5 MB
df.to_hdf('data_table_columns1.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a']) #-> 29.1 MB
df.to_hdf('data_table_columns2.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a','b']) #-> 35.8 MB
df.to_hdf('data_table_columns3.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a','b','c']) #-> 42.4 MB
df.to_hdf('data_table_columns3_nocomp.h5','data',table=True,data_columns=['a','b','c']) #-> 52.4 MB

No comments:

Post a Comment