When a Pandas DataFrame is saved (via PyTables) to hdf5 as a frame_table there is a varying amount of disk space overhead depending on how many columns are declared as data_columns (i.e. columns you can use to select rows by). This overhead can be rather high.
import pandas as pd, numpy df = pd.DataFrame(numpy.random.randn(1000000,3),columns=['a','b','c']) df.to_hdf('data_table_nocomp.h5','data') #-> 32 MB df.to_hdf('data_normal.h5','data',complevel=9,complib='bzip2') #-> 21.9 MB df.to_hdf('data_table.h5','data',complevel=9,complib='bzip2',table=True) #-> 22.5 MB df.to_hdf('data_table_columns1.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a']) #-> 29.1 MB df.to_hdf('data_table_columns2.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a','b']) #-> 35.8 MB df.to_hdf('data_table_columns3.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a','b','c']) #-> 42.4 MB df.to_hdf('data_table_columns3_nocomp.h5','data',table=True,data_columns=['a','b','c']) #-> 52.4 MB
Comments
Post a Comment