Skip to main content

HDF5 is not for fast access

HDF5 is a good solution for storing large datasets on disk. Python's h5py library makes it possible to pretend that data stored on disk is just like an in memory array. It is important to keep in mind that the data is really stored on disk and is read in every time a slice or index into the data is taken.

import numpy
import h5py


def create_data(length=1e4):
  data = numpy.random.rand(length)
  with h5py.File('test.h5', 'w') as fp:
    fp.create_dataset('test', data=data)
  return data


def access_each_h5():
  y = 0
  with h5py.File('test.h5', 'r') as fp:
    for n in range(fp['test'].size):
      y += fp['test'][n]
  return y

def access_each_array(data):
  y = 0
  for n in range(data.size):
    y += data[n]
  return y


d = create_data()

>>> run test.py
>>> %timeit access_each_array(d)
100 loops, best of 3: 4.14 ms per loop
>>> %timeit access_each_h5()
1 loops, best of 3: 1.9 s per loop
That sobering difference in performance reminds us that we can't - performance wise - equate the two. When processing data from an hdf5 file, it is best to read in as large chunks as your memory will allow and do the heavy lifting in memory.

Comments

Popular posts from this blog

Flowing text in inkscape (Poster making)

You can flow text into arbitrary shapes in inkscape. (From a hint here).

You simply create a text box, type your text into it, create a frame with some drawing tool, select both the text box and the frame (click and shift) and then go to text->flow into frame.

UPDATE:

The omnipresent anonymous asked:
Trying to enter sentence so that text forms the number three...any ideas?
The solution:
Type '3' using the text toolConvert to path using object->pathSize as necessaryRemove fillUngroupType in actual text in new text boxSelect the text and the '3' pathFlow the text

Drawing circles using matplotlib

Use the pylab.Circle command

import pylab #Imports matplotlib and a host of other useful modules cir1 = pylab.Circle((0,0), radius=0.75, fc='y') #Creates a patch that looks like a circle (fc= face color) cir2 = pylab.Circle((.5,.5), radius=0.25, alpha =.2, fc='b') #Repeat (alpha=.2 means make it very translucent) ax = pylab.axes(aspect=1) #Creates empty axes (aspect=1 means scale things so that circles look like circles) ax.add_patch(cir1) #Grab the current axes, add the patch to it ax.add_patch(cir2) #Repeat pylab.show()

Pandas panel = collection of tables/data frames aligned by index and column

Pandas panel provides a nice way to collect related data frames together while maintaining correspondence between the index and column values:


import pandas as pd, pylab #Full dimensions of a slice of our panel index = ['1','2','3','4'] #major_index columns = ['a','b','c'] #minor_index df = pd.DataFrame(pylab.randn(4,3),columns=columns,index=index) #A full slice of the panel df2 = pd.DataFrame(pylab.randn(3,2),columns=['a','c'],index=['1','3','4']) #A partial slice df3 = pd.DataFrame(pylab.randn(2,2),columns=['a','b'],index=['2','4']) #Another partial slice df4 = pd.DataFrame(pylab.randn(2,2),columns=['d','e'],index=['5','6']) #Partial slice with a new column and index pn = pd.Panel({'A': df}) pn['B'] = df2 pn['C'] = df3 pn['D'] = df4 for key in pn.items: print pn[key] -> output …