Skip to main content

HDF5 is not for fast access

HDF5 is a good solution for storing large datasets on disk. Python's h5py library makes it possible to pretend that data stored on disk is just like an in memory array. It is important to keep in mind that the data is really stored on disk and is read in every time a slice or index into the data is taken.

import numpy
import h5py


def create_data(length=1e4):
  data = numpy.random.rand(length)
  with h5py.File('test.h5', 'w') as fp:
    fp.create_dataset('test', data=data)
  return data


def access_each_h5():
  y = 0
  with h5py.File('test.h5', 'r') as fp:
    for n in range(fp['test'].size):
      y += fp['test'][n]
  return y

def access_each_array(data):
  y = 0
  for n in range(data.size):
    y += data[n]
  return y


d = create_data()

>>> run test.py
>>> %timeit access_each_array(d)
100 loops, best of 3: 4.14 ms per loop
>>> %timeit access_each_h5()
1 loops, best of 3: 1.9 s per loop
That sobering difference in performance reminds us that we can't - performance wise - equate the two. When processing data from an hdf5 file, it is best to read in as large chunks as your memory will allow and do the heavy lifting in memory.

Comments

Popular posts from this blog

Python: Multiprocessing: passing multiple arguments to a function

Write a wrapper function to unpack the arguments before calling the real function. Lambda won't work, for some strange un-Pythonic reason.


import multiprocessing as mp def myfun(a,b): print a + b def mf_wrap(args): return myfun(*args) p = mp.Pool(4) fl = [(a,b) for a in range(3) for b in range(2)] #mf_wrap = lambda args: myfun(*args) -> this sucker, though more pythonic and compact, won't work p.map(mf_wrap, fl)

Flowing text in inkscape (Poster making)

You can flow text into arbitrary shapes in inkscape. (From a hint here).

You simply create a text box, type your text into it, create a frame with some drawing tool, select both the text box and the frame (click and shift) and then go to text->flow into frame.

UPDATE:

The omnipresent anonymous asked:
Trying to enter sentence so that text forms the number three...any ideas?
The solution:
Type '3' using the text toolConvert to path using object->pathSize as necessaryRemove fillUngroupType in actual text in new text boxSelect the text and the '3' pathFlow the text

Calculating confidence intervals: straight Python is as good as scipy.stats.scoreatpercentile

UPDATE:
I would say the most efficient AND readable way of working out confidence intervals from bootstraps is:

numpy.percentile(r,[2.5,50,97.5],axis=1)

Where r is a n x b array where n are different runs (e.g different data sets) and b are the individual bootstraps within a run. This code returns the 95% CIs as three numpy arrays.


Confidence intervals can be computed by bootstrapping the calculation of a descriptive statistic and then finding the appropriate percentiles of the data. I saw that scipy.stats has a built in percentile function and assumed that it would work really fast because (presumably) the code is in C. I was using a simple minded Python/Numpy implementation by first sorting and then picking the appropriate percentile data. I thought this was going to be inefficient timewise and decided that using scipy.stats.scoreatpercentile was going to be blazing fast because
It was native C It was vectorized - I could compute the CIs for multiple bootstrap runs at the same time …