Thursday, August 15, 2013

Pandas and PyTables: Variable assignment forces copy

I wish to report Pandas to the house unPythonic activities committee. Remember how in Python assignments are by reference rather than value i.e. when you do something like:
a = b
what python does is it creates a reference from a to b (except for very simple objects like integer).

This is what tripped me up when I was learning Python. For example
In [2]: a = {'age': 90, 'weight': 400}

In [3]: b = a

In [4]: a
Out[4]: {'age': 90, 'weight': 400}

In [5]: b
Out[5]: {'age': 90, 'weight': 400}

In [6]: b['age'] = 20

In [7]: b
Out[7]: {'age': 20, 'weight': 400}

In [8]: a
Out[8]: {'age': 20, 'weight': 400}
As you can see, changing
because the assignment creates a reference.

Now, when I was working with Pandas and its built in PyTables interface I learned the hard way that when you assign a variable to an element of a hdf5 store it copies the data from the hdf5 store into the variable, rather than creating an assignment.

If you run the following code you will find that the assignments are actually copying the data from disk into the variables, rather than passing out references to the data in the hdf5 file.
import pandas as pd, pylab, cProfile

def create_file():
  r = pylab.randn(10000,1000)
  p = pd.DataFrame(r)

  with pd.get_store('test.h5','w') as store:
    store['data1'] = p
    store['data2'] = p
    store['data3'] = p

def load_file():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    p1 = store['data1']
    p2 = store['data2']
    p3 = store['data3']
    print p1[10]

def get_file():
  print 'Working on hdf5 store reference'
  with pd.get_store('test.h5','r') as store:
    print store['data1'][10]

A sample output is:
python | grep 'function calls'
         11109 function calls (10989 primitive calls) in 0.329 seconds
         7278 function calls (7238 primitive calls) in 0.053 seconds
         9540 function calls (9420 primitive calls) in 0.138 seconds
         7278 function calls (7238 primitive calls) in 0.054 seconds
Disregarding the first call, which includes some strange startup code, we see that load_file that assigns variables p1,p2,p3 to the hdf5 nodes ends up copying the whole data over, which is why it takes so long to execute, even though those nodes are actually not accessed.

No comments:

Post a Comment