Skip to main content

Pandas and PyTables: Variable assignment forces copy

I wish to report Pandas to the house unPythonic activities committee. Remember how in Python assignments are by reference rather than value i.e. when you do something like:
a = b
what python does is it creates a reference from a to b (except for very simple objects like integer).

This is what tripped me up when I was learning Python. For example
In [2]: a = {'age': 90, 'weight': 400}

In [3]: b = a

In [4]: a
Out[4]: {'age': 90, 'weight': 400}

In [5]: b
Out[5]: {'age': 90, 'weight': 400}

In [6]: b['age'] = 20

In [7]: b
Out[7]: {'age': 20, 'weight': 400}

In [8]: a
Out[8]: {'age': 20, 'weight': 400}
As you can see, changing
b
changes
a
because the assignment creates a reference.

Now, when I was working with Pandas and its built in PyTables interface I learned the hard way that when you assign a variable to an element of a hdf5 store it copies the data from the hdf5 store into the variable, rather than creating an assignment.

If you run the following code you will find that the assignments are actually copying the data from disk into the variables, rather than passing out references to the data in the hdf5 file.
import pandas as pd, pylab, cProfile

def create_file():
  r = pylab.randn(10000,1000)
  p = pd.DataFrame(r)

  with pd.get_store('test.h5','w') as store:
    store['data1'] = p
    store['data2'] = p
    store['data3'] = p

def load_file():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    p1 = store['data1']
    p2 = store['data2']
    p3 = store['data3']
    print p1[10]

def get_file():
  print 'Working on hdf5 store reference'
  with pd.get_store('test.h5','r') as store:
    print store['data1'][10]

create_file()
cProfile.run('load_file()')
cProfile.run('get_file()')
cProfile.run('load_file()')
cProfile.run('get_file()')
A sample output is:
python test.py | grep 'function calls'
         11109 function calls (10989 primitive calls) in 0.329 seconds
         7278 function calls (7238 primitive calls) in 0.053 seconds
         9540 function calls (9420 primitive calls) in 0.138 seconds
         7278 function calls (7238 primitive calls) in 0.054 seconds
Disregarding the first call, which includes some strange startup code, we see that load_file that assigns variables p1,p2,p3 to the hdf5 nodes ends up copying the whole data over, which is why it takes so long to execute, even though those nodes are actually not accessed.

Comments

Popular posts from this blog

A note on Python's __exit__() and errors

Python's context managers are a very neat way of handling code that needs a teardown once you are done. Python objects have do have a destructor method ( __del__ ) called right before the last instance of the object is about to be destroyed. You can do a teardown there. However there is a lot of fine print to the __del__ method. A cleaner way of doing tear-downs is through Python's context manager , manifested as the with keyword. class CrushMe: def __init__(self): self.f = open('test.txt', 'w') def foo(self, a, b): self.f.write(str(a - b)) def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): self.f.close() return True with CrushMe() as c: c.foo(2, 3) One thing that is important, and that got me just now, is error handling. I made the mistake of ignoring all those 'junk' arguments ( exc_type, exc_val, exc_tb ). I just skimmed the docs and what popped out is that you need to return True or...

Store numpy arrays in sqlite

Use numpy.getbuffer (or sqlite3.Binary ) in combination with numpy.frombuffer to lug numpy data in and out of the sqlite3 database: import sqlite3, numpy r1d = numpy.random.randn(10) con = sqlite3.connect(':memory:') con.execute("CREATE TABLE eye(id INTEGER PRIMARY KEY, desc TEXT, data BLOB)") con.execute("INSERT INTO eye(desc,data) VALUES(?,?)", ("1d", sqlite3.Binary(r1d))) con.execute("INSERT INTO eye(desc,data) VALUES(?,?)", ("1d", numpy.getbuffer(r1d))) res = con.execute("SELECT * FROM eye").fetchall() con.close() #res -> #[(1, u'1d', <read-write buffer ptr 0x10371b220, size 80 at 0x10371b1e0>), # (2, u'1d', <read-write buffer ptr 0x10371b190, size 80 at 0x10371b150>)] print r1d - numpy.frombuffer(res[0][2]) #->[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] print r1d - numpy.frombuffer(res[1][2]) #->[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Note that for work where data ty...