Skip to main content

Pandas and PyTables: Variable assignment forces copy

I wish to report Pandas to the house unPythonic activities committee. Remember how in Python assignments are by reference rather than value i.e. when you do something like:
a = b
what python does is it creates a reference from a to b (except for very simple objects like integer).

This is what tripped me up when I was learning Python. For example
In [2]: a = {'age': 90, 'weight': 400}

In [3]: b = a

In [4]: a
Out[4]: {'age': 90, 'weight': 400}

In [5]: b
Out[5]: {'age': 90, 'weight': 400}

In [6]: b['age'] = 20

In [7]: b
Out[7]: {'age': 20, 'weight': 400}

In [8]: a
Out[8]: {'age': 20, 'weight': 400}
As you can see, changing
because the assignment creates a reference.

Now, when I was working with Pandas and its built in PyTables interface I learned the hard way that when you assign a variable to an element of a hdf5 store it copies the data from the hdf5 store into the variable, rather than creating an assignment.

If you run the following code you will find that the assignments are actually copying the data from disk into the variables, rather than passing out references to the data in the hdf5 file.
import pandas as pd, pylab, cProfile

def create_file():
  r = pylab.randn(10000,1000)
  p = pd.DataFrame(r)

  with pd.get_store('test.h5','w') as store:
    store['data1'] = p
    store['data2'] = p
    store['data3'] = p

def load_file():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    p1 = store['data1']
    p2 = store['data2']
    p3 = store['data3']
    print p1[10]

def get_file():
  print 'Working on hdf5 store reference'
  with pd.get_store('test.h5','r') as store:
    print store['data1'][10]

A sample output is:
python | grep 'function calls'
         11109 function calls (10989 primitive calls) in 0.329 seconds
         7278 function calls (7238 primitive calls) in 0.053 seconds
         9540 function calls (9420 primitive calls) in 0.138 seconds
         7278 function calls (7238 primitive calls) in 0.054 seconds
Disregarding the first call, which includes some strange startup code, we see that load_file that assigns variables p1,p2,p3 to the hdf5 nodes ends up copying the whole data over, which is why it takes so long to execute, even though those nodes are actually not accessed.


Popular posts from this blog

Flowing text in inkscape (Poster making)

You can flow text into arbitrary shapes in inkscape. (From a hint here).

You simply create a text box, type your text into it, create a frame with some drawing tool, select both the text box and the frame (click and shift) and then go to text->flow into frame.


The omnipresent anonymous asked:
Trying to enter sentence so that text forms the number three...any ideas?
The solution:
Type '3' using the text toolConvert to path using object->pathSize as necessaryRemove fillUngroupType in actual text in new text boxSelect the text and the '3' pathFlow the text

Python: Multiprocessing: passing multiple arguments to a function

Write a wrapper function to unpack the arguments before calling the real function. Lambda won't work, for some strange un-Pythonic reason.

import multiprocessing as mp def myfun(a,b): print a + b def mf_wrap(args): return myfun(*args) p = mp.Pool(4) fl = [(a,b) for a in range(3) for b in range(2)] #mf_wrap = lambda args: myfun(*args) -> this sucker, though more pythonic and compact, won't work, fl)

Drawing circles using matplotlib

Use the pylab.Circle command

import pylab #Imports matplotlib and a host of other useful modules cir1 = pylab.Circle((0,0), radius=0.75, fc='y') #Creates a patch that looks like a circle (fc= face color) cir2 = pylab.Circle((.5,.5), radius=0.25, alpha =.2, fc='b') #Repeat (alpha=.2 means make it very translucent) ax = pylab.axes(aspect=1) #Creates empty axes (aspect=1 means scale things so that circles look like circles) ax.add_patch(cir1) #Grab the current axes, add the patch to it ax.add_patch(cir2) #Repeat