Friday, August 30, 2013

Pandas: Hierarchical index and frame_table

https://github.com/pydata/pandas/issues/4710

UPDATE: I have to hand it to @jreback who is a bug fixing ninja. The bug has been fixed for 0.13, though data_columns will not handle multi index.

Wednesday, August 28, 2013

HDF5 (and Pandas using HDF5) is row oriented

From a nice hint here, and the docs here:

When you use pandas + HDF5 storage it is convenient to generate one table that is the 'selector' table that you use to index which rows you will select. Then you use that to retrieve the bulk data from separate tables which have the same index.

Originally I was appending columns to the main table, but there is no efficient way of doing that when using HDF5 (appending rows is efficient). Now I'm just creating new tables for the data, keeping the original index.

An efficient way to store pandas data

OK, after much belly aching I have a decent work flow for when I want to use Pandas which is actually quite convenient. Firstly, Pandas shines for when I have heterogeneous data (mixed types) that form nicely into columns and where I need to select out a subset of rows because they satisfy certain conditions.

UPDATE: Fixed confusion between 'table' and 'store'
UPDATE: Include note about how to set data columns

The basic steps are these
  1. Use table=True in .put or .to_hdf to indicate that you want the data stored as a frame_table that allows on-disk selection and partial retrieval
  2. Use data_columns= [...] during saving to identify which columns should be used to select data
You need to do BOTH steps to have a working selectable-table-on-disk.
  • If you do not use table=True you will get TypeError: cannot pass a where specification when reading from a non-table this store must be selected in its entirety
  • If you do not declare data_columns you will get ValueError: query term is not valid [field->...,op->...,value->...]

import pandas as pd

store = pd.HDFStore('filename.h5')

df = pd.DataFrame( ... ) #Construct some dataframe
#Save as a frame_table in filename.h5 and declare some data columns 
#append creates a table automatically 
store.append('data1', df, data_columns=[...]) 

#
df = pd.DataFrame( ... ) #Construct another dataframe 
#Put requires an explicit instruction to create a table
store.put('data2', df, table=True, data_columns=[...]) #This is convenient - it now adds a second node to the file 
 
 
Now you can use the battery of select methods (outlined here) to load just selected parts of the data structures.

h5py and pandas for large array storage

I've actually gone back to pure hdf5 (via the h5py interface) for storing and accessing numerical data. Pandas via PyTables started to get too complicated and started to get in the way of my analysis (I was spending too much time on the docs, and testing out cases etc.).

My application is simple. There is a rather large array of numbers that I would like to store on disk and load subsets of to perform operations on cells/subsets. For this I found pandas to be a bad compromise. Either I had to load all the data all at once into memory, or I had to go through a really slow disk interface (which probably WAS loading everything into memory at the same time). I just don't have the luxury to fight with it so long.

I'm seeing that pandas has a (kind of) proper way of doing what I'm doing, but in h5py it just seems more natural and less encumbering :(

UPDATE: So, as previously mentioned, Pandas shines as a database substitute, where you want to select subsets of data based on some criterion. Pandas has a method (to_hdf) that will save a dataframe as a PyTables table that DOES allow you to do efficient sub-sampling without loading everything onto disk using the 'select' method, but even that is pretty slow compared to directly pulling things using h5py (and cumbersome). But it works really nicely for actual conditional select statements. Code updated to reflect this.

Timing information for randomly accessing 1000 individual cells from a 1000x1000 array of floats
h5py                - 0.295 s
pandas frame        - 14.8 s   (reloaded table on each lookup probably)
pandas frame_table  - 3.943 s  
python test.py | grep 'function calls'
         95023 function calls in 0.295 seconds
         711312 function calls (707157 primitive calls) in 14.808 seconds
         1331709 function calls (1269472 primitive calls) in 3.943 seconds

import h5py, pandas as pd, numpy, cProfile

def create_data_files():
  r = numpy.empty((1000,1000),dtype=float)
  df = pd.DataFrame(r)

  with pd.get_store('pandas.h5','w') as f:
    f.append('data', df)

  with h5py.File('h5py.h5','w') as f:
    f.create_dataset('data', data=r)

def access_h5py(idx):
  with h5py.File('h5py.h5') as f:
    for n in range(idx.shape[0]):
      f['/data'][idx[n][0],idx[n][1]]

def access_pandas(idx):
  with pd.get_store('pandas.h5') as f:
    for n in range(idx.shape[0]):
      f['data'].iloc[idx[n][0],idx[n][1]]

def slice_pandas(idx):
  with pd.get_store('pandas.h5') as f:
    for n in range(idx.shape[0]):
      f.select('data', [('index', '=', idx[n][0]), ('columns', '=', idx[n][1])])

#create_data_files()
idx = numpy.random.randint(1000,size=(1000,2))
cProfile.run('access_h5py(idx)')
cProfile.run('access_pandas(idx)')
cProfile.run('slice_pandas(idx)')

Use Enthought Canopy outside of their environment

From hints on their blog and other places:

Canopy installs a virtual environment. The environment activate command is located at ~/Library/Enthought/Canopy_64bit/User/bin/activate. An easy way to access this environment is to alias it in your start up file eg:

# For canopy ipython
alias canpy='source ~/Library/Enthought/Canopy_64bit/User/bin/activate'

When in the environment use deactivate to exit.

I'm using Canopy because I found it insanely annoying to install hdf5 and h5py on Mac 10.7.5
I think my next laptop will be linux...

Tuesday, August 27, 2013

Pandas: brief observations

After using Pandas for a little bit, I have a few observations:
  1. Pandas is great for database like use. When you have tabular data from which you would like to efficiently  select sub-tables based on critera, Pandas is great.
  2. Pandas is great for time-series like data, where the rows are ordered. In such cases pandas allows you to combine multiple tables, or plot, or do analyses based on the time series nature of the rows.
  3. Pandas, however, is a little unwieldy when you wish to add rows (adding columns is very easy) and in data manipulation in general

Monday, August 19, 2013

Each access of a Pandas hdf5 store node is a re-copy from the file

This is obvious, but it is important to remember.
import pandas as pd, pylab, cProfile

def create_file():
  r = pylab.randn(10000,1000)
  p = pd.DataFrame(r)

  with pd.get_store('test.h5','w') as store:
    store['data'] = p

def analyze(p):
  return [(p[c] > 0).size for c in [0,1,2,3,4,5,6,7,8,9]]


def copy1():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    p = store['data']
    idx = analyze(p)
    print idx

def copy2():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    idx = analyze(store['data'])
    print idx

def ref():
  print 'Working on hdf5 store reference'
  with pd.get_store('test.h5','r') as store:
    idx = [(store['data'][c] > 0).size for c in [0,1,2,3,4,5,6,7,8,9]]
    print idx

#create_file()
cProfile.run('copy1()')
cProfile.run('copy1()')
cProfile.run('copy2()')
cProfile.run('ref()')
When run with python test.py | grep "function calls" gives us
         5340 function calls (5256 primitive calls) in 0.094 seconds
         2080 function calls (2040 primitive calls) in 0.048 seconds
         2080 function calls (2040 primitive calls) in 0.050 seconds
         5661 function calls (5621 primitive calls) in 0.402 seconds
So, if you are going to do multiple operations on the data in a node it is better to copy it over once (if you have the memory).

Friday, August 16, 2013

Pandas: presence of a NaN/None in a DataFrame forces column to float

import pandas as pd
a = [[1,2],[3,4]]
df = pd.DataFrame(a)

df-> 
   0  1
0  1  2
1  3  4

df.values ->
array([[1, 2],
       [3, 4]])

df.ix[1].values ->
array([3, 4])

a = [[1,None],[3,4]]
df = pd.DataFrame(a)

df->
   0   1
0  1 NaN
1  3   4

df.values ->
array([[  1.,  nan],
       [  3.,   4.]])

df[0].values ->
array([1, 3])

df[1].values ->
array([ nan,   4.])

df.ix[1].values ->
array([ 3.,  4.])

df[0][1] -> 3
df[1][1] -> 4.0
This threw me because I have a data structure that is all ints, but I have a few Nones on one column and that column was suddenly returned as floats.
As you can see it's just the relevant column that is forced to float.

Thursday, August 15, 2013

Pandas and PyTables: Variable assignment forces copy

I wish to report Pandas to the house unPythonic activities committee. Remember how in Python assignments are by reference rather than value i.e. when you do something like:
a = b
what python does is it creates a reference from a to b (except for very simple objects like integer).

This is what tripped me up when I was learning Python. For example
In [2]: a = {'age': 90, 'weight': 400}

In [3]: b = a

In [4]: a
Out[4]: {'age': 90, 'weight': 400}

In [5]: b
Out[5]: {'age': 90, 'weight': 400}

In [6]: b['age'] = 20

In [7]: b
Out[7]: {'age': 20, 'weight': 400}

In [8]: a
Out[8]: {'age': 20, 'weight': 400}
As you can see, changing
b
changes
a
because the assignment creates a reference.

Now, when I was working with Pandas and its built in PyTables interface I learned the hard way that when you assign a variable to an element of a hdf5 store it copies the data from the hdf5 store into the variable, rather than creating an assignment.

If you run the following code you will find that the assignments are actually copying the data from disk into the variables, rather than passing out references to the data in the hdf5 file.
import pandas as pd, pylab, cProfile

def create_file():
  r = pylab.randn(10000,1000)
  p = pd.DataFrame(r)

  with pd.get_store('test.h5','w') as store:
    store['data1'] = p
    store['data2'] = p
    store['data3'] = p

def load_file():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    p1 = store['data1']
    p2 = store['data2']
    p3 = store['data3']
    print p1[10]

def get_file():
  print 'Working on hdf5 store reference'
  with pd.get_store('test.h5','r') as store:
    print store['data1'][10]

create_file()
cProfile.run('load_file()')
cProfile.run('get_file()')
cProfile.run('load_file()')
cProfile.run('get_file()')
A sample output is:
python test.py | grep 'function calls'
         11109 function calls (10989 primitive calls) in 0.329 seconds
         7278 function calls (7238 primitive calls) in 0.053 seconds
         9540 function calls (9420 primitive calls) in 0.138 seconds
         7278 function calls (7238 primitive calls) in 0.054 seconds
Disregarding the first call, which includes some strange startup code, we see that load_file that assigns variables p1,p2,p3 to the hdf5 nodes ends up copying the whole data over, which is why it takes so long to execute, even though those nodes are actually not accessed.

Monday, August 12, 2013

Sleep number bed LCDs are defective

We bought a sleep number bed about 5 years ago. These beds come with a '20 year' warranty which sounds awesome, because it makes one think that a) The bed's are made well for the company to give such a warranty and b) It's a nice warranty.

Well, it's not THAT great. About 2 years ago the LCD display on the controller started to go on the fritz. It started with one segment of one digit and then progressed until a few weeks ago the display was simply blank. I did a quick search on the internet and it turns out that this is a very common problem.

We have a wired controller (because it was cheaper, I guess, it was a while ago). The refurbished replacement is going to cost us $60 with shipping and the original one would have cost us $140 or so. It does seem that we are getting a nice discount on their catalog price, but I don't think this is such a good deal.

Any how, the pump is working fine, so the actual cost of the controller was probably $10 or so, so I'm not entirely happy, unless it turns out the replacement is a double controller (It says Dual I-Series, I don't know what that means) and is wireless.

Saturday, August 10, 2013

Manipulating pandas data structures

I really enjoy using the Pandas Series and DataFrame objects. I find, however, that methods to update the series/frame are clunky. For a DataFrame it's pretty easy to add columns - you create a DataFrame or a Series and you just assign it. But adding rows to a Series or DataFrame is a bit clunky.

I sometimes have the need to modify a certain row with new data or add that row if it does not exist, which in a database would be a 'replace or insert' operation. You can concat or append another Series or DataFrame but I have not found a nice way of handling the 'replace or insert' case.

If the structure is small I simply convert it into a dictionary and manipulate the structure using the dictionary keys and then recreate the pandas structure.

If the structure is large I do an explicit test for the index (row) and then decide whether to append or replace.


Thursday, August 8, 2013

DSLR vs compacts/micro four thirds

I'm what the marketing department at camera companies call an 'enthusiast'. Previously I would be called an amateur, but I guess 'enthusiast' doesn't have the stigma of 'clueless' that amateur now has. I don't make money of photos and I take photos for pleasure and for memories.

I bought my DSLR when DSLR prices were plunging off a cliff, that is after all the professionals had subsidized sensor and lens development. I bought the D40. I got a DSLR for the following characteristics:
  1. Low shutter lag. This was probably the biggest deal for me. I like to capture the fleeting expressions on human faces and the compact was very frustrating with the long lag between focusing and then taking the picture.
  2. Good low light performance. The D40 works just fine for me upto 1600 ISO. ISO 3200 is very noisy and adding a nice prime lens that goes out to f1.8 added a lot of artistic scope and improved low light performance.
The downside of even a small DSLR like the D40 is that it is large and conspicuous and not that quick to whip out when you need it.

This has turned my attention to the micro four thirds family. The larger sensor sizes are a great step up from compacts, but the form factors are so small! They also have interchangeable lenses.

Shutter lag is still a concern, but one thing I realised after using the D40 is that in low light (when a lot of my people portraits are done, round dinner tables and indoors) I have a long effective shutter lag because the focusing in low light is an issue.

What I depend a lot on in such situations is to focus on a sharp edge and then shoot a burst. Instead of waiting for the right moment, I estimate when the moment is going to come up and then hope that one of the images in the burst will carry the hidden expression.

The new 4/3 cameras I am seeing do bursts, do better ISO than the D40, are smaller/lighter AND they do movies, so I'm pretty sure my next camera is not going to be the D5100 (I was waiting for the price to drop steeply, or to find a refurbed one) but rather one of the 4/3s family.

UPDATE: I just found Thom Hogan's guide to m4/3. The guide is very useful.