Skip to main content

Posts

Showing posts from August, 2013

HDF5 (and Pandas using HDF5) is row oriented

From a nice hint here , and the docs here : When you use pandas + HDF5 storage it is convenient to generate one table that is the 'selector' table that you use to index which rows you will select. Then you use that to retrieve the bulk data from separate tables which have the same index. Originally I was appending columns to the main table, but there is no efficient way of doing that when using HDF5 (appending rows is efficient). Now I'm just creating new tables for the data, keeping the original index.

An efficient way to store pandas data

OK, after much belly aching I have a decent work flow for when I want to use Pandas which is actually quite convenient. Firstly, Pandas shines for when I have heterogeneous data (mixed types) that form nicely into columns and where I need to select out a subset of rows because they satisfy certain conditions. UPDATE: Fixed confusion between 'table' and 'store' UPDATE: Include note about how to set data columns The basic steps are these Use table=True in .put or .to_hdf to indicate that you want the data stored as a frame_table that allows on-disk selection and partial retrieval Use data_columns= [...] during saving to identify which columns should be used to select data You need to do BOTH steps to have a working selectable-table-on-disk. If you do not use table=True you will get TypeError: cannot pass a where specification when reading from a non-table this store must be selected in its entirety If you do not declare data_columns you will get ValueError: q

h5py and pandas for large array storage

I've actually gone back to pure hdf5 (via the h5py interface) for storing and accessing numerical data. Pandas via PyTables started to get too complicated and started to get in the way of my analysis (I was spending too much time on the docs, and testing out cases etc.). My application is simple. There is a rather large array of numbers that I would like to store on disk and load subsets of to perform operations on cells/subsets. For this I found pandas to be a bad compromise. Either I had to load all the data all at once into memory, or I had to go through a really slow disk interface (which probably WAS loading everything into memory at the same time). I just don't have the luxury to fight with it so long. I'm seeing that pandas has a (kind of) proper way of doing what I'm doing , but in h5py it just seems more natural and less encumbering :( UPDATE: So, as previously mentioned, Pandas shines as a database substitute, where you want to select subsets of data bas

Use Enthought Canopy outside of their environment

From hints on their blog and other places: Canopy installs a virtual environment. The environment activate command is located at ~/Library/Enthought/Canopy_64bit/User/bin/activate . An easy way to access this environment is to alias it in your start up file eg: # For canopy ipython alias canpy='source ~/Library/Enthought/Canopy_64bit/User/bin/activate' When in the environment use deactivate to exit. I'm using Canopy because I found it insanely annoying to install hdf5 and h5py on Mac 10.7.5 I think my next laptop will be linux...

Pandas: brief observations

After using Pandas for a little bit, I have a few observations: Pandas is great for database like use. When you have tabular data from which you would like to efficiently  select sub-tables based on critera, Pandas is great. Pandas is great for time-series like data, where the rows are ordered. In such cases pandas allows you to combine multiple tables, or plot, or do analyses based on the time series nature of the rows. Pandas, however, is a little unwieldy when you wish to add rows (adding columns is very easy) and in data manipulation in general

Each access of a Pandas hdf5 store node is a re-copy from the file

This is obvious, but it is important to remember. import pandas as pd, pylab, cProfile def create_file(): r = pylab.randn(10000,1000) p = pd.DataFrame(r) with pd.get_store('test.h5','w') as store: store['data'] = p def analyze(p): return [(p[c] > 0).size for c in [0,1,2,3,4,5,6,7,8,9]] def copy1(): print 'Working on copy of data' with pd.get_store('test.h5','r') as store: p = store['data'] idx = analyze(p) print idx def copy2(): print 'Working on copy of data' with pd.get_store('test.h5','r') as store: idx = analyze(store['data']) print idx def ref(): print 'Working on hdf5 store reference' with pd.get_store('test.h5','r') as store: idx = [(store['data'][c] > 0).size for c in [0,1,2,3,4,5,6,7,8,9]] print idx #create_file() cProfile.run('copy1()') cProfile.run('copy1()') cProfile.run(&#

Pandas: presence of a NaN/None in a DataFrame forces column to float

import pandas as pd a = [[1,2],[3,4]] df = pd.DataFrame(a) df-> 0 1 0 1 2 1 3 4 df.values -> array([[1, 2], [3, 4]]) df.ix[1].values -> array([3, 4]) a = [[1,None],[3,4]] df = pd.DataFrame(a) df-> 0 1 0 1 NaN 1 3 4 df.values -> array([[ 1., nan], [ 3., 4.]]) df[0].values -> array([1, 3]) df[1].values -> array([ nan, 4.]) df.ix[1].values -> array([ 3., 4.]) df[0][1] -> 3 df[1][1] -> 4.0 This threw me because I have a data structure that is all ints, but I have a few Nones on one column and that column was suddenly returned as floats. As you can see it's just the relevant column that is forced to float.

Pandas and PyTables: Variable assignment forces copy

I wish to report Pandas to the house unPythonic activities committee. Remember how in Python assignments are by reference rather than value i.e. when you do something like: a = b what python does is it creates a reference from a to b (except for very simple objects like integer). This is what tripped me up when I was learning Python. For example In [2]: a = {'age': 90, 'weight': 400} In [3]: b = a In [4]: a Out[4]: {'age': 90, 'weight': 400} In [5]: b Out[5]: {'age': 90, 'weight': 400} In [6]: b['age'] = 20 In [7]: b Out[7]: {'age': 20, 'weight': 400} In [8]: a Out[8]: {'age': 20, 'weight': 400} As you can see, changing b changes a because the assignment creates a reference. Now, when I was working with Pandas and its built in PyTables interface I learned the hard way that when you assign a variable to an element of a hdf5 store it copies the data from the hdf5 store into the varia

Sleep number bed LCDs are defective

We bought a sleep number bed about 5 years ago. These beds come with a '20 year' warranty which sounds awesome, because it makes one think that a) The bed's are made well for the company to give such a warranty and b) It's a nice warranty. Well, it's not THAT great. About 2 years ago the LCD display on the controller started to go on the fritz. It started with one segment of one digit and then progressed until a few weeks ago the display was simply blank. I did a quick search on the internet and it turns out that this is a very common problem . We have a wired controller (because it was cheaper, I guess, it was a while ago). The refurbished replacement is going to cost us $60 with shipping and the original one would have cost us $140 or so. It does seem that we are getting a nice discount on their catalog price, but I don't think this is such a good deal. Any how, the pump is working fine, so the actual cost of the controller was probably $10 or so, so I&#

Manipulating pandas data structures

I really enjoy using the Pandas Series and DataFrame objects. I find, however, that methods to update the series/frame are clunky. For a DataFrame it's pretty easy to add columns - you create a DataFrame or a Series and you just assign it. But adding rows to a Series or DataFrame is a bit clunky. I sometimes have the need to modify a certain row with new data or add that row if it does not exist, which in a database would be a 'replace or insert' operation. You can concat or append another Series or DataFrame but I have not found a nice way of handling the 'replace or insert' case. If the structure is small I simply convert it into a dictionary and manipulate the structure using the dictionary keys and then recreate the pandas structure. If the structure is large I do an explicit test for the index (row) and then decide whether to append or replace.

DSLR vs compacts/micro four thirds

I'm what the marketing department at camera companies call an 'enthusiast'. Previously I would be called an amateur, but I guess 'enthusiast' doesn't have the stigma of 'clueless' that amateur now has. I don't make money of photos and I take photos for pleasure and for memories. I bought my DSLR when DSLR prices were plunging off a cliff, that is after all the professionals had subsidized sensor and lens development. I bought the D40. I got a DSLR for the following characteristics: Low shutter lag. This was probably the biggest deal for me. I like to capture the fleeting expressions on human faces and the compact was very frustrating with the long lag between focusing and then taking the picture. Good low light performance. The D40 works just fine for me upto 1600 ISO. ISO 3200 is very noisy and adding a nice prime lens that goes out to f1.8 added a lot of artistic scope and improved low light performance. The downside of even a small DSLR like