Friday, December 20, 2013

Quitclaim deeds, land records and all THAT

Looking into some legal stuff I noticed that a deed was labelled 'quitclaim'. I was puzzled by what this meant (it sounded a little shady to me). From the page here it seems like a quitclaim deed is weaker than a warranty deed. A warranty deed states that whoever is giving you the deed is legally obliged to defend any challenges to ownership that arise on the land regardless of how far back in time this challenge originates. The quitclaim deed obliges the grantor to defend any challenges to ownership that arose only while they were owning the property. Any challenges that arise before are excluded. This seems a little shady, because if it is your land, and you are selling it, why would you NOT give the full support of ownership as a warranty deed promises? I started to look into land records (For Massachusetts you can go to and do a search based on county) and every transfer of that particular piece of land was quitclaim, going back as far as I could trace it. From a legal blog it seems that the quit claim deed is very commonly used in Massachusetts but I don't know exactly why.

Friday, December 13, 2013

Massachusetts State Government registry for lead in rentals/homes

If you are concerned about lead paint in a building for living in go to By entering a complete address you can find out if the dwelling has any violations. If you enter just a street name you can find all the violations recorded.

Sunday, December 8, 2013

Docopt is amazing

I love the command line and I love Python. So, naturally, I am an avid user of the argparse module bundled with Python. Today I discovered docopt and I am so totally converted. argparse is great but there is a bunch of setup code that you have to write and often things look very boilerplate-y and messy and it just looks like there should be a more concise way of expressing the command line interface to a program. Enter docopt

docopt allows you to describe your commandline interface in your doc string and then it parses this description and creates a command line parser that returns a dictionary with the values for all the options filled in. Just like that.

So, for example, one of my scripts has a docstring that looks like

  compute_eye_epoch [-R DATAROOT] [-x EXCEL] [-d DATABASE] [-e EPOCH] [-f|-F] [-q]

  -h --help     Show this screen and exit.
  -R DATAROOT   Root of data directory [default: ../../Data]
  -x EXCEL      Spreadsheet with sessions/trials etc [default: ../../Notes/sessions_and_neurons.xlsx]
  -d DATABASE   sqlite3 database we write to [default: test.sqlite3]
  -e EPOCH      Name of epoch we want to process
  -f            Force recomputation of all entries for this epoch
  -F            Force storing of epoch (automatically forces recomputation)
  -q            Quiet mode (print only ERROR level logger messages)

And the __main__ part of the code is

import docopt

if __name__ == '__main__':
  arguments = docopt.docopt(__doc__, version='v1')
  print arguments

If I call the program with -h then I get the usage information printed and the program exits. If I call it with other options args will be filled out, for example:

{'-F': False,
 '-R': '../../Data',
 '-d': 'test.sqlite3',
 '-e': None,
 '-f': False,
 '-q': False,
 '-x': '../../Notes/sessions_and_neurons.xlsx'}

This totally removes the barrier to creating command line interfaces and removes clutter from the __main__ section of the code! Amazing! Give it a try!

Thursday, December 5, 2013

Database diagrams and sqlite on the cheap

Those diagrams that show you your database tables and the links between them through foreign keys are apparently called Entity Relationship Diagrams (ERDs). I wanted to create one for my sqlite database to keep track of everything but I'm a cheapskate and didn't want to pay anything.

It turns out MySQL WorkBench is great for this. You don't need to register with them to download the program. You don't need a MySQL database running for this. I simply followed these steps:

  1. From the sqlite3 commandline I types .schema which printed the database schema to the console.
  2. I pasted the schema into a file and saved it
  3. I used Import from MySQL Workbench to parse the schema and place it on a diagram. 
The Autolayout feature is pretty good and probably optimizes for visual appeal, but I spent a few minutes changing the layout to  what I think worked logically in my head and also minimized connection overlaps. The translation from sqlite3 to MySQL dialects is smooth.

My only complaint with this tool on Mac is that is pretty unstable and crashes on every whim. Save often.

Store numpy arrays in sqlite

Use numpy.getbuffer (or sqlite3.Binary) in combination with numpy.frombuffer to lug numpy data in and out of the sqlite3 database:
import sqlite3, numpy

r1d = numpy.random.randn(10)

con = sqlite3.connect(':memory:')
con.execute("CREATE TABLE eye(id INTEGER PRIMARY KEY, desc TEXT, data BLOB)")
con.execute("INSERT INTO eye(desc,data) VALUES(?,?)", ("1d", sqlite3.Binary(r1d)))
con.execute("INSERT INTO eye(desc,data) VALUES(?,?)", ("1d", numpy.getbuffer(r1d)))
res = con.execute("SELECT * FROM eye").fetchall()

#res ->
#[(1, u'1d', <read-write buffer ptr 0x10371b220, size 80 at 0x10371b1e0>),
# (2, u'1d', <read-write buffer ptr 0x10371b190, size 80 at 0x10371b150>)]

print r1d - numpy.frombuffer(res[0][2])
#->[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

print r1d - numpy.frombuffer(res[1][2])
#->[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Note that for work where data types other than float and shapes other than 1d are needed you will also need to keep track of dtype and shape and store those in the database so you can pass them to numpy.frombuffer

Wednesday, December 4, 2013

Monday, December 2, 2013

Pandas, multiindex, date, HDFstore and frame_tables

Currently if you have a dataframe with a multiindex with a date as one of the indexers you can not save it as a frame_table. Use datetime instead.
import pandas as pd, numpy, datetime

print pd.__version__ #-> 0.13.0rc1

idx1 = pd.MultiIndex.from_tuples([(,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)])
df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
#-> If you want to save as a table in HDF5 use datetime rather than date

with pd.get_store('test1.h5') as f:
  f.put('trials',df1) #-> OK

with pd.get_store('test2.h5') as f:
  f.put('trials',df1,data_colums=True,format='t') #-> TypeError: [date] is not implemented as a table column

#-> Solution is to use datetime

Update: Thanks to Jeff again for the solution

Saturday, November 30, 2013

Setting up a virtualenv to run pandas-dev

virtualenv pandas
cd pandas
source bin/activate
git clone
pip install cython #Didn't have this
python install
pip install tables

Friday, November 22, 2013

h5py and multiprocessing

The HDF5 format has been working awesome for me, but I ran into danger when I started to mix it with multiprocessing. It was the worst kind of danger: the intermittent error.

Here are the dangers/issues in order of escalation

(TL;DR is use a generator to feed data from your file into the child processes as they spawn. It's the easiest way. Read on for harder ways.)

  1. An h5py file handle can't be pickled and therefore can't be passed as an argument using
  2. If you set the handle as a global and access it from the child processes you run the risk of racing which leads to corrupted reads. My personal runin was that my code sometimes ran fine but sometimes would complain that there are NaNs or Infinity in the data. This wasted some time tracking down. Other people have had this kind of problem [1].
  3. Same problem if you pass the filename and have the different processes open individual instances of the file separately.
  4. The hard way to solve this problem is to switch your workflow over to MPI and use mpi4py or somesuch. The details are here. My problem is embarrassingly parallel and the computations result in heavy data reduction, so I ended up simply doing ...
  5. The easy way to solve the problem is to have a generator in the parent thread that reads in the data from the file and passeses it to the child process in a JIT manner.

fin = h5py.File(fname,'r')
X = (read_chunk(fin, [n,n+Nstep]) for n in range(0,N,Nstep))
pool = mp.Pool()
result =, X)

Thursday, November 21, 2013

Daemonic processes are not allowed to have children

You can use the multiprocessing module to 'farm' out a function to multiple cores using the function. I was wondering idly if you can nest these farming operation to quickly build an exponentially growing army of processes to, you know, take over the world. It turns out that the universe has a failsafe:

import multiprocessing as mp

def compute(i):
  return i

def inner_pool(args):
  n0, N = args
  pool = mp.Pool()
  return, range(n0,n0+N))

pool = mp.Pool()
print, [(n,n+10) for n in range(10)])

# -> AssertionError: daemonic processes are not allowed to have children

Saturday, November 16, 2013

h5py: The resizing overhead

A cool thing about data arrays stored in HDF5 via h5py is that you can incremental add data to them. This is the practical way of processing large data sets: you read in large datasets piece by piece, process them and then append the processed data to an array on disk. An interesting thing about this is that there is a small size overhead in the saved file associated with this resizing compared to the same data saved all at once with no resizing of the HDF5 datasets.
I did the computations by using block sizes of [10, 100, 1000, 10000] elements and block counts of [10,100,1000, 10000] The corresponding matrix of overhead (excess bytes needed for the resized version over the directly saved version) looks like this (rows are for the block sizes, columns are for the block counts):
overhead ->
array([[   9264,    2064,    3792,    9920],
       [   2064,    3792,    9920,   52544],
       [   3792,    9920,   52544,  462744],
       [   9920,   52544,  462744, 4570320]])
As you can see the matrix is symmetric, indicating that block size and resize counts (number of blocks) don't interact and it is only the total final size of the data that matters. Someone with knowledge of the internal of HDF5 and h5py will probably be able to explain why this pattern of overhead occurs and why it only depends on the total size of the data and not on the number of resizes performed. Also, interestingly, the simple act of marking a dataset as resizable adds space overhead to the storage (Third plot). This is a plot of the additional overhead incurred as a result of saving the same sized data set all at once, but marking one as resizable. Sidenote: If you let h5py choose the data type on its own during a dummy initialization it will pick float32 (e.g. f.create_dataset('data',(0,1), maxshape=(None,1) ) Code:
import h5py, numpy, os

def resize_test(block_sizes=[10, 100, 1000, 10000], n_blocks=[10,100,1000, 10000], dtyp=float):
  file_size = numpy.empty((len(block_sizes),len(n_blocks),2),dtype=int)
  for i,bl_sz in enumerate(block_sizes):
    for j,n_bl in enumerate(n_blocks):
      with h5py.File('no_resize.h5','w') as f:

      with h5py.File('yes_resize.h5','w') as f:
        f.create_dataset('data',(0,1), maxshape=(None,1),dtype=dtyp)
        for n in range(n_bl):
          f['data'][-bl_sz:,:] = numpy.empty((bl_sz,1),dtype=dtyp)

      with h5py.File('no_resize.h5') as f:
        print f['data'].dtype
        print f['data'].shape
        file_size[i,j,0] = os.fstat(f.fid.get_vfd_handle()).st_size

      with h5py.File('yes_resize.h5') as f:
        print f['data'].dtype
        print f['data'].shape
        file_size[i,j,1] = os.fstat(f.fid.get_vfd_handle()).st_size
  return file_size

def default_test(data_sizes=[100,1000,10000,100000,1000000,10000000,100000000], dtyp=float):
  file_size = numpy.empty((len(data_sizes),2),dtype=int)
  for i,dt_sz in enumerate(data_sizes):
    with h5py.File('no_resize.h5','w') as f:

    with h5py.File('yes_resize.h5','w') as f:
      f.create_dataset('data',data=numpy.empty((dt_sz,1)), maxshape=(None,1),dtype=dtyp)

    with h5py.File('no_resize.h5') as f:
      print f['data'].dtype
      print f['data'].shape
      file_size[i,0] = os.fstat(f.fid.get_vfd_handle()).st_size

    with h5py.File('yes_resize.h5') as f:
      print f['data'].dtype
      print f['data'].shape
      file_size[i,1] = os.fstat(f.fid.get_vfd_handle()).st_size
  return file_size

data = resize_test()
overhead = data[:,:,1] - data[:,:,0]

data2 = default_test()
overhead2 = data2[:,1] - data2[:,0]
And, for those interested, here is the full data matrix (which you can get simply by running the code above):
In [170]: data
array([[[     2944,     12208],
        [    10144,     12208],
        [    82144,     85936],
        [   802144,    812064]],

       [[    10144,     12208],
        [    82144,     85936],
        [   802144,    812064],
        [  8002144,   8054688]],

       [[    82144,     85936],
        [   802144,    812064],
        [  8002144,   8054688],
        [ 80002144,  80464888]],

       [[   802144,    812064],
        [  8002144,   8054688],
        [ 80002144,  80464888],
        [800002144, 804572464]]])
And I took a short cut to making the plots by pasting the computed data in:
x1 = [100,1000,10000,100000,1000000,10000000,100000000]
x2 = [2944,     10144,     82144,    802144, 8002144,  80002144, 800002144]
y = [9265,2064,3792,9920,52544,462744,4570320]

x3 = [100,1000,10000,100000,1000000,10000000,100000000]
y3 = overhead2


pylab.ylabel('Overhead (bytes)')

pylab.xlabel('File size (bytes)')
pylab.ylabel('Overhead (bytes)')

pylab.ylabel('Overhead (bytes)')
pylab.title('No actual resizing')

Friday, November 15, 2013

h5py: the HDF file indexing overhead

Storing numpy arrays in hdf5 files using h5py is great, because you can load parts of the array from disk. One thing to note is that there is a varying amount of time overhead depending on the kind of indexing you use.

It turns out that it is fastest to use standard python slicing terminology - [:20,:] - which grabs well defined contiguous sections of the array.

If we use an array of consecutive numbers as an index we get an additional time overhead simply for using this kind of index.

If we use an array of non-consecutive numbers (note that the indecies have to be monotonic and non-repeating) we get yet another time overhead even above the array with consecutive indexes.

Just something to keep in mind when implementing algorithms.

import numpy, h5py

N = 1000
m = 50
f = h5py.File('index_test.h5','w')
f.create_dataset('data', data=numpy.random.randn(N,1000))
idx1 = numpy.array(range(m))
idx2 = numpy.array(range(N-m,N))
idx3 = numpy.random.choice(N,size=m,replace=False)

timeit f['data'][:m,:]
timeit f['data'][-m:,:]
timeit f['data'][idx1,:]
timeit f['data'][idx2,:]
timeit f['data'][idx3,:]

# N = 1000
#-> 1000 loops, best of 3: 279 µs per loop
#-> 1000 loops, best of 3: 281 µs per loop
#-> 1000 loops, best of 3: 888 µs per loop
#-> 1000 loops, best of 3: 891 µs per loop
#-> 1000 loops, best of 3: 1.27 ms per loop

# N = 10000
#-> 1000 loops, best of 3: 258 µs per loop
#-> 1000 loops, best of 3: 258 µs per loop
#-> 1000 loops, best of 3: 893 µs per loop
#-> 1000 loops, best of 3: 892 µs per loop
#-> 1000 loops, best of 3: 1.3 ms per loop

Octoshape (octoshapepm)

I was first alerted to this when my mac asked me if I wanted to allow octoshapepm to accept incoming connections. This led me to a web search which led me to an interesting finding that CNN is basically installing a program that uses your computer to redistribute your content, but not really telling you that it is doing it.

The program itself is made by this company. This article gives a brief non-marketing overview of what the program actually does and how to get rid of it if you wish.

In short, as installed by CNN, the program acts as a realtime file distribution system, like bittorrent, except that its probably running without your permission, helping CNN deliver content using part of your bandwidth (you are uploading video data just as you are downloading it). There are security issues with this in addition to an issue of principle, where you are most likely being tricked into giving up part of your bandwidth to save CNN some money as well as exposing a new security hole.

I note that on my mac the program exits when I'm not watching a CNN stream, but I don't know if that is because I denied the program access to the internet (i.e I don't know if octoshape would continue to run after I quit if there were other clients pulling data from my computer).

Any how, the bottom line is that you should be aware that there is this program that is running when you watch certain CNN streams that has access to your computer and is uploading data from your computer to other users.

Sunday, November 10, 2013

Annoying bug in PyTables (affects Big Data analysis with Pandas)

Pandas HDF5 interface through PyTables is awesome because it allows you to select and process small chunks of data from a much larger data file stored on disk. PyTables, however, has an annoying and subtle bug and I just wanted to point you to it so that you don't have to spend hours debugging code like I did.

In short, if you have a DataFrame, and a column of that DF starts with a NaN, any select statements that you run with that conditions on that column will return empty (you won't get any results back, ever). There is a work around, but I chose to use a dummy value instead.

This shook my confidence in Pandas as an analysis platform a bit (though it is really PyTable's fault).


I like learning languages and after a little kerfuffle with a Python package I was wondering if there was anything out there for statistical data analysis that might not have so many hidden pitfalls in ordinary places. I knew about R from colleagues but I never payed much attention to it, but I decided to give it a whirl. Here are some brief preliminary notes in no particular order

  • Keyword arguments!
  • Gorgeous plotting
  • Integrated workspace (including GUI package manager)
  • Very good documentation and help
  • NaN different from NA
  • They have their own Journal. But what do you expect from a bunch of mathematicians?
  • Prints large arrays on multiple lines with index number of first element on each line on left gutter
  • Parenthesis autocomplete on command line
  • RStudio, though the base distribution is pretty complete, with package manager, editor and console.

  • Everything is a function. I love this, but it means commands in the interpreter always need parentheses. I'd gotten used to the Python REPL not requiring parentheses.
  • The assignment operator is two characters rather than one
  • Indexing starts from 1. Oh god, could we PLEASE standardize this either way?
  • Not clear how well R handles "big data" (Data that can't be loaded into memory at once) or parallelization. (To look up: bigmemory)

Tuesday, November 5, 2013

Big Data, Small Data and Pandas

Welp, I finally got this through my thick head, thanks to a hint by Jeff who answered my cry for help on stack overflow, and pointed me to this thread on the pandas issues list.

So here's my use case again: I have small data and big data. Small data is relatively lightweight heterogeneous table-type data. Big data is potentially gigabytes in size, homogenous data. Conditionals on the small data table are used to select out rows which then indicate to us the subset of the big data needed for further processing.

Here's one way to do things: (Things to note: saving in frame_table format, common indexing, use of 'where' to select the big data)
import pandas as pd, numpy

df = pd.DataFrame(data=numpy.random.randint(10,size=(8,4)),columns=['a','b','c','d'])
df1 = pd.DataFrame(data=numpy.random.randint(10,size=(8,20)),index=df.index)
df2 = pd.Panel(data=numpy.random.randint(10,size=(8,20,5)),items=df.index)
df3 = pd.Panel4D(data=numpy.random.randint(10,size=(8,20,5,5)),labels=df.index)

store = pd.HDFStore('data.h5')
print store

row ='small',where=['a>2','b<5'],columns=['a','b'])
print 'Small data:'
print row

da ='big',pd.Term(['index',row.index]))
print 'Big data:' 
print da

da ='big_panel',pd.Term(['items',row.index]))
print 'Big data (Panel):'
print da.items

da ='big_panel4',pd.Term(['labels',row.index]))
print 'Big data (Panel4d):'
print da.labels
With a sample output of:
<class ''>
File path: data.h5
/big                   frame_table  (typ->appendable,nrows->8,ncols->20,indexers->[index])                       
/big_panel             wide_table   (typ->appendable,nrows->100,ncols->8,indexers->[major_axis,minor_axis])      
/big_panel4            wide_table   (typ->appendable,nrows->500,ncols->8,indexers->[items,major_axis,minor_axis])
/small                 frame_table  (typ->appendable,nrows->8,ncols->4,indexers->[index],dc->[a,b])              
Small data:
   a  b
3  6  2
5  9  4
Big data:
   0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19
3   9   1   1   4   0   3   2   8   3   4   2   9   9   7   0   4   5   2   5   0
5   0   5   3   5   4   3   4   5   5   9   9   8   6   3   8   0   5   8   8   4
Big data (Panel):
Int64Index([3, 5], dtype=int64)
Big data (Panel4d):
Int64Index([3, 5], dtype=int64)

In the Pandas issue thread there is a very interesting monkey patch that can return arbitrary data based on our small data selection, but I generally tend to shy away from monkey patches as being hard to debug after a few months have passed since the code was written.

Linux: install pytables

pip install numexpr --user
pip install cython
pip install tables

Monday, November 4, 2013

Pulling/pushing data to a samba server from Linux

smbclient (It's atrociously formatted man page is here) will let you do what ftp let you do which is to get and put files from you local machine to a samba server.

My use case is that I have a high performance cluster (Partners' Linux High Performance Computing cluster) that I want to run my code on (remoteA) while my data is on another server (remoteB) that seems to only allow access through samba and refuses ssh and scp requests.

The solution turns out to be to use smbclient, which seems to behave just like the ftp clients of old.

ssh into remoteA

smbclient \\\\{machine name}\\{share name} -D {my directory} -W{domain}

(The multiple backslashes turn out to be vital)
You'll end up with a smbc prompt. At the prompt type

prompt  (Gets rid of the prompt asking you if you are sure you want to copy or EVERY file)
recurse  (I wanted to copy a whole directory, so I needed this)
mget <my dir>\  (this is my directory)

A useful command is

smbclient -L {machine name} -W {domain}

which lists the share names available on the machine

Sunday, October 27, 2013

Posting a link to LinkedIn

Sometimes LinkedIn will not pull the metadata/images from a page you link to in a status update. In my case I was trying to link to a page on latimes. I found that if you get a tinyurl to the page, that works. I suspect that the url parser LinkedIn uses can not handle 'weird' characters in an url, like commas (this url had a comma) or else, can't handle urls beyond a certain length.

Saturday, October 26, 2013

Plotting state boundary data from shapefiles using Python

The great folks at have put up some of the data they collect so we can download and use it. On this page they have data relating to state boundaries. The files are available as zipped directories containing a shapefile and other metadata information. If you want to plot state boundaries and some state metadata (like zip code, state name) the .shp shapefile is sufficient. Assuming that the shape file is 'tl_2010_us_state10/tl_2010_us_state10.shp', some sample code using the pyshp package is:
import shapefile as sf, pylab

map_f = sf.Reader('tl_2010_us_state10/tl_2010_us_state10.shp')
state_metadata = map_f.records()
state_shapes = map_f.shapes()

for n in range(len(state_metadata)):
  pylab.plot([px[0] if px[0] <0 else px[0]-360 for px in state_shapes[n].points],[px[1] for px in state_shapes[n].points],'k.',ms=2)

for n in range(len(state_metadata)):
The pyshp package makes things so easy! Note that you can plot continuous lines instead of dots for the state boundaries, however, for some states like Alaska and Florida with islands, where the boundaries are not contiguous, you get nasty disjoint lines. Removing this requires much more processing (unless you do it by hand and break down states into "sub-states".

Friday, October 25, 2013

Pandas: the frame_table disk space overhead

When a Pandas DataFrame is saved (via PyTables) to hdf5 as a frame_table there is a varying amount of disk space overhead depending on how many columns are declared as data_columns (i.e. columns you can use to select rows by). This overhead can be rather high.

import pandas as pd, numpy

df = pd.DataFrame(numpy.random.randn(1000000,3),columns=['a','b','c'])
df.to_hdf('data_table_nocomp.h5','data') #-> 32 MB
df.to_hdf('data_normal.h5','data',complevel=9,complib='bzip2') #-> 21.9 MB
df.to_hdf('data_table.h5','data',complevel=9,complib='bzip2',table=True) #-> 22.5 MB
df.to_hdf('data_table_columns1.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a']) #-> 29.1 MB
df.to_hdf('data_table_columns2.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a','b']) #-> 35.8 MB
df.to_hdf('data_table_columns3.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a','b','c']) #-> 42.4 MB
df.to_hdf('data_table_columns3_nocomp.h5','data',table=True,data_columns=['a','b','c']) #-> 52.4 MB

Monday, October 14, 2013

The one thing I would have changed in 'Gravity'

Gravity is a great movie on many levels. It can't quite beat 2001 for solitude, desolation and a tiny cast, but its good. The three actors, Clooney, Bullock and Sir Newton do a great job and work well together, though there is not much by way of character development.

There is one raging issue that I have though. It only lasts 20 seconds in the movie and I don't quite know why its there. 

So here are Clooney and Bullock drifting towards the ISS. They get entangled in the parachute cords which stops their momentum relative to the ISS. Then, for some inexplicably reason, for 20 seconds Sir Isaac Newton goes on coffee break but the crew keep filming! Clooney is pulled by some mysterious phantom force that affects only him and Bullock but not the ISS. Clooney cuts himself loose and slingshots outward. Bullock kind of drifts back, so you know Sir Newton is slowly waking up from the coffee, but not quite, so it's not really clear what's going on.

Here's a tweak I would make to those 20 seconds that would keep Ike in the cast and keep everything else intact:
Clooney and Bullock are drifting towards the ISS. Clooney's jet pack runs out of gas, but their vector isn't close enough. They are going to miss the ISS by 20m or so and they will both die. Clooney unhooks himself from a semi-conscious Bullock, hurls her in the right vector for an ISS intercept and directs her into the airlock as he goes spinning into space.

Personally, as Bullock, I would hurry as fast as I could to get into the Soyuz to rescue Clooney, but whatever, fire in space is cool and I would have stopped and stared too.

I really can't understand why this discrepancy is there in movie where the folks seem to have worked somewhat hard to bring us the wonderful world of Newton's laws with no net external force.

Sunday, October 13, 2013

Reversed 28mm on a D5100 is 2:1 macro

Wooden ruler, mm grading
D5100 sensor size 23.6 mm × 15.6 mm
8mm of wooden ruler spans height of sensor
Magnification is 15.6mm : 8mm

About 2:1 which is great, but the focus distance is insanely close.

Saturday, October 12, 2013

Wednesday, October 9, 2013

Adventures with a Nikkor H 28mm f3.5

I wanted to find out what all this hoopla about old manual lenses was about, so I went looking for one. Apparently old manual lenses aren't THAT cheap, or else my definition of cheap is about a standard deviation below the mean. However, I did find a manual lens that fit my budget. Apparently the Nikon 28mm f3.5 isn't as hot an item as some other lenses.

The lens I got is older than I am, but in better condition. Nikon ended the run of this version of the lens in 1971. It's a non-Ai lens with the metering prong taken off (which makes it worthless for a collector, I guess). This suited me for two reasons: it made it cheap and it meant I could fit it on my D5100 (I read that you can fit the lens on the camera even with the prongs, but I don't believe it - the flash housing juts over the camera mount pretty closely, and I suspect the prongs would foul verified- the flash housing is JUST high enough that the prongs don't foul.). I inspected the lens for fungus and dust by shining a flashlight through the front and rear elements, and found the lens to be remarkably clean. There were no streaks on the front element.

The D5100 is a 'entry level camera' so it doesn't have a body motor. This turns a range of AF-D lenses into badly designed manual focus lenses. However, it also does not have the Ai-tab that allows higher end Nikons, like the D7100, to meter with Ai lenses. This allows the D5100 to accept non-Ai (or pre-Ai) lenses.

Now I understand what those coots mean when they rave about old manual lenses. AF-S and AF-D lenses are designed around an internal or body motor. You want the lens focus to zip from one end to the other as fast as possible. So you design the focus system to go from infinity to closest focal point with as little twist of the barrel as possible (called throw). Also, you want the motor to have as little resistance as possible, so you make the focusing ring as loose and light as possible. This sucks for manual focus.

Firstly, you have to be very delicate when you turn the focusing ring because you zip past the focus point really quickly. Secondly, if you breathe at the wrong moment, you lose your focus, because the ring is so loose (I'm looking at you 50mm f/1.8 AF-D). I use the red dot to confirm focus and I have to keep twiddling the ring to settle on the correct focus point, as if I'm some kind of CDAF algorithm.

(As a side note, I'm annoyed that the D5100 doesn't show the rangefinder when in full manual mode. You can't shoot a non-Ai lens in S or A mode, fair enough. But you get the rangefinder in those modes! I just can't understand this design decision).

A true manual focus lens, like this one, is very pleasant. The focus ring is damped. You have to push a little to move it, but it feels good and solid. Next, the throw is huge: a full 180 degrees. I was focusing and brought a subject into focus, and I tried to move beyond the focus point as I was used to, but I kept turning and turning and the subject did not go out of focus for what seemed like an eternity.

I kind of knew about these two quantities but I did not realize how big a deal they would be for shooting and getting in-focus shots. I've had disheartening experiences with the 50/1.8 AF-D and my experience with this 28mm is very different, where I always nailed the focus and got really bright, contrasty shots.

However, though I began to rave like the other coots, technically speaking, the better results I got from the manual focus lens is due to a complex mixture of factors and the comparison is not exactly fair. The main thing is that, once you have f1.8 you want to use it ALL THE TIME. So naturally, with the 50mm 1.8 wide open and trying to manual focus, I get a bunch of out of focus shots. I have much better images when I stop down to 3.5 or so. The 28mm lens starts out at 3.4 and is much wider, so that itself gives me a large DoF that masks any slop in focusing.

I was wowed by how sharp and contrasty the images from this lens were wide open. I'm used to hearing that one needs to stop down the lens a bit, and the largest aperture is really just there for low-light situations. However, a factor to consider here is that this is a 'FX' lens (from a time when everything was FX) used on a DX body. The engineers worked hard to get the center of the lens sharp and contrasty, the edges are probably more flawed, but I don't get to see that. Another score for the 'entry-level' camera...

I guess a real test would be for me to find a manual focus 50mm f1.8 or 1.4 and then compare my 50mm AF-D with that. Hmm, where is that piggy-bank?

Friday, October 4, 2013

Some notes from an ebay newbie

I was always suspicious of eBay (mostly because of Paypal). But I decided to jump in (like, what, about a few decades behind the curve) and try it out. I have fairly specific things I look for on eBay: photo stuff, my idea is that eBay is the giant garage sale in the ether and sometimes you can find exactly what you want for exactly what you want to pay for.

I don't have any deep observations, but I think one simple rule is important. I saw a lot of advice about sniping (bidding at the last second) and I think eBay does a very fair thing, in the true spirit of trying to find the appropriate price for an item.

Ebay allows you to set a maximum bid and will automatically bid just enough to keep you ahead up to the maximum. If someone comes in after you with a series of bids your bid always has precedence until your limit. I think this, if you are using eBay as a garage sale to find cheap items for a hobby, is the proper way to go.

When you first see an item, decide how much at most you want to pay for it, set that as your maximum bid AND THEN FORGET ABOUT IT.

EDIT: Now, however, I think this not literally how you should do it. You should write your maximum bid out, to remind yourself of what the real valuation of the item is. However, if everyone entered their max bid into eBay right away, all the prices would instantly saturate out, leaving the price at some small increment above the second highest bid. I think I understand "sniping" a little better - you should slowly work your way up to your max bid in relation to other bids, and as late as possible. However, the crucial key here is to remember your own maximum valuation and not budge from that, otherwise you will overpay.

eBay, of course, makes money off commissions, so it has an interest in you bidding on items and keeping increasing your bid. It's site is tuned to competitiveness in bidding, and I can see how people with a weakness for gambling or a misplaced competitive streak would get into a bidding war forgetting what their original base valuation for an item was. eBay is not really at fault here.

In general, item prices on open auction (where the lowest bid is very low) tend to be exponential, with prices racing up in the last 15min or so. I can see how, for expensive items, you can occasionally get a deal - there is probably a lot of noise in the price curve and since the growth is exponential the last few minutes/seconds of a bidding war can lead to large fluctuations in final price. The danger is that even with fluctuations, if you forget your original valuation, you are ending up paying more than the item is worth to you if you were not in a 'war': the only difference is in how much more you are paying than what the item was really worth to you.

One set of items I had been looking at were old 85mm primes. I was surprised at how much money people were paying for these old lenses. I can understand that a lens that has never been used, with its original packing and caps etc, would have a collector value. I could not understand how a lens with bits and pieces missing could sell for nearly $300. In all of these auctions I could see this exponential bid growth in the last hour. I guess there is a craze for Nikon 85mm f/1.8 D lenses well beyond their actual value as a photographic instrument!

Any how, it is possible to get deals from eBay, but it can't be something that there is a craze for, otherwise the item becomes over priced.

Wednesday, October 2, 2013

A simple exchange on eBay

I bought a 52mm-52mm coupler from a HK supplier (goes by the name of william-s-home). After I paid for the item, I noticed that the seller had a warning that the shipping could take 20-30 days and to email them if I wanted to cancel because I was just reading this note.

I emailed him and requested a cancellation. The seller was SO polite. We had a few exchanges and he/she was always extremely respectful.

I now have this image in my head of a venerable old Chinese trader who takes his business and reputation very seriously. For him, this is not just a way to earn money. It is a way of life, a principle, and things must be done correctly. The item cost $4.00 with shipping. It probably cost more than that for both of us in terms of the time spent emailing and completing the formalities for cancelling the transaction.

It was all very civilized and suddenly made me want to be a global trader, exchanging emails with people from far flung places in the globe, because life is too short and the world is too big and there are too many good people out there to not get to know them.

Monday, September 30, 2013

Initializing a Pandas panel

Sometimes there are multiple tables of data that should be stored in an aligned manner. Pandas Panel is great for this. Panels can not expand along the major and minor axis after they are created (at least in a painless manner). If you know the maximum size of the tabular data it is convenient to initialize the panel to this maximum size before inserting any data. For example:
import numpy, pandas as pd

pn = pd.Panel(major_axis=['1','2','3','4','5','6'], minor_axis=['a','b'])
pn['A'] = pd.DataFrame(numpy.random.randn(3,2), index=['2','3','5'], columns=['a','b'])
print pn['A']

Which gives:
          a         b
1       NaN       NaN
2  1.862536 -0.966010
3 -0.214348 -0.882993
4       NaN       NaN
5 -1.266505  1.248311
6       NaN       NaN

Edit: Don't need a default item - an empty panel can be created

Saturday, September 28, 2013

Macro photography with reversed lens

I had forgotten the simple joys of experimenting with cameras. Some of you will recall the old trick of reversing your lens to obtain macro photos. Here I simply took my 18-55 kit lens, reversed it, set it to 18mm and took a photo of my laptop monitor. I aimed it at a white part of the screen and you can see the three sub pixels per real pixel which combine together to give the illusion of white.

Monday, September 23, 2013

Pandas panel = collection of tables/data frames aligned by index and column

Pandas panel provides a nice way to collect related data frames together while maintaining correspondence between the index and column values:

import pandas as pd, pylab

#Full dimensions of a slice of our panel
index = ['1','2','3','4'] #major_index
columns = ['a','b','c'] #minor_index

df = pd.DataFrame(pylab.randn(4,3),columns=columns,index=index) #A full slice of the panel
df2 = pd.DataFrame(pylab.randn(3,2),columns=['a','c'],index=['1','3','4']) #A partial slice
df3 = pd.DataFrame(pylab.randn(2,2),columns=['a','b'],index=['2','4']) #Another partial slice
df4 = pd.DataFrame(pylab.randn(2,2),columns=['d','e'],index=['5','6']) #Partial slice with a new column and index

pn = pd.Panel({'A': df})
pn['B'] = df2
pn['C'] = df3
pn['D'] = df4

for key in pn.items:
  print pn[key]

-> output

          a         b         c
1  0.243221 -0.142410  1.228757
2 -0.748140 -0.780719  0.644401
3  0.161369 -0.001034 -0.278070
4 -1.143613 -1.547082  0.025639
          a   b         c
1  1.165219 NaN  1.391501
2       NaN NaN       NaN
3 -1.484183 NaN  0.541619
4  0.810439 NaN -0.848142
          a         b   c
1       NaN       NaN NaN
2  1.310740  1.278829 NaN
3       NaN       NaN NaN
4  0.042748 -0.464065 NaN
    a   b   c
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

One thing to note is that Panel does not expand along the minor and major axes.

Saturday, September 21, 2013

Wordpress renders LaTeX

I was so pleasantly surprised to learn that wordpress blogs will render latex. The tags are simply $latex and $.
So $latex e^{ix} = \cos(x) + i\sin(x)$ will render as

There are some cool parameters that you can set (from hints here and here):
  1. increase size by adding &s=X where X is an integer [-4,4]: $latex x^2 &s=2$  
  2. Instead of inline equtions (default) display as block (bigger): $latex \displaystyle x^2$

Thursday, September 19, 2013

Python: Multiprocessing: xlrd workbook can't be passed as argument

import multiprocessing as mp, xlrd

def myfun(b):
  print b.sheet_names()

p = mp.Pool(4), [b,b,b,b])
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Applications/", line 551, in __bootstrap_inner
  File "/Applications/", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Applications/", line 319, in _handle_tasks
PicklingError: Can't pickle : attribute lookup __builtin__.instancemethod failed

Python: Multiprocessing: passing multiple arguments to a function

Write a wrapper function to unpack the arguments before calling the real function. Lambda won't work, for some strange un-Pythonic reason.

import multiprocessing as mp

def myfun(a,b):
  print a + b

def mf_wrap(args):
  return myfun(*args)

p = mp.Pool(4)

fl = [(a,b) for a in range(3) for b in range(2)]
#mf_wrap = lambda args: myfun(*args) -> this sucker, though more pythonic and compact, won't work, fl)

Tuesday, September 17, 2013

Calculating confidence intervals: straight Python is as good as scipy.stats.scoreatpercentile

I would say the most efficient AND readable way of working out confidence intervals from bootstraps is:


Where r is a n x b array where n are different runs (e.g different data sets) and b are the individual bootstraps within a run. This code returns the 95% CIs as three numpy arrays.

Confidence intervals can be computed by bootstrapping the calculation of a descriptive statistic and then finding the appropriate percentiles of the data. I saw that scipy.stats has a built in percentile function and assumed that it would work really fast because (presumably) the code is in C. I was using a simple minded Python/Numpy implementation by first sorting and then picking the appropriate percentile data. I thought this was going to be inefficient timewise and decided that using scipy.stats.scoreatpercentile was going to be blazing fast because
  1. It was native C
  2. It was vectorized - I could compute the CIs for multiple bootstrap runs at the same time
  3. It could pick out multiple percentiles (low and high ends) at the same time.
Funnily enough, my crude measurements showed that the dumb implementation using numpy.sort is just as fast as the builtin one. Well, silly me: it turns out that scipy.stats.scoreatpercentile calls scipy.stats.mquantiles which simply does numpy.sort. I guess I should have thought of that, since sorting is the real bottle neck in this operation and numpy.sort is as efficient as you can get since that's implemented in C.

python | grep 'function calls'
         38 function calls (36 primitive calls) in 0.001 seconds
         12 function calls in 0.001 seconds
         38 function calls (36 primitive calls) in 0.001 seconds
         12 function calls in 0.001 seconds
         38 function calls (36 primitive calls) in 0.876 seconds
         17 function calls in 0.705 seconds
         38 function calls (36 primitive calls) in 0.814 seconds
         17 function calls in 0.704 seconds

import pylab, cProfile, scipy.stats as ss

def conf_int_scipy(x, ci=0.95):
  low_per = 100*(1-ci)/2.
  high_per = 100*ci + low_per
  mn = x.mean()
  cis = ss.scoreatpercentile(x, [low_per, high_per])
  return mn, cis

def conf_int_native(x, ci=0.95):
  ci2 = (1-ci)*.5
  low_idx = int(ci2*x.size)
  high_idx = int((1-ci2)*x.size)
  return x.mean(), x[low_idx], x[high_idx]

def conf_int_scipy_multi(x, ci=0.95):
  low_per = 100*(1-ci)/2.
  high_per = 100*ci + low_per
  mn = x.mean(axis=0)
  cis = ss.scoreatpercentile(x, [low_per, high_per],axis=0)
  return mn, cis

def conf_int_native_multi(x, ci=0.95):
  ci2 = (1-ci)*.5
  low_idx = int(ci2*x.shape[1])
  high_idx = int((1-ci2)*x.shape[1])
  mn = x.mean(axis=1)
  xs = pylab.sort(x)
  cis = pylab.empty((mn.size,2),dtype=float)
  cis[:,0] = xs[:,low_idx]
  cis[:,1] = xs[:,high_idx]
  return mn, cis

r = pylab.randn(10000)'conf_int_scipy(r)')'conf_int_native(r)')

r = pylab.randn(10000)'conf_int_scipy(r)')'conf_int_native(r)')

r = pylab.randn(1000,10000)'conf_int_scipy_multi(r)')'conf_int_native_multi(r)')

r = pylab.randn(1000,10000)'conf_int_scipy_multi(r)')'conf_int_native_multi(r)')

Sunday, September 15, 2013

Three coding fonts

Coding fonts should:
  1. Look good at small sizes, (10-11 pt) - you can see more code in your window
  2. Have good distinction between characters, especially (O,0), (i,l), (l,1)(`,') - your programs have enough bugs already
Three fonts that I have tried out and that work for me are, in order
  1. Anonymous Pro - Looks good even at 10pt
  2. Monaco
  3. Consolas

Anonymous Pro 11pt

Monaco 11pt

Consolas 11pt

Saturday, September 14, 2013

D5100: More notes

It took me a little bit to get warmed up to the concept, but now I definitely see the potential for using DSLRs for movie making. Camcorders (in the price range I would consider) are fitted with single lenses (probably a superzoom) with average optical quality. Their smaller sensor size means a much noisier low light performance.

With this cheap DSLR I can put on my cheap 50mm/1.8 and get HD movies that look 'arty' because I opened the lens up wide. I can take movies in indoor lighting. I can take videos of my cat that look like something showing at Sundance. It really opens up for creativity.

My only gripe is the auto focus. It's not that it is slow, it's that I can't get it to do what I want, but perhaps I want too much. The AF, with a decent lens, like the 35mm/1.8 AF-S, is fast enough and silent enough. The kit lens is atrocious in this department. My gripe is that I just could not figure out how to efficiently get it to track my subject (my cat).

My assumption was that with AF-F and subject track I would be able to focus on the cat and then when he moved focus would follow. Well, not quite. The logic seemed to be to focus on what ever was in the focus box. I need more practice with this to figure out what to do.

Ok, you know how guys get a bad rap because they don't read the manual? I'm one of those guys. This ain't your grandpapy's SLR. You flick the switch to live view. You set to AF-C and select 'Tracking' for mode. You use the eight-way switch to place the green cursor over your subject. Then you hit OK. This registers the pattern in the targeting computer and the square will follow the subject around the screen and keep focus on it. I think this is what fighter pilots use to target their AGMs.

In the meanwhile, manual focus works a blast, and I can see this working just fine stopped down to f3.4 or so for causal home movies.

I read somewhere that movie makers actually use manual focus. I remember seeing documentaries about film making and the camera-man's assistant was always scuttling around with a tape measure and the actors had lines on the floor indicating where they were to stand for the low DoF shots, so perhaps no AF logic can really understand what YOU want in focus in a moving scene and so you need to do it manually.

Battery life
Normal battery life seems to be fine. I say seems because I have been using the video a bit and, man, does this chew through the battery fast. I had a fully charged battery and I had it down to the last bar within 30min of playing with video, making about 4 videos in this time and having live view on. I suspect that if you are using live view a bunch/making videos you need to find a very high capacity battery or have a bunch of spares. On the days I went out and shot mostly stills, I did not have any problems, the battery went through 100 shots without changing a bar.

I've been using a cheap(er) Wasabi battery and it has worked fine so far. Some people complained that their D5100 did not recognize the battery, or did not recognize it after the first charge, but so far, so good.

The view finder now carries complete exposure information. You see shutter speed, f-number AND ISO in the viewfinder. This is also customizable, which makes it a nice step up from the D40's viewfinder. For some reason, I find it easier to manually focus my 50mm/1.8 on the D5100. I suspect that it is the rangefinder display that is aiding me, though the added brightness may be helping.

My use of the viewfinder is primarily to check focus and secondarily to compose. I'm OK with a little composition slop because I feel I can crop later (I don't worry too much about sticking to standard aspects). So if the coverage (how much of the scene is visible in the viewfinder) is less than 100% I don't care - I'm possibly getting some junk on the edges which I will crop away if needed. But I worry about focus and this matters a lot in low-light. So my intuition was that viewfinders with more magnification are better to see details which are important for focus.

Then I went through some specs on film and digital cameras:

F65 (Film camera, entry level) - 0.68x, 89% (Pg 106 of manual)
F4 (Film, top end, interchangeable viewfinders) -  0.7x, 100% (stock), 6x (special) (Pg 107 of manual)
D40 (Digital, entry) - 0.80x, 95%
D5100 (Digital, entry) - 0.78x, 95%
D4 (Digital, top end)- 0.70x, 100%

That's odd. Why would the top camera have LESS magnification than the entry level ones?

I read a nice description of basic viewfinder specs on luminous landscape and on stack exchange photography. I always thought that a larger magnification was better, but Stan Roger's answer on stack exchange made me realize that as the magnification gets larger the image gets dimmer, so there is a tradeoff between detail and low light visibility (other trade-offs are mentioned on the luminous landscape article).

This is a double hit: it is in low-light, when your f-number is low that accurate focus is most important (and your AF is likely to give up) but your viewfinder has to tradeoff between being visible and giving you detail.

So, I guess, in the D4 and other FF cameras, the designers went for better visibility (users often describe the viewfinders as brighter) deciding that that would help manual focus with 0.7x mag giving enough acuity for focus.

As a sidenote - where are our interchangeable viewfinders for digital cameras? I don't think even the D4 "flagship" has the ability to pop-off the stock pentaprism assembly and place another viewfinder.

Wednesday, September 11, 2013

A script to clear and disable recent items in Mac OS X doc

From a hint here.

Mac OS X has the annoying feature of remembering your application history in the dock and not erasing the history when you erase it from the application preferences.

The following is a little bash script that does this for you provided you pass the name of the application (e.g. to it.

#!/bin/bash -x
BUNDLEID=$(defaults read "/Applications/$1/Contents/Info" CFBundleIdentifier)
defaults delete "$BUNDLEID.LSSharedFileList" RecentDocuments
defaults write "$BUNDLEID" NSRecentDocumentsLimit 0
defaults write "$BUNDLEID.LSSharedFileList" RecentDocuments -dict-add MaxAmount 0

You need to run killall Dock after this to restart the dock for the changes to take effect.

Sunday, September 1, 2013

The nikon D5100 (as seen by a D40 shooter)

The D5100 is rather old news now. People are either ogling the m4/3 cameras (I know I am) or looking at Nikon's new models such as the D5200. However, I recall, when the D5100 first came out, and I was the owner of a D40, I badly wanted the high ISO performance and the video.

Well, enough time has passed that the D5100 is now at a sweet price point (especially the refurbished ones) that I did get myself one. There are tons of comprehensive D5100 reviews out there, this will be a short collection of very subjective thoughts from a D40 owner.

What kind of photographer am I?
Well, I'm a casual shooter. A few pics are up on flickr, but I mostly shoot family and don't really put up pictures on web galleries. My favorite subject is the human face in the middle of its many fleeting expressions.

High ISO performance
I'm very happy. Experts on sites such as dpreview complained that noise rendered D5100 photos above 1600 unusable. I was already impressed by the D40's ISO 1600 performance so I didn't pay any attention to that.

I'm extremely happy with Hi-1 (ISO 12800) and I think even ISO 25600 is better than the D40's Hi-1 (3200). This is candle light shooting, if you have an f1.8 lens. I think I'm going to have a lot of fun with the 35mm f1.8 mounted on the D5100. In addition, Nikon's auto-ISO logic will extend up to 25600 (In the D40 you would have to manually go to 3200 if you wanted that). The night shot (ISO 102400) is fun but I think the lens focuses at infinity.

Digital camera ISOs are an example of how easily we become spoilt. When I shot film the highest I ever shot was 800. I'd see 400 ISO film and salivate. Now I'm like, man, pictures aren't perfect at 25600, things could be better. WHAT MORE DO I WANT?!

Focus points
I used to focus and recompose but then for some reason I got it into my head that this is inaccurate.  While technically true, this only becomes an issue when shooting wide-open and medium to close distance subjects (relative to lens reach).

If you shoot wide open (say f1.8) hand held with the 35mm, trying for a subject at 1m, three things work against focus and recompose: a) The actual geometrical theory: the focal surface is actually a bowl and not flat like the sensor b) You don't actually swivel the camera in place - you shift a little. c) The subject moves during this time.

Once you stop down a little bit, or the subject is further off, the DOF is enough to compensate for focus-recompose

The D5100 has more focus points, but far fewer than the D5200. It's enough for me. I wasted enough time jogging the 4-way switch with the D40's three focus points. Sometimes, you focus-recompose and just shoot. It's better to get a little softness than miss the moment. Interestingly, cycling through the D5100's 11 points is not slower than the D40's, so I appreciate that.

The D5100 is really a step up in this department because now I have focus points above and below the midline (the D40 has three points horizontally along the midline) and this is perfect for both portrait and landscape orientation.

I am told the real utility of denser focal points is during shooting moving subjects and using AF-C, where the focus will predictively follow your subject. I'm impressed, but skeptical. Need to try this out.

Manual focus
I had forgotten this, but another reason I wanted the more advanced Nikons is that they had a focus rangefinder that you can use with manual focus lenses. I've tried this with my 50mm f1.8 and it works pretty well. This is a nice bonus I had not thought about when stepping up.

I find the video quality very good. My other alternative is a 8 year old Canon powershot, so my standards are modest. The video fills the screen of my laptop, and is sharp and saturated. I can't ask for more. What I was a little disappointed in at first was the focus. The hunting was a bad throwback to my compact camera days (which was a big reason why I saved up for a DSLR). However, in decent light (where our criterion for decent is now bathroom or bedroom illumination) the focus is OK. It takes about 1s for the lens (35mm f1.8) to settle down but I can see fast moving objects being a problem on a wide open lens.

This bothered me a bit. The mic, being in the body, picks up the sounds from dial adjustment, button presses and of course the focus motor on the lens. This I would definitely say, for even casual shooters: get an external mike. I was pleasantly surprised to find a speaker in the camera for playback.

Compared to m4/3
For the longest time I was looking at the m4/3 cameras, and I was looking very closely at the Olympus line, especially the E-PM-2 and the E-PL-5. High ISO, tiny size, HD video and fast AF, what's not to like?

There were two factors that led me back to Nikon DSLR: a love of shooting through the viewfinder and a lack of familiarity with the m4/3 system.

I have discovered that while I'm fine shooting with an LCD screen, I'm more comfortable using a viewfinder. I'm not sure how I would enjoy using an EVF, and that would add greatly to the cost and a little to the size of the Olympus.

I wasn't sure how quickly you could start up an E-PM-2 and shoot from standby mode. Can you leave it for ever in standby, like you can leave a DSLR?

I have a very small set of F-mount lenses which I really like, I'm not demanding at all, but I really love the 35mm f1.8 AF-S and the D40 kit lens is just fine for well lit situations where the sun is not in the front. In addition I have a 50mm f1.8 with no focus motor. It wasn't clear to me how expensive it would be for me to get similar quality lenses for the m4/3.

Lastly, while the photo experts were raving about the speed of the Olympus AF system and in principle it seemed awesome to be able to hold the camera out and then tap on the screen to focus and shoot, I just wasn't sure. If I was motivated enough I would have gone into a camera store and played with the camera for a while.

It's possible I'm passing up an opportunity to get a tiny camera system with great optics and superb video for a bulkier system with poorer video, but I'm comfortable with the price-performance point of the D5100.

UPDATE: I really liked this review at Andy's blog.

Pandas: HDF: Strange corruption with tables wider than 1306 columns

Friday, August 30, 2013

Pandas: Hierarchical index and frame_table

UPDATE: I have to hand it to @jreback who is a bug fixing ninja. The bug has been fixed for 0.13, though data_columns will not handle multi index.

Wednesday, August 28, 2013

HDF5 (and Pandas using HDF5) is row oriented

From a nice hint here, and the docs here:

When you use pandas + HDF5 storage it is convenient to generate one table that is the 'selector' table that you use to index which rows you will select. Then you use that to retrieve the bulk data from separate tables which have the same index.

Originally I was appending columns to the main table, but there is no efficient way of doing that when using HDF5 (appending rows is efficient). Now I'm just creating new tables for the data, keeping the original index.

An efficient way to store pandas data

OK, after much belly aching I have a decent work flow for when I want to use Pandas which is actually quite convenient. Firstly, Pandas shines for when I have heterogeneous data (mixed types) that form nicely into columns and where I need to select out a subset of rows because they satisfy certain conditions.

UPDATE: Fixed confusion between 'table' and 'store'
UPDATE: Include note about how to set data columns

The basic steps are these
  1. Use table=True in .put or .to_hdf to indicate that you want the data stored as a frame_table that allows on-disk selection and partial retrieval
  2. Use data_columns= [...] during saving to identify which columns should be used to select data
You need to do BOTH steps to have a working selectable-table-on-disk.
  • If you do not use table=True you will get TypeError: cannot pass a where specification when reading from a non-table this store must be selected in its entirety
  • If you do not declare data_columns you will get ValueError: query term is not valid [field->...,op->...,value->...]

import pandas as pd

store = pd.HDFStore('filename.h5')

df = pd.DataFrame( ... ) #Construct some dataframe
#Save as a frame_table in filename.h5 and declare some data columns 
#append creates a table automatically 
store.append('data1', df, data_columns=[...]) 

df = pd.DataFrame( ... ) #Construct another dataframe 
#Put requires an explicit instruction to create a table
store.put('data2', df, table=True, data_columns=[...]) #This is convenient - it now adds a second node to the file 
Now you can use the battery of select methods (outlined here) to load just selected parts of the data structures.

h5py and pandas for large array storage

I've actually gone back to pure hdf5 (via the h5py interface) for storing and accessing numerical data. Pandas via PyTables started to get too complicated and started to get in the way of my analysis (I was spending too much time on the docs, and testing out cases etc.).

My application is simple. There is a rather large array of numbers that I would like to store on disk and load subsets of to perform operations on cells/subsets. For this I found pandas to be a bad compromise. Either I had to load all the data all at once into memory, or I had to go through a really slow disk interface (which probably WAS loading everything into memory at the same time). I just don't have the luxury to fight with it so long.

I'm seeing that pandas has a (kind of) proper way of doing what I'm doing, but in h5py it just seems more natural and less encumbering :(

UPDATE: So, as previously mentioned, Pandas shines as a database substitute, where you want to select subsets of data based on some criterion. Pandas has a method (to_hdf) that will save a dataframe as a PyTables table that DOES allow you to do efficient sub-sampling without loading everything onto disk using the 'select' method, but even that is pretty slow compared to directly pulling things using h5py (and cumbersome). But it works really nicely for actual conditional select statements. Code updated to reflect this.

Timing information for randomly accessing 1000 individual cells from a 1000x1000 array of floats
h5py                - 0.295 s
pandas frame        - 14.8 s   (reloaded table on each lookup probably)
pandas frame_table  - 3.943 s  
python | grep 'function calls'
         95023 function calls in 0.295 seconds
         711312 function calls (707157 primitive calls) in 14.808 seconds
         1331709 function calls (1269472 primitive calls) in 3.943 seconds

import h5py, pandas as pd, numpy, cProfile

def create_data_files():
  r = numpy.empty((1000,1000),dtype=float)
  df = pd.DataFrame(r)

  with pd.get_store('pandas.h5','w') as f:
    f.append('data', df)

  with h5py.File('h5py.h5','w') as f:
    f.create_dataset('data', data=r)

def access_h5py(idx):
  with h5py.File('h5py.h5') as f:
    for n in range(idx.shape[0]):

def access_pandas(idx):
  with pd.get_store('pandas.h5') as f:
    for n in range(idx.shape[0]):

def slice_pandas(idx):
  with pd.get_store('pandas.h5') as f:
    for n in range(idx.shape[0]):'data', [('index', '=', idx[n][0]), ('columns', '=', idx[n][1])])

idx = numpy.random.randint(1000,size=(1000,2))'access_h5py(idx)')'access_pandas(idx)')'slice_pandas(idx)')

Use Enthought Canopy outside of their environment

From hints on their blog and other places:

Canopy installs a virtual environment. The environment activate command is located at ~/Library/Enthought/Canopy_64bit/User/bin/activate. An easy way to access this environment is to alias it in your start up file eg:

# For canopy ipython
alias canpy='source ~/Library/Enthought/Canopy_64bit/User/bin/activate'

When in the environment use deactivate to exit.

I'm using Canopy because I found it insanely annoying to install hdf5 and h5py on Mac 10.7.5
I think my next laptop will be linux...

Tuesday, August 27, 2013

Pandas: brief observations

After using Pandas for a little bit, I have a few observations:
  1. Pandas is great for database like use. When you have tabular data from which you would like to efficiently  select sub-tables based on critera, Pandas is great.
  2. Pandas is great for time-series like data, where the rows are ordered. In such cases pandas allows you to combine multiple tables, or plot, or do analyses based on the time series nature of the rows.
  3. Pandas, however, is a little unwieldy when you wish to add rows (adding columns is very easy) and in data manipulation in general

Monday, August 19, 2013

Each access of a Pandas hdf5 store node is a re-copy from the file

This is obvious, but it is important to remember.
import pandas as pd, pylab, cProfile

def create_file():
  r = pylab.randn(10000,1000)
  p = pd.DataFrame(r)

  with pd.get_store('test.h5','w') as store:
    store['data'] = p

def analyze(p):
  return [(p[c] > 0).size for c in [0,1,2,3,4,5,6,7,8,9]]

def copy1():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    p = store['data']
    idx = analyze(p)
    print idx

def copy2():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    idx = analyze(store['data'])
    print idx

def ref():
  print 'Working on hdf5 store reference'
  with pd.get_store('test.h5','r') as store:
    idx = [(store['data'][c] > 0).size for c in [0,1,2,3,4,5,6,7,8,9]]
    print idx

When run with python | grep "function calls" gives us
         5340 function calls (5256 primitive calls) in 0.094 seconds
         2080 function calls (2040 primitive calls) in 0.048 seconds
         2080 function calls (2040 primitive calls) in 0.050 seconds
         5661 function calls (5621 primitive calls) in 0.402 seconds
So, if you are going to do multiple operations on the data in a node it is better to copy it over once (if you have the memory).

Friday, August 16, 2013

Pandas: presence of a NaN/None in a DataFrame forces column to float

import pandas as pd
a = [[1,2],[3,4]]
df = pd.DataFrame(a)

   0  1
0  1  2
1  3  4

df.values ->
array([[1, 2],
       [3, 4]])

df.ix[1].values ->
array([3, 4])

a = [[1,None],[3,4]]
df = pd.DataFrame(a)

   0   1
0  1 NaN
1  3   4

df.values ->
array([[  1.,  nan],
       [  3.,   4.]])

df[0].values ->
array([1, 3])

df[1].values ->
array([ nan,   4.])

df.ix[1].values ->
array([ 3.,  4.])

df[0][1] -> 3
df[1][1] -> 4.0
This threw me because I have a data structure that is all ints, but I have a few Nones on one column and that column was suddenly returned as floats.
As you can see it's just the relevant column that is forced to float.

Thursday, August 15, 2013

Pandas and PyTables: Variable assignment forces copy

I wish to report Pandas to the house unPythonic activities committee. Remember how in Python assignments are by reference rather than value i.e. when you do something like:
a = b
what python does is it creates a reference from a to b (except for very simple objects like integer).

This is what tripped me up when I was learning Python. For example
In [2]: a = {'age': 90, 'weight': 400}

In [3]: b = a

In [4]: a
Out[4]: {'age': 90, 'weight': 400}

In [5]: b
Out[5]: {'age': 90, 'weight': 400}

In [6]: b['age'] = 20

In [7]: b
Out[7]: {'age': 20, 'weight': 400}

In [8]: a
Out[8]: {'age': 20, 'weight': 400}
As you can see, changing
because the assignment creates a reference.

Now, when I was working with Pandas and its built in PyTables interface I learned the hard way that when you assign a variable to an element of a hdf5 store it copies the data from the hdf5 store into the variable, rather than creating an assignment.

If you run the following code you will find that the assignments are actually copying the data from disk into the variables, rather than passing out references to the data in the hdf5 file.
import pandas as pd, pylab, cProfile

def create_file():
  r = pylab.randn(10000,1000)
  p = pd.DataFrame(r)

  with pd.get_store('test.h5','w') as store:
    store['data1'] = p
    store['data2'] = p
    store['data3'] = p

def load_file():
  print 'Working on copy of data'
  with pd.get_store('test.h5','r') as store:
    p1 = store['data1']
    p2 = store['data2']
    p3 = store['data3']
    print p1[10]

def get_file():
  print 'Working on hdf5 store reference'
  with pd.get_store('test.h5','r') as store:
    print store['data1'][10]

A sample output is:
python | grep 'function calls'
         11109 function calls (10989 primitive calls) in 0.329 seconds
         7278 function calls (7238 primitive calls) in 0.053 seconds
         9540 function calls (9420 primitive calls) in 0.138 seconds
         7278 function calls (7238 primitive calls) in 0.054 seconds
Disregarding the first call, which includes some strange startup code, we see that load_file that assigns variables p1,p2,p3 to the hdf5 nodes ends up copying the whole data over, which is why it takes so long to execute, even though those nodes are actually not accessed.

Monday, August 12, 2013

Sleep number bed LCDs are defective

We bought a sleep number bed about 5 years ago. These beds come with a '20 year' warranty which sounds awesome, because it makes one think that a) The bed's are made well for the company to give such a warranty and b) It's a nice warranty.

Well, it's not THAT great. About 2 years ago the LCD display on the controller started to go on the fritz. It started with one segment of one digit and then progressed until a few weeks ago the display was simply blank. I did a quick search on the internet and it turns out that this is a very common problem.

We have a wired controller (because it was cheaper, I guess, it was a while ago). The refurbished replacement is going to cost us $60 with shipping and the original one would have cost us $140 or so. It does seem that we are getting a nice discount on their catalog price, but I don't think this is such a good deal.

Any how, the pump is working fine, so the actual cost of the controller was probably $10 or so, so I'm not entirely happy, unless it turns out the replacement is a double controller (It says Dual I-Series, I don't know what that means) and is wireless.

Saturday, August 10, 2013

Manipulating pandas data structures

I really enjoy using the Pandas Series and DataFrame objects. I find, however, that methods to update the series/frame are clunky. For a DataFrame it's pretty easy to add columns - you create a DataFrame or a Series and you just assign it. But adding rows to a Series or DataFrame is a bit clunky.

I sometimes have the need to modify a certain row with new data or add that row if it does not exist, which in a database would be a 'replace or insert' operation. You can concat or append another Series or DataFrame but I have not found a nice way of handling the 'replace or insert' case.

If the structure is small I simply convert it into a dictionary and manipulate the structure using the dictionary keys and then recreate the pandas structure.

If the structure is large I do an explicit test for the index (row) and then decide whether to append or replace.

Thursday, August 8, 2013

DSLR vs compacts/micro four thirds

I'm what the marketing department at camera companies call an 'enthusiast'. Previously I would be called an amateur, but I guess 'enthusiast' doesn't have the stigma of 'clueless' that amateur now has. I don't make money of photos and I take photos for pleasure and for memories.

I bought my DSLR when DSLR prices were plunging off a cliff, that is after all the professionals had subsidized sensor and lens development. I bought the D40. I got a DSLR for the following characteristics:
  1. Low shutter lag. This was probably the biggest deal for me. I like to capture the fleeting expressions on human faces and the compact was very frustrating with the long lag between focusing and then taking the picture.
  2. Good low light performance. The D40 works just fine for me upto 1600 ISO. ISO 3200 is very noisy and adding a nice prime lens that goes out to f1.8 added a lot of artistic scope and improved low light performance.
The downside of even a small DSLR like the D40 is that it is large and conspicuous and not that quick to whip out when you need it.

This has turned my attention to the micro four thirds family. The larger sensor sizes are a great step up from compacts, but the form factors are so small! They also have interchangeable lenses.

Shutter lag is still a concern, but one thing I realised after using the D40 is that in low light (when a lot of my people portraits are done, round dinner tables and indoors) I have a long effective shutter lag because the focusing in low light is an issue.

What I depend a lot on in such situations is to focus on a sharp edge and then shoot a burst. Instead of waiting for the right moment, I estimate when the moment is going to come up and then hope that one of the images in the burst will carry the hidden expression.

The new 4/3 cameras I am seeing do bursts, do better ISO than the D40, are smaller/lighter AND they do movies, so I'm pretty sure my next camera is not going to be the D5100 (I was waiting for the price to drop steeply, or to find a refurbed one) but rather one of the 4/3s family.

UPDATE: I just found Thom Hogan's guide to m4/3. The guide is very useful.

Wednesday, July 24, 2013

Closing ssh tunnel

ps aux | grep ssh
to get list of processes matching ssh
kill [proc id]
matching the tunneling

This frees up the port.

Friday, July 19, 2013

Bug in Tk Listbox

Run the script below. Clicking on the window will fire off <<ListboxSelect>> events even though the widget is disabled. Keyboard actions do not fire this event.

import Tkinter as tki

def selection_changed(event):
  print 'Selection changed'

root = tki.Tk()
listbox = tki.Listbox(root, selectmode=tki.BROWSE)
listbox.pack(side='left', fill='both', expand=True)
listbox.bind('<>', selection_changed)

Python bug ticket:  18506  (The python guys bounced me to the Tcl/Tk guys)
Tcl/Tk bug ticket: 67c8e8bd71

UPDATE: Is the same as a previously reported bug (1288433). Given that is has been open for 7 years, with a seemingly minor fix, it would appear that it won't be fixed any time soon. The work around is to check to see if the listbox is disabled before processing the event.

Friday, July 12, 2013

Run IPython notebook on remote server

This comes in very useful if you want to run your notebook on a remote machine (e.g. your data is on that machine, or the machine is a lot faster than your own). From hints here.
  1. Start ipython notebook on remote machine: ipython notebook --pylab inline --no-browser --port=7000 
  2. Setup tunneling on local machine: ssh -N -f -L localhost:7000:localhost:7000 login@the.remote.machine 
  3. Open up localhost:7000 on your browser

Thursday, July 11, 2013

Install latest Ipython

git clone
cd ipython
python install --user
Don't forget to get rid of the other two or three installations of Ipython, as the case may be (It was three in my case). Also, don't forget to get latest dependencies

Python subprocess, Popen and PIPE

Typically when using Python's subprocess we use PIPEs to communicate with the process. However, it turns out, PIPEs suck when the data gets even slightly large (somewhere in the vicinity of 16K). You can verify this by running the following test code:
from subprocess import Popen, PIPE
import argparse, time

def execute(n):
  p = Popen(['python', '', '-n', str(n)], stdin=PIPE, stdout=PIPE, stderr=PIPE)

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.add_argument('-n', type=int)
  args = parser.parse_args()
  if args.n is not None:
    print '0'*args.n
    for n in [10,100,1000,10000,12000,16000,16200, 16500]:
      t0 = time.clock()
      print n, time.clock() - t0
The output is
10 0.001219
100 0.001254
1000 0.001162
10000 0.001362
12000 0.001429
16000 0.001305
16200 0.00121
(Hangs after this)
The way to handle this is to generate a temporary file and let the process write to the file. A detailed note can be found here.
def execute_long(n):
  with open('query.txt','w') as stdout:
    p = Popen(['python', '', '-n', str(n)], stdin=PIPE, stdout=stdout, stderr=PIPE)
  with open('query.txt','r') as stdout:

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.add_argument('-n', type=int)
  args = parser.parse_args()
  if args.n is not None:
    print '0'*args.n
    for n in [10,100,1000,10000,12000,16000,16200, 16500]:
      t0 = time.clock()
      print n, time.clock() - t0
10 0.001601
100 0.001263
1000 0.001272
10000 0.001404
12000 0.001419
16000 0.001333
32000 0.001445
64000 0.001692
128000 0.001763

Saturday, July 6, 2013

github gh-pages original markdown is stored in params.json

As you know github lets you put up webpages for your projects and these are stored in branch of your repository called 'gh-pages'.

Github also lets you write the page in Markdown and then converts it into html automatically. I am thrilled by this as you can also import your file from your main project. I was also impressed by the fact that you can go back to the automatic page generator and reload the page as markdown and edit it. But I could not find the markdown source - all I saw was index.html and I wondered what magic github did to reverse convert html to markdown. This puzzled me because it did not look like a reversible operation.

Well, the secret is in the params.json file. The markdown, site title and tagline are in this file!

Friday, July 5, 2013

Mac OS X: make a screen cast with no additional software

I recently learned that on Mac OS X (> 10.6) it is possible to create decent screen casts using only the built in utilities (From hints here and here):

For the basic screen Quicktime is sufficient. Open up Quicktime and go to File->New Screen Recording. A small control panel will open that allows you to control recording. The small dropdown arrow on the right gives access to various recording options, including sound. When you are ready hit the record button. QT will tell you to click to start right away recording the whole screen, or drag your mouse and select a part of the screen to record from. Then you get a 'start' button which you should click to start recording. If you have activated voice recording you can see your voice level in the control panel.

If you want to visualize your keystrokes on the screen (and don't want to spend money on separate software that does this in a fancy way) you can do the following: Go to System Preferences->Keyboard. Check 'Show keyboard and character viewers in menu bar'. This shows a flag on your menu bar corresponding to the current language. Click the flag and check 'Show keyboard viewer'. This will show a schematic keyboard on your screen which lights up with the keys you press. There is no persistence, so it is not as apparent as specialized software made for screen casts.

Additionally, if you have a cluttered desktop, from a hint here, you can hide all the icons as follows: In a terminal type defaults write CreateDesktop -bool false And then restart Finder. To unhide the icons type defaults write CreateDesktop -bool true And then restart Finder

Monday, July 1, 2013

Stacking event bindings in tkinter

To stack events in Tkinter one should use the add='+' keyword (from here):
widget.bind("", callback1)
widget.bind("", callback2, add="+")
If you don't use the add='+' in the second call, then the first callback binding gets overwritten

Thursday, June 27, 2013

Oracle's VirtualBox in Mac OS X: "VMMR0.r0 (VERR_SUPLIB_WORLD_WRITABLE)" error

Problem: VirtualBox VM does not start after updating VirtualBox. Says "VMMR0.r0 (VERR_SUPLIB_WORLD_WRITABLE)".

Solution: From here: For some reason the permissions under /Applications/ is changed to world writable. You can fix the permission by running Disc Utility and Repair Permissions. This fixes this problem.

Sunday, June 23, 2013

exiftool batch mode

exiftool has a batch mode. If you pass the argument -stay_open True, exiftool accepts multiple commands. This is invaluable if you call exiftool from another program because you avoid the overhead of loading/unloading the program everytime. exiftool can also return data formatted as JSON, which python knows how to handle, allowing us to pass formatted data back and forth rather easily. An example of this all working together nicely is here.

Running a task in a separate thread in a Tkinter app.

  1. Use Queues to communicate between main thread and sub-thread
  2. Use wm_protocol/protocol to handle quit event
  3. Use Event to pass a message to sub-thread

import Tkinter as tki, threading, Queue, time

def thread(q, stop_event):
  """q is a Queue object, stop_event is an Event.
  stop_event from
  while(not stop_event.is_set()):
    if q.empty():

class App(object):

  def __init__(self):
    self.root = tki.Tk() = tki.Text(self.root, undo=True, width=10, height=1)'left')

    self.queue = Queue.Queue(maxsize=1)
    self.poll_thread_stop_event = threading.Event()
    self.poll_thread = threading.Thread(target=thread, name='Thread', args=(self.queue,self.poll_thread_stop_event))

    self.poll_interval = 250

    self.root.wm_protocol("WM_DELETE_WINDOW", self.cleanup_on_exit)

  def cleanup_on_exit(self):
    """Needed to shutdown the polling thread."""
    print 'Window closed. Cleaning up and quitting'
    self.root.quit() #Allow the rest of the quit process to continue

  def poll(self):
    if self.queue.qsize():
      self.selected_files = self.queue.get(),tki.END), self.selected_files)
    self._poll_job_id = self.root.after(self.poll_interval, self.poll)

app = App()

Saturday, June 22, 2013

Calling a function periodically in Tkinter (Polling)

Use the

method. From a discussion here.

import Tkinter as tki, time

class App(object):

  def __init__(self):
    self.root = tki.Tk() = tki.Text(self.root, undo=True, width=20, height=5)'top', expand=True, fill='both')

  def poll(self):
    print time.strftime('%H:%M:%S'), time.strftime('%H:%M:%S') + '\n')
    self.root.after(1000, self.poll)

app = App()

From a further discussion here, use the after_cancel method to cancel the polling. after returns a job id that you need for the cancellation

import Tkinter as tki, time

class App(object):

  def __init__(self):
    self.root = tki.Tk() = tki.Text(self.root, undo=True, width=10, height=1)'left')
    self.cancel = tki.Button(self.root,text='Stop',command=self.stop)
    self._poll_job_id = self.poll()

  def poll(self):
    print time.strftime('%H:%M:%S'),tki.END), time.strftime('%H:%M:%S') + '\n')
    self._poll_job_id = self.root.after(1000, self.poll)

  def stop(self):

  def go(self):
    self._poll_job_id = self.poll()

app = App()