Skip to main content

Posts

Showing posts from 2013

Quitclaim deeds, land records and all THAT

Looking into some legal stuff I noticed that a deed was labelled 'quitclaim'. I was puzzled by what this meant (it sounded a little shady to me). From the page here it seems like a quitclaim deed is weaker than a warranty deed. A warranty deed states that whoever is giving you the deed is legally obliged to defend any challenges to ownership that arise on the land regardless of how far back in time this challenge originates. The quitclaim deed obliges the grantor to defend any challenges to ownership that arose only while they were owning the property. Any challenges that arise before are excluded. This seems a little shady, because if it is your land, and you are selling it, why would you NOT give the full support of ownership as a warranty deed promises? I started to look into land records (For Massachusetts you can go to http://www.masslandrecords.com/ and do a search based on county) and every transfer of that particular piece of land was quitclaim, going back as far as

Docopt is amazing

I love the command line and I love Python. So, naturally, I am an avid user of the argparse module bundled with Python. Today I discovered docopt and I am so totally converted. argparse is great but there is a bunch of setup code that you have to write and often things look very boilerplate-y and messy and it just looks like there should be a more concise way of expressing the command line interface to a program. Enter docopt docopt allows you to describe your commandline interface in your doc string and then it parses this description and creates a command line parser that returns a dictionary with the values for all the options filled in. Just like that. So, for example, one of my scripts has a docstring that looks like Usage: compute_eye_epoch [-R DATAROOT] [-x EXCEL] [-d DATABASE] [-e EPOCH] [-f|-F] [-q] Options: -h --help Show this screen and exit. -R DATAROOT Root of data directory [default: ../../Data] -x EXCEL Spreadsheet with sessions/trials etc

Database diagrams and sqlite on the cheap

Those diagrams that show you your database tables and the links between them through foreign keys are apparently called Entity Relationship Diagrams (ERDs). I wanted to create one for my sqlite database to keep track of everything but I'm a cheapskate and didn't want to pay anything. It turns out MySQL WorkBench is great for this. You don't need to register with them to download the program. You don't need a MySQL database running for this. I simply followed these steps: From the sqlite3 commandline I types .schema which printed the database schema to the console. I pasted the schema into a file and saved it I used Import from MySQL Workbench to parse the schema and place it on a diagram.  The Autolayout feature is pretty good and probably optimizes for visual appeal, but I spent a few minutes changing the layout to  what I think worked logically in my head and also minimized connection overlaps. The translation from sqlite3 to MySQL dialects is smooth. My

Store numpy arrays in sqlite

Use numpy.getbuffer (or sqlite3.Binary ) in combination with numpy.frombuffer to lug numpy data in and out of the sqlite3 database: import sqlite3, numpy r1d = numpy.random.randn(10) con = sqlite3.connect(':memory:') con.execute("CREATE TABLE eye(id INTEGER PRIMARY KEY, desc TEXT, data BLOB)") con.execute("INSERT INTO eye(desc,data) VALUES(?,?)", ("1d", sqlite3.Binary(r1d))) con.execute("INSERT INTO eye(desc,data) VALUES(?,?)", ("1d", numpy.getbuffer(r1d))) res = con.execute("SELECT * FROM eye").fetchall() con.close() #res -> #[(1, u'1d', <read-write buffer ptr 0x10371b220, size 80 at 0x10371b1e0>), # (2, u'1d', <read-write buffer ptr 0x10371b190, size 80 at 0x10371b150>)] print r1d - numpy.frombuffer(res[0][2]) #->[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] print r1d - numpy.frombuffer(res[1][2]) #->[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Note that for work where data ty

Pandas, multiindex, date, HDFstore and frame_tables

Currently if you have a dataframe with a multiindex with a date as one of the indexers you can not save it as a frame_table. Use datetime instead. import pandas as pd, numpy, datetime print pd.__version__ #-> 0.13.0rc1 idx1 = pd.MultiIndex.from_tuples([(datetime.date(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)]) df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1) #-> If you want to save as a table in HDF5 use datetime rather than date with pd.get_store('test1.h5') as f: f.put('trials',df1) #-> OK with pd.get_store('test2.h5') as f: f.put('trials',df1,data_colums=True,format='t') #-> TypeError: [date] is not implemented as a table column #-> Solution is to use datetime Update: Thanks to Jeff again for the solution

h5py and multiprocessing

The HDF5 format has been working awesome for me, but I ran into danger when I started to mix it with multiprocessing. It was the worst kind of danger: the intermittent error. Here are the dangers/issues in order of escalation (TL;DR is use a generator to feed data from your file into the child processes as they spawn. It's the easiest way. Read on for harder ways.) An h5py file handle can't be pickled and therefore can't be passed as an argument using pool.map() If you set the handle as a global and access it from the child processes you run the risk of racing which leads to corrupted reads. My personal runin was that my code sometimes ran fine but sometimes would complain that there are NaNs or Infinity in the data. This wasted some time tracking down. Other people have had this kind of problem [ 1 ]. Same problem if you pass the filename and have the different processes open individual instances of the file separately. The hard way to solve this problem is to sw

Daemonic processes are not allowed to have children

You can use the multiprocessing module to 'farm' out a function to multiple cores using the Pool.map function. I was wondering idly if you can nest these farming operation to quickly build an exponentially growing army of processes to, you know, take over the world. It turns out that the universe has a failsafe: import multiprocessing as mp def compute(i): return i def inner_pool(args): n0, N = args pool = mp.Pool() return pool.map(compute, range(n0,n0+N)) pool = mp.Pool() print pool.map(inner_pool, [(n,n+10) for n in range(10)]) # -> AssertionError: daemonic processes are not allowed to have children

h5py: The resizing overhead

A cool thing about data arrays stored in HDF5 via h5py is that you can incremental add data to them. This is the practical way of processing large data sets: you read in large datasets piece by piece, process them and then append the processed data to an array on disk. An interesting thing about this is that there is a small size overhead in the saved file associated with this resizing compared to the same data saved all at once with no resizing of the HDF5 datasets. I did the computations by using block sizes of [10, 100, 1000, 10000] elements and block counts of [10,100,1000, 10000] The corresponding matrix of overhead (excess bytes needed for the resized version over the directly saved version) looks like this (rows are for the block sizes, columns are for the block counts): overhead -> array([[ 9264, 2064, 3792, 9920], [ 2064, 3792, 9920, 52544], [ 3792, 9920, 52544, 462744], [ 9920, 52544, 462744, 4570320]]) As

h5py: the HDF file indexing overhead

Storing numpy arrays in hdf5 files using h5py is great, because you can load parts of the array from disk. One thing to note is that there is a varying amount of time overhead depending on the kind of indexing you use. It turns out that it is fastest to use standard python slicing terminology - [:20,:] - which grabs well defined contiguous sections of the array. If we use an array of consecutive numbers as an index we get an additional time overhead simply for using this kind of index. If we use an array of non-consecutive numbers (note that the indecies have to be monotonic and non-repeating) we get yet another time overhead even above the array with consecutive indexes. Just something to keep in mind when implementing algorithms. import numpy, h5py N = 1000 m = 50 f = h5py.File('index_test.h5','w') f.create_dataset('data', data=numpy.random.randn(N,1000)) idx1 = numpy.array(range(m)) idx2 = numpy.array(range(N-m,N)) idx3 = numpy.random.choice(N,siz

Octoshape (octoshapepm)

I was first alerted to this when my mac asked me if I wanted to allow octoshapepm to accept incoming connections. This led me to a web search which led me to an interesting finding that CNN is basically installing a program that uses your computer to redistribute your content, but not really telling you that it is doing it. The program itself is made by this company . This article gives a brief non-marketing overview of what the program actually does and how to get rid of it if you wish. In short, as installed by CNN, the program acts as a realtime file distribution system, like bittorrent, except that its probably running without your permission, helping CNN deliver content using part of your bandwidth (you are uploading video data just as you are downloading it). There are security issues with this in addition to an issue of principle, where you are most likely being tricked into giving up part of your bandwidth to save CNN some money as well as exposing a new security hole. I

Annoying bug in PyTables (affects Big Data analysis with Pandas)

Pandas HDF5 interface through PyTables is awesome because it allows you to select and process small chunks of data from a much larger data file stored on disk. PyTables, however, has an annoying and subtle bug and I just wanted to point you to it so that you don't have to spend hours debugging code like I did. In short, if you have a DataFrame, and a column of that DF starts with a NaN, any select statements that you run with that conditions on that column will return empty (you won't get any results back, ever). There is a work around , but I chose to use a dummy value instead. This shook my confidence in Pandas as an analysis platform a bit (though it is really PyTable's fault).

R

I like learning languages and after a little kerfuffle with a Python package I was wondering if there was anything out there for statistical data analysis that might not have so many hidden pitfalls in ordinary places. I knew about R from colleagues but I never payed much attention to it, but I decided to give it a whirl. Here are some brief preliminary notes in no particular order PLUS Keyword arguments! Gorgeous plotting Integrated workspace (including GUI package manager) Very good documentation and help NaN different from NA They have their own Journal. But what do you expect from a bunch of mathematicians? Prints large arrays on multiple lines with index number of first element on each line on left gutter Parenthesis autocomplete on command line RStudio, though the base distribution is pretty complete, with package manager, editor and console. MINUS Everything is a function. I love this, but it means commands in the interpreter always need parentheses. I'd go

Big Data, Small Data and Pandas

Welp, I finally got this through my thick head, thanks to a hint by Jeff who answered my cry for help on stack overflow, and pointed me to this thread on the pandas issues list. So here's my use case again: I have small data and big data. Small data is relatively lightweight heterogeneous table-type data. Big data is potentially gigabytes in size, homogenous data. Conditionals on the small data table are used to select out rows which then indicate to us the subset of the big data needed for further processing. Here's one way to do things: (Things to note: saving in frame_table format, common indexing, use of 'where' to select the big data) import pandas as pd, numpy df = pd.DataFrame(data=numpy.random.randint(10,size=(8,4)),columns=['a','b','c','d']) df.to_hdf('data.h5','small',table=True,data_columns=['a','b']) df1 = pd.DataFrame(data=numpy.random.randint(10,size=(8,20)),index=df.index) df1.to_h

Pulling/pushing data to a samba server from Linux

smbclient (It's atrociously formatted man page is here ) will let you do what ftp let you do which is to get and put files from you local machine to a samba server. My use case is that I have a high performance cluster (Partners' Linux High Performance Computing cluster) that I want to run my code on (remoteA) while my data is on another server (remoteB) that seems to only allow access through samba and refuses ssh and scp requests. The solution turns out to be to use smbclient, which seems to behave just like the ftp clients of old. ssh into remoteA smbclient \\\\{machine name}\\{share name} -D {my directory} -W{domain} (The multiple backslashes turn out to be vital) You'll end up with a smbc prompt. At the prompt type prompt   (Gets rid of the prompt asking you if you are sure you want to copy or EVERY file) recurse   (I wanted to copy a whole directory, so I needed this) mget <my dir>\  (this is my directory) A useful command is smbclient -L {m

How to Preserve a Snowflake Forever (as mentioned in The Big Bang Theory)

Posting a link to LinkedIn

Sometimes LinkedIn will not pull the metadata/images from a page you link to in a status update. In my case I was trying to link to a page on latimes . I found that if you get a tinyurl to the page, that works. I suspect that the url parser LinkedIn uses can not handle 'weird' characters in an url, like commas (this url had a comma) or else, can't handle urls beyond a certain length.

Plotting state boundary data from shapefiles using Python

The great folks at census.gov have put up some of the data they collect so we can download and use it. On this page they have data relating to state boundaries. The files are available as zipped directories containing a shapefile and other metadata information. If you want to plot state boundaries and some state metadata (like zip code, state name) the .shp shapefile is sufficient. Assuming that the shape file is 'tl_2010_us_state10/tl_2010_us_state10.shp' , some sample code using the pyshp package is: #http://stackoverflow.com/questions/10871085/viewing-a-polygon-read-from-shapefile-with-matplotlib #http://stackoverflow.com/questions/1441717/plotting-color-map-with-zip-codes-in-r-or-python import shapefile as sf, pylab map_f = sf.Reader('tl_2010_us_state10/tl_2010_us_state10.shp') state_metadata = map_f.records() state_shapes = map_f.shapes() for n in range(len(state_metadata)): pylab.plot([px[0] if px[0] <0 else px[0]-360 for px in state_shapes[n].points],[p

Pandas: the frame_table disk space overhead

When a Pandas DataFrame is saved (via PyTables) to hdf5 as a frame_table there is a varying amount of disk space overhead depending on how many columns are declared as data_columns (i.e. columns you can use to select rows by). This overhead can be rather high. import pandas as pd, numpy df = pd.DataFrame(numpy.random.randn(1000000,3),columns=['a','b','c']) df.to_hdf('data_table_nocomp.h5','data') #-> 32 MB df.to_hdf('data_normal.h5','data',complevel=9,complib='bzip2') #-> 21.9 MB df.to_hdf('data_table.h5','data',complevel=9,complib='bzip2',table=True) #-> 22.5 MB df.to_hdf('data_table_columns1.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a']) #-> 29.1 MB df.to_hdf('data_table_columns2.h5','data',complevel=9,complib='bzip2',table=True,data_columns=['a','b']) #-> 35.8 MB df.to_hdf('data_

The one thing I would have changed in 'Gravity'

Gravity is a great movie on many levels. It can't quite beat 2001 for solitude, desolation and a tiny cast, but its good. The three actors, Clooney, Bullock and Sir Newton do a great job and work well together, though there is not much by way of character development. There is one raging issue that I have though. It only lasts 20 seconds in the movie and I don't quite know why its there.  So here are Clooney and Bullock drifting towards the ISS. They get entangled in the parachute cords which stops their momentum relative to the ISS. Then, for some inexplicably reason, for 20 seconds Sir Isaac Newton goes on coffee break but the crew keep filming! Clooney is pulled by some mysterious phantom force that affects only him and Bullock but not the ISS. Clooney cuts himself loose and slingshots outward. Bullock kind of drifts back, so you know Sir Newton is slowly waking up from the coffee, but not quite, so it's not really clear what's going on. Here's a tweak I wo

Reversed 28mm on a D5100 is 2:1 macro

D5100 sensor size 23.6 mm × 15.6 mm 8mm of wooden ruler spans height of sensor Magnification is 15.6mm : 8mm About 2:1 which is great, but the focus distance is insanely close.

Nikkor-H 28m f3.5

Nikkor-H 28m f3.5 , a set on Flickr. Sample images from my experiments with an old manual lens.

Adventures with a Nikkor H 28mm f3.5

I wanted to find out what all this hoopla about old manual lenses was about, so I went looking for one. Apparently old manual lenses aren't THAT cheap, or else my definition of cheap is about a standard deviation below the mean. However, I did find a manual lens that fit my budget. Apparently the Nikon 28mm f3.5 isn't as hot an item as some other lenses. The lens I got is older than I am, but in better condition. Nikon ended the run of this version of the lens in 1971. It's a non-Ai lens with the metering prong taken off (which makes it worthless for a collector, I guess). This suited me for two reasons: it made it cheap and it meant I could fit it on my D5100 (I read that you can fit the lens on the camera even with the prongs, but I don't believe it - the flash housing juts over the camera mount pretty closely, and I suspect the prongs would foul verified- the flash housing is JUST high enough that the prongs don't foul.). I inspected the lens for fungus and d

Some notes from an ebay newbie

I was always suspicious of eBay (mostly because of Paypal). But I decided to jump in (like, what, about a few decades behind the curve) and try it out. I have fairly specific things I look for on eBay: photo stuff, my idea is that eBay is the giant garage sale in the ether and sometimes you can find exactly what you want for exactly what you want to pay for. I don't have any deep observations, but I think one simple rule is important. I saw a lot of advice about sniping (bidding at the last second) and I think eBay does a very fair thing, in the true spirit of trying to find the appropriate price for an item. Ebay allows you to set a maximum bid and will automatically bid just enough to keep you ahead up to the maximum. If someone comes in after you with a series of bids your bid always has precedence until your limit. I think this, if you are using eBay as a garage sale to find cheap items for a hobby, is the proper way to go. When you first see an item, decide how much at

A simple exchange on eBay

I bought a 52mm-52mm coupler from a HK supplier (goes by the name of william-s-home). After I paid for the item, I noticed that the seller had a warning that the shipping could take 20-30 days and to email them if I wanted to cancel because I was just reading this note. I emailed him and requested a cancellation. The seller was SO polite. We had a few exchanges and he/she was always extremely respectful. I now have this image in my head of a venerable old Chinese trader who takes his business and reputation very seriously. For him, this is not just a way to earn money. It is a way of life, a principle, and things must be done correctly. The item cost $4.00 with shipping. It probably cost more than that for both of us in terms of the time spent emailing and completing the formalities for cancelling the transaction. It was all very civilized and suddenly made me want to be a global trader, exchanging emails with people from far flung places in the globe, because life is too short a

Initializing a Pandas panel

Sometimes there are multiple tables of data that should be stored in an aligned manner. Pandas Panel is great for this. Panels can not expand along the major and minor axis after they are created (at least in a painless manner). If you know the maximum size of the tabular data it is convenient to initialize the panel to this maximum size before inserting any data. For example: import numpy, pandas as pd pn = pd.Panel(major_axis=['1','2','3','4','5','6'], minor_axis=['a','b']) pn['A'] = pd.DataFrame(numpy.random.randn(3,2), index=['2','3','5'], columns=['a','b']) print pn['A'] Which gives: a b 1 NaN NaN 2 1.862536 -0.966010 3 -0.214348 -0.882993 4 NaN NaN 5 -1.266505 1.248311 6 NaN NaN Edit: Don't need a default item - an empty panel can be created

Macro photography with reversed lens

I had forgotten the simple joys of experimenting with cameras. Some of you will recall the old trick of reversing your lens to obtain macro photos. Here I simply took my 18-55 kit lens, reversed it, set it to 18mm and took a photo of my laptop monitor. I aimed it at a white part of the screen and you can see the three sub pixels per real pixel which combine together to give the illusion of white.

Pandas panel = collection of tables/data frames aligned by index and column

Pandas panel provides a nice way to collect related data frames together while maintaining correspondence between the index and column values: import pandas as pd, pylab #Full dimensions of a slice of our panel index = ['1','2','3','4'] #major_index columns = ['a','b','c'] #minor_index df = pd.DataFrame(pylab.randn(4,3),columns=columns,index=index) #A full slice of the panel df2 = pd.DataFrame(pylab.randn(3,2),columns=['a','c'],index=['1','3','4']) #A partial slice df3 = pd.DataFrame(pylab.randn(2,2),columns=['a','b'],index=['2','4']) #Another partial slice df4 = pd.DataFrame(pylab.randn(2,2),columns=['d','e'],index=['5','6']) #Partial slice with a new column and index pn = pd.Panel({'A': df}) pn['B'] = df2 pn['C'] = df3 pn['D'] = df4 for key in pn.items: print pn[key] -> output

Wordpress renders LaTeX

I was so pleasantly surprised to learn that wordpress blogs will render latex. The tags are simply $latex and $ . So $latex e^{ix} = \cos(x) + i\sin(x)$ will render as There are some cool parameters that you can set (from hints here and here ): increase size by adding &s=X where X is an integer [-4,4]: $latex x^2 &s=2$   Instead of inline equtions (default) display as block (bigger): $latex \displaystyle x^2$

Python: Multiprocessing: xlrd workbook can't be passed as argument

import multiprocessing as mp, xlrd def myfun(b): print b.sheet_names() b=xlrd.open_workbook('../../Notes/sessions_and_neurons.xlsx') p = mp.Pool(4) p.map(myfun, [b,b,b,b]) Exception in thread Thread-2: Traceback (most recent call last): File "/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/lib/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/lib/python2.7/threading.py", line 504, in run self.__target(*self.__args, **self.__kwargs) File "/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks put(task) PicklingError: Can't pickle : attribute lookup __builtin__.instancemethod failed

Python: Multiprocessing: passing multiple arguments to a function

Write a wrapper function to unpack the arguments before calling the real function. Lambda won't work, for some strange un-Pythonic reason. import multiprocessing as mp def myfun(a,b): print a + b def mf_wrap(args): return myfun(*args) p = mp.Pool(4) fl = [(a,b) for a in range(3) for b in range(2)] #mf_wrap = lambda args: myfun(*args) -> this sucker, though more pythonic and compact, won't work p.map(mf_wrap, fl)

Calculating confidence intervals: straight Python is as good as scipy.stats.scoreatpercentile

UPDATE: I would say the most efficient AND readable way of working out confidence intervals from bootstraps is : numpy.percentile(r,[2.5,50,97.5],axis=1) Where r is a n x b array where n are different runs (e.g different data sets) and b are the individual bootstraps within a run. This code returns the 95% CIs as three numpy arrays. Confidence intervals can be computed by bootstrapping the calculation of a descriptive statistic and then finding the appropriate percentiles of the data. I saw that scipy.stats has a built in percentile function and assumed that it would work really fast because (presumably) the code is in C. I was using a simple minded Python/Numpy implementation by first sorting and then picking the appropriate percentile data. I thought this was going to be inefficient timewise and decided that using scipy.stats.scoreatpercentile was going to be blazing fast because It was native C It was vectorized - I could compute the CIs for multiple bootstrap runs a

Three coding fonts

Coding fonts should: Look good at small sizes, (10-11 pt) - you can see more code in your window Have good distinction between characters, especially (O,0), (i,l), (l,1)(`,') - your programs have enough bugs already Three fonts that I have tried out and that work for me are, in order Anonymous Pro - Looks good even at 10pt Monaco Consolas Anonymous Pro 11pt Monaco 11pt Consolas 11pt

D5100: More notes

Video It took me a little bit to get warmed up to the concept, but now I definitely see the potential for using DSLRs for movie making. Camcorders (in the price range I would consider) are fitted with single lenses (probably a superzoom) with average optical quality. Their smaller sensor size means a much noisier low light performance. With this cheap DSLR I can put on my cheap 50mm/1.8 and get HD movies that look 'arty' because I opened the lens up wide. I can take movies in indoor lighting. I can take videos of my cat that look like something showing at Sundance. It really opens up for creativity. My only gripe is the auto focus. It's not that it is slow, it's that I can't get it to do what I want, but perhaps I want too much. The AF, with a decent lens, like the 35mm/1.8 AF-S, is fast enough and silent enough. The kit lens is atrocious in this department. My gripe is that I just could not figure out how to efficiently get it to track my subject (my cat). M

A script to clear and disable recent items in Mac OS X doc

From a hint here . Mac OS X has the annoying feature of remembering your application history in the dock and not erasing the history when you erase it from the application preferences. The following is a little bash script that does this for you provided you pass the name of the application (e.g. vlc.app) to it. #!/bin/bash -x BUNDLEID=$(defaults read "/Applications/$1/Contents/Info" CFBundleIdentifier) defaults delete "$BUNDLEID.LSSharedFileList" RecentDocuments defaults write "$BUNDLEID" NSRecentDocumentsLimit 0 defaults write "$BUNDLEID.LSSharedFileList" RecentDocuments -dict-add MaxAmount 0 You need to run killall Dock after this to restart the dock for the changes to take effect.

The nikon D5100 (as seen by a D40 shooter)

The D5100 is rather old news now. People are either ogling the m4/3 cameras (I know I am) or looking at Nikon's new models such as the D5200. However, I recall, when the D5100 first came out, and I was the owner of a D40, I badly wanted the high ISO performance and the video. Well, enough time has passed that the D5100 is now at a sweet price point (especially the refurbished ones) that I did get myself one. There are tons of comprehensive D5100 reviews out there, this will be a short collection of very subjective thoughts from a D40 owner. What kind of photographer am I? Well, I'm a casual shooter. A few pics are up on flickr , but I mostly shoot family and don't really put up pictures on web galleries. My favorite subject is the human face in the middle of its many fleeting expressions. High ISO performance I'm very happy. Experts on sites such as dpreview complained that noise rendered D5100 photos above 1600 unusable. I was already impressed by the D40'

HDF5 (and Pandas using HDF5) is row oriented

From a nice hint here , and the docs here : When you use pandas + HDF5 storage it is convenient to generate one table that is the 'selector' table that you use to index which rows you will select. Then you use that to retrieve the bulk data from separate tables which have the same index. Originally I was appending columns to the main table, but there is no efficient way of doing that when using HDF5 (appending rows is efficient). Now I'm just creating new tables for the data, keeping the original index.

An efficient way to store pandas data

OK, after much belly aching I have a decent work flow for when I want to use Pandas which is actually quite convenient. Firstly, Pandas shines for when I have heterogeneous data (mixed types) that form nicely into columns and where I need to select out a subset of rows because they satisfy certain conditions. UPDATE: Fixed confusion between 'table' and 'store' UPDATE: Include note about how to set data columns The basic steps are these Use table=True in .put or .to_hdf to indicate that you want the data stored as a frame_table that allows on-disk selection and partial retrieval Use data_columns= [...] during saving to identify which columns should be used to select data You need to do BOTH steps to have a working selectable-table-on-disk. If you do not use table=True you will get TypeError: cannot pass a where specification when reading from a non-table this store must be selected in its entirety If you do not declare data_columns you will get ValueError: q

h5py and pandas for large array storage

I've actually gone back to pure hdf5 (via the h5py interface) for storing and accessing numerical data. Pandas via PyTables started to get too complicated and started to get in the way of my analysis (I was spending too much time on the docs, and testing out cases etc.). My application is simple. There is a rather large array of numbers that I would like to store on disk and load subsets of to perform operations on cells/subsets. For this I found pandas to be a bad compromise. Either I had to load all the data all at once into memory, or I had to go through a really slow disk interface (which probably WAS loading everything into memory at the same time). I just don't have the luxury to fight with it so long. I'm seeing that pandas has a (kind of) proper way of doing what I'm doing , but in h5py it just seems more natural and less encumbering :( UPDATE: So, as previously mentioned, Pandas shines as a database substitute, where you want to select subsets of data bas

Use Enthought Canopy outside of their environment

From hints on their blog and other places: Canopy installs a virtual environment. The environment activate command is located at ~/Library/Enthought/Canopy_64bit/User/bin/activate . An easy way to access this environment is to alias it in your start up file eg: # For canopy ipython alias canpy='source ~/Library/Enthought/Canopy_64bit/User/bin/activate' When in the environment use deactivate to exit. I'm using Canopy because I found it insanely annoying to install hdf5 and h5py on Mac 10.7.5 I think my next laptop will be linux...

Pandas: brief observations

After using Pandas for a little bit, I have a few observations: Pandas is great for database like use. When you have tabular data from which you would like to efficiently  select sub-tables based on critera, Pandas is great. Pandas is great for time-series like data, where the rows are ordered. In such cases pandas allows you to combine multiple tables, or plot, or do analyses based on the time series nature of the rows. Pandas, however, is a little unwieldy when you wish to add rows (adding columns is very easy) and in data manipulation in general

Each access of a Pandas hdf5 store node is a re-copy from the file

This is obvious, but it is important to remember. import pandas as pd, pylab, cProfile def create_file(): r = pylab.randn(10000,1000) p = pd.DataFrame(r) with pd.get_store('test.h5','w') as store: store['data'] = p def analyze(p): return [(p[c] > 0).size for c in [0,1,2,3,4,5,6,7,8,9]] def copy1(): print 'Working on copy of data' with pd.get_store('test.h5','r') as store: p = store['data'] idx = analyze(p) print idx def copy2(): print 'Working on copy of data' with pd.get_store('test.h5','r') as store: idx = analyze(store['data']) print idx def ref(): print 'Working on hdf5 store reference' with pd.get_store('test.h5','r') as store: idx = [(store['data'][c] > 0).size for c in [0,1,2,3,4,5,6,7,8,9]] print idx #create_file() cProfile.run('copy1()') cProfile.run('copy1()') cProfile.run(&#

Pandas: presence of a NaN/None in a DataFrame forces column to float

import pandas as pd a = [[1,2],[3,4]] df = pd.DataFrame(a) df-> 0 1 0 1 2 1 3 4 df.values -> array([[1, 2], [3, 4]]) df.ix[1].values -> array([3, 4]) a = [[1,None],[3,4]] df = pd.DataFrame(a) df-> 0 1 0 1 NaN 1 3 4 df.values -> array([[ 1., nan], [ 3., 4.]]) df[0].values -> array([1, 3]) df[1].values -> array([ nan, 4.]) df.ix[1].values -> array([ 3., 4.]) df[0][1] -> 3 df[1][1] -> 4.0 This threw me because I have a data structure that is all ints, but I have a few Nones on one column and that column was suddenly returned as floats. As you can see it's just the relevant column that is forced to float.

Pandas and PyTables: Variable assignment forces copy

I wish to report Pandas to the house unPythonic activities committee. Remember how in Python assignments are by reference rather than value i.e. when you do something like: a = b what python does is it creates a reference from a to b (except for very simple objects like integer). This is what tripped me up when I was learning Python. For example In [2]: a = {'age': 90, 'weight': 400} In [3]: b = a In [4]: a Out[4]: {'age': 90, 'weight': 400} In [5]: b Out[5]: {'age': 90, 'weight': 400} In [6]: b['age'] = 20 In [7]: b Out[7]: {'age': 20, 'weight': 400} In [8]: a Out[8]: {'age': 20, 'weight': 400} As you can see, changing b changes a because the assignment creates a reference. Now, when I was working with Pandas and its built in PyTables interface I learned the hard way that when you assign a variable to an element of a hdf5 store it copies the data from the hdf5 store into the varia

Sleep number bed LCDs are defective

We bought a sleep number bed about 5 years ago. These beds come with a '20 year' warranty which sounds awesome, because it makes one think that a) The bed's are made well for the company to give such a warranty and b) It's a nice warranty. Well, it's not THAT great. About 2 years ago the LCD display on the controller started to go on the fritz. It started with one segment of one digit and then progressed until a few weeks ago the display was simply blank. I did a quick search on the internet and it turns out that this is a very common problem . We have a wired controller (because it was cheaper, I guess, it was a while ago). The refurbished replacement is going to cost us $60 with shipping and the original one would have cost us $140 or so. It does seem that we are getting a nice discount on their catalog price, but I don't think this is such a good deal. Any how, the pump is working fine, so the actual cost of the controller was probably $10 or so, so I&#

Manipulating pandas data structures

I really enjoy using the Pandas Series and DataFrame objects. I find, however, that methods to update the series/frame are clunky. For a DataFrame it's pretty easy to add columns - you create a DataFrame or a Series and you just assign it. But adding rows to a Series or DataFrame is a bit clunky. I sometimes have the need to modify a certain row with new data or add that row if it does not exist, which in a database would be a 'replace or insert' operation. You can concat or append another Series or DataFrame but I have not found a nice way of handling the 'replace or insert' case. If the structure is small I simply convert it into a dictionary and manipulate the structure using the dictionary keys and then recreate the pandas structure. If the structure is large I do an explicit test for the index (row) and then decide whether to append or replace.

DSLR vs compacts/micro four thirds

I'm what the marketing department at camera companies call an 'enthusiast'. Previously I would be called an amateur, but I guess 'enthusiast' doesn't have the stigma of 'clueless' that amateur now has. I don't make money of photos and I take photos for pleasure and for memories. I bought my DSLR when DSLR prices were plunging off a cliff, that is after all the professionals had subsidized sensor and lens development. I bought the D40. I got a DSLR for the following characteristics: Low shutter lag. This was probably the biggest deal for me. I like to capture the fleeting expressions on human faces and the compact was very frustrating with the long lag between focusing and then taking the picture. Good low light performance. The D40 works just fine for me upto 1600 ISO. ISO 3200 is very noisy and adding a nice prime lens that goes out to f1.8 added a lot of artistic scope and improved low light performance. The downside of even a small DSLR like

Bug in Tk Listbox

Run the script below. Clicking on the window will fire off <<ListboxSelect>> events even though the widget is disabled. Keyboard actions do not fire this event. import Tkinter as tki def selection_changed(event): print 'Selection changed' root = tki.Tk() listbox = tki.Listbox(root, selectmode=tki.BROWSE) listbox.pack(side='left', fill='both', expand=True) listbox.bind('< >', selection_changed) listbox.config(state=tki.DISABLED) root.mainloop() Python bug ticket:  18506   (The python guys bounced me to the Tcl/Tk guys) Tcl/Tk bug ticket: 67c8e8bd71 UPDATE: Is the same as a previously reported bug ( 1288433 ). Given that is has been open for 7 years, with a seemingly minor fix, it would appear that it won't be fixed any time soon. The work around is to check to see if the listbox is disabled before processing the event.

Run IPython notebook on remote server

This comes in very useful if you want to run your notebook on a remote machine (e.g. your data is on that machine, or the machine is a lot faster than your own). From hints here . Start ipython notebook on remote machine: ipython notebook --pylab inline --no-browser --port=7000   Setup tunneling on local machine: ssh -N -f -L localhost:7000:localhost:7000 login@the.remote.machine   Open up localhost:7000 on your browser

Install latest Ipython

git clone https://github.com/ipython/ipython.git cd ipython python setup.py install --user Don't forget to get rid of the other two or three installations of Ipython, as the case may be (It was three in my case). Also, don't forget to get latest dependencies

Python subprocess, Popen and PIPE

Typically when using Python's subprocess we use PIPEs to communicate with the process. However, it turns out, PIPEs suck when the data gets even slightly large (somewhere in the vicinity of 16K). You can verify this by running the following test code: from subprocess import Popen, PIPE import argparse, time def execute(n): p = Popen(['python', 'test.py', '-n', str(n)], stdin=PIPE, stdout=PIPE, stderr=PIPE) p.wait() return p.stdout.read().splitlines() if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('-n', type=int) args = parser.parse_args() if args.n is not None: print '0'*args.n else: for n in [10,100,1000,10000,12000,16000,16200, 16500]: t0 = time.clock() execute(n) print n, time.clock() - t0 The output is 10 0.001219 100 0.001254 1000 0.001162 10000 0.001362 12000 0.001429 16000 0.001305 16200 0.00121 (Hangs after this) The way to handle this is

github gh-pages original markdown is stored in params.json

As you know github lets you put up webpages for your projects and these are stored in branch of your repository called 'gh-pages'. Github also lets you write the page in Markdown and then converts it into html automatically. I am thrilled by this as you can also import your Readme.md file from your main project. I was also impressed by the fact that you can go back to the automatic page generator and reload the page as markdown and edit it. But I could not find the markdown source - all I saw was index.html and I wondered what magic github did to reverse convert html to markdown. This puzzled me because it did not look like a reversible operation. Well, the secret is in the params.json file. The markdown, site title and tagline are in this file!

Mac OS X: make a screen cast with no additional software

I recently learned that on Mac OS X (> 10.6) it is possible to create decent screen casts using only the built in utilities (From hints here and here ): For the basic screen Quicktime is sufficient. Open up Quicktime and go to File->New Screen Recording. A small control panel will open that allows you to control recording. The small dropdown arrow on the right gives access to various recording options, including sound. When you are ready hit the record button. QT will tell you to click to start right away recording the whole screen, or drag your mouse and select a part of the screen to record from. Then you get a 'start' button which you should click to start recording. If you have activated voice recording you can see your voice level in the control panel. If you want to visualize your keystrokes on the screen (and don't want to spend money on separate software that does this in a fancy way) you can do the following: Go to System Preferences->Keyboard. Check

exiftool batch mode

exiftool has a batch mode. If you pass the argument -stay_open True , exiftool accepts multiple commands. This is invaluable if you call exiftool from another program because you avoid the overhead of loading/unloading the program everytime. exiftool can also return data formatted as JSON, which python knows how to handle, allowing us to pass formatted data back and forth rather easily. An example of this all working together nicely is here .

Running a task in a separate thread in a Tkinter app.

Use Queues to communicate between main thread and sub-thread Use wm_protocol/protocol to handle quit event Use Event to pass a message to sub-thread import Tkinter as tki, threading, Queue, time def thread(q, stop_event): """q is a Queue object, stop_event is an Event. stop_event from http://stackoverflow.com/questions/6524459/stopping-a-thread-python """ while(not stop_event.is_set()): if q.empty(): q.put(time.strftime('%H:%M:%S')) class App(object): def __init__(self): self.root = tki.Tk() self.win = tki.Text(self.root, undo=True, width=10, height=1) self.win.pack(side='left') self.queue = Queue.Queue(maxsize=1) self.poll_thread_stop_event = threading.Event() self.poll_thread = threading.Thread(target=thread, name='Thread', args=(self.queue,self.poll_thread_stop_event)) self.poll_thread.start() self.poll_interval = 250 self.poll() self.root.wm_protocol("