Skip to main content


Showing posts from August, 2014

A funny thing with Python __getitem__()

So, Python has the 'in' operator which we are encouraged to use to test membership of a key in a dictionary, for example. Like

a = {'A': 22, 'B': 45} if 'A' in a: print 'Yes' --> 'Yes'
Python has a index/slice/key notation which consists of square brackets ([]):
a['A'] --> 22
Because Python is so awesome it's slice notation is available to us for any classes we define.
class A(): def __getitem__(self, item): print item return item a = A() a[23] --> 23 b['hello'] --> 'hello'
(In our real class, of course, we would be doing something interesting in the __getitem__ method, like returning a data structure indexed by the key)

What happens when we put the two together, you know, use the in idiom with a user defined class?

It's so awesome, it needs its own screencast.

So Python goes through indexes sequentially testing membership! Returning None does not help either. You have to retur…

Big numbers in Python

Numbers in Python can get arbitrarily large. This is pretty fun for things like doing factorials

>>> def fac(n): return reduce(lambda x,y: x*y, range(1,n+1)) ... >>> fac(1) 1 >>> fac(2) 2 >>> fac(3) 6 >>> fac(4) 24 >>> fac(5) 120 >>> fac(500) 12201368259911100687012387854230469262535743428031928421924135883858453731538819976054964475022032818630136164771482035841633787220781772004807852051593292854779075719393306037729608590862704291745478824249127263443056701732707694610628023104526442188787894657547771498634943677810376442740338273653974713864778784954384895955375379904232410612713269843277457155463099772027810145610811883737095310163563244329870295638966289116589747695720879269288712817800702651745077684107196243903943225364226052349458501299185715012487069615681416253590566934238130088562492468915641267756544818865065938479517753608940057452389403357984763639449053130623237490664450488246650759467358620746379251842004…

Numpy: a note on slicing efficiency

In my application I have three arrays of the same length (and datatype) (say p1, p2 and p3) that have a one to one correspondence with each other. My code works on the arrays in blocks and has to concatenate fragments of these three arrays together. In the rest of my code I take matched slices through these arrays (say p1[0:100], p2[0:100], p3[0:100]) and pass them onto functions for processing,

My first instinct was to create a two-dimensional array since this encapsulates the data nicely, even though the data do not, semantically, constitute a matrix. It was at this point I came across a fact that I had not considered before: taking slices from a 2D array is more time expensive than taking slices from a 1D array.

For example:

In [397]: %%timeit p = numpy.empty((3, 1e6)) .....: x = p[1, 0:2] .....: 100000 loops, best of 3: 2.18 ┬Ás per loop

In [398]: %%timeit p = numpy.empty(1e6) .....: x = p[0:2] .....: 1000000 loops, best of 3: 512 ns per loop
Row-column shape does …

Python blist.sortedlist

The blist package is pretty good. The documentation and care the author has put in is very impressive. It's a shame that it wasn't included in the standard Python distribution, and the reasons given seem to indicate to me that there are not enough people working on large datasets on the Python committee. Any how, I wanted to try out for myself how the sorted list stacks up against a regular python list that is sorted every time after and insertion:

As can be seen, blist.sorted insertion is O(1) and compares quite favorably to the regular Python list.

Timing code follows:

import pylab from timeit import timeit l = [1,2,3,4,5,6,7,8,9,10] def time_things(): n_loops = 1000 s = [10, 100, 1000, 10000, 100000] blist_time = [timeit("a.add(100)", "import blist; a=blist.sortedlist(range({:d},0,-1))".format(n), number=n_loops)/n_loops for n in s] list_time = [timeit("a.append(100); a.sort()", "a=range({:d})".format(n), number=n_loops)/n_lo…

bcolz, numpy and plain Python

So we have these big data sets that people say are the code of life. In a simple representation they add up to about 3 GB of space. While it is quite possible to load up the whole human genome into RAM and have capacity to spare, I was wondering if there were painless ways to access compressed forms of the data. This lead me to Mikhail Korobov's blog and to bcolz. Here's a quick shoot out between plain Python strings, numpy and bcolz as used to store and access human chromosome 11 which is about 135 million base pairs long:

My take away, for my particular application is that I will stick to plain old Pythons strings - I tend to take short slices of the data (about 100 basepairs long) and that puts me in the range where Python strings are faster than numpy arrays. Memory is cheap.

Code follows:

"""Access times for single characters along the length of the string and for 100 item slices along the length of the string.""" import pylab import numpy imp…

Nosetests and pdb

I sat there looking at a testing error and thought "Can't we automatically invoke the debugger when a test fails". This is Python, of course someone has a solution for that.

nosetests --pdb
will invoke the debugger .when one of your tests goes south.