Monday, December 22, 2014

Retrieve a single file from a git repo.

The answer to this was more difficult to find than it should have been. The best answer is on stackoverflow here.

In short, use the git archive command:

git archive --prefix=path/to/ HEAD:path/to/ |  tar xvf -

  • For this to work the owner of the git server needs to have enabled upload-archive (git config daemon.uploadarch true)
  • You need tar -xvf because the output is in tar form

Thursday, December 4, 2014

Using user classes with Python's set operations

Python has an efficient, and built in, implementation of set operations. With a tiny bit of work you can make your user defined classes work nicely with set operations (as well as dictionaries).

Say you have a class:

class A:
  def __init__(self, x, s):
    self.x, self.s = x, s

Python actually doesn't have a way of knowing if the objects are the same or not:

a1 = A(1, 'Hi')
a2 = A(1, 'Hi')
a = set([a1, a2])
-> a1, a2

and by default assumes they are different.

All you have have to do, for things to work properly, is to define a hash function (__hash__()) and an equality (__eq__()) operator for your class. The two functions need to be congruent i.e. if the hash value for two objects is the same, they should also register as equal. It is possible to define the functions differently but this will lead to funny things happening with set operations.

The hash value is an integer. If your class has several attributes that you consider important for distinguishing between objects you can hash each attribute separately and then XOR the individual hashes.

class A:
  def __init__(self, x, s):
    self.x, self.s = x, s

  def __hash__(self):
    return self.x ^ self.s.__hash__()

  def __eq__(self, other):
    return self.__hash__() == other.__hash__()

And now

a1 = A(1, 'Hi')
a2 = A(1, 'Hi')
a = set([a1, a2])
-> just a1 as expected

Tuesday, December 2, 2014

Sphinx + Cython

It's pretty awesome that Sphinx will work with Cython code files (.pyx) but there are currently a few wrinkles:
  1. You need to rerun the cython compiler before the documentation changes take since sphinx reads this off the compiled library
  2. Function signatures are omitted in the documentation and the compiler directive # cython: embedsignature=True does not work either. The workaround is to repeat the signature as the first line of the docstring
def merge_variants(c1, c2):
  merge_variants(c1, c2)
  ... rest of docstring ...

Monday, November 24, 2014

A small caution when using virtualenvs

Virtualenvs are a great way to test and develop code in insulated containers. My main use case is to have an insulated environment where I can mock install a package I'm developing and ensure that I've taken care of all the dependencies so that a stranger can use the package as is. The virtualenvwrapper package is a great utility that simplifies managing multiple virtual environments.

One caution I have to observe is that packages installed out side the virtual environment can interfere in ways that make behaviors inside the virtual env very mysterious. For example, before I started using virtual environments seriously I had installed the nose and tox modules in my base python install. A month or so afterwards I had created a new test environment and was doing pip install -e . to test whether a package I was writing would install correctly on a fresh environment.

Everything installed fine, including an external package A my code needed. But, when I went to run nosetests or tox my code errored out, claiming package A had not been installed. BUT I CAN $%#@ SEE IT. LOOK, I TYPE pip list AND THERE IT IS!

What was happening, is that I was picking up both nose and tox from the base environment, which did not have A installed. I lost some more hair over this. People who know me will confirm that I can't afford such hair loss.

As a rule, if you want to use virtual envs, it is best to have a bare install of Python and then not install anything on top of that outside of a virtual environment.

git merge

I'm so used to git merge doing the right thing re: merging files - and possibly used to working mostly by myself - that when git merge fails I always expect something messy has happened. However, just now, I got a merge complain that amounted to this:

from setuptools import setup, find_packages

      ... some things ...

<<<<<<< HEAD
      # Register the built in plugins
      # Command line scripts
>>>>>>> origin/fix_setup

I was surprised since the changes I made and my colleague made were simple and non-overlapping. They just need to come in sequence.

I took a look at the git merge documentation which said

When both sides made changes to the same area, however, Git cannot randomly pick one side over the other, and asks you to resolve it by leaving what both sides did to that area.

What happened here was that there were no intervening lines. We both appended our changes below "... some things ..." and above ")".

git merge's algorithm can't assume (rightly so) that these are tandem changes. Which should come first? Should they both be kept? And so on. It's not a mess, but it still takes humans - actually people working on the code - to figure out that, in this case, both should be kept, and it does not matter which comes first.

Wednesday, November 12, 2014

cdef vs cpdef

In a .pyx file if you want a function that you declared as cdef (e.g. because you want inlining) to be visible to the Python interpreter, you need to use cpdef instead. However, it seems the cpdef version is a lot less efficient than cdef, and it might be better to have a wrapper of some sort instead.

Monday, September 8, 2014

An annoying thing with Python slices

You of course know that Python slices are awesome:

a[:3] -> 'ABC'
a[2:5] -> 'CDE'

And more interestingly:

a[-3:] -> 'EFG'


a[6:4:-1] -> 'GF'

But you can see that the reverse slicing is starting to stretch the fence-post we are familiar with. Python uses zero based, inclusive-exclusive indexing. This corresponds to a C syntax of (for i = n; i < m; i++). When you reverse it the slice goes (for i = m - 1; i > n - 1; i--).

As you can imagine this starts to get ugly and at one point it gets to be wrong:

Say, as is often the case, you are not taking static, pre-determined slices but rather slices determined at runtime. Say you are taking slices between n and m or [n, m).

The forward slice is a[n:m]
The backward slice is a[m-1:n-1:-1] right? Because of the fence posts?

Well yes, except what happens when n = 0? The forward slice is fine but the reverse slice resolves to a[m-1:-1:-1]

This is where Python becomes a little too clever. As you will recall from our earlier examples, negative indices indicate offsets from the end of the object. So, the last slice returns empty.

The correct slice is a[m-1:None:-1] or a[m-1::-1] and the logic for this is cumbersome:

a[m-1:n-1 if n > 0 else None:-1]

The simpler way is to do a[n:m][::-1].

Saturday, September 6, 2014

Mac OS + 'cat' + 'sed' + \n = half-assed

You guys all know how Mac OS darwin does everything JUST a little differently from the *nixes. It's close enough to draw you in, and different enough to stab you in the back. Today's case sed and \n

I needed to cat some files together but I needed a newline between them. I asked my colleague Wan-Ping for a command that would do this and she suggested

cat 1.fa 2.fa 3.fa | sed 's/^>/\n>/g'

So I did this and the sucker added the character 'n' wherever I expected a newline.

It turns out that Macs are special little snow flakes and need a special little command:

cat chr*.fa | sed 's/^>/\'$'\n>/g' > hg38.fa

The magic sauce is the '$' that escapes the '\n'


Wednesday, September 3, 2014

Fixing the big mess with git and case insensitive filesystems

Mac OS X by default is a case insensitive file system. But Mac OS, as in a lot of other things, makes a half-assed job of this. In addition to causing various bits of confusion when creating directories it also leads to a potentially messy situation with git. This is how things happen:

1. You create a directory in your source tree called, say, Plugins with a capital "P".
2. After a few commits you decide that it's better to change this to a lower case "p": plugins
3. When you go to commit this rename (perhaps with a few other changes you implemented) git throws a hissy fit.

After a bit of searching on stack overflow it turns out that this is all related to Mac OS's case-insensitivity.

The cleanest fix I found on stack overflow was:

git mv Plugins temp00
git mv temp00 plugins
git commit

Apparently this fools git's index into doing the change, where as git mv Plugins plugins - because the underlying file system does not recognize the difference - tells git nothing has changed but it has and leaves it in some sort of half way state that messes it up.

Friday, August 29, 2014

A funny thing with Python __getitem__()

So, Python has the 'in' operator which we are encouraged to use to test membership of a key in a dictionary, for example. Like

a = {'A': 22, 'B': 45}
if 'A' in a:
  print 'Yes' 

--> 'Yes'

Python has a index/slice/key notation which consists of square brackets ([]):

--> 22

Because Python is so awesome it's slice notation is available to us for any classes we define.
class A():
  def __getitem__(self, item):
    print item
    return item

a = A()
a[23] --> 23
b['hello']  --> 'hello'

(In our real class, of course, we would be doing something interesting in the __getitem__ method, like returning a data structure indexed by the key)

What happens when we put the two together, you know, use the in idiom with a user defined class?

It's so awesome, it needs its own screencast.

So Python goes through indexes sequentially testing membership! Returning None does not help either. You have to return True or 1 to indicate membership or raise IndexError for non-mmebership. Otherwise this madness will never stop!

Monday, August 25, 2014

Big numbers in Python

Numbers in Python can get arbitrarily large. This is pretty fun for things like doing factorials

>>> def fac(n): return reduce(lambda x,y: x*y, range(1,n+1))
>>> fac(1)
>>> fac(2)
>>> fac(3)
>>> fac(4)
>>> fac(5)
>>> fac(500)
You're like Ooooh! I'm a gonna break Python!

>>> fac(5000)

And Python calmly gives you


Another fun thing about such arbitrary length integers in Python is that you can do weird, and fast, bit shifts.

>>> n = 1 << int(250e6)
You just created a 250 million bit array with the last bit set to one! And it's fast too:
In [407]: %timeit n = 1 << int(250e6)
100 loops, best of 3: 12.4 ms per loop

In [408]: %%timeit n = 1 << int(250e6)
   .....: n &= 1
10000000 loops, best of 3: 159 ns per loop

In [409]: %%timeit n = 1 << int(250e6)
n &= 1 << int(200e6)
100 loops, best of 3: 10.5 ms per loop

Numpy: a note on slicing efficiency

In my application I have three arrays of the same length (and datatype) (say p1, p2 and p3) that have a one to one correspondence with each other. My code works on the arrays in blocks and has to concatenate fragments of these three arrays together. In the rest of my code I take matched slices through these arrays (say p1[0:100], p2[0:100], p3[0:100]) and pass them onto functions for processing,

My first instinct was to create a two-dimensional array since this encapsulates the data nicely, even though the data do not, semantically, constitute a matrix. It was at this point I came across a fact that I had not considered before: taking slices from a 2D array is more time expensive than taking slices from a 1D array.

For example:

In [397]: %%timeit p = numpy.empty((3, 1e6))
   .....: x = p[1, 0:2]
100000 loops, best of 3: 2.18 µs per loop


In [398]: %%timeit p = numpy.empty(1e6)
   .....: x = p[0:2]
1000000 loops, best of 3: 512 ns per loop

Row-column shape does not make a difference:

In [399]: %%timeit p = numpy.empty((1e6, 3))
x = p[0:2, 1]
100000 loops, best of 3: 1.97 µs per loop

Additionally, A bit of parametric investigation reveals that, interestingly, the size of the slice does not matter:

In [402]: %%timeit p = numpy.empty((3, 1e6))
x = p[1, 0:100000]
100000 loops, best of 3: 2.17 µs per loop


In [401]: %%timeit p = numpy.empty(1e6)
x = p[0:100000]
1000000 loops, best of 3: 507 ns per loop

I think the most likely reason for this is that we are no longer pulling contiguous memory locations but rather having to pull together fragments from different memory locations and do proper offset computations while we do this. I can't understand why the time is independent of size for the 2D array, though!

So these consideration led me to the following formulation. I first verified that a concat of 2D fragments was not faster than repeated concats of separate 1-D fragments:

In [404]: %%timeit p1 = numpy.empty((3, 1e6)); p2=numpy.empty((3,1e2))
   .....: p = numpy.concatenate((p1, p2),axis=1)
100 loops, best of 3: 8.15 ms per loop


In [405]: %%timeit p1 = [numpy.empty(1e6) for _ in range(3)]; p2=[numpy.empty(1e2) for _ in range(3)]
   .....: p = [numpy.concatenate((pa, pb)) for pa, pb in zip(p1, p2)]
100 loops, best of 3: 9.34 ms per loop

And then used a list of arrays instead of a 2D array. This works for me because I do a lot of slicing of the resultant arrays and I never need to do any matrix operations with the three rows arranged as a matrix.

Saturday, August 16, 2014

Python blist.sortedlist

The blist package is pretty good. The documentation and care the author has put in is very impressive. It's a shame that it wasn't included in the standard Python distribution, and the reasons given seem to indicate to me that there are not enough people working on large datasets on the Python committee. Any how, I wanted to try out for myself how the sorted list stacks up against a regular python list that is sorted every time after and insertion:

As can be seen, blist.sorted insertion is O(1) and compares quite favorably to the regular Python list.

Timing code follows:

import pylab
from timeit import timeit

l = [1,2,3,4,5,6,7,8,9,10]

def time_things():
  n_loops = 1000
  s = [10, 100, 1000, 10000, 100000]
  blist_time = [timeit("a.add(100)", "import blist; a=blist.sortedlist(range({:d},0,-1))".format(n), number=n_loops)/n_loops for n in s]
  list_time = [timeit("a.append(100); a.sort()", "a=range({:d})".format(n), number=n_loops)/n_loops for n in s]
  return list_time, blist_time

def plot_things(list_time, blist_time):
  s = [10, 100, 1000, 10000, 100000]
  pylab.plot(s, list_time, label='Python list')
  pylab.plot(s, blist_time, label='blist.sortedlist')
  pylab.setp(pylab.gca(), xlabel='List size', ylabel='Time (s)', xscale='log', yscale='log') #, ylim=[1e-9, 1e-3])
  pylab.legend(loc='lower right')

Thursday, August 14, 2014

bcolz, numpy and plain Python

So we have these big data sets that people say are the code of life. In a simple representation they add up to about 3 GB of space. While it is quite possible to load up the whole human genome into RAM and have capacity to spare, I was wondering if there were painless ways to access compressed forms of the data. This lead me to Mikhail Korobov's blog and to bcolz. Here's a quick shoot out between plain Python strings, numpy and bcolz as used to store and access human chromosome 11 which is about 135 million base pairs long:

My take away, for my particular application is that I will stick to plain old Pythons strings - I tend to take short slices of the data (about 100 basepairs long) and that puts me in the range where Python strings are faster than numpy arrays. Memory is cheap.

Code follows:

"""Access times for single characters along the length of the string and for 100 item slices along the length of the
import pylab
import numpy
import bcolz

chrom_11_str = None  # Ugh, globals for timeit
chrom_11_bcolz = None
chrom_11_numpy = None

def load_data():
  global chrom_11_str, chrom_11_bcolz, chrom_11_numpy
  with open('chr11.fa', 'r') as fp:
    _ = fp.readline()  # The header
    chrom_11_str = ''.join([ln.strip() for ln in fp.readlines()])
  chrom_11_numpy = numpy.fromstring(chrom_11_str, dtype='c')
  chrom_11_bcolz = bcolz.carray(chrom_11_numpy)

def time_things():
  from timeit import timeit
  n_loops = 1000
  return [[[timeit("a = {:s}[0:{:d}]".format(dta, n), "from __main__ import {:s}".format(dta), number=n_loops) / n_loops
           for n in [10 ** m for m in range(6)]],
          [timeit("a = {:s}[100000000:100000000+{:d}]".format(dta, n), "from __main__ import {:s}".format(dta), number=n_loops) / n_loops
           for n in [10 ** m for m in range(6)]]] for dta in ['chrom_11_str', 'chrom_11_bcolz', 'chrom_11_numpy']]

def plot_things(t):
  #t_str, t_bcolz, t_numpy = t
  sl_size = [10 ** m for m in range(6)]
  location = ['(start)', '(end)']
  for n in [0, 1]:
    for t_value, t_name in zip(t, ['str', 'bcolz', 'numpy']):
      pylab.plot(sl_size, t_value[n], label=t_name + location[n])
      #pylab.plot(t_value[1], 'b', label=t_name + '(finish)')
    pylab.setp(pylab.gca(), xlabel='Slice size', ylabel='Time (s)', xscale='log', yscale='log', ylim=[1e-9, 1e-3])
    pylab.legend(loc='lower right')

t = time_things()

Wednesday, August 13, 2014

Looping over a list of files in Bash

Really, bash was meant for hackers:

FILES=*.txt  # What ever filter you want
for f in ${FILES}
  # Your commands here

Saturday, August 9, 2014

Nosetests and pdb

I sat there looking at a testing error and thought "Can't we automatically invoke the debugger when a test fails". This is Python, of course someone has a solution for that.

nosetests --pdb

will invoke the debugger .when one of your tests goes south.

Tuesday, July 29, 2014

Nose and test generators

In my code I have a directory of small python modules that act as plugins. They can be called by a main program as directed. Each plugin, therefore, has a standard interface and needs to work with the main program. I wanted to setup an automated test that would run the main program, load each plugin in turn and run to make sure there was no crash. It turns out that nose has a very elegant way of doing this.

My test module for this turns out to be (with a little paraphrasing and removal of details):

from nose.plugins.skip import SkipTest

def find_plugin_models():
  return a_list_of_the_modules_loaded

def check_plugin(model):
  if an_important_precondition_is_not_met:
    raise SkipTest('I can not run this test')
  assert thing_is_ok()

def test_all_found_plugins():
  """Integration test on automatically found mutation plugin"""
  for model in find_plugin_models():
    check_plugin.description = 'Testing {:s} automatically'.format(model.__name__)
    yield check_plugin, model

Note the cool ability to set a .description attribute that allows you to print what you like when the test runs.

Friday, July 25, 2014

numpy and testing

Numpy has some great builtins for running unit tests under
. There are routines for comparing arrays that come in handy for tests. Many of these have verbose settings that give useful details in a thoughtful format for chasing down why a test failed.

Making docopt and Sphinx work together

I use docopt to build simple, self-documenting command line interfaces to scripts. I was delighted when sphinx's outdoc features seemed to interpret docopt's options well:

But it is apparent that the formatting of the usage part is a bit off and any option that does not begin with -- or - messes up the options formatting completely.

It turns out that Sphinx has a bunch of plugins for documenting command line options with sophisticated ones able to run the code or analyze the argparse object itself but nothing for docopt. One code sample I found wrote out the command line description in restructured text and then stripped out parts of it to make docopt parse it properly

The real solution is, of course, to write a Sphinx extension for docopt, but I ended up using a slight modification that doesn't confuse docopt and lets SPhinx make decent (not great) autodoc out of it.

My docstring looks like this:

Command line::

    fasta2wg  --index=IDX  --wg=WG  [--fa=FA] [-v]
    fasta2wg  describe  --wg=WG

    --index=IDX       Index file (.json) listing fasta files to be inserted into whole genome
    --wg=WG           Name of whole genome file
    --fa=FA           If set, will also dump a fa.gz file (by simply concatenating the files together) for use by BWA
    -v                Be verbose when you do things
    describe          Take the indicated whole genome file as input and print a summary of what it contains

And the Sphinx documentation looks like:

Wednesday, July 23, 2014

Some notes on Python imports and project organization

One majorly annoying thing for Pythonistas is the search path that the import command uses. When I was a beginner and wanted to get cracking I'd simply add whatever I was working on to PYTHONPATH, not wanting to get bogged down in silly details (that's what we were using Python for, right?).

I learned my lesson when I "forked" one of my projects (by duplicating the directory - give me a break, I wasn't born learning version control systems) and spent several hours trying to figure out why changes to a file I was making did not show up when I ran the code - Python, of course, was picking up the old version of the module in the original directory.

My current strategies, in somewhat random order, for project organization and testing are:

Project organization

   |             |------->  (empty)
   |             |-------> root level modules 
   |             |-------> submodule1
   |             |    |---------------> (empty)
   |             |    |----> sub modules
   |             |-------> submodule2
   |                  |---------------> (empty)
   |                  |----> sub modules
                 |------->  import sys; sys.path.append('../mainmodule')
                 |-------> root level module tests 
                 |-------> submodule1
                 |    |---------------> (empty)
                 |    |----> sub module tests
                 |-------> submodule2
                      |---------------> (empty)
                      |----> sub module tests

An important lesson is a) not to forget the files and b) to add relevant paths to sys.path in when needed.

Using this directory structure allows me to use sphinx for the documentation (the source root goes in in docs) and allows me to invoke nosetests by doing

nosetests tests

and the path appending going on in tests/ allows me to import any part of the module in any test script just as I would from a different application after the application has been installed using

Tuesday, July 22, 2014

Use numpydoc + sphinx

An ideal code documentation system should allow you to write documentation once (when you are writing the code) and then allow you to display the documentation in different contexts, such as project manuals, inline help, command line help and so on. You shouldn't need to duplicate docs - it is a waste of effort and leads to errors when the code is updated. The documentation you write should be easily readable by human users as they peruse the code, as well as by users as they run your code.

Sphinx is an awesome documentation system for Python. Sphinx can take descriptions in docstrings and embed them in documentation so that we can approach the goals of an ideal documentation system.

However, Sphinx violates the human readability part when it comes to function parameter descriptions. Take the following function, for example.

def _repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
Create a sequence by repeating a sub-sequence
subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring()
return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len]

The Sphinx-parsable way of writing the docstring is:

def _repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
  """Create a sequence by repeating a sub-sequence

  :param seq_len: Length of sequence.
  :type seq_len: int
  :param subseq: Sub-sequence to use as repeat block. Omit to generate a random sub-sequence.
  :type subseq: str
  :param subseq_len: If subseq is omitted this must be provided to indicate desired length of random sub-sequence
  :type subseq_len: Length of random sub-sequence
  :param base_sel_rng: Random number generator e.g. numpy.random
  :param alphabet: List of characters constituting the alphabet
  :type alphabet: list

  :returns:  str -- the sequence.
  subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring()
  return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len]

It does not matter that the Sphinx output is well formatted: for humans it's rather yucky to look at which defeats the purpose of docstrings.

It turns out that, as is common in Python, this problem has been noted and rectified. In this case the correction comes from the great folks at numpy who have a nice extension called numpydoc. The numpydoc version of this docstring is:

def repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
  """Create a sequence by repeating a sub-sequence

  seq_len : int
                              Length of sequence.
  subseq : str, optional
                              Sub-sequence to use as repeat block. Omit to generate a random sub-sequence.
  subseq_len : int, optional
                              If subseq is omitted this must be provided to indicate desired length of random
  subseq_len : int
                              Length of random sub-sequence
  base_sel_rng : object
                              Random number generator e.g. numpy.random
  alphabet : list, optional
                              List of characters constituting the alphabet

      The sequence.

  .. note:: This is meant to be used internally

  subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring()
  return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len]

Note: The only glitchy thing is that when you add numpydoc to the list of extensions in Sphinx's it can't be ahead of standard sphinx extensions - the order matters. For example, my extensions list is:

extensions = [

Putting numpydoc first leads to the error:
Extension error:
Unknown event name: autodoc-process-docstring

Thursday, July 10, 2014

Recap: Python list indexing is O(1)

Just in case any of you, hearing about how slow and ponderous Python is, were worried about indexing into a list - it's a O(1) operation:

In [936]: big_list = range(int(1e8))
In [937]: len(big_list)
Out[937]: 100000000

In [938]: %timeint big_list[-1]
ERROR: Line magic function `%timeint` not found.

In [939]: %timeit big_list[-1]
10000000 loops, best of 3: 66.1 ns per loop

In [940]: %timeit big_list[0]
10000000 loops, best of 3: 65.2 ns per loop

In [941]: small_list = range(1000)

In [942]: %timeit small_list[0]
10000000 loops, best of 3: 70.6 ns per loop

In [943]: %timeit small_list[-1]
10000000 loops, best of 3: 69.7 ns per loop

In [944]: %timeit big_list[10000]
10000000 loops, best of 3: 68.2 ns per loop

In [945]: %timeit small_list[100]
10000000 loops, best of 3: 70.2 ns per loop

Wednesday, July 9, 2014

Mac OS X: Coding in the dark

So, sometimes you need to work at night and not light up the whole room (e.g. when there is a sleeping baby next to you). The dimmest setting on the mac screen is still annoyingly bright, and indeed after a while, begins to hurt my eyes. One native thing to try is "invert colors" under "screen" in "accessibility". It makes the screen look like a negative image. It looks very cool, a bit annoying if you have images, but for text it is a very good counterpart to bright backgrounds during the day.

Monday, July 7, 2014

Python: To format or to concatenate

A while ago a kindly reader pointed out that Python's string .format method is, after all, a function, and carries with it some over head. I have some inner loop code that I could stand to run a little faster and I was looking for a way to speed things up without losing readability. In one part of the loop I was creating a string using the format statement. I wondered if I could speed things up by changing that to a concat.

So I first tested it out on some toy code:

def str_format(size=int(1e6)):
  for n in range(size):
    a = 'hi {:d} {:d} {:d}'.format(n, n+1, n+2)
  return a

def str_concat(size=int(1e6)):
  for n in range(size):
    a = 'hi ' + str(n) + str(n+1) + str(n+2)
  return a

In [448]: %timeit str_concat()
1 loops, best of 3: 996 ms per loop

In [449]: %timeit str_format()
1 loops, best of 3: 1.26 s per loop

So, the plain python concat is faster than the elegant way. This held with one variable or two variables too. This probably has to do with the complexity of the .format function, which offsets even the extra calls to str

Sunday, July 6, 2014

Doctests or nosetests?

I had a torrid love affair with doctests. It was AMAZING! You mean you can write documentation AND test your code at the same time!? You can write a line of code, put down the answer next to it and the computer WILL CHECK IT FOR YOU?

Well, this went on for several months. Then one morning I woke up next to this:

def handle_variant(variant, f_vseq, f_vseq_pos, strand_no=None):
  """Write the variant to the mutated sequence and fill out the pos array

  Some setup
  >>> import io, vcf, struct
  >>> def add_GT(v, gt):
  ...   md=vcf.model.make_calldata_tuple('GT')(gt)
  ...   call=vcf.model._Call(v,'sample',md)
  ...   v.samples = [call]
  ...   v._sample_indexes = {'sample': 0}

  Test with SNP: ignore zygosity
  >>> ref_seq = 'ACTGACTG'; \
  pos = 2; \
  ref = 'C'; \
  alt = [vcf.model._Substitution('T')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  add_GT(variant, '1/1'); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  >>> _ =; print
  >>> _ =; print struct.unpack('I',

  Test with SNP: homo, strand 0
  >>> f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 0)
  >>> _ =; print
  >>> _ =; print struct.unpack('I',

  Test with SNP: homo, strand 1
  >>> f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 1)
  >>> _ =; print
  >>> _ =; print struct.unpack('I',

  Test with SNP: het, strand 0
  >>> add_GT(variant, '0/1'); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 0)
  >>> _ =; print
  >>> _ =,2); print f_vseq.tell()

  Test with SNP: het, strand 1
  >>> f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 1)
  >>> _ =; print
  >>> _ =; print struct.unpack('I',

  Test with delete: ignore zygosity
  >>> ref_seq = 'ACTGACTG'; \
  pos = 2; \
  ref = 'CTG'; \
  alt = [vcf.model._Substitution('C')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  add_GT(variant, '1/0'); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  >>> _ =; print
  >>> _ =; print struct.unpack('I',

  Test with same delete, strand 1 (same as REF)
  >>> f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 1)
  >>> _ =; print
  >>> _ =,2); print f_vseq.tell()

  Test with delete and .
  >>> ref_seq = 'ACTGACTG'; \
  pos = 8; \
  ref = 'G'; \
  alt = [vcf.model._Substitution('.')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  >>> _ =; print
  >>> _ =,2); print f_vseq.tell()

  Test with delete and .
  >>> ref_seq = 'ACTGACTG'; \
  pos = 4; \
  ref = 'GACTG'; \
  alt = [vcf.model._Substitution('.')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  >>> _ =; print
  >>> _ =,2); print f_vseq.tell()

  Test with insert
  >>> ref_seq = 'ACTGACTG'; \
  pos = 2; \
  ref = 'C'; \
  alt = [vcf.model._Substitution('CGGG')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  >>> _ =; print
  >>> _ =; print struct.unpack('4I',*4))
  (2, 3, 3, 3)

  Test with insert
  >>> ref_seq = 'ACTGACTG'; \
  pos = 8; \
  ref = 'G'; \
  alt = [vcf.model._Substitution('GTTT')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  >>> _ =; print
  >>> _ =; print struct.unpack('4I',*4))
  (8, 9, 9, 9)

  See for details of algorithm
  var_type = '1' if strand_no is None else variant.genotype('sample').data.GT.split('/')[strand_no]
  # 0 means REF 1 means ALT. If we don't specify a strand number it means we don't care about Zygosity and will always
  # take the ALT
  if var_type == '1':
    alt = variant.ALT[0].sequence
    if alt == '.': alt = ''
    ref = variant.REF
    if ref == '.': ref = ''
    f_vseq.write(alt)    # Copy over ALT
    new_ref_pos = len(ref) + variant.POS - 1  # Advance along ref_seq
                                              # -1 because POS is 1-indexed, we are 0-indexed internally
    if len(alt) > 0:
      if len(ref) > 0:
        f_vseq_pos.write(struct.pack('I', variant.POS))  # The original base
      if len(alt) > 1:
        f_vseq_pos.write(struct.pack('{:d}I'.format(len(alt) - len(ref)), *[new_ref_pos + 1] * (len(alt) - len(ref))))
    new_ref_pos = variant.POS - 1  # Keep us here - we didn't implement this variant and we should keep copying
                                   # -1 because POS is 1-indexed, we are 0-indexed internally
  return new_ref_pos

There were several functions like this. I was refactoring this code and when I would work on these functions I would spend minutes trying to find the actual body of the function, often puzzling over what turned out to be a doctest and not the actual code. I had been warned about this by well meaning relatives.

Doctests are awesome when you can combine a test and example and it is pithy. When you need to setup complex data structures or you need to test lots of use cases, it is better to put tests in a separate file, in a separate directory and run them with nose.

Doctests can still be used in the documentation as examples of using functions with the added benefit that the examples are kept upto date by being tested.

Saturday, July 5, 2014

Python string concatenation

Continuing with our timing and efficiency theme: What is the more efficient way of building up a string?

from cStringIO import StringIO

def string_concat(size=int(1e6)):
  repeat = 'A'
  st = ''
  for n in range(size):
    st += repeat
  return st

def list_concat(size=int(1e6)):
  repeat = 'A'
  st = []
  for n in range(size):
    st += repeat
  return ''.join(st)

def stringIO_concat(size=int(1e6)):
  repeat = 'A'
  buf = StringIO()
  for n in range(size):
  return buf.getvalue()

Turns out the simple way is best:

In [91]: %timeit string_concat()
10 loops, best of 3: 121 ms per loop

In [92]: %timeit list_concat()
1 loops, best of 3: 293 ms per loop

In [93]: %timeit stringIO_concat()
1 loops, best of 3: 336 ms per loop

Friday, July 4, 2014

numpy arrays and iterators

Is it better to access a numpy array with an iterator or indexing?

import numpy

# This is how I've always done it - straight python idiomatic usage
def iterate1(m=100000):
  a = numpy.empty(m)
  sum = 0
  for n in a:
    sum += n
  return sum

# Didn't know about this one, but it is the recommended one in the docs
def iterate2(m=100000):
  a = numpy.empty(m)
  sum = 0
  for n in numpy.nditer(a):
    sum += n
  return sum

#This is C-like 
def loop(m=100000):
  a = numpy.empty(m)
  sum = 0
  for n in range(a.size):
    sum += a[n]
  return sum

And the shootout?

>>> %timeit iterate1()
10 loops, best of 3: 35.4 ms per loop
>>> %timeit iterate2()
10 loops, best of 3: 63.7 ms per loop
>>> %timeit loop()
10 loops, best of 3: 45.1 ms per loop

>>> %timeit iterate1(1e6)
1 loops, best of 3: 368 ms per loop
>>> %timeit iterate2(1e6)
1 loops, best of 3: 605 ms per loop
>>> %timeit loop(1e6)
1 loops, best of 3: 443 ms per loop

So the pythonic one seems to be the best

Thursday, July 3, 2014

HDF5 is not for fast access

HDF5 is a good solution for storing large datasets on disk. Python's h5py library makes it possible to pretend that data stored on disk is just like an in memory array. It is important to keep in mind that the data is really stored on disk and is read in every time a slice or index into the data is taken.

import numpy
import h5py

def create_data(length=1e4):
  data = numpy.random.rand(length)
  with h5py.File('test.h5', 'w') as fp:
    fp.create_dataset('test', data=data)
  return data

def access_each_h5():
  y = 0
  with h5py.File('test.h5', 'r') as fp:
    for n in range(fp['test'].size):
      y += fp['test'][n]
  return y

def access_each_array(data):
  y = 0
  for n in range(data.size):
    y += data[n]
  return y

d = create_data()

>>> run
>>> %timeit access_each_array(d)
100 loops, best of 3: 4.14 ms per loop
>>> %timeit access_each_h5()
1 loops, best of 3: 1.9 s per loop
That sobering difference in performance reminds us that we can't - performance wise - equate the two. When processing data from an hdf5 file, it is best to read in as large chunks as your memory will allow and do the heavy lifting in memory.

Friday, June 20, 2014

A note on Python's __exit__() and errors

Python's context managers are a very neat way of handling code that needs a teardown once you are done. Python objects have do have a destructor method (__del__) called right before the last instance of the object is about to be destroyed. You can do a teardown there. However there is a lot of fine print to the __del__ method. A cleaner way of doing tear-downs is through Python's context manager, manifested as the with keyword.

class CrushMe:
  def __init__(self):
    self.f = open('test.txt', 'w')

  def foo(self, a, b):
    self.f.write(str(a - b))

  def __enter__(self):
    return self

  def __exit__(self, exc_type, exc_val, exc_tb):
    return True

with CrushMe() as c:, 3)

One thing that is important, and that got me just now, is error handling. I made the mistake of ignoring all those 'junk' arguments (exc_type, exc_val, exc_tb). I just skimmed the docs and what popped out is that you need to return True or False depending on whether there was an error. So I wrote my code to return False if there was a problem in saving the file (My actual code is a little more involved, but the same in spirit) and True otherwise.

So what happens now if you do
with CrushMe() as c:'2', '3')

Of course, YOU know that this code will error out - you can't subtract strings. But if you run this code, it will FAIL SILENTLY. This is because I was careless and did not consider what happens if there is an error SOMEWHERE ELSE.

The proper way to do this, as a minimum, is to change the code to

def __exit__(self, exc_type, exc_val, exc_tb):
    return True if exc_type is None else False

Another note: defining both a __del__ method and an __exit__ method can lead to tricky situations. The following code, for, instance, calls the close method twice.

class CrushMe:
  def __init__(self):
    self.f = open('test.txt', 'w')

  def foo(self, a, b):
    self.f.write(str(a - b))

  def __del__(self):

  def __enter__(self):
    return self

  def __exit__(self, exc_type, exc_val, exc_tb):
    return True if exc_type is None else False

  def close(self):
    print 'Called!'

with CrushMe() as c:, 3)

close gets called first by __exit__ when we exit the context and then when we exit the interpreter and the object is deleted.

Sunday, June 1, 2014

Feedburner and the fragmentation of google

Wordpress has a plugin for sending your post to twitter or linkedin (and linkedin has a builtin way of sending your updates to twitter) but I didn't see anything like that for blogger (except a G+ button for publicizing to that thriving community). Whenever I did a search I came up with some third party service. Finally I ran into Feedburner, which turns out to be Google's own service for publicizing things on social media. This just led me to an observation that google has so many services and they CAN talk to each other, but there are complicated layers between them.

Wednesday, May 21, 2014

Python 'or'

I've been trying to write more Pythonic code. I'm sure everyone has their own definition of what Pythonic is, but everyone will agree on things like using list/dict comprehensions where possible and so on and so forth.

I was perusing some code written by some colleagues and found this:

    push_data = {'message': container.message or 'NO COMMIT MESSAGE',

This took me to the basic definition of Python's or keyword and, notably, this clause: "These only evaluate their second argument if needed for their outcome."

Way cool! I would previously have used a construction like

    push_data = {'message': container.message if container.message else 'NO COMMIT MESSAGE',

and patted myself on the back for being Pythonic, but there is an even better way to do it.

I must say, though, that I would not use the corresponding and construction because it is not so intuitive to me.

0 and 56

Tuesday, May 20, 2014

PyVCF and header lines

PyCVF which I use for loading VCF files ignores* all lines of the header (## lines) but needs the column header line - if you omit this than PyVCF will skip the first variant line.

*And by ignores I really meant - interprets such lines as ##key=value pairs which are then found in vcf_reader.metadata.

Tuesday, May 13, 2014

Finder on Mavericks slowing down?

If you find that Finder on Mavericks has gotten obscenely slow you can try resetting the finder's .plist and restarting the finder (From the solution here):

rm ~/Library/Preferences/ Finder

Tuesday, May 6, 2014

Darwin and case sensitivity in filenames

I normally use a Mac as a pretty interface to a unix (POSIX) system and don't run into problems, but every now and then my whole workflow will come to a stop because Darwin did that one thing slightly different from everyone else and I have to run that down. This time it was case sensitive file names.

POSIX file systems are case sensitive. So a file called Test is different from TEST is different from tesT (Though one is recommended not to torture users in that fashion).

So under a proper POSIX system you can do

mkdir Test
mkdir tesT
mkdir TEST

and end up with three different directories.

Darwin lulls you into a false sense of security by accepting all three forms but behind the scenes it simply maps it all to the same form. So you get a nonsensical complaint as you do this

mkdir Test  # OK
mkdir tesT  # Directory already exists? WTF?
mkdir TEST  # Directory already exists? Whaaa?


Monday, May 5, 2014

Doctesting Python command line scripts

The purists will hate this one, but the pragmatists may not condemn me so much. I'm writing Python because I want compact, easy to understand and maintain code. If I was into writing a ton of complicated boilerplate I would have used C.

In keeping with this philosophy I love doctest for testing Python code. Many a time writing code with an eye to doctesting it has led me to more modular and functional code. Perhaps this code was a little slower than it could be, but boy was it fast to write and debug and that's the reason we are in Python, right?

One bump in the road is how do you doctest whole command line applications? Vital parts of my code consist of Python scripts that are meant to be called as command line tools. The individual parts have been tested with doctest, but there is often a need for a gestalt test, where several tools are chained and the output needs to be tested.

There is something called shell-doctest which seems to do exactly this, but it is yet another third party package and not very recently maintained, or very widely known. I'm not sure I should use it.

The creator of Python has recommended creating a main function that can be then tested with forced arguments. I like this, and may convert my 'main' sections this way, but there is something to be said to maintaing the 'conversational' style of doctest, where you simply plop down python commands as you would normally and have the correct answers as if you were in an interactive programming environment.

My current experiment is to merge the gestalt test with my instruction manual so that I'm writing the command line examples into my file which then doubles as a test file which can be executed by running python -m doctest -v

In fact, if you are in IPython, the magic function run does exactly what I want - which, is to run a complete shell command, like run arg1 arg2 arg3. Unfortunately doctest uses a vanilla python shell which does not have this facility.

My solution is to define a function I call shell early on in the document.

>>> import shlex, subprocess
>>> def shell(command):

Now, in my file I have things like:

You can run the data processing program as

>>> shell('python infile.txt outfile.txt')

Which will result in this processed data

>>> with open('outfile.txt','r') as f: print
Output Data

Which serves both as do by example documentation as well as test.

Sunday, May 4, 2014

Storing state in a Python function

This one blew me away. You can store state in a function, just like you would any object. These are called function attributes. As an aside, I also learned that using a try: except clause is ever slightly so faster than an if.

def foo(a):
    foo.b += a
  except AttributeError:
    foo.b = a
  return foo.b

def foo2(a):
  if hasattr(foo2, 'b'):
    foo2.b += a
    foo2.b = a
  return foo2.b

if __name__ == '__main__':
  print [foo(x) for x in range(10)]
  print [foo2(x) for x in range(10)]

python -mtimeit -s'import test' '[ for x in range(100)]'
python -mtimeit -s'import test' '[test.foo2(x) for x in range(100)]'

Giving us:

[0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45]

python -mtimeit -s'import test' '[ for x in range(100)]'  ->  10000 loops, best of 3: 29.4 usec per loop
python -mtimeit -s'import test' '[test.foo2(x) for x in range(100)]' -> 10000 loops, best of 3: 39.1 usec per loop

Thursday, May 1, 2014

Python: Maps, Comprehensions and Loops

Most of you will have seen this already but:

c = range(1000)
d = range(1,1001)

def foo(a,b):
  return b - a

def map_foo():
  e = map(foo, c, d)
  return e

def comprehend_foo():
  e = [foo(a, b) for (a,b) in zip(c,d)]
  return e

def loop_foo():
  e = []
  for (a,b) in zip(c,d):
    e.append(foo(a, b))
  return e

def bare_loop():
  e = []
  for (a,b) in zip(c,d):
    e.append(b - a)
  return e

def bare_comprehension():
  e = [b - a for (a,b) in zip(c,d)]
  return e

python -mtimeit -s'import test' 'test.map_foo()'
python -mtimeit -s'import test' 'test.comprehend_foo()'
python -mtimeit -s'import test' 'test.loop_foo()'
python -mtimeit -s'import test' 'test.bare_loop()'
python -mtimeit -s'import test' 'test.bare_comprehension()'

In order of speediness:

test.bare_comprehension() -> 10000 loops, best of 3: 97.9 usec per loop
test.map_foo() -> 10000 loops, best of 3: 125 usec per loop
test.bare_loop() -> 10000 loops, best of 3: 135 usec per loop
test.comprehend_foo() -> 10000 loops, best of 3: 159 usec per loop
test.loop_foo() -> 1000 loops, best of 3: 202 usec per loop

So, where-ever you can, use map. The reason map_foo is slower than bare_comprehension is the function overhead.

Python doctest and functions that do file I/O

Some of my code contains functions that read data from files in chunks, process them, and then write data out to other files. I thought it was silly and expensive to first read the data into a list/array, work on them, write out the results to another list and then write the list out to files. But how do you write doctests for such functions?

One way would be to explicitly open files for reading/writing but this is a pain because you have to first create the file, process them, check the answers and then remember to clean up (delete) the files afterwards. You could use the wonderful Python tempfile module, but ...

In my case I was passing file handles to the functions. One convenient way to test such functions without creating physical files on disk is to use io.BytesIO(). This function returns a data structure that has all the characteristics of a file but resides in memory.

For example, suppose you have a cool function that reads in a file and writes out every second byte.

def skip_a_byte(fin, fout):
  b =
  while b:
    if fin.tell() % 2:
    b =

We could outfit with a doctest harness as follows:

def skip_a_byte(fin, fout):
  >>> import io; \
  data = 'Haopwf gNqojwj jBlrmowwjnl BCloRwT'; \
  f1 = io.BytesIO(data); \
  f2 = io.BytesIO(); \
  skip_a_byte(f1, f2); \
  _ =; \
  How Now Brown Cow
  b =
  while b:
    if fin.tell() % 2:
    b =

if __name__ == "__main__":
  import doctest

Oh, BTW, I would like to give a shoutout to PyCharm for being very cool: they recognize the code in the docstring, and can do code analysis on it!

Wednesday, April 30, 2014

How to tell if a property is a rental

From a thread here:

  1. The most accurate is to physically go to the place and politely ask the residents - making clear that you are considering buying and are trying to determine how many units are rentals.
  2. Online, if you go to the assessor's database if the owner's mailing address is different from the house address it's most certainly not a primary residence. This may mean it's rented out.
  3. Online, if you can see the property tax amount, many towns in Mass. have a residential exemption which applies to primary residences only. If the tax seems lower, then this is a primary residence.

Tuesday, April 29, 2014

What makes marine grade wire special?

I was in the market for pain old 14 AWG wire for a project when I came across spools of wire being sold as "tinned marine grade wire". I was curious to know what made this wire "marine grade". From this page, it turns out, marine grade wire is

  1. 12% bigger for a given wire gauge. So 14G marine wire is thicker than regular 14G wire
  2. Each strand of the wire is finer and individually coated with tin, making it resistant to corrosion
  3. The insulation on the wire is oil, moisture and heat resistant.

Monday, April 28, 2014

Breaking BAM

It's fun to mess with our bioinformatics tools and laugh at ourselves. The BAM format carries a query name string. In an idle moment I wondered, how long a string can I put here before bad things happen to the BAM?

Generally this string carries information related to the device that produced the read, lane number, run number and so and so forth. It's completely specification free and everyone encodes information here in their own way, so you would think it's an arbitrary length string. Almost.

Try out the short Python snippet below. It generates a dummy fasta file (test.fa) and a dummy BAM file (test.bam) with one 'aligned' read. The only funky thing is that you can make the qname field of the read as long as you want (by varying the lone input parameter to the script).

import sys
import pysam

  qnamelen = int(sys.argv[1])
  qnamelen = 255

with open('test.fa', 'w') as f:

bam_hdr = {'HD': {'VN': '1.4'},
           'SQ': [{'LN': 100, 'SN': 'test'}]}
bf = pysam.Samfile('test2.bam', 'wb', header=bam_hdr)  # Write binary BAM with header
ar = pysam.AlignedRead()
ar.qname = 'Q'*qnamelen  # Gives Tablet headaches when str len > 254
ar.seq = 'A'*50
ar.qual = '~'*50
ar.mapq = 100
ar.pos = 0
ar.cigarstring = '50M'
pysam.sort('test2.bam', 'test')

Now open the pair of fasta and BAM files in samtools tview, Tablet and IGV (and indeed in anything else). Laugh loudly until your office mates banish you.

Only samtools tview - the least fancy one - loads the alignment without batting an eyelid.

The specs allow an arbitrary length null terminated string.

(Oh, as an aside, try this. Create a .fasta file with this simple corruption: Remove the newline from the sequence id in the fasta file (so that the sequence looks something like 'testAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'. Load the sequence in IGV. Now fix the sequence and try and convince IGV to load the new corrected sequence. Hint: it won't go away until you use igvtools to reindex the fasta file. running bwa index won't help)

Saturday, April 26, 2014

Adjusting sticking doors

Several of the doors in ye old house were sticking at the top at the latch end. The door was angled too high (or the door frame top was drooping). My solution, in each case, was to use a chisel to deepen the lower hinge seat on the door frame. This allowed the bottom of the door to swing towards the hinge side, bringing the top away from the frame. I could have added shims to the top hinge too. Indeed, for one of the doors I added shims to the top hinge, chiseled away the lower hinge AND chiseled away a bit of the top corner.

Wednesday, April 23, 2014

Bash conditionals

set -x
if [ ${MYVAR} = 1 ]; then
    : First option
elif [ ${MYVAR} = 2 ]; then
    : Second option
elif [ ${MYVAR} = 3 ]; then
    : Third option


+ '[' 3 = 1 ']'
+ '[' 3 = 2 ']'
+ '[' 3 = 3 ']'
+ : Third option

You need to watch out for whitespace, as bash is sensitive to whitespace

set -x
if [${MYVAR} = 1 ]; then
    : First option
elif [ ${MYVAR} = 2 ]; then
    : Second option
elif [ ${MYVAR} = 3 ]; then
    : Third option


+ '[3' = 1 ']'
./ line 4: [3: command not found
+ '[' 3 = 2 ']'
+ '[' 3 = 3 ']'
+ : Third option

Very, very sensitive to whitespace

set -x
if [ ${MYVAR}=1 ]; then
    : First option
elif [ ${MYVAR} = 2 ]; then
    : Second option
elif [ ${MYVAR} = 3 ]; then
    : Third option


+ '[' 3=1 ']'
+ : First option

Monday, April 14, 2014

The magic of mmap

Big data is sometimes described as data whose size is larger than your available RAM. I think that this is a good criterion because once the size of your data (or the size of any results of computing on your data) start to approach your RAM size you have to start worrying about how you are going to manage memory. If you leave it up to your OS you are going to be writing and reading to disk in somewhat unpredictable ways and depending on the software you use, your program might just quit with no warning or with a courtesy 'Out of memory' message. The fun challenge of "Big Data" is, of course, how to keep doing computations regardless of the size of your data and not have your computer quit on you. Some calculations can be done in a blocked fashion but some calculations require you to access different parts of the data all at once.

Python's mmap module is an excellent way to let someone else do the dirty work of handling data files that are comparable or larger than available memory.

import mmap
import numpy

def load_data():
  fin = open('../Data/human_chrom_11.smalla', 'r+b')
  x =
  y = x[numpy.random.randint(0,len(x))]
  print y

def map_data():
  fin = open('../Data/human_chrom_11.smalla', 'r+b')
  x = mmap.mmap(fin.fileno(), 0)
  y = x[numpy.random.randint(0,len(x))]
  print y


Here the .smalla data files are simply somewhat large files that can be loaded into memory (which we do for illustration purposes) but which we'd rather not. Running this code with memory_profiler

python -m memory_profiler

tells us:


Line #    Mem usage    Increment   Line Contents
     5   16.922 MiB    0.000 MiB   @profile
     6                             def load_data():
     7   16.926 MiB    0.004 MiB     fin = open('../Data/human_chrom_11.smalla', 'r+b')
     8  145.680 MiB  128.754 MiB     x =
     9  145.691 MiB    0.012 MiB     y = x[numpy.random.randint(0,len(x))]
    10  145.691 MiB    0.000 MiB     print y


Line #    Mem usage    Increment   Line Contents
    12   16.941 MiB    0.000 MiB   @profile
    13                             def map_data():
    14   16.941 MiB    0.000 MiB     fin = open('../Data/human_chrom_11.smalla', 'r+b')
    15   16.945 MiB    0.004 MiB     x = mmap.mmap(fin.fileno(), 0)
    16   16.953 MiB    0.008 MiB     y = x[numpy.random.randint(0,len(x))]
    17   16.953 MiB    0.000 MiB     print y

As we can see from the 'increment' column, when we map the data we hardly use any memory at all compared to the 128 MB that we heat up when we load the data into memory at once.

We should keep in mind that we have traded off space for time here. Even with a SSD operating on data from disk is going to take much longer than operating on data that is all in memory, but at least we are able to do it.

Now, if you want both fast access and operability with limited RAM, you need a much larger bag of tricks which I don't have and which often heavily depends on tricks you can do with your particular data.

The logging overhead.

Python makes printing logger messages, and adjusting the logger level (which messages to print when) very easy. However, it seems, that the logger code comes with a higher overhead than if you used simple 'if' statements to control what messages to print.

Logger messages are very, very useful. Two typical uses of logger messages is to debug things when a program goes awry and to print out the progress of a program (if the user wishes it) to reassure us that the program is running and inform us of how much further the program has to go.

Python's logger is a lot of fun because it is very easy to set up logging messages at different levels and then filter which messages you will actually show. For example you can code the program to print out values of some variables at the  'DEBUG' level and print out the progress of the program at the 'INFO' level. Then you can instruct the code to print out only the INFO messages or both the DEBUG and INFO messages.

It's very likely that actually printing logging messages slows down the program, but I began to wonder what kind of overhead even the ignored logger messages were introducing to the program. One could think that placing a debug logger message in a loop is a good debug tool which can be conveniently switched off when in production with little effect, but how true is that?

import logging
logger = logging.getLogger("Meta")

def foo(N):
  sum = 0
  for n in range(N):
    sum += n
    logger.debug('This is message {:d}'.format(n))
  x = sum**2

def bar(N):
  sum = 0
  for n in range(N):
    sum += n
  x = sum**2

def foobar(N, flag=False):
  sum = 0
  for n in range(N):
    sum += n
    if flag:
      logger.debug('This is message {:d}'.format(n))
  x = sum**2

if __name__ == "__main__":
  import timeit
  print 'No logger message: ', timeit.timeit('bar(100)', setup='from __main__ import logging, bar; logging.basicConfig(level=logging.ERROR)', number=1000)
  print 'Suppressed logger message: ', timeit.timeit('foo(100)', setup='from __main__ import logging, foo; logging.basicConfig(level=logging.ERROR)', number=1000)
  print 'Logger message behind false if statement: ', timeit.timeit('foobar(100)', setup='from __main__ import logging, foobar; logging.basicConfig(level=logging.ERROR)', number=1000)
  print 'Logger message behind true if statement: ', timeit.timeit('foobar(100, True)', setup='from __main__ import logging, foobar; logging.basicConfig(level=logging.ERROR)', number=1000)

Which gives us

No logger message:  0.00471091270447
Suppressed logger message:  0.140299081802
Logger message behind false if statement:  0.00491690635681
Logger message behind true if statement:  0.133023023605

This is interesting. It shows that the overhead for the suppressed logging statement is very high, and is in fact higher than that for a simple if statement - that is if you had code where you suppressed printing of a log message using an if statement you would waste less time than if you relied on logger's own logic to handle log levels.

This is a little disappointing.

UPDATE: Following Anon's suggestion to use
logger.debug('This is message %d', n)
I found that it did indeed reduce the logger overhead, but not by much. The new times are:

No logger message:  0.0054669380188
Suppressed logger message:  0.0824329853058
Logger message behind false if statement:  0.00509595870972
Logger message behind true if statement:  0.0940999984741

Tuesday, April 8, 2014

Bash: print commands as they execute

I'm making a demo script that calls programs from a toolchain I am developing. The idea of the script is that it runs a set of commands in sequence and also prints what the command will be doing. Say the base script was:


I first put in echo to describe the command

echo 'Now we do blah blah'

But, I wanted to see the commands too. I discovered the set -x command, which starts the debug mode and prints out every command as it is executed.

set -x
echo 'Now we do blah blah'

But, this now, effectively, printed out the description twice: once when printing the echo statement and then again when actually executing the echo statement. Then I discovered the No OPeration (NOP) character for bash ":".

set -x
: Now we do blah blah

This prints the line but does not try to execute it.

Thursday, April 3, 2014

Python: passing a mix of keyword arguments and dictionary arguments to a function

So Python is cool because of keyword arguments:

def foo(a=1,b=2,c=3):
  print a,b,c

foo(a=1) # -> 1 2 3

Python is cool because you can pass a dictionary whose keys match the argument names:

def foo(a=1,b=2,c=3):
  print a,b,c

args = {'a': 1, 'b':2}
foo(**args) # -> 1 2 3

But, can you mix the two? Yes, yes you can!

def foo(a=1,b=2,c=3):
  print a,b,c

args = {'a': 1, 'b':2}
foo(c=3, **args) # -> 1 2 3

Hmm, can we screw up the interpreter? What happens if we send the same argument as a keyword AND a dictionary?

def foo(a=1,b=2,c=3):
  print a,b,c

args = {'a': 1, 'b':2}
foo(a=4, **args) # -> TypeError: foo() got multiple values for keyword argument 'a'

Nothing gets past Python, eh?

Monday, March 31, 2014

Washing machine hoses

When our home inspector went through he mentioned to me that I should replace the existing rubber hoses on the washing machine with steel-reinforced ones. I wondered if washing machine hoses were specially prone to fail, perhaps something to do with the intermittent nature of the load (if you look at the hoses when the washing machine runs, they jerk as the machine draws water in phases). I also began to wonder if dishwasher hoses suffered from the same problem.

A quick search brought up two documents, one from a community association underwriter - which I assume insures people with rental properties. The other document I found was from an insurance association serving hotels and inns. The first one claims that steel-reinforced hoses are no better than rubber hoses since the hoses are damaged at the connection. The other one suggests that steel-reinforced hoses may be better than regular rubber hoses but all hoses should be replaced every five years or so.

I could not find reasons why these hoses were more susceptible to rupturing and why the dish washer water connection does not suffer from the same maintenance issue.

execfile and imported modules

I was given to believe that Python's execfile statement simply executed all the commands in a file and continued on its way. This is not entirely true, at least in Python 2.7: imported modules seem not to be handled by the execfile statement, which seems to be rather odd to me.

import numpy

def gen(n=100):
  return numpy.arange(n)

This code does what you expect when you import it as a module:

In [1]: import test

In [2]: test.gen(10)
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

And when you run it as a script to incorporate it into your workspace:

In [3]: run test

In [4]: gen(10)
Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

However, when you use execfile to run the script you run into a snag:

In [5]: pars = {}; execfile('', {}, pars); pars['gen'](10)
NameError                                 Traceback (most recent call last)
<ipython-input-5-b06061c74d2b> in <module>()
----> 1 pars = {}; execfile('', {}, pars); pars['gen'](10)
/Users/kghose/Code/Mitty/Test/ in gen(n)
      4 def gen(n=100):
----> 5   return numpy.arange(n)
NameError: global name 'numpy' is not defined


If, instead, you use the fascinating standard Python imp module, you get:

In [7]: import imp; mod = imp.load_source('test','./', open('','r'))
In [8]: mod
Out[8]: <module 'test' from './test.pyc'>
In [9]: mod.gen(10)
Out[9]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Things run as they should.

Tuesday, March 25, 2014

Parameter files in Python for Python programs

I often write and use Python programs that require input parameters to tell them what to do. When there are a few parameters I usually pass them through the command line (my current favorite framework for command line argument parsing is the amazing docopt module). When there are many parameters, however, command line arguments can get cumbersome.

In the past I have looked to configuration markup languages (YAML comes to mind) and have also used microsoft excel format files (using the excellent xlrd package). Using proprietary formats makes my stomach upset, but there are no good solutions for the odf counterparts.

Recently I have started to experiment with storing parameter files in python and reading them into my code.

__import__(filename) works and puts the parameters into a separate name space, so you can access them as filename.x, filename.y etc. However, the file needs to be in the module search path, and it feels a bit of a misuse.

A better solution, I find, is to use execfile to read the parameter file into the code. There are no path restrictions, execfile IS meant to execute code (so no misuse there) and the variables from the Python file are loaded into a separate dictionary so they are contained and do not spill all over your code.

#Main file
pars = {}
execfile('', {}, pars)

With looking like

a = 43
b = 'a string'
c = {'d': 23}

We get:

pars --> {'a': 43, 'b': 'a string', 'c': {'d': 23}}

We can put in comments in the file and structure however we need which I find very convenient.

execfile has disappeared in Python 3 but you can use

with open(filename) as f:
    exec(compile(, filename, "exec"))

UPDATE: However, see a funny issue with execfile here.

Monday, March 24, 2014

git subtree split

Problem: You have a git hub repository with code in several folders. You want to move one of the sub folders into its own separate repository (for example, it was experimental code you were working on, which you now want to spin off into its own life)

Solution: Use git subtree split as detailed here (look for the answer that says The Easy Way™)

Wednesday, March 19, 2014

Another reason to love PyCharm

The latest version of PyCharm can run the tool on your code to flag coding style violations on the fly. I thought this was cool, but it can be annoying sometimes. This is because pep8 consists of coding guidelines and some of them are a matter of taste. My personal quirk is that I like to use 2 spaces for indents instead of 4. 

When I turn on PyCharm’s pep8 checking all my code gets underlined because I’m using two spaces for indents. However, I discovered that under Preferences->Inspections->PEP 8 Coding style violation and there is a properties list labelled “Ignore” where you can type in pep8 violations to be ignored.

UPDATE: From this very informative post here and a response from one of PyCharm's programmers there: Go to the error, click on the light bulb icon and then select 'Ignore errors like this'. This will automatically add a proper exception to this list. For those interested the list of error codes is here.

I typed in ‘indentation’ and was delighted to note that this made those annoying flags disappear (This should not have worked). Also, you can change the severity of the violation flag - I changed it from warning to typo, which is less visually distracting.