Skip to main content


Showing posts from July, 2014

Nose and test generators

In my code I have a directory of small python modules that act as plugins. They can be called by a main program as directed. Each plugin, therefore, has a standard interface and needs to work with the main program. I wanted to setup an automated test that would run the main program, load each plugin in turn and run to make sure there was no crash. It turns out that nose has a very elegant way of doing this.

My test module for this turns out to be (with a little paraphrasing and removal of details):

from nose.plugins.skip import SkipTest def find_plugin_models(): return a_list_of_the_modules_loaded def check_plugin(model): if an_important_precondition_is_not_met: # raise SkipTest('I can not run this test') run_a_test() assert thing_is_ok() # def test_all_found_plugins(): """Integrat…

numpy and testing

Numpy has some great builtins for running unit tests under numpy.testing. There are routines for comparing arrays that come in handy for tests. Many of these have verbose settings that give useful details in a thoughtful format for chasing down why a test failed.

Making docopt and Sphinx work together

I use docopt to build simple, self-documenting command line interfaces to scripts. I was delighted when sphinx's outdoc features seemed to interpret docopt's options well:

But it is apparent that the formatting of the usage part is a bit off and any option that does not begin with -- or - messes up the options formatting completely.

It turns out that Sphinx has a bunch of plugins for documenting command line options with sophisticated ones able to run the code or analyze the argparse object itself but nothing for docopt. One code sample I found wrote out the command line description in restructured text and then stripped out parts of it to make docopt parse it properly

The real solution is, of course, to write a Sphinx extension for docopt, but I ended up using a slight modification that doesn't confuse docopt and lets SPhinx make decent (not great) autodoc out of it.

My docstring looks like this:

Command line:: Usage: fasta2wg --index=IDX --wg=WG [--fa=FA] [-v] …

Some notes on Python imports and project organization

One majorly annoying thing for Pythonistas is the search path that the import command uses. When I was a beginner and wanted to get cracking I'd simply add whatever I was working on to PYTHONPATH, not wanting to get bogged down in silly details (that's what we were using Python for, right?).

I learned my lesson when I "forked" one of my projects (by duplicating the directory - give me a break, I wasn't born learning version control systems) and spent several hours trying to figure out why changes to a file I was making did not show up when I ran the code - Python, of course, was picking up the old version of the module in the original directory.

My current strategies, in somewhat random order, for project organization and testing are:

Project organization
ProjectName |--------->Readme |---------> |--------->docs |--------->mainmodule | |-------> (empty) | |-------> root level modules …

Use numpydoc + sphinx

An ideal code documentation system should allow you to write documentation once (when you are writing the code) and then allow you to display the documentation in different contexts, such as project manuals, inline help, command line help and so on. You shouldn't need to duplicate docs - it is a waste of effort and leads to errors when the code is updated. The documentation you write should be easily readable by human users as they peruse the code, as well as by users as they run your code.

Sphinx is an awesome documentation system for Python. Sphinx can take descriptions in docstrings and embed them in documentation so that we can approach the goals of an ideal documentation system.

However, Sphinx violates the human readability part when it comes to function parameter descriptions. Take the following function, for example.

def _repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):

Recap: Python list indexing is O(1)

Just in case any of you, hearing about how slow and ponderous Python is, were worried about indexing into a list - it's a O(1) operation:

In [936]: big_list = range(int(1e8)) len(b In [937]: len(big_list) Out[937]: 100000000 In [938]: %timeint big_list[-1] ERROR: Line magic function `%timeint` not found. In [939]: %timeit big_list[-1] 10000000 loops, best of 3: 66.1 ns per loop In [940]: %timeit big_list[0] 10000000 loops, best of 3: 65.2 ns per loop In [941]: small_list = range(1000) In [942]: %timeit small_list[0] 10000000 loops, best of 3: 70.6 ns per loop In [943]: %timeit small_list[-1] 10000000 loops, best of 3: 69.7 ns per loop In [944]: %timeit big_list[10000] 10000000 loops, best of 3: 68.2 ns per loop In [945]: %timeit small_list[100] 10000000 loops, best of 3: 70.2 ns per loop

Mac OS X: Coding in the dark

So, sometimes you need to work at night and not light up the whole room (e.g. when there is a sleeping baby next to you). The dimmest setting on the mac screen is still annoyingly bright, and indeed after a while, begins to hurt my eyes. One native thing to try is "invert colors" under "screen" in "accessibility". It makes the screen look like a negative image. It looks very cool, a bit annoying if you have images, but for text it is a very good counterpart to bright backgrounds during the day.

Python: To format or to concatenate

A while ago a kindly reader pointed out that Python's string .format method is, after all, a function, and carries with it some over head. I have some inner loop code that I could stand to run a little faster and I was looking for a way to speed things up without losing readability. In one part of the loop I was creating a string using the format statement. I wondered if I could speed things up by changing that to a concat.

So I first tested it out on some toy code:

def str_format(size=int(1e6)): for n in range(size): a = 'hi {:d} {:d} {:d}'.format(n, n+1, n+2) return a def str_concat(size=int(1e6)): for n in range(size): a = 'hi ' + str(n) + str(n+1) + str(n+2) return a In [448]: %timeit str_concat() 1 loops, best of 3: 996 ms per loop In [449]: %timeit str_format() 1 loops, best of 3: 1.26 s per loop
So, the plain python concat is faster than the elegant way. This held with one variable or two variables too. This probably has to do with the co…

Doctests or nosetests?

I had a torrid love affair with doctests. It was AMAZING! You mean you can write documentation AND test your code at the same time!? You can write a line of code, put down the answer next to it and the computer WILL CHECK IT FOR YOU?

Well, this went on for several months. Then one morning I woke up next to this:

def handle_variant(variant, f_vseq, f_vseq_pos, strand_no=None): """Write the variant to the mutated sequence and fill out the pos array Some setup >>> import io, vcf, struct >>> def add_GT(v, gt): ... md=vcf.model.make_calldata_tuple('GT')(gt) ... call=vcf.model._Call(v,'sample',md) ... v.samples = [call] ... v._sample_indexes = {'sample': 0} Test with SNP: ignore zygosity >>> ref_seq = 'ACTGACTG'; \ pos = 2; \ ref = 'C'; \ alt = [vcf.model._Substitution('T')]; \ variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, Non…

Python string concatenation

Continuing with our timing and efficiency theme: What is the more efficient way of building up a string?

from cStringIO import StringIO def string_concat(size=int(1e6)): repeat = 'A' st = '' for n in range(size): st += repeat return st def list_concat(size=int(1e6)): repeat = 'A' st = [] for n in range(size): st += repeat return ''.join(st) def stringIO_concat(size=int(1e6)): repeat = 'A' buf = StringIO() for n in range(size): buf.write(repeat) return buf.getvalue()
Turns out the simple way is best:

In [91]: %timeit string_concat() 10 loops, best of 3: 121 ms per loop In [92]: %timeit list_concat() 1 loops, best of 3: 293 ms per loop In [93]: %timeit stringIO_concat() 1 loops, best of 3: 336 ms per loop

numpy arrays and iterators

Is it better to access a numpy array with an iterator or indexing?

import numpy # This is how I've always done it - straight python idiomatic usage def iterate1(m=100000): a = numpy.empty(m) sum = 0 for n in a: sum += n return sum # Didn't know about this one, but it is the recommended one in the docs def iterate2(m=100000): a = numpy.empty(m) sum = 0 for n in numpy.nditer(a): sum += n return sum #This is C-like def loop(m=100000): a = numpy.empty(m) sum = 0 for n in range(a.size): sum += a[n] return sum
And the shootout?

>>> %timeit iterate1() 10 loops, best of 3: 35.4 ms per loop >>> %timeit iterate2() 10 loops, best of 3: 63.7 ms per loop >>> %timeit loop() 10 loops, best of 3: 45.1 ms per loop >>> %timeit iterate1(1e6) 1 loops, best of 3: 368 ms per loop >>> %timeit iterate2(1e6) 1 loops, best of 3: 605 ms per loop >>> %timeit loop(1e6) 1 loops, best of 3: 443 ms per loop
So the…

HDF5 is not for fast access

HDF5 is a good solution for storing large datasets on disk. Python's h5py library makes it possible to pretend that data stored on disk is just like an in memory array. It is important to keep in mind that the data is really stored on disk and is read in every time a slice or index into the data is taken.

import numpy import h5py def create_data(length=1e4): data = numpy.random.rand(length) with h5py.File('test.h5', 'w') as fp: fp.create_dataset('test', data=data) return data def access_each_h5(): y = 0 with h5py.File('test.h5', 'r') as fp: for n in range(fp['test'].size): y += fp['test'][n] return y def access_each_array(data): y = 0 for n in range(data.size): y += data[n] return y d = create_data() >>> run >>> %timeit access_each_array(d) 100 loops, best of 3: 4.14 ms per loop >>> %timeit access_each_h5() 1 loops, best of 3: 1.9 s per loop That so…