Tuesday, July 29, 2014

Nose and test generators

In my code I have a directory of small python modules that act as plugins. They can be called by a main program as directed. Each plugin, therefore, has a standard interface and needs to work with the main program. I wanted to setup an automated test that would run the main program, load each plugin in turn and run to make sure there was no crash. It turns out that nose has a very elegant way of doing this.

My test module for this turns out to be (with a little paraphrasing and removal of details):

from nose.plugins.skip import SkipTest


def find_plugin_models():
  return a_list_of_the_modules_loaded

def check_plugin(model):
  if an_important_precondition_is_not_met:
    #http://stackoverflow.com/questions/1120148/disabling-python-nosetests
    raise SkipTest('I can not run this test')
  run_a_test()
  assert thing_is_ok()


#http://stackoverflow.com/questions/19071601/how-do-i-run-multiple-python-test-cases-in-a-loop
def test_all_found_plugins():
  """Integration test on automatically found mutation plugin"""
  for model in find_plugin_models():
    check_plugin.description = 'Testing {:s} automatically'.format(model.__name__)
    yield check_plugin, model

Note the cool ability to set a .description attribute that allows you to print what you like when the test runs.

Friday, July 25, 2014

numpy and testing

Numpy has some great builtins for running unit tests under
numpy.testing
. There are routines for comparing arrays that come in handy for tests. Many of these have verbose settings that give useful details in a thoughtful format for chasing down why a test failed.

Making docopt and Sphinx work together

I use docopt to build simple, self-documenting command line interfaces to scripts. I was delighted when sphinx's outdoc features seemed to interpret docopt's options well:

But it is apparent that the formatting of the usage part is a bit off and any option that does not begin with -- or - messes up the options formatting completely.

It turns out that Sphinx has a bunch of plugins for documenting command line options with sophisticated ones able to run the code or analyze the argparse object itself but nothing for docopt. One code sample I found wrote out the command line description in restructured text and then stripped out parts of it to make docopt parse it properly

The real solution is, of course, to write a Sphinx extension for docopt, but I ended up using a slight modification that doesn't confuse docopt and lets SPhinx make decent (not great) autodoc out of it.

My docstring looks like this:

Command line::

  Usage:
    fasta2wg  --index=IDX  --wg=WG  [--fa=FA] [-v]
    fasta2wg  describe  --wg=WG

  Options:
    --index=IDX       Index file (.json) listing fasta files to be inserted into whole genome
    --wg=WG           Name of whole genome file
    --fa=FA           If set, will also dump a fa.gz file (by simply concatenating the files together) for use by BWA
    -v                Be verbose when you do things
    describe          Take the indicated whole genome file as input and print a summary of what it contains

And the Sphinx documentation looks like:



Wednesday, July 23, 2014

Some notes on Python imports and project organization

One majorly annoying thing for Pythonistas is the search path that the import command uses. When I was a beginner and wanted to get cracking I'd simply add whatever I was working on to PYTHONPATH, not wanting to get bogged down in silly details (that's what we were using Python for, right?).

I learned my lesson when I "forked" one of my projects (by duplicating the directory - give me a break, I wasn't born learning version control systems) and spent several hours trying to figure out why changes to a file I was making did not show up when I ran the code - Python, of course, was picking up the old version of the module in the original directory.

My current strategies, in somewhat random order, for project organization and testing are:

Project organization


ProjectName
   |--------->Readme
   |--------->setup.py
   |--------->docs
   |--------->mainmodule
   |             |------->__init__.py  (empty)
   |             |-------> root level modules 
   |             |-------> submodule1
   |             |    |---------------> __init__.py (empty)
   |             |    |----> sub modules
   |             |-------> submodule2
   |                  |---------------> __init__.py (empty)
   |                  |----> sub modules
   |
   |--------->tests
                 |------->__init__.py  import sys; sys.path.append('../mainmodule')
                 |-------> root level module tests 
                 |-------> submodule1
                 |    |---------------> __init__.py (empty)
                 |    |----> sub module tests
                 |-------> submodule2
                      |---------------> __init__.py (empty)
                      |----> sub module tests

An important lesson is a) not to forget the __init__.py files and b) to add relevant paths to sys.path in __init__.py when needed.

Using this directory structure allows me to use sphinx for the documentation (the source root goes in conf.py in docs) and allows me to invoke nosetests by doing

nosetests tests

and the path appending going on in tests/__init__.py allows me to import any part of the module in any test script just as I would from a different application after the application has been installed using setup.py

Tuesday, July 22, 2014

Use numpydoc + sphinx

An ideal code documentation system should allow you to write documentation once (when you are writing the code) and then allow you to display the documentation in different contexts, such as project manuals, inline help, command line help and so on. You shouldn't need to duplicate docs - it is a waste of effort and leads to errors when the code is updated. The documentation you write should be easily readable by human users as they peruse the code, as well as by users as they run your code.

Sphinx is an awesome documentation system for Python. Sphinx can take descriptions in docstrings and embed them in documentation so that we can approach the goals of an ideal documentation system.

However, Sphinx violates the human readability part when it comes to function parameter descriptions. Take the following function, for example.


def _repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
"""
Create a sequence by repeating a sub-sequence
"""
subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring()
return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len]


The Sphinx-parsable way of writing the docstring is:

def _repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
  """Create a sequence by repeating a sub-sequence

  :param seq_len: Length of sequence.
  :type seq_len: int
  :param subseq: Sub-sequence to use as repeat block. Omit to generate a random sub-sequence.
  :type subseq: str
  :param subseq_len: If subseq is omitted this must be provided to indicate desired length of random sub-sequence
  :type subseq_len: Length of random sub-sequence
  :param base_sel_rng: Random number generator e.g. numpy.random
  :param alphabet: List of characters constituting the alphabet
  :type alphabet: list

  :returns:  str -- the sequence.
  """
  subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring()
  return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len]

It does not matter that the Sphinx output is well formatted: for humans it's rather yucky to look at which defeats the purpose of docstrings.

It turns out that, as is common in Python, this problem has been noted and rectified. In this case the correction comes from the great folks at numpy who have a nice extension called numpydoc. The numpydoc version of this docstring is:

def repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
  """Create a sequence by repeating a sub-sequence

  Parameters
  ----------
  seq_len : int
                              Length of sequence.
  subseq : str, optional
                              Sub-sequence to use as repeat block. Omit to generate a random sub-sequence.
  subseq_len : int, optional
                              If subseq is omitted this must be provided to indicate desired length of random
                              sub-sequence
  subseq_len : int
                              Length of random sub-sequence
  base_sel_rng : object
                              Random number generator e.g. numpy.random
  alphabet : list, optional
                              List of characters constituting the alphabet

  Returns
  -------
  str
      The sequence.

  .. note:: This is meant to be used internally

  """
  subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring()
  return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len]

Note: The only glitchy thing is that when you add numpydoc to the list of extensions in Sphinx's conf.py it can't be ahead of standard sphinx extensions - the order matters. For example, my extensions list is:

extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.doctest',
    'sphinx.ext.todo',
    'sphinx.ext.coverage',
    'sphinx.ext.pngmath',
    'sphinx.ext.mathjax',
    'sphinx.ext.viewcode',
    'numpydoc.numpydoc'
]

Putting numpydoc first leads to the error:
Extension error:
Unknown event name: autodoc-process-docstring

Thursday, July 10, 2014

Recap: Python list indexing is O(1)

Just in case any of you, hearing about how slow and ponderous Python is, were worried about indexing into a list - it's a O(1) operation:

In [936]: big_list = range(int(1e8))
len(b
In [937]: len(big_list)
Out[937]: 100000000

In [938]: %timeint big_list[-1]
ERROR: Line magic function `%timeint` not found.

In [939]: %timeit big_list[-1]
10000000 loops, best of 3: 66.1 ns per loop

In [940]: %timeit big_list[0]
10000000 loops, best of 3: 65.2 ns per loop

In [941]: small_list = range(1000)

In [942]: %timeit small_list[0]
10000000 loops, best of 3: 70.6 ns per loop

In [943]: %timeit small_list[-1]
10000000 loops, best of 3: 69.7 ns per loop

In [944]: %timeit big_list[10000]
10000000 loops, best of 3: 68.2 ns per loop

In [945]: %timeit small_list[100]
10000000 loops, best of 3: 70.2 ns per loop

Wednesday, July 9, 2014

Mac OS X: Coding in the dark

So, sometimes you need to work at night and not light up the whole room (e.g. when there is a sleeping baby next to you). The dimmest setting on the mac screen is still annoyingly bright, and indeed after a while, begins to hurt my eyes. One native thing to try is "invert colors" under "screen" in "accessibility". It makes the screen look like a negative image. It looks very cool, a bit annoying if you have images, but for text it is a very good counterpart to bright backgrounds during the day.

Monday, July 7, 2014

Python: To format or to concatenate

A while ago a kindly reader pointed out that Python's string .format method is, after all, a function, and carries with it some over head. I have some inner loop code that I could stand to run a little faster and I was looking for a way to speed things up without losing readability. In one part of the loop I was creating a string using the format statement. I wondered if I could speed things up by changing that to a concat.

So I first tested it out on some toy code:

def str_format(size=int(1e6)):
  for n in range(size):
    a = 'hi {:d} {:d} {:d}'.format(n, n+1, n+2)
  return a


def str_concat(size=int(1e6)):
  for n in range(size):
    a = 'hi ' + str(n) + str(n+1) + str(n+2)
  return a

In [448]: %timeit str_concat()
1 loops, best of 3: 996 ms per loop

In [449]: %timeit str_format()
1 loops, best of 3: 1.26 s per loop

So, the plain python concat is faster than the elegant way. This held with one variable or two variables too. This probably has to do with the complexity of the .format function, which offsets even the extra calls to str







Sunday, July 6, 2014

Doctests or nosetests?

I had a torrid love affair with doctests. It was AMAZING! You mean you can write documentation AND test your code at the same time!? You can write a line of code, put down the answer next to it and the computer WILL CHECK IT FOR YOU?

Well, this went on for several months. Then one morning I woke up next to this:

def handle_variant(variant, f_vseq, f_vseq_pos, strand_no=None):
  """Write the variant to the mutated sequence and fill out the pos array

  Some setup
  >>> import io, vcf, struct
  >>> def add_GT(v, gt):
  ...   md=vcf.model.make_calldata_tuple('GT')(gt)
  ...   call=vcf.model._Call(v,'sample',md)
  ...   v.samples = [call]
  ...   v._sample_indexes = {'sample': 0}

  Test with SNP: ignore zygosity
  >>> ref_seq = 'ACTGACTG'; \
  pos = 2; \
  ref = 'C'; \
  alt = [vcf.model._Substitution('T')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  add_GT(variant, '1/1'); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  2
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  T
  >>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
  (2,)

  Test with SNP: homo, strand 0
  >>> f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 0)
  2
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  T
  >>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
  (2,)


  Test with SNP: homo, strand 1
  >>> f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 1)
  2
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  T
  >>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
  (2,)



  Test with SNP: het, strand 0
  >>> add_GT(variant, '0/1'); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 0)
  1
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  
  >>> _ = f_vseq_pos.seek(0,2); print f_vseq.tell()
  0

  Test with SNP: het, strand 1
  >>> f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 1)
  2
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  T
  >>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
  (2,)


  Test with delete: ignore zygosity
  >>> ref_seq = 'ACTGACTG'; \
  pos = 2; \
  ref = 'CTG'; \
  alt = [vcf.model._Substitution('C')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  add_GT(variant, '1/0'); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  4
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  C
  >>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
  (2,)

  Test with same delete, strand 1 (same as REF)
  >>> f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos, 1)
  1
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  
  >>> _ = f_vseq_pos.seek(0,2); print f_vseq.tell()
  0


  Test with delete and .
  >>> ref_seq = 'ACTGACTG'; \
  pos = 8; \
  ref = 'G'; \
  alt = [vcf.model._Substitution('.')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  8
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  
  >>> _ = f_vseq_pos.seek(0,2); print f_vseq.tell()
  0


  Test with delete and .
  >>> ref_seq = 'ACTGACTG'; \
  pos = 4; \
  ref = 'GACTG'; \
  alt = [vcf.model._Substitution('.')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  8
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  
  >>> _ = f_vseq_pos.seek(0,2); print f_vseq.tell()
  0


  Test with insert
  >>> ref_seq = 'ACTGACTG'; \
  pos = 2; \
  ref = 'C'; \
  alt = [vcf.model._Substitution('CGGG')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  2
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  CGGG
  >>> _ = f_vseq_pos.seek(0); print struct.unpack('4I',f_vseq_pos.read(4*4))
  (2, 3, 3, 3)


  Test with insert
  >>> ref_seq = 'ACTGACTG'; \
  pos = 8; \
  ref = 'G'; \
  alt = [vcf.model._Substitution('GTTT')]; \
  variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
  f_vseq = io.BytesIO(); \
  f_vseq_pos = io.BytesIO(); \
  print handle_variant(variant, f_vseq, f_vseq_pos)
  8
  >>> _ = f_vseq.seek(0); print f_vseq.read()
  GTTT
  >>> _ = f_vseq_pos.seek(0); print struct.unpack('4I',f_vseq_pos.read(4*4))
  (8, 9, 9, 9)

  See Readme.md for details of algorithm
  """
  var_type = '1' if strand_no is None else variant.genotype('sample').data.GT.split('/')[strand_no]
  # 0 means REF 1 means ALT. If we don't specify a strand number it means we don't care about Zygosity and will always
  # take the ALT
  if var_type == '1':
    alt = variant.ALT[0].sequence
    if alt == '.': alt = ''
    ref = variant.REF
    if ref == '.': ref = ''
    f_vseq.write(alt)    # Copy over ALT
    new_ref_pos = len(ref) + variant.POS - 1  # Advance along ref_seq
                                              # -1 because POS is 1-indexed, we are 0-indexed internally
    if len(alt) > 0:
      if len(ref) > 0:
        f_vseq_pos.write(struct.pack('I', variant.POS))  # The original base
      if len(alt) > 1:
        f_vseq_pos.write(struct.pack('{:d}I'.format(len(alt) - len(ref)), *[new_ref_pos + 1] * (len(alt) - len(ref))))
  else:
    new_ref_pos = variant.POS - 1  # Keep us here - we didn't implement this variant and we should keep copying
                                   # -1 because POS is 1-indexed, we are 0-indexed internally
  return new_ref_pos

There were several functions like this. I was refactoring this code and when I would work on these functions I would spend minutes trying to find the actual body of the function, often puzzling over what turned out to be a doctest and not the actual code. I had been warned about this by well meaning relatives.

Doctests are awesome when you can combine a test and example and it is pithy. When you need to setup complex data structures or you need to test lots of use cases, it is better to put tests in a separate file, in a separate directory and run them with nose.

Doctests can still be used in the documentation as examples of using functions with the added benefit that the examples are kept upto date by being tested.







Saturday, July 5, 2014

Python string concatenation

Continuing with our timing and efficiency theme: What is the more efficient way of building up a string?

from cStringIO import StringIO

def string_concat(size=int(1e6)):
  repeat = 'A'
  st = ''
  for n in range(size):
    st += repeat
  return st

def list_concat(size=int(1e6)):
  repeat = 'A'
  st = []
  for n in range(size):
    st += repeat
  return ''.join(st)

def stringIO_concat(size=int(1e6)):
  repeat = 'A'
  buf = StringIO()
  for n in range(size):
    buf.write(repeat)
  return buf.getvalue()

Turns out the simple way is best:

In [91]: %timeit string_concat()
10 loops, best of 3: 121 ms per loop

In [92]: %timeit list_concat()
1 loops, best of 3: 293 ms per loop

In [93]: %timeit stringIO_concat()
1 loops, best of 3: 336 ms per loop

Friday, July 4, 2014

numpy arrays and iterators

Is it better to access a numpy array with an iterator or indexing?

import numpy

# This is how I've always done it - straight python idiomatic usage
def iterate1(m=100000):
  a = numpy.empty(m)
  sum = 0
  for n in a:
    sum += n
  return sum

# Didn't know about this one, but it is the recommended one in the docs
def iterate2(m=100000):
  a = numpy.empty(m)
  sum = 0
  for n in numpy.nditer(a):
    sum += n
  return sum

#This is C-like 
def loop(m=100000):
  a = numpy.empty(m)
  sum = 0
  for n in range(a.size):
    sum += a[n]
  return sum

And the shootout?

>>> %timeit iterate1()
10 loops, best of 3: 35.4 ms per loop
>>> %timeit iterate2()
10 loops, best of 3: 63.7 ms per loop
>>> %timeit loop()
10 loops, best of 3: 45.1 ms per loop


>>> %timeit iterate1(1e6)
1 loops, best of 3: 368 ms per loop
>>> %timeit iterate2(1e6)
1 loops, best of 3: 605 ms per loop
>>> %timeit loop(1e6)
1 loops, best of 3: 443 ms per loop

So the pythonic one seems to be the best

Thursday, July 3, 2014

HDF5 is not for fast access

HDF5 is a good solution for storing large datasets on disk. Python's h5py library makes it possible to pretend that data stored on disk is just like an in memory array. It is important to keep in mind that the data is really stored on disk and is read in every time a slice or index into the data is taken.

import numpy
import h5py


def create_data(length=1e4):
  data = numpy.random.rand(length)
  with h5py.File('test.h5', 'w') as fp:
    fp.create_dataset('test', data=data)
  return data


def access_each_h5():
  y = 0
  with h5py.File('test.h5', 'r') as fp:
    for n in range(fp['test'].size):
      y += fp['test'][n]
  return y

def access_each_array(data):
  y = 0
  for n in range(data.size):
    y += data[n]
  return y


d = create_data()

>>> run test.py
>>> %timeit access_each_array(d)
100 loops, best of 3: 4.14 ms per loop
>>> %timeit access_each_h5()
1 loops, best of 3: 1.9 s per loop
That sobering difference in performance reminds us that we can't - performance wise - equate the two. When processing data from an hdf5 file, it is best to read in as large chunks as your memory will allow and do the heavy lifting in memory.