Skip to main content

Posts

Showing posts from 2014

Retrieve a single file from a git repo.

The answer to this was more difficult to find than it should have been. The best answer is on stackoverflow here.

In short, use the git archive command:

git archive --remote=git@github.com:foo/bar.git --prefix=path/to/ HEAD:path/to/ | tar xvf -

For this to work the owner of the git server needs to have enabled upload-archive (git config daemon.uploadarch true)You need tar -xvf because the output is in tar form

Using user classes with Python's set operations

Python has an efficient, and built in, implementation of set operations. With a tiny bit of work you can make your user defined classes work nicely with set operations (as well as dictionaries).

Say you have a class:

class A: def __init__(self, x, s): self.x, self.s = x, s
Python actually doesn't have a way of knowing if the objects are the same or not:

a1 = A(1, 'Hi') a2 = A(1, 'Hi') a = set([a1, a2]) -> a1, a2
and by default assumes they are different.

All you have have to do, for things to work properly, is to define a hash function (__hash__()) and an equality (__eq__()) operator for your class. The two functions need to be congruent i.e. if the hash value for two objects is the same, they should also register as equal. It is possible to define the functions differently but this will lead to funny things happening with set operations.

The hash value is an integer. If your class has several attributes that you consider important for distinguishing betwe…

Sphinx + Cython

It's pretty awesome that Sphinx will work with Cython code files (.pyx) but there are currently a few wrinkles:
You need to rerun the cython compiler before the documentation changes take since sphinx reads this off the compiled libraryFunction signatures are omitted in the documentation and the compiler directive # cython: embedsignature=True does not work either. The workaround is to repeat the signature as the first line of the docstringdef merge_variants(c1, c2): """ merge_variants(c1, c2) ... rest of docstring ... """

A small caution when using virtualenvs

Virtualenvs are a great way to test and develop code in insulated containers. My main use case is to have an insulated environment where I can mock install a package I'm developing and ensure that I've taken care of all the dependencies so that a stranger can use the package as is. The virtualenvwrapper package is a great utility that simplifies managing multiple virtual environments.

One caution I have to observe is that packages installed out side the virtual environment can interfere in ways that make behaviors inside the virtual env very mysterious. For example, before I started using virtual environments seriously I had installed the nose and tox modules in my base python install. A month or so afterwards I had created a new test environment and was doing pip install -e . to test whether a package I was writing would install correctly on a fresh environment.

Everything installed fine, including an external package A my code needed. But, when I went to run nosetests or t…

git merge

I'm so used to git merge doing the right thing re: merging files - and possibly used to working mostly by myself - that when git merge fails I always expect something messy has happened. However, just now, I got a merge complain that amounted to this:

from setuptools import setup, find_packages setup( ... some things ... <<<<<<< HEAD ext_modules=cythonize('lib/*.pyx'), entry_points={ # Register the built in plugins .... # Command line scripts .... } ) ======= install_requires= .... cython_ext='lib/*.pyx' ) >>>>>>> origin/fix_setup
I was surprised since the changes I made and my colleague made were simple and non-overlapping. They just need to come in sequence.

I took a look at the git merge documentation which said

When both sides made changes to the same area, however, Git cannot randomly pick one side over the other, and asks you to resolve i…

An annoying thing with Python slices

You of course know that Python slices are awesome:

a = 'ABCDEFG' a[:3] -> 'ABC' a[2:5] -> 'CDE'
And more interestingly:

a[-3:] -> 'EFG'
and

a[6:4:-1] -> 'GF'
But you can see that the reverse slicing is starting to stretch the fence-post we are familiar with. Python uses zero based, inclusive-exclusive indexing. This corresponds to a C syntax of (for i = n; i < m; i++). When you reverse it the slice goes (for i = m - 1; i > n - 1; i--).

As you can imagine this starts to get ugly and at one point it gets to be wrong:

Say, as is often the case, you are not taking static, pre-determined slices but rather slices determined at runtime. Say you are taking slices between n and m or [n, m).

The forward slice is a[n:m]
The backward slice is a[m-1:n-1:-1] right? Because of the fence posts?

Well yes, except what happens when n = 0? The forward slice is fine but the reverse slice resolves to a[m-1:-1:-1]

This is where Python becomes a littl…

Mac OS + 'cat' + 'sed' + \n = half-assed

You guys all know how Mac OS darwin does everything JUST a little differently from the *nixes. It's close enough to draw you in, and different enough to stab you in the back. Today's case sed and \n

I needed to cat some files together but I needed a newline between them. I asked my colleague Wan-Ping for a command that would do this and she suggested

cat 1.fa 2.fa 3.fa | sed 's/^>/\n>/g'
So I did this and the sucker added the character 'n' wherever I expected a newline.

It turns out that Macs are special little snow flakes and need a special little command:

cat chr*.fa | sed 's/^>/\'$'\n>/g' > hg38.fa
The magic sauce is the '$' that escapes the '\n'

Whatever.






Fixing the big mess with git and case insensitive filesystems

Mac OS X by default is a case insensitive file system. But Mac OS, as in a lot of other things, makes a half-assed job of this. In addition to causing various bits of confusion when creating directories it also leads to a potentially messy situation with git. This is how things happen:

1. You create a directory in your source tree called, say, Plugins with a capital "P".
2. After a few commits you decide that it's better to change this to a lower case "p": plugins
3. When you go to commit this rename (perhaps with a few other changes you implemented) git throws a hissy fit.

After a bit of searching on stack overflow it turns out that this is all related to Mac OS's case-insensitivity.

The cleanest fix I found on stack overflow was:

git mv Plugins temp00 git mv temp00 plugins git commit
Apparently this fools git's index into doing the change, where as git mv Plugins plugins - because the underlying file system does not recognize the difference - tells gi…

A funny thing with Python __getitem__()

So, Python has the 'in' operator which we are encouraged to use to test membership of a key in a dictionary, for example. Like

a = {'A': 22, 'B': 45} if 'A' in a: print 'Yes' --> 'Yes'
Python has a index/slice/key notation which consists of square brackets ([]):
a['A'] --> 22
Because Python is so awesome it's slice notation is available to us for any classes we define.
class A(): def __getitem__(self, item): print item return item a = A() a[23] --> 23 b['hello'] --> 'hello'
(In our real class, of course, we would be doing something interesting in the __getitem__ method, like returning a data structure indexed by the key)

What happens when we put the two together, you know, use the in idiom with a user defined class?

It's so awesome, it needs its own screencast.


So Python goes through indexes sequentially testing membership! Returning None does not help either. You have to retur…

Big numbers in Python

Numbers in Python can get arbitrarily large. This is pretty fun for things like doing factorials

>>> def fac(n): return reduce(lambda x,y: x*y, range(1,n+1)) ... >>> fac(1) 1 >>> fac(2) 2 >>> fac(3) 6 >>> fac(4) 24 >>> fac(5) 120 >>> fac(500) 12201368259911100687012387854230469262535743428031928421924135883858453731538819976054964475022032818630136164771482035841633787220781772004807852051593292854779075719393306037729608590862704291745478824249127263443056701732707694610628023104526442188787894657547771498634943677810376442740338273653974713864778784954384895955375379904232410612713269843277457155463099772027810145610811883737095310163563244329870295638966289116589747695720879269288712817800702651745077684107196243903943225364226052349458501299185715012487069615681416253590566934238130088562492468915641267756544818865065938479517753608940057452389403357984763639449053130623237490664450488246650759467358620746379251842004…

Numpy: a note on slicing efficiency

In my application I have three arrays of the same length (and datatype) (say p1, p2 and p3) that have a one to one correspondence with each other. My code works on the arrays in blocks and has to concatenate fragments of these three arrays together. In the rest of my code I take matched slices through these arrays (say p1[0:100], p2[0:100], p3[0:100]) and pass them onto functions for processing,

My first instinct was to create a two-dimensional array since this encapsulates the data nicely, even though the data do not, semantically, constitute a matrix. It was at this point I came across a fact that I had not considered before: taking slices from a 2D array is more time expensive than taking slices from a 1D array.

For example:

In [397]: %%timeit p = numpy.empty((3, 1e6)) .....: x = p[1, 0:2] .....: 100000 loops, best of 3: 2.18 ┬Ás per loop
Vs

In [398]: %%timeit p = numpy.empty(1e6) .....: x = p[0:2] .....: 1000000 loops, best of 3: 512 ns per loop
Row-column shape does …

Python blist.sortedlist

The blist package is pretty good. The documentation and care the author has put in is very impressive. It's a shame that it wasn't included in the standard Python distribution, and the reasons given seem to indicate to me that there are not enough people working on large datasets on the Python committee. Any how, I wanted to try out for myself how the sorted list stacks up against a regular python list that is sorted every time after and insertion:


As can be seen, blist.sorted insertion is O(1) and compares quite favorably to the regular Python list.

Timing code follows:

import pylab from timeit import timeit l = [1,2,3,4,5,6,7,8,9,10] def time_things(): n_loops = 1000 s = [10, 100, 1000, 10000, 100000] blist_time = [timeit("a.add(100)", "import blist; a=blist.sortedlist(range({:d},0,-1))".format(n), number=n_loops)/n_loops for n in s] list_time = [timeit("a.append(100); a.sort()", "a=range({:d})".format(n), number=n_loops)/n_lo…

bcolz, numpy and plain Python

So we have these big data sets that people say are the code of life. In a simple representation they add up to about 3 GB of space. While it is quite possible to load up the whole human genome into RAM and have capacity to spare, I was wondering if there were painless ways to access compressed forms of the data. This lead me to Mikhail Korobov's blog and to bcolz. Here's a quick shoot out between plain Python strings, numpy and bcolz as used to store and access human chromosome 11 which is about 135 million base pairs long:




My take away, for my particular application is that I will stick to plain old Pythons strings - I tend to take short slices of the data (about 100 basepairs long) and that puts me in the range where Python strings are faster than numpy arrays. Memory is cheap.

Code follows:

"""Access times for single characters along the length of the string and for 100 item slices along the length of the string.""" import pylab import numpy imp…

Nosetests and pdb

I sat there looking at a testing error and thought "Can't we automatically invoke the debugger when a test fails". This is Python, of course someone has a solution for that.

nosetests --pdb
will invoke the debugger .when one of your tests goes south.

Nose and test generators

In my code I have a directory of small python modules that act as plugins. They can be called by a main program as directed. Each plugin, therefore, has a standard interface and needs to work with the main program. I wanted to setup an automated test that would run the main program, load each plugin in turn and run to make sure there was no crash. It turns out that nose has a very elegant way of doing this.

My test module for this turns out to be (with a little paraphrasing and removal of details):

from nose.plugins.skip import SkipTest def find_plugin_models(): return a_list_of_the_modules_loaded def check_plugin(model): if an_important_precondition_is_not_met: #http://stackoverflow.com/questions/1120148/disabling-python-nosetests raise SkipTest('I can not run this test') run_a_test() assert thing_is_ok() #http://stackoverflow.com/questions/19071601/how-do-i-run-multiple-python-test-cases-in-a-loop def test_all_found_plugins(): """Integrat…

numpy and testing

Numpy has some great builtins for running unit tests under numpy.testing. There are routines for comparing arrays that come in handy for tests. Many of these have verbose settings that give useful details in a thoughtful format for chasing down why a test failed.

Making docopt and Sphinx work together

I use docopt to build simple, self-documenting command line interfaces to scripts. I was delighted when sphinx's outdoc features seemed to interpret docopt's options well:

But it is apparent that the formatting of the usage part is a bit off and any option that does not begin with -- or - messes up the options formatting completely.

It turns out that Sphinx has a bunch of plugins for documenting command line options with sophisticated ones able to run the code or analyze the argparse object itself but nothing for docopt. One code sample I found wrote out the command line description in restructured text and then stripped out parts of it to make docopt parse it properly

The real solution is, of course, to write a Sphinx extension for docopt, but I ended up using a slight modification that doesn't confuse docopt and lets SPhinx make decent (not great) autodoc out of it.

My docstring looks like this:

Command line:: Usage: fasta2wg --index=IDX --wg=WG [--fa=FA] [-v] …

Some notes on Python imports and project organization

One majorly annoying thing for Pythonistas is the search path that the import command uses. When I was a beginner and wanted to get cracking I'd simply add whatever I was working on to PYTHONPATH, not wanting to get bogged down in silly details (that's what we were using Python for, right?).

I learned my lesson when I "forked" one of my projects (by duplicating the directory - give me a break, I wasn't born learning version control systems) and spent several hours trying to figure out why changes to a file I was making did not show up when I ran the code - Python, of course, was picking up the old version of the module in the original directory.

My current strategies, in somewhat random order, for project organization and testing are:

Project organization
ProjectName |--------->Readme |--------->setup.py |--------->docs |--------->mainmodule | |------->__init__.py (empty) | |-------> root level modules …

Use numpydoc + sphinx

An ideal code documentation system should allow you to write documentation once (when you are writing the code) and then allow you to display the documentation in different contexts, such as project manuals, inline help, command line help and so on. You shouldn't need to duplicate docs - it is a waste of effort and leads to errors when the code is updated. The documentation you write should be easily readable by human users as they peruse the code, as well as by users as they run your code.

Sphinx is an awesome documentation system for Python. Sphinx can take descriptions in docstrings and embed them in documentation so that we can approach the goals of an ideal documentation system.

However, Sphinx violates the human readability part when it comes to function parameter descriptions. Take the following function, for example.


def _repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
""…

Recap: Python list indexing is O(1)

Just in case any of you, hearing about how slow and ponderous Python is, were worried about indexing into a list - it's a O(1) operation:

In [936]: big_list = range(int(1e8)) len(b In [937]: len(big_list) Out[937]: 100000000 In [938]: %timeint big_list[-1] ERROR: Line magic function `%timeint` not found. In [939]: %timeit big_list[-1] 10000000 loops, best of 3: 66.1 ns per loop In [940]: %timeit big_list[0] 10000000 loops, best of 3: 65.2 ns per loop In [941]: small_list = range(1000) In [942]: %timeit small_list[0] 10000000 loops, best of 3: 70.6 ns per loop In [943]: %timeit small_list[-1] 10000000 loops, best of 3: 69.7 ns per loop In [944]: %timeit big_list[10000] 10000000 loops, best of 3: 68.2 ns per loop In [945]: %timeit small_list[100] 10000000 loops, best of 3: 70.2 ns per loop

Mac OS X: Coding in the dark

So, sometimes you need to work at night and not light up the whole room (e.g. when there is a sleeping baby next to you). The dimmest setting on the mac screen is still annoyingly bright, and indeed after a while, begins to hurt my eyes. One native thing to try is "invert colors" under "screen" in "accessibility". It makes the screen look like a negative image. It looks very cool, a bit annoying if you have images, but for text it is a very good counterpart to bright backgrounds during the day.

Python: To format or to concatenate

A while ago a kindly reader pointed out that Python's string .format method is, after all, a function, and carries with it some over head. I have some inner loop code that I could stand to run a little faster and I was looking for a way to speed things up without losing readability. In one part of the loop I was creating a string using the format statement. I wondered if I could speed things up by changing that to a concat.

So I first tested it out on some toy code:

def str_format(size=int(1e6)): for n in range(size): a = 'hi {:d} {:d} {:d}'.format(n, n+1, n+2) return a def str_concat(size=int(1e6)): for n in range(size): a = 'hi ' + str(n) + str(n+1) + str(n+2) return a In [448]: %timeit str_concat() 1 loops, best of 3: 996 ms per loop In [449]: %timeit str_format() 1 loops, best of 3: 1.26 s per loop
So, the plain python concat is faster than the elegant way. This held with one variable or two variables too. This probably has to do with the co…

Doctests or nosetests?

I had a torrid love affair with doctests. It was AMAZING! You mean you can write documentation AND test your code at the same time!? You can write a line of code, put down the answer next to it and the computer WILL CHECK IT FOR YOU?

Well, this went on for several months. Then one morning I woke up next to this:

def handle_variant(variant, f_vseq, f_vseq_pos, strand_no=None): """Write the variant to the mutated sequence and fill out the pos array Some setup >>> import io, vcf, struct >>> def add_GT(v, gt): ... md=vcf.model.make_calldata_tuple('GT')(gt) ... call=vcf.model._Call(v,'sample',md) ... v.samples = [call] ... v._sample_indexes = {'sample': 0} Test with SNP: ignore zygosity >>> ref_seq = 'ACTGACTG'; \ pos = 2; \ ref = 'C'; \ alt = [vcf.model._Substitution('T')]; \ variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, Non…

Python string concatenation

Continuing with our timing and efficiency theme: What is the more efficient way of building up a string?

from cStringIO import StringIO def string_concat(size=int(1e6)): repeat = 'A' st = '' for n in range(size): st += repeat return st def list_concat(size=int(1e6)): repeat = 'A' st = [] for n in range(size): st += repeat return ''.join(st) def stringIO_concat(size=int(1e6)): repeat = 'A' buf = StringIO() for n in range(size): buf.write(repeat) return buf.getvalue()
Turns out the simple way is best:

In [91]: %timeit string_concat() 10 loops, best of 3: 121 ms per loop In [92]: %timeit list_concat() 1 loops, best of 3: 293 ms per loop In [93]: %timeit stringIO_concat() 1 loops, best of 3: 336 ms per loop

numpy arrays and iterators

Is it better to access a numpy array with an iterator or indexing?

import numpy # This is how I've always done it - straight python idiomatic usage def iterate1(m=100000): a = numpy.empty(m) sum = 0 for n in a: sum += n return sum # Didn't know about this one, but it is the recommended one in the docs def iterate2(m=100000): a = numpy.empty(m) sum = 0 for n in numpy.nditer(a): sum += n return sum #This is C-like def loop(m=100000): a = numpy.empty(m) sum = 0 for n in range(a.size): sum += a[n] return sum
And the shootout?

>>> %timeit iterate1() 10 loops, best of 3: 35.4 ms per loop >>> %timeit iterate2() 10 loops, best of 3: 63.7 ms per loop >>> %timeit loop() 10 loops, best of 3: 45.1 ms per loop >>> %timeit iterate1(1e6) 1 loops, best of 3: 368 ms per loop >>> %timeit iterate2(1e6) 1 loops, best of 3: 605 ms per loop >>> %timeit loop(1e6) 1 loops, best of 3: 443 ms per loop
So the…

HDF5 is not for fast access

HDF5 is a good solution for storing large datasets on disk. Python's h5py library makes it possible to pretend that data stored on disk is just like an in memory array. It is important to keep in mind that the data is really stored on disk and is read in every time a slice or index into the data is taken.

import numpy import h5py def create_data(length=1e4): data = numpy.random.rand(length) with h5py.File('test.h5', 'w') as fp: fp.create_dataset('test', data=data) return data def access_each_h5(): y = 0 with h5py.File('test.h5', 'r') as fp: for n in range(fp['test'].size): y += fp['test'][n] return y def access_each_array(data): y = 0 for n in range(data.size): y += data[n] return y d = create_data() >>> run test.py >>> %timeit access_each_array(d) 100 loops, best of 3: 4.14 ms per loop >>> %timeit access_each_h5() 1 loops, best of 3: 1.9 s per loop That so…

A note on Python's __exit__() and errors

Python's context managers are a very neat way of handling code that needs a teardown once you are done. Python objects have do have a destructor method (__del__) called right before the last instance of the object is about to be destroyed. You can do a teardown there. However there is a lot of fine print to the __del__ method. A cleaner way of doing tear-downs is through Python's context manager, manifested as the with keyword.

class CrushMe: def __init__(self): self.f = open('test.txt', 'w') def foo(self, a, b): self.f.write(str(a - b)) def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): self.f.close() return True with CrushMe() as c: c.foo(2, 3)
One thing that is important, and that got me just now, is error handling. I made the mistake of ignoring all those 'junk' arguments (exc_type, exc_val, exc_tb). I just skimmed the docs and what popped out is that you need to return True or False dep…

Feedburner and the fragmentation of google

Wordpress has a plugin for sending your post to twitter or linkedin (and linkedin has a builtin way of sending your updates to twitter) but I didn't see anything like that for blogger (except a G+ button for publicizing to that thriving community). Whenever I did a search I came up with some third party service. Finally I ran into Feedburner, which turns out to be Google's own service for publicizing things on social media. This just led me to an observation that google has so many services and they CAN talk to each other, but there are complicated layers between them.

Python 'or'

I've been trying to write more Pythonic code. I'm sure everyone has their own definition of what Pythonic is, but everyone will agree on things like using list/dict comprehensions where possible and so on and so forth.

I was perusing some code written by some colleagues and found this:

push_data = {'message': container.message or 'NO COMMIT MESSAGE',
This took me to the basic definition of Python's or keyword and, notably, this clause: "These only evaluate their second argument if needed for their outcome."

Way cool! I would previously have used a construction like

push_data = {'message': container.message if container.message else 'NO COMMIT MESSAGE',
and patted myself on the back for being Pythonic, but there is an even better way to do it.

I must say, though, that I would not use the corresponding and construction because it is not so intuitive to me.

0 and 56

PyVCF and header lines

PyCVF which I use for loading VCF files ignores* all lines of the header (## lines) but needs the column header line - if you omit this than PyVCF will skip the first variant line.

*And by ignores I really meant - interprets such lines as ##key=value pairs which are then found in vcf_reader.metadata.

Darwin and case sensitivity in filenames

I normally use a Mac as a pretty interface to a unix (POSIX) system and don't run into problems, but every now and then my whole workflow will come to a stop because Darwin did that one thing slightly different from everyone else and I have to run that down. This time it was case sensitive file names.

POSIX file systems are case sensitive. So a file called Test is different from TEST is different from tesT (Though one is recommended not to torture users in that fashion).

So under a proper POSIX system you can do

mkdir Test mkdir tesT mkdir TEST
and end up with three different directories.

Darwin lulls you into a false sense of security by accepting all three forms but behind the scenes it simply maps it all to the same form. So you get a nonsensical complaint as you do this

mkdir Test # OK mkdir tesT # Directory already exists? WTF? mkdir TEST # Directory already exists? Whaaa?
Sigh.

Doctesting Python command line scripts

The purists will hate this one, but the pragmatists may not condemn me so much. I'm writing Python because I want compact, easy to understand and maintain code. If I was into writing a ton of complicated boilerplate I would have used C.

In keeping with this philosophy I love doctest for testing Python code. Many a time writing code with an eye to doctesting it has led me to more modular and functional code. Perhaps this code was a little slower than it could be, but boy was it fast to write and debug and that's the reason we are in Python, right?

One bump in the road is how do you doctest whole command line applications? Vital parts of my code consist of Python scripts that are meant to be called as command line tools. The individual parts have been tested with doctest, but there is often a need for a gestalt test, where several tools are chained and the output needs to be tested.

There is something called shell-doctest which seems to do exactly this, but it is yet another thi…

Storing state in a Python function

This one blew me away. You can store state in a function, just like you would any object. These are called function attributes. As an aside, I also learned that using a try: except clause is ever slightly so faster than an if.

def foo(a): try: foo.b += a except AttributeError: foo.b = a return foo.b def foo2(a): if hasattr(foo2, 'b'): foo2.b += a else: foo2.b = a return foo2.b if __name__ == '__main__': print [foo(x) for x in range(10)] print [foo2(x) for x in range(10)] """ python -mtimeit -s'import test' '[test.foo(x) for x in range(100)]' python -mtimeit -s'import test' '[test.foo2(x) for x in range(100)]' """
Giving us:

python test.py [0, 1, 3, 6, 10, 15, 21, 28, 36, 45] [0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
python -mtimeit -s'import test' '[test.foo(x) for x in range(100)]' -> 10000 loops, best of 3: 29.4 usec per loop python -mtimeit -s'import t…

Python: Maps, Comprehensions and Loops

Most of you will have seen this already but:

c = range(1000) d = range(1,1001) def foo(a,b): return b - a def map_foo(): e = map(foo, c, d) return e def comprehend_foo(): e = [foo(a, b) for (a,b) in zip(c,d)] return e def loop_foo(): e = [] for (a,b) in zip(c,d): e.append(foo(a, b)) return e def bare_loop(): e = [] for (a,b) in zip(c,d): e.append(b - a) return e def bare_comprehension(): e = [b - a for (a,b) in zip(c,d)] return e """ python -mtimeit -s'import test' 'test.map_foo()' python -mtimeit -s'import test' 'test.comprehend_foo()' python -mtimeit -s'import test' 'test.loop_foo()' python -mtimeit -s'import test' 'test.bare_loop()' python -mtimeit -s'import test' 'test.bare_comprehension()' """
In order of speediness:

test.bare_comprehension() -> 10000 loops, best of 3: 97.9 usec per loop test.map_foo() -> 10000 loops, best …

Python doctest and functions that do file I/O

Some of my code contains functions that read data from files in chunks, process them, and then write data out to other files. I thought it was silly and expensive to first read the data into a list/array, work on them, write out the results to another list and then write the list out to files. But how do you write doctests for such functions?

One way would be to explicitly open files for reading/writing but this is a pain because you have to first create the file, process them, check the answers and then remember to clean up (delete) the files afterwards. You could use the wonderful Python tempfile module, but ...

In my case I was passing file handles to the functions. One convenient way to test such functions without creating physical files on disk is to use io.BytesIO(). This function returns a data structure that has all the characteristics of a file but resides in memory.

For example, suppose you have a cool function that reads in a file and writes out every second byte.

def skip_a…

How to tell if a property is a rental

From a thread here:

The most accurate is to physically go to the place and politely ask the residents - making clear that you are considering buying and are trying to determine how many units are rentals.Online, if you go to the assessor's database if the owner's mailing address is different from the house address it's most certainly not a primary residence. This may mean it's rented out.Online, if you can see the property tax amount, many towns in Mass. have a residential exemption which applies to primary residences only. If the tax seems lower, then this is a primary residence.

What makes marine grade wire special?

I was in the market for pain old 14 AWG wire for a project when I came across spools of wire being sold as "tinned marine grade wire". I was curious to know what made this wire "marine grade". From this page, it turns out, marine grade wire is

12% bigger for a given wire gauge. So 14G marine wire is thicker than regular 14G wireEach strand of the wire is finer and individually coated with tin, making it resistant to corrosionThe insulation on the wire is oil, moisture and heat resistant.

Breaking BAM

It's fun to mess with our bioinformatics tools and laugh at ourselves. The BAM format carries a query name string. In an idle moment I wondered, how long a string can I put here before bad things happen to the BAM?

Generally this string carries information related to the device that produced the read, lane number, run number and so and so forth. It's completely specification free and everyone encodes information here in their own way, so you would think it's an arbitrary length string. Almost.

Try out the short Python snippet below. It generates a dummy fasta file (test.fa) and a dummy BAM file (test.bam) with one 'aligned' read. The only funky thing is that you can make the qname field of the read as long as you want (by varying the lone input parameter to the script).

import sys import pysam try: qnamelen = int(sys.argv[1]) except: qnamelen = 255 with open('test.fa', 'w') as f: f.write('>test\n') f.write('A'*100) bam_…

Adjusting sticking doors

Several of the doors in ye old house were sticking at the top at the latch end. The door was angled too high (or the door frame top was drooping). My solution, in each case, was to use a chisel to deepen the lower hinge seat on the door frame. This allowed the bottom of the door to swing towards the hinge side, bringing the top away from the frame. I could have added shims to the top hinge too. Indeed, for one of the doors I added shims to the top hinge, chiseled away the lower hinge AND chiseled away a bit of the top corner.

Bash conditionals

#!/bin/bash set -x MYVAR=3 if [ ${MYVAR} = 1 ]; then : First option elif [ ${MYVAR} = 2 ]; then : Second option elif [ ${MYVAR} = 3 ]; then : Third option fi -> + MYVAR=3 + '[' 3 = 1 ']' + '[' 3 = 2 ']' + '[' 3 = 3 ']' + : Third option
You need to watch out for whitespace, as bash is sensitive to whitespace

#!/bin/bash set -x MYVAR=3 if [${MYVAR} = 1 ]; then : First option elif [ ${MYVAR} = 2 ]; then : Second option elif [ ${MYVAR} = 3 ]; then : Third option fi -> + MYVAR=3 + '[3' = 1 ']' ./test.sh: line 4: [3: command not found + '[' 3 = 2 ']' + '[' 3 = 3 ']' + : Third option
Very, very sensitive to whitespace

#!/bin/bash set -x MYVAR=3 if [ ${MYVAR}=1 ]; then : First option elif [ ${MYVAR} = 2 ]; then : Second option elif [ ${MYVAR} = 3 ]; then : Third option fi -> + MYVAR=3 + '[' 3=1 ']' + : First option

The magic of mmap

Big data is sometimes described as data whose size is larger than your available RAM. I think that this is a good criterion because once the size of your data (or the size of any results of computing on your data) start to approach your RAM size you have to start worrying about how you are going to manage memory. If you leave it up to your OS you are going to be writing and reading to disk in somewhat unpredictable ways and depending on the software you use, your program might just quit with no warning or with a courtesy 'Out of memory' message. The fun challenge of "Big Data" is, of course, how to keep doing computations regardless of the size of your data and not have your computer quit on you. Some calculations can be done in a blocked fashion but some calculations require you to access different parts of the data all at once.

Python's mmap module is an excellent way to let someone else do the dirty work of handling data files that are comparable or larger th…

The logging overhead.

Python makes printing logger messages, and adjusting the logger level (which messages to print when) very easy. However, it seems, that the logger code comes with a higher overhead than if you used simple 'if' statements to control what messages to print.

Logger messages are very, very useful. Two typical uses of logger messages is to debug things when a program goes awry and to print out the progress of a program (if the user wishes it) to reassure us that the program is running and inform us of how much further the program has to go.

Python's logger is a lot of fun because it is very easy to set up logging messages at different levels and then filter which messages you will actually show. For example you can code the program to print out values of some variables at the  'DEBUG' level and print out the progress of the program at the 'INFO' level. Then you can instruct the code to print out only the INFO messages or both the DEBUG and INFO messages.

It'…

Bash: print commands as they execute

I'm making a demo script that calls programs from a toolchain I am developing. The idea of the script is that it runs a set of commands in sequence and also prints what the command will be doing. Say the base script was:

python do_this.py
I first put in echo to describe the command

echo 'Now we do blah blah' python do_this.py

But, I wanted to see the commands too. I discovered the set -x command, which starts the debug mode and prints out every command as it is executed.

set -x echo 'Now we do blah blah' python do_this.py

But, this now, effectively, printed out the description twice: once when printing the echo statement and then again when actually executing the echo statement. Then I discovered the No OPeration (NOP) character for bash ":".

set -x : Now we do blah blah python do_this.py
This prints the line but does not try to execute it.

Python: passing a mix of keyword arguments and dictionary arguments to a function

So Python is cool because of keyword arguments:

def foo(a=1,b=2,c=3): print a,b,c foo(a=1) # -> 1 2 3
Python is cool because you can pass a dictionary whose keys match the argument names:

def foo(a=1,b=2,c=3): print a,b,c args = {'a': 1, 'b':2} foo(**args) # -> 1 2 3
But, can you mix the two? Yes, yes you can!

def foo(a=1,b=2,c=3): print a,b,c args = {'a': 1, 'b':2} foo(c=3, **args) # -> 1 2 3
Hmm, can we screw up the interpreter? What happens if we send the same argument as a keyword AND a dictionary?

def foo(a=1,b=2,c=3): print a,b,c args = {'a': 1, 'b':2} foo(a=4, **args) # -> TypeError: foo() got multiple values for keyword argument 'a'
Nothing gets past Python, eh?

Washing machine hoses

When our home inspector went through he mentioned to me that I should replace the existing rubber hoses on the washing machine with steel-reinforced ones. I wondered if washing machine hoses were specially prone to fail, perhaps something to do with the intermittent nature of the load (if you look at the hoses when the washing machine runs, they jerk as the machine draws water in phases). I also began to wonder if dishwasher hoses suffered from the same problem.

A quick search brought up two documents, one from a community association underwriter - which I assume insures people with rental properties. The other document I found was from an insurance association serving hotels and inns. The first one claims that steel-reinforced hoses are no better than rubber hoses since the hoses are damaged at the connection. The other one suggests that steel-reinforced hoses may be better than regular rubber hoses but all hoses should be replaced every five years or so.

I could not find reasons why…

execfile and imported modules

I was given to believe that Python's execfile statement simply executed all the commands in a file and continued on its way. This is not entirely true, at least in Python 2.7: imported modules seem not to be handled by the execfile statement, which seems to be rather odd to me.

import numpy def gen(n=100): return numpy.arange(n)
This code does what you expect when you import it as a module:

In [1]: import test In [2]: test.gen(10) Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
And when you run it as a script to incorporate it into your workspace:

In [3]: run test In [4]: gen(10) Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
However, when you use execfile to run the script you run into a snag:

In [5]: pars = {}; execfile('test.py', {}, pars); pars['gen'](10) --------------------------------------------------------------------------- NameError                                 Traceback (most recent call last) <ipython-input-5-b06061c74d2b> in <module>() ---…

Parameter files in Python for Python programs

I often write and use Python programs that require input parameters to tell them what to do. When there are a few parameters I usually pass them through the command line (my current favorite framework for command line argument parsing is the amazing docopt module). When there are many parameters, however, command line arguments can get cumbersome.

In the past I have looked to configuration markup languages (YAML comes to mind) and have also used microsoft excel format files (using the excellent xlrd package). Using proprietary formats makes my stomach upset, but there are no good solutions for the odf counterparts.

Recently I have started to experiment with storing parameter files in python and reading them into my code.

__import__(filename) works and puts the parameters into a separate name space, so you can access them as filename.x, filename.y etc. However, the file needs to be in the module search path, and it feels a bit of a misuse.

A better solution, I find, is to use execfile …

git subtree split

Problem: You have a git hub repository with code in several folders. You want to move one of the sub folders into its own separate repository (for example, it was experimental code you were working on, which you now want to spin off into its own life)

Solution: Use git subtree split as detailed here (look for the answer that says The Easy Way™)

Another reason to love PyCharm

The latest version of PyCharm can run the pep8.py tool on your code to flag coding style violations on the fly. I thought this was cool, but it can be annoying sometimes. This is because pep8 consists of coding guidelines and some of them are a matter of taste. My personal quirk is that I like to use 2 spaces for indents instead of 4. 
When I turn on PyCharm’s pep8 checking all my code gets underlined because I’m using two spaces for indents. However, I discovered that under Preferences->Inspections->PEP 8 Coding style violation and there is a properties list labelled “Ignore” where you can type in pep8 violations to be ignored.
UPDATE: From this very informative post here and a response from one of PyCharm's programmers there: Go to the error, click on the light bulb icon and then select 'Ignore errors like this'. This will automatically add a proper exception to this list. For those interested the list of error codes is here.

I typed in ‘indentation’ and was delight…