Skip to main content

Posts

Showing posts from May, 2014

Python 'or'

I've been trying to write more Pythonic code. I'm sure everyone has their own definition of what Pythonic is, but everyone will agree on things like using list/dict comprehensions where possible and so on and so forth.

I was perusing some code written by some colleagues and found this:

push_data = {'message': container.message or 'NO COMMIT MESSAGE',
This took me to the basic definition of Python's or keyword and, notably, this clause: "These only evaluate their second argument if needed for their outcome."

Way cool! I would previously have used a construction like

push_data = {'message': container.message if container.message else 'NO COMMIT MESSAGE',
and patted myself on the back for being Pythonic, but there is an even better way to do it.

I must say, though, that I would not use the corresponding and construction because it is not so intuitive to me.

0 and 56

PyVCF and header lines

PyCVF which I use for loading VCF files ignores* all lines of the header (## lines) but needs the column header line - if you omit this than PyVCF will skip the first variant line.

*And by ignores I really meant - interprets such lines as ##key=value pairs which are then found in vcf_reader.metadata.

Darwin and case sensitivity in filenames

I normally use a Mac as a pretty interface to a unix (POSIX) system and don't run into problems, but every now and then my whole workflow will come to a stop because Darwin did that one thing slightly different from everyone else and I have to run that down. This time it was case sensitive file names.

POSIX file systems are case sensitive. So a file called Test is different from TEST is different from tesT (Though one is recommended not to torture users in that fashion).

So under a proper POSIX system you can do

mkdir Test mkdir tesT mkdir TEST
and end up with three different directories.

Darwin lulls you into a false sense of security by accepting all three forms but behind the scenes it simply maps it all to the same form. So you get a nonsensical complaint as you do this

mkdir Test # OK mkdir tesT # Directory already exists? WTF? mkdir TEST # Directory already exists? Whaaa?
Sigh.

Doctesting Python command line scripts

The purists will hate this one, but the pragmatists may not condemn me so much. I'm writing Python because I want compact, easy to understand and maintain code. If I was into writing a ton of complicated boilerplate I would have used C.

In keeping with this philosophy I love doctest for testing Python code. Many a time writing code with an eye to doctesting it has led me to more modular and functional code. Perhaps this code was a little slower than it could be, but boy was it fast to write and debug and that's the reason we are in Python, right?

One bump in the road is how do you doctest whole command line applications? Vital parts of my code consist of Python scripts that are meant to be called as command line tools. The individual parts have been tested with doctest, but there is often a need for a gestalt test, where several tools are chained and the output needs to be tested.

There is something called shell-doctest which seems to do exactly this, but it is yet another thi…

Storing state in a Python function

This one blew me away. You can store state in a function, just like you would any object. These are called function attributes. As an aside, I also learned that using a try: except clause is ever slightly so faster than an if.

def foo(a): try: foo.b += a except AttributeError: foo.b = a return foo.b def foo2(a): if hasattr(foo2, 'b'): foo2.b += a else: foo2.b = a return foo2.b if __name__ == '__main__': print [foo(x) for x in range(10)] print [foo2(x) for x in range(10)] """ python -mtimeit -s'import test' '[test.foo(x) for x in range(100)]' python -mtimeit -s'import test' '[test.foo2(x) for x in range(100)]' """
Giving us:

python test.py [0, 1, 3, 6, 10, 15, 21, 28, 36, 45] [0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
python -mtimeit -s'import test' '[test.foo(x) for x in range(100)]' -> 10000 loops, best of 3: 29.4 usec per loop python -mtimeit -s'import t…

Python: Maps, Comprehensions and Loops

Most of you will have seen this already but:

c = range(1000) d = range(1,1001) def foo(a,b): return b - a def map_foo(): e = map(foo, c, d) return e def comprehend_foo(): e = [foo(a, b) for (a,b) in zip(c,d)] return e def loop_foo(): e = [] for (a,b) in zip(c,d): e.append(foo(a, b)) return e def bare_loop(): e = [] for (a,b) in zip(c,d): e.append(b - a) return e def bare_comprehension(): e = [b - a for (a,b) in zip(c,d)] return e """ python -mtimeit -s'import test' 'test.map_foo()' python -mtimeit -s'import test' 'test.comprehend_foo()' python -mtimeit -s'import test' 'test.loop_foo()' python -mtimeit -s'import test' 'test.bare_loop()' python -mtimeit -s'import test' 'test.bare_comprehension()' """
In order of speediness:

test.bare_comprehension() -> 10000 loops, best of 3: 97.9 usec per loop test.map_foo() -> 10000 loops, best …

Python doctest and functions that do file I/O

Some of my code contains functions that read data from files in chunks, process them, and then write data out to other files. I thought it was silly and expensive to first read the data into a list/array, work on them, write out the results to another list and then write the list out to files. But how do you write doctests for such functions?

One way would be to explicitly open files for reading/writing but this is a pain because you have to first create the file, process them, check the answers and then remember to clean up (delete) the files afterwards. You could use the wonderful Python tempfile module, but ...

In my case I was passing file handles to the functions. One convenient way to test such functions without creating physical files on disk is to use io.BytesIO(). This function returns a data structure that has all the characteristics of a file but resides in memory.

For example, suppose you have a cool function that reads in a file and writes out every second byte.

def skip_a…