## Monday, December 22, 2014

### Retrieve a single file from a git repo.

The answer to this was more difficult to find than it should have been. The best answer is on stackoverflow here.

In short, use the git archive command:

git archive --remote=git@github.com:foo/bar.git --prefix=path/to/ HEAD:path/to/ |  tar xvf -


• For this to work the owner of the git server needs to have enabled upload-archive (git config daemon.uploadarch true)
• You need tar -xvf because the output is in tar form

## Thursday, December 4, 2014

### Using user classes with Python's set operations

Python has an efficient, and built in, implementation of set operations. With a tiny bit of work you can make your user defined classes work nicely with set operations (as well as dictionaries).

Say you have a class:

class A:
def __init__(self, x, s):
self.x, self.s = x, s


Python actually doesn't have a way of knowing if the objects are the same or not:

a1 = A(1, 'Hi')
a2 = A(1, 'Hi')
a = set([a1, a2])
-> a1, a2


and by default assumes they are different.

All you have have to do, for things to work properly, is to define a hash function (__hash__()) and an equality (__eq__()) operator for your class. The two functions need to be congruent i.e. if the hash value for two objects is the same, they should also register as equal. It is possible to define the functions differently but this will lead to funny things happening with set operations.

The hash value is an integer. If your class has several attributes that you consider important for distinguishing between objects you can hash each attribute separately and then XOR the individual hashes.

class A:
def __init__(self, x, s):
self.x, self.s = x, s

def __hash__(self):
return self.x ^ self.s.__hash__()

def __eq__(self, other):
return self.__hash__() == other.__hash__()


And now

a1 = A(1, 'Hi')
a2 = A(1, 'Hi')
a = set([a1, a2])
-> just a1 as expected


## Tuesday, December 2, 2014

### Sphinx + Cython

It's pretty awesome that Sphinx will work with Cython code files (.pyx) but there are currently a few wrinkles:
1. You need to rerun the cython compiler before the documentation changes take since sphinx reads this off the compiled library
2. Function signatures are omitted in the documentation and the compiler directive # cython: embedsignature=True does not work either. The workaround is to repeat the signature as the first line of the docstring
def merge_variants(c1, c2):
"""
merge_variants(c1, c2)
... rest of docstring ...
"""


## Monday, November 24, 2014

### A small caution when using virtualenvs

Virtualenvs are a great way to test and develop code in insulated containers. My main use case is to have an insulated environment where I can mock install a package I'm developing and ensure that I've taken care of all the dependencies so that a stranger can use the package as is. The virtualenvwrapper package is a great utility that simplifies managing multiple virtual environments.

One caution I have to observe is that packages installed out side the virtual environment can interfere in ways that make behaviors inside the virtual env very mysterious. For example, before I started using virtual environments seriously I had installed the nose and tox modules in my base python install. A month or so afterwards I had created a new test environment and was doing pip install -e . to test whether a package I was writing would install correctly on a fresh environment.



do
done


## Saturday, August 9, 2014

### Nosetests and pdb

I sat there looking at a testing error and thought "Can't we automatically invoke the debugger when a test fails". This is Python, of course someone has a solution for that.

nosetests --pdb


will invoke the debugger .when one of your tests goes south.

## Tuesday, July 29, 2014

### Nose and test generators

In my code I have a directory of small python modules that act as plugins. They can be called by a main program as directed. Each plugin, therefore, has a standard interface and needs to work with the main program. I wanted to setup an automated test that would run the main program, load each plugin in turn and run to make sure there was no crash. It turns out that nose has a very elegant way of doing this.

My test module for this turns out to be (with a little paraphrasing and removal of details):

from nose.plugins.skip import SkipTest

def find_plugin_models():

def check_plugin(model):
if an_important_precondition_is_not_met:
#http://stackoverflow.com/questions/1120148/disabling-python-nosetests
raise SkipTest('I can not run this test')
run_a_test()
assert thing_is_ok()

#http://stackoverflow.com/questions/19071601/how-do-i-run-multiple-python-test-cases-in-a-loop
def test_all_found_plugins():
"""Integration test on automatically found mutation plugin"""
for model in find_plugin_models():
check_plugin.description = 'Testing {:s} automatically'.format(model.__name__)
yield check_plugin, model


Note the cool ability to set a .description attribute that allows you to print what you like when the test runs.

## Friday, July 25, 2014

### numpy and testing

Numpy has some great builtins for running unit tests under
numpy.testing
. There are routines for comparing arrays that come in handy for tests. Many of these have verbose settings that give useful details in a thoughtful format for chasing down why a test failed.

### Making docopt and Sphinx work together

I use docopt to build simple, self-documenting command line interfaces to scripts. I was delighted when sphinx's outdoc features seemed to interpret docopt's options well:

But it is apparent that the formatting of the usage part is a bit off and any option that does not begin with -- or - messes up the options formatting completely.

It turns out that Sphinx has a bunch of plugins for documenting command line options with sophisticated ones able to run the code or analyze the argparse object itself but nothing for docopt. One code sample I found wrote out the command line description in restructured text and then stripped out parts of it to make docopt parse it properly

The real solution is, of course, to write a Sphinx extension for docopt, but I ended up using a slight modification that doesn't confuse docopt and lets SPhinx make decent (not great) autodoc out of it.

My docstring looks like this:

Command line::

Usage:
fasta2wg  --index=IDX  --wg=WG  [--fa=FA] [-v]
fasta2wg  describe  --wg=WG

Options:
--index=IDX       Index file (.json) listing fasta files to be inserted into whole genome
--wg=WG           Name of whole genome file
--fa=FA           If set, will also dump a fa.gz file (by simply concatenating the files together) for use by BWA
-v                Be verbose when you do things
describe          Take the indicated whole genome file as input and print a summary of what it contains


And the Sphinx documentation looks like:

## Wednesday, July 23, 2014

### Some notes on Python imports and project organization

One majorly annoying thing for Pythonistas is the search path that the import command uses. When I was a beginner and wanted to get cracking I'd simply add whatever I was working on to PYTHONPATH, not wanting to get bogged down in silly details (that's what we were using Python for, right?).

I learned my lesson when I "forked" one of my projects (by duplicating the directory - give me a break, I wasn't born learning version control systems) and spent several hours trying to figure out why changes to a file I was making did not show up when I ran the code - Python, of course, was picking up the old version of the module in the original directory.

My current strategies, in somewhat random order, for project organization and testing are:

### Project organization

ProjectName
|--------->setup.py
|--------->docs
|--------->mainmodule
|             |------->__init__.py  (empty)
|             |-------> root level modules
|             |-------> submodule1
|             |    |---------------> __init__.py (empty)
|             |    |----> sub modules
|             |-------> submodule2
|                  |---------------> __init__.py (empty)
|                  |----> sub modules
|
|--------->tests
|------->__init__.py  import sys; sys.path.append('../mainmodule')
|-------> root level module tests
|-------> submodule1
|    |---------------> __init__.py (empty)
|    |----> sub module tests
|-------> submodule2
|---------------> __init__.py (empty)
|----> sub module tests


An important lesson is a) not to forget the __init__.py files and b) to add relevant paths to sys.path in __init__.py when needed.

Using this directory structure allows me to use sphinx for the documentation (the source root goes in conf.py in docs) and allows me to invoke nosetests by doing

nosetests tests


and the path appending going on in tests/__init__.py allows me to import any part of the module in any test script just as I would from a different application after the application has been installed using setup.py

## Tuesday, July 22, 2014

### Use numpydoc + sphinx

An ideal code documentation system should allow you to write documentation once (when you are writing the code) and then allow you to display the documentation in different contexts, such as project manuals, inline help, command line help and so on. You shouldn't need to duplicate docs - it is a waste of effort and leads to errors when the code is updated. The documentation you write should be easily readable by human users as they peruse the code, as well as by users as they run your code.

Sphinx is an awesome documentation system for Python. Sphinx can take descriptions in docstrings and embed them in documentation so that we can approach the goals of an ideal documentation system.

However, Sphinx violates the human readability part when it comes to function parameter descriptions. Take the following function, for example.

 def _repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']): """ Create a sequence by repeating a sub-sequence """ subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring() return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len] 

The Sphinx-parsable way of writing the docstring is:

def _repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
"""Create a sequence by repeating a sub-sequence

:param seq_len: Length of sequence.
:type seq_len: int
:param subseq: Sub-sequence to use as repeat block. Omit to generate a random sub-sequence.
:type subseq: str
:param subseq_len: If subseq is omitted this must be provided to indicate desired length of random sub-sequence
:type subseq_len: Length of random sub-sequence
:param base_sel_rng: Random number generator e.g. numpy.random
:param alphabet: List of characters constituting the alphabet
:type alphabet: list

:returns:  str -- the sequence.
"""
subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring()
return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len]


It does not matter that the Sphinx output is well formatted: for humans it's rather yucky to look at which defeats the purpose of docstrings.

It turns out that, as is common in Python, this problem has been noted and rectified. In this case the correction comes from the great folks at numpy who have a nice extension called numpydoc. The numpydoc version of this docstring is:

def repeat_sequence(seq_len=100, subseq=None, subseq_len=10, base_sel_rng=None, alphabet=['A', 'C', 'T', 'G']):
"""Create a sequence by repeating a sub-sequence

Parameters
----------
seq_len : int
Length of sequence.
subseq : str, optional
Sub-sequence to use as repeat block. Omit to generate a random sub-sequence.
subseq_len : int, optional
If subseq is omitted this must be provided to indicate desired length of random
sub-sequence
subseq_len : int
Length of random sub-sequence
base_sel_rng : object
Random number generator e.g. numpy.random
alphabet : list, optional
List of characters constituting the alphabet

Returns
-------
str
The sequence.

.. note:: This is meant to be used internally

"""
subseq = base_sel_rng.choice(alphabet, size=subseq_len, replace=True, p=[.3, .2, .2, .3]).tostring()
return subseq * (seq_len / subseq_len) + subseq[:seq_len % subseq_len]


Note: The only glitchy thing is that when you add numpydoc to the list of extensions in Sphinx's conf.py it can't be ahead of standard sphinx extensions - the order matters. For example, my extensions list is:

extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.doctest',
'sphinx.ext.todo',
'sphinx.ext.coverage',
'sphinx.ext.pngmath',
'sphinx.ext.mathjax',
'sphinx.ext.viewcode',
'numpydoc.numpydoc'
]


Putting numpydoc first leads to the error:
Extension error:
Unknown event name: autodoc-process-docstring


## Thursday, July 10, 2014

### Recap: Python list indexing is O(1)

Just in case any of you, hearing about how slow and ponderous Python is, were worried about indexing into a list - it's a O(1) operation:

In [936]: big_list = range(int(1e8))
len(b
In [937]: len(big_list)
Out[937]: 100000000

In [938]: %timeint big_list[-1]
ERROR: Line magic function %timeint not found.

In [939]: %timeit big_list[-1]
10000000 loops, best of 3: 66.1 ns per loop

In [940]: %timeit big_list[0]
10000000 loops, best of 3: 65.2 ns per loop

In [941]: small_list = range(1000)

In [942]: %timeit small_list[0]
10000000 loops, best of 3: 70.6 ns per loop

In [943]: %timeit small_list[-1]
10000000 loops, best of 3: 69.7 ns per loop

In [944]: %timeit big_list[10000]
10000000 loops, best of 3: 68.2 ns per loop

In [945]: %timeit small_list[100]
10000000 loops, best of 3: 70.2 ns per loop


## Wednesday, July 9, 2014

### Mac OS X: Coding in the dark

So, sometimes you need to work at night and not light up the whole room (e.g. when there is a sleeping baby next to you). The dimmest setting on the mac screen is still annoyingly bright, and indeed after a while, begins to hurt my eyes. One native thing to try is "invert colors" under "screen" in "accessibility". It makes the screen look like a negative image. It looks very cool, a bit annoying if you have images, but for text it is a very good counterpart to bright backgrounds during the day.

## Monday, July 7, 2014

### Python: To format or to concatenate

A while ago a kindly reader pointed out that Python's string .format method is, after all, a function, and carries with it some over head. I have some inner loop code that I could stand to run a little faster and I was looking for a way to speed things up without losing readability. In one part of the loop I was creating a string using the format statement. I wondered if I could speed things up by changing that to a concat.

So I first tested it out on some toy code:

def str_format(size=int(1e6)):
for n in range(size):
a = 'hi {:d} {:d} {:d}'.format(n, n+1, n+2)
return a

def str_concat(size=int(1e6)):
for n in range(size):
a = 'hi ' + str(n) + str(n+1) + str(n+2)
return a

In [448]: %timeit str_concat()
1 loops, best of 3: 996 ms per loop

In [449]: %timeit str_format()
1 loops, best of 3: 1.26 s per loop


So, the plain python concat is faster than the elegant way. This held with one variable or two variables too. This probably has to do with the complexity of the .format function, which offsets even the extra calls to str

## Sunday, July 6, 2014

### Doctests or nosetests?

I had a torrid love affair with doctests. It was AMAZING! You mean you can write documentation AND test your code at the same time!? You can write a line of code, put down the answer next to it and the computer WILL CHECK IT FOR YOU?

Well, this went on for several months. Then one morning I woke up next to this:

def handle_variant(variant, f_vseq, f_vseq_pos, strand_no=None):
"""Write the variant to the mutated sequence and fill out the pos array

Some setup
>>> import io, vcf, struct
...   md=vcf.model.make_calldata_tuple('GT')(gt)
...   call=vcf.model._Call(v,'sample',md)
...   v.samples = [call]
...   v._sample_indexes = {'sample': 0}

Test with SNP: ignore zygosity
>>> ref_seq = 'ACTGACTG'; \
pos = 2; \
ref = 'C'; \
alt = [vcf.model._Substitution('T')]; \
variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos)
2
>>> _ = f_vseq.seek(0); print f_vseq.read()
T
>>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
(2,)

Test with SNP: homo, strand 0
>>> f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos, 0)
2
>>> _ = f_vseq.seek(0); print f_vseq.read()
T
>>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
(2,)

Test with SNP: homo, strand 1
>>> f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos, 1)
2
>>> _ = f_vseq.seek(0); print f_vseq.read()
T
>>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
(2,)

Test with SNP: het, strand 0
f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos, 0)
1
>>> _ = f_vseq.seek(0); print f_vseq.read()

>>> _ = f_vseq_pos.seek(0,2); print f_vseq.tell()
0

Test with SNP: het, strand 1
>>> f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos, 1)
2
>>> _ = f_vseq.seek(0); print f_vseq.read()
T
>>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
(2,)

Test with delete: ignore zygosity
>>> ref_seq = 'ACTGACTG'; \
pos = 2; \
ref = 'CTG'; \
alt = [vcf.model._Substitution('C')]; \
variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos)
4
>>> _ = f_vseq.seek(0); print f_vseq.read()
C
>>> _ = f_vseq_pos.seek(0); print struct.unpack('I',f_vseq_pos.read(4))
(2,)

Test with same delete, strand 1 (same as REF)
>>> f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos, 1)
1
>>> _ = f_vseq.seek(0); print f_vseq.read()

>>> _ = f_vseq_pos.seek(0,2); print f_vseq.tell()
0

Test with delete and .
>>> ref_seq = 'ACTGACTG'; \
pos = 8; \
ref = 'G'; \
alt = [vcf.model._Substitution('.')]; \
variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos)
8
>>> _ = f_vseq.seek(0); print f_vseq.read()

>>> _ = f_vseq_pos.seek(0,2); print f_vseq.tell()
0

Test with delete and .
>>> ref_seq = 'ACTGACTG'; \
pos = 4; \
ref = 'GACTG'; \
alt = [vcf.model._Substitution('.')]; \
variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos)
8
>>> _ = f_vseq.seek(0); print f_vseq.read()

>>> _ = f_vseq_pos.seek(0,2); print f_vseq.tell()
0

Test with insert
>>> ref_seq = 'ACTGACTG'; \
pos = 2; \
ref = 'C'; \
alt = [vcf.model._Substitution('CGGG')]; \
variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos)
2
>>> _ = f_vseq.seek(0); print f_vseq.read()
CGGG
>>> _ = f_vseq_pos.seek(0); print struct.unpack('4I',f_vseq_pos.read(4*4))
(2, 3, 3, 3)

Test with insert
>>> ref_seq = 'ACTGACTG'; \
pos = 8; \
ref = 'G'; \
alt = [vcf.model._Substitution('GTTT')]; \
variant = vcf.model._Record('1', pos, '.', ref, alt, 100, None, None, None, None, None); \
f_vseq = io.BytesIO(); \
f_vseq_pos = io.BytesIO(); \
print handle_variant(variant, f_vseq, f_vseq_pos)
8
>>> _ = f_vseq.seek(0); print f_vseq.read()
GTTT
>>> _ = f_vseq_pos.seek(0); print struct.unpack('4I',f_vseq_pos.read(4*4))
(8, 9, 9, 9)

See Readme.md for details of algorithm
"""
var_type = '1' if strand_no is None else variant.genotype('sample').data.GT.split('/')[strand_no]
# 0 means REF 1 means ALT. If we don't specify a strand number it means we don't care about Zygosity and will always
# take the ALT
if var_type == '1':
alt = variant.ALT[0].sequence
if alt == '.': alt = ''
ref = variant.REF
if ref == '.': ref = ''
f_vseq.write(alt)    # Copy over ALT
new_ref_pos = len(ref) + variant.POS - 1  # Advance along ref_seq
# -1 because POS is 1-indexed, we are 0-indexed internally
if len(alt) > 0:
if len(ref) > 0:
f_vseq_pos.write(struct.pack('I', variant.POS))  # The original base
if len(alt) > 1:
f_vseq_pos.write(struct.pack('{:d}I'.format(len(alt) - len(ref)), *[new_ref_pos + 1] * (len(alt) - len(ref))))
else:
new_ref_pos = variant.POS - 1  # Keep us here - we didn't implement this variant and we should keep copying
# -1 because POS is 1-indexed, we are 0-indexed internally
return new_ref_pos


There were several functions like this. I was refactoring this code and when I would work on these functions I would spend minutes trying to find the actual body of the function, often puzzling over what turned out to be a doctest and not the actual code. I had been warned about this by well meaning relatives.

Doctests are awesome when you can combine a test and example and it is pithy. When you need to setup complex data structures or you need to test lots of use cases, it is better to put tests in a separate file, in a separate directory and run them with nose.

Doctests can still be used in the documentation as examples of using functions with the added benefit that the examples are kept upto date by being tested.

## Saturday, July 5, 2014

### Python string concatenation

Continuing with our timing and efficiency theme: What is the more efficient way of building up a string?

from cStringIO import StringIO

def string_concat(size=int(1e6)):
repeat = 'A'
st = ''
for n in range(size):
st += repeat
return st

def list_concat(size=int(1e6)):
repeat = 'A'
st = []
for n in range(size):
st += repeat
return ''.join(st)

def stringIO_concat(size=int(1e6)):
repeat = 'A'
buf = StringIO()
for n in range(size):
buf.write(repeat)
return buf.getvalue()


Turns out the simple way is best:

In [91]: %timeit string_concat()
10 loops, best of 3: 121 ms per loop

In [92]: %timeit list_concat()
1 loops, best of 3: 293 ms per loop

In [93]: %timeit stringIO_concat()
1 loops, best of 3: 336 ms per loop


## Friday, July 4, 2014

### numpy arrays and iterators

Is it better to access a numpy array with an iterator or indexing?

import numpy

# This is how I've always done it - straight python idiomatic usage
def iterate1(m=100000):
a = numpy.empty(m)
sum = 0
for n in a:
sum += n
return sum

def iterate2(m=100000):
a = numpy.empty(m)
sum = 0
for n in numpy.nditer(a):
sum += n
return sum

#This is C-like
def loop(m=100000):
a = numpy.empty(m)
sum = 0
for n in range(a.size):
sum += a[n]
return sum


And the shootout?

>>> %timeit iterate1()
10 loops, best of 3: 35.4 ms per loop
>>> %timeit iterate2()
10 loops, best of 3: 63.7 ms per loop
>>> %timeit loop()
10 loops, best of 3: 45.1 ms per loop

>>> %timeit iterate1(1e6)
1 loops, best of 3: 368 ms per loop
>>> %timeit iterate2(1e6)
1 loops, best of 3: 605 ms per loop
>>> %timeit loop(1e6)
1 loops, best of 3: 443 ms per loop


So the pythonic one seems to be the best

## Thursday, July 3, 2014

### HDF5 is not for fast access

HDF5 is a good solution for storing large datasets on disk. Python's h5py library makes it possible to pretend that data stored on disk is just like an in memory array. It is important to keep in mind that the data is really stored on disk and is read in every time a slice or index into the data is taken.

import numpy
import h5py

def create_data(length=1e4):
data = numpy.random.rand(length)
with h5py.File('test.h5', 'w') as fp:
fp.create_dataset('test', data=data)
return data

def access_each_h5():
y = 0
with h5py.File('test.h5', 'r') as fp:
for n in range(fp['test'].size):
y += fp['test'][n]
return y

def access_each_array(data):
y = 0
for n in range(data.size):
y += data[n]
return y

d = create_data()

>>> run test.py
>>> %timeit access_each_array(d)
100 loops, best of 3: 4.14 ms per loop
>>> %timeit access_each_h5()
1 loops, best of 3: 1.9 s per loop

That sobering difference in performance reminds us that we can't - performance wise - equate the two. When processing data from an hdf5 file, it is best to read in as large chunks as your memory will allow and do the heavy lifting in memory.

## Friday, June 20, 2014

### A note on Python's __exit__() and errors

Python's context managers are a very neat way of handling code that needs a teardown once you are done. Python objects have do have a destructor method (__del__) called right before the last instance of the object is about to be destroyed. You can do a teardown there. However there is a lot of fine print to the __del__ method. A cleaner way of doing tear-downs is through Python's context manager, manifested as the with keyword.

class CrushMe:
def __init__(self):
self.f = open('test.txt', 'w')

def foo(self, a, b):
self.f.write(str(a - b))

def __enter__(self):
return self

def __exit__(self, exc_type, exc_val, exc_tb):
self.f.close()
return True

with CrushMe() as c:
c.foo(2, 3)


One thing that is important, and that got me just now, is error handling. I made the mistake of ignoring all those 'junk' arguments (exc_type, exc_val, exc_tb). I just skimmed the docs and what popped out is that you need to return True or False depending on whether there was an error. So I wrote my code to return False if there was a problem in saving the file (My actual code is a little more involved, but the same in spirit) and True otherwise.

So what happens now if you do
with CrushMe() as c:
c.foo('2', '3')


Of course, YOU know that this code will error out - you can't subtract strings. But if you run this code, it will FAIL SILENTLY. This is because I was careless and did not consider what happens if there is an error SOMEWHERE ELSE.

The proper way to do this, as a minimum, is to change the code to

def __exit__(self, exc_type, exc_val, exc_tb):
self.f.close()
return True if exc_type is None else False


Another note: defining both a __del__ method and an __exit__ method can lead to tricky situations. The following code, for, instance, calls the close method twice.

class CrushMe:
def __init__(self):
self.f = open('test.txt', 'w')

def foo(self, a, b):
self.f.write(str(a - b))

def __del__(self):
self.close()

def __enter__(self):
return self

def __exit__(self, exc_type, exc_val, exc_tb):
self.close()
return True if exc_type is None else False

def close(self):
print 'Called!'
self.f.close()

with CrushMe() as c:
c.foo(2, 3)


close gets called first by __exit__ when we exit the context and then when we exit the interpreter and the object is deleted.

## Sunday, June 1, 2014

### Feedburner and the fragmentation of google

Wordpress has a plugin for sending your post to twitter or linkedin (and linkedin has a builtin way of sending your updates to twitter) but I didn't see anything like that for blogger (except a G+ button for publicizing to that thriving community). Whenever I did a search I came up with some third party service. Finally I ran into Feedburner, which turns out to be Google's own service for publicizing things on social media. This just led me to an observation that google has so many services and they CAN talk to each other, but there are complicated layers between them.

## Wednesday, May 21, 2014

### Python 'or'

I've been trying to write more Pythonic code. I'm sure everyone has their own definition of what Pythonic is, but everyone will agree on things like using list/dict comprehensions where possible and so on and so forth.

I was perusing some code written by some colleagues and found this:

    push_data = {'message': container.message or 'NO COMMIT MESSAGE',


This took me to the basic definition of Python's or keyword and, notably, this clause: "These only evaluate their second argument if needed for their outcome."

Way cool! I would previously have used a construction like

    push_data = {'message': container.message if container.message else 'NO COMMIT MESSAGE',


and patted myself on the back for being Pythonic, but there is an even better way to do it.

I must say, though, that I would not use the corresponding and construction because it is not so intuitive to me.

0 and 56


## Tuesday, May 20, 2014

PyCVF which I use for loading VCF files ignores* all lines of the header (## lines) but needs the column header line - if you omit this than PyVCF will skip the first variant line.

*And by ignores I really meant - interprets such lines as ##key=value pairs which are then found in vcf_reader.metadata.

## Tuesday, May 13, 2014

### Finder on Mavericks slowing down?

If you find that Finder on Mavericks has gotten obscenely slow you can try resetting the finder's .plist and restarting the finder (From the solution here):

rm ~/Library/Preferences/com.apple.finder.plist&&killall Finder


## Tuesday, May 6, 2014

### Darwin and case sensitivity in filenames

I normally use a Mac as a pretty interface to a unix (POSIX) system and don't run into problems, but every now and then my whole workflow will come to a stop because Darwin did that one thing slightly different from everyone else and I have to run that down. This time it was case sensitive file names.

POSIX file systems are case sensitive. So a file called Test is different from TEST is different from tesT (Though one is recommended not to torture users in that fashion).

So under a proper POSIX system you can do

mkdir Test
mkdir tesT
mkdir TEST


and end up with three different directories.

Darwin lulls you into a false sense of security by accepting all three forms but behind the scenes it simply maps it all to the same form. So you get a nonsensical complaint as you do this

mkdir Test  # OK
mkdir tesT  # Directory already exists? WTF?
mkdir TEST  # Directory already exists? Whaaa?


Sigh.

## Monday, May 5, 2014

### Doctesting Python command line scripts

The purists will hate this one, but the pragmatists may not condemn me so much. I'm writing Python because I want compact, easy to understand and maintain code. If I was into writing a ton of complicated boilerplate I would have used C.

In keeping with this philosophy I love doctest for testing Python code. Many a time writing code with an eye to doctesting it has led me to more modular and functional code. Perhaps this code was a little slower than it could be, but boy was it fast to write and debug and that's the reason we are in Python, right?

One bump in the road is how do you doctest whole command line applications? Vital parts of my code consist of Python scripts that are meant to be called as command line tools. The individual parts have been tested with doctest, but there is often a need for a gestalt test, where several tools are chained and the output needs to be tested.

There is something called shell-doctest which seems to do exactly this, but it is yet another third party package and not very recently maintained, or very widely known. I'm not sure I should use it.

The creator of Python has recommended creating a main function that can be then tested with forced arguments. I like this, and may convert my 'main' sections this way, but there is something to be said to maintaing the 'conversational' style of doctest, where you simply plop down python commands as you would normally and have the correct answers as if you were in an interactive programming environment.

My current experiment is to merge the gestalt test with my instruction manual so that I'm writing the command line examples into my Readme.md file which then doubles as a test file which can be executed by running python -m doctest -v Readme.md

In fact, if you are in IPython, the magic function run does exactly what I want - which, is to run a complete shell command, like run process_data.py arg1 arg2 arg3. Unfortunately doctest uses a vanilla python shell which does not have this facility.

My solution is to define a function I call shell early on in the document.

>>> import shlex, subprocess
>>> def shell(command): subprocess.call(shlex.split(command))


Now, in my Readme.md file I have things like:

You can run the data processing program as

>>> shell('python process_data.py infile.txt outfile.txt')

Which will result in this processed data

>>> with open('outfile.txt','r') as f: print f.read()
Output Data


Which serves both as do by example documentation as well as test.

## Sunday, May 4, 2014

### Storing state in a Python function

This one blew me away. You can store state in a function, just like you would any object. These are called function attributes. As an aside, I also learned that using a try: except clause is ever slightly so faster than an if.

def foo(a):
try:
foo.b += a
except AttributeError:
foo.b = a
return foo.b

def foo2(a):
if hasattr(foo2, 'b'):
foo2.b += a
else:
foo2.b = a
return foo2.b

if __name__ == '__main__':
print [foo(x) for x in range(10)]
print [foo2(x) for x in range(10)]

"""
python -mtimeit -s'import test' '[test.foo(x) for x in range(100)]'
python -mtimeit -s'import test' '[test.foo2(x) for x in range(100)]'
"""


Giving us:

python test.py
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45]


python -mtimeit -s'import test' '[test.foo(x) for x in range(100)]'  ->  10000 loops, best of 3: 29.4 usec per loop
python -mtimeit -s'import test' '[test.foo2(x) for x in range(100)]' -> 10000 loops, best of 3: 39.1 usec per loop


## Thursday, May 1, 2014

### Python: Maps, Comprehensions and Loops

Most of you will have seen this already but:

c = range(1000)
d = range(1,1001)

def foo(a,b):
return b - a

def map_foo():
e = map(foo, c, d)
return e

def comprehend_foo():
e = [foo(a, b) for (a,b) in zip(c,d)]
return e

def loop_foo():
e = []
for (a,b) in zip(c,d):
e.append(foo(a, b))
return e

def bare_loop():
e = []
for (a,b) in zip(c,d):
e.append(b - a)
return e

def bare_comprehension():
e = [b - a for (a,b) in zip(c,d)]
return e

"""
python -mtimeit -s'import test' 'test.map_foo()'
python -mtimeit -s'import test' 'test.comprehend_foo()'
python -mtimeit -s'import test' 'test.loop_foo()'
python -mtimeit -s'import test' 'test.bare_loop()'
python -mtimeit -s'import test' 'test.bare_comprehension()'
"""


In order of speediness:

test.bare_comprehension() -> 10000 loops, best of 3: 97.9 usec per loop
test.map_foo() -> 10000 loops, best of 3: 125 usec per loop
test.bare_loop() -> 10000 loops, best of 3: 135 usec per loop
test.comprehend_foo() -> 10000 loops, best of 3: 159 usec per loop
test.loop_foo() -> 1000 loops, best of 3: 202 usec per loop


So, where-ever you can, use map. The reason map_foo is slower than bare_comprehension is the function overhead.

### Python doctest and functions that do file I/O

Some of my code contains functions that read data from files in chunks, process them, and then write data out to other files. I thought it was silly and expensive to first read the data into a list/array, work on them, write out the results to another list and then write the list out to files. But how do you write doctests for such functions?

One way would be to explicitly open files for reading/writing but this is a pain because you have to first create the file, process them, check the answers and then remember to clean up (delete) the files afterwards. You could use the wonderful Python tempfile module, but ...

In my case I was passing file handles to the functions. One convenient way to test such functions without creating physical files on disk is to use io.BytesIO(). This function returns a data structure that has all the characteristics of a file but resides in memory.

For example, suppose you have a cool function that reads in a file and writes out every second byte.

def skip_a_byte(fin, fout):
while b:
if fin.tell() % 2:
fout.write(b)


We could outfit with a doctest harness as follows:

def skip_a_byte(fin, fout):
"""
>>> import io; \
data = 'Haopwf gNqojwj jBlrmowwjnl BCloRwT'; \
f1 = io.BytesIO(data); \
f2 = io.BytesIO(); \
skip_a_byte(f1, f2); \
_ = f2.seek(0); \
How Now Brown Cow
"""
while b:
if fin.tell() % 2:
fout.write(b)

if __name__ == "__main__":
import doctest
doctest.testmod()


Oh, BTW, I would like to give a shoutout to PyCharm for being very cool: they recognize the code in the docstring, and can do code analysis on it!

## Wednesday, April 30, 2014

### How to tell if a property is a rental

1. The most accurate is to physically go to the place and politely ask the residents - making clear that you are considering buying and are trying to determine how many units are rentals.
2. Online, if you go to the assessor's database if the owner's mailing address is different from the house address it's most certainly not a primary residence. This may mean it's rented out.
3. Online, if you can see the property tax amount, many towns in Mass. have a residential exemption which applies to primary residences only. If the tax seems lower, then this is a primary residence.

## Tuesday, April 29, 2014

### What makes marine grade wire special?

I was in the market for pain old 14 AWG wire for a project when I came across spools of wire being sold as "tinned marine grade wire". I was curious to know what made this wire "marine grade". From this page, it turns out, marine grade wire is

1. 12% bigger for a given wire gauge. So 14G marine wire is thicker than regular 14G wire
2. Each strand of the wire is finer and individually coated with tin, making it resistant to corrosion
3. The insulation on the wire is oil, moisture and heat resistant.

## Monday, April 28, 2014

### Breaking BAM

It's fun to mess with our bioinformatics tools and laugh at ourselves. The BAM format carries a query name string. In an idle moment I wondered, how long a string can I put here before bad things happen to the BAM?

Generally this string carries information related to the device that produced the read, lane number, run number and so and so forth. It's completely specification free and everyone encodes information here in their own way, so you would think it's an arbitrary length string. Almost.

Try out the short Python snippet below. It generates a dummy fasta file (test.fa) and a dummy BAM file (test.bam) with one 'aligned' read. The only funky thing is that you can make the qname field of the read as long as you want (by varying the lone input parameter to the script).

import sys
import pysam

try:
qnamelen = int(sys.argv[1])
except:
qnamelen = 255

with open('test.fa', 'w') as f:
f.write('>test\n')
f.write('A'*100)

bam_hdr = {'HD': {'VN': '1.4'},
'SQ': [{'LN': 100, 'SN': 'test'}]}
ar.qname = 'Q'*qnamelen  # Gives Tablet headaches when str len > 254
ar.seq = 'A'*50
ar.qual = '~'*50
ar.mapq = 100
ar.pos = 0
ar.cigarstring = '50M'
bf.write(ar)
bf.close()
pysam.sort('test2.bam', 'test')
pysam.index('test.bam')


Now open the pair of fasta and BAM files in samtools tview, Tablet and IGV (and indeed in anything else). Laugh loudly until your office mates banish you.

Only samtools tview - the least fancy one - loads the alignment without batting an eyelid.

The specs allow an arbitrary length null terminated string.

(Oh, as an aside, try this. Create a .fasta file with this simple corruption: Remove the newline from the sequence id in the fasta file (so that the sequence looks something like 'testAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'. Load the sequence in IGV. Now fix the sequence and try and convince IGV to load the new corrected sequence. Hint: it won't go away until you use igvtools to reindex the fasta file. running bwa index won't help)

## Saturday, April 26, 2014

Several of the doors in ye old house were sticking at the top at the latch end. The door was angled too high (or the door frame top was drooping). My solution, in each case, was to use a chisel to deepen the lower hinge seat on the door frame. This allowed the bottom of the door to swing towards the hinge side, bringing the top away from the frame. I could have added shims to the top hinge too. Indeed, for one of the doors I added shims to the top hinge, chiseled away the lower hinge AND chiseled away a bit of the top corner.

## Wednesday, April 23, 2014

### Bash conditionals

#!/bin/bash
set -x
MYVAR=3
if [ ${MYVAR} = 1 ]; then : First option elif [${MYVAR} = 2 ]; then
: Second option
elif [ ${MYVAR} = 3 ]; then : Third option fi -> + MYVAR=3 + '[' 3 = 1 ']' + '[' 3 = 2 ']' + '[' 3 = 3 ']' + : Third option  You need to watch out for whitespace, as bash is sensitive to whitespace #!/bin/bash set -x MYVAR=3 if [${MYVAR} = 1 ]; then
: First option
elif [ ${MYVAR} = 2 ]; then : Second option elif [${MYVAR} = 3 ]; then
: Third option
fi

->

+ MYVAR=3
+ '[3' = 1 ']'
+ '[' 3 = 2 ']'
+ '[' 3 = 3 ']'
+ : Third option



Very, very sensitive to whitespace

#!/bin/bash
set -x
MYVAR=3
if [ ${MYVAR}=1 ]; then : First option elif [${MYVAR} = 2 ]; then
: Second option
elif [ \${MYVAR} = 3 ]; then
: Third option
fi

->

+ MYVAR=3
+ '[' 3=1 ']'
+ : First option



## Monday, April 14, 2014

### The magic of mmap

Big data is sometimes described as data whose size is larger than your available RAM. I think that this is a good criterion because once the size of your data (or the size of any results of computing on your data) start to approach your RAM size you have to start worrying about how you are going to manage memory. If you leave it up to your OS you are going to be writing and reading to disk in somewhat unpredictable ways and depending on the software you use, your program might just quit with no warning or with a courtesy 'Out of memory' message. The fun challenge of "Big Data" is, of course, how to keep doing computations regardless of the size of your data and not have your computer quit on you. Some calculations can be done in a blocked fashion but some calculations require you to access different parts of the data all at once.

Python's mmap module is an excellent way to let someone else do the dirty work of handling data files that are comparable or larger than available memory.

import mmap
import numpy

@profile
fin = open('../Data/human_chrom_11.smalla', 'r+b')
y = x[numpy.random.randint(0,len(x))]
print y

@profile
def map_data():
fin = open('../Data/human_chrom_11.smalla', 'r+b')
x = mmap.mmap(fin.fileno(), 0)
y = x[numpy.random.randint(0,len(x))]
print y

map_data()


Here the .smalla data files are simply somewhat large files that can be loaded into memory (which we do for illustration purposes) but which we'd rather not. Running this code with memory_profiler

python -m memory_profiler test.py


tells us:

Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
5   16.922 MiB    0.000 MiB   @profile
7   16.926 MiB    0.004 MiB     fin = open('../Data/human_chrom_11.smalla', 'r+b')
8  145.680 MiB  128.754 MiB     x = fin.read()
9  145.691 MiB    0.012 MiB     y = x[numpy.random.randint(0,len(x))]
10  145.691 MiB    0.000 MiB     print y

Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
12   16.941 MiB    0.000 MiB   @profile
13                             def map_data():
14   16.941 MiB    0.000 MiB     fin = open('../Data/human_chrom_11.smalla', 'r+b')
15   16.945 MiB    0.004 MiB     x = mmap.mmap(fin.fileno(), 0)
16   16.953 MiB    0.008 MiB     y = x[numpy.random.randint(0,len(x))]
17   16.953 MiB    0.000 MiB     print y


As we can see from the 'increment' column, when we map the data we hardly use any memory at all compared to the 128 MB that we heat up when we load the data into memory at once.

We should keep in mind that we have traded off space for time here. Even with a SSD operating on data from disk is going to take much longer than operating on data that is all in memory, but at least we are able to do it.

Now, if you want both fast access and operability with limited RAM, you need a much larger bag of tricks which I don't have and which often heavily depends on tricks you can do with your particular data.

Python makes printing logger messages, and adjusting the logger level (which messages to print when) very easy. However, it seems, that the logger code comes with a higher overhead than if you used simple 'if' statements to control what messages to print.

Logger messages are very, very useful. Two typical uses of logger messages is to debug things when a program goes awry and to print out the progress of a program (if the user wishes it) to reassure us that the program is running and inform us of how much further the program has to go.

Python's logger is a lot of fun because it is very easy to set up logging messages at different levels and then filter which messages you will actually show. For example you can code the program to print out values of some variables at the  'DEBUG' level and print out the progress of the program at the 'INFO' level. Then you can instruct the code to print out only the INFO messages or both the DEBUG and INFO messages.

It's very likely that actually printing logging messages slows down the program, but I began to wonder what kind of overhead even the ignored logger messages were introducing to the program. One could think that placing a debug logger message in a loop is a good debug tool which can be conveniently switched off when in production with little effect, but how true is that?

import logging
logger = logging.getLogger("Meta")

def foo(N):
sum = 0
for n in range(N):
sum += n
logger.debug('This is message {:d}'.format(n))
x = sum**2

def bar(N):
sum = 0
for n in range(N):
sum += n
x = sum**2

def foobar(N, flag=False):
sum = 0
for n in range(N):
sum += n
if flag:
logger.debug('This is message {:d}'.format(n))
x = sum**2

if __name__ == "__main__":
import timeit
print 'No logger message: ', timeit.timeit('bar(100)', setup='from __main__ import logging, bar; logging.basicConfig(level=logging.ERROR)', number=1000)
print 'Suppressed logger message: ', timeit.timeit('foo(100)', setup='from __main__ import logging, foo; logging.basicConfig(level=logging.ERROR)', number=1000)
print 'Logger message behind false if statement: ', timeit.timeit('foobar(100)', setup='from __main__ import logging, foobar; logging.basicConfig(level=logging.ERROR)', number=1000)
print 'Logger message behind true if statement: ', timeit.timeit('foobar(100, True)', setup='from __main__ import logging, foobar; logging.basicConfig(level=logging.ERROR)', number=1000)



Which gives us

No logger message:  0.00471091270447
Suppressed logger message:  0.140299081802
Logger message behind false if statement:  0.00491690635681
Logger message behind true if statement:  0.133023023605


This is interesting. It shows that the overhead for the suppressed logging statement is very high, and is in fact higher than that for a simple if statement - that is if you had code where you suppressed printing of a log message using an if statement you would waste less time than if you relied on logger's own logic to handle log levels.

This is a little disappointing.

UPDATE: Following Anon's suggestion to use
logger.debug('This is message %d', n)
I found that it did indeed reduce the logger overhead, but not by much. The new times are:

No logger message:  0.0054669380188
Suppressed logger message:  0.0824329853058
Logger message behind false if statement:  0.00509595870972
Logger message behind true if statement:  0.0940999984741


## Tuesday, April 8, 2014

### Bash: print commands as they execute

I'm making a demo script that calls programs from a toolchain I am developing. The idea of the script is that it runs a set of commands in sequence and also prints what the command will be doing. Say the base script was:

python do_this.py

I first put in echo to describe the command

echo 'Now we do blah blah'
python do_this.py


But, I wanted to see the commands too. I discovered the set -x command, which starts the debug mode and prints out every command as it is executed.

set -x
echo 'Now we do blah blah'
python do_this.py


But, this now, effectively, printed out the description twice: once when printing the echo statement and then again when actually executing the echo statement. Then I discovered the No OPeration (NOP) character for bash ":".

set -x
: Now we do blah blah
python do_this.py


This prints the line but does not try to execute it.

## Thursday, April 3, 2014

### Python: passing a mix of keyword arguments and dictionary arguments to a function

So Python is cool because of keyword arguments:

def foo(a=1,b=2,c=3):
print a,b,c

foo(a=1) # -> 1 2 3


Python is cool because you can pass a dictionary whose keys match the argument names:

def foo(a=1,b=2,c=3):
print a,b,c

args = {'a': 1, 'b':2}
foo(**args) # -> 1 2 3


But, can you mix the two? Yes, yes you can!

def foo(a=1,b=2,c=3):
print a,b,c

args = {'a': 1, 'b':2}
foo(c=3, **args) # -> 1 2 3


Hmm, can we screw up the interpreter? What happens if we send the same argument as a keyword AND a dictionary?

def foo(a=1,b=2,c=3):
print a,b,c

args = {'a': 1, 'b':2}
foo(a=4, **args) # -> TypeError: foo() got multiple values for keyword argument 'a'


Nothing gets past Python, eh?

## Monday, March 31, 2014

### Washing machine hoses

When our home inspector went through he mentioned to me that I should replace the existing rubber hoses on the washing machine with steel-reinforced ones. I wondered if washing machine hoses were specially prone to fail, perhaps something to do with the intermittent nature of the load (if you look at the hoses when the washing machine runs, they jerk as the machine draws water in phases). I also began to wonder if dishwasher hoses suffered from the same problem.

A quick search brought up two documents, one from a community association underwriter - which I assume insures people with rental properties. The other document I found was from an insurance association serving hotels and inns. The first one claims that steel-reinforced hoses are no better than rubber hoses since the hoses are damaged at the connection. The other one suggests that steel-reinforced hoses may be better than regular rubber hoses but all hoses should be replaced every five years or so.

I could not find reasons why these hoses were more susceptible to rupturing and why the dish washer water connection does not suffer from the same maintenance issue.

### execfile and imported modules

I was given to believe that Python's execfile statement simply executed all the commands in a file and continued on its way. This is not entirely true, at least in Python 2.7: imported modules seem not to be handled by the execfile statement, which seems to be rather odd to me.

import numpy

def gen(n=100):
return numpy.arange(n)


This code does what you expect when you import it as a module:

In [1]: import test

In [2]: test.gen(10)
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

And when you run it as a script to incorporate it into your workspace:

In [3]: run test

In [4]: gen(10)
Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


However, when you use execfile to run the script you run into a snag:


In [5]: pars = {}; execfile('test.py', {}, pars); pars['gen'](10)---------------------------------------------------------------------------NameError                                 Traceback (most recent call last)<ipython-input-5-b06061c74d2b> in <module>()----> 1 pars = {}; execfile('test.py', {}, pars); pars['gen'](10)/Users/kghose/Code/Mitty/Test/test.py in gen(n)      3       4 def gen(n=100):----> 5   return numpy.arange(n)NameError: global name 'numpy' is not defined

Whaaaaa?

If, instead, you use the fascinating standard Python imp module, you get:


In [7]: import imp; mod = imp.load_source('test','./test.py', open('test.py','r'))In [8]: modOut[8]: <module 'test' from './test.pyc'>In [9]: mod.gen(10)Out[9]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Things run as they should.

## Tuesday, March 25, 2014

### Parameter files in Python for Python programs

I often write and use Python programs that require input parameters to tell them what to do. When there are a few parameters I usually pass them through the command line (my current favorite framework for command line argument parsing is the amazing docopt module). When there are many parameters, however, command line arguments can get cumbersome.

In the past I have looked to configuration markup languages (YAML comes to mind) and have also used microsoft excel format files (using the excellent xlrd package). Using proprietary formats makes my stomach upset, but there are no good solutions for the odf counterparts.

Recently I have started to experiment with storing parameter files in python and reading them into my code.

__import__(filename) works and puts the parameters into a separate name space, so you can access them as filename.x, filename.y etc. However, the file needs to be in the module search path, and it feels a bit of a misuse.

A better solution, I find, is to use execfile to read the parameter file into the code. There are no path restrictions, execfile IS meant to execute code (so no misuse there) and the variables from the Python file are loaded into a separate dictionary so they are contained and do not spill all over your code.

#Main file
pars = {}
execfile('params.py', {}, pars)


With params.py looking like

a = 43
b = 'a string'
c = {'d': 23}


We get:

pars --> {'a': 43, 'b': 'a string', 'c': {'d': 23}}


We can put in comments in the file and structure however we need which I find very convenient.

execfile has disappeared in Python 3 but you can use

with open(filename) as f:


UPDATE: However, see a funny issue with execfile here.

## Monday, March 24, 2014

### git subtree split

Problem: You have a git hub repository with code in several folders. You want to move one of the sub folders into its own separate repository (for example, it was experimental code you were working on, which you now want to spin off into its own life)

Solution: Use git subtree split as detailed here (look for the answer that says The Easy Way™)

## Wednesday, March 19, 2014

### Another reason to love PyCharm

The latest version of PyCharm can run the pep8.py tool on your code to flag coding style violations on the fly. I thought this was cool, but it can be annoying sometimes. This is because pep8 consists of coding guidelines and some of them are a matter of taste. My personal quirk is that I like to use 2 spaces for indents instead of 4.

When I turn on PyCharm’s pep8 checking all my code gets underlined because I’m using two spaces for indents. However, I discovered that under Preferences->Inspections->PEP 8 Coding style violation and there is a properties list labelled “Ignore” where you can type in pep8 violations to be ignored.

UPDATE: From this very informative post here and a response from one of PyCharm's programmers there: Go to the error, click on the light bulb icon and then select 'Ignore errors like this'. This will automatically add a proper exception to this list. For those interested the list of error codes is here.

I typed in ‘indentation’ and was delighted to note that this made those annoying flags disappear (This should not have worked). Also, you can change the severity of the violation flag - I changed it from warning to typo, which is less visually distracting.