Wednesday, April 30, 2014

How to tell if a property is a rental

From a thread here:

  1. The most accurate is to physically go to the place and politely ask the residents - making clear that you are considering buying and are trying to determine how many units are rentals.
  2. Online, if you go to the assessor's database if the owner's mailing address is different from the house address it's most certainly not a primary residence. This may mean it's rented out.
  3. Online, if you can see the property tax amount, many towns in Mass. have a residential exemption which applies to primary residences only. If the tax seems lower, then this is a primary residence.

Tuesday, April 29, 2014

What makes marine grade wire special?

I was in the market for pain old 14 AWG wire for a project when I came across spools of wire being sold as "tinned marine grade wire". I was curious to know what made this wire "marine grade". From this page, it turns out, marine grade wire is

  1. 12% bigger for a given wire gauge. So 14G marine wire is thicker than regular 14G wire
  2. Each strand of the wire is finer and individually coated with tin, making it resistant to corrosion
  3. The insulation on the wire is oil, moisture and heat resistant.

Monday, April 28, 2014

Breaking BAM

It's fun to mess with our bioinformatics tools and laugh at ourselves. The BAM format carries a query name string. In an idle moment I wondered, how long a string can I put here before bad things happen to the BAM?

Generally this string carries information related to the device that produced the read, lane number, run number and so and so forth. It's completely specification free and everyone encodes information here in their own way, so you would think it's an arbitrary length string. Almost.

Try out the short Python snippet below. It generates a dummy fasta file (test.fa) and a dummy BAM file (test.bam) with one 'aligned' read. The only funky thing is that you can make the qname field of the read as long as you want (by varying the lone input parameter to the script).

import sys
import pysam

try:
  qnamelen = int(sys.argv[1])
except:
  qnamelen = 255

with open('test.fa', 'w') as f:
  f.write('>test\n')
  f.write('A'*100)

bam_hdr = {'HD': {'VN': '1.4'},
           'SQ': [{'LN': 100, 'SN': 'test'}]}
bf = pysam.Samfile('test2.bam', 'wb', header=bam_hdr)  # Write binary BAM with header
ar = pysam.AlignedRead()
ar.qname = 'Q'*qnamelen  # Gives Tablet headaches when str len > 254
ar.seq = 'A'*50
ar.qual = '~'*50
ar.mapq = 100
ar.pos = 0
ar.cigarstring = '50M'
bf.write(ar)
bf.close()
pysam.sort('test2.bam', 'test')
pysam.index('test.bam')

Now open the pair of fasta and BAM files in samtools tview, Tablet and IGV (and indeed in anything else). Laugh loudly until your office mates banish you.


Only samtools tview - the least fancy one - loads the alignment without batting an eyelid.

The specs allow an arbitrary length null terminated string.

(Oh, as an aside, try this. Create a .fasta file with this simple corruption: Remove the newline from the sequence id in the fasta file (so that the sequence looks something like 'testAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'. Load the sequence in IGV. Now fix the sequence and try and convince IGV to load the new corrected sequence. Hint: it won't go away until you use igvtools to reindex the fasta file. running bwa index won't help)

Saturday, April 26, 2014

Adjusting sticking doors

Several of the doors in ye old house were sticking at the top at the latch end. The door was angled too high (or the door frame top was drooping). My solution, in each case, was to use a chisel to deepen the lower hinge seat on the door frame. This allowed the bottom of the door to swing towards the hinge side, bringing the top away from the frame. I could have added shims to the top hinge too. Indeed, for one of the doors I added shims to the top hinge, chiseled away the lower hinge AND chiseled away a bit of the top corner.

Wednesday, April 23, 2014

Bash conditionals

#!/bin/bash
set -x
MYVAR=3
if [ ${MYVAR} = 1 ]; then
    : First option
elif [ ${MYVAR} = 2 ]; then
    : Second option
elif [ ${MYVAR} = 3 ]; then
    : Third option
fi

->

+ MYVAR=3
+ '[' 3 = 1 ']'
+ '[' 3 = 2 ']'
+ '[' 3 = 3 ']'
+ : Third option

You need to watch out for whitespace, as bash is sensitive to whitespace

#!/bin/bash
set -x
MYVAR=3
if [${MYVAR} = 1 ]; then
    : First option
elif [ ${MYVAR} = 2 ]; then
    : Second option
elif [ ${MYVAR} = 3 ]; then
    : Third option
fi

->

+ MYVAR=3
+ '[3' = 1 ']'
./test.sh: line 4: [3: command not found
+ '[' 3 = 2 ']'
+ '[' 3 = 3 ']'
+ : Third option


Very, very sensitive to whitespace

#!/bin/bash
set -x
MYVAR=3
if [ ${MYVAR}=1 ]; then
    : First option
elif [ ${MYVAR} = 2 ]; then
    : Second option
elif [ ${MYVAR} = 3 ]; then
    : Third option
fi

->

+ MYVAR=3
+ '[' 3=1 ']'
+ : First option


Monday, April 14, 2014

The magic of mmap

Big data is sometimes described as data whose size is larger than your available RAM. I think that this is a good criterion because once the size of your data (or the size of any results of computing on your data) start to approach your RAM size you have to start worrying about how you are going to manage memory. If you leave it up to your OS you are going to be writing and reading to disk in somewhat unpredictable ways and depending on the software you use, your program might just quit with no warning or with a courtesy 'Out of memory' message. The fun challenge of "Big Data" is, of course, how to keep doing computations regardless of the size of your data and not have your computer quit on you. Some calculations can be done in a blocked fashion but some calculations require you to access different parts of the data all at once.

Python's mmap module is an excellent way to let someone else do the dirty work of handling data files that are comparable or larger than available memory.

import mmap
import numpy


@profile
def load_data():
  fin = open('../Data/human_chrom_11.smalla', 'r+b')
  x = fin.read()
  y = x[numpy.random.randint(0,len(x))]
  print y

@profile
def map_data():
  fin = open('../Data/human_chrom_11.smalla', 'r+b')
  x = mmap.mmap(fin.fileno(), 0)
  y = x[numpy.random.randint(0,len(x))]
  print y

load_data()
map_data()

Here the .smalla data files are simply somewhat large files that can be loaded into memory (which we do for illustration purposes) but which we'd rather not. Running this code with memory_profiler

python -m memory_profiler test.py

tells us:

Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5   16.922 MiB    0.000 MiB   @profile
     6                             def load_data():
     7   16.926 MiB    0.004 MiB     fin = open('../Data/human_chrom_11.smalla', 'r+b')
     8  145.680 MiB  128.754 MiB     x = fin.read()
     9  145.691 MiB    0.012 MiB     y = x[numpy.random.randint(0,len(x))]
    10  145.691 MiB    0.000 MiB     print y


Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
    12   16.941 MiB    0.000 MiB   @profile
    13                             def map_data():
    14   16.941 MiB    0.000 MiB     fin = open('../Data/human_chrom_11.smalla', 'r+b')
    15   16.945 MiB    0.004 MiB     x = mmap.mmap(fin.fileno(), 0)
    16   16.953 MiB    0.008 MiB     y = x[numpy.random.randint(0,len(x))]
    17   16.953 MiB    0.000 MiB     print y

As we can see from the 'increment' column, when we map the data we hardly use any memory at all compared to the 128 MB that we heat up when we load the data into memory at once.

We should keep in mind that we have traded off space for time here. Even with a SSD operating on data from disk is going to take much longer than operating on data that is all in memory, but at least we are able to do it.


Now, if you want both fast access and operability with limited RAM, you need a much larger bag of tricks which I don't have and which often heavily depends on tricks you can do with your particular data.



The logging overhead.

Python makes printing logger messages, and adjusting the logger level (which messages to print when) very easy. However, it seems, that the logger code comes with a higher overhead than if you used simple 'if' statements to control what messages to print.

Logger messages are very, very useful. Two typical uses of logger messages is to debug things when a program goes awry and to print out the progress of a program (if the user wishes it) to reassure us that the program is running and inform us of how much further the program has to go.

Python's logger is a lot of fun because it is very easy to set up logging messages at different levels and then filter which messages you will actually show. For example you can code the program to print out values of some variables at the  'DEBUG' level and print out the progress of the program at the 'INFO' level. Then you can instruct the code to print out only the INFO messages or both the DEBUG and INFO messages.

It's very likely that actually printing logging messages slows down the program, but I began to wonder what kind of overhead even the ignored logger messages were introducing to the program. One could think that placing a debug logger message in a loop is a good debug tool which can be conveniently switched off when in production with little effect, but how true is that?


import logging
logger = logging.getLogger("Meta")


def foo(N):
  sum = 0
  for n in range(N):
    sum += n
    logger.debug('This is message {:d}'.format(n))
  x = sum**2

def bar(N):
  sum = 0
  for n in range(N):
    sum += n
  x = sum**2

def foobar(N, flag=False):
  sum = 0
  for n in range(N):
    sum += n
    if flag:
      logger.debug('This is message {:d}'.format(n))
  x = sum**2


if __name__ == "__main__":
  import timeit
  print 'No logger message: ', timeit.timeit('bar(100)', setup='from __main__ import logging, bar; logging.basicConfig(level=logging.ERROR)', number=1000)
  print 'Suppressed logger message: ', timeit.timeit('foo(100)', setup='from __main__ import logging, foo; logging.basicConfig(level=logging.ERROR)', number=1000)
  print 'Logger message behind false if statement: ', timeit.timeit('foobar(100)', setup='from __main__ import logging, foobar; logging.basicConfig(level=logging.ERROR)', number=1000)
  print 'Logger message behind true if statement: ', timeit.timeit('foobar(100, True)', setup='from __main__ import logging, foobar; logging.basicConfig(level=logging.ERROR)', number=1000)


Which gives us

No logger message:  0.00471091270447
Suppressed logger message:  0.140299081802
Logger message behind false if statement:  0.00491690635681
Logger message behind true if statement:  0.133023023605

This is interesting. It shows that the overhead for the suppressed logging statement is very high, and is in fact higher than that for a simple if statement - that is if you had code where you suppressed printing of a log message using an if statement you would waste less time than if you relied on logger's own logic to handle log levels.

This is a little disappointing.

UPDATE: Following Anon's suggestion to use
logger.debug('This is message %d', n)
I found that it did indeed reduce the logger overhead, but not by much. The new times are:

No logger message:  0.0054669380188
Suppressed logger message:  0.0824329853058
Logger message behind false if statement:  0.00509595870972
Logger message behind true if statement:  0.0940999984741

Tuesday, April 8, 2014

Bash: print commands as they execute

I'm making a demo script that calls programs from a toolchain I am developing. The idea of the script is that it runs a set of commands in sequence and also prints what the command will be doing. Say the base script was:

python do_this.py

I first put in echo to describe the command

echo 'Now we do blah blah'
python do_this.py


But, I wanted to see the commands too. I discovered the set -x command, which starts the debug mode and prints out every command as it is executed.

set -x
echo 'Now we do blah blah'
python do_this.py


But, this now, effectively, printed out the description twice: once when printing the echo statement and then again when actually executing the echo statement. Then I discovered the No OPeration (NOP) character for bash ":".

set -x
: Now we do blah blah
python do_this.py

This prints the line but does not try to execute it.

Thursday, April 3, 2014

Python: passing a mix of keyword arguments and dictionary arguments to a function

So Python is cool because of keyword arguments:

def foo(a=1,b=2,c=3):
  print a,b,c

foo(a=1) # -> 1 2 3

Python is cool because you can pass a dictionary whose keys match the argument names:

def foo(a=1,b=2,c=3):
  print a,b,c

args = {'a': 1, 'b':2}
foo(**args) # -> 1 2 3

But, can you mix the two? Yes, yes you can!

def foo(a=1,b=2,c=3):
  print a,b,c

args = {'a': 1, 'b':2}
foo(c=3, **args) # -> 1 2 3

Hmm, can we screw up the interpreter? What happens if we send the same argument as a keyword AND a dictionary?

def foo(a=1,b=2,c=3):
  print a,b,c

args = {'a': 1, 'b':2}
foo(a=4, **args) # -> TypeError: foo() got multiple values for keyword argument 'a'

Nothing gets past Python, eh?