Skip to main content

bcolz, numpy and plain Python

So we have these big data sets that people say are the code of life. In a simple representation they add up to about 3 GB of space. While it is quite possible to load up the whole human genome into RAM and have capacity to spare, I was wondering if there were painless ways to access compressed forms of the data. This lead me to Mikhail Korobov's blog and to bcolz. Here's a quick shoot out between plain Python strings, numpy and bcolz as used to store and access human chromosome 11 which is about 135 million base pairs long:




My take away, for my particular application is that I will stick to plain old Pythons strings - I tend to take short slices of the data (about 100 basepairs long) and that puts me in the range where Python strings are faster than numpy arrays. Memory is cheap.

Code follows:

"""Access times for single characters along the length of the string and for 100 item slices along the length of the
string."""
import pylab
import numpy
import bcolz

chrom_11_str = None  # Ugh, globals for timeit
chrom_11_bcolz = None
chrom_11_numpy = None


def load_data():
  global chrom_11_str, chrom_11_bcolz, chrom_11_numpy
  with open('chr11.fa', 'r') as fp:
    _ = fp.readline()  # The header
    chrom_11_str = ''.join([ln.strip() for ln in fp.readlines()])
  chrom_11_numpy = numpy.fromstring(chrom_11_str, dtype='c')
  chrom_11_bcolz = bcolz.carray(chrom_11_numpy)


def time_things():
  from timeit import timeit
  n_loops = 1000
  return [[[timeit("a = {:s}[0:{:d}]".format(dta, n), "from __main__ import {:s}".format(dta), number=n_loops) / n_loops
           for n in [10 ** m for m in range(6)]],
          [timeit("a = {:s}[100000000:100000000+{:d}]".format(dta, n), "from __main__ import {:s}".format(dta), number=n_loops) / n_loops
           for n in [10 ** m for m in range(6)]]] for dta in ['chrom_11_str', 'chrom_11_bcolz', 'chrom_11_numpy']]


def plot_things(t):
  #t_str, t_bcolz, t_numpy = t
  sl_size = [10 ** m for m in range(6)]
  location = ['(start)', '(end)']
  for n in [0, 1]:
    pylab.subplot(1,2,n+1)
    for t_value, t_name in zip(t, ['str', 'bcolz', 'numpy']):
      pylab.plot(sl_size, t_value[n], label=t_name + location[n])
      #pylab.plot(t_value[1], 'b', label=t_name + '(finish)')
    pylab.setp(pylab.gca(), xlabel='Slice size', ylabel='Time (s)', xscale='log', yscale='log', ylim=[1e-9, 1e-3])
    pylab.legend(loc='lower right')

load_data()
t = time_things()
plot_things(t)

Comments

Popular posts from this blog

Flowing text in inkscape (Poster making)

You can flow text into arbitrary shapes in inkscape. (From a hint here).

You simply create a text box, type your text into it, create a frame with some drawing tool, select both the text box and the frame (click and shift) and then go to text->flow into frame.

UPDATE:

The omnipresent anonymous asked:
Trying to enter sentence so that text forms the number three...any ideas?
The solution:
Type '3' using the text toolConvert to path using object->pathSize as necessaryRemove fillUngroupType in actual text in new text boxSelect the text and the '3' pathFlow the text

Pandas panel = collection of tables/data frames aligned by index and column

Pandas panel provides a nice way to collect related data frames together while maintaining correspondence between the index and column values:


import pandas as pd, pylab #Full dimensions of a slice of our panel index = ['1','2','3','4'] #major_index columns = ['a','b','c'] #minor_index df = pd.DataFrame(pylab.randn(4,3),columns=columns,index=index) #A full slice of the panel df2 = pd.DataFrame(pylab.randn(3,2),columns=['a','c'],index=['1','3','4']) #A partial slice df3 = pd.DataFrame(pylab.randn(2,2),columns=['a','b'],index=['2','4']) #Another partial slice df4 = pd.DataFrame(pylab.randn(2,2),columns=['d','e'],index=['5','6']) #Partial slice with a new column and index pn = pd.Panel({'A': df}) pn['B'] = df2 pn['C'] = df3 pn['D'] = df4 for key in pn.items: print pn[key] -> output …

Drawing circles using matplotlib

Use the pylab.Circle command

import pylab #Imports matplotlib and a host of other useful modules cir1 = pylab.Circle((0,0), radius=0.75, fc='y') #Creates a patch that looks like a circle (fc= face color) cir2 = pylab.Circle((.5,.5), radius=0.25, alpha =.2, fc='b') #Repeat (alpha=.2 means make it very translucent) ax = pylab.axes(aspect=1) #Creates empty axes (aspect=1 means scale things so that circles look like circles) ax.add_patch(cir1) #Grab the current axes, add the patch to it ax.add_patch(cir2) #Repeat pylab.show()