Saturday, November 30, 2013

Setting up a virtualenv to run pandas-dev

virtualenv pandas
cd pandas
source bin/activate
git clone https://github.com/pydata/pandas.git
pip install cython #Didn't have this
python setup.py install
pip install tables

Friday, November 22, 2013

h5py and multiprocessing

The HDF5 format has been working awesome for me, but I ran into danger when I started to mix it with multiprocessing. It was the worst kind of danger: the intermittent error.

Here are the dangers/issues in order of escalation

(TL;DR is use a generator to feed data from your file into the child processes as they spawn. It's the easiest way. Read on for harder ways.)

  1. An h5py file handle can't be pickled and therefore can't be passed as an argument using pool.map()
  2. If you set the handle as a global and access it from the child processes you run the risk of racing which leads to corrupted reads. My personal runin was that my code sometimes ran fine but sometimes would complain that there are NaNs or Infinity in the data. This wasted some time tracking down. Other people have had this kind of problem [1].
  3. Same problem if you pass the filename and have the different processes open individual instances of the file separately.
  4. The hard way to solve this problem is to switch your workflow over to MPI and use mpi4py or somesuch. The details are here. My problem is embarrassingly parallel and the computations result in heavy data reduction, so I ended up simply doing ...
  5. The easy way to solve the problem is to have a generator in the parent thread that reads in the data from the file and passeses it to the child process in a JIT manner.
Pseudocode:

fin = h5py.File(fname,'r')
X = (read_chunk(fin, [n,n+Nstep]) for n in range(0,N,Nstep))
pool = mp.Pool()
result = pool.map(child_process, X)
fin.close()

Thursday, November 21, 2013

Daemonic processes are not allowed to have children

You can use the multiprocessing module to 'farm' out a function to multiple cores using the Pool.map function. I was wondering idly if you can nest these farming operation to quickly build an exponentially growing army of processes to, you know, take over the world. It turns out that the universe has a failsafe:

import multiprocessing as mp

def compute(i):
  return i

def inner_pool(args):
  n0, N = args
  pool = mp.Pool()
  return pool.map(compute, range(n0,n0+N))

pool = mp.Pool()
print pool.map(inner_pool, [(n,n+10) for n in range(10)])

# -> AssertionError: daemonic processes are not allowed to have children

Saturday, November 16, 2013

h5py: The resizing overhead

A cool thing about data arrays stored in HDF5 via h5py is that you can incremental add data to them. This is the practical way of processing large data sets: you read in large datasets piece by piece, process them and then append the processed data to an array on disk. An interesting thing about this is that there is a small size overhead in the saved file associated with this resizing compared to the same data saved all at once with no resizing of the HDF5 datasets.
I did the computations by using block sizes of [10, 100, 1000, 10000] elements and block counts of [10,100,1000, 10000] The corresponding matrix of overhead (excess bytes needed for the resized version over the directly saved version) looks like this (rows are for the block sizes, columns are for the block counts):
overhead ->
array([[   9264,    2064,    3792,    9920],
       [   2064,    3792,    9920,   52544],
       [   3792,    9920,   52544,  462744],
       [   9920,   52544,  462744, 4570320]])
As you can see the matrix is symmetric, indicating that block size and resize counts (number of blocks) don't interact and it is only the total final size of the data that matters. Someone with knowledge of the internal of HDF5 and h5py will probably be able to explain why this pattern of overhead occurs and why it only depends on the total size of the data and not on the number of resizes performed. Also, interestingly, the simple act of marking a dataset as resizable adds space overhead to the storage (Third plot). This is a plot of the additional overhead incurred as a result of saving the same sized data set all at once, but marking one as resizable. Sidenote: If you let h5py choose the data type on its own during a dummy initialization it will pick float32 (e.g. f.create_dataset('data',(0,1), maxshape=(None,1) ) Code:
import h5py, numpy, os

def resize_test(block_sizes=[10, 100, 1000, 10000], n_blocks=[10,100,1000, 10000], dtyp=float):
  file_size = numpy.empty((len(block_sizes),len(n_blocks),2),dtype=int)
  for i,bl_sz in enumerate(block_sizes):
    for j,n_bl in enumerate(n_blocks):
      with h5py.File('no_resize.h5','w') as f:
        f.create_dataset('data',data=numpy.empty((bl_sz*n_bl,1)),dtype=dtyp)

      with h5py.File('yes_resize.h5','w') as f:
        f.create_dataset('data',(0,1), maxshape=(None,1),dtype=dtyp)
        for n in range(n_bl):
          f['data'].resize(((n+1)*bl_sz,1))
          f['data'][-bl_sz:,:] = numpy.empty((bl_sz,1),dtype=dtyp)

      with h5py.File('no_resize.h5') as f:
        print f['data'].dtype
        print f['data'].shape
        file_size[i,j,0] = os.fstat(f.fid.get_vfd_handle()).st_size

      with h5py.File('yes_resize.h5') as f:
        print f['data'].dtype
        print f['data'].shape
        file_size[i,j,1] = os.fstat(f.fid.get_vfd_handle()).st_size
  return file_size

def default_test(data_sizes=[100,1000,10000,100000,1000000,10000000,100000000], dtyp=float):
  file_size = numpy.empty((len(data_sizes),2),dtype=int)
  for i,dt_sz in enumerate(data_sizes):
    with h5py.File('no_resize.h5','w') as f:
      f.create_dataset('data',data=numpy.empty((dt_sz,1)),dtype=dtyp)

    with h5py.File('yes_resize.h5','w') as f:
      f.create_dataset('data',data=numpy.empty((dt_sz,1)), maxshape=(None,1),dtype=dtyp)

    with h5py.File('no_resize.h5') as f:
      print f['data'].dtype
      print f['data'].shape
      file_size[i,0] = os.fstat(f.fid.get_vfd_handle()).st_size

    with h5py.File('yes_resize.h5') as f:
      print f['data'].dtype
      print f['data'].shape
      file_size[i,1] = os.fstat(f.fid.get_vfd_handle()).st_size
  return file_size



data = resize_test()
overhead = data[:,:,1] - data[:,:,0]

data2 = default_test()
overhead2 = data2[:,1] - data2[:,0]
And, for those interested, here is the full data matrix (which you can get simply by running the code above):
In [170]: data
Out[170]: 
array([[[     2944,     12208],
        [    10144,     12208],
        [    82144,     85936],
        [   802144,    812064]],

       [[    10144,     12208],
        [    82144,     85936],
        [   802144,    812064],
        [  8002144,   8054688]],

       [[    82144,     85936],
        [   802144,    812064],
        [  8002144,   8054688],
        [ 80002144,  80464888]],

       [[   802144,    812064],
        [  8002144,   8054688],
        [ 80002144,  80464888],
        [800002144, 804572464]]])
And I took a short cut to making the plots by pasting the computed data in:
x1 = [100,1000,10000,100000,1000000,10000000,100000000]
x2 = [2944,     10144,     82144,    802144, 8002144,  80002144, 800002144]
y = [9265,2064,3792,9920,52544,462744,4570320]

x3 = [100,1000,10000,100000,1000000,10000000,100000000]
y3 = overhead2

pylab.figure(figsize=(12,4))
pylab.subplots_adjust(bottom=0.15,left=0.1,right=0.97,top=0.9)

pylab.subplot(1,3,1)
pylab.loglog(x1,y,'k.-',lw=5)
pylab.xlabel('Elements')
pylab.ylabel('Overhead (bytes)')
pylab.title('Resized')
pylab.setp(pylab.gca(),ylim=[1e3,1e7])

pylab.subplot(1,3,2)
pylab.loglog(x2,y,'k.-',lw=5)
pylab.xlabel('File size (bytes)')
pylab.ylabel('Overhead (bytes)')
pylab.title('Resized')
pylab.setp(pylab.gca(),ylim=[1e3,1e7])

pylab.subplot(1,3,3)
pylab.loglog(x3,y3,'k.-',lw=5)
pylab.xlabel('Elements')
pylab.ylabel('Overhead (bytes)')
pylab.title('No actual resizing')
pylab.setp(pylab.gca(),ylim=[1e3,1e7])

Friday, November 15, 2013

h5py: the HDF file indexing overhead

Storing numpy arrays in hdf5 files using h5py is great, because you can load parts of the array from disk. One thing to note is that there is a varying amount of time overhead depending on the kind of indexing you use.

It turns out that it is fastest to use standard python slicing terminology - [:20,:] - which grabs well defined contiguous sections of the array.

If we use an array of consecutive numbers as an index we get an additional time overhead simply for using this kind of index.

If we use an array of non-consecutive numbers (note that the indecies have to be monotonic and non-repeating) we get yet another time overhead even above the array with consecutive indexes.

Just something to keep in mind when implementing algorithms.


import numpy, h5py

N = 1000
m = 50
f = h5py.File('index_test.h5','w')
f.create_dataset('data', data=numpy.random.randn(N,1000))
idx1 = numpy.array(range(m))
idx2 = numpy.array(range(N-m,N))
idx3 = numpy.random.choice(N,size=m,replace=False)
idx3.sort()

timeit f['data'][:m,:]
timeit f['data'][-m:,:]
timeit f['data'][idx1,:]
timeit f['data'][idx2,:]
timeit f['data'][idx3,:]
f.close()


# N = 1000
#-> 1000 loops, best of 3: 279 µs per loop
#-> 1000 loops, best of 3: 281 µs per loop
#-> 1000 loops, best of 3: 888 µs per loop
#-> 1000 loops, best of 3: 891 µs per loop
#-> 1000 loops, best of 3: 1.27 ms per loop

# N = 10000
#-> 1000 loops, best of 3: 258 µs per loop
#-> 1000 loops, best of 3: 258 µs per loop
#-> 1000 loops, best of 3: 893 µs per loop
#-> 1000 loops, best of 3: 892 µs per loop
#-> 1000 loops, best of 3: 1.3 ms per loop

Octoshape (octoshapepm)

I was first alerted to this when my mac asked me if I wanted to allow octoshapepm to accept incoming connections. This led me to a web search which led me to an interesting finding that CNN is basically installing a program that uses your computer to redistribute your content, but not really telling you that it is doing it.

The program itself is made by this company. This article gives a brief non-marketing overview of what the program actually does and how to get rid of it if you wish.

In short, as installed by CNN, the program acts as a realtime file distribution system, like bittorrent, except that its probably running without your permission, helping CNN deliver content using part of your bandwidth (you are uploading video data just as you are downloading it). There are security issues with this in addition to an issue of principle, where you are most likely being tricked into giving up part of your bandwidth to save CNN some money as well as exposing a new security hole.

I note that on my mac the program exits when I'm not watching a CNN stream, but I don't know if that is because I denied the program access to the internet (i.e I don't know if octoshape would continue to run after I quit if there were other clients pulling data from my computer).

Any how, the bottom line is that you should be aware that there is this program that is running when you watch certain CNN streams that has access to your computer and is uploading data from your computer to other users.

Sunday, November 10, 2013

Annoying bug in PyTables (affects Big Data analysis with Pandas)

Pandas HDF5 interface through PyTables is awesome because it allows you to select and process small chunks of data from a much larger data file stored on disk. PyTables, however, has an annoying and subtle bug and I just wanted to point you to it so that you don't have to spend hours debugging code like I did.

In short, if you have a DataFrame, and a column of that DF starts with a NaN, any select statements that you run with that conditions on that column will return empty (you won't get any results back, ever). There is a work around, but I chose to use a dummy value instead.

This shook my confidence in Pandas as an analysis platform a bit (though it is really PyTable's fault).

R

I like learning languages and after a little kerfuffle with a Python package I was wondering if there was anything out there for statistical data analysis that might not have so many hidden pitfalls in ordinary places. I knew about R from colleagues but I never payed much attention to it, but I decided to give it a whirl. Here are some brief preliminary notes in no particular order

PLUS
  • Keyword arguments!
  • Gorgeous plotting
  • Integrated workspace (including GUI package manager)
  • Very good documentation and help
  • NaN different from NA
  • They have their own Journal. But what do you expect from a bunch of mathematicians?
  • Prints large arrays on multiple lines with index number of first element on each line on left gutter
  • Parenthesis autocomplete on command line
  • RStudio, though the base distribution is pretty complete, with package manager, editor and console.

MINUS
  • Everything is a function. I love this, but it means commands in the interpreter always need parentheses. I'd gotten used to the Python REPL not requiring parentheses.
  • The assignment operator is two characters rather than one
  • Indexing starts from 1. Oh god, could we PLEASE standardize this either way?
  • Not clear how well R handles "big data" (Data that can't be loaded into memory at once) or parallelization. (To look up: bigmemory)

Tuesday, November 5, 2013

Big Data, Small Data and Pandas

Welp, I finally got this through my thick head, thanks to a hint by Jeff who answered my cry for help on stack overflow, and pointed me to this thread on the pandas issues list.

So here's my use case again: I have small data and big data. Small data is relatively lightweight heterogeneous table-type data. Big data is potentially gigabytes in size, homogenous data. Conditionals on the small data table are used to select out rows which then indicate to us the subset of the big data needed for further processing.

Here's one way to do things: (Things to note: saving in frame_table format, common indexing, use of 'where' to select the big data)
import pandas as pd, numpy

df = pd.DataFrame(data=numpy.random.randint(10,size=(8,4)),columns=['a','b','c','d'])
df.to_hdf('data.h5','small',table=True,data_columns=['a','b'])
df1 = pd.DataFrame(data=numpy.random.randint(10,size=(8,20)),index=df.index)
df1.to_hdf('data.h5','big',table=True)
df2 = pd.Panel(data=numpy.random.randint(10,size=(8,20,5)),items=df.index)
df2.to_hdf('data.h5','big_panel',table=True)
df3 = pd.Panel4D(data=numpy.random.randint(10,size=(8,20,5,5)),labels=df.index)
df3.to_hdf('data.h5','big_panel4',table=True)

store = pd.HDFStore('data.h5')
print store

row = store.select('small',where=['a>2','b<5'],columns=['a','b'])
print 'Small data:'
print row

da = store.select('big',pd.Term(['index',row.index]))
print 'Big data:' 
print da

da = store.select('big_panel',pd.Term(['items',row.index]))
print 'Big data (Panel):'
print da.items

da = store.select('big_panel4',pd.Term(['labels',row.index]))
print 'Big data (Panel4d):'
print da.labels
store.close()
With a sample output of:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/big                   frame_table  (typ->appendable,nrows->8,ncols->20,indexers->[index])                       
/big_panel             wide_table   (typ->appendable,nrows->100,ncols->8,indexers->[major_axis,minor_axis])      
/big_panel4            wide_table   (typ->appendable,nrows->500,ncols->8,indexers->[items,major_axis,minor_axis])
/small                 frame_table  (typ->appendable,nrows->8,ncols->4,indexers->[index],dc->[a,b])              
Small data:
   a  b
3  6  2
5  9  4
Big data:
   0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19
3   9   1   1   4   0   3   2   8   3   4   2   9   9   7   0   4   5   2   5   0
5   0   5   3   5   4   3   4   5   5   9   9   8   6   3   8   0   5   8   8   4
Big data (Panel):
Int64Index([3, 5], dtype=int64)
Big data (Panel4d):
Int64Index([3, 5], dtype=int64)

In the Pandas issue thread there is a very interesting monkey patch that can return arbitrary data based on our small data selection, but I generally tend to shy away from monkey patches as being hard to debug after a few months have passed since the code was written.

Linux: install pytables

pip install numexpr --user
pip install cython
pip install tables

Monday, November 4, 2013

Pulling/pushing data to a samba server from Linux

smbclient (It's atrociously formatted man page is here) will let you do what ftp let you do which is to get and put files from you local machine to a samba server.

My use case is that I have a high performance cluster (Partners' Linux High Performance Computing cluster) that I want to run my code on (remoteA) while my data is on another server (remoteB) that seems to only allow access through samba and refuses ssh and scp requests.

The solution turns out to be to use smbclient, which seems to behave just like the ftp clients of old.

ssh into remoteA

smbclient \\\\{machine name}\\{share name} -D {my directory} -W{domain}

(The multiple backslashes turn out to be vital)
You'll end up with a smbc prompt. At the prompt type

prompt  (Gets rid of the prompt asking you if you are sure you want to copy or EVERY file)
recurse  (I wanted to copy a whole directory, so I needed this)
mget <my dir>\  (this is my directory)

A useful command is

smbclient -L {machine name} -W {domain}

which lists the share names available on the machine