The HDF5 format has been working awesome for me, but I ran into danger when I started to mix it with multiprocessing. It was the worst kind of danger: the intermittent error.
Here are the dangers/issues in order of escalation
(TL;DR is use a generator to feed data from your file into the child processes as they spawn. It's the easiest way. Read on for harder ways.)
An h5py file handle can't be pickled and therefore can't be passed as an argument using pool.map()If you set the handle as a global and access it from the child processes you run the risk of racing which leads to corrupted reads. My personal runin was that my code sometimes ran fine but sometimes would complain that there are NaNs or Infinity in the data. This wasted some time tracking down. Other people have had this kind of problem .Same problem if you pass the filename and have the different processes open individual instances of the file separately.The hard way to solve this problem is to switch your workfl…
You can use the multiprocessing module to 'farm' out a function to multiple cores using the Pool.map function. I was wondering idly if you can nest these farming operation to quickly build an exponentially growing army of processes to, you know, take over the world. It turns out that the universe has a failsafe:
import multiprocessing as mp
n0, N = args
pool = mp.Pool()
return pool.map(compute, range(n0,n0+N))
pool = mp.Pool()
print pool.map(inner_pool, [(n,n+10) for n in range(10)])
# -> AssertionError: daemonic processes are not allowed to have children
A cool thing about data arrays stored in HDF5 via h5py is that you can incremental add data to them. This is the practical way of processing large data sets: you read in large datasets piece by piece, process them and then append the processed data to an array on disk.
An interesting thing about this is that there is a small size overhead in the saved file associated with this resizing compared to the same data saved all at once with no resizing of the HDF5 datasets.
I did the computations by using block sizes of [10, 100, 1000, 10000] elements and block counts of [10,100,1000, 10000]
The corresponding matrix of overhead (excess bytes needed for the resized version over the directly saved version) looks like this (rows are for the block sizes, columns are for the block counts):
array([[ 9264, 2064, 3792, 9920],
[ 2064, 3792, 9920, 52544],
[ 3792, 9920, 52544, 462744],
[ 9920, 52544, 462744, 4570320]])
Storing numpy arrays in hdf5 files using h5py is great, because you can load parts of the array from disk. One thing to note is that there is a varying amount of time overhead depending on the kind of indexing you use.
It turns out that it is fastest to use standard python slicing terminology - [:20,:] - which grabs well defined contiguous sections of the array.
If we use an array of consecutive numbers as an index we get an additional time overhead simply for using this kind of index.
If we use an array of non-consecutive numbers (note that the indecies have to be monotonic and non-repeating) we get yet another time overhead even above the array with consecutive indexes.
Just something to keep in mind when implementing algorithms.
import numpy, h5py
N = 1000
m = 50
f = h5py.File('index_test.h5','w')
idx1 = numpy.array(range(m))
idx2 = numpy.array(range(N-m,N))
idx3 = numpy.random.choice(N,size=m,replace…
I was first alerted to this when my mac asked me if I wanted to allow octoshapepm to accept incoming connections. This led me to a web search which led me to an interesting finding that CNN is basically installing a program that uses your computer to redistribute your content, but not really telling you that it is doing it.
The program itself is made by this company. This article gives a brief non-marketing overview of what the program actually does and how to get rid of it if you wish.
In short, as installed by CNN, the program acts as a realtime file distribution system, like bittorrent, except that its probably running without your permission, helping CNN deliver content using part of your bandwidth (you are uploading video data just as you are downloading it). There are security issues with this in addition to an issue of principle, where you are most likely being tricked into giving up part of your bandwidth to save CNN some money as well as exposing a new security hole.
Pandas HDF5 interface through PyTables is awesome because it allows you to select and process small chunks of data from a much larger data file stored on disk. PyTables, however, has an annoying and subtle bug and I just wanted to point you to it so that you don't have to spend hours debugging code like I did.
In short, if you have a DataFrame, and a column of that DF starts with a NaN, any select statements that you run with that conditions on that column will return empty (you won't get any results back, ever). There is a work around, but I chose to use a dummy value instead.
This shook my confidence in Pandas as an analysis platform a bit (though it is really PyTable's fault).
I like learning languages and after a little kerfuffle with a Python package I was wondering if there was anything out there for statistical data analysis that might not have so many hidden pitfalls in ordinary places.
I knew about R from colleagues but I never payed much attention to it, but I decided to give it a whirl. Here are some brief preliminary notes in no particular order
PLUS Keyword arguments!Gorgeous plottingIntegrated workspace (including GUI package manager)Very good documentation and helpNaN different from NAThey have their own Journal. But what do you expect from a bunch of mathematicians?Prints large arrays on multiple lines with index number of first element on each line on left gutterParenthesis autocomplete on command lineRStudio, though the base distribution is pretty complete, with package manager, editor and console.
MINUS Everything is a function. I love this, but it means commands in the interpreter always need parentheses. I'd gotten used to the Python RE…
Welp, I finally got this through my thick head, thanks to a hint by Jeff who answered my cry for help on stack overflow, and pointed me to this thread on the pandas issues list.
So here's my use case again: I have small data and big data. Small data is relatively lightweight heterogeneous table-type data. Big data is potentially gigabytes in size, homogenous data. Conditionals on the small data table are used to select out rows which then indicate to us the subset of the big data needed for further processing.
Here's one way to do things:
(Things to note: saving in frame_table format, common indexing, use of 'where' to select the big data)
import pandas as pd, numpy
df = pd.DataFrame(data=numpy.random.randint(10,size=(8,4)),columns=['a','b','c','d'])
df1 = pd.DataFrame(data=numpy.random.randint(10,size=(8,20)),index=df.index)
smbclient (It's atrociously formatted man page is here) will let you do what ftp let you do which is to get and put files from you local machine to a samba server.
My use case is that I have a high performance cluster (Partners' Linux High Performance Computing cluster) that I want to run my code on (remoteA) while my data is on another server (remoteB) that seems to only allow access through samba and refuses ssh and scp requests.
The solution turns out to be to use smbclient, which seems to behave just like the ftp clients of old.