Tuesday, June 23, 2015

h5py in memory files

The h5py library is a very nice wrapper round the HDF5 data storage format/library. The authors of h5py have done a super job of aligning HDF5 data types with numpy data types including structured arrays, which means you can store variable lengths strings and jagged arrays. One of the advantages of HDF5 for large datasets is that you can load slices of the data into memory very easily and transparently - h5py and HDF5 take care of everything - compression, chunking, buffering - for you.

As I was playing around with h5py one thing tripped me up. h5py has an "in memory" mode where you can create HDF5 files in memory (driver='core') option which is great when prototyping or writing tests, since you don't have to clean up files after you are done.

In the documentation it says, even if you have an in-memory file, you need to give it a name. I found this requirement funny, because I assumed that the fake file was being created in a throw away memory buffer attached to the HDF5 object and would disappear once the object was closed or went out of scope.

So I did things like this:

import h5py
fp = h5py.File(name='f1', driver='core') # driver='core' is the incantation for creating an in memory HDF5 file
dset = fp.create_group('/grp1')
fp.keys() -> [u'grp1']

Great, things work as expected

fp.close()
fp1 = h5py.File(name='f1', driver='core') # driver='core' is the incantation for creating an in memory HDF5 file
fp1.keys() -> [u'grp1']

Whaaaa?!?!

Closing the file didn't get rid of it! I have the data still!

del fp
fp2 = h5py.File(name='f1', driver='core') # driver='core' is the incantation for creating an in memory HDF5 file
fp2.keys() -> [u'grp1']

Whoah! Deleting the parent object doesn't get rid of it either!!!

fp3 = h5py.File(name='f2', driver='core') # driver='core' is the incantation for creating an in memory HDF5 file
fp3.keys() -> []

This surprised me a great deal. I had assumed the name was a dummy item, perhaps in order to keep some of their internal code consistent, but I did not ever expect a persistent memory store.

Welp, it turns out this is a *memory mapped* file and there is an actual file called f1 and f2 on the file system now. In order to make a file truly stored in memory, you have to use an additional option backing_store

fp = h5py.File(name='f1', driver='core', backing_store=False)
dset = fp.create_group('/grp1')
fp.keys() -> [u'grp1']
fp.close()

fp = h5py.File(name='f1', driver='core', backing_store=False)
fp.keys() -> []

Wednesday, May 13, 2015

Sample randomly from an sqlite database

If you have a large database you often want to sample rows from it. For many uses the sampling should be randomly done. A reasonably fast way to do this completely in SQL is the following:

SELECT * FROM my_table WHERE rowid in (SELECT abs(random()) % N FROM my_table LIMIT k);

Where N is the maximum number of rows the table has and k is the number of samples we want

There is an alternative form popular on the internet

SELECT * FROM my_table WHERE random() % L = 0 LIMIT k;

where L is the factor deciding how likely it is we pick a particular row.

This form is slower AND has a bias to picking samples from the beginning of the database.

In this method, we go through each row sequentially, deciding if we select that row for the sample, which makes things slow for large L. If we make L small we pick faster, but our sample is biased towards the start of the database.

Thursday, April 23, 2015

rsync and FAT vs EXT file systems

NTFS/FAT timestamps can vary from those in EXT (Unix) systems like Mac OS. This causes issues when using rsync where every file is copied over even though nothing has really changed. Use the --modify-window=1 flag to tell rsync to allow larger time differences. In this case 1s.

So a typical incremental backup command will be

rsync -avz --modify-window=1 /src /dst

Wednesday, March 25, 2015

Re-printing a line (in Python)

I love those progress bars in the command line. You know, like when you install Linux (or rather when I used to install Linux, nowadays the cool kids have full graphical interface installers).

Something like this:


I always thought you had to use curses or something equally magical to do this. Then I ran into this post. It turns out the character '\r' moves the cursor to the beginning of the line and you can use that, for example in Python, to create an animated progress bar.

def progress_bar(title, f, cols):
  """Draw a nifty progress bar.
  '\r' trick from http://stackoverflow.com/questions/15685063/print-a-progress-bar-processing-in-python

  :param title: leading text to print
  :param f:     fraction completed
  :param cols:  how many columns wide should the bar be
  """
  x = int(f * cols + 0.5)
  sys.stdout.write('\r' + title + '[' + '.' * x + ' ' * (cols - x) + ']\r')
  sys.stdout.flush()


(Oh yeah, I use animated .gifs. I'm THAT old)

Tuesday, January 27, 2015

Notes on distributing cython code

One of the conveniences of Python is the package system which allows you to install your program and any dependencies smoothly. The package system works very well when the code is pure Python, but can run into trouble when code written in cython or c is part of the program.

I will illustrate some mis-steps I made while writing a install script for an example program that is a mixture of Python and Cython. I've put the code up on github and each step is a commit tag. You can follow along by setting up a virtual environment using virtualenvwrapper:

mkvirtualenv cy-test

And then trying to install the appropriate tag, e.g:

git clone git@github.com:kghose/cython-example.git
cd cython-example
git checkout ex2

ex1

The module installs without errors, but because of me not indicating the paths of the cython files properly (I omit the kgcyex directory in the path) the cython files do not compile. You will note this because there are no compilation messages during the install, though the failure is otherwise silent
kghose$ kgcyex
Traceback (most recent call last):
  File "/Users/kghose/.venvs/blog/bin/kgcyex", line 9, in <module>
    load_entry_point('kgcyex==1.0.0', 'console_scripts', 'kgcyex')()
  File "/Users/kghose/.venvs/blog/lib/python2.7/site-packages/pkg_resources.py", line 356, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/Users/kghose/.venvs/blog/lib/python2.7/site-packages/pkg_resources.py", line 2431, in load_entry_point
    return ep.load()
  File "/Users/kghose/.venvs/blog/lib/python2.7/site-packages/pkg_resources.py", line 2147, in load
    ['__name__'])
  File "/Users/kghose/.venvs/blog/lib/python2.7/site-packages/kgcyex/main.py", line 2, in <module>
    import kgcyex.cy1 as cy1
ImportError: No module named cy1

ex2

I correctly write out the full paths of the cython modules, and everything installs and runs fine.
kghose$ kgcyex
foo from kgcyex.mod1
foo from kgcyex.cy1
foo from kgcyex.lib.mod2
foo from kgcyex.lib.cy2

ex3

Suppose the other user does not have Cython? The cython documentation suggests that we distribute the generated c code with the source. There is some debate as to whether this is "proper" since the .c files are actually generated from the .pyx files and in principle we should only really be distributing files which can not be auto-generated from the "real" source. For now, we put pragmatism over principle. Note that the setup.py changes a bit
If you read the setup.py you will note that I have used a check to test if the user has Cython or not. This check then tells setup to either use the .pyx files or the .c files. This is standard stuff recommended by the Cython folks. Look carefully at the setup.py where I add the extensions.
extensions = [Extension("cy1", ["kgcyex/cy1"+ext]), Extension("cy2", ["kgcyex/lib/cy2"+ext])]
Things compile properly because I've remembered to indicate the peoper path to the .pyx (or .c) files. When we run setup.py we can see the modules being compiled. But what the #$%@! when we go to run the code it again complains that it can find the compiled modules! In real life this error caused me to lose about an hour :(
My error was that though I had correctly indicated the path to the source (the second parameter forExtension) I had not given the proper dotted path for the modules themselves. If you look undersite-packages of your installation you will note that there are two compiled modules cy1.so andcy2.so directly under site-packages rather than in their proper places under kgcyex andkgcyex/lib. The correct form of this line is ...

ex4

extensions = [Extension("kgcyex.cy1", ["kgcyex/cy1"+ext]), Extension("kgcyex.lib.cy2", ["kgcyex/lib/cy2"+ext])]





Friday, January 23, 2015

Running bash functions in parallel

I was blown away when I learned this. From this thread on stackoverflow it turns out that by simply adding an ampersand to a line containing a function call you can send it to run in the background!

#!/bin/bash
function foo {
  echo $1
  sleep $1
  date
}

for i in `seq 1 10`; do
  foo $i &
done

I always thought that this was restricted to programs/scripts you can call from the command line!

Sunday, January 4, 2015

Electricity choice in Massachusetts

I've lived in Massachusetts for some years now and I've noticed that my electric bill is split into two parts: Delivery services and Supply services. I always thought that that was some itemizing detail, like the forty items I used to have on my phone bill and I ignored it. This month's bill was higher than expected and I took a closer look.

After a little inspection I noted that the Supply services rate was higher than before. It said "Basic Fixed Service". After some messing around on the National Grid Website, I came to this page. The important information there is this:
National Grid separates your bill into two services: supply and delivery. Supply Services is the portion of your electric service for which you can shop for your electricity supply from a supplier other than National Grid. These suppliers, often referred to as competitive suppliers, can be companies that produce or generate electricity or are brokers that buy electricity in the wholesale market and sell it to residents and businesses. National Grid is a delivery company, which means we will deliver electricity to you regardless of your choice of supplier. We encourage you to shop and compare the prices of competitive suppliers. Find out more about choosing your supply of electricity from a competitive supplier by visiting our Energy Choice area.
Wow. It goes on to say that by default you are signed on to a National Grid brokered plan where they buy electricity at wholesale rates and sell it to you for no profit and with some administrative costs added.

I went to the list of energy suppliers and browsed many of the companies. It does not take much time, and I would encourage you to do the same. It was interesting to me that most of these companies were offering rates lower than what I have from National Grid, which I was not expecting, if National Grid was a near monopoly buyer of electricity.

Some of the companies looked shady - the website had no upfront way to find out the electric supply cost, and they were offering incentives like gift cards and so on. The companies I favored were those that had a nice, easy interface for signing up and a clearly marked price per kWh.

Some of the companies did not serve Massachusetts, so I was surprised that a Mass customer was linked to them, but National Grid does serve many areas, so perhaps this is a country-wide list.

Some companies offer choices of getting electricity from renewable sources, which, if you have the budget for it, seems a good way to go. One company was offering renewable at about 20% more than regular, which isn't so bad if your electric-bills are about $100 a month.