Skip to main content

Posts

Showing posts from 2015

h5py in memory files

The h5py library is a very nice wrapper round the HDF5 data storage format/library. The authors of h5py have done a super job of aligning HDF5 data types with numpy data types including structured arrays, which means you can store variable lengths strings and jagged arrays. One of the advantages of HDF5 for large datasets is that you can load slices of the data into memory very easily and transparently - h5py and HDF5 take care of everything - compression, chunking, buffering - for you. As I was playing around with h5py one thing tripped me up. h5py has an "in memory" mode where you can create HDF5 files in memory (driver='core') option which is great when prototyping or writing tests, since you don't have to clean up files after you are done. In the documentation it says, even if you have an in-memory file, you need to give it a name. I found this requirement funny, because I assumed that the fake file was being created in a throw away memory buffer attached ...

Sample randomly from an sqlite database

If you have a large database you often want to sample rows from it. For many uses the sampling should be randomly done. A reasonably fast way to do this completely in SQL is the following: SELECT * FROM my_table WHERE rowid in (SELECT abs(random()) % N FROM my_table LIMIT k); Where N is the maximum number of rows the table has and k is the number of samples we want There is an alternative form popular on the internet SELECT * FROM my_table WHERE random() % L = 0 LIMIT k; where L is the factor deciding how likely it is we pick a particular row. This form is slower AND has a bias to picking samples from the beginning of the database. In this method, we go through each row sequentially, deciding if we select that row for the sample, which makes things slow for large L. If we make L small we pick faster, but our sample is biased towards the start of the database.

rsync and FAT vs EXT file systems

NTFS/FAT timestamps can vary from those in EXT (Unix) systems like Mac OS. This causes issues when using rsync where every file is copied over even though nothing has really changed. Use the --modify-window=1 flag to tell rsync to allow larger time differences. In this case 1s. So a typical incremental backup command will be rsync -avz --modify-window=1 /src /dst

Re-printing a line (in Python)

I love those progress bars in the command line. You know, like when you install Linux (or rather when I used to install Linux, nowadays the cool kids have full graphical interface installers). Something like this: I always thought you had to use curses or something equally magical to do this. Then I ran into this post . It turns out the character '\r' moves the cursor to the beginning of the line and you can use that, for example in Python, to create an animated progress bar. def progress_bar(title, f, cols): """Draw a nifty progress bar. '\r' trick from http://stackoverflow.com/questions/15685063/print-a-progress-bar-processing-in-python :param title: leading text to print :param f: fraction completed :param cols: how many columns wide should the bar be """ x = int(f * cols + 0.5) sys.stdout.write('\r' + title + '[' + '.' * x + ' ' * (cols - x) + ']\r') sys.stdout.flu...

Notes on distributing cython code

One of the conveniences of Python is the package system which allows you to install your program and any dependencies smoothly. The package system works very well when the code is pure Python, but can run into trouble when code written in cython or c is part of the program. I will illustrate some mis-steps I made while writing a install script for an example program that is a mixture of Python and Cython. I've put the code up on github and each step is a commit tag. You can follow along by setting up a virtual environment using virtualenvwrapper : mkvirtualenv cy-test And then trying to install the appropriate tag, e.g: git clone git@github.com:kghose/cython-example.git cd cython-example git checkout ex2 ex1 The module installs without errors, but because of me not indicating the paths of the cython files properly (I omit the  kgcyex  directory in the path) the cython files do not compile. You will note this because there are no compilation messages during th...

Running bash functions in parallel

I was blown away when I learned this. From this thread on stackoverflow it turns out that by simply adding an ampersand to a line containing a function call you can send it to run in the background! #!/bin/bash function foo { echo $1 sleep $1 date } for i in `seq 1 10`; do foo $i & done I always thought that this was restricted to programs/scripts you can call from the command line!

Electricity choice in Massachusetts

I've lived in Massachusetts for some years now and I've noticed that my electric bill is split into two parts: Delivery services and Supply services. I always thought that that was some itemizing detail, like the forty items I used to have on my phone bill and I ignored it. This month's bill was higher than expected and I took a closer look. After a little inspection I noted that the Supply services rate was higher than before. It said "Basic Fixed Service". After some messing around on the National Grid Website, I came to this page . The important information there is this: National Grid separates your bill into two services: supply and delivery. Supply Services is the portion of your electric service for which you can shop for your electricity supply from a supplier other than National Grid. These suppliers, often referred to as competitive suppliers, can be companies that produce or generate electricity or are brokers that buy electricity in the wholesale m...