Monday, August 25, 2014

Numpy: a note on slicing efficiency

In my application I have three arrays of the same length (and datatype) (say p1, p2 and p3) that have a one to one correspondence with each other. My code works on the arrays in blocks and has to concatenate fragments of these three arrays together. In the rest of my code I take matched slices through these arrays (say p1[0:100], p2[0:100], p3[0:100]) and pass them onto functions for processing,

My first instinct was to create a two-dimensional array since this encapsulates the data nicely, even though the data do not, semantically, constitute a matrix. It was at this point I came across a fact that I had not considered before: taking slices from a 2D array is more time expensive than taking slices from a 1D array.

For example:

In [397]: %%timeit p = numpy.empty((3, 1e6))
   .....: x = p[1, 0:2]
   .....: 
100000 loops, best of 3: 2.18 µs per loop

Vs

In [398]: %%timeit p = numpy.empty(1e6)
   .....: x = p[0:2]
   .....: 
1000000 loops, best of 3: 512 ns per loop

Row-column shape does not make a difference:

In [399]: %%timeit p = numpy.empty((1e6, 3))
x = p[0:2, 1]
   .....: 
100000 loops, best of 3: 1.97 µs per loop

Additionally, A bit of parametric investigation reveals that, interestingly, the size of the slice does not matter:

In [402]: %%timeit p = numpy.empty((3, 1e6))
x = p[1, 0:100000]
   .....: 
100000 loops, best of 3: 2.17 µs per loop

Vs

In [401]: %%timeit p = numpy.empty(1e6)
x = p[0:100000]
   .....: 
1000000 loops, best of 3: 507 ns per loop

I think the most likely reason for this is that we are no longer pulling contiguous memory locations but rather having to pull together fragments from different memory locations and do proper offset computations while we do this. I can't understand why the time is independent of size for the 2D array, though!

So these consideration led me to the following formulation. I first verified that a concat of 2D fragments was not faster than repeated concats of separate 1-D fragments:

In [404]: %%timeit p1 = numpy.empty((3, 1e6)); p2=numpy.empty((3,1e2))
   .....: p = numpy.concatenate((p1, p2),axis=1)
   .....: 
100 loops, best of 3: 8.15 ms per loop

Vs

In [405]: %%timeit p1 = [numpy.empty(1e6) for _ in range(3)]; p2=[numpy.empty(1e2) for _ in range(3)]
   .....: p = [numpy.concatenate((pa, pb)) for pa, pb in zip(p1, p2)]
   .....: 
100 loops, best of 3: 9.34 ms per loop

And then used a list of arrays instead of a 2D array. This works for me because I do a lot of slicing of the resultant arrays and I never need to do any matrix operations with the three rows arranged as a matrix.

3 comments:

  1. Since my previous comment didn't seem to get posted... here's my question. If efficiency / speed is a concern, why are you using Python?

    ReplyDelete
  2. Hey Ben, I don't see any comment in the moderation list - did you get any acknowledgement that the comment went through?

    Ok, to answer your question:

    Python shines as maintainable code when written idiomatically but Python can't get anywhere near C speeds unless you do some tricks (like write parts in Cython) which makes the code unreadable and hard to maintain.

    However, this does not mean you shouldn't profile your code and remove bottlenecks, especially if the modified code is still idiomatic Python.

    ReplyDelete
  3. Also, the blog was a getting a lot of spam which cluttered up the moderation queue so I think I set it so that only registered users can comment. I will check to make sure folks with blogger accounts/openid can comment.

    ReplyDelete