Skip to main content

Multi-indexing with Pandas DataFrame

One of the features of the Pandas library that I like the most is hierarchical indexing. The use of hierarchical indexing is illustrated by the following examples:


import pandas as pd, pylab
idx = pd.MultiIndex.from_tuples([('a','x'),('a','y'),('b','x'),('b','y')])
col = pd.MultiIndex.from_tuples([('c1',0),('c1',1),('c2',0)],names=['f','s'])
dat = pylab.randn(len(idx),len(col))
df1 = pd.DataFrame(dat, index=idx, columns=col)
view raw p01.py hosted with ❤ by GitHub
In [6]: df1
Out[6]:
f c1 c2
s 0 1 0
a x 0.091242 0.700668 1.755267
y 0.933405 0.129897 -0.082570
b x -0.828174 -0.614804 0.299050
y -2.034121 -0.560468 -0.443721
As can be seen, the function MultiIndex enables us to turn a
list of tuples into a hierarchical index. The cool thing about
the hierarchical index is that we can now select out groups
of columns/rows like so:
In [13]: df1.c1
Out[13]:
s 0 1
a x 0.091242 0.700668
y 0.933405 0.129897
b x -0.828174 -0.614804
y -2.034121 -0.560468
Or like so:
In [17]: df1.loc['b']
Out[17]:
f c1 c2
s 0 1 0
x -0.828174 -0.614804 0.299050
y -2.034121 -0.560468 -0.443721
A more powerful example of label based selection over a range is given later.
view raw p02.txt hosted with ❤ by GitHub
col = pd.MultiIndex.from_tuples([('c3','x'),('c3','y')])
dat = pylab.randn(len(idx),len(col))
df2 = pd.DataFrame(dat, index=idx, columns=col)
view raw p03.py hosted with ❤ by GitHub
In [7]: df2
Out[7]:
c3
x y
a x -0.524225 0.245902
y 0.286759 0.365465
b x 0.542192 1.605373
y -0.554882 0.124332
view raw p04.txt hosted with ❤ by GitHub
col = pd.MultiIndex.from_tuples([('c4','x'),('c4','y')])
dat = pylab.randn(len(idx),len(col))
df3 = pd.DataFrame(dat, index=idx, columns=col)
view raw p05.py hosted with ❤ by GitHub
In [8]: df3
Out[8]:
c4
x y
a x -0.339814 -0.399197
y 0.053150 0.368418
b x 0.314094 0.228735
y 0.030094 0.005102
view raw p06.txt hosted with ❤ by GitHub
df_a = pd.concat([df1,df2,df3])
view raw p07.py hosted with ❤ by GitHub
In [9]: df_a
Out[9]:
c1 c2 c3 c4
0 1 0 x y x y
a x NaN NaN NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN NaN NaN
b x NaN NaN NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN NaN NaN
a x NaN NaN NaN -0.524225 0.245902 NaN NaN
y NaN NaN NaN 0.286759 0.365465 NaN NaN
b x NaN NaN NaN 0.542192 1.605373 NaN NaN
y NaN NaN NaN -0.554882 0.124332 NaN NaN
a x NaN NaN NaN NaN NaN -0.339814 -0.399197
y NaN NaN NaN NaN NaN 0.053150 0.368418
b x NaN NaN NaN NaN NaN 0.314094 0.228735
y NaN NaN NaN NaN NaN 0.030094 0.005102
If you find those NaNs in the topmost rows and leftmost
columns odd, I agree with you. What has happened here is
that the column subindex for df1 is an integer (0,1) while
df2 and df3 it is a string ('x','y').
Pandas (I used v 0.11.0) for some reason does not combine
those indeces consistently, for example below you see that
a join on axis=1 goes smoothly.
And if you skip down a bit you will see that if we set the
indeces as strings ('0','1') the combination goes fine.
view raw p08.txt hosted with ❤ by GitHub
df_b = pd.concat([df1,df2,df3], axis=1)
view raw p09.py hosted with ❤ by GitHub
In [10]: df_b
Out[10]:
f c1 c2 c3 c4
s 0 1 0 x y x y
a x 0.091242 0.700668 1.755267 -0.524225 0.245902 -0.339814 -0.399197
y 0.933405 0.129897 -0.082570 0.286759 0.365465 0.053150 0.368418
b x -0.828174 -0.614804 0.299050 0.542192 1.605373 0.314094 0.228735
y -2.034121 -0.560468 -0.443721 -0.554882 0.124332 0.030094 0.005102
Note how we don't have an issue with some of the subindeces starting out
as numbers
view raw p10.txt hosted with ❤ by GitHub
df_c = pd.concat([df1,df2,df3], axis=0)
view raw p11.py hosted with ❤ by GitHub
In [11]: df_c
Out[11]:
c1 c2 c3 c4
0 1 0 x y x y
a x NaN NaN NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN NaN NaN
b x NaN NaN NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN NaN NaN
a x NaN NaN NaN -0.524225 0.245902 NaN NaN
y NaN NaN NaN 0.286759 0.365465 NaN NaN
b x NaN NaN NaN 0.542192 1.605373 NaN NaN
y NaN NaN NaN -0.554882 0.124332 NaN NaN
a x NaN NaN NaN NaN NaN -0.339814 -0.399197
y NaN NaN NaN NaN NaN 0.053150 0.368418
b x NaN NaN NaN NaN NaN 0.314094 0.228735
y NaN NaN NaN NaN NaN 0.030094 0.005102
This is the same as axis=0
view raw p12.txt hosted with ❤ by GitHub
col = pd.MultiIndex.from_tuples([('c1','0'),('c1','1'),('c2','0')],names=['f','s'])
dat = pylab.randn(len(idx),len(col))
df4 = pd.DataFrame(dat, index=idx, columns=col)
df_d = pd.concat([df4,df2,df3], axis=0)
view raw p13.py hosted with ❤ by GitHub
In [12]: df_d
Out[12]:
c1 c2 c3 c4
0 1 0 x y x y
a x 0.337708 -0.677786 2.774935 NaN NaN NaN NaN
y 1.661593 -1.035824 -1.680486 NaN NaN NaN NaN
b x -0.797928 -1.212866 -1.161657 NaN NaN NaN NaN
y -0.750026 0.442908 -0.086385 NaN NaN NaN NaN
a x NaN NaN NaN -0.524225 0.245902 NaN NaN
y NaN NaN NaN 0.286759 0.365465 NaN NaN
b x NaN NaN NaN 0.542192 1.605373 NaN NaN
y NaN NaN NaN -0.554882 0.124332 NaN NaN
a x NaN NaN NaN NaN NaN -0.339814 -0.399197
y NaN NaN NaN NaN NaN 0.053150 0.368418
b x NaN NaN NaN NaN NaN 0.314094 0.228735
y NaN NaN NaN NaN NaN 0.030094 0.005102
Note how the sub indecies are not a problem anymore, since they are all
of the same type ('string')
view raw p14.txt hosted with ❤ by GitHub

Comments

Popular posts from this blog

A note on Python's __exit__() and errors

Python's context managers are a very neat way of handling code that needs a teardown once you are done. Python objects have do have a destructor method ( __del__ ) called right before the last instance of the object is about to be destroyed. You can do a teardown there. However there is a lot of fine print to the __del__ method. A cleaner way of doing tear-downs is through Python's context manager , manifested as the with keyword. class CrushMe: def __init__(self): self.f = open('test.txt', 'w') def foo(self, a, b): self.f.write(str(a - b)) def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): self.f.close() return True with CrushMe() as c: c.foo(2, 3) One thing that is important, and that got me just now, is error handling. I made the mistake of ignoring all those 'junk' arguments ( exc_type, exc_val, exc_tb ). I just skimmed the docs and what popped out is that you need to return True or...

Store numpy arrays in sqlite

Use numpy.getbuffer (or sqlite3.Binary ) in combination with numpy.frombuffer to lug numpy data in and out of the sqlite3 database: import sqlite3, numpy r1d = numpy.random.randn(10) con = sqlite3.connect(':memory:') con.execute("CREATE TABLE eye(id INTEGER PRIMARY KEY, desc TEXT, data BLOB)") con.execute("INSERT INTO eye(desc,data) VALUES(?,?)", ("1d", sqlite3.Binary(r1d))) con.execute("INSERT INTO eye(desc,data) VALUES(?,?)", ("1d", numpy.getbuffer(r1d))) res = con.execute("SELECT * FROM eye").fetchall() con.close() #res -> #[(1, u'1d', <read-write buffer ptr 0x10371b220, size 80 at 0x10371b1e0>), # (2, u'1d', <read-write buffer ptr 0x10371b190, size 80 at 0x10371b150>)] print r1d - numpy.frombuffer(res[0][2]) #->[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] print r1d - numpy.frombuffer(res[1][2]) #->[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Note that for work where data ty...