One of the features of the Pandas library that I like the most is hierarchical indexing. The use of hierarchical indexing is illustrated by the following examples:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd, pylab | |
idx = pd.MultiIndex.from_tuples([('a','x'),('a','y'),('b','x'),('b','y')]) | |
col = pd.MultiIndex.from_tuples([('c1',0),('c1',1),('c2',0)],names=['f','s']) | |
dat = pylab.randn(len(idx),len(col)) | |
df1 = pd.DataFrame(dat, index=idx, columns=col) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [6]: df1 | |
Out[6]: | |
f c1 c2 | |
s 0 1 0 | |
a x 0.091242 0.700668 1.755267 | |
y 0.933405 0.129897 -0.082570 | |
b x -0.828174 -0.614804 0.299050 | |
y -2.034121 -0.560468 -0.443721 | |
As can be seen, the function MultiIndex enables us to turn a | |
list of tuples into a hierarchical index. The cool thing about | |
the hierarchical index is that we can now select out groups | |
of columns/rows like so: | |
In [13]: df1.c1 | |
Out[13]: | |
s 0 1 | |
a x 0.091242 0.700668 | |
y 0.933405 0.129897 | |
b x -0.828174 -0.614804 | |
y -2.034121 -0.560468 | |
Or like so: | |
In [17]: df1.loc['b'] | |
Out[17]: | |
f c1 c2 | |
s 0 1 0 | |
x -0.828174 -0.614804 0.299050 | |
y -2.034121 -0.560468 -0.443721 | |
A more powerful example of label based selection over a range is given later. | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
col = pd.MultiIndex.from_tuples([('c3','x'),('c3','y')]) | |
dat = pylab.randn(len(idx),len(col)) | |
df2 = pd.DataFrame(dat, index=idx, columns=col) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [7]: df2 | |
Out[7]: | |
c3 | |
x y | |
a x -0.524225 0.245902 | |
y 0.286759 0.365465 | |
b x 0.542192 1.605373 | |
y -0.554882 0.124332 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
col = pd.MultiIndex.from_tuples([('c4','x'),('c4','y')]) | |
dat = pylab.randn(len(idx),len(col)) | |
df3 = pd.DataFrame(dat, index=idx, columns=col) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [8]: df3 | |
Out[8]: | |
c4 | |
x y | |
a x -0.339814 -0.399197 | |
y 0.053150 0.368418 | |
b x 0.314094 0.228735 | |
y 0.030094 0.005102 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df_a = pd.concat([df1,df2,df3]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [9]: df_a | |
Out[9]: | |
c1 c2 c3 c4 | |
0 1 0 x y x y | |
a x NaN NaN NaN NaN NaN NaN NaN | |
y NaN NaN NaN NaN NaN NaN NaN | |
b x NaN NaN NaN NaN NaN NaN NaN | |
y NaN NaN NaN NaN NaN NaN NaN | |
a x NaN NaN NaN -0.524225 0.245902 NaN NaN | |
y NaN NaN NaN 0.286759 0.365465 NaN NaN | |
b x NaN NaN NaN 0.542192 1.605373 NaN NaN | |
y NaN NaN NaN -0.554882 0.124332 NaN NaN | |
a x NaN NaN NaN NaN NaN -0.339814 -0.399197 | |
y NaN NaN NaN NaN NaN 0.053150 0.368418 | |
b x NaN NaN NaN NaN NaN 0.314094 0.228735 | |
y NaN NaN NaN NaN NaN 0.030094 0.005102 | |
If you find those NaNs in the topmost rows and leftmost | |
columns odd, I agree with you. What has happened here is | |
that the column subindex for df1 is an integer (0,1) while | |
df2 and df3 it is a string ('x','y'). | |
Pandas (I used v 0.11.0) for some reason does not combine | |
those indeces consistently, for example below you see that | |
a join on axis=1 goes smoothly. | |
And if you skip down a bit you will see that if we set the | |
indeces as strings ('0','1') the combination goes fine. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df_b = pd.concat([df1,df2,df3], axis=1) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [10]: df_b | |
Out[10]: | |
f c1 c2 c3 c4 | |
s 0 1 0 x y x y | |
a x 0.091242 0.700668 1.755267 -0.524225 0.245902 -0.339814 -0.399197 | |
y 0.933405 0.129897 -0.082570 0.286759 0.365465 0.053150 0.368418 | |
b x -0.828174 -0.614804 0.299050 0.542192 1.605373 0.314094 0.228735 | |
y -2.034121 -0.560468 -0.443721 -0.554882 0.124332 0.030094 0.005102 | |
Note how we don't have an issue with some of the subindeces starting out | |
as numbers |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df_c = pd.concat([df1,df2,df3], axis=0) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [11]: df_c | |
Out[11]: | |
c1 c2 c3 c4 | |
0 1 0 x y x y | |
a x NaN NaN NaN NaN NaN NaN NaN | |
y NaN NaN NaN NaN NaN NaN NaN | |
b x NaN NaN NaN NaN NaN NaN NaN | |
y NaN NaN NaN NaN NaN NaN NaN | |
a x NaN NaN NaN -0.524225 0.245902 NaN NaN | |
y NaN NaN NaN 0.286759 0.365465 NaN NaN | |
b x NaN NaN NaN 0.542192 1.605373 NaN NaN | |
y NaN NaN NaN -0.554882 0.124332 NaN NaN | |
a x NaN NaN NaN NaN NaN -0.339814 -0.399197 | |
y NaN NaN NaN NaN NaN 0.053150 0.368418 | |
b x NaN NaN NaN NaN NaN 0.314094 0.228735 | |
y NaN NaN NaN NaN NaN 0.030094 0.005102 | |
This is the same as axis=0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
col = pd.MultiIndex.from_tuples([('c1','0'),('c1','1'),('c2','0')],names=['f','s']) | |
dat = pylab.randn(len(idx),len(col)) | |
df4 = pd.DataFrame(dat, index=idx, columns=col) | |
df_d = pd.concat([df4,df2,df3], axis=0) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [12]: df_d | |
Out[12]: | |
c1 c2 c3 c4 | |
0 1 0 x y x y | |
a x 0.337708 -0.677786 2.774935 NaN NaN NaN NaN | |
y 1.661593 -1.035824 -1.680486 NaN NaN NaN NaN | |
b x -0.797928 -1.212866 -1.161657 NaN NaN NaN NaN | |
y -0.750026 0.442908 -0.086385 NaN NaN NaN NaN | |
a x NaN NaN NaN -0.524225 0.245902 NaN NaN | |
y NaN NaN NaN 0.286759 0.365465 NaN NaN | |
b x NaN NaN NaN 0.542192 1.605373 NaN NaN | |
y NaN NaN NaN -0.554882 0.124332 NaN NaN | |
a x NaN NaN NaN NaN NaN -0.339814 -0.399197 | |
y NaN NaN NaN NaN NaN 0.053150 0.368418 | |
b x NaN NaN NaN NaN NaN 0.314094 0.228735 | |
y NaN NaN NaN NaN NaN 0.030094 0.005102 | |
Note how the sub indecies are not a problem anymore, since they are all | |
of the same type ('string') |
Comments
Post a Comment