A Trip Through Pandas Part Two

A Trip Through Pandas Part Two

Advanced pandas: part two of the pandas library tour

2020-11-23

In the previous article we started learning some important stuff about Pandas, this great library of Data Science. From the creation of DataFrames to the description and alteration of them, a lot of topics were covered. But Pandas is not done - not by a long shot. In this article we will continue with several functions of Pandas that may not be so well known, but are certainly useful.

Grouping and Counting

We are going to continue with the wine dataset that we used in the previous article. There, it is explained how to load it, as well as the basics. This time, however, we are going to study our dataset. Let’s say we want to know how many unique wines we have.

We can do:


df.Wine.unique()

 array([1, 2, 3], dtype=int64)

There are three different wines. But how many registers we have of each?
We can group by the Wine column in the DataFrame:


df.groupby('Wine').Wine.count()

Wine
1    59
2    71
3    48
Name: Wine, dtype: int64

What is going on here? With the groupby function we can select a set of columns (or just one) by which to group the DataFrame. We got what is called a DataFrameGroupBy object. It has the same columns as the original DataFrame, but now grouped. And with these groups, we can perform operations such as count, min, or even apply lambda functions:


df.groupby('Wine').apply(lambda df: df.loc[df.Alcohol == df.Alcohol.max()])

          Wine  Alcohol  Malic.acid   Ash   Acl  Mg  ...  Nonflavanoid.phenols  Proanth  Color.int   Hue    OD  Proline
Wine                                                 ...
1    8       1    14.83        1.64  2.17  14.0  97  ...                  0.29     1.98       5.20  1.08  2.85     1045
2    71      2    13.86        1.51  2.67  25.0  86  ...                  0.21     1.87       3.38  1.36  3.16      410
3    158     3    14.34        1.68  2.70  25.0  98  ...                  0.53     2.70      13.00  0.57  1.96      660

[3 rows x 14 columns]

A very long line, but let’s break it down. It groups by wine, that is clear. And we apply a lambda to the resulting DataFrame. It will take the register where Alcohol is maximum for each of the wines. And this just opens endless possibilities, such as the .agg() function.

.agg()

What is this function? We have shown that we can group by elements and apply functions. But sometimes, when we want to perform an analysis we want a lot of data. And this is where .agg() comes in. Instead of calling the functions one by one, we can aggregate all the data we want:


 wine_mg = df.groupby(['Wine','Mg']).Alcohol.agg([len, min, max])

          len    min    max
Wine Mg
1    89   1.0  13.75  13.75
     90   1.0  13.41  13.41
     91   1.0  14.75  14.75
     92   1.0  13.05  13.05
     93   1.0  13.39  13.39
...       ...    ...    ...
3    113  1.0  13.08  13.08
     116  1.0  13.11  13.11
     120  2.0  13.17  13.27
     122  1.0  12.86  12.86
     123  1.0  13.50  13.50

[94 rows x 3 columns]

What we have done is grouping first by Wine and then by Mg, and getting the length, the maximum and the minimum. Depending on the DataFrame, we can do more or less complex groupings, such as grouping by country and region. And this DataFrame ties in with the next topic perfectly.

Multi-index

The keen eyed will have noticed an important difference in this new DataFrame with respect to the ones we have been working with: it is using a Multi-index. This means that it is indexed by two different values. Usually, this grants a fine granularity for a more sophisticated analysis.
When you have high dimensionality, or a hierarchical set of values, this can be a very interesting tool. In this tutorial, we are going to see the basics of indexing, and how to return to a simple index. Given that for a lot of applications it will be easier to convert a Multi-index to a regular indexed DataFrame and use the methods you already know.

The Multi-index is hierarchical, which means that it is ordered. This way, if you do:


wine_mg.loc[1]

     len    min    max
Mg
89   1.0  13.75  13.75
90   1.0  13.41  13.41
91   1.0  14.75  14.75
...  ...  ...    ...
127  1.0  14.23  14.23
128  1.0  14.22  14.22
132  1.0  13.76  13.76

You get the DataFrame where Wine is 1, with the single remaining index. You can manipulate that new DataFrame to get specific values, or you can do:


 wine_mg.loc[(1,95):(1,100)]

          len    min    max
Wine Mg
1    95   2.0  12.85  14.12
     96   3.0  13.50  14.39
     97   1.0  14.83  14.83
     98   3.0  13.05  13.86
     100  2.0  13.20  13.48

To get the values that go from Wine == 1 and Mg == 95 to Wine == 1 and Mg == 100.

Indexing with multiple indices can get complicated very quickly and it depends on each analysis if it is the proper tool. In case you want to go back to single indices, you can just do:


wine_mg.reset_index()

   Wine   Mg  len    min    max
0      1   89  1.0  13.75  13.75
1      1   90  1.0  13.41  13.41
2      1   91  1.0  14.75  14.75
3      1   92  1.0  13.05  13.05
4      1   93  1.0  13.39  13.39
..   ...  ...  ...    ...    ...
89     3  113  1.0  13.08  13.08
90     3  116  1.0  13.11  13.11
91     3  120  2.0  13.17  13.27
92     3  122  1.0  12.86  12.86
93     3  123  1.0  13.50  13.50

[94 rows x 5 columns]

That way, Wine and Mg are independent columns and you can use the methods explained previously. That way, you decide how to work your data.

Sorting

Finally in this section, we are going to see how to sort the data.
It is fairly simple: just calling the function sort_values:


 wine_mg.sort_values(by='len')

          len    min    max
Wine Mg
1    89   1.0  13.75  13.75
3    122  1.0  12.86  12.86
2    96   1.0  11.45  11.45
     99   1.0  12.67  12.67
     100  1.0  12.64  12.64
...       ...    ...    ...
3    96   4.0  12.20  14.13
2    85   5.0  11.03  13.05
1    101  5.0  13.16  13.90
2    88   9.0  11.41  12.60
     86   9.0  11.82  13.86

[94 rows x 3 columns]

You just have to set the column (again, or columns) by which you want to sort and that’s it. Take note of the parameter ascending to have an ascending or descending sorting.

Combining DataFrames

In this section, we are going to see how to combine several related DataFrames for their study. First, let’s divide our DataFrame:


 df_one = df.loc[df.Wine == 1]
 df_two = df.loc[df.Wine == 2]
 df_three = df.loc[df.Wine == 3]

Now, there are two main methods of combining DataFrames. The first is to put one just underneath the other. Let’s say we have two DataFrames of the same thing but obtained at different dates and we want to put them together to have one big DataFrame:


df_one_two = pd.concat([df_one, df_two])

     Wine  Alcohol  Malic.acid   Ash   Acl  ...  Proanth  Color.int   Hue    OD  Proline
0       1    14.23        1.71  2.43  15.6  ...     2.29       5.64  1.04  3.92     1065
1       1    13.20        1.78  2.14  11.2  ...     1.28       4.38  1.05  3.40     1050
2       1    13.16        2.36  2.67  18.6  ...     2.81       5.68  1.03  3.17     1185
3       1    14.37        1.95  2.50  16.8  ...     2.18       7.80  0.86  3.45     1480
4       1    13.24        2.59  2.87  21.0  ...     1.82       4.32  1.04  2.93      735
..    ...      ...         ...   ...   ...  ...      ...        ...   ...   ...      ...
125     2    12.07        2.16  2.17  21.0  ...     1.35       2.76  0.86  3.28      378
126     2    12.43        1.53  2.29  21.5  ...     1.77       3.94  0.69  2.84      352
127     2    11.79        2.13  2.78  28.5  ...     1.76       3.00  0.97  2.44      466
128     2    12.37        1.63  2.30  24.5  ...     1.90       2.12  0.89  2.78      342
129     2    12.04        4.30  2.38  22.0  ...     1.35       2.60  0.79  2.57      580

[130 rows x 14 columns]

The .concat() function from Pandas takes an array of DataFrames and pastes them together. The columns named the same will combine, while the columns with different names will be created anew; and the DataFrame without said column will be filled with NaN.

If, on the contrary, we want to compare both DataFrames, we can use the join function. That function will take two DataFrames with the same index and join them together. Let’s say we want to get all the registries joined by the same value of Mg.

We do the following:


left = df_one.set_index(['Mg'])
right = df_two.set_index(['Mg'])
mg_join = left.join(right, lsuffix='_one', rsuffix='_two')

     Wine_one  Alcohol_one  Malic.acid_one  Ash_one  Acl_one  Phenols_one  Flavanoids_one  ...  Flavanoids_two  Nonflavanoid.phenols_two  Proanth_two  Color.int_two  Hue_two  OD_two  Proline_two
Mg                                                                                         ...
89          1        13.75            1.73     2.41     16.0         2.60            2.76  ...             NaN                       NaN          NaN            NaN      NaN     NaN          NaN
90          1        13.41            3.84     2.12     18.8         2.45            2.68  ...            1.69                      0.43         1.56           2.45     1.33    2.26        495.0
90          1        13.41            3.84     2.12     18.8         2.45            2.68  ...            1.84                      0.66         1.42           2.70     0.86    3.30        315.0
91          1        14.75            1.73     2.39     11.4         3.10            3.69  ...             NaN                       NaN          NaN            NaN      NaN     NaN          NaN
92          1        13.05            1.73     2.04     12.4         2.72            3.27  ...            2.04                      0.39         2.08           2.70     0.86    3.02        312.0
..        ...          ...             ...      ...      ...          ...             ...  ...             ...                       ...          ...            ...      ...     ...          ...
124         1        13.05            2.05     3.22     25.0         2.63            2.68  ...             NaN                       NaN          NaN            NaN      NaN     NaN          NaN
126         1        14.06            1.63     2.28     16.0         3.00            3.17  ...             NaN                       NaN          NaN            NaN      NaN     NaN          NaN
127         1        14.23            1.71     2.43     15.6         2.80            3.06  ...             NaN                       NaN          NaN            NaN      NaN     NaN          NaN
128         1        14.22            3.99     2.51     13.2         3.00            3.04  ...             NaN                       NaN          NaN            NaN      NaN     NaN          NaN
132         1        13.76            1.53     2.70     19.5         2.95            2.74  ...             NaN                       NaN          NaN            NaN      NaN     NaN          NaN

[84 rows x 26 columns]

The values with the same Mg will appear as one row, and the rest will have NaN in the corresponding fields.

Pipelines and Method Chaining

The last part of this tutorial will be a way to create pipelines. All these methods can be applied step by step, but they can also be applied together or one after another automatically. This is particularly useful when wanting to do the same processing to several similar DataFrames, or when you want to create a clean structure to understand later what is the process.

Let’s say we want to get a random sample of the Ash of those elements with the same Mg value, but without the NaN.

We can do:


(mg_join
    .dropna()
    .sample(5)
    [['Ash_one', 'Ash_two']])

     Ash_one  Ash_two
Mg
101     2.67     2.40
97      2.17     1.70
92      2.04     1.94
94      2.60     2.70
112     2.70     2.23

Chaining methods like this can create a simple and readable way to perform multiple operations to a DataFrame.

``.assign()`

With .assign(), you can modify or create columns inline.
To create a column called Ash_Greater_Two we can do:


mg_join.assign(
    Ash_Greater_Two=mg_join.apply(lambda df: df.Ash_one > 2, axis='columns')
    )

     Wine_one  Alcohol_one  Malic.acid_one  Ash_one  ...  Hue_two  OD_two  Proline_two  Ash_Greater_Two
Mg                                                   ...
89          1        13.75            1.73     2.41  ...      NaN     NaN          NaN             True
90          1        13.41            3.84     2.12  ...     1.33    2.26        495.0             True
90          1        13.41            3.84     2.12  ...     0.86    3.30        315.0             True
91          1        14.75            1.73     2.39  ...      NaN     NaN          NaN             True
92          1        13.05            1.73     2.04  ...     0.86    3.02        312.0             True
..        ...          ...             ...      ...  ...      ...     ...          ...              ...
124         1        13.05            2.05     3.22  ...      NaN     NaN          NaN             True
126         1        14.06            1.63     2.28  ...      NaN     NaN          NaN             True
127         1        14.23            1.71     2.43  ...      NaN     NaN          NaN             True
128         1        14.22            3.99     2.51  ...      NaN     NaN          NaN             True
132         1        13.76            1.53     2.70  ...      NaN     NaN          NaN             True

[84 rows x 27 columns]

.pipe()

Finally, the method .pipe() allows for a modification of the whole DataFrame inline, given a passed function. This is a very specific function that can lead to interesting results, when needed. In our case let’s say we want to remove all the NaN from the DataFrame.
We do the following:


mg_join.pipe(lambda df: df.dropna())

     Wine_one  Alcohol_one  Malic.acid_one  Ash_one  ...  Color.int_two  Hue_two  OD_two  Proline_two
Mg                                                   ...
90          1        13.41            3.84     2.12  ...           2.45    1.330    2.26        495.0
90          1        13.41            3.84     2.12  ...           2.70    0.860    3.30        315.0
92          1        13.05            1.73     2.04  ...           2.70    0.860    3.02        312.0
...         ...      ...              ...      ...   ...           ...     ...      ...         ...
110         1        13.51            1.80     2.65  ...           3.17    1.020    1.93        750.0
110         1        13.28            1.64     2.84  ...           3.17    1.020    1.93        750.0
112         1        14.20            1.76     2.45  ...           2.65    0.960    2.52        500.0
112         1        13.63            1.81     2.70  ...           2.65    0.960    2.52        500.0

[59 rows x 26 columns]

This function is a not-so-well-known one, but there are specific pipelines that can greatly benefit from it, when the modification of a DataFrame is very complex and needs a multiline function. Or sometimes, just the clarity of a short lambda can be decisive to choose this method. Anyway, now it is also in your toolbox.

Conclusion

In this second part of the tutorial, we have seen more obscure Pandas function. They are not usually seen in regular tutorials because the use-cases are more limited; but that can also be seen as more specific tools for more fine-grained jobs. With this, you have now the ability to process the dataset of your project with the proper functions, and to reach the goals writing clean and concise code that does what it needs to do.