5 Useful Pandas Methods That You May Not Know Existed - Part II

# 5 Useful Pandas Methods That You May Not Know Existed - Part II

This is Part II of the 5 Useful Pandas Methods That You May Not Know Existedseries. You can find Part I here. In this blog post, we'll go over another 5 useful but lesser known pandas methods.

Without further ado, let's dive in!

## .agg()

The .agg() method makes it super easy to calculate some aggregate statistics on our Pandas DataFrame.

Suppose we want to calculate some statistics on the seaborn planets dataset:

import seaborn as sns
import pandas as pd

planets = planets.dropna() # for simplicity

If we call planets.head(2) we get the following:

You can get started pretty fast with .agg(). All you need to run is:

planets.agg(['sum', 'max', 'min'])

Here, I've specified sum, max and min, but you could pick other statistics too. Its output will look like the following:

Now, the first thing that comes to mind when reading this, is the 'method' column. It really doesn't make sense to calculate these statistics for this column. Neither is taking the sum of 'year'...

So, really what we want is to specify specific columns and for each specify what aggregates to let loose on it. Luckily, that's easy too.

import statistics
import numpy as np

planets.agg({"mass": ['sum', 'min', 'max', 'mean', np.median],
"distance": ['sum', 'min', 'max', 'mean', np.median],
"year": ['min', 'max', statistics.mode]})

Here, within .agg() we specify which columns to calculate aggregates for and which aggregates we need per column.

This gives us the following:

Note here, that NaN denotes that we haven't specified that aggregate for that particular column.

Now, remember, we also had a categorical column. It is also possible to first group by a categorical column and then calculate the stats per group.

(planets.groupby('method')
.agg({"mass": ['mean', np.median],
"distance": ["mean", np.median],
"orbital_period":["mean", np.median]}
)
)

Note that planets.describe() also gives you some of these stats, but it doesn't provide you with all of the flexibility that agg() has to offer.

## .nlargest() / .nsmallest()

Do you often find yourself running the following to find the largest 5 items:

planets.sort_values('mass', ascending=False).head(5)

Did you know that there's a single command for that? It's called .nlargest() (or .nsmallest() for the smallest n items).

Simply run:

planets.nlargest(5, 'mass')

You will get the same result in a much more elegant way! A nice benefit is that it is also more performant than the former approach!

You can also sort using multiple columns:

planets.nsmallest(3, ['distance', 'mass'])

The optional 'keep' parameter specifies what you want to do with duplicate (sorting) values. So if within your N largest items, there are a couple of duplicate values for 'mass', you can specify whether you only want to select the first, the last or keep all of them (even if that means you will get more results than the specified N). Keep defaults to 'first'.

## .query()

Suppose we have a dataset of flights data:

flights = sns.load_dataset('flights')
flights.head(2)

.query() allows you to easily and verbosely filter your data much like you would in sql.

So if you want to only have flights from before 1950, where there were less than 150 passengers, you could simply run:

flights.query("year < 1950 and passengers < 150")

This will give you the following output:

## .assign()

The .assign() method allows you to add additional variables to your dataframe.

flights.assign(bad_month = flights['passengers'] < 150)

This can easily be used in combination with other statements, such as pipe(), groupby(),  .value_counts() etc. This is where assign() truly comes in handy.

For instance, in the below code snippet we group by year, take the sum of the passengers per year and create a new column 'bad_year' using assign() where the number of passengers in that year is below 2500.

(flights
.groupby('year')['passengers']
.sum()
.reset_index()
.assign(bad_year = lambda x: x.passengers < 2500)

)

## .rank() / .pct_change()

With the .rank() method you can rank items in your dataframe or across groups. So in the flights dataset, you could for instance rank the number of passengers from highest to lowest per year.

flights['rank'] = flights.groupby('year')['passengers'].rank(method='first', ascending=False)

Note, we're using method = first, which means that when there are multiple equal values, the one that came first in the dataframe will get the highest rank (where the highest rank to be had in this case is 1.0).

In the dataframe below you can see this in action for the months of July and August where the number of passengers are the same, and our earliest month (lowest index in dataframe) gets the highest rank.

You can also use different settings, such as min, which will give each of the 'duplicates' the lowest rank of the group or max, which will give them the highest rank of the group.

For this simple exercise, we're mainly interested in the passenger frequencies of the top month, we're not really interested in which exact month of the year it was, so we'll leave it like this.

So if we want to inspect the ranks for the year 1949, it would look like this:

flights.query('year == 1949').sort_values(by='rank')

Suppose then that you'd like to know how the highest ranked month each year performed over time. You can use rank in combination with .pct_change().

flights['pct_change'] = flights.groupby('rank')['passengers'].pct_change() * 100

flights.query('rank == 1')

This will first group your dataset by rank (e.g. the months ranked first will all be together in one group). Afterwards, it will calculate the percentage difference in number of passengers between each row (year) and its previous row (year) within that rank group.

Finally, we will filter by rank = 1, to inspect our percentage difference in passenger count for the busiest month for each year.

Here, you will see that the busiest months (rank = 1) have always increased over time (pct_change is positive), but the size of the increase fluctuates over time. Note, the pct_change()of 1949 is NaN because it doesn't have a previous value to make a comparison with.

## Concluding remarks

That's it for today! I hope you discovered at least one niche Pandas trick that you can use in your data projects. Let me know on Twitter which of these tricks you didn't know yet. If you have other tricks that I should know about, also let me know of course! :)

## Let's keep in touch! 📫

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.