5 Useful Pandas Methods You May Not Know Existed - Part I
I use Pandas almost every day for work and side projects. I probably don't need to convince you that Pandas is one of the most powerful libraries out there for data scientists and the like.
What I will say is this: most people learn the basics and that is enough, most of the time. Yet, some features are really too good not to know about. This post will walk you through 5 useful but not so well-known Pandas methods that you may or may not know exist. In Part II, we will cover 5 more, so stay tuned for that.
.pipe()
The .pipe()
method makes it easy to apply multiple functions to a DataFrame object. Think of all the times you had to run various preprocessing functions and ended up writing something like this:
do_last(do_third(do_second(do_first(df))))
.pipe()
allows you to take the output from the first operation and pipe it through to the next operation and so on.
It's easiest to demonstrate this with an example.
Suppose you have some dataset that you want to analyse like this Reddit Jokes dataset:

Before you run your analysis, you would like to preprocess the data. You write a couple of simple utility functions for this:
import pandas as pd
def load_dataset():
df = pd.read_json('../data/reddit_jokes.json')
return df
def drop_na(df):
return df.dropna()
def select_relevant(df, cols):
return df[cols]
def to_lower(df, cols):
for col in cols:
df[col] = df[col].str.lower()
return df
def strip_whitespace(df, cols):
for col in cols:
df[col] = df[col].str.replace('\n', ' ')
df[col] = df[col].str.strip()
return df
def remove_punctuation(df, cols):
for col in cols:
df[col] = df[col].str.replace('[^\w\s]','', regex=True)
return df
Now, instead of writing this:
df = load_dataset()
df = drop_na(df)
df = select_relevant(df, ['title', 'body', 'score'])
df = to_lower(df, ['title', 'body'])
df = strip_whitespace(df, ['title', 'body'])
df = remove_punctuation(df, ['title', 'body'])
Or, god forbid, even this:
df = select_relevant(drop_na(load_dataset()), ['title', 'body', 'score'])
# omitted the last few, you get the point
We simply might write:
df = (load_dataset()
.pipe(drop_na)
.pipe(select_relevant, ['title', 'body', 'score'])
.pipe(to_lower, ['title', 'body'])
.pipe(strip_whitespace, ['title', 'body'])
.pipe(remove_punctuation, ['title', 'body'])
)
That's a lot cleaner right? It reads very naturally, going from first step to last step in simple clearly named units.
.read_clipboard()
As a data scientist, you will often get your data from databases, .csvs or .xlsx files. However, sometimes you will see a nice snippet of data online and it can be really tedious to copy paste.
Enter .read_clipboard()
.
You just copy the snippet and call the pd.read_clipboard()
function. It will parse your clipboard data and calls .read_csv()
on it.
Let's illustrate this with an example.
Suppose we are looking at an online tutorial, where someone runs the following code (in the background):
df = pd.DataFrame([[1,2], [3,4], [5,6]], columns=["A", "B"])
The output will look something like this:
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
Now simply copy the table.
Run df_copied = pd.read_clipboard()
and lo and behold, if you print df_copied
, you will once again see your table, this time loaded nicely in your Pandas DataFrame, ready to be wrangled, or even saved to disk.
.melt()
.melt()
is useful when you want to wrangle your data into a different format. More specifically, you can use it to change your data format from wide to long.
With .melt()
we identify one or more identifier columns which will stay in place. The other columns are value columns. These will be transformed in such a way that we are left with two columns: one denoting the variable (previously the column name(s)), and the other denoting the value.
This is quite a complex method to grasp, so it's better illustrated with an example.
Suppose we have the following dataset:
df = pd.DataFrame({'Player': ['messi', 'zidane', 'ronaldo'],
'Goals': [10, 12, 15],
'Assists': [5, 3, 10]})

We can now melt the Goals and Assists value columns into long format using .melt()
.
df_melted = pd.melt(df, id_vars=['Player'], value_vars=['Goals', 'Assists'])
Here, we first specify the identifier column 'Player', and subsequently the value variables 'Goals' and 'Assists'. The latter two will be transformed into the variable, value format.
When you run the above snippet, you will get the following:

This can be pretty useful when you have multiple related columns and want to look at all the values together. Or want to easily subset plots using the variable
column, instead of having multiple plots.
Indeed, we can now easily create the following plot using Seaborn:

.eval()
This method evaluates a string which describes what should be done. Let me explain with an example.
Suppose, we again use the df_melted
dataset from before:

We can use .eval()
to create a new 'dummy' column expected_value
like so:
df_melted.eval("expected_value = value * .8")

Pretty sweet, right? It makes it very explicit and clear what variable is being created.
.read_gbq()
read_gbq()
makes it easy to read data in from Google BigQuery using this one-liner:
df = pd.read_gbq("SELECT my_col FROM `my_dataset.my_table`", project_id='my_project')
There's a couple things you need to do first though:
- Make sure you're properly authenticated. This page explains it well.
- Install pandas-gbq as an additional dependency alongside pandas.
Final thoughts
That's it. Hope you discovered at least one new method to add to your Pandas toolkit! :) I know I have been using these quite a bit over the past months, so they've definitely helped me out.
Stay tuned for Part II of this series!
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.