Pandas DataFrame Indexing Explained: from .loc to .iloc and beyond
9 min read

Pandas DataFrame Indexing Explained: from .loc to .iloc and beyond

Pandas DataFrame Indexing Explained: from .loc to .iloc and beyond
Photo by Pascal Müller / Unsplash

Pandas makes it easy to wrangle your data. From BigQuery data,  to csv or even Excel, with Pandas you can easily load it into a DataFrame.

Once your data is loaded into a DataFrame, there are many ways to index it but the syntax for iloc, loc etc, can be easily confused.

That's why today, we'll take a deep dive through Pandas indexing land.

I'll take you through some examples and explain which method to use when for effective data selection.

What is indexing?

Indexing can be described as selecting values from specific rows and columns in a dataframe. The row labels (the dataframe index) can be integer or string values, the column labels are usually strings.

By indexing, we can be very particular about the selections we make, zooming in on the exact data that we need.

We'll go over the following approaches:

  1. the vanilla way: using standard brackets []
  2. .loc
  3. .iloc
  4. Bonus approaches: df.where(), df.query(), df.get(), df.at(), df.iat()

Install dependencies & load some data

Make sure to pip install pandas and seaborn and then run the following code to load in some exercise data.

import seaborn as sns
import pandas as pd
df = sns.load_dataset('exercise', index_col=[0])
df.head()
The first 5 rows of our dataset

The vanilla way: indexing with regular brackets []

It's fast, it's easy: quering Pandas DataFrames with regular brackets is very common. I myself use it all the time for quick analyses in Jupyter Notebooks.

Below I'll describe some example usages:

Selecting one column:

df['diet'].head()
Selecting one column using brackets

Selecting multiple columns:

df[['diet', 'pulse']].head()
Selecting multiple columns using brackets

Selecting one row:

df[2:3]
Selecting one row using brackets
💡
Note that if you use df[3] it will return not row 3 but a column named 3. Which is likely not what you want.

Selecting multiple rows:

df[:3]
Selecting multiple rows using brackets

Although it will mostly work, it will not be the most efficient way to query your data. It is not as optimised as .loc and .iloc.

Why? Well, it is more implicit, so Pandas has to figure out stuff like 'are we indexing rows or columns?' and more, behind the scenes.

Therefore, it is not the recommended way to select data from your dataframe in production environments.

Also, if you want to filter on both rows and columns, it is much easier to use .loc. .loc is also much more flexible and explicit than regular brackets so keep reading to learn more about when and how to use it.

.loc: label-based indexing on columns and/or rows

We use .loc for the following use cases:

  • Selecting columns based on their names (labels)
  • Selecting rows based on their index labels (could also be numerical)
  • Boolean masking

Furthermore, the following is true about the .loc indexer:

  • Our selection is inclusive, we slice from the start label up to and including the stop label
  • Integers are accepted as labels, but they will be seen as labels, not as the position in the dataframe!
  • The syntax is df.loc[[row_indexer], [col_indexer]]

Selecting columns by their names

df.loc[:, ['diet', 'pulse']]
Selecting columns using .loc
💡
Note that the : operator here means that we select all of the rows.

Selecting a range of columns

df.loc[:, 'diet': 'time']
Selecting a range of columns using .loc

Here we select the range of columns from diet up to and including time, which in this case also includes pulse.

Selecting rows based on their labels (could also be numerical labels)

.loc can also be used to select rows based on their index labels. For example:

df.loc[1]
Selecting a row using .loc
df.loc[[1,2]]
Selecting multiple rows using .loc
df.loc[10:15]
Selecting a range of rows using .loc

We are implicitly selecting all columns here. If we want to be more explicit we could add the : operator. See:

df.loc[10:15, :]
💡
Be careful: .loc will select everything from your start up to and including your stop label, no matter whether the range is neatly ordered. This may or may not be what you want. Just know that this is how it works, to avoid getting unexpected results.

As an example, what if we sort by id, and select a range of row labels?

df.sort_values(by='id').loc[25:30]
'Unexpected' result of using numerical row indexer with .loc on data with unchronological index

As you can see, it didn't just get rows with index labels in the range 25-30, but also 31 and 32, simply because they were in between the start and the stop label.

So regard it more as a categorical label, it will still work with integers but it will see them as categories. Sorting your index chronologically will help to avoid these kinds of silent issues.

Indexing both rows and columns

Of course, you can also make a selection based on a row and column filter. For example:

df.loc[1:3, 'diet': 'time']
Indexing both rows and columns with .loc
df.loc[5, "diet"]
Indexing both rows and columns with .loc

Indexing with categorical row labels

In order to demonstrate how to index using categorical row labels, I've created this mini dataframe called small_df:

small_df = pd.DataFrame({"foo": [1,2], "bar": [3,4]}, index=['a', 'b'])
small_df

Now, in order to select the value at row 'a', column 'bar', we do:

small_df.loc['a', 'bar']
Indexing using categorical row and column labels

If we just want to retrieve the row 'b', we run:

small_df.loc['b']
Indexing rows with categorical labels

Boolean masks

.loc is also really helpful for using boolean masks.

Boolean masks are conditional statements in Pandas that return True or False for each row in your dataframe.

You can use one or more of these conditional 'boolean masks' to make a neat selection from your dataframe.

This is especially nice if you have multiple complex conditions and don't want to clutter up your selection statement.

See for example:

low_fat = df['diet'] == 'low fat'
pulse_above_100 = df['pulse'] > 100

df.loc[low_fat & pulse_above_100]
Pandas indexing with boolean masks
💡
Note: NAs will be converted to False in your boolean mask.

Now, we can also do boolean masking in combination with a column selection. That would look a little like this:

columns_to_select = ['time', 'kind']
df.loc[low_fat & pulse_above_100, columns_to_select]
Pandas indexing with boolean masks and column selection

Setting values using .loc

Sometimes you want to update values in a dataframe depending on a particular condition. You can use .loc to make that happen. See:

running = df['kind'] == 'running'

df.loc[running, "kind"] = "walking"
Result of setting values using .loc

If you now query on df.loc[running] you will see that 'running' was replaced by 'walking' in the kind column.

iloc: Indexing based on position

We use .iloc for the following use cases:

  • Selecting rows (and potentially columns) based on their integer positions in the DataFrame

What's also important to note is that contrary to .loc, .iloc selection is exclusive, meaning that the stop position is not included in the final result.

Selecting rows based on their integer row position

With .iloc we can easily select rows based on the integer position of the rows in the DataFrame.

This doesn't necessarily mean that the integer position corresponds to the index 'label value' of the row, as we shall see soon.

First some examples to get us started:

df.iloc[3]
Selecting the 4th row using .iloc
df.iloc[[2,4,6]]
Selecting multiple rows with .iloc
df.iloc[:3]
Selecting a range of rows using .iloc

Here we see it fetches the first three rows of the dataframe, which are rows 0, 1 and 2.

But what happens if we jumble the index?

For instance we can run the following:

sample = df.sample(10)
sample
sample.iloc[:3]
sample.iloc[:3] will fetch the first 3 rows regardless of index 'name'

Here you can really see that it is position based and doesn't care about the index at all.

So in this case we get the 0th up to and including the 2nd element of the DataFrame that we are selecting from.

You could also select columns using this syntax, but again using index values (exclusive):

df.iloc[1:5, 1:3]
Integer based row AND column selection using .iloc

So this gives you the second to the fifth row and the second and third columns. Again note that this is exclusive.

Bonus: df.where(), df.query(), df.get(), df.at(), df.iat()

Because you made it all the way to the bottom of this article, I've added some quick summaries of more filtering options that may be useful.

df.where(): Replaces every row that doesn't match the filter with NAs

low_fat = df['diet'] == 'low fat'
df_lf = df.where(low_fat)
df.where() replaces non-matching rows with NaNs

df.query(): quickly select rows based on explicit SQL-like filters

df.query("pulse < 85")
df.query() for SQL-like querying of your dataframe

df.get(): select a column, dictionary-style

df.get('diet')
df.get() for dictionary-style column selection

df.at(): access single value for row, column label pair

df.at[0, 'diet']
Accessing a single value for a row/column pair with df.at()

df.iat(): access single value for row, column position pair

df.iat[0, 2]
Accessing a single value for a row/column position pair with df.iat()

Summary

That's it! In this blog post, I've explained that vanilla brackets are to be used sparingly but can be handy for quick and simple Pandas indexing.

.loc is a label-based indexer that makes it easy to filter on rows, columns or both. It also supports boolean masking. Be aware that it sees integers as categorical labels, too!

.iloc is an integer based indexer that makes it easy to select a bunch of rows at a specific (integer) position in the dataframe. It is less helpful in selecting columns.

Finally, I've given you some alternative methods that you can use to spice up your Pandas indexing game: df.where(), df.query(), df.get(), df.at() and df.iat().

Hope you have learned something useful, let me know if you would like more of this type of content!

Let's keep in touch! 📫

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.