Pandas DataFrame Indexing Explained: from .loc to .iloc and beyond
Pandas makes it easy to wrangle your data. From BigQuery data, to csv or even Excel, with Pandas you can easily load it into a DataFrame.
Once your data is loaded into a DataFrame, there are many ways to index it but the syntax for iloc
, loc
etc, can be easily confused.
That's why today, we'll take a deep dive through Pandas indexing land.
I'll take you through some examples and explain which method to use when for effective data selection.
What is indexing?
Indexing can be described as selecting values from specific rows and columns in a dataframe. The row labels (the dataframe index) can be integer or string values, the column labels are usually strings.
By indexing, we can be very particular about the selections we make, zooming in on the exact data that we need.
We'll go over the following approaches:
- the vanilla way: using standard brackets
[]
.loc
.iloc
- Bonus approaches:
df.where(), df.query(), df.get(), df.at(), df.iat()
Install dependencies & load some data
Make sure to pip install
pandas and seaborn and then run the following code to load in some exercise data.
import seaborn as sns
import pandas as pd
df = sns.load_dataset('exercise', index_col=[0])
df.head()

The vanilla way: indexing with regular brackets []
It's fast, it's easy: quering Pandas DataFrames with regular brackets is very common. I myself use it all the time for quick analyses in Jupyter Notebooks.
Below I'll describe some example usages:
Selecting one column:
df['diet'].head()

Selecting multiple columns:
df[['diet', 'pulse']].head()

Selecting one row:
df[2:3]

df[3]
it will return not row 3 but a column named 3. Which is likely not what you want.Selecting multiple rows:
df[:3]

Although it will mostly work, it will not be the most efficient way to query your data. It is not as optimised as .loc
and .iloc
.
Why? Well, it is more implicit, so Pandas has to figure out stuff like 'are we indexing rows or columns?' and more, behind the scenes.
Therefore, it is not the recommended way to select data from your dataframe in production environments.
Also, if you want to filter on both rows and columns, it is much easier to use .loc
. .loc
is also much more flexible and explicit than regular brackets so keep reading to learn more about when and how to use it.
.loc: label-based indexing on columns and/or rows
We use .loc
for the following use cases:
- Selecting columns based on their names (labels)
- Selecting rows based on their index labels (could also be numerical)
- Boolean masking
Furthermore, the following is true about the .loc
indexer:
- Our selection is inclusive, we slice from the start label up to and including the stop label
- Integers are accepted as labels, but they will be seen as labels, not as the position in the dataframe!
- The syntax is
df.loc[[row_indexer], [col_indexer]]
Selecting columns by their names
df.loc[:, ['diet', 'pulse']]

:
operator here means that we select all of the rows.Selecting a range of columns
df.loc[:, 'diet': 'time']

Here we select the range of columns from diet
up to and including time
, which in this case also includes pulse
.
Selecting rows based on their labels (could also be numerical labels)
.loc
can also be used to select rows based on their index labels. For example:
df.loc[1]

df.loc[[1,2]]

df.loc[10:15]

We are implicitly selecting all columns here. If we want to be more explicit we could add the :
operator. See:
df.loc[10:15, :]
.loc
will select everything from your start up to and including your stop label, no matter whether the range is neatly ordered. This may or may not be what you want. Just know that this is how it works, to avoid getting unexpected results.As an example, what if we sort by id, and select a range of row labels?
df.sort_values(by='id').loc[25:30]

As you can see, it didn't just get rows with index labels in the range 25-30, but also 31 and 32, simply because they were in between the start and the stop label.
So regard it more as a categorical label, it will still work with integers but it will see them as categories. Sorting your index chronologically will help to avoid these kinds of silent issues.
Indexing both rows and columns
Of course, you can also make a selection based on a row and column filter. For example:
df.loc[1:3, 'diet': 'time']

df.loc[5, "diet"]

Indexing with categorical row labels
In order to demonstrate how to index using categorical row labels, I've created this mini dataframe called small_df
:
small_df = pd.DataFrame({"foo": [1,2], "bar": [3,4]}, index=['a', 'b'])
small_df

Now, in order to select the value at row 'a', column 'bar', we do:
small_df.loc['a', 'bar']

If we just want to retrieve the row 'b', we run:
small_df.loc['b']

Boolean masks
.loc
is also really helpful for using boolean masks.
Boolean masks are conditional statements in Pandas that return True or False for each row in your dataframe.
You can use one or more of these conditional 'boolean masks' to make a neat selection from your dataframe.
This is especially nice if you have multiple complex conditions and don't want to clutter up your selection statement.
See for example:
low_fat = df['diet'] == 'low fat'
pulse_above_100 = df['pulse'] > 100
df.loc[low_fat & pulse_above_100]

Now, we can also do boolean masking in combination with a column selection. That would look a little like this:
columns_to_select = ['time', 'kind']
df.loc[low_fat & pulse_above_100, columns_to_select]

Setting values using .loc
Sometimes you want to update values in a dataframe depending on a particular condition. You can use .loc
to make that happen. See:
running = df['kind'] == 'running'
df.loc[running, "kind"] = "walking"

If you now query on df.loc[running]
you will see that 'running' was replaced by 'walking' in the kind
column.
iloc: Indexing based on position
We use .iloc
for the following use cases:
- Selecting rows (and potentially columns) based on their integer positions in the DataFrame
What's also important to note is that contrary to .loc
, .iloc
selection is exclusive, meaning that the stop position is not included in the final result.
Selecting rows based on their integer row position
With .iloc
we can easily select rows based on the integer position of the rows in the DataFrame.
This doesn't necessarily mean that the integer position corresponds to the index 'label value' of the row, as we shall see soon.
First some examples to get us started:
df.iloc[3]

df.iloc[[2,4,6]]

df.iloc[:3]

Here we see it fetches the first three rows of the dataframe, which are rows 0, 1 and 2.
But what happens if we jumble the index?
For instance we can run the following:
sample = df.sample(10)
sample

sample.iloc[:3]

Here you can really see that it is position based and doesn't care about the index at all.
So in this case we get the 0th up to and including the 2nd element of the DataFrame that we are selecting from.
You could also select columns using this syntax, but again using index values (exclusive):
df.iloc[1:5, 1:3]

So this gives you the second to the fifth row and the second and third columns. Again note that this is exclusive.
Bonus: df.where(), df.query(), df.get(), df.at(), df.iat()
Because you made it all the way to the bottom of this article, I've added some quick summaries of more filtering options that may be useful.
df.where(): Replaces every row that doesn't match the filter with NAs
low_fat = df['diet'] == 'low fat'
df_lf = df.where(low_fat)

df.query(): quickly select rows based on explicit SQL-like filters
df.query("pulse < 85")

df.get(): select a column, dictionary-style
df.get('diet')

df.at(): access single value for row, column label pair
df.at[0, 'diet']

df.iat(): access single value for row, column position pair
df.iat[0, 2]

Summary
That's it! In this blog post, I've explained that vanilla brackets are to be used sparingly but can be handy for quick and simple Pandas indexing.
.loc
is a label-based indexer that makes it easy to filter on rows, columns or both. It also supports boolean masking. Be aware that it sees integers as categorical labels, too!
.iloc
is an integer based indexer that makes it easy to select a bunch of rows at a specific (integer) position in the dataframe. It is less helpful in selecting columns.
Finally, I've given you some alternative methods that you can use to spice up your Pandas indexing game: df.where()
, df.query()
, df.get()
, df.at()
and df.iat()
.
Hope you have learned something useful, let me know if you would like more of this type of content!
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.