Introduction to GeoPandas: Analysing 311 call data
240 min read

Introduction to GeoPandas: Analysing 311 call data

Pandas and Numpy are a data scientist's best friends. They help clean up our data and prepare it for data modelling, or make it easy to put together some awesome insights and stats. They even display our data in nice graphs for us, without much effort.

Yet, sometimes these friends need the help of Pandas' useful sidekick GeoPandas.

GeoPandas makes it easier to work with geospatial data in Pandas. It can help you do geospatial operations in pandas, as well as visualise your maps in a pretty way.

Today we'll demonstrate how easy it is to get started with GeoPandas by visualising 311 service calls in San Francisco.

The data

For this tutorial, I've used the San Francisco 311 calls dataset as made available by Google BigQuery. I have kept the dataset nice and small for this demonstration, only using data from one day (11th of June 2021). In this tutorial I'm loading the data from csv, but if you write along, you can fetch the same data using the following query:

SELECT
  *
FROM
  bigquery-public-data.san_francisco_311.311_service_requests
WHERE DATE(created_date) = '2021-06-11'

In order to plot shapes using GeoPandas we also need a shapefile with the same location as the dataset: in this case a shapefile from San Francisco's neighbourhoods. This can be found here.

Importing dependencies

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Loading in the data

We'll start off by loading the 311 calls dataset into pandas.

df_calls = pd.read_csv('data/san_francisco_calls.csv')
df_calls = df_calls[~df_calls['neighborhood_geom'].isna()]

Note that in the second line, we filter out any rows where the neighborhood geometry (coordinates) are not available.

A quick call to .info() gives us the following overview of our columns:

Then we'll load in the shapefile using GeoPandas.

sf_map = gpd.read_file('./data/SF_Find_Neighborhoods/geo_export.shp')
sf_map.plot(figsize=(10,7))

The plot command plots the shapefile data and returns the following plot:

Note that the x and y-axis display the latitude (y-axis) and longitude (x-axis) coordinates.

Plotting 311 calls on a map

Now, in order to plot our data on the map, we first have to make our original pandas dataframe a GeoPandas dataframe:

geo_df = gpd.GeoDataFrame(df_calls,
                          geometry=gpd.points_from_xy(df_calls.longitude, 
                                                      df_calls.latitude),
                          crs="EPSG:4326")

The crs (coordinate reference system) parameter specifies the projection of the coordinates. In this case, we want to use latitudes and longitudes so we use EPSG:4326.

Now, we can plot the data:

fig, ax = plt.subplots(figsize=(10,10))
sf_map['geometry'].plot(ax=ax, zorder=1)
geo_df.plot(ax=ax, markersize=25, marker='o', 
            c='orange', alpha=.2, zorder=2)
ax.set_title('San Francisco 311 Calls on 11th of June 2021');

Here the main points to note are that we first plot the map, and subsequently plot our 311 calls data on top of that map. plt.subplots() is used to make sure that we plot both the map and the data in one figure.

We specify the zorder parameter, just to tell matplotlib which plot to build on top of which. In this case, we want the dots to be visible on top of the map, so we specify the data with a higher zorder.

The output of the plot looks like this:

Now that we've seen how to plot our data on a map. It's time to answer some questions about our data using GeoPandas. For example...

Which neighbourhoods get the most calls?

In order to answer (and visualise) that question, we need to do some data wrangling first.

neighborhood_df = geo_df.copy()
neighborhood_df = neighborhood_df[['neighborhood', 
                                   'neighborhood_geom']]
)
neighborhood_calls = (neighborhood_df.groupby(['neighborhood', 
                                               'neighborhood_geom'])
                .size()
                .reset_index()
                .rename(columns={0: 'count'})
)

Here, we group our data by neighborhood, we call size() in order to get the number of calls per neighborhood, and rename that column to count.

If we then run:

neighborhood_calls.sort_values(by='count', ascending=False).head()

We get the following list of neighbourhoods ranked by calls (count):

Now, we again make the neighborhood_calls df into a GeoPandas one. The dataset has a very nice column, namely neighborhood_geom which contains the neighborhood shape data. All we have to do is specify that column as our geometry in the GeoDataFrame, again with EPSG:4326 and we're good to go:

neighborhood_calls['geometry'] = gpd.GeoSeries.from_wkt(neighborhood_calls['neighborhood_geom'])
geo_neighborhood_df = gpd.GeoDataFrame(neighborhood_calls, 
                          crs = 'EPSG:4326', 
                          geometry = 'geometry')
fig, ax = plt.subplots(figsize=(10,10))
sf_map['geometry'].plot(ax=ax, zorder=1, alpha=.95, cmap='YlOrRd')
geo_neighborhood_df[['geometry', 
                     'count']].plot(column = 'count', 
                                    ax=ax, 
                                    cmap = 'YlOrRd',
                                    legend = True, legend_kwds={'shrink': 0.3}, 
                                    markersize = 10)
ax.set_title('San Francisco 311 Call Counts per Neighborhood on 11th of June 2021');

From here it is clear that the neighborhoods in red in the north east of San Francisco are receiving the most calls. From the dataset above we know these to be Mission and South of Market. But wouldn't it be nice to see this in the plot?

Now we could plot these as annotations into the plot, but since the text length varies, this could become really ugly really fast. Luckily, GeoPandas has some built-ins to make our lives a lot easier!

Making interactive plots in GeoPandas

First things first, not all areas have 311 calls, but we do want to show the map area for those areas. To verify the discrepancies we run:

geo_neighborhood_df['neighborhood'].nunique(), sf_map['name'].nunique()

This outputs: (110, 117).

Meaning the geo_neighborhood_df dataframe (with out data) has 110 neighborhood names (with counts) and the San Francisco map, has 7 more.

Let's merge the two dataframes and fill in the missing counts with 0:

sf_map_sm = sf_map[['name', 'geometry']]
geo_neigh_sm = geo_neighborhood_df[['neighborhood', 'count']]

full_sf_df = sf_map_sm.merge(geo_neigh_sm, 
                             left_on='name', 
                             right_on='neighborhood', 
                             how='left')

full_sf_df.fillna({"count": 0}, inplace=True)

First we only select the columns that we need. Then we merge the two datasets, making sure to left join because the sf_map_sm dataset has all of the neighborhood data, and therefore should always be 'kept'.

Finally, we fill all missing counts for those areas that don't have a count with a value of 0.

Now if we run the following command, we get a basic map with our areas and counts:

full_sf_df.explore()

But this doesn't do the full functionality of this tool justice. With some small tweaks we can reproduce the plot above, interactive style!

full_sf_df.explore(
     column="count", 
     tooltip=["name", "count"],
     tiles="CartoDB positron",
     cmap="YlOrRd",
     style_kwds=dict(color="black")
    )

Awesome, right! No need to switch to another tool to get your interactive plots, GeoPandas can do this out of the box!

Investigating complaint types

There are 128 different complaint types, with many related to each other. So first thing we'll do is clean these up a bit.

geo_df['complaint_type_cleaned'] = geo_df['complaint_type'].copy()
geo_df['complaint_type_cleaned'] = geo_df['complaint_type_cleaned'].str.lower()

def clean_complaint(complaint):
    if 'graffiti' in complaint:
        return 'graffiti'
    if 'blocking' in complaint or 'blocked' in complaint:
        return 'blocking part road'
    if 'illegal postings' in complaint:
        return 'illegal postings'
    if 'encampment' in complaint:
        return 'encampment related'
    if 'pavement_defect' in complaint or 'sidewalk_defect' in complaint:
        return 'pavement or sidewalk defect'
    if 'trees' in complaint:
        return 'trees related'
    if 'park' in complaint:
        return 'park related'
    if 'garbage_can' in complaint:
        return 'garbage can related'
    if 'abandoned vehicle' in complaint:
        return 'abandoned vehicle related'
    if 'muni' in complaint:
        return 'MUNI related'
    if 'toter' in complaint:
        return 'toter related'
    if 'sign repair' in complaint:
        return 'sign repair related'
    if 'streetlight' in complaint:
        return 'streetlight related'
    if 'noise' in complaint:
        return 'noise related'
    if 'damage' in complaint:
        return 'damage related'
    
    return complaint

geo_df['complaint_type_cleaned'] = geo_df['complaint_type_cleaned'].apply(clean_complaint)

Note that this method is super coarse, but it will do for quick and dirty data analysis!

So we group various complaints together using above function and oneliner, and now we take the top 20 complaints and group the rest as 'other'. Finally, we remove some unnecessary columns:

complaint_counts = (geo_df['complaint_type_cleaned'].value_counts()
                        .reset_index()
                        .rename(columns={'index':'complaint_type_cleaned', 'complaint_type_cleaned': 'complaint_count'})
                        .query('index < 20')
)

geo_df_merged = geo_df.merge(complaint_counts, how='left')
geo_df_merged.loc[geo_df_merged.complaint_count.isna(), 'complaint_type_cleaned'] = 'other'
geo_df_complaint_types = geo_df_merged[['created_date', 'complaint_type_cleaned', 'geometry']]

Now, we can again use the .explore() function on our GeoPandas dataframe. I highly recommend it for initial data analysis just to give you some impression of the data and to generate some hypotheses about the data you're working with.

geo_df_complaint_types.explore(
     column="complaint_type_cleaned", 
     tooltip=["complaint_type_cleaned", "created_date"],
     tiles="CartoDB positron",
     cmap="rainbow",
    )

Investigating specific complaint types

For this section I want to dive into two hypotheses: check out some related complaint types and see if the complaints are related. In addition, I want to see if multiple people call in the same complaints about the same issue.

complaint_types = ['blocking part road', 'pavement or sidewalk defect']

fig, ax = plt.subplots(figsize=(10,10))
sf_map.plot(alpha=.95, ax=ax)
geo_df_complaint_types[geo_df_complaint_types['complaint_type_cleaned'].isin(complaint_types)] \
                        .plot(column='complaint_type_cleaned', 
                              cmap = 'rainbow', 
                              ax=ax,  
                              legend = True, 
                              markersize = 25,
                              alpha=.7)

This snippet plots the two selected complaint types on the map:

My hypothesis was that pavement or sidewalk defects may lead to blocking of the road. Looking at this plot, this could be the case in some areas, but this is certainly not often the case.

My second hypothesis was that multiple people may call in a complaint. From the plot, we can see that this is definitely true. We see many clusters of the same complaint popping up in the same streets or even reaching all the way into neighbouring streets.

You see how useful these plots are to investigate your data. With this knowledge, you could for instance build a new feature which counts the number of identical or similar complaints occurring in the same area on the same day. This could give 311 responders a sense of urgency of which requests to send responders to first.

Final thoughts

I hope this article gave you some inspiration on how GeoPandas can help you create awesome visuals of your geospatial data. It is quite easy to use once you get used to the syntax and a very useful library to have in your toolkit.

In a later blogpost I will dive into how to create timelapses with GeoPandas, Matplotlib and ImageIO. So stay tuned for that! :)

Let's keep in touch! 📫

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.