SQL for Data Science - Quest #3
2 min read

SQL for Data Science - Quest #3

It's time for another SQL for Data Science Quest and today we're gonna look at a rather grim dataset, namely the traffic fatalities dataset for the US in 2016. The dataset is publicly available in BigQuery as bigquery-public-data.nhtsa_traffic_fatalities.accident_2016.

This post is part of the series 'SQL for Data Science'. In this series I take you through SQL for Data Science quests, not just to prepare you for interviews but also to make you more aware of some of the SQL tricks that are out there, and have you practice along in the process. Quest 1 and Quest 2 can be found here and here respectively.

Problem Statement

Suppose we have this traffic fatalities dataset which lists the fatalities that were recorded for 2016 in the US, one fatality per row.

The data looks something like this:

Can you now find the state_names and their accident count, where the accident count is below the average number of accidents per state?

Step-by-Step Solution

You know the drill, let's break it down!👇

Get the accident counts per state

We group by state_name and then take a COUNT(*), to count the number of accidents per state. Note that we count the number of accidents, not the number of fatalities (we leave that as an exercise to the reader  😁).

SELECT
  state_name,
  COUNT(*) AS accident_count
FROM
  `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
GROUP BY
  state_name

Wrap it in a nice WITH statement to use in our next query

With the accident counts per state in our WITH Statement, we can use the average accident_count as a filter in our next SELECT statement:

WITH
  accident_counts AS (
  SELECT
    state_name,
    COUNT(*) AS accident_count
  FROM
    `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
  GROUP BY
    state_name)

SELECT
  *
FROM
  accident_counts
WHERE
  accident_count < (
  SELECT
    AVG(accident_count)
  FROM
    accident_counts)
ORDER BY accident_count

So basically all the data is in the accident_counts table. We leverage that table not only to select rows, but also to filter on the AVG(accident_count) from that table. Pretty nice, huh!?

Results

After running the query, you will see something like this:

As you can see, District of Columbia, Rhode Island and Vermont have the lowest number of accidents in 2016. In total, there are 32 states with a lower than average accident count in 2016. If you dive a bit deeper and run: SELECT AVG(accident_count) FROM accident_counts you will find that the average number of accidents in 2016 was ~681.

Next steps

That's it for today. Hope you learned something from this Quest. Check out the other quests here.

If you want to practice some more, you can calculate the number of fatalities (not accidents) per state in 2016, and find those states with higher than average fatalities. You could also try to find the states that have the highest number of multiple fatality accidents.

Let's keep in touch! 📫

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.