SQL for Data Science - Quest #3
It's time for another SQL for Data Science Quest and today we're gonna look at a rather grim dataset, namely the traffic fatalities dataset for the US in 2016. The dataset is publicly available in BigQuery as bigquery-public-data.nhtsa_traffic_fatalities.accident_2016
.
This post is part of the series 'SQL for Data Science'. In this series I take you through SQL for Data Science quests, not just to prepare you for interviews but also to make you more aware of some of the SQL tricks that are out there, and have you practice along in the process. Quest 1 and Quest 2 can be found here and here respectively.
Problem Statement
Suppose we have this traffic fatalities dataset which lists the fatalities that were recorded for 2016 in the US, one fatality per row.
The data looks something like this:

Can you now find the state_names
and their accident count
, where the accident count is below the average number of accidents per state?
Step-by-Step Solution
You know the drill, let's break it down!👇
Get the accident counts per state
We group by state_name and then take a COUNT(*), to count the number of accidents per state. Note that we count the number of accidents, not the number of fatalities (we leave that as an exercise to the reader 😁).
SELECT
state_name,
COUNT(*) AS accident_count
FROM
`bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
GROUP BY
state_name
Wrap it in a nice WITH statement to use in our next query
With the accident counts per state in our WITH Statement, we can use the average accident_count
as a filter in our next SELECT statement:
WITH
accident_counts AS (
SELECT
state_name,
COUNT(*) AS accident_count
FROM
`bigquery-public-data.nhtsa_traffic_fatalities.accident_2016`
GROUP BY
state_name)
SELECT
*
FROM
accident_counts
WHERE
accident_count < (
SELECT
AVG(accident_count)
FROM
accident_counts)
ORDER BY accident_count
So basically all the data is in the accident_counts table. We leverage that table not only to select rows, but also to filter on the AVG(accident_count)
from that table. Pretty nice, huh!?
Results
After running the query, you will see something like this:

As you can see, District of Columbia, Rhode Island and Vermont have the lowest number of accidents in 2016. In total, there are 32 states with a lower than average accident count in 2016. If you dive a bit deeper and run: SELECT AVG(accident_count) FROM accident_counts
you will find that the average number of accidents in 2016 was ~681.
Next steps
That's it for today. Hope you learned something from this Quest. Check out the other quests here.
If you want to practice some more, you can calculate the number of fatalities (not accidents) per state in 2016, and find those states with higher than average fatalities. You could also try to find the states that have the highest number of multiple fatality accidents.
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.