It's time for another SQL for Data Science Quest and today we're gonna look at a rather grim dataset, namely the traffic fatalities dataset for the US in 2016. The dataset is publicly available in BigQuery as
This post is part of the series 'SQL for Data Science'. In this series I take you through SQL for Data Science quests, not just to prepare you for interviews but also to make you more aware of some of the SQL tricks that are out there, and have you practice along in the process. Quest 1 and Quest 2 can be found here and here respectively.
Suppose we have this traffic fatalities dataset which lists the fatalities that were recorded for 2016 in the US, one fatality per row.
The data looks something like this:
Can you now find the
state_names and their
accident count, where the accident count is below the average number of accidents per state?
You know the drill, let's break it down!👇
Get the accident counts per state
We group by state_name and then take a COUNT(*), to count the number of accidents per state. Note that we count the number of accidents, not the number of fatalities (we leave that as an exercise to the reader 😁).
SELECT state_name, COUNT(*) AS accident_count FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016` GROUP BY state_name
Wrap it in a nice WITH statement to use in our next query
With the accident counts per state in our WITH Statement, we can use the average
accident_count as a filter in our next SELECT statement:
WITH accident_counts AS ( SELECT state_name, COUNT(*) AS accident_count FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2016` GROUP BY state_name) SELECT * FROM accident_counts WHERE accident_count < ( SELECT AVG(accident_count) FROM accident_counts) ORDER BY accident_count
So basically all the data is in the accident_counts table. We leverage that table not only to select rows, but also to filter on the
AVG(accident_count) from that table. Pretty nice, huh!?
After running the query, you will see something like this:
As you can see, District of Columbia, Rhode Island and Vermont have the lowest number of accidents in 2016. In total, there are 32 states with a lower than average accident count in 2016. If you dive a bit deeper and run:
SELECT AVG(accident_count) FROM accident_counts you will find that the average number of accidents in 2016 was ~681.
That's it for today. Hope you learned something from this Quest. Check out the other quests here.
If you want to practice some more, you can calculate the number of fatalities (not accidents) per state in 2016, and find those states with higher than average fatalities. You could also try to find the states that have the highest number of multiple fatality accidents.
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at email@example.com or drop me a message on Twitter.