SQL for Data Science - Quest #1
In my day job, I use SQL a lot. And it goes without saying that SQL is hard.
In this series I will take you through a couple of possible SQL for Data Science quests, not just to prepare you for interviews but also to make you more aware of some of the SQL tricks that are out there, and have you practice along in the process.
Problem Statement
Suppose you have a liquor sales dataset (available through BigQuery), which looks like the following:

Write a query that calculates the store that has the highest sales and the store that has the lowest sales for each city
Step-by-Step Solution
Group by relevant columns and aggregate the sales
First, we want to group by city and store_name and sum the sale_dollars to get the total sales per store, per city.
This will look something like this:
SELECT
city,
store_name,
SUM(sale_dollars) as total_sales
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null # for simplicity
GROUP BY city, store_name
Rank the stores per city
Now it gets a little bit tricky. We want to rank stores, but not just over the whole dataset. No, we want to rank them by city. We can achieve this by partitioning our data per city, and making a ranking over that. Luckily SQL has a function that will achieve just that. It's called RANK() over PARTITION
.
RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) desc) as top_sales_rank
Let's go over each component of this part of the query.
The RANK()
ranks our data as specified by the PARTITION BY
clause (city in this case). Note, RANK()
will assign equal ranks to equal sales. So if you want to assign different ranks, you can use ROW NUMBER()
instead of RANK()
.
PARTITION BY
city will split up the data by city and compute a ranking over each of those splits.
ORDER BY
determines how to rank our data. In this case we want to rank our data per city with the highest sales first (SUM(sale_dollars)
descending).
Note that in the problem statement, it said we also need to compute the lowest sales. So how do we do this? Exactly, we compute another RANK()
but order it by lowest sales.
RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) asc) as bottom_sales_rank
Putting it all together
SELECT
city,
store_name,
SUM(sale_dollars) as total_sales,
RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) desc) as top_sales_rank,
RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) asc) as bottom_sales_rank
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null # for simplicity
GROUP BY city, store_name
Now this SQL query will have all the ranks, not just the top and bottom ranks.
Since we're using RANK()
statement, which SQL executes after the WHERE
statement, we can't filter out the irrelevant ranks with a WHERE
clause or a HAVING clause in the same SELECT
statement.
What I usually do is create a WITH
statement for all that we've written so far and compute the filters in the final query.
WITH sales_ranked as (
SELECT
city,
store_name,
SUM(sale_dollars) as total_sales,
RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) desc) as top_sales_rank,
RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) asc) as bottom_sales_rank
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null # for simplicity
GROUP BY city, store_name
)
SELECT
city,
store_name,
total_sales
FROM sales_ranked
WHERE top_sales_rank = 1 or bottom_sales_rank = 1
And that will give you the following results:

That's it! I highly encourage you to play around with this some more. If you want the practice. You could try to answer the following question: find the top performing store per date in the dataset.
I hope you learned something from this tutorial. There's tons of applications for this functionality, so I can assure you, you will run into it at some point in your career. 😉
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.