SQL for Data Science - Quest #1
3 min read

SQL for Data Science - Quest #1

In my day job, I use SQL a lot. And it goes without saying that SQL is hard.

In this series I will take you through a couple of possible SQL for Data Science quests, not just to prepare you for interviews but also to make you more aware of some of the SQL tricks that are out there, and have you practice along in the process.

Problem Statement

Suppose you have a liquor sales dataset (available through BigQuery), which looks like the following:

Untitled

Write a query that calculates the store that has the highest sales and the store that has the lowest sales for each city

Step-by-Step Solution

Group by relevant columns and aggregate the sales

First, we want to group by city and store_name and sum the sale_dollars to get the total sales per store, per city.

This will look something like this:

SELECT
  city,
  store_name,
  SUM(sale_dollars) as total_sales
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null # for simplicity
GROUP BY city, store_name

Rank the stores per city

Now it gets a little bit tricky. We want to rank stores, but not just over the whole dataset. No, we want to rank them by city. We can achieve this by partitioning our data per city, and making a ranking over that. Luckily SQL has a function that will achieve just that. It's called RANK() over PARTITION.

RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) desc) as top_sales_rank

Let's go over each component of this part of the query.

The RANK()ranks our data as specified by the  PARTITION BY clause (city in this case). Note, RANK() will assign equal ranks to equal sales. So if you want to assign different ranks, you can use ROW NUMBER() instead of RANK().

PARTITION BY city will split up the data by city and compute a ranking over each of those splits.

ORDER BY determines how to rank our data. In this case we want to rank our data per city with the highest sales first (SUM(sale_dollars) descending).

Note that in the problem statement, it said we also need to compute the lowest sales. So how do we do this? Exactly, we compute another RANK() but order it by lowest sales.

RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) asc) as bottom_sales_rank

Putting it all together

SELECT
  city,
  store_name,
  SUM(sale_dollars) as total_sales,
  RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) desc) as top_sales_rank,
  RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) asc) as bottom_sales_rank
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null # for simplicity
GROUP BY city, store_name

Now this SQL query will have all the ranks, not just the top and bottom ranks.

Since we're using RANK() statement, which SQL executes after the WHERE statement, we can't filter out the irrelevant ranks with a WHERE clause or a HAVING clause in the same SELECT statement.

What I usually do is create a WITH statement for all that we've written so far and compute the filters in the final query.

WITH sales_ranked as (
SELECT
  city,
  store_name,
  SUM(sale_dollars) as total_sales,
  RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) desc) as top_sales_rank,
  RANK() over (PARTITION BY city ORDER BY SUM(sale_dollars) asc) as bottom_sales_rank
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null # for simplicity
GROUP BY city, store_name
)

SELECT
  city,
  store_name,
  total_sales
FROM sales_ranked
WHERE top_sales_rank = 1 or bottom_sales_rank = 1

And that will give you the following results:

Untitled

That's it! I highly encourage you to play around with this some more. If you want the practice. You could try to answer the following question: find the top performing store per date in the dataset.

I hope you learned something from this tutorial. There's tons of applications for this functionality, so I can assure you, you will run into it at some point in your career. 😉

Let's keep in touch! 📫

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.