SQL for Data Science - Quest #2
2 min read

SQL for Data Science - Quest #2

It's time for another SQL for Data Science Quest! Today we'll be analyzing the repositories dataset from libraries_io!

In this series, I will walk you through a couple of SQL for Data Science quests, not just to prepare you for interviews but also to make you more aware of some of the SQL tricks that are out there, and have you gain practical experience in the process. For Quest 1, go here.

Problem Statement

Suppose we have a dataset with code repositories like this one:

Dataset
Dataset

The following column names are present:

Dataset column names
Dataset column names

Can you calculate, for just the Github repositories, the average number of stars, the average number of watchers and the total number of forks per programming language?

Step-by-Step Solution

Let's break it up into pieces.πŸ‘‡

Filter out everything that's not on Github

First, we only want the Github repositories, so let's filter out all Gitlab and Bitbucket repositories.

SELECT 
  *
FROM `bigquery-public-data.libraries_io.repositories`
WHERE host_type = 'GitHub'

Second, let's filter out forks, for simplicity's sake.

SELECT 
  *
FROM `bigquery-public-data.libraries_io.repositories`
WHERE host_type = 'GitHub' and fork is false

Group by language and calculate stats

Finally, let's group by programming language. After that, it's just a matter of calculating stats for each language using the AVG() and SUM() operators.

SELECT 
  language,
  AVG(stars_count) as avg_stars,
  SUM(forks_count) total_forks,
  AVG(watchers_count) as avg_watchers
FROM `bigquery-public-data.libraries_io.repositories`
WHERE host_type = 'GitHub' and fork is false
GROUP BY language
ORDER BY avg_stars desc

Results

That's all. The result will look something like this:

Notably, my favourite language (Python) is not in the top 5 of programming languages in this list. Also, I admit I had to look up the Rascal language πŸ˜…. It might be a more niche language, with fewer but some very popular repos.

Doing a quick COUNT(*) confirms this hunch.

Next steps

Hope you learned something from this Quest. Check out the other quests here. If you want to practice some more, the (more detailed) dataset is on bigquery.

As an additional exercise, you could take the results from above and filter out languages with less than 500 repos, to see if more well-established popular languages come to the surface.

In addition, you could for instance calculate the language with the lowest contributor count, or you could take the owner of the repo and calculate some aggregates per repo owner.

As long as you have fun with it. 😁

Let's keep in touch! πŸ“«

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.