SQL for Data Science - Quest #2
It's time for another SQL for Data Science Quest! Today we'll be analyzing the repositories dataset from libraries_io!
In this series, I will walk you through a couple of SQL for Data Science quests, not just to prepare you for interviews but also to make you more aware of some of the SQL tricks that are out there, and have you gain practical experience in the process. For Quest 1, go here.
Suppose we have a dataset with code repositories like this one:
The following column names are present:
Can you calculate, for just the Github repositories, the average number of stars, the average number of watchers and the total number of forks per programming language?
Let's break it up into pieces.👇
Filter out everything that's not on Github
First, we only want the Github repositories, so let's filter out all Gitlab and Bitbucket repositories.
SELECT * FROM `bigquery-public-data.libraries_io.repositories` WHERE host_type = 'GitHub'
Second, let's filter out forks, for simplicity's sake.
SELECT * FROM `bigquery-public-data.libraries_io.repositories` WHERE host_type = 'GitHub' and fork is false
Group by language and calculate stats
Finally, let's group by programming language. After that, it's just a matter of calculating stats for each language using the AVG() and SUM() operators.
SELECT language, AVG(stars_count) as avg_stars, SUM(forks_count) total_forks, AVG(watchers_count) as avg_watchers FROM `bigquery-public-data.libraries_io.repositories` WHERE host_type = 'GitHub' and fork is false GROUP BY language ORDER BY avg_stars desc
That's all. The result will look something like this:
Notably, my favourite language (Python) is not in the top 5 of programming languages in this list. Also, I admit I had to look up the Rascal language 😅. It might be a more niche language, with fewer but some very popular repos.
Doing a quick COUNT(*) confirms this hunch.
Hope you learned something from this Quest. Check out the other quests here. If you want to practice some more, the (more detailed) dataset is on bigquery.
As an additional exercise, you could take the results from above and filter out languages with less than 500 repos, to see if more well-established popular languages come to the surface.
In addition, you could for instance calculate the language with the lowest contributor count, or you could take the owner of the repo and calculate some aggregates per repo owner.
As long as you have fun with it. 😁
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at email@example.com or drop me a message on Twitter.