Data Analysis at the Command Line💻
15 min read

Data Analysis at the Command Line💻

Analysing the lyrics of The Beatles and The Rolling Stones at the command line🎵
Data Analysis at the Command Line💻
Photo by Kevin Ku / Unsplash

As you may know, I'm a data scientist so data analysis is kind of my cup of tea. 🫖

However, I tend to do my analysis work in Python. You know.. Because it's fast, easy and the ecosystem for data analysis and data science is great.

But sometimes you just want to have some fun. Sometimes, you want to challenge yourself to do things a little bit different.

So today we're going to do some data analysis on lyrics from The Beatles and The Rolling Stones. Not in Python, but at the command line!

Let's dive in!

Requirements

Before we get started, you need to install the csvkit Python package. You can easily install it using pip or poetry.

Though it may seem obvious, you should keep your favorite terminal application ready as well. :)

💡
If you're on Windows, I recommend using Git Bash to code along with this tutorial.

The Data

For this tutorial, we're using the Kaggle song lyrics from 79 musical genres dataset. Specifically, the aptly named file lyrics-data.csv.

The columns

Using csvcut -n we can get the column names and indices:

csvcut -n lyrics-data.csv
  1: ALink
  2: SName
  3: SLink
  4: Lyric
  5: language

What does the data look like?

The Lyric column is huge and we'll come back to that later, but for now let's visualise the other columns using csvcut.

We can use csvcut -c to select specific columns in our dataset. Here we select columns ALink, SName, SLink and language.

csvcut -c ALink,SName,SLink,language lyrics-data.csv | csvlook --max-column-width 30 --max-rows 5

If you want to select many columns, you can also use column numbers (1 = first column, 2 = second column etc.) to make your selection more succinct.

We then pipe the output to csvlook. csvlook pretty prints the output. Here, we also cap the rows to 5 and column width to 30. This is purely to make the output look pretty and avoid having columns overflow.

| ALink           | SName                          | SLink                          | language |
| --------------- | ------------------------------ | ------------------------------ | -------- |
| /ivete-sangalo/ | Arerê                          | /ivete-sangalo/arere.html      | pt       |
| /ivete-sangalo/ | Se Eu Não Te Amasse Tanto A... | /ivete-sangalo/se-eu-nao-te... | pt       |
| /ivete-sangalo/ | Céu da Boca                    | /ivete-sangalo/chupa-toda.html | pt       |
| /ivete-sangalo/ | Quando A Chuva Passar          | /ivete-sangalo/quando-a-chu... | pt       |
| /ivete-sangalo/ | Sorte Grande                   | /ivete-sangalo/sorte-grande... | pt       |
| ...             | ...                            | ...                            | ...      |

You see we've got quite a bit of Portuguese music in this list.

Select data

For this tutorial, I'd like to select 2 songs and do a small comparative analysis between them.

Let's go with a Beatles song and a Rolling Stones song, to spice up the ol' debate between the two bands once again. :')

But first let's verify whether these artists are actually in the dataset.

We verify whether there are any songs by running the following commands:

  1. We use csvgrep to search for specific values in one or more columns. In this case we search in the column (-c) ALink for the regular expression (-r) "(?i)/the-beatles/". The (?i) flag stands for case insensitive search.
  2. We are only interested (for now) in the first two columns to check out the songs, so we select only the first two using csvcut.
  3. We use csvlook again to pretty print our data and select 15 rows.
csvgrep -c ALink -r "(?i)/the-beatles/" lyrics-data.csv | csvcut -c 1,2 | csvlook --max-rows 15
| ALink         | SName                        |
| ------------- | ---------------------------- |
| /the-beatles/ | Yesterday                    |
| /the-beatles/ | Let It Be                    |
| /the-beatles/ | Hey Jude                     |
| /the-beatles/ | Here Comes The Sun           |
| /the-beatles/ | Don't Let Me Down            |
| /the-beatles/ | Help!                        |
| /the-beatles/ | Something                    |
| /the-beatles/ | Come Together                |
| /the-beatles/ | In My Life                   |
| /the-beatles/ | Hello, Goodbye               |
| /the-beatles/ | I Want To Hold Your Hand     |
| /the-beatles/ | Get Back                     |
| /the-beatles/ | All My Loving                |
| /the-beatles/ | All You Need Is Love         |
| /the-beatles/ | While My Guitar Gently Weeps |
| ...           | ...                          |

Well, the good news is that there are plenty of Beatles songs. Let's go for 'Here Comes The Sun'.

Now, let's do the same for The Rolling Stones.

csvgrep -c ALink -r "(?i)/the-rolling-stones/" lyrics-data.csv | csvcut -c 1,2 | csvlook --max-rows 15
| ALink                | SName                              |
| -------------------- | ---------------------------------- |
| /the-rolling-stones/ | Sympathy For The Devil             |
| /the-rolling-stones/ | (I Can't Get No) Satisfaction      |
| /the-rolling-stones/ | Start Me Up                        |
| /the-rolling-stones/ | Angie                              |
| /the-rolling-stones/ | Paint It Black                     |
| /the-rolling-stones/ | Miss You                           |
| /the-rolling-stones/ | You Can't Always Get What You Want |
| /the-rolling-stones/ | Wild Horses                        |
| /the-rolling-stones/ | Gimme Shelter                      |
| /the-rolling-stones/ | She's A Rainbow                    |
| /the-rolling-stones/ | Brown Sugar                        |
| /the-rolling-stones/ | Like A Rolling Stone               |
| /the-rolling-stones/ | As Tears Go By                     |
| /the-rolling-stones/ | Jumpin' Jack Flash                 |
| /the-rolling-stones/ | Honky Tonk Women                   |
| ...                  | ...                                |

Let's go for 'Paint it Black'.

Selecting the Lyrics

Now it's time to select the specific rows for these two songs. Let's start with The Beatles. We can use the following commands for that:

csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvcut -c 1,2 | csvlook --max-rows 15
| ALink                     | SName              |
| ------------------------- | ------------------ |
| /nick-cave-the-bad-seeds/ | Here Comes The Sun |
| /george-harrison/         | Here Comes The Sun |
| /jewel/                   | Here Comes The Sun |
| /bon-jovi/                | Here Comes The Sun |
| /ghost/                   | Here Comes the Sun |
| /runaways/                | Here Comes The Sun |
| /coldplay/                | Here Comes The Sun |
| /christina-perri/         | Here comes the sun |
| /colbie-caillat/          | Here Comes The Sun |
| /glee/                    | Here Comes the Sun |
| /the-beatles/             | Here Comes The Sun |
| /ben-harper/              | Here Comes The Sun |
| /sheryl-crow/             | Here Comes The Sun |

And for The Rolling Stones:

csvgrep -c SName -r "(?i)^paint it black$" lyrics-data.csv | csvcut -c 1,2 | csvlook --max-rows 15
| ALink                | SName          |
| -------------------- | -------------- |
| /the-rolling-stones/ | Paint It Black |
| /jonny-lang/         | Paint It Black |
| /deep-purple/        | Paint It Black |
| /face-to-face/       | Paint it Black |
| /wasp/               | Paint It Black |
| /rage/               | Paint It Black |
| /anvil/              | Paint It Black |
| /ministry/           | Paint It Black |
| /vanessa-carlton/    | Paint It Black |
| /u2/                 | Paint It Black |
| /the-animals/        | Paint it Black |
| /the-lemonheads/     | Paint It Black |
| /the-vines/          | Paint It Black |

Now, we quickly notice that just having the exact titles of the songs in the grep is not enough. We have to be more specific.

So to be more specific we first get all songs with those exact titles, then we pipe that to our next expression in which we filter for the name of the artist. This gives us the exact (unique) results that we're looking for:

csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-beatles/" | csvcut -c 1,2 | csvlook --max-rows 5
| ALink         | SName              |
| ------------- | ------------------ |
| /the-beatles/ | Here Comes The Sun |
csvgrep -c SName -r "(?i)^paint it black$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-rolling-stones/" | csvcut -c 1,2 | csvlook --max-rows 15
| ALink                | SName          |
| -------------------- | -------------- |
| /the-rolling-stones/ | Paint It Black |

Write lyrics to text files

Now, we can't really do any analysis on just these two titles, we're actually interested in the Lyric column. So finally, let's get the lyrics for each of the songs and write them to <song>.txt files.

csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-beatles/" | csvcut -c Lyric
Lyric
"Here comes the sun, here comes the sun
And I say it's all right

Little darling, it's been a long cold lonely winter
Little darling, it feels like years since it's been here
Here comes the sun, here comes the sun
And I say it's all right

Little darling, the smiles returning to the faces
Little darling, it seems like years since it's been here
Here comes the sun, here comes the sun
And I say it's all right
...

Here we see that the column name Lyric still shows up at the top of the text. We don't need this column name. Luckily, there's a command we can use to remove it. We can use the command sed 1d to give us everything from the second line onwards.

csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-beatles/" | csvcut -c Lyric | sed 1d
"Here comes the sun, here comes the sun
And I say it's all right

Little darling, it's been a long cold lonely winter
Little darling, it feels like years since it's been here
Here comes the sun, here comes the sun
And I say it's all right

Little darling, the smiles returning to the faces
Little darling, it seems like years since it's been here
Here comes the sun, here comes the sun
And I say it's all right
...

So the final two full commands are the following:

csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-beatles/" | csvcut -c Lyric | sed 1d > herecomesthesun.txt
csvgrep -c SName -r "(?i)^paint it black$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-rolling-stones/" | csvcut -c Lyric | sed 1d > paintitblack.txt

Preprocessing

In order to effectively analyse our data using the command line, we're first going to do some preprocessing steps.

Remove punctuation

We can remove punctuation using the tr command, which stands for translate. In this case we use it to delete -d all punctuation '[:punct:]'.

cat paintitblack.txt | tr -d '[:punct:]' | head -10
I see a red door and I want it painted black
No colors anymore I want them to turn black
I see the girls walk by dressed in their summer clothes
I have to turn my head until my darkness goes

I see a line of cars and theyre all painted black
With flowers and my love both never to come back
I see people turn their heads and quickly look away
Like a new born baby it just happens every day
cat herecomesthesun.txt | tr -d '[:punct:]' | head -10

Put each word on a new line

After deleting the punctuation, we can put each word on a new line.

We do this by translating any space between words to a new line character using tr ' ' '\n'. This essentially shifts each next word one line over.

cat herecomesthesun.txt | tr -d '[:punct:]' | tr  ' ' '\n' | head -15
Here
comes
the
sun
here
comes
the
sun
And
I
say
its
all
right

Lowercase

Next, we lowercase our words by translating all uppercase letters to lowercase ones using tr A-Z a-z.

cat paintitblack.txt | tr -d '[:punct:]' | tr  ' ' '\n' | tr A-Z a-z | head -10
i
see
a
red
door
and
i
want
it
painted

Remove blank lines

If we look at the last 15 lines of Here Comes The Sun, we see that there's still blank lines in our 'word list':

cat herecomesthesun.txt | tr -d '[:punct:]' | tr  ' ' '\n' | tr A-Z a-z | tail -15
its
all
right

here
comes
the
sun
here
comes
the
sun
its
all
right

We'd like to remove those blank lines. We do so using the following command: awk NF.

✍️A little note about awk NF.

Whenever you use awk we use the following syntax: awk condition [statement]. The statement is optional.

If we don't use a statement, awk will print the line that it read if the condition is not equal to zero.

In our case, we don't use a [statement] and the condition 'NF' records the number of fields per line (in our case this is equal to 1 word or 0 words).

So if the line contains a word, it will print the line, and if it doesn't contain a word, it will not print anything.  That's exactly what we need. :)

cat herecomesthesun.txt | tr -d '[:punct:]' | tr  ' ' '\n' | tr A-Z a-z | awk NF | tail -15
say
its
all
right
here
comes
the
sun
here
comes
the
sun
its
all
right

Basic stats

Now onto the fun part. Let's calculate some basic stats on our preprocessed datasets!

How many words are in the vocabulary?

We can calculate this using wc -l. wc -l counts the number of lines. In this case each line is a word, so it counts the number of words in our songs.

cat herecomesthesun.txt | tr -d '[:punct:]' | tr  ' ' '\n' | tr A-Z a-z | awk NF | wc -l

For the Beatles' Here Comes The Sun:

  • 153 words

For The Rolling Stones' Paint it Black:

  • 241 words

How many unique words?

We can calculate this with the following chained commands:

  • sort: sorts your words alphabetically
  • uniq -c: aggregates the 'duplicates' and counts how many times they occur. Returning an overview of <wordcount> <word>
  • wc -l: counts the number of lines. In this case each line is a unique word, thus it returns the unique word count.
cat paintitblack.txt | tr -d '[:punct:]' | tr  ' ' '\n' | tr A-Z a-z | awk NF | sort | uniq -c | wc -l

For The Beatles:

  • 35 words

For The Rolling Stones:

  • 110 words

What's the type / token ratio?

The type token ratio is the number of unique words over the total number of words:

So for Here Comes The Sun this can be calculated as:

  • 35 / 153 words = ~0.23

For Paint it Black, this is:

  • 110 / 241 words = ~0.46

It appears that The Rolling Stones use a wider variety of words in their lyrics (of this song at least). The Beatles seem to be more repetitive. Who would have thought, huh?!

Which words occur a lot?

For this we use the same first steps (sorting and counting unique tokens). This gives us the lines with <frequency word> combinations. But then we:

  • Sort again not in alphabetical order but by numeric counts (-n)
  • Also, we'd like to sort in reverse order (-r), so we get the highest counts at the top.
  • And then we take the top 20:
cat herecomesthesun.txt | tr -d '[:punct:]' | tr  ' ' '\n' | tr A-Z a-z | awk NF | sort | uniq -c | sort -nr | head -n 20

So sort -nr means:

  • We sort by the numeric values (n)
  • We sort in reverse order (r)

For Here Comes The Sun, these are the top 20 words:

  25 sun
  17 here
  15 comes
  12 the
   9 its
   8 it
   6 little
   6 darling
   5 right
   5 i
   5 all
   4 say
   4 been
   4 and
   3 years
   3 since
   3 like
   2 seems
   1 winter
   1 to

Unsurprisingly, the word 'sun' tops the list. 🤣

For Paint it Black, the list looks like this:

  19 i
  12 black
  11 see
  10 painted
   9 my
   8 and
   7 to
   7 the
   6 turn
   6 it
   5 a
   4 want
   3 wanna
   3 their
   3 red
   3 not
   3 no
   3 look
   3 hmm
   3 have

For Paint it Black, the word black is the second most frequent word in the song, after I.

Note that many of these words are stopwords, in both songs actually. For those of you who are not familiar with stopwords: stopwords are a collection of the most common words in a language that don't add a lot of information to a text. Examples are articles, pronouns and prepositions.

We'll remove these later on, but first let's look at co-occurring words.

Co-occurring words

In order to find the co-occurring words we're going to do the following:

  • Get the word list like we had before (one word per line, lowercase, no punctuation, no empty lines)
  • Save it in tmp.txt
  • Get a copy of tmp.txt except for the first word and save that in tmp2.txt
  • Then we're basically going to smash the two word lists together, so that the first word of the first file matches the first word of the second file (so actually, we're matching the first word with the second, the second with the third and so on)
  • Finally, we sort the combinations, count the unique combinations, and sort by reverse numeric count and print the top 20

So here we have our 'original' command, that we save to tmp.txt:

cat herecomesthesun.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF > tmp.txt

Then we use tail -n+2 to save all lines except the first (basically starting at line 2) to tmp2.txt.

tail -n+2 tmp.txt > tmp2.txt

We use paste -d ','  <txt1> <txt2> to combine the two lists. We paste them next to each other separated by a comma.

We then sort, count unique combinations, sort by the numeric counts in reverse order and print the top 20.

paste -d ',' tmp.txt tmp2.txt | sort | uniq -c | sort -rn | head -n20

This is the outcome for Here Comes The Sun:

  10 the,sun
  10 sun,sun
  10 sun,here
  10 here,comes
  10 comes,the
   6 little,darling
   5 its,all
   5 it,comes
   5 here,it
   5 all,right
   4 sun,and
   4 say,its
   4 its,been
   4 i,say
   4 comes,sun
   4 and,i
   3 years,since
   3 since,its
   3 like,years
   3 darling,it

Again, the sun plays a big part in the top combinations, unsurprisingly, as well as the famous 'little darling'.

This is the command for Paint It Black:

cat paintitblack.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF > tmp.txt
tail -n+2 tmp.txt > tmp2.txt
paste -d ',' tmp.txt tmp2.txt | sort | uniq -c | sort -rn | head -n20

And the results:

   7 i,see
   6 painted,black
   4 to,turn
   4 painted,painted
   4 it,painted
   4 i,want
   3 wanna,see
   3 see,the
   3 see,a
   3 red,door
   3 i,wanna
   3 have,to
   3 door,and
   3 black,no
   3 black,i
   2 want,them
   2 want,it
   2 walk,by
   2 until,my
   2 turn,my

Classic utterances like 'painted black' and 'red door' are represented nicely in the top list, as well as many other combinations that are repeated many times across the song.

Removing stopwords

First, let's download a list of English stopwords. You can find one here.

I'm downloading it here with curl and saving it to stopwords.txt:

curl "https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords" -o stopwords.txt

Now, we're going to do the following:

  • We're going to use our list of depunctuated, lowercase words
  • We cross-reference our stopwords.
  • We only want to keep those words that are not in the stopword list.
  • We sort the remaining words again, aggregate and count, sort numerically in reverse order

We can use the following command for that:

cat herecomesthesun.txt| tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | grep -v -w -f stopwords.txt | sort | uniq -c | sort -nr | head -10

Note that we use grep -v -w -f stopwords.txt to remove the stopwords from our lyrics. What does this command do?

  • Grep is used to match one or more patterns. In this case a whole bunch of stopwords in a file
  • -v Means that we invert-match. So we keep the lines that don't match any of the words in the stopwords file.
  • -w Makes sure that we match on whole words, not on parts of words
  • -f Tells grep that we want to use a file as input, not a single word

The top 10 words excluding stopwords for Here Comes The Sun are:

  25 sun
  15 comes
   6 little
   6 darling
   5 right
   4 say
   3 years
   3 since
   3 like
   2 seems

For Paint it Black:

  12 black
  11 see
  10 painted
   6 turn
   4 want
   3 wanna
   3 red
   3 look
   3 hmm
   3 door
💡
Now that you've removed stopwords, you could go back and recalculate the type/token ratio without stopwords. Does Paint it Black still have a higher type/token ratio?

Visualising our top words with Gnuplot

Installing Gnuplot

First things first: we'll have to install gnuplot. On Mac you can use brew to install it. For any other platform, please check out the official download page.

Once you enter gnuplot in the terminal and you see something like the following you should be good to go.

Preparing the data

For each song we do the following:

  1. We start with what we had before (word list sorted by counts descending)
  2. Swap the word count and word around, so it becomes <word>\t<word count>.
  3. Insert column names on row 1
  4. Select only the top 10
  5. Save in <title>_top10.tsv

These are the respective commands we use:

  1. cat herecomesthesun.txt| tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | grep -v -w -f stopwords.txt | sort | uniq -c | sort -nr
  2. awk '{ print $2 "\t" $1}'
  3. sed "1s/^/word\tfreq\n/" (essentially pushes original line 1 to second line)
  4. head -n 10
  5. <all previous commands> > herecomesthesun_top10.tsv

The full command:

cat herecomesthesun.txt| tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | grep -v -w -f stopwords.txt | sort | uniq -c | sort -nr | awk '{ print $2 "\t" $1}' | sed "1s/^/word\tfreq\n/" | head -10 > herecomesthesun_top10.tsv

Using Gnuplot

With the Gnuplot CLI you can declare your graph settings and then plot your data. It is relatively straightforward but the syntax is a little different than your regular old Matplotlib or Seaborn. Let's get started:

Set the styles of the barchart

set boxwidth 0.5
set style fill solid
set style line 1 linecolor "blue"

Set plot title and axes

set title 'Unigram Frequencies Here Comes The Sun'
set xlabel 'Words'
set ylabel 'Frequencies'

Set output settings

set terminal png size 700,400; set output 'herecomesthesun.png'

Let's plot

plot "herecomesthesun_top10.tsv"  using 0:2:xtic(1) with boxes ls 1 notitle

The result

It's great that this functionality exists at the command line, but honestly, I am way too comfortable with my trusty friends Matplotlib and Seaborn to make Gnuplot a regular in my toolkit. 😉

Summary

It's a wrap! In this tutorial, I've shown you how to do simple data analysis at the command line.

Note that such a comparative analysis would be even better if we compare entire discographies... I'll leave that as an exercise to the reader! 💪

In this tutorial we've learned:

  • How to do basic stats (number of words, unique words, type/token ratios)
  • How to calculate the most frequent words
  • How to calculate co-occurring words
  • How to remove stopwords
  • How to plot your frequency charts using Gnuplot

Please keep in mind that it's helpful to use the right tool for a particular problem.

Personally, I would probably pick Python over the command line in 99% of the cases when conducting such an analysis. It is simply more convenient, easier to work with and has a bigger toolbox to work with.

Regardless, it is fun to challenge yourself a bit and explore alternative ways of solving problems! 😁

Moreover, sometimes you simply won't have access to Python. In that case it's great to be skilled enough at the command line to get the insights you want. For example, I sometimes have to check out my server logs and it is always very useful to know these commands to quickly zoom in on the data you need.

Let's keep in touch! 📫

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.