Data Analysis at the Command Line💻
As you may know, I'm a data scientist so data analysis is kind of my cup of tea. 🫖
However, I tend to do my analysis work in Python. You know.. Because it's fast, easy and the ecosystem for data analysis and data science is great.
But sometimes you just want to have some fun. Sometimes, you want to challenge yourself to do things a little bit different.
So today we're going to do some data analysis on lyrics from The Beatles and The Rolling Stones. Not in Python, but at the command line!
Let's dive in!
Requirements
Before we get started, you need to install the csvkit
Python package. You can easily install it using pip or poetry.
Though it may seem obvious, you should keep your favorite terminal application ready as well. :)
The Data
For this tutorial, we're using the Kaggle song lyrics from 79 musical genres dataset. Specifically, the aptly named file lyrics-data.csv
.
The columns
Using csvcut -n
we can get the column names and indices:
csvcut -n lyrics-data.csv
1: ALink
2: SName
3: SLink
4: Lyric
5: language
What does the data look like?
The Lyric column is huge and we'll come back to that later, but for now let's visualise the other columns using csvcut
.
We can use csvcut -c
to select specific columns in our dataset. Here we select columns ALink, SName, SLink and language.
csvcut -c ALink,SName,SLink,language lyrics-data.csv | csvlook --max-column-width 30 --max-rows 5
If you want to select many columns, you can also use column numbers (1 = first column, 2 = second column etc.) to make your selection more succinct.
We then pipe the output to csvlook
. csvlook
pretty prints the output. Here, we also cap the rows to 5 and column width to 30. This is purely to make the output look pretty and avoid having columns overflow.
| ALink | SName | SLink | language |
| --------------- | ------------------------------ | ------------------------------ | -------- |
| /ivete-sangalo/ | Arerê | /ivete-sangalo/arere.html | pt |
| /ivete-sangalo/ | Se Eu Não Te Amasse Tanto A... | /ivete-sangalo/se-eu-nao-te... | pt |
| /ivete-sangalo/ | Céu da Boca | /ivete-sangalo/chupa-toda.html | pt |
| /ivete-sangalo/ | Quando A Chuva Passar | /ivete-sangalo/quando-a-chu... | pt |
| /ivete-sangalo/ | Sorte Grande | /ivete-sangalo/sorte-grande... | pt |
| ... | ... | ... | ... |
You see we've got quite a bit of Portuguese music in this list.
Select data
For this tutorial, I'd like to select 2 songs and do a small comparative analysis between them.
Let's go with a Beatles song and a Rolling Stones song, to spice up the ol' debate between the two bands once again. :')
But first let's verify whether these artists are actually in the dataset.
We verify whether there are any songs by running the following commands:
- We use
csvgrep
to search for specific values in one or more columns. In this case we search in the column (-c)ALink
for the regular expression (-r)"
. The (?i) flag stands for case insensitive search.(?i)/the-beatles/
" - We are only interested (for now) in the first two columns to check out the songs, so we select only the first two using
csvcut
. - We use
csvlook
again to pretty print our data and select 15 rows.
csvgrep -c ALink -r "(?i)/the-beatles/" lyrics-data.csv | csvcut -c 1,2 | csvlook --max-rows 15
| ALink | SName |
| ------------- | ---------------------------- |
| /the-beatles/ | Yesterday |
| /the-beatles/ | Let It Be |
| /the-beatles/ | Hey Jude |
| /the-beatles/ | Here Comes The Sun |
| /the-beatles/ | Don't Let Me Down |
| /the-beatles/ | Help! |
| /the-beatles/ | Something |
| /the-beatles/ | Come Together |
| /the-beatles/ | In My Life |
| /the-beatles/ | Hello, Goodbye |
| /the-beatles/ | I Want To Hold Your Hand |
| /the-beatles/ | Get Back |
| /the-beatles/ | All My Loving |
| /the-beatles/ | All You Need Is Love |
| /the-beatles/ | While My Guitar Gently Weeps |
| ... | ... |
Well, the good news is that there are plenty of Beatles songs. Let's go for 'Here Comes The Sun'.
Now, let's do the same for The Rolling Stones.
csvgrep -c ALink -r "(?i)/the-rolling-stones/" lyrics-data.csv | csvcut -c 1,2 | csvlook --max-rows 15
| ALink | SName |
| -------------------- | ---------------------------------- |
| /the-rolling-stones/ | Sympathy For The Devil |
| /the-rolling-stones/ | (I Can't Get No) Satisfaction |
| /the-rolling-stones/ | Start Me Up |
| /the-rolling-stones/ | Angie |
| /the-rolling-stones/ | Paint It Black |
| /the-rolling-stones/ | Miss You |
| /the-rolling-stones/ | You Can't Always Get What You Want |
| /the-rolling-stones/ | Wild Horses |
| /the-rolling-stones/ | Gimme Shelter |
| /the-rolling-stones/ | She's A Rainbow |
| /the-rolling-stones/ | Brown Sugar |
| /the-rolling-stones/ | Like A Rolling Stone |
| /the-rolling-stones/ | As Tears Go By |
| /the-rolling-stones/ | Jumpin' Jack Flash |
| /the-rolling-stones/ | Honky Tonk Women |
| ... | ... |
Let's go for 'Paint it Black'.
Selecting the Lyrics
Now it's time to select the specific rows for these two songs. Let's start with The Beatles. We can use the following commands for that:
csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvcut -c 1,2 | csvlook --max-rows 15
| ALink | SName |
| ------------------------- | ------------------ |
| /nick-cave-the-bad-seeds/ | Here Comes The Sun |
| /george-harrison/ | Here Comes The Sun |
| /jewel/ | Here Comes The Sun |
| /bon-jovi/ | Here Comes The Sun |
| /ghost/ | Here Comes the Sun |
| /runaways/ | Here Comes The Sun |
| /coldplay/ | Here Comes The Sun |
| /christina-perri/ | Here comes the sun |
| /colbie-caillat/ | Here Comes The Sun |
| /glee/ | Here Comes the Sun |
| /the-beatles/ | Here Comes The Sun |
| /ben-harper/ | Here Comes The Sun |
| /sheryl-crow/ | Here Comes The Sun |
And for The Rolling Stones:
csvgrep -c SName -r "(?i)^paint it black$" lyrics-data.csv | csvcut -c 1,2 | csvlook --max-rows 15
| ALink | SName |
| -------------------- | -------------- |
| /the-rolling-stones/ | Paint It Black |
| /jonny-lang/ | Paint It Black |
| /deep-purple/ | Paint It Black |
| /face-to-face/ | Paint it Black |
| /wasp/ | Paint It Black |
| /rage/ | Paint It Black |
| /anvil/ | Paint It Black |
| /ministry/ | Paint It Black |
| /vanessa-carlton/ | Paint It Black |
| /u2/ | Paint It Black |
| /the-animals/ | Paint it Black |
| /the-lemonheads/ | Paint It Black |
| /the-vines/ | Paint It Black |
Now, we quickly notice that just having the exact titles of the songs in the grep is not enough. We have to be more specific.
So to be more specific we first get all songs with those exact titles, then we pipe that to our next expression in which we filter for the name of the artist. This gives us the exact (unique) results that we're looking for:
csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-beatles/" | csvcut -c 1,2 | csvlook --max-rows 5
| ALink | SName |
| ------------- | ------------------ |
| /the-beatles/ | Here Comes The Sun |
csvgrep -c SName -r "(?i)^paint it black$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-rolling-stones/" | csvcut -c 1,2 | csvlook --max-rows 15
| ALink | SName |
| -------------------- | -------------- |
| /the-rolling-stones/ | Paint It Black |
Write lyrics to text files
Now, we can't really do any analysis on just these two titles, we're actually interested in the Lyric column. So finally, let's get the lyrics for each of the songs and write them to <song>.txt
files.
csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-beatles/" | csvcut -c Lyric
Lyric
"Here comes the sun, here comes the sun
And I say it's all right
Little darling, it's been a long cold lonely winter
Little darling, it feels like years since it's been here
Here comes the sun, here comes the sun
And I say it's all right
Little darling, the smiles returning to the faces
Little darling, it seems like years since it's been here
Here comes the sun, here comes the sun
And I say it's all right
...
Here we see that the column name Lyric
still shows up at the top of the text. We don't need this column name. Luckily, there's a command we can use to remove it. We can use the command sed 1d
to give us everything from the second line onwards.
csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-beatles/" | csvcut -c Lyric | sed 1d
"Here comes the sun, here comes the sun
And I say it's all right
Little darling, it's been a long cold lonely winter
Little darling, it feels like years since it's been here
Here comes the sun, here comes the sun
And I say it's all right
Little darling, the smiles returning to the faces
Little darling, it seems like years since it's been here
Here comes the sun, here comes the sun
And I say it's all right
...
So the final two full commands are the following:
csvgrep -c SName -r "(?i)^here comes the sun$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-beatles/" | csvcut -c Lyric | sed 1d > herecomesthesun.txt
csvgrep -c SName -r "(?i)^paint it black$" lyrics-data.csv | csvgrep -c ALink -r "(?i)/the-rolling-stones/" | csvcut -c Lyric | sed 1d > paintitblack.txt
Preprocessing
In order to effectively analyse our data using the command line, we're first going to do some preprocessing steps.
Remove punctuation
We can remove punctuation using the tr
command, which stands for translate
. In this case we use it to delete -d
all punctuation '[:punct:]'
.
cat paintitblack.txt | tr -d '[:punct:]' | head -10
I see a red door and I want it painted black
No colors anymore I want them to turn black
I see the girls walk by dressed in their summer clothes
I have to turn my head until my darkness goes
I see a line of cars and theyre all painted black
With flowers and my love both never to come back
I see people turn their heads and quickly look away
Like a new born baby it just happens every day
cat herecomesthesun.txt | tr -d '[:punct:]' | head -10
Put each word on a new line
After deleting the punctuation, we can put each word on a new line.
We do this by translating any space between words to a new line character using tr ' ' '\n'. This essentially shifts each next word one line over.
cat herecomesthesun.txt | tr -d '[:punct:]' | tr ' ' '\n' | head -15
Here
comes
the
sun
here
comes
the
sun
And
I
say
its
all
right
Lowercase
Next, we lowercase our words by translating all uppercase letters to lowercase ones using tr A-Z a-z
.
cat paintitblack.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | head -10
i
see
a
red
door
and
i
want
it
painted
Remove blank lines
If we look at the last 15 lines of Here Comes The Sun, we see that there's still blank lines in our 'word list':
cat herecomesthesun.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | tail -15
its
all
right
here
comes
the
sun
here
comes
the
sun
its
all
right
We'd like to remove those blank lines. We do so using the following command: awk NF
.
✍️A little note about awk NF.
Whenever you use awk
we use the following syntax: awk condition [statement]
. The statement is optional.
If we don't use a statement, awk
will print the line that it read if the condition
is not equal to zero.
In our case, we don't use a [statement] and the condition
'NF' records the number of fields per line (in our case this is equal to 1 word or 0 words).
So if the line contains a word, it will print the line, and if it doesn't contain a word, it will not print anything. That's exactly what we need. :)
cat herecomesthesun.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | tail -15
say
its
all
right
here
comes
the
sun
here
comes
the
sun
its
all
right
Basic stats
Now onto the fun part. Let's calculate some basic stats on our preprocessed datasets!
How many words are in the vocabulary?
We can calculate this using wc -l
. wc -l
counts the number of lines. In this case each line is a word, so it counts the number of words in our songs.
cat herecomesthesun.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | wc -l
For the Beatles' Here Comes The Sun:
- 153 words
For The Rolling Stones' Paint it Black:
- 241 words
How many unique words?
We can calculate this with the following chained commands:
- sort: sorts your words alphabetically
- uniq -c: aggregates the 'duplicates' and counts how many times they occur. Returning an overview of <wordcount> <word>
- wc -l: counts the number of lines. In this case each line is a unique word, thus it returns the unique word count.
cat paintitblack.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | sort | uniq -c | wc -l
For The Beatles:
- 35 words
For The Rolling Stones:
- 110 words
What's the type / token ratio?
The type token ratio is the number of unique words over the total number of words:
So for Here Comes The Sun this can be calculated as:
- 35 / 153 words = ~0.23
For Paint it Black, this is:
- 110 / 241 words = ~0.46
It appears that The Rolling Stones use a wider variety of words in their lyrics (of this song at least). The Beatles seem to be more repetitive. Who would have thought, huh?!
Which words occur a lot?
For this we use the same first steps (sorting and counting unique tokens). This gives us the lines with <frequency word>
combinations. But then we:
- Sort again not in alphabetical order but by numeric counts (-n)
- Also, we'd like to sort in reverse order (-r), so we get the highest counts at the top.
- And then we take the top 20:
cat herecomesthesun.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | sort | uniq -c | sort -nr | head -n 20
So sort -nr means:
- We sort by the numeric values (n)
- We sort in reverse order (r)
For Here Comes The Sun, these are the top 20 words:
25 sun
17 here
15 comes
12 the
9 its
8 it
6 little
6 darling
5 right
5 i
5 all
4 say
4 been
4 and
3 years
3 since
3 like
2 seems
1 winter
1 to
Unsurprisingly, the word 'sun' tops the list. 🤣
For Paint it Black, the list looks like this:
19 i
12 black
11 see
10 painted
9 my
8 and
7 to
7 the
6 turn
6 it
5 a
4 want
3 wanna
3 their
3 red
3 not
3 no
3 look
3 hmm
3 have
For Paint it Black, the word black
is the second most frequent word in the song, after I
.
Note that many of these words are stopwords, in both songs actually. For those of you who are not familiar with stopwords: stopwords are a collection of the most common words in a language that don't add a lot of information to a text. Examples are articles, pronouns and prepositions.
We'll remove these later on, but first let's look at co-occurring words.
Co-occurring words
In order to find the co-occurring words we're going to do the following:
- Get the word list like we had before (one word per line, lowercase, no punctuation, no empty lines)
- Save it in
tmp.txt
- Get a copy of
tmp.txt
except for the first word and save that intmp2.txt
- Then we're basically going to smash the two word lists together, so that the first word of the first file matches the first word of the second file (so actually, we're matching the first word with the second, the second with the third and so on)
- Finally, we sort the combinations, count the unique combinations, and sort by reverse numeric count and print the top 20
So here we have our 'original' command, that we save to tmp.txt
:
cat herecomesthesun.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF > tmp.txt
Then we use tail -n+2
to save all lines except the first (basically starting at line 2) to tmp2.txt
.
tail -n+2 tmp.txt > tmp2.txt
We use paste -d ',' <txt1> <txt2>
to combine the two lists. We paste them next to each other separated by a comma.
We then sort, count unique combinations, sort by the numeric counts in reverse order and print the top 20.
paste -d ',' tmp.txt tmp2.txt | sort | uniq -c | sort -rn | head -n20
This is the outcome for Here Comes The Sun:
10 the,sun
10 sun,sun
10 sun,here
10 here,comes
10 comes,the
6 little,darling
5 its,all
5 it,comes
5 here,it
5 all,right
4 sun,and
4 say,its
4 its,been
4 i,say
4 comes,sun
4 and,i
3 years,since
3 since,its
3 like,years
3 darling,it
Again, the sun plays a big part in the top combinations, unsurprisingly, as well as the famous 'little darling'.
This is the command for Paint It Black:
cat paintitblack.txt | tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF > tmp.txt
tail -n+2 tmp.txt > tmp2.txt
paste -d ',' tmp.txt tmp2.txt | sort | uniq -c | sort -rn | head -n20
And the results:
7 i,see
6 painted,black
4 to,turn
4 painted,painted
4 it,painted
4 i,want
3 wanna,see
3 see,the
3 see,a
3 red,door
3 i,wanna
3 have,to
3 door,and
3 black,no
3 black,i
2 want,them
2 want,it
2 walk,by
2 until,my
2 turn,my
Classic utterances like 'painted black' and 'red door' are represented nicely in the top list, as well as many other combinations that are repeated many times across the song.
Removing stopwords
First, let's download a list of English stopwords. You can find one here.
I'm downloading it here with curl
and saving it to stopwords.txt
:
curl "https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords" -o stopwords.txt
Now, we're going to do the following:
- We're going to use our list of depunctuated, lowercase words
- We cross-reference our stopwords.
- We only want to keep those words that are not in the stopword list.
- We sort the remaining words again, aggregate and count, sort numerically in reverse order
We can use the following command for that:
cat herecomesthesun.txt| tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | grep -v -w -f stopwords.txt | sort | uniq -c | sort -nr | head -10
Note that we use grep -v -w -f stopwords.txt
to remove the stopwords from our lyrics. What does this command do?
- Grep is used to match one or more patterns. In this case a whole bunch of stopwords in a file
-v
Means that weinvert-match
. So we keep the lines that don't match any of the words in the stopwords file.-w
Makes sure that we match on whole words, not on parts of words-f
Tells grep that we want to use a file as input, not a single word
The top 10 words excluding stopwords for Here Comes The Sun are:
25 sun
15 comes
6 little
6 darling
5 right
4 say
3 years
3 since
3 like
2 seems
For Paint it Black:
12 black
11 see
10 painted
6 turn
4 want
3 wanna
3 red
3 look
3 hmm
3 door
Visualising our top words with Gnuplot
Installing Gnuplot
First things first: we'll have to install gnuplot
. On Mac you can use brew to install it. For any other platform, please check out the official download page.
Once you enter gnuplot
in the terminal and you see something like the following you should be good to go.

Preparing the data
For each song we do the following:
- We start with what we had before (word list sorted by counts descending)
- Swap the word count and word around, so it becomes
<word>\t<word count>
. - Insert column names on row 1
- Select only the top 10
- Save in
<title>_top10.tsv
These are the respective commands we use:
cat herecomesthesun.txt| tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | grep -v -w -f stopwords.txt | sort | uniq -c | sort -nr
awk '{ print $2 "\t" $1}'
sed "1s/^/word\tfreq\n/"
(essentially pushes original line 1 to second line)head -n 10
<all previous commands> > herecomesthesun_top10.tsv
The full command:
cat herecomesthesun.txt| tr -d '[:punct:]' | tr ' ' '\n' | tr A-Z a-z | awk NF | grep -v -w -f stopwords.txt | sort | uniq -c | sort -nr | awk '{ print $2 "\t" $1}' | sed "1s/^/word\tfreq\n/" | head -10 > herecomesthesun_top10.tsv
Using Gnuplot
With the Gnuplot CLI you can declare your graph settings and then plot your data. It is relatively straightforward but the syntax is a little different than your regular old Matplotlib or Seaborn. Let's get started:
Set the styles of the barchart
set boxwidth 0.5
set style fill solid
set style line 1 linecolor "blue"
Set plot title and axes
set title 'Unigram Frequencies Here Comes The Sun'
set xlabel 'Words'
set ylabel 'Frequencies'
Set output settings
set terminal png size 700,400; set output 'herecomesthesun.png'
Let's plot
plot "herecomesthesun_top10.tsv" using 0:2:xtic(1) with boxes ls 1 notitle
The result

It's great that this functionality exists at the command line, but honestly, I am way too comfortable with my trusty friends Matplotlib and Seaborn to make Gnuplot a regular in my toolkit. 😉
Summary
It's a wrap! In this tutorial, I've shown you how to do simple data analysis at the command line.
Note that such a comparative analysis would be even better if we compare entire discographies... I'll leave that as an exercise to the reader! 💪
In this tutorial we've learned:
- How to do basic stats (number of words, unique words, type/token ratios)
- How to calculate the most frequent words
- How to calculate co-occurring words
- How to remove stopwords
- How to plot your frequency charts using Gnuplot
Please keep in mind that it's helpful to use the right tool for a particular problem.
Personally, I would probably pick Python over the command line in 99% of the cases when conducting such an analysis. It is simply more convenient, easier to work with and has a bigger toolbox to work with.
Regardless, it is fun to challenge yourself a bit and explore alternative ways of solving problems! 😁
Moreover, sometimes you simply won't have access to Python. In that case it's great to be skilled enough at the command line to get the insights you want. For example, I sometimes have to check out my server logs and it is always very useful to know these commands to quickly zoom in on the data you need.
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.