Music Lyrical Analysis. Unpacking musical preferences with R

Introduction

Influences and inspirations

I have loved music since the day I was created. I love to be surrounded by glorious sounds and music I can really connect with. I’m by no means a musician or a sound engineer, I’m just a simple lover and enthusiast of listening to (and sometimes singing along with) music.

My family greatly influenced this musical enthusiasm. I grew up in a family full of music listening enthusiasts and “ninja singers” (people who sing when they think no one is around or can hear them!). My Dad influenced me with his eclectic collection of classic rock and famous mainstream songwriters from the 60’s through to the 80’s. From my Dad, I learned very early on to appreciate the crystal-clear sound vinyl records offer as well as quality stereo equipment. My Mum influenced me with her regular listening to radio stations, in particular Gold 104 FM (Melbourne Australia) and random sing-alongs in the car. My brother influenced me with his electrical engineering talents. I distinctly remember as a young kid the day he created his first car-boot boom box. Thundering rumbles of finely tuned bass could be heard from 3 blocks away in tranquil suburbia, and of course it was 90’s techno pop, electronica and occasionally Jamiroquai!

So how did all this intriguing musical enthusiasm influence me?

I was always drawn to classic and hard rock. I’ve always been fascinated with the interaction of sound compositions and poetic lyrical prose. Being an athlete since I was 9, I recognised early on how sounds and lyrics could influence my performances. There’s music to psych you up or calm you down, and rhymes to help you stay in time!

I discovered my fan-girl love for Irish rock group U2 when I was 14. It was their album “All That You Can’t Leave Behind” which got my attention. I then did what any obsessed fan-girl would do. I researched everything humanly possible about their music, and in addition to knowing every lyric to every song U2 ever wrote, I consequently ended up learning a great deal about many musical-related and life-experience topics. Things like the guitars and complex sound effects The Edge uses, Bono’s public speaking, song writing talents and stage presence, Larry Mullen Junior’s masterful methods of how to stay out of the public spotlight while protecting his band, and Adam Clayton’s talents for creative collaboration and sophisticated, artistic conversation.

As the years moved on, I began to open up and diversify my music collection. I have my husband to thank in part for this, introducing me to heavy metal and modern electronic genres. These days, my music interest is eclectic. In any single day I enjoy listening to anything between Eminem and Enya, Rammstein to Bobby McFerrin, Tibetan Buddhist Monk chanting to Dragonforce Through the Fire and the Flames. Yes, I’m also a huge fan of the Audiosurf game series, and Guitar Hero!

The purpose and objective for this analysis

The data of music and unpacking personal music preferences

There’s not many days that go by where I’m not listening to music. I spend hours every day working away while listening to music, many hours creating an inadvertent fashion trend with my headphones and many hours with my thoughts randomly jumping in with playful musical commentary.

More recently, this thought pattern commentary has been circulating in my head while listening to my personal collection of music:

“This is such an eclectic playlist, how do you make sense of such a diverse collection of music? Can we prove this is truly an eclectic playlist?”.
“How on earth are these pieces of music even related or connected?” “Yes Joshua Tree, I’m referring to you, I still haven’t found what I’m looking for and its not Bullet the Blue Sky!”
“What makes this song unique, special, similar and different to another song in your playlist? Surely Led Zeppelin’s Kashmir is some kind of random outlier?” What makes this song so impactful?
“Does this album actually tell a story or have a meaning?”
“Is it possible to obtain an understanding of a music album through only observing its lyrics, having never actually listened to it?”
“Hey Bree, you are a data scientist by trade. Music IS data, could you do something useful and insightful with it to find some answers?”

On that last comment, my inner kid sparked up and went “ooh this could be so much fun, we can learn, write, and do an analysis with sound engineering, signal processing, text mining and natural language processing”. My logical mind very quickly realised to cover off all of those topics in a single article wouldn’t be doing any of those single topics any justice. I would need to chunk-down these topics into separate articles. So for this article, the purpose is to demonstrate with great love:

A comprehensive music lyrical analysis!

Working purely with the lyrical components for a selected collection of music album data, I will apply contemporary text mining and natural language processing techniques to this data to explore the above questions.

The target audience for this analysis is aimed at the general population of music enthusiasts, data scientists, data-folk, and anyone who’s ever wondered what on earth Led Zeppelin’s Kashmir was about.

Technical

Guiding questions for this analysis

As we flow through this analysis I’m seeking to weave in my responses to five key technical questions which are relevant for text mining and natural language processing:

How will I acquire, structure and group the data?
How can I visualise the data to tell a story and help make meaningful sense of the data?
What are the important key words and n-gram constructs associated with each artist’s album?
What are the sentiments and emotional representations of each song and album?
What topics emerge from the albums? Can machine learning be useful here?

Project workflow

This project will be written in R, with the Github repository link here.

Sequencing:

1. Data Considerations

1. Intellectual property & copyrights
1. Artist and album selection
1. Obtaining lyrics
1. Converting lyric sheets into a useful dataset
1. Data cleaning & pre processing
1. Additional data: Using the Spotify API
1. Observing the specific and unique nature of music lyrics in a text analysis context
1. Variable names and the source dataset

2. Manual Data Exploration

1. Creating tokens & data wrangling
1. Identifying vocal and instrumental songs
1. Word counts by song & album
1. Wordclouds
1. Lexical diversity (vocabulary)
1. Song lyrics self-similarity matrices (SongSim) & repetition
1. Term Frequency - Inverse Document Frequency (TF-IDF)

3. Sentiment Analysis & Natural Language Processing (NLP)

1. NRC Emotional Sentiment
1. NGrams, bi-grams & tri-grams
1. Bi-gram network analysis
1. Pair-wise comparisons
1. Album similarity
1. Song dissimilarity (agreement between lyrics)

4. Unsupervised Machine Learning

1. Topic modelling: Structured Topic Modelling (STM)

5. Findings & Learnings

1. Findings
1. Learnings, gotchas, traps for young players
1. Where to next & part 2

6. References

1. Data Considerations

a. Intellectual property & copyrights

Because we are working with copyrighted material in reference to music lyrics, I’d like to make it known that the copyright belongs to the artists and songwriters who created the songs and the lyrics.

b. Artist and album selection

After much deliberation and consulting with friends and family, I decided to select 6 artists and an album from each of these artists. The selection (cherry picking) of these artists is based on the human perception that when viewed as a collection this group of artists would be considered an “eclectic” grouping. This analysis will test this perception of eclectic grouping and by the end of this analysis, converge with a conclusion.

Artist	Album	Release Year	Genre	Tracks	Review Tags	Other Notes & Rationale for selection
Daft Punk	Discovery	2001	Electronic	One More Time Aerodynamic Digital Love Harder, Better, Faster, Stronger Crescendolls Nightvision Superheroes High Life Something About Us Voyager Veridis Quo Short Circuit Face to Face Too Long	Electronic, House, Rock, Techno, Funk, Modern Disco	Aerodynamic, Crescendolls, Nightvision, Superheroes, High Life, Voyager, Viridis Quo and Short Circuit are instrumental songs. We will include these for sound analysis in a subsequent exploration of music analysis. Music video movie “Interstella 555: The Story of the Secret Star System”. The film is the visual realization of Discovery. http://www.imdb.com/title/tt0368667/ Bree’s favourite Daft Punk album.
U2	The Joshua Tree	1987	Rock	Where the streets have no name I still haven’t found what I’m looking for With or without you Bullet the blue sky Running to stand still Red hill mining town In God’s country Trip through your wires One tree hill Exit Mothers of the disappeared	Rock, Gospel,	One of the biggest selling albums of all time. 30th Anniversary of its release in 2017 saw U2 take it on tour.
Elton John	Honky Chateau	1972	Rock	Honky Cat Mellow I think I’m going to kill myself Susie (Dramas) Rocket Man (I think it’s going to be a long, long time) Salvation Slave Amy Mona Lisa and Mad Hatters Hercules	Rock, Pop, Rock & Roll	Rolling Stone believes this was the album which marked the transformation of Elton John from gentle singer/songwriter to a legitimate rock star. https://www.rollingstone.com/music/pictures/readers-poll-the-10-best-elton-john-albums-20130918/5-honky-chateau-3-78-9
Killswitch Engage	Alive or Just Breathing	2002	Metal Core	Numbered Days Self Revolution Fixation on the Darkness My Last Serenade Life to Lifeless Just Barely Breathing To The Sons of Man Temple from Within The Element of One Vide Infra Without a Name Rise Inside When the Balance is Broken	Hardcore Metal,	Loudwire suggests this is the greatest album KsE produced. http://loudwire.com/killswitch-engage-albums-ranked/ Bree has not yet listened to this album.
Iron Maiden	Powerslave	1984	Heavy Metal	Aces High Two Minutes to Midnight Losfer Words (Big ’Orra) Flash of the Blade The Duellists Back in the Village Powerslave Rime of the Ancient Mariner	Hard Rock, Heavy Metal	LouderSound suggests this is the greatest album Iron Maiden produced. https://www.loudersound.com/features/every-iron-maiden-album-ranked-from-worst-to-best Bree has not yet listened to this album.
Led Zeppelin	Physical Graffiti	1975	Rock	Custard Pie The Rover In my Time of Dying Houses of the Holy Trampled Under Foot Kashmir In the Light Bron-Yr-Aur	Hard Rock, Heavy Metal	Bree wants to inspect Kashmir further. Bron-Yr-Aur is an instrumental.

c. Obtaining lyrics

I made a conscious decision to download the lyrics for the selected albums, manually. This was achieved by googling, finding the correct lyric sheets and then copying and pasting into Microsoft Word. I could have just as easily set up an R script to web-scrape, but I wanted to observe and have full control over the state and structure the lyrics were being stored as and apply some very specific formats, for a reason, which I will explain in the next section.

d. Converting lyric sheets into a useful dataset

For the source dataset we’re creating, we want to maintain the “structural integrity” of the music lyrics as we import them into data frames. This means our dataset looks the same way the lyric sheet looks. We have rows for album title, rows for song titles, rows for each verse and each chorus.

Why?

Music is all about patterns, sequences and time signatures. Lyrics are written differently to many other forms of written communication. Lyrics are also interwoven with sound, else it would be just poetry or a short story. It would be inappropriate at this point to treat the lyrics as one giant “word pool”. We need to maintain hierarchical groupings and identify which lyrics belong with which artist and album. Let’s firstly observe the lyrics in as close to the original structure as they were written.

How do we do this?

Let’s keep the lyric sheets in .docx format for now and check the integrity of the style formats. There’s magic in the little “¶” button on the paragraph menu pane in Microsoft Word.

For each lyric sheet we have, turn on paragraph marks and make the following adjustments:

Configure styles for Heading 1, Heading 2, Heading 3, Normal body and spacing after paragraph
Heading 1 – Artist name
Heading 2 – Album name
Heading 3 – Song title name
Song lyrics – Normal body

Each verse and chorus are separated by a paragraph. Individual lines within verses and choruses separated with carriage return (Enter key), with a single space at the end of each line.

Between each style component used, separate these with a single paragraph. Paragraphs will create new rows for our dataset, as will differentiating styles.

Where a song is instrumental, use "[Instrumental]" as a consistent tag across all lyric sheets for ease of flagging these songs for filtering later.

Here is an example of the lyric sheet for U2 - The Joshua Tree and reading it in to create a dataframe.

We can use the officer and qdapTools packages to read in the collection of lyric sheet documents:

# Working with word documents (docx):
# Read in each file and use docx_summary() to map the styles used in each document as records in a dataframe.
# Then apply some common hirarchial groupings: Artist, Album, Track Name, and add in verse line counters.
library(officer)
library(qdapTools)
library(zoo)

doc02 <- read_docx(F("Data/Raw/U2 - The Joshua Tree.docx"))
raw.data02 <- docx_summary(doc02) %>%
  mutate(CATMusicArtist = "U2",
         CATMusicAlbum = "The Joshua Tree",
         CATTrackName = ifelse(style_name == "heading 3", text, "None"),
         CATTrackName = zoo::na.locf(CATTrackName), #fill down track number
         style_name = ifelse(is.na(style_name), "body", style_name)
  ) %>%
  group_by(CATTrackName) %>%
  mutate(NUMTrackLyricLineNumber = sequence(n()) - 1) #using minus 1 so we dont include the main heading

Our imported lyric sheet for U2 - The Joshua Tree has now become a nicely structured data frame. Here’s an example of the first five records:

We will repeat the above steps for the other Artists and Albums in our selection and append the datasets together.

e. Data cleaning & pre processing

Next, we will perform some data checks and cleanup using the QDAP package.

# Using the dataset which contains all raw imported lyric sheets from the previous step:
# Tidy up the track name variable for when we need to use this as a merge key
# Identify the songs which are instrumental. We will filter these out for this analysis.

wrk.01_Data_Prep <- wrk.data %>%
  mutate(BINTrackIsInstrumental = ifelse(style_name == "body" & trimws(text) == "[Instrumental]", 1, 0)) %>%
  mutate(KEYTrackName = toupper(trimws(CATTrackName)))
       
# Use QDAP qview to identify suggestions for tidy up. Results will be written to a text file.
qview(wrk.01_Data_Prep)   
check_text(wrk.01_Data_Prep$text, file=F("Data/Raw/QDAPCheckText_wrk.01_Data_Prep.txt")) 

# Applying QDAP cleanup recommendations
wrk.01_Data_Prep$text <- replace_number(wrk.01_Data_Prep$text, num.paste = TRUE, remove = FALSE)
wrk.01_Data_Prep$text <- incomplete_replace(wrk.01_Data_Prep$text)
wrk.01_Data_Prep$text <- comma_spacer(wrk.01_Data_Prep$text)
wrk.01_Data_Prep$text <- clean(wrk.01_Data_Prep$text)
wrk.01_Data_Prep$text <- scrubber(wrk.01_Data_Prep$text, fix.comma = TRUE, fix.space = TRUE)

f. Additional data: Using the Spotify API

I thought it would be useful to explore alternative sources of data to complement the lyrical dataset we just created. In my exploration, I came across the spotifyr package with more details here. After setting up a Spotify developers account I was able to obtain the necessary “client ID” and “client secret” tokens, and set these as environment variables to use in the spotifyr functions.

#devtools::install_github('charlie86/spotifyr')
#install.packages('spotifyr')
library(spotifyr)
# Reference: https://github.com/charlie86/spotifyr

# app name "MusicLyricAnalysis1"
Sys.setenv(SPOTIFY_CLIENT_ID = "ADD TOKEN HERE")
Sys.setenv(SPOTIFY_CLIENT_SECRET = "ADD TOKEN HERE")

access_token <- get_spotify_access_token()

#Extract data from spotify
spotify_df_U2 <- get_artist_audio_features('U2',access_token)
spotify_df_DaftPunk <- get_artist_audio_features('Daft Punk',access_token)
spotify_df_EltonJohn <- get_artist_audio_features('Elton John',access_token)
spotify_df_LedZeppelin <- get_artist_audio_features('Led Zeppelin',access_token)
spotify_df_KillswitchEngage <- get_artist_audio_features('Killswitch Engage',access_token)
spotify_df_IronMaiden <- get_artist_audio_features('Iron Maiden',access_token)

spotify_U2_filtered <- filter(spotify_df_U2, album_name == "The Joshua Tree (Deluxe)") #we have extra live songs
spotify_DaftPunk_filtered <- filter(spotify_df_DaftPunk, album_name == "Discovery")
spotify_EltonJohn_filtered <- filter(spotify_df_EltonJohn, album_name == "Honky Chateau") #no results found
spotify_LedZeppelin_filtered <- filter(spotify_df_LedZeppelin, album_name == "Physical Graffiti")
spotify_KillswitchEngage_filtered <- filter(spotify_df_KillswitchEngage, album_name == "Alive or Just Breathing [Topshelf Edition]")
spotify_IronMaiden_filtered <- filter(spotify_df_IronMaiden, album_name == "Powerslave (1998 Remastered Edition)")

# We can keep these datasets in data frame format for now, dataset size is tiny.
raw.SpotifyArtistList <- list(spotify_U2_filtered, spotify_DaftPunk_filtered, spotify_EltonJohn_filtered,
                              spotify_LedZeppelin_filtered, spotify_KillswitchEngage_filtered, spotify_IronMaiden_filtered)
# Append all above raw datasets together
raw.SpotifyArtistAlbumTrackData <- rbindlist(raw.SpotifyArtistList) %>%
  mutate(KEYTrackName = toupper(trimws(track_name)))

# Save Feather file from 
write_feather(raw.SpotifyArtistAlbumTrackData, F("Data/Raw/raw.SpotifyArtistAlbumTrackData.feather"))

wrk.01_DataPrep_LyricsWithSpotify <- list(wrk.01_Data_Prep, raw.SpotifyArtistAlbumTrackData) %>%
  reduce(left_join, by = c("KEYTrackName" = "KEYTrackName"))

# Save Feather file from 
write_feather(wrk.01_DataPrep_LyricsWithSpotify, F("Data/Processed/wrk.01_DataPrep_LyricsWithSpotify.feather"))

Inspecting the Spotify data for the collection of albums:

# Check over the dataset
glimpse(raw.SpotifyArtistAlbumTrackData)

## Observations: 76
## Variables: 24
## $ album_uri          <chr> "2t4UTpa53ALkISHhiKrEtv", "2t4UTpa53ALkISHh...
## $ album_name         <chr> "The Joshua Tree (Deluxe)", "The Joshua Tre...
## $ album_img          <chr> "https://i.scdn.co/image/3dc58a6d1e838ff4d5...
## $ album_release_date <chr> "1987-03-03", "1987-03-03", "1987-03-03", "...
## $ album_release_year <date> 1987-03-03, 1987-03-03, 1987-03-03, 1987-0...
## $ album_popularity   <int> 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62,...
## $ track_name         <chr> "Where The Streets Have No Name", "I Still ...
## $ track_uri          <chr> "2IlT1DLSpmmHkHlAeuHMU3", "4GW8K6bDiiJGEgGP...
## $ danceability       <dbl> 0.4950, 0.5660, 0.5430, 0.3360, 0.5270, 0.3...
## $ energy             <dbl> 0.728, 0.783, 0.432, 0.653, 0.189, 0.679, 0...
## $ key                <chr> "D", "C#", "D", "G#", "D", "C", "E", "C", "...
## $ loudness           <dbl> -9.549, -9.412, -11.832, -10.210, -18.605, ...
## $ mode               <chr> "major", "major", "major", "major", "major"...
## $ speechiness        <dbl> 0.0385, 0.0363, 0.0288, 0.0499, 0.0293, 0.0...
## $ acousticness       <dbl> 0.011000, 0.015700, 0.000207, 0.006700, 0.8...
## $ instrumentalness   <dbl> 0.0035300, 0.0030000, 0.3690000, 0.4380000,...
## $ liveness           <dbl> 0.1510, 0.0806, 0.1460, 0.1360, 0.3340, 0.2...
## $ valence            <dbl> 0.2180, 0.5870, 0.1070, 0.4610, 0.2090, 0.3...
## $ tempo              <dbl> 125.810, 100.864, 110.196, 152.308, 94.642,...
## $ duration_ms        <dbl> 337506, 277477, 295516, 271547, 257366, 292...
## $ time_signature     <dbl> 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 3, 4...
## $ key_mode           <chr> "D major", "C# major", "D major", "G# major...
## $ track_popularity   <int> 39, 41, 68, 34, 34, 33, 33, 31, 32, 29, 29,...
## $ KEYTrackName       <chr> "WHERE THE STREETS HAVE NO NAME", "I STILL ...

After attempting to extract the desired albums from the selected artists, I discovered that not all albums or songs from the albums were available on spotify. For example, Elton John’s Honky Chateau was not available to extract, instead only single tracks were available via a best-of album. For U2’s The Joshua Tree, the original release album is not available, only the deluxe edition which comes with extra live performance tracks. Similar situation for Iron Maiden’s Powerslave.

With this information, I decided that the Spotify data is simply a “nice to have” and will not be entirely useful to this analysis. I would very much have liked for it to cover all artists, all albums and all songs in the collection.

We will leave the spotify variables in the merged dataset, but we won’t use them in this analysis.

CheckSpotify <- select(raw.SpotifyArtistAlbumTrackData,album_name, track_name)
checkSpotify_NotMatchingSource <- anti_join(CheckSpotify, wrk.01_Data_Prep, by = c("track_name" = "CATTrackName") ) #all spotify songs not matching our source data
glimpse(select(checkSpotify_NotMatchingSource,album_name, track_name ))

## Observations: 42
## Variables: 2
## $ album_name <chr> "The Joshua Tree (Deluxe)", "The Joshua Tree (Delux...
## $ track_name <chr> "Where The Streets Have No Name", "With Or Without ...

g. Observing the specific and unique nature of music lyrics in a text analysis context

A few things we can observe and acknowledge so far:

Some songs will be instrumental. The raw lyric sheets will only contain “[Instrumental]” for instrumental songs. We can filter these out of our analysis and re-use these in subsequent analysis. Perhaps for signal or beat detection analysis!
Each lyric “line” is meaningful in the flow of a song. Each line can be linked to subsequent lines via rhyming and context. We expect the lyric sheets to loosely resemble poetry and we expect a higher instance of repetition, because music is full of patterns (unless we’re dealing with some kind of random jam session or jazz genres!).
The number of songs will vary per artist and album. Some albums will have more songs than others. We need to be mindful of any analysis utilising word counts or averages.
In reality, the lyrics are interwoven with an audio track. This analysis is like staring at words in segregated silence. With the audio track interwoven this adds the dimensions of time series, rhythm, verbal intonation and speech/verbal patterns, and emotional sentiment, all of which lead us to interpret the lyrics differently to how we would analysing lyrics in isolation. Think for a moment, when you receive text based emails or text messages, the emotions, sentiment and context of the communication can be very difficult to convey in text format alone, the same message transmitted as sound can result in a very different reception.

h. Variable names and the source dataset

Here is a summary of the source dataset we’ll use for this analysis. It is one row per album track. All lyrics for each track are contained within a single variable TXTAllTrackLyrics. The end of each lyric line is signified with “<br>" tag.

# Create a new dataframe with one row of lyrics for each track (instead of multiple rows per verse/chorus)
wrk.02_TextAnalysis_00 <-  wrk.01_DataPrep_LyricsWithSpotify %>% 
  group_by(CATTrackName) %>% 
  mutate(TXTAllTrackLyrics = paste0(text, sep = "<br>", collapse = " ")) %>%
  mutate(NUMMaxLyricLines = max(as.numeric(NUMTrackLyricLineNumber)))

# Create a dataset with one row per song. One variable & record to hold all lyrics for a song.
wrk.02_TextAnalysis_01 <- wrk.02_TextAnalysis_00 %>%
  filter(style_name == "heading 3")

# lets now remove songs which were purely instrumental only
wrk.02_TextAnalysis_02 <- wrk.02_TextAnalysis_01 %>%
  filter(str_detect(TXTAllTrackLyrics, "Instrumental") == FALSE )

#Summary
glimpse(wrk.02_TextAnalysis_02)

## Observations: 59
## Variables: 37
## $ doc_index               <int> 3, 22, 29, 68, 78, 82, 3, 10, 15, 22, ...
## $ content_type            <chr> "paragraph", "paragraph", "paragraph",...
## $ style_name              <chr> "heading 3", "heading 3", "heading 3",...
## $ text                    <chr> "One More Time", "Digital Love", "Hard...
## $ level                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ num_id                  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ CATMusicArtist          <chr> "Daft Punk", "Daft Punk", "Daft Punk",...
## $ CATMusicAlbum           <chr> "Discovery", "Discovery", "Discovery",...
## $ CATTrackName            <chr> "One More Time ", "Digital Love ", "Ha...
## $ NUMTrackLyricLineNumber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BINTrackIsInstrumental  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ KEYTrackName            <chr> "ONE MORE TIME", "DIGITAL LOVE", "HARD...
## $ album_uri               <chr> "2noRn2Aes5aoNVsU6iWThc", "2noRn2Aes5a...
## $ album_name              <chr> "Discovery", "Discovery", NA, "Discove...
## $ album_img               <chr> "https://i.scdn.co/image/1a9dab25976c7...
## $ album_release_date      <chr> "2001-03-07", "2001-03-07", NA, "2001-...
## $ album_release_year      <date> 2001-03-07, 2001-03-07, NA, 2001-03-0...
## $ album_popularity        <int> 76, 76, NA, 76, 76, 76, 62, 62, 62, 62...
## $ track_name              <chr> "One More Time", "Digital Love", NA, "...
## $ track_uri               <chr> "0DiWol3AO6WpXZgp0goxAV", "2VEZx7NWsZ1...
## $ danceability            <dbl> 0.611, 0.644, NA, 0.875, 0.874, 0.691,...
## $ energy                  <dbl> 0.697, 0.664, NA, 0.475, 0.437, 0.582,...
## $ key                     <chr> "D", "A", NA, "A", "C#", "F", "D", "C#...
## $ loudness                <dbl> -8.618, -8.398, NA, -12.673, -10.234, ...
## $ mode                    <chr> "major", "major", NA, "minor", "minor"...
## $ speechiness             <dbl> 0.1330, 0.0333, NA, 0.0986, 0.0706, 0....
## $ acousticness            <dbl> 0.019300, 0.048100, NA, 0.440000, 0.00...
## $ instrumentalness        <dbl> 0.0000000, 0.8620000, NA, 0.7200000, 0...
## $ liveness                <dbl> 0.3320, 0.3420, NA, 0.0460, 0.0293, 0....
## $ valence                 <dbl> 0.4760, 0.5300, NA, 0.3840, 0.9630, 0....
## $ tempo                   <dbl> 122.752, 124.726, NA, 99.958, 117.790,...
## $ duration_ms             <dbl> 320357, 301373, NA, 232667, 240173, 60...
## $ time_signature          <dbl> 4, 4, NA, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ key_mode                <chr> "D major", "A major", NA, "A minor", "...
## $ track_popularity        <int> 75, 63, NA, 67, 60, 52, 39, 41, 68, 34...
## $ TXTAllTrackLyrics       <chr> "One More Time<br> One more time One m...
## $ NUMMaxLyricLines        <dbl> 16, 6, 30, 3, 3, 12, 6, 4, 6, 6, 3, 8,...

From this dataset, the most important variables we will be using frequently will be:

text & TXTAllTrackLyrics, usage will depend on the character string structure we need for text analysis functions.
CATMusicArtist, our main grouping variable.
CATMusicAlbum, our secondary grouping variable.
CATTrackName, most granular grouping variable.
BINTrackIsInstrumental, for filtering out songs with no lyrics.

2. Manual Data Exploration

a. Creating tokens & data wrangling

Our dataset is almost ready for exploration. There’s a few remaining data wrangling steps to complete:

Manually identify any “undesirable words” in the context of lyrics. Sometimes we see markers in the lyric sheets like “chorus”, “verse”, “repeat x3” etc.
We have the option to remove stop words at this point, although I would like to leave them in for initial exploration to purely observe for now, and remove them when we begin more detailed text analysis.
Create lyric line tokens, split the lyrics into individual verses, using the tidytext package.
Create lyric word tokens, split the lyrics into individual words, using the tidytext package.
Create numeric counter of how many words per song, per album & artist.

library(tidytext)

#Identify any specific/customisable words we wish to eliminate later on
undesirable_words <- c("chorus", "lyrics", "verse")

#Create lyric verse tokens using tidytext [dataset will be one row per verse]
lineToken <- wrk.02_TextAnalysis_02 %>%
  ungroup() %>%
  unnest_tokens(line, TXTAllTrackLyrics, token = stringr::str_split, pattern = '<br>') %>% #break the lyrics into verses
  mutate(lineCount = row_number()) #Create a line counter so we know which verse record is which

# Create lyric word tokens and apply tidy text format [dataset will be one row per lyric word]
# We have the option to remove stop words at this point. 
# Selecting not to do this yet, as we wish to visually observe the raw form of the dataset.
wordToken <-  lineToken %>% 
  unnest_tokens(word, line) %>%  #Break the lyrics into individual words
  # anti_join(stop_words) %>% #TM/tidytext removing stop words
  filter(!word %in% undesirable_words) #removing custom configured stop words

# Add in full word counts for each song (non-distinct)
wrk.02_TextAnalysis_03_WordCount <- wordToken %>%
  group_by(CATMusicArtist,CATMusicAlbum, CATTrackName) %>%
  summarise(num_words = n()) %>%
  arrange(desc(num_words))

b. Identifying vocal and instrumental songs

We need to identify which songs are instrumental so we can exclude them for this analysis. Not surprisingly, more than half of the songs on Daft Punk’s Discovery are instrumental. This is significant as 8 of the 14 songs on the album will be flagged as instrumental, leaving only 6 songs available to lyrically analyse and compare with the other 5 selected artists.

Instrumental Songs to Exclude from Analysis
CATMusicArtist	CATMusicAlbum	CATTrackName
Daft Punk	Discovery	Aerodynamic
Daft Punk	Discovery	Crescendolls
Daft Punk	Discovery	Nightvision
Daft Punk	Discovery	Superheroes
Daft Punk	Discovery	High Life
Daft Punk	Discovery	Voyager
Daft Punk	Discovery	Veridis Quo
Daft Punk	Discovery	Short Circuit
Iron Maiden	Powerslave	Losfer Words (Big ’Orra)
Killswitch Engage	Alive or Just Breathing	Without A Name
Led Zeppelin	Physical Graffiti	Bron-Yr-Aur

c. Word counts by song & album

With our dataset now ready for exploration, let’s inspect these questions:

In total, how many songs with lyrics are available to work with?
What are the raw word counts for each of these songs?

We have 59 songs to work with, and Iron Maiden’s Rime of the Ancient Mariner has a very large word count at 650 words, significantly higher than any other song in our dataset.

Could this song’s lyrics skew any subsequent analysis we will be performing?

While this table shows us the raw individual word counts for each song, it doesn’t clearly illustrate the Artist-Album groupings and total word counts for Artist-Album. Some questions here:

How many songs do we have for each artist & album?
What is the “wordiest” album in the collection? Could this be used as a basic indicator of the balance between story telling and sound engineering or audible experience?

We must be careful here about using total word counts. Considering Daft Punk’s Discovery had 8 of 14 songs excluded due to them being instrumental, there are less songs available to contribute to an overall word count.

From this visualisation:

Daft Punk’s Discovery has 6 songs with lyrics, total word duration of just under 1500.
Elton John’s Honky Chateau has 10 songs with lyrics, total word duration of just over 1500.
Iron Maiden’s Powerslave has 7 songs with lyrics, total word duration of just under 2000.
Killswitch Engage’s Alive or Just Breathing has 12 songs with lyrics, total word duration of just under 1400.
Led Zeppelin’s Physical Graffiti has 13 songs with lyrics, total word duration of near 3000.
U2’s The Joshua Tree has 11 songs with lyrics, total word duration of just under 2000.

There’s not enough information here to draw any meaningful conclusions about story telling or proving the collection is eclectic. However it is interesting to note the variance of word counts between songs. In general, it seems a song can be of any word length between 0 (instrumental) and 650 (or even more) words, and an album compilation can have any number of songs listed. There is no consistency between the artists on either of these observations.

d. Wordclouds

We can visualise the raw word counts using word clouds. The intention here is to get an initial, basic view as to the common words for each artist & album.

We do need to be careful with the usage and context of these visualisations here because we know some songs were designed for repetition and certain words will dominate for this reason.

For this section I will use the wordcloud package to create the word clouds for each artist and album.

Daft Punk - Discovery

U2 - The Joshua Tree

Elton John - Honky Chateau

Led Zeppelin - Physical Graffiti

Killswitch Engage - Alive or Just Breathing

Iron Maiden - Powerslave

On initial observation of these word clouds:

All of these wordclouds are visually different and unique! This view shows us on broad, face value the collection of artists, albums and songs in the dataset are an eclcectic collection.
There is an observable difference in word choice & vocabulary used between artists. See Led Zeppelin’s relaxed usage of “mama” and “baby” versus the articulate words from Iron Maiden “mariner”, “village”.
Seemingly larger sized/rendering area word cloud with more defined words of Elton John versus a smaller word cloud with less defined words in Daft Punk.
Perceived similar (but not precisely the same), synonymous word usage between U2 and Killswitch Engage: “heart”, “eyes”. “life” versus “love”.

e. Lexical diversity (vocabulary)

Time to explore the depth of each song’s lyrical vocabulary, we will refer to this as “lexical diversity”.

A curious subjective question at this point is: Could the larger the vocabulary for a song (and therefore artist) be an indicator of great story telling?

In calculating the lexical diversity we will:

Remove stop words
Work with a dataset which is one row per word (un-nested, token by word)
Group by Artist & Album
Count by distinct words used in each song
Visualise using a pirateplot box plot, from the yarrr package

In this visualisation, each dot represents a song on the artist’s album, plotted by its total distinct (unique) count of words used. The black horizontal lines measure the average count of distinct words used for the artist & album.

It becomes immediately obvious, Iron Maiden’s Powerslave contains a range of more and less lexically dense songs. Iron Maiden’s song “Rime of the Ancient Mariner” is by far an outlier in this visualisation, at 167 distinct words.

On the lower end of the scale, Daft Punk’s Discovery and Killswitch Engage’s Alive or Just Breathing appear to have a smaller vocabulary. Some human experiential insights from these musical compositions suggest that these albums are very engineered for audio experience. Daft Punk’s Discovery could be described as a dance record, while Killswitch Engage’s Alive or Just Breathing may have been designed to get the most out of vocally “growling” lyrics and so careful selection of words to achieve this purpose could be what we are observing in a lower vocabulary range.

In the mid-range of the lexical diversity we see Elton John, Led Zeppelin and U2, perhaps this is where the balance between audio experience and story telling can be observed?

With this measure and perspective, we begin to see some similarities emerging between the artists and albums. The notion this is an eclectic grouping of music begins to have a counter argument.

f. Song lyrics self-similarity matrices (SongSim) & repetition

We have observed the lexical diversity for our collection of songs. Let’s now take a look at lyrical repetition within individual songs and observe for consistencies within and across albums.

Measuring and observing lyrical (or word) repetition is very relevant for this analysis as our context is music. Repetition can be observed in both instrumental waveforms and in lyrical structure. Ever had a song stuck in your head, repeating over and over?

The challenge here is to identify and use a visualisation which neatly and clearly describes repetition for our collection of songs.

Enter package songsim.

SongSim uses self-similarity matrices to visualise patterns of repetition in text. Each word (lyric) of a song forms a row and a column of the matrix. The cell at position (x, y) is filled in if the x-th and y-th words of the song are the same. For a more technical explanation check out the package author’s site here.

A self-similarity matrix is used to answer the question “which parts of this text thing are alike?”.

To get started, let’s setup our process flow for SongSim and take a look at Led Zeppelin’s “Kashmir”.

Most of “Kashmir” has very little lyrical repetition. This can be seen in the songsim matrix plot, where the “dots” form a weak and very sparse pattern. The lack of lyrical repetition is also observed by reading the lyric sheet. I find this a curious song to analyse. The element which strikes me the most is actually the sound of the song, it has a very repetitious component to it. When I listen to Kashmir it is this element of sound that I notice “gets stuck in my head”, and not the low-repetition lyrics! In the TV series “It might get loud”, Jimmy Page describes the guitar sound of Kashmir:

“…it has this riff which is circling around, then a cascade which goes over the top and hits this atonal point. It’s one of those real hypnotic riffs.”

Returning to the songsim matrix plot, we progress down the black diagonal line to the bottom right of the square we begin to see some pattern “blobs” coloured in blue, purple and pink, this is attributed to the repeated lyrics:

Ooh, yeah-yeah, ooh, yeah-yeah, when I’m down…

Ooh, yeah-yeah, ooh, yeah-yeah, well I’m down, so down

Ooh, my baby, ooh, my baby, let me take you there

Let me take you there. Let me take you there

The “Colorful” mode of the SongSim matrix plot assigns a unique color to each repeated word (words appearing only once are black). When there are several repeated themes, this can make it easier to distinguish them.

The SongSim matrices also come with some handy parameters:

$ songMat - this is the matrix structure for the SongSim plot.
$ repetitiveness - this quantifies how repetitive a song is. It is a simple mean of the upper triangle of the matrix. The larger the value, the more repetitive the song.

We can create songsim matrices for our collection of songs and then compare repetitiveness scores as a method of assessing lyrical repetition (or lyrical density).

In this visualisation, each dot represents a song on the artist’s album, plotted by its lyrical repetitiveness score. The black horizontal lines measure the average song lyrical repetitiveness for the artist & album. A measure of 0.0 (0%) means each word in the song is unique, nothing is the same. A measure of 1.0 (100%) means the song is completely repetitive, i.e. a song with only one word used throughout the whole song.

Is it so surprising to see Daft Punk’s Discovery with the highest average lyrical repetitiveness at 35%? Interesting it also has the widest range of repetitiveness, between 12% and 64%.

Curious to observe U2’s The Joshua Tree has the lowest average at 15% repetition and the smallest range of lyrical repetitiveness, between 10% and 18%. The grouping of songs appear to be very consistent in terms of lyrical repetition.

Observing the similarities, Elton John, Iron Maiden, Killswitch Engage and Led Zeppelin all share an average level of repetitiveness between20% and 24%.

So what do all of these songs look like as SongSim matrices? We have an opportunity here to create some “data art”, and also observe each of these songs visually on one canvas.

Click here to see the SongSim poster we created for the collection of songs in this analysis.

Some notable mentions and observable “base patterns” from the “SongSim Matrix Poster”:

Entire patterned square: Daft Punk’s “Harder, Better, Faster, Stronger” is a great example of visualised pop music. The chorus is basically the entire song. Daft Punk’s “One more time” and “Too long” also fit this description and pattern.
Small checkerboard-like patterns: Most of Elton John’s songs only have a small number of words within a repetition, see “Salvation” and “Hercules”.
Verses and Bridges “gutter” patterns: Most of Iron Maiden’s and U2’s songs appear to follow an intro - verse - chorus - verse - chorus - outro pattern.
Broken Diagonal patterns: Most of Killswitch Engages’s songs suggest a variation on the chorus or another major repeating section. Most of their songs are structured with verse - chorus - verse - chorus, but at the very end, some words moved around or swapped out.
Hybrid of “gutter” and “broken diagonal” patterns: Very much Led Zeppelin.
No two songs “look” the same. There may be similarities in terms of sections of patterns, but when lyric structure, lyric repetitiveness, and vocabulary are observed together, the songs are truly different from one another.

This has certainly uncovered some very interesting insights about how the songs in the collection are lyrically structured, and maybe even shed some light on the Artist’s preferences toward lyrical song writing.

The songsim matrices have done a great job of displaying how diverse each song (and therefore artist & album) is from a lyrical pattern and structure perspective. From the repetitiveness measure we observed strong differences and weak similarities in the levels of repetitiveness across artists and albums. This gives strength to the hypothesis that the collection of music is eclectic!

g. Term Frequency Inverse Document Frequency (TF-IDF)

Let’s now address quantifying how important various lyrics (words) are in a song with respect to an album.

The Term Frequency - Inverse Document Frequency (TF-IDF for short), is a measure of:

Term Frequency * Inverse Document Frequency

The Term Frequency, the number of times a word is counted in a document

Multiplied by
1/DF, or 1 divided by the number of documents that contain each word

With the TF and IDF combined, a word’s (or a lyric’s) importance is adjusted for how rarely it is used. The assumption with TF-IDF is that words that appear more frequently in a document (or a song) should be given a higher weighting, unless the word also appears in many documents (or songs).

For this analysis we can use TF-IDF to identify which words are important to each of the albums in the collection, and compared across albums. We expect the albums to differ in terms of subject/topic, content and sentiment, we therefore expect the frequency of words to differ between albums, the TF-IDF metric will highlight these differences.

So let’s take a look at word importance through the lens of TF-IDF.

We see lots of familiar lyric words here which we can trace clearly back to individual songs. There are also specific narrative elements for individual songs, like Daft Punk’s “Harder, Better, Faster, Stronger”, we can see almost the entire chorus structure (we do know it’s highly repetitive!). True to the nature of the TF-IDF metric, the words to this song are really only found in this song and very unlikely elsewhere.

Same goes for “rocket” in Elton John’s Rocket Man, “(two) minutes (to) midnight” in Iron Maiden, “numbered days” in Killswitch Engage, “Kashmir” in Led Zeppelin, and “red hill mining town” in U2. These are all curiously exclusive to the songs they are found in.

There is one small similarity with “time” appearing for Daft Punk and also for Led Zeppelin, in both instances it is rated low in significance, at < 0.1, compared to other words for both artists.

With respect to the hypothesis on an eclectic grouping of music, we can clearly see that the top words of significance for each artist is different and unique when compared across all artists.

This measure is performing as expected and doesn’t tell us anything new. However, this can still be useful information to be aware of prior to designing and training any models and exploring topic modelling.

3. Sentiment Analysis & Natural Language Processing (NLP)

Sentiment analysis is a type of text mining which aims to determine the opinion and subjectivity of its content. When applied to song lyrics, the results can be representative of the artist’s attitude as well as their influences.

Natural Language Processing (NLP) is another methodology used in mining text. It tries to decipher the ambiguities in written language by tokenization, clustering, extracting entity and word relationships, and using algorithms to identify themes and quantify subjective information. This analysis in earlier sections touched on the basics of NLP, via exploring the lexical complexity of song lyrics (word frequencies, lexical diversity “vocabulary” and lexical density “repetition”).

At this point, there are some important questions to consider before commencing with sentiment analysis and machine learning:

Do we need to perform more data preparation?
Stemming: do we need to remove suffixes from words and reduce down to the common word origin? Is it appropriate?
Lemmatization: do we need to remove inflectional endings of words, and return the base or dictionary form of a word (which is known as the lemma)?
How will we approach negation in sentence constructs? “I am not happy and I don’t like it”, can change it’s sentiment and meaning very quickly when we remove key words like “not” and “don’t”.
Advanced concept in sentiment analysis: Is it appropriate to simply replace some certain words with more frequently used synonyms (semantically similar peers) and/or hypernyms (common parents)? This would be used to address lexicon word matching challenges, between the words in the text versus the lexicon used.
Do we need to construct our own lexicon for the sentiment analysis, or will off-the-shelf packaged lexicons be appropriate to use?

a. NRC Emotional Sentiment

There are different methods which can be used for sentiment analysis. For this analysis we will explore our collection of songs using a predefined lexical dictionary (lexicon) named NRC.

NRC is the Word-Emotion Association Lexicon. It assigns words into one or more of ten categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

We will use the tidytext package and call in the NRC lexicon by using the get_sentiments() function when creating our data frame.

Let’s observe how the NRC lexicon triages our song lyric words into the emotional sentiments.

This is a very dense visualisation, difficult to distinguish any clear patterns. However it is helpful to observe what words are being triaged into which emotional sentiment grouping.

Let’s summarise this down to word counts by emotional sentiment for each artist and then consider this question:

How do these album’s make us feel when we listen to them?

This visualisation paints an “emotional profile”" picture for each artist and album. This illustrates for us that no two artists have the exact same emotional profile.

I have always suspected Daft Punk’s Discovery to be jovial and energetic. I am glad to see anticipation, joy and trust rate highly for this album. This is a nice validation of my own personal opinion.

Equally for Killswitch Engage’s Alive or Just Breathing, this album tells stories with much sadness, anger and fear. Not very surprising these emotions rate highly for this album.

There are some faint similarities of emotional profile patterns between:

Daft Punk and Led Zeppelin - Anticipation, Joy and Trust rate the highest with these artists.
Killswitch Engage and Iron Maiden - Fear, Sadness and Anger rate the highest with these artists.

From an emotional sentiment perspective this collection of artists and albums are unique, and diverse enough to label as an eclectic grouping.

b. NGrams, bi-grams & tri-grams

Earlier in this analysis we explored single word (or unigram) frequency counts. This section is dedicated to exploring what precedes and follows the most common words we have identified in our collection of songs.

There are a few ways we can approach ngram analysis, two options are described below:

Using the tidytext package and unnest_tokens() function to perform a simple and quick “blind” ngram analysis. For bi-gram and tri-gram analysis, this will recount ngram parts and potentially skew the results.
Using the tm and RWeka packages to perform a more robust, intelligent ngram analysis. This is a more complex approach versus option 1. It uses a VCorpus and a DocumentTermMatrix. The ID attribute will need to be retained to re-identify the grouping variables, i.e: which ngram belongs to which song, album and artist so we can compare the results.

I will select Option 2 for the ngram analysis, I’d like to inspect with greater resolution the subject matter the ngrams will reveal.

The lasting subject matter and “sentiments” from observing these bi-grams and tri-grams:

Daft Punk - The highest ngram frequency count of all the artists. The lyrical repetitiveness from the song “One More Time” is very obvious, strongest of all the artists. The bi-gram and tri-gram structures point toward “being in the moment” and self-enjoyment. There is a strong subject theme pointing toward the “music” and “dancing” along with it.
Elton John - Lowest ngram frequency of all artists, this could be linked back to the observations in the songsim matrices, Elton John’s songs don’t have much lyrical repetition in them. The bi-grams and tri-grams here weakly highlight a common subject thread of thankfulness (“thank”), chill (“mellow”) and a desirable “pretty” girl.
Iron Maiden - Strong subject themes of returning to a communal “village” and anticipating “midnight”, there’s a sense of survival urgency with “run” and escapism in “fly”.
Killswitch Engage - Strong subject themes of observing the passing and impacts of “time”, and emotional pain resulting in “tears”.
Led Zeppelin - A low degree of lyrical repetitiveness circulating around “love”, “light” and asking “mama” what’s up.
U2 - The song writer’s observations of life journeys, observations with people (“haven’t found”, “bullet [the] blue”, “give away”) and foreign places (“streets”). Some unique Bono-esque chorus constructions emerge with “la di day”.

Just like the NRC emotional sentiment, the bi-grams and tri-grams also begin to paint a picture that each artist has its own “ngram frequency profile”. On face value it could be argued that Daft Punk and Killswitch Engage share the same ngram of “time”, and therefore this makes these artists similar. However we now know from the NRC emotional sentiment that these artists are on very different ends of the emotional profile spectrum, Daft Punk is full of joy, while Killswitch Engage contains a great deal of sadness.

In terms of subject matter similarity, it is curious to note:

Elton John has written lyrics about himself talking to a female person named “Amy”, and Led Zeppelin has done something very similar by questioning a perceived anonymous female person named “mama”.
Iron Maiden has lyrics which describe a location known as the “village”, while U2 have lyrics which describe a location which has “streets”.

How does all this new information rate against the hypothesis that this collection of music is eclectic? This section has begun to uncover the hidden subject matter and potential (and subjective) meaning of the albums and songs in the collection. While the ngram frequency counts are varied across the artists, the similarities in subject matter via similar analogies, give rise to the counter argument.

c. Bi-gram network analysis

From our Bi-Gram constructs we can create a network graph using the ggraph and igraph packages. We can arrange words into connected nodes, with selected “centering words” at the centers.

From our earlier inspection of unigram word frequencies, the following high-frequency “centering words” have been selected:

"hey", "feel", "gonna", "people", "yeah", "love", "light", "time", "life"

Our dataset will be grouped up, to represent all songs, for all artists and albums.

This is just a simple network graph to demonstrate the various methods we can use to visualise our song lyric data. It is heavy on the cognitive load to interpret and it is showing all ngrams without illustrating their artist/album/song origins.

This does however show us some lyric construction pathways and how subjects within a song can be formed. For testing the hypothesis on an eclectic grouping of music, this shows us an alternative perspective that across all songs some lyrical pathways are shared and inspires the idea that some pathways are more common than others. This visualisation does not show us however, the strength and level of commonality of these shared pathways, it only shows us that similarities exist.

Another practical application for this visualisation: For any songwriter’s out there, perhaps this could be useful in brainstorming or identifying ideas for lyric constructs?

d. Pair-wise comparisons

Which songs are similar to each other in lyrical content? We can explore this by finding the pairwise correlation of lyric (word) frequencies within each song, using the pairwise_cor() function from the widyr package.

The assumption here is, the higher the correlation factor the higher the similarity between songs.

We will remove stop words for this analysis, to allow us to observe more meaningful results.

Well this is a revelation! We created a pairwise correlation plot using the igraph package and using the graph_from_data_frame function with our input dataset. We filtered our dataset to show only correlations stronger than 0.4.

From this it appears that only Elton John’s Mona Lisa and Mad Hatters and Led Zeppelin’s Down by the Seaside have a connection with a correlation greater than 0.4. Does this mean they are similar songs? A correlation of 0.4 isn’t very strong, more evidence would be needed to support this idea. When observing the word counts and repetitiveness these two songs are distinctly different. Checking the songsim matrix plots also confirms this, however one could say “Mona Lisas and Mad Hatters” has a “weaker” more “sparse” songsim lyrical pattern plot to “Down by the Seaside”. Maybe there is some similarity here but it is very weak.

Interesting that no other songs had a high enough correlation in the pairwise calculation to be considered as “significant”. With regard to the hypothesis that this collection of music is eclectic, this method is highlighting far more differences than similarities.

e. Album similarity

Using the QDAP package we have access to a function called trans_venn(). Since we just observed pairwise correlations between songs, let’s take a look at similarity between albums, visualised as a venn diagram.

This visualisation could be interpreted as more a novelty than providing any solid metric to assess the hypothesis of an eclectic music collection. In absence of metrics, this visualisation does illustrate that similarity overlap between artists do exist, even if that overlap is fairly weak.

In deriving any meaning from this venn diagram, it appears:

U2’s The Joshua Tree is a versatile linking centroid in this venn diagram. This evokes a more philosophical question, can the other albums realistically be “linked” together via The Joshua Tree?
Iron Maiden and Killswitch Engage share the most visual overlap, closely followed by U2 and Elton John then U2 and Led Zeppelin.
Daft Punk is very weakly grazing the border of U2 and Killswitch Engage. Does this mean they are most dis-similar of all the artists?

f. Song dissimilarity (agreement between lyrics)

Just like the album similarity we calculated in the previous section, we can do something similar for songs. Still using the QDAP package, we can use the Dissimilarity() function. We can perform dissimilarity statistics, using the distance function to calculate dissimilarity statistics by grouping variables.

The Dissimilarity() function will return a matrix of dissimilarity values, which is the agreement between text, or songs. We will plot this matrix as a dendogram and identify some potential clusters.

This dendogram offers us a very alternative view of how similar and different the individual songs in the collection are. We configured the dendogram to identify 14 clusters and illustrate the borders of these clusters with the coloured rectangles.

Curious how there is one large cluster in the middle, illustrated with the green rectangle. It is difficult to discern why some songs are in certain clusters:

The most repetitive, and lowest lexically diverse songs Daft Punk’s “One More Time” and “Harder, Better, Faster, Stronger” are in the same cluster as one of the least repetitive, more lexically diverse songs, Led Zeppelin’s “Kashmir”, and a wide variety of songs from all other artists.

This visualisation paints a picture that there are both similarities and differences with song lyrics at the individual song level. No one, single artist dominates any of the dendogram’s defined clusters, and so far we know from previous metrics all of the songs differ with word counts, vocabulary, lyric pattern structure and repetitiveness.

4. Unsupervised Machine Learning

a. Topic modelling: Structured Topic Modelling (STM)

Topic modelling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for or what our target is.

Some questions for guiding this section are:

Can we identify any meaningful groups or themes in our collection of songs?
Which words (or lyrics) contribute to which topics?
Which topics contribute to which albums?

For this piece, we will use Structured Topic Modelling (STM) and make use of the packages quanteda and stm.

First step is to prepare our input dataset:

#Load the libraries
library(quanteda)
library(stm)

# Using our LineToken dataset from earlier, un-nest by single words
# Then we will remove stop words and filter out any more undesirable words
tidy_MusicLyrics <- lineToken %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) # %>%
  #filter(word != "")

# check the output from tidy_MusicLyrics and identify any further words to filter out above, rerun the above step if needed.
tidy_MusicLyrics %>%
  count(word, sort = TRUE)

## # A tibble: 90 x 2
##    word         n
##    <chr>    <int>
##  1 time        49
##  2 faster      32
##  3 harder      32
##  4 stronger    32
##  5 ancient     25
##  6 mariner     25
##  7 rime        25
##  8 dying       23
##  9 foot        21
## 10 trampled    21
## # ... with 80 more rows

# Create a Quanteda DFM object, ready to use in the STM
MusicLyrics_dfm <- tidy_MusicLyrics %>%
  count(CATMusicArtist, word, sort = TRUE) %>%
  cast_dfm(CATMusicArtist, word, n)

One of the drawbacks of STM is the need to select the number of topics “K” to train the model with. Fortunately, the stm package comes with lots of functions and support for choosing an appropriate number of topics for the model.

With our DFM object prepared and ready, we can proceed to running the STM model with parameters init.type="Spectral" and k = 0. This will train the model, and the algorithm will use K = 0 to calculate the maximum number of topics itself. From here we can further analyse and refine how many topics would be appropriate to set K to.

The Spectral initialization uses a decomposition of the VxV word co-occurrence matrix to identify “anchor” words, which are words that belong to only one topic and therefore identify that topic. The topic loadings of other words are then calculated based on these anchor words. This processes is deterministic, so that the same values are always reached with the same VxV matrix. The problem is that this process assumes that the VxV matrix is generated from the population of an infinite number of documents (or in our case, songs). Therefore, the process does not behave well with infrequent/rare words. The solution to this is to remove infrequent words, although we still need to be careful in situations where we don’t have a lot of documents, in our case we have 59 songs across 6 albums.

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Remove Custom Stopwords...
## Removing numbers... 
## Stemming... 
## Creating Output...

# Train the model with 0 clusters K defined. Lets observe how many Topics emerge:
STM_topic_model <- stm(MusicLyrics_dfm, K = 0, seed=12345, verbose = FALSE, init.type = "Spectral")

Training the model on K = 0 clusters identifies 32 topics. We can use labelTopics() to describe the common words associated with each of these 32 topics.

labelTopics(STM_topic_model)

## Topic 1 Top Words:
##       Highest Prob: balance, temple, element, breathing, darkness, fixation, life 
##       FREX: balance, trip, bullet, sky, streets, mining, tree 
##       Lift: balance, temple, element, breathing, darkness, fixation, life 
##       Score: balance, temple, element, breathing, darkness, fixation, life 
## Topic 2 Top Words:
##       Highest Prob: barely, temple, element, breathing, darkness, fixation, life 
##       FREX: barely, trip, bullet, sky, streets, mining, tree 
##       Lift: barely, temple, element, breathing, darkness, fixation, life 
##       Score: barely, temple, element, breathing, darkness, fixation, life 
## Topic 3 Top Words:
##       Highest Prob: black, dying, trampled, kashmir, time, rover, flight 
##       FREX: black, dying, trampled, kashmir, time, rover, flight 
##       Lift: black, kashmir, trampled, dying, rover, flight, woman 
##       Score: black, dying, trampled, kashmir, rover, flight, woman 
## Topic 4 Top Words:
##       Highest Prob: blue, hill, mining, town, red, bullet, trip 
##       FREX: blue, hill, mining, town, red, bullet, trip 
##       Lift: blue, stand, found, tree, bullet, trip, sky 
##       Score: blue, hill, mining, town, red, bullet, trip 
## Topic 5 Top Words:
##       Highest Prob: broken, temple, element, breathing, darkness, fixation, life 
##       FREX: broken, trip, bullet, sky, streets, mining, tree 
##       Lift: broken, temple, element, breathing, darkness, fixation, life 
##       Score: broken, temple, element, breathing, darkness, fixation, life 
## Topic 6 Top Words:
##       Highest Prob: country, hill, town, mining, red, bullet, sky 
##       FREX: country, bullet, sky, streets, trip, tree, found 
##       Lift: country, hill, town, mining, red, bullet, sky 
##       Score: country, hill, town, mining, red, bullet, sky 
## Topic 7 Top Words:
##       Highest Prob: custard, dying, trampled, kashmir, rover, flight, woman 
##       FREX: custard, dying, trampled, kashmir, rover, time, flight 
##       Lift: custard, kashmir, trampled, dying, rover, flight, woman 
##       Score: custard, dying, trampled, kashmir, rover, flight, woman 
## Topic 8 Top Words:
##       Highest Prob: temple, days, element, breathing, darkness, fixation, life 
##       FREX: days, temple, element, breathing, darkness, fixation, life 
##       Lift: days, breathing, darkness, fixation, life, lifeless, revolution 
##       Score: days, temple, element, breathing, darkness, fixation, life 
## Topic 9 Top Words:
##       Highest Prob: disappeared, hill, mining, red, town, bullet, trip 
##       FREX: disappeared, hill, mining, red, town, bullet, trip 
##       Lift: disappeared, hill, mining, red, town, bullet, trip 
##       Score: disappeared, hill, mining, red, town, bullet, trip 
## Topic 10 Top Words:
##       Highest Prob: dramas, it’s, rocket, mona, salvation, amy, cat 
##       FREX: dramas, it’s, rocket, mona, salvation, amy, cat 
##       Lift: dramas, it’s, rocket, mona, salvation, amy, cat 
##       Score: dramas, it’s, rocket, mona, salvation, amy, cat 
## Topic 11 Top Words:
##       Highest Prob: exit, hill, mining, red, town, sky, streets 
##       FREX: exit, hill, mining, red, town, sky, streets 
##       Lift: exit, stand, found, tree, sky, streets, trip 
##       Score: exit, hill, mining, red, town, sky, streets 
## Topic 12 Top Words:
##       Highest Prob: foot, time, dying, trampled, kashmir, rover, flight 
##       FREX: foot, time, dying, trampled, kashmir, rover, flight 
##       Lift: foot, time, pie, boogie, stu, song, wanton 
##       Score: foot, time, dying, trampled, kashmir, rover, flight 
## Topic 13 Top Words:
##       Highest Prob: god's, hill, red, town, mining, bullet, trip 
##       FREX: god's, hill, red, town, mining, bullet, trip 
##       Lift: god's, stand, found, tree, bullet, trip, sky 
##       Score: god's, hill, red, town, mining, bullet, trip 
## Topic 14 Top Words:
##       Highest Prob: infra, temple, element, breathing, darkness, fixation, life 
##       FREX: infra, trip, bullet, sky, streets, mining, tree 
##       Lift: infra, temple, element, breathing, darkness, fixation, life 
##       Score: infra, temple, element, breathing, darkness, fixation, life 
## Topic 15 Top Words:
##       Highest Prob: inside, temple, element, breathing, darkness, fixation, life 
##       FREX: inside, trip, bullet, sky, streets, mining, tree 
##       Lift: inside, temple, element, breathing, darkness, fixation, life 
##       Score: inside, temple, element, breathing, darkness, fixation, life 
## Topic 16 Top Words:
##       Highest Prob: faster, harder, stronger, time, digital, kill, love 
##       FREX: faster, harder, stronger, digital, kill, love, time 
##       Lift: digital, kill, love, faster, harder, stronger, time 
##       Score: faster, harder, stronger, kill, time, digital, love 
## Topic 17 Top Words:
##       Highest Prob: light, dying, trampled, kashmir, time, rover, flight 
##       FREX: light, dying, trampled, kashmir, time, rover, flight 
##       Lift: light, kashmir, trampled, dying, rover, flight, woman 
##       Score: light, dying, trampled, kashmir, rover, flight, woman 
## Topic 18 Top Words:
##       Highest Prob: mad, rocket, it’s, amy, cat, hatters, hercules 
##       FREX: mad, rocket, it’s, amy, cat, hatters, hercules 
##       Lift: mad, amy, cat, hatters, hercules, honky, lisas 
##       Score: mad, rocket, it’s, amy, cat, hatters, hercules 
## Topic 19 Top Words:
##       Highest Prob: mellow, time, rocket, it’s, mona, salvation, amy 
##       FREX: mellow, time, mona, salvation, amy, cat, hatters 
##       Lift: mellow, time, mona, salvation, amy, cat, hatters 
##       Score: mellow, time, rocket, it’s, mona, salvation, amy 
## Topic 20 Top Words:
##       Highest Prob: mothers, hill, mining, town, red, bullet, sky 
##       FREX: mothers, hill, mining, town, red, bullet, sky 
##       Lift: mothers, hill, mining, town, red, bullet, sky 
##       Score: mothers, hill, mining, town, red, bullet, sky 
## Topic 21 Top Words:
##       Highest Prob: night, dying, trampled, kashmir, time, rover, flight 
##       FREX: night, dying, trampled, kashmir, time, rover, flight 
##       Lift: night, kashmir, trampled, dying, rover, flight, woman 
##       Score: night, dying, trampled, kashmir, rover, flight, woman 
## Topic 22 Top Words:
##       Highest Prob: temple, numbered, element, breathing, darkness, fixation, life 
##       FREX: numbered, temple, element, breathing, darkness, fixation, life 
##       Lift: numbered, breathing, darkness, fixation, life, lifeless, revolution 
##       Score: numbered, temple, element, breathing, darkness, fixation, life 
## Topic 23 Top Words:
##       Highest Prob: rise, temple, element, breathing, darkness, fixation, life 
##       FREX: rise, trip, sky, streets, bullet, mining, tree 
##       Lift: rise, temple, element, breathing, darkness, fixation, life 
##       Score: rise, temple, element, breathing, darkness, fixation, life 
## Topic 24 Top Words:
##       Highest Prob: running, hill, town, red, mining, sky, streets 
##       FREX: running, hill, town, red, mining, sky, streets 
##       Lift: running, hill, town, red, mining, sky, streets 
##       Score: running, hill, town, red, mining, sky, streets 
## Topic 25 Top Words:
##       Highest Prob: seaside, time, dying, trampled, kashmir, rover, flight 
##       FREX: seaside, dying, time, trampled, kashmir, rover, flight 
##       Lift: seaside, pie, boogie, stu, wanton, song, woman 
##       Score: seaside, dying, trampled, time, kashmir, rover, flight 
## Topic 26 Top Words:
##       Highest Prob: serenade, temple, element, breathing, darkness, fixation, life 
##       FREX: serenade, trip, bullet, sky, streets, mining, tree 
##       Lift: serenade, temple, element, breathing, darkness, fixation, life 
##       Score: serenade, temple, element, breathing, darkness, fixation, life 
## Topic 27 Top Words:
##       Highest Prob: sick, dying, trampled, time, kashmir, rover, flight 
##       FREX: sick, dying, trampled, time, kashmir, rover, flight 
##       Lift: sick, pie, boogie, stu, song, wanton, woman 
##       Score: sick, dying, trampled, kashmir, rover, time, flight 
## Topic 28 Top Words:
##       Highest Prob: slave, it’s, rocket, mona, salvation, amy, cat 
##       FREX: slave, it’s, rocket, mona, salvation, amy, cat 
##       Lift: slave, it’s, rocket, mona, salvation, amy, cat 
##       Score: slave, it’s, rocket, mona, salvation, amy, cat 
## Topic 29 Top Words:
##       Highest Prob: susie, it’s, rocket, mona, salvation, amy, cat 
##       FREX: susie, it’s, rocket, mona, salvation, amy, cat 
##       Lift: susie, it’s, rocket, mona, salvation, amy, cat 
##       Score: susie, it’s, rocket, mona, salvation, amy, cat 
## Topic 30 Top Words:
##       Highest Prob: ten, dying, trampled, kashmir, time, rover, flight 
##       FREX: ten, dying, trampled, kashmir, time, rover, flight 
##       Lift: ten, kashmir, trampled, dying, rover, flight, woman 
##       Score: ten, dying, trampled, kashmir, rover, flight, woman 
## Topic 31 Top Words:
##       Highest Prob: ancient, mariner, rime, village, midnight, minutes, aces 
##       FREX: ancient, mariner, rime, village, midnight, minutes, aces 
##       Lift: aces, village, ancient, blade, duellists, flash, mariner 
##       Score: ancient, mariner, rime, village, midnight, minutes, aces 
## Topic 32 Top Words:
##       Highest Prob: wires, hill, mining, town, red, trip, sky 
##       FREX: wires, hill, mining, town, red, trip, sky 
##       Lift: wires, stand, found, tree, trip, sky, streets 
##       Score: wires, hill, mining, town, red, trip, sky

We can now use plot.STM() and observe how common each topic is:

stm::plot.STM(STM_topic_model,type = "summary", xlim = c(0, 0.1))

Exploring semantic coherence and exclusivity for each topic using the function stm::topicQuality():

Semantic coherence is the empirical co-occurrence of words with high probability under a given topic. If you have word “apple” what is the probability the word “banana” will also appear? For the topic, the coherence is the sum of the log of these probabilities.
Exclusivity is the top words for a topic which are unlikely to appear in top words for other topics.

##  [1] -274.6426 -280.0723 -284.9488 -309.0505 -280.0723 -310.3935 -277.7341
##  [8] -280.0723 -310.3935 -303.7534 -310.3935 -277.7341 -310.3935 -280.0723
## [15] -280.0723 -319.9913 -277.7341 -304.2346 -308.3448 -310.3935 -277.7341
## [22] -280.0723 -280.0723 -310.3935 -277.7341 -280.0723 -277.7341 -306.6729
## [29] -306.6729 -277.7341 -303.4867 -310.3935
##  [1] 9.500000 9.500000 9.474528 9.353203 9.500000 9.500000 9.475682
##  [8] 9.496062 9.500000 9.472222 9.353203 9.306970 9.353203 9.500000
## [15] 9.500000 9.380764 9.474528 9.474647 9.420263 9.500000 9.474528
## [22] 9.496062 9.500000 9.500000 9.281457 9.500000 9.281718 9.472209
## [29] 9.472209 9.474528 9.414610 9.353203

From our initial 32 identified topics we can begin to see some groups of topics emerging. Perhaps we can refine these 32 topics down to a smaller number of topics? Perhaps 6 is a more reasonable number of topics.

Let’s train our model on 6 topics, setting parameter to K = 6.

STM_topic_model <- stm(MusicLyrics_dfm, K = 6, seed=12345, verbose = FALSE, init.type = "Spectral")

We can now observe and visualise the output from our trained model across the 6 topics.

Using the beta matrix we can see which words contribute the most to each topic.

td_BetaMatrix <- tidy(STM_topic_model)

td_BetaMatrix %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = paste0("Topic ", topic),
         term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip() +
  #scale_x_reordered() +
  labs(x = NULL, y = expression(beta),
       title = "Beta Matrix: Highest word probabilities for each topic",
       subtitle = "Different words are associated with different topics")

Our STM model also has another metric we can use and observe, the gamma matrix. This is the probability that each “document” is generated from each topic.

Questions we can observe here are:

How much did this document contribute to the topic?
How likely is this document to belong to this topic?

td_GammaMatrix <- tidy(STM_topic_model, matrix = "gamma",                    
                 document_names = rownames(MusicLyrics_dfm))

ggplot(td_GammaMatrix, aes(gamma, fill = as.factor(topic))) +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, ncol = 3) +
  labs(title = "Gamma Matrix: Distribution of document probabilities for each topic",
       subtitle = "Each topic is associated with exactly 1 story",
       y = "Number of topics", x = expression(gamma))

From these results it seems each song is strongly associated with a single topic. This is an uncommon occurrence as topic modeling does not always work out this way. Remembering that the model was built with only a small number of “documents” (6 artists & albums, 59 songs) and a small total amount of “words” to work with. This was still a good exercise in working with topic modelling and interpreting beta and gamma results.

Observing the results from the beta matrix, the topics, and music artist influences:

Topic 1 - Dominantly Daft Punk themed, with key repetition words “harder”,“faster”, “stronger”.
Topic 2 - Strongly suggests Iron Maiden themed, with all words, including “rime” and “mariner” distinctly belonging with Iron Maiden.
Topic 3 - Weakly describes Killswitch Engage with “numbered”, “temple” and “vide”. We also know from other metrics that Killswitch Engage has low word and vocabulary count.
Topic 4 - Is mostly Led Zeppelin with words like “dying” and “kashmir”, with a hint of Killswitch Engage and appearance of “time” - we know about this word’s placement from the ngram analysis and the emotional profile connotation of Killswitch Engage’s album. It’s not surprising to see “time” (with a sadness connotation) mingled with “dying”, “sick” and “night” within the same topic.
Topic 5 - Mostly Elton John, I suspect the appearance of “time” here is originating from the song “Rocket Man” and its chorus structure.
Topic 6 - Distinctively U2, who else would sing about “red hill mining town”?

Acknowledging the technical challenge of working with a small amount of data and with the results as they are, the model topics reveal very minimal overlap and similarity between artists. The artists and album/song content group fairly cleanly into their own topics. This gives more confidence to the hypothesis that the collection of music is eclectic.

5. Findings & Learnings

a. Findings

Returning to our guiding questions:

The hypothesis. Can we prove this collection of music is truly an eclectic playlist? A variety of different text mining, NLP and machine learning techniques were utilised in this analysis and the results from these were objectively assessed. Here’s the short version of results:

1. Album word length – the collection is eclectic. Each artist and album has differing word durations and number of songs per album available for analysis.
1. Word Clouds & Counts – the collection is eclectic. Each artist demonstrates different word selection.
1. Lexical Diversity – the collection is similar between four of the six artists. Vocabulary size is demonstrated.
1. SongSim matrix plots – the collection is eclectic. Each song has a different lyrical pattern “signature”, albums of songs tend to share similar base patterns and “flow”.
1. SongSim repetition scores – the collection is similar between four of the six artists. The average repetition of the artists & albums is demonstrated.
1. TF-IDF scores – the collection is eclectic; the words of significance differ between artists.
1. NRC Emotional Sentiment – the collection is eclectic. Each artist has an “emotional profile” which is distinct and different.
1. Bi-grams and Tri-grams – the collection is similar. Similarities are observed via perceived subject matter and subject analogies.
1. Bi-gram network – the collection is similar. Lyrical construct pathways can potentially be shared, across artists, albums and songs.
1. Pair-wise correlation – the collection is eclectic. Majority of songs did not yield a high enough correlation to be labelled as significant.
1. Venn Diagram of album similarity – the collection is eclectic. Notable mention to U2’s the Joshua Tree for being a potential similarity linkage to all other artists in the collection.
1. Dendogram of song similarity – the collection is eclectic. No strong themes emerging from the defined clusters.
1. Structured Topic Model (STM) – the collection is eclectic, 6 topics defined and neatly defined artist & album groupings under each of the 6 topics.

Summarising these results, it will be majority vote. Nine of the thirteen (9/13) of the methods used in this analysis pointed toward this selected collection of music is eclectic. Of the four methods used which pointed toward similarity, two were related to using calculated “averages” metrics and two were generating an awareness that the perceived subject matter could be synonymously linked between artists, i.e. the listener could derive lyrical similarities and subject analogies between artists.

Can we identify what makes a song unique, special similar and different to another song? The SongSim matrix plots have been particularly useful in answering this question. The plots not only show the lyrical pattern “signature” for each song but also make us aware of the basic lyrical flow and structure of a song.

Do the albums actually tell a story or have a meaning? We could further explore this question in a subsequent analysis, as telling a story involves the concept of establishing a timeline. Observing how ngrams and emotional sentiment change as the album’s songs are listened/experienced in the order as they are listed i.e track 1,2,3… and even at the song level, as the song progresses from beginning to end, this could yield some interesting results on story telling. With regard to identifying a meaning with each of the albums, this is subjective to the listener (or reader of the lyrics in this case), we could begin to extract a basic meaning when observing the ngrams and the overall “emotional profile” NRC emotional sentiments.

Is it possible to obtain an understanding of a music album through only observing its lyrics, having never actually listened to it? Having never listened to Iron Maiden’s Powerslave album, this analysis has provided me with a much clearer understanding of what subjects, derived meanings, and sentiments this album contains. That’s pretty cool! The album song lyrics read like it’s a very intense album, and in particular “Rime of the Ancient Mariner” looks like a very epic song! I’m looking forward to listening to the album and validating this thought.

b. Learnings, gotchas, traps for young players

Two key lessons I picked up on preparing this analysis:

Text Mining, NLP, Sentiment Analysis and Topic Modelling are all large subjects to cover in their own right. One can easily go into great detail and depth on each of these subjects. This analysis was really touching the surface of these subjects. The downside of going into depth and detail is one must factor in more research, analysis and development time, but then circle back to the big questions, “what is the purpose of this analysis”, “what is the problem we are trying to solve and why does it matter?”.
Consider, very carefully, the technical & narrative flow of the analysis for the reader’s (and developer’s) benefit. This analysis intended to start with the simple data clean up and descriptive stats then progress to more intermediate and advanced techniques, building with the “lego blocks” of data and constructs along the way.

c. Where to next & part 2

There were a few elements in this analysis we could unpack further, such as:

Album and Song sentiment progression from first to last listed track and song verse (by “sentence”), respectively.
What would happen if we applied word stemming, lemmatization, synonyms and hypernyms?
Including all albums released by the artists selected for this analysis. How much different would adding more “documents” make?
Impacts of “negation” words, for example: “I’m not happy and I don’t like it” will have different sentiment and meaning depending on which words are eliminated. Same goes for lyrics.
Could the songsim matrix plots be used to detect lyrical plagiarism between artists?

Given this analysis is all about music, we only covered one aspect in this analysis, the lyrical constructs from a text analysis approach. To make this a more holistic analysis, we will consider the audio components of music for a subsequent analysis.

I’d love to receive your thoughts, queries and feedback on this analysis. Please feel free to reach out to Bree at bree_mclennan@outlook.com.

Thanks for connecting!

Bree.

6. References

Many hours have been spent researching approaches for designing and writing this analysis. Here’s some items I’d like to share:

Inspirations for this piece

[1] TV series “It might get loud” with The Edge (U2), Jimmy Page (Led Zeppelin), Jack White (The White Stripes)
[2] Game: Audiosurf (visualising sound, beat detection algorithms and digital signal processing)

Text mining, NLP and Machine Learning with Music Lyrics & Text Scripts

[3] Prince analysis - NLP

[4] Prince analysis - Sentments

[5] The Ramones

[6] Rick and Morty

[7] 50 Years of Pop Music Lyrics

[8] Radiohead & Using the Spotify API

[9] Alternative sentiment analysis: Using the “gloom” index to find depressing songs

[10] Alternative visualisations: Visualising songs as matrix structures and find repetitions

[11] Topic Modeling of Sherlock Holmes Stories

[12] Topic modeling using STM