Text Mining and Sentiment Analysis of Song Lyrics

Introduction

Music has accompanied people for millennia. What once started from naturally occurring sounds and rhythms in prehistory, now is a multi-billion industry and an integral part of everyday life. Through lyrics of songs, people passed on knowledge from generation to generation, and both laughed at it and cried to it. Musical lyrics present us with an artist’s perspective and carry with it the mood of the times. Recent advancements in the field of Natural Language Processing enabled us to analyze musical lyrics on an unprecedented scale with tremendous efficiency. But because data conveyed in musical lyrics are so often structured so differently than prose, it requires caution with assumptions and a uniquely discriminant choice of analytic techniques.

In our project, we have attempted to analyze the unique characteristics of song lyrics. First, we examined some descriptive statistics about our database, then we took a closer look at the development of lyrics thoughts the decades, its differences based on musical genre as well as applied sentiment analysis to it. Next, we have tried to examine closer the importance of words, both ones that are timeless and those that are important in their rarity. Finally, we performed topic modeling using Latent Dirichlet Allocation (LDA).

Disclaimer: We decided not to censor out any inappropriate content, some of which are considered objectionable or offensive by some readers. We take no responsibility for the words contained in these analyzed songs.

Data preprocessing

We start by loading the packages required for the entire project, and next, we load our database.

library(cld3)
library(data.table)
library(devtools)
library(dplyr)
library(ggplot2)
library(ggraph)
library(httr)
library(igraph)
library(jsonlite)
library(magrittr)
library(RColorBrewer)
library(reshape2)
library(rvest)
library(tidyverse)
library(tidytext)
library(topicmodels)
library(widyr)
library(wordcloud)
library(wordcloud2)

music <- read.csv("Data/original_cleaned_lyrics.csv", stringsAsFactors = F)

Let us first see the general structure of the data.

glimpse(music)

## Rows: 227,449
## Columns: 7
## $ X      <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ index  <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ song   <chr> "ego-remix", "then-tell-me", "honesty", "you-are-my-rock", "...
## $ year   <int> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, ...
## $ artist <chr> "beyonce-knowles", "beyonce-knowles", "beyonce-knowles", "be...
## $ genre  <chr> "Pop", "Pop", "Pop", "Pop", "Pop", "Pop", "Pop", "Pop", "Pop...
## $ lyrics <chr> "Oh baby how you doing You know I'm gonna cut right to the c...

Initially, there are 227,449 observations. These are songs authored by a total of 11,117 artists / bands. There are 7 columns, however, two of them: x and index are simply the same. Therefore, we remove one of them.

music$X <- NULL

Our database includes not only information about the lyrics, artist, and title of the songs, but also the year and genre. It is mainly for this reason that we chose this database for our analysis - in order to have the possibility of wider and more interesting results.

At the very beginning, it is worth mentioning that we found some songs in other languages than English, and unfortunately, there is no variable in the database indicating the language in the lyrics. For this reason, in the first step, we use detect_language() function from cld3 library in order to get rid of those songs that have lyrics in a language other than English.

music$language <- detect_language(music$lyrics)

music <- music %>% 
  filter(language == "en") %>% 
  select(!language)

Let us now take a brief look at the individual variables and make the necessary changes, starting with the year variable.

table(music$year) %>% head(12)

## 
##   67  112  702 1970 1971 1972 1973 1974 1975 1976 1977 1978 
##    1    4    1  149  178  177  236  149  128   70  211  163

There are some meaningless year values and only one value for the 1960s. Since our analysis is concerned with changes in the song lyrics among the decades, we exclude observations from before 1970.

music <- music %>%
  filter(year >= 1970)

What is more, we add a new factor variable indicating the decade the song comes from.

breaks <- c(1970,1980,1990,2000,2010,2020)

labels <- c("1970s", "1980s", "1990s", "2000s", "2010s")
 
music$decade <- cut(music$year, 
                    breaks = breaks, 
                    include.lowest = TRUE, 
                    right = FALSE, 
                    labels = labels)

In order to make the data visually more attractive, we also slightly change the form of two columns (artist and song) by removing dashes that join words in those columns.

music <- music %>%
  mutate(artist = chartr("-", " ", artist)) %>%
  mutate(song = chartr("-", " ", song))

Next, we check whether there are any missing values in the database.

any(is.na(music))

## [1] FALSE

Even though there are no missing values in our database, it turned out that some songs have very short or even empty lyrics. Due to the fact that this text column is one of our main interests, we had to exclude from further analysis songs whose lyrics are very short or do not exist at all. We decided to set the threshold to 30 words.

Therefore, we create a function number_of_words() that counts the words of lyrics and later apply it to all observations. In the end, we exclude songs whose lyrics are shorter than 30 words.

number_of_words <- function(x){
  
  result <- x %>% 
    tolower %>% 
    stringr::str_extract_all('\\w+') %>%
    unlist() %>%
    length()
  
  return(result)
}

music$n_words <- sapply(music$lyrics, number_of_words)

music <- music %>%
  filter(n_words >= 30)

At the end of the data preprocessing, observations for which the genre of the song is unknown are deleted (Other value in the genre column)

music <- music %>% 
  filter(genre != "Other")

Data visualization

music_by_artist <- music %>% 
  group_by(artist) %>% 
  summarise(n = n())

wordcloud2(music_by_artist %>% top_n(100),
           size = .5)

We start this section with the song count of the artist in the word cloud. Probably almost everyone is able to find there at least a few artists they know. One can find representatives of various genres and eras there. Below, we present the top 20 artists with the highest number of songs in the database.

music_by_artist %>%
  arrange(-n) %>%
  head(20) %>%
  mutate(artist = reorder(artist, n)) %>%
  ggplot(aes(artist, n)) + 
  geom_segment(aes(x = artist, xend = artist, y = 0, yend = n), color = "steelblue", size = 1) +
  geom_point(color = "blue", size = 4, alpha = 0.8) +
  geom_text(aes(label = n), hjust = -0.35) +
  scale_y_continuous(breaks = seq(0,800,100),
                     labels = seq(0,800,100),
                     limits = c(0,800)) +
  theme_minimal() + 
  labs(title = "Top 20 Artists in the Number of Songs",
       x = "",
       y = "") + 
  coord_flip()

Dolly Parton is clearly leading in this ranking with 741 songs. The vast majority of these performers are very famous artists and bands: Elton John, Bee Gees, Eminem, David Bowie, Eric Clapton … almost all of them could be mentioned.

Obviously, the vast majority of artists in our database have much fewer songs. We provide a distribution of the number of songs in the table below.

music_by_artist <- music_by_artist %>%
  mutate(cut = case_when(n > 200 ~ '>200',
                         n > 100 ~ '101-200',
                         n > 30 ~ '31-100',
                         n > 15 ~ '16-30',
                         n > 5 ~ '6-15',
                         n > 1 ~ '2-5',
                         T ~ '1')) %>%
  mutate(cut = factor(cut,
                      levels = c('1','2-5','6-15','16-30','31-100','101-200','>200'),
                      ordered = TRUE))
music_by_artist %>%
  group_by(cut) %>%
  summarise(share = n() / dim(music_by_artist)[1]) %>%
  mutate(share = round(share * 100, 2) %>% paste("%")) -> music_share

colnames(music_share) <- c("Number of songs", "Share")

music_share %>%
  knitr:: kable('html') %>%
  kableExtra::kable_styling(full_width = F)

Number of songs	Share
1	21.91 %
2-5	21.76 %
6-15	22.39 %
16-30	13.23 %
31-100	16.04 %
101-200	3.37 %
>200	1.3 %

As can be noticed, more than 1/5 of the artists have only one song in the database. Furthermore, over 65% of them have no more than 15 songs. Artists with over 200 songs constitute just only over 1%.

At the end of the data visualization, let us look at the relationship between the number of words in a song and its genre. Due to the numerous outliers, the median was used instead of the mean.

music %>%
  group_by(genre) %>%
  summarise(median = median(n_words)) %>%
  ggplot(aes(x= genre, y = median, fill = genre)) +
  geom_col() +
  theme_minimal() +
  labs(title="Median Number of Words in Song by Genre",
       x = "",
       y = "Median Number of Words") +
  theme(legend.position = "none")

The most important and not so surprising observation is that Hip-Hop definitely outweighs the rest of the genres in terms of word count. Comparing the others, the median number of words is fairly even, only Metal and Jazz could be said to be slightly below.

Data Cleaning

Before we process the data for analysis, we need to clean it. We already made some changes. Among other things, observations containing erroneous years have been removed. We also eliminated the genre “Other” as it contained a mix of a dozen smaller genres of music, and thus would add nothing meaningful to the analysis. Next, we binned years into decades. Lastly, we ran Google’s Compact Language Detector 3 - a neural network model for language detection available in package cld3, to remove all non-English songs.

The remaining database had more than 210,000 observations before tokenization. Operations on it have proven to be too computationally expensive. This is why we decided to sub-sample the database, pulling randomly 1650 observations (songs) for each genre. This sub-sampling would also help us balance the genres, as the vast majority of them in the database comprise Hip Hop and Pop genres, with severe underrepresentation of other genres.

Many lyrics, when transcribed, include phrases like “Repeat Chorus”, or labels such as “Bridge” and “Verse”. There are also a lot of other undesirable words that can muddy the results. Having done some prior analysis, we picked a few that either transformed to their full form or deleted altogether.

Then we tokenize the lyrics into individual words. We use dplyr’s anti_join() to remove stop words from the general lexicon stop_words. Next, we get rid of the undesirable words that were defined earlier using dplyr’s filter() with the %in% operator. Then with the help of distinct() we get rid of any duplicate records as well. Lastly, we remove all words with fewer than 2 characters. This is another subjective decision, but in lyrics, there are often interjections such as “ye” “ey”.

set.seed(42)

df_cleand <- music %>%  group_by(genre) %>%
  do(sample_n(.,1650)) 


a <- df_cleand

clean = function(x){
  x = gsub("won't", "will not", x)
  x = gsub("can't", "can not", x)
  x = gsub("n't", " not", x)
  x = gsub("'ll", " will", x)
  x = gsub("'re", " are", x)
  x = gsub("'ve", " have", x)
  x = gsub("'m", " am", x)
  x = gsub("'d", " would", x)
  x = gsub("'s", "", x)
  x = gsub('feelin', 'feeling', x)
  x = gsub('tryin', 'trying', x)
  x = gsub('mothafucka', 'motherfucker', x)
  x = gsub('wanna', 'want to', x)
  x = gsub('dat', 'that', x)
  return(x)
}

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]", " ", x)

df_cleand$lyrics = a$lyrics %>%
  gsub(pattern = '\\[[^][]*]', replacement = ' ') %>%
  tolower() %>%
  clean() %>% 
  removeSpecialChars() %>%
  gsub(pattern = '[[:punct:]|[:digit:]]', replacement = ' ')

myStopwords = c('ooh', 'oooh', 'oh', 'uh', 'baby', 'babi', 'bebe', 
                'yeah', 'yeh', 'ye', 'yes', 'ya', 'eh', 'da', 'cardi', 'se',
                'ayy', 'ah', 'yo', 'o', 'bum', 'na', 'la', 'ai', 'ba', 'hey','chorus',
                'da','yo','dr','aah','mckenzie','2006','yuhh','yurr','aaaaaaaaaaaaahhhhhhh',
                'aaahhs','aaaaahh', 'aaaaaaaaaaaaaah')

df_tidy <- df_cleand %>% 
  ungroup() %>% 
  unnest_tokens(word, lyrics) %>%
  distinct() %>%
  filter(!word %in% myStopwords & decade != 'NA') %>%
  anti_join(stop_words) %>%
  filter(nchar(word) > 2) %>% 
  select(artist,song,genre,word,decade)

Word Frequencies

First, we investigate word frequencies after removing all stopwords and cleaning data. Similar to the median number of words pre-cleaning, we can see that hip hop songs tend to have, on average, more words. The situation looks different for the metal genre. Previously it was characterized by one of the least median numbers of words per song, but after removing stop words, it tends to have more meaningful words than other genres, except hip-hop. Genres like pop and R&B that previously were among the highest median number of words now display just the average number of words, suggesting that lyrics for these genres might not have a lot of vocabulary variety. An interesting outlier is in the folk genre with the song “hypnagogue” by artist Current-93 with a whopping 657 words.

df_tidy %>%
  group_by(artist,song,genre) %>%
  count()  %>%
  ggplot(aes(x = genre, y = n, fill = genre)) + 
  geom_boxplot(show.legend = F) +
  labs(x = '', y = 'Word Count', title = 'Word frequency in relation to song genre') +
  theme_bw()

Then we move on to visualize the most popular words across all genres. The top eight results include words: love, time, feel, hearth, night, day, life, eyes.

unigram_tidy <- df_tidy %>%
  group_by(word) %>%
  count() %>% 
  ungroup () %>%
  arrange(desc(n))

wordcloud2(data = unigram_tidy[1:100, ], size = 1, color = brewer.pal(8, 'Dark2'))

We also looked up word bigrams cause words often have different meanings when it comes to other neighbors. Across all genres, the most popular bigrams were: true love, talkin bout, hip hop, santa claus, deep inside, broken heart, gonna love, coming home.

bigram_token  <-  df_cleand %>%
  select(song,artist,decade,lyrics,genre) %>% 
  unnest_tokens(output = bigram, input = lyrics, token = 'ngrams', n = 2)

bigram_token <-  bigram_token %>%
  separate(bigram, into = c('word1', 'word2'), sep = ' ') %>%
  filter(!word1 %in% c(myStopwords, stop_words$word)) %>%
  filter(!word2 %in% c(myStopwords, stop_words$word)) %>% 
  filter(word1 != word2) %>%
  unite(col = bigram, word1, word2, sep = ' ') %>%
  filter(!bigram %in% tolower(gsub("-", " ", (df_cleand$artist))))

bigram_tidy = bigram_token %>% 
  group_by(bigram) %>%
  count() %>%
  arrange(desc(n)) %>%
  ungroup()


wordcloud2(bigram_tidy[1:50, ], size = .5, color = brewer.pal(8, 'Dark2'), shape = 'circle')

Sentiment Analysis

In this part, we are using the tidytext package since it contains sentiment lexicons that are based on single words (unigrams). The first one, the Bing lexicon, categorizes words into positive and negative. The NRC lexicon, in turn, categorizes them in a binary way into more detailed eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). Finally, the AFINN lexicon is the one that assigns a score for each word in the range from -5 to 5, where negative scores indicate negative sentiment, while a positive score indicates positive sentiment.

We start with presenting the results of the Bing lexicon that divides the words into positive or negative categories. The best way to depict the words in slightly more numbers in this breakdown is to use a wordcloud as below.

df_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "blue"),
                   max.words = 200)

It can be noticed that none of the groups is clearly dominant - maybe there are slightly more positive words. The most dominant word is undeniably love. Among the remaining positive words, there are words such as free, smile, sweet, or strong, and among the negative words wrong, hard, lost, or fall.

We used Bing Lexicon for one more interesting type of analysis. Namely, using the number of negative and positive words in a given song, we have created a ratio that determines how strongly positive/negative this song is.

ratio_song <- df_tidy %>% 
  inner_join(get_sentiments("bing")) %>%
  group_by(song, sentiment) %>%
  summarize(score = n()) %>%
  spread(sentiment, score) %>% 
  ungroup() %>%
  mutate(ratio = positive / (positive + negative), 
         song = reorder(song, ratio))

The first chart presents the top 20 most positive songs. It is not difficult to notice that many of them are actually Christmas songs and carols and thus, the percentage of words defined as negative is relatively low. For example, the most positive song (oral fixation) has less than 4% of negative words among all its words assigned to any of those two categories.

ratio_song %>%
  top_n(20) %>%
  ggplot(aes(x = song, y = ratio)) +
  geom_point(color = "blue", size = 4) +
  coord_flip() +
  labs(title = "Top 20 Most Positive Songs",
       x = "",
       caption = "ratio = positive to positive and negative words jointly") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16, face = "bold"),
        panel.grid = element_line(linetype = "dashed", color = "darkgrey", size = .5))

We next created a similar chart for the most negative songs. The ratio here was calculated by subtracting the “positive” ratio from unity. Thanks to this, we actually have the most negative songs, not the least positive ones. Already from the titles themselves, it can be concluded that these songs are much closer to the themes of death or hate.

ratio_song %>%
  mutate(ratio = 1 - ratio, 
         song = reorder(song, ratio)) %>%
  top_n(20) %>%
  ggplot(aes(x = song, y = ratio)) +
  geom_point(color = "red", size = 4) +
  coord_flip() +
  labs(title = "Top 20 Most Negative Songs",
       x = "",
       caption = "ratio = negative to positive and negative words jointly") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16, face = "bold"),
        panel.grid = element_line(linetype = "dashed", color = "darkgrey", size = .5))

In the next step, we analyzed the sentiment of song lyrics using the NRC lexicon.

(nrc = get_sentiments(lexicon = 'nrc'))

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

As expected, song sentiments differ based on the genre. Country, folk, pop, and jazz tend to be more positive, whereas rock, hip-hop, and especially metal have more negative connotations. We can also see that while some genres tend to convey a more diverse range of emotions, like metal and hip-hop, others have a more even distribution, like rock or electronic.

song_nrc <- df_tidy %>%
  inner_join(nrc) %>%
  group_by(genre, sentiment) %>% 
  count() %>%
  ungroup() 


ggplot(song_nrc, aes(x = reorder(sentiment, n), y = n, fill = genre)) + 
  geom_col(show.legend = F) +
  facet_wrap(genre ~., scales = "free") +
  coord_flip() +
  labs(x = NULL, y = NULL, title = 'Sentiment Analysis by Nrc Lexicon') +
  theme_bw()

To further investigate sentiment, we took a closer look at what words were contributing the most to each sentiment category. The findings are displayed in the figure below.

unigram_tidy <- df_tidy %>%
  group_by(word) %>%
  count() %>% 
  ungroup () %>%
  arrange(desc(n)) 


unigram_tidy %>% 
  inner_join(nrc, by = "word") %>% 
  ungroup() %>%
  filter(!sentiment %in% c("positive", "negative")) %>%
  arrange(desc(n)) %>%
  group_by(sentiment) %>% 
  slice(1:10) %>%
  ggplot(aes(
    x = reorder(word, n),
    y = n,
    fill = sentiment)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap( ~ sentiment, scales = "free") +
  coord_flip() +
  labs(x = NULL, y = NULL, title = 'Top 10 most frequent words per each sentiment category') +
  theme_bw()

Next, we move to analyze the sentiment of song lyrics using the AFINN lexicon.

(afinn = get_sentiments(lexicon = 'afinn'))

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

As expected, the AFINN lexicon scored Hip-hop genre with the lowest score. Metal genre ranked second most negative, whereas genres like jazz or country tend to score higher. What is interesting, positive outliers in country, folk, hip-hop, jazz, pop, and R&B are different arrangements of the song “Silent Night”.

song_afinn <- df_tidy %>%
  inner_join(afinn) %>%
  group_by(song) %>%
  mutate(total_score = sum(value)) %>%
  ungroup() %>%
  arrange(desc(value))
 
song_afinn %>%
  ggplot(aes(x = genre, y = total_score, fill = genre)) + 
  geom_boxplot(show.legend = F) +
  labs(x = 'Genre', y = 'Sentiment Score', title = 'Sentiment Score by Afinn Lexicon') +
  theme_bw()

Word importance

Timeless Words

Using text mining, we can additionally check which words in songs are key for given decades and which could be called timeless. Therefore, we may see which words were the inspiration and most important for artists over the past decades. By properly filtering, we choose the 7 most important words for each decade.

timeless_words <- df_tidy %>% 
  group_by(decade) %>%
  count(word, decade, sort = TRUE) %>%
  slice(seq_len(7)) %>%
  ungroup() %>%
  arrange(decade, n) %>%
  mutate(row = row_number()) 

timeless_words %>%
  ggplot(aes(row, n, fill = decade)) +
  geom_col() +
  labs(title = "Timeless words",
       x = NULL, 
       y = NULL) +
  theme_bw() +  
  theme(legend.position = "None") +
  facet_wrap(~decade, scales = "free", ncol = 5) +
  scale_x_continuous(breaks = timeless_words$row,
                     labels = timeless_words$word) +
  theme(axis.text.x = element_blank()) +
  coord_flip()

Love is undeniably the most timeless word. Interestingly, there are some other words that stay in the top 7 most important words through decades such as time, life, heart, or feel. Generally, we can conclude that top words do not change significantly among years.

TF-IDF

After searching for words used often across decades and genres, we investigated word importance and adjusted for how rarely they were used. To do that, we used the TF-IDF metric. TF-IDF metric scores higher terms that appear more frequently in a document, unless it also occurs in many documents (songs). We carried out our analysis across both decades and genres.

# Differentiated by Decade

tfidf_words_decade <- df_tidy %>% 
  count(decade, word, sort = TRUE) %>%
  ungroup() %>%
  bind_tf_idf(word, decade, n) %>%
  arrange(desc(tf_idf))

top_tfidf_words_decade <- tfidf_words_decade %>% 
  group_by(decade) %>% 
  slice(seq_len(8)) %>%
  ungroup() %>%
  arrange(decade, tf_idf) %>%
  mutate(row = row_number())

top_tfidf_words_decade %>%
  ggplot(aes(x = row, tf_idf, fill = decade)) +
  geom_col(show.legend = NULL) +
  labs(x = NULL, y = "TF-IDF") + 
  ggtitle("Important Words using TF-IDF by Decade") +
  theme_bw() +  
  facet_wrap(~decade, 
             ncol = 2, nrow = 3, 
             scales = "free") +
  scale_x_continuous(
    breaks = top_tfidf_words_decade$row,
    labels = top_tfidf_words_decade$word) +
  coord_flip()

Although the interpretation of the actual importance of words promoted by the TF-IDF metric leaves a lot of subjectivity, it nonetheless gives another perspective. When it comes to decades grouping, we might see that vulgar words tend to score higher for recent decades.

# Differentiated by Genre

tfidf_words_genre <- df_tidy %>%
  count(genre, word, sort = TRUE) %>%
  ungroup() %>%
  bind_tf_idf(word, genre, n) %>%
  arrange(desc(tf_idf))

top_tfidf_words_genre <- tfidf_words_genre %>% 
  group_by(genre) %>% 
  slice(seq_len(7)) %>%
  ungroup() %>%
  arrange(genre, tf_idf) %>%
  mutate(row = row_number())

top_tfidf_words_genre %>%
  ggplot(aes(x = row, tf_idf, fill = genre)) +
  geom_col(show.legend = NULL) +
  labs(x = NULL, y = "TF-IDF") + 
  ggtitle("Important Words using TF-IDF by Genre") +
  theme_bw() +  
  facet_wrap(~genre,
             ncol = 3, nrow = 4, 
             scales = "free") +
  scale_x_continuous( 
    breaks = top_tfidf_words_genre$row, 
    labels = top_tfidf_words_genre$word) +
  coord_flip()

For genres, in the country genre, we might see that person names like Dolly Parton carry high weight, suggesting that this person is indeed important for this genre. On the other hand, in electronics, we might see the importance of not so much as words but sound-alike words. In the hip-hop genre, we might see a dominance of vulgar and racist words.

Networks of co-occuring words

Next, we tried to look closer at the relationship between words by examining networks of co-occurring words. We classified songs into sections and then calculated their pairwise co-occurrence, examining the correlation of words, which does not necessarily exist in the same song. We examined the songs of Hip-Hop artist - Chris Brown. As we can see, some words begin to cluster, like one in the figure below, on the upper right-hand side, that contains a lot of sexually charged and vulgar vocabulary.

section <- df_cleand %>% 
  ungroup() %>%  
  filter(artist=='chris brown') %>%
  mutate(section = row_number() %/% 5) %>%
  filter(section > 0) %>%
  unnest_tokens(word, lyrics) %>%
  distinct() %>%
  filter(!word %in% myStopwords & decade != 'NA') %>%
  anti_join(stop_words) %>%
  filter(nchar(word) > 2)

word_corr<- section %>% 
  group_by(word) %>%
  filter(n() >= 5) %>%
  pairwise_cor(word, section, sort = TRUE)

word_corr %>%
  filter(correlation > .75) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "kk") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

LDA

Another method of analyzing lyrics data deployed by us was Latent Dirichlet Allocation (LDA), used for topic modeling. Every document (song) is a mixture of topics. We imagine that each song is containing words from several topics in particular proportions. Every topic is a mixture of words. LDA is a mathematical method for estimating both of these simultaneously: finding the mixture of words associated with each topic. We performed LDA on the whole available data, specifying four topic LDA models and genre pop only with three topics.

When analyzing the whole dataset, we can see that topics tend to form around words with certain sentiments. In topic 2 and 4, we can see predominantly words with positive sentiment, whereas topic 1 is more connotated with negative sentiment and emotions of fear and sadness. Topic 3 is also mainly comprised of words with the negative sentiment but displays more disgust and anger, typical for hip-hop songs.

tidy_lda <- df_cleand %>% 
  ungroup() %>% 
  unnest_tokens(word, lyrics) %>%
  distinct() %>%
  filter(!word %in% myStopwords & decade != 'NA') %>%
  anti_join(stop_words) %>%
  filter(nchar(word) > 2) %>% 
  select(artist,song,genre,word)


topics <- LDA(cast_dtm(data = df_tidy %>% 
                         count(artist, word) %>% 
                         ungroup(),
                       term = word,
                       document = artist, 
                       value = n),
              k = 4, control = list(seed = 42)) %>% 
  tidy(matrix = "beta") %>% 
  group_by(topic) %>%
  arrange(desc(beta)) %>% 
  top_n(12, beta) %>% 
  ungroup()

topics %>% 
  arrange(topic, -beta) %>% 
  mutate(term = reorder(term, beta)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = 'free') +
  coord_flip() +
  ggtitle("Topic modeling using LDA")

When assessing only the pop genre, with might see that topics are more uniformly distributed, with a majority of words appearing in every topic, indicating that pop songs’ underlying motives tend to oscillate around similar topics.

tidy_lda_pop <- df_cleand %>% 
  filter(genre=="Pop") %>% 
  ungroup() %>% 
  unnest_tokens(word, lyrics) %>%
  distinct() %>%
  filter(!word %in% myStopwords & decade != 'NA') %>%
  anti_join(stop_words) %>%
  filter(nchar(word) > 2) %>% 
  select(artist,song,genre,word)


topics_pop <- LDA(cast_dtm(data = tidy_lda_pop %>% 
                         count(artist, word) %>% 
                         ungroup(),
                       term = word,
                       document = artist, 
                       value = n),
              k = 3, control = list(seed = 42)) %>% 
  tidy(matrix = "beta") %>% 
  group_by(topic) %>%
  arrange(desc(beta)) %>% 
  top_n(12, beta) %>% 
  ungroup()

topics_pop %>% 
  arrange(topic, -beta) %>% 
  mutate(term = reorder(term, beta)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = 'free') +
  coord_flip() +
  ggtitle("Topic modeling using LDA in the Pop Genre")

Conclusions

Lyrics analysis is no easy task. It requires a lot of attention during data - preprocessing and caution assumptions about this highly unstructured data. Through our analysis, we showed musical lyrics could be very diversified in some aspects, like most popular words in certain genres, but at the same time similar in other aspects, like the distribution ratios of most popular songs. Our analysis also showed that lyrics evolve over time, but some underlying themes like love echo over decades. An interesting extension of our analysis would be an inclusion of more advanced topic modeling methods like lda2vec or predictive analytics that would revolve around predicting genre of songs using machine learning or even generative analytics, that would use deep learning methods like Long Short Term Memory Recurrent Neural Networks to generate and predict lyrics.