Song Analysis

Text analytics is used in many fields, the goal is to discover relevant information likely hidden into the text. Natural Language Processing (NLP) is one methodology used in text mining. It tries to decode the ambiguities in written language by tokenization, clustering, extracting relationships and using algorithms to identify topics.

Loading libraries

library(dplyr)
library(ggplot2) 
library(gridExtra) 
library(tidytext) 
library(wordcloud2) 
library(formattable)
library(knitr)
library(kableExtra)
library(ggraph)
library(igraph)
library(widyr)
library(tm)
library(topicmodels)
library(SnowballC)

Loading data

orig <- read.csv("C:/Users/user/Desktop/articoli/song/songdata.csv")
names(orig)

## [1] "artist" "song"   "link"   "text"

dim(orig)

## [1] 57650     4

class(orig)

## [1] "data.frame"

str(orig)

## 'data.frame':    57650 obs. of  4 variables:
##  $ artist: Factor w/ 643 levels "'n Sync","ABBA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ song  : Factor w/ 44824 levels "'39","'59 Crunch",..: 1364 2345 2896 3677 3678 6022 6771 7219 8410 8598 ...
##  $ link  : Factor w/ 57650 levels "/a/abba/ahes+my+kind+of+girl_20598417.html",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ text  : Factor w/ 57494 levels "'AD 'AKHSHYAV LO NISH'AR DAVAR  \nTEN LI SIMAN  \nHAKE'EV SHEHAYAH NISH'AR  \nKEN KOL HAZMAN  \nATAH MITRACHEK,"| __truncated__,..: 31695 43057 17616 32559 32560 51085 11108 8594 25646 18976 ...

glimpse(orig)

## Observations: 57,650
## Variables: 4
## $ artist <fct> ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, A...
## $ song   <fct> Ahe's My Kind Of Girl, Andante, Andante, As Good As New...
## $ link   <fct> /a/abba/ahes+my+kind+of+girl_20598417.html, /a/abba/and...
## $ text   <fct> Look at her face, it's a wonderful face  
## And it means ...

orig$text <- as.character(orig$text)
#new dataset
lyrics <- orig[,c(1,2,4)]

The analysis begin breaking out the lyrics into individual words and mining for insights, using tidy data principles. It has a specific structure: -Each variable is a column -Each observation is a row -Each type of observational unit is a table We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. There are also others text-mining approach as String, Corpus and Document-term matrix.

Exploratory data analysis (EDA)

song count per artist

song_grp<-lyrics %>%group_by(artist)%>%summarise(song_cnt=unique(length(song)))%>%arrange(desc(song_cnt))
song_grp[1:10,] %>%
  ungroup(artist, song_cnt) %>%
  mutate(song_cnt = color_bar("lightblue")(song_cnt)) %>%
  mutate(artist = color_tile("green","green")(artist)) %>%
  kable("html", escape = FALSE, align = "c", caption = "Artist With Highest song Count") %>%
  kable_styling(bootstrap_options = 
                  c("striped", "condensed", "bordered"), 
                full_width = FALSE)

Artist With Highest song Count
artist	song_cnt
Donna Summer	191
Gordon Lightfoot	189
Bob Dylan	188
George Strait	188
Alabama	187
Cher	187
Loretta Lynn	187
Reba Mcentire	187
Chaka Khan	186
Dean Martin	186

Wordcloud of artists

wordcloud2(song_grp[1:300,],size = .1)

Tokenization

Tokenization start using unnest_tokens() function by tidytext library already loaded. This function requires at least two arguments: the output column name that will be created as the text is unnested into it (i.e. word), and the input column that holds the current text (i.e. text). Take lyrics dataset and pipe it into unnest_tokens() and then remove stop words. They are overly common words that may not add any meaning to our results. There are different lists to choose from, but here I’ve used the lexicon called stop_words from the tidytext package. After tokenized the lyrics into words, I’ve used anti_join() to remove stop words, then I’ve used distinct() to get rid of any duplicate records as well. Lastly, I’ve removed all words with fewer than four characters because in lyrics these are often interjections.

head(sample(stop_words$word, 15), 15)

##  [1] "whose"        "sorry"        "lest"         "she's"       
##  [5] "who's"        "consequently" "specify"      "this"        
##  [9] "indicated"    "backing"      "specifying"   "normally"    
## [13] "upon"         "others"       "viz"

lyrics_filtered <- lyrics %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  distinct() %>%
  filter(nchar(word) > 3)
head(lyrics_filtered)

##   artist                  song      word
## 1   ABBA Ahe's My Kind Of Girl wonderful
## 2   ABBA Ahe's My Kind Of Girl     means
## 3   ABBA Ahe's My Kind Of Girl   special
## 4   ABBA Ahe's My Kind Of Girl    smiles
## 5   ABBA Ahe's My Kind Of Girl     lucky
## 6   ABBA Ahe's My Kind Of Girl    fellow

dim(lyrics_filtered)

## [1] 2190290       3

Word frequency

In music, individual word frequencies carry a great deal of importance, whether it be repetition or rarity. Both affect memorability of the entire song itself. One question a songwriter may want to know is if there is a correlation between word frequency and hit songs. So I’ve summarized count of words per song and for a simple evaluation I’ve summarized also the most frequently used words in all dataset.

full_word_count <- lyrics %>%
  unnest_tokens(word, text) %>%
  group_by(song) %>%
  summarise(num_words = n()) %>%
  arrange(desc(num_words)) 

full_word_count[1:10,] %>%
  ungroup(num_words, song) %>%
  mutate(num_words = color_bar("lightblue")(num_words)) %>%
  mutate(song = color_tile("green","green")(song)) %>%
  kable("html", escape = FALSE, align = "c", caption = "Songs With Highest Word Count") %>%
  kable_styling(bootstrap_options = 
                  c("striped", "condensed", "bordered"), 
                full_width = FALSE)

Songs With Highest Word Count
song	num_words
Hold On	6230
Angel	6003
Forever	5827
Home	5209
I Believe	5084
Best Friend	4803
Crazy	4735
Beautiful	4733
Paradise	4624
Stay	4569

One of the most word per song and artist

lyrics_filtered %>% 
  filter(word == "angel") %>%
  select(word, song, artist) %>%
  arrange() %>%
  top_n(10,song) %>%
  mutate(song = color_tile("lightblue","lightblue")(song)) %>%
  mutate(word = color_tile("green","green")(word)) %>%
  kable("html", escape = FALSE, align = "c", caption = "angel word per song and artist") %>%
  kable_styling(bootstrap_options = 
                  c("striped", "condensed", "bordered"), 
                full_width = FALSE)

angel word per song and artist
word	song	artist
angel	You Opened My Eyes	Bosson
angel	You’re So Beautiful	Def Leppard
angel	You’ve Changed	George Michael
angel	You Rescue Me	Wishbone Ash
angel	You Made Me Love You (Dear Mr. Gable)	Judy Garland
angel	You’re Not Alone Tonight	Keith Urban
angel	Young Offender	New Order
angel	You Only Tell Me You Love Me When You’re Drunk	Pet Shop Boys
angel	You Just Gotta Love Christmas	Peter Cetera
angel	Your Precious Love	Whitesnake

Top words

lyrics_filtered %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot() +
  geom_col(aes(word, n), fill = "blue") +
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank()) +
  xlab("") + 
  ylab("Song Count") +
  ggtitle("Most Frequently Used Words in Lyrics") +
  coord_flip()

Wordcloud of words

lyrics_words_counts <- lyrics_filtered %>%
  count(word, sort = TRUE) 
wordcloud2(lyrics_words_counts[1:500, ], size = .5)

Popular words

popular_words <- lyrics_filtered %>% 
  group_by(song) %>%
  count(word, song, sort = TRUE) %>%
  slice(seq_len(8)) %>%
  ungroup() %>%
  arrange(song,n) %>%
  mutate(row = row_number()) 
popular_words

## # A tibble: 357,967 x 4
##    song       word          n   row
##    <fct>      <chr>     <int> <int>
##  1 '39        ahead         1     1
##  2 '39        assembled     1     2
##  3 '39        blue          1     3
##  4 '39        born          1     4
##  5 '39        brave         1     5
##  6 '39        bring         1     6
##  7 '39        call          1     7
##  8 '39        calling       1     8
##  9 '59 Crunch alive         1     9
## 10 '59 Crunch chords        1    10
## # ... with 357,957 more rows

Word lenght

lyrics_word_lengths <- lyrics %>%
  unnest_tokens(word, text) %>%
  group_by(song,artist) %>%
  distinct() %>%
  #filter(!word %in% undesirable_words) %>%
  mutate(word_length = nchar(word)) 
lyrics_word_lengths %>%
  count(word_length, sort = TRUE) %>%
  ggplot(aes(word_length), 
         binwidth = 10) + 
  geom_histogram(aes(fill = ..count..),
                 breaks = seq(1,25, by = 2), 
                 show.legend = FALSE) + 
  xlab("Word Length") + 
  ylab("Word Count") +
  ggtitle("Word Length Distribution") +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid.minor = element_blank())

Wordcloud of word length

wc <- lyrics_word_lengths %>%
  ungroup() %>%
  select(word, word_length) %>%
  distinct() %>%
  arrange(desc(word_length))

wordcloud2(wc[1:300, ], 
           size = .15,
           minSize = .0005,
           ellipticity = .3, 
           rotateRatio = 1, 
           fontWeight = "bold")

Correlation: find the correlation between the words in the songs written by artist “U2”

Tidy data is a useful structure for comparing between variables or grouping by rows, but it can be challenging to compare between rows. Most operations for finding pairwise counts or correlations need to turn the data into a wide matrix first.The widyr package makes operations such as computing counts and correlations easy. One useful function is the pairwise_count() function. The prefix pairwise_ means it will result in one row for each pair of words in the word variable. While the input had one row for each pair of a document and a word, the output has one row for each pair of words. This is also a tidy format. The target is to examine correlation among words, which indicates how often they appear together relative to how often they appear separately. The focus is on the phi coefficient, a common measure for binary correlation. It’s equivalent to the Pearson correlation that either both word X and Y appear, or neither do, than that one appears without the other. The pairwise_cor() function in widyr lets us find the phi coefficient between words based on how often they appear in the same section. Its syntax is similar to pairwise_count(). It’s used ggraph to visualize bigrams: to visualize the correlations and clusters of words that were found by the widyr package.

section_U2<-lyrics%>%filter(artist=='U2')%>%mutate(section = row_number() %/% 10) %>%
  filter(section > 0) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  filter(nchar(word) > 3)
head(section_U2)

##   artist    song section  word
## 1     U2 Breathe       1  16th
## 2     U2 Breathe       1  june
## 3     U2 Breathe       1  door
## 4     U2 Breathe       1  bell
## 5     U2 Breathe       1 rings
## 6     U2 Breathe       1  door

lyric_pair<-section_U2%>%pairwise_count(word,section,sort=TRUE)
head(lyric_pair)

## # A tibble: 6 x 3
##   item1 item2     n
##   <chr> <chr> <dbl>
## 1 time  love   13.0
## 2 song  love   13.0
## 3 love  time   13.0
## 4 song  time   13.0
## 5 love  song   13.0
## 6 time  song   13.0

word_corr<- section_U2 %>%group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, section, sort = TRUE)
head(word_corr)

## # A tibble: 6 x 3
##   item1   item2   correlation
##   <chr>   <chr>         <dbl>
## 1 york    alright       0.993
## 2 alright york          0.993
## 3 mother  alright       0.986
## 4 alright mother        0.986
## 5 york    mother        0.984
## 6 mother  york          0.984

set.seed(12345)
word_corr %>%
  filter(correlation > .75) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "kk") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "blue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

Unsupervised learning with LDA

Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items. Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. Every topic is a mixture of words. LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. I’ve used the LDA() function from the topicmodels package, setting k = 5, to create a five-topic LDA model. This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents.

Subsample the dataset

set.seed(1234) 
row_indexes <- sample(1:nrow(lyrics), 1600, replace = F) 
texts_subsample <-slice(lyrics, row_indexes)

Create a document term matrix to clean, convert term matrix into tidytext corpus, remove stop words and apply LDA

viewsCorpus <- Corpus(VectorSource(texts_subsample$text)) 
viewsDTM <- DocumentTermMatrix(viewsCorpus)
viewsDTM_tidy <- tidy(viewsDTM)
viewsDTM_tidy_cleaned <- viewsDTM_tidy %>% 
  anti_join(stop_words, by = c("term" = "word")) %>%
  filter(nchar(term) > 3)
top_terms_by_topic_LDA <- function(input_text, 
                                   plot = T, 
                                   number_of_topics = 5) 
{    
  Corpus <- Corpus(VectorSource(input_text))
  DTM <- DocumentTermMatrix(Corpus) 
  unique_indexes <- unique(DTM$i) 
  DTM <- DTM[unique_indexes,] 
  lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
  topics <- tidy(lda, matrix = "beta") 
  top_terms <- topics  %>% 
    group_by(topic) %>% 
    top_n(10, beta) %>%
    ungroup() %>% 
    arrange(topic, -beta) 
  if(plot == T){
    top_terms %>% 
      mutate(term = reorder(term, beta)) %>% 
      ggplot(aes(term, beta, fill = factor(topic))) + 
      geom_col(show.legend = FALSE) + 
      facet_wrap(~ topic, scales = "free") + 
      labs(x = NULL, y = "Beta") + 
      coord_flip() 
  }else{ 
    return(top_terms)
  }
}
top_terms_by_topic_LDA(viewsDTM_tidy_cleaned$term, number_of_topics = 5)

Stem the words (e.g. convert each word to its stem) and re-apply LDA

viewsDTM_tidy_cleaned <- viewsDTM_tidy_cleaned %>% 
  mutate(stem = wordStem(term))
top_terms_by_topic_LDA(viewsDTM_tidy_cleaned$stem, number_of_topics=5)