Text analytics is used in many fields, the goal is to discover relevant information likely hidden into the text. Natural Language Processing (NLP) is one methodology used in text mining. It tries to decode the ambiguities in written language by tokenization, clustering, extracting relationships and using algorithms to identify topics.

Loading libraries

library(dplyr)
library(ggplot2) 
library(gridExtra) 
library(tidytext) 
library(wordcloud2) 
library(formattable)
library(knitr)
library(kableExtra)
library(ggraph)
library(igraph)
library(widyr)
library(tm)
library(topicmodels)
library(SnowballC)

Loading data

orig <- read.csv("C:/Users/user/Desktop/articoli/song/songdata.csv")
names(orig)
## [1] "artist" "song"   "link"   "text"
dim(orig)
## [1] 57650     4
class(orig)
## [1] "data.frame"
str(orig)
## 'data.frame':    57650 obs. of  4 variables:
##  $ artist: Factor w/ 643 levels "'n Sync","ABBA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ song  : Factor w/ 44824 levels "'39","'59 Crunch",..: 1364 2345 2896 3677 3678 6022 6771 7219 8410 8598 ...
##  $ link  : Factor w/ 57650 levels "/a/abba/ahes+my+kind+of+girl_20598417.html",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ text  : Factor w/ 57494 levels "'AD 'AKHSHYAV LO NISH'AR DAVAR  \nTEN LI SIMAN  \nHAKE'EV SHEHAYAH NISH'AR  \nKEN KOL HAZMAN  \nATAH MITRACHEK,"| __truncated__,..: 31695 43057 17616 32559 32560 51085 11108 8594 25646 18976 ...
glimpse(orig)
## Observations: 57,650
## Variables: 4
## $ artist <fct> ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, A...
## $ song   <fct> Ahe's My Kind Of Girl, Andante, Andante, As Good As New...
## $ link   <fct> /a/abba/ahes+my+kind+of+girl_20598417.html, /a/abba/and...
## $ text   <fct> Look at her face, it's a wonderful face  
## And it means ...
orig$text <- as.character(orig$text)
#new dataset
lyrics <- orig[,c(1,2,4)]

The analysis begin breaking out the lyrics into individual words and mining for insights, using tidy data principles. It has a specific structure: -Each variable is a column -Each observation is a row -Each type of observational unit is a table We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. There are also others text-mining approach as String, Corpus and Document-term matrix.

Exploratory data analysis (EDA)

song count per artist

song_grp<-lyrics %>%group_by(artist)%>%summarise(song_cnt=unique(length(song)))%>%arrange(desc(song_cnt))
song_grp[1:10,] %>%
  ungroup(artist, song_cnt) %>%
  mutate(song_cnt = color_bar("lightblue")(song_cnt)) %>%
  mutate(artist = color_tile("green","green")(artist)) %>%
  kable("html", escape = FALSE, align = "c", caption = "Artist With Highest song Count") %>%
  kable_styling(bootstrap_options = 
                  c("striped", "condensed", "bordered"), 
                full_width = FALSE)
Artist With Highest song Count
artist song_cnt
Donna Summer 191
Gordon Lightfoot 189
Bob Dylan 188
George Strait 188
Alabama 187
Cher 187
Loretta Lynn 187
Reba Mcentire 187
Chaka Khan 186
Dean Martin 186

Wordcloud of artists

wordcloud2(song_grp[1:300,],size = .1)

Tokenization

Tokenization start using unnest_tokens() function by tidytext library already loaded. This function requires at least two arguments: the output column name that will be created as the text is unnested into it (i.e. word), and the input column that holds the current text (i.e. text). Take lyrics dataset and pipe it into unnest_tokens() and then remove stop words. They are overly common words that may not add any meaning to our results. There are different lists to choose from, but here I’ve used the lexicon called stop_words from the tidytext package. After tokenized the lyrics into words, I’ve used anti_join() to remove stop words, then I’ve used distinct() to get rid of any duplicate records as well. Lastly, I’ve removed all words with fewer than four characters because in lyrics these are often interjections.

head(sample(stop_words$word, 15), 15)
##  [1] "whose"        "sorry"        "lest"         "she's"       
##  [5] "who's"        "consequently" "specify"      "this"        
##  [9] "indicated"    "backing"      "specifying"   "normally"    
## [13] "upon"         "others"       "viz"
lyrics_filtered <- lyrics %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  distinct() %>%
  filter(nchar(word) > 3)
head(lyrics_filtered)
##   artist                  song      word
## 1   ABBA Ahe's My Kind Of Girl wonderful
## 2   ABBA Ahe's My Kind Of Girl     means
## 3   ABBA Ahe's My Kind Of Girl   special
## 4   ABBA Ahe's My Kind Of Girl    smiles
## 5   ABBA Ahe's My Kind Of Girl     lucky
## 6   ABBA Ahe's My Kind Of Girl    fellow
dim(lyrics_filtered)
## [1] 2190290       3

Word frequency

In music, individual word frequencies carry a great deal of importance, whether it be repetition or rarity. Both affect memorability of the entire song itself. One question a songwriter may want to know is if there is a correlation between word frequency and hit songs. So I’ve summarized count of words per song and for a simple evaluation I’ve summarized also the most frequently used words in all dataset.

full_word_count <- lyrics %>%
  unnest_tokens(word, text) %>%
  group_by(song) %>%
  summarise(num_words = n()) %>%
  arrange(desc(num_words)) 

full_word_count[1:10,] %>%
  ungroup(num_words, song) %>%
  mutate(num_words = color_bar("lightblue")(num_words)) %>%
  mutate(song = color_tile("green","green")(song)) %>%
  kable("html", escape = FALSE, align = "c", caption = "Songs With Highest Word Count") %>%
  kable_styling(bootstrap_options = 
                  c("striped", "condensed", "bordered"), 
                full_width = FALSE)
Songs With Highest Word Count
song num_words
Hold On 6230
Angel 6003
Forever 5827
Home 5209
I Believe 5084
Best Friend 4803
Crazy 4735
Beautiful 4733
Paradise 4624
Stay 4569

One of the most word per song and artist

lyrics_filtered %>% 
  filter(word == "angel") %>%
  select(word, song, artist) %>%
  arrange() %>%
  top_n(10,song) %>%
  mutate(song = color_tile("lightblue","lightblue")(song)) %>%
  mutate(word = color_tile("green","green")(word)) %>%
  kable("html", escape = FALSE, align = "c", caption = "angel word per song and artist") %>%
  kable_styling(bootstrap_options = 
                  c("striped", "condensed", "bordered"), 
                full_width = FALSE)
angel word per song and artist
word song artist
angel You Opened My Eyes Bosson
angel You’re So Beautiful Def Leppard
angel You’ve Changed George Michael
angel You Rescue Me Wishbone Ash
angel You Made Me Love You (Dear Mr. Gable) Judy Garland
angel You’re Not Alone Tonight Keith Urban
angel Young Offender New Order
angel You Only Tell Me You Love Me When You’re Drunk Pet Shop Boys
angel You Just Gotta Love Christmas Peter Cetera
angel Your Precious Love Whitesnake

Top words

lyrics_filtered %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot() +
  geom_col(aes(word, n), fill = "blue") +
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank()) +
  xlab("") + 
  ylab("Song Count") +
  ggtitle("Most Frequently Used Words in Lyrics") +
  coord_flip()

Wordcloud of words

lyrics_words_counts <- lyrics_filtered %>%
  count(word, sort = TRUE) 
wordcloud2(lyrics_words_counts[1:500, ], size = .5)

Word lenght

lyrics_word_lengths <- lyrics %>%
  unnest_tokens(word, text) %>%
  group_by(song,artist) %>%
  distinct() %>%
  #filter(!word %in% undesirable_words) %>%
  mutate(word_length = nchar(word)) 
lyrics_word_lengths %>%
  count(word_length, sort = TRUE) %>%
  ggplot(aes(word_length), 
         binwidth = 10) + 
  geom_histogram(aes(fill = ..count..),
                 breaks = seq(1,25, by = 2), 
                 show.legend = FALSE) + 
  xlab("Word Length") + 
  ylab("Word Count") +
  ggtitle("Word Length Distribution") +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid.minor = element_blank())

Wordcloud of word length

wc <- lyrics_word_lengths %>%
  ungroup() %>%
  select(word, word_length) %>%
  distinct() %>%
  arrange(desc(word_length))

wordcloud2(wc[1:300, ], 
           size = .15,
           minSize = .0005,
           ellipticity = .3, 
           rotateRatio = 1, 
           fontWeight = "bold")

Correlation: find the correlation between the words in the songs written by artist “U2”

Tidy data is a useful structure for comparing between variables or grouping by rows, but it can be challenging to compare between rows. Most operations for finding pairwise counts or correlations need to turn the data into a wide matrix first.The widyr package makes operations such as computing counts and correlations easy. One useful function is the pairwise_count() function. The prefix pairwise_ means it will result in one row for each pair of words in the word variable. While the input had one row for each pair of a document and a word, the output has one row for each pair of words. This is also a tidy format. The target is to examine correlation among words, which indicates how often they appear together relative to how often they appear separately. The focus is on the phi coefficient, a common measure for binary correlation. It’s equivalent to the Pearson correlation that either both word X and Y appear, or neither do, than that one appears without the other. The pairwise_cor() function in widyr lets us find the phi coefficient between words based on how often they appear in the same section. Its syntax is similar to pairwise_count(). It’s used ggraph to visualize bigrams: to visualize the correlations and clusters of words that were found by the widyr package.

section_U2<-lyrics%>%filter(artist=='U2')%>%mutate(section = row_number() %/% 10) %>%
  filter(section > 0) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  filter(nchar(word) > 3)
head(section_U2)
##   artist    song section  word
## 1     U2 Breathe       1  16th
## 2     U2 Breathe       1  june
## 3     U2 Breathe       1  door
## 4     U2 Breathe       1  bell
## 5     U2 Breathe       1 rings
## 6     U2 Breathe       1  door
lyric_pair<-section_U2%>%pairwise_count(word,section,sort=TRUE)
head(lyric_pair)
## # A tibble: 6 x 3
##   item1 item2     n
##   <chr> <chr> <dbl>
## 1 time  love   13.0
## 2 song  love   13.0
## 3 love  time   13.0
## 4 song  time   13.0
## 5 love  song   13.0
## 6 time  song   13.0
word_corr<- section_U2 %>%group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, section, sort = TRUE)
head(word_corr)
## # A tibble: 6 x 3
##   item1   item2   correlation
##   <chr>   <chr>         <dbl>
## 1 york    alright       0.993
## 2 alright york          0.993
## 3 mother  alright       0.986
## 4 alright mother        0.986
## 5 york    mother        0.984
## 6 mother  york          0.984
set.seed(12345)
word_corr %>%
  filter(correlation > .75) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "kk") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "blue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

Unsupervised learning with LDA

Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items. Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. Every topic is a mixture of words. LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. I’ve used the LDA() function from the topicmodels package, setting k = 5, to create a five-topic LDA model. This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents.

Subsample the dataset

set.seed(1234) 
row_indexes <- sample(1:nrow(lyrics), 1600, replace = F) 
texts_subsample <-slice(lyrics, row_indexes) 

Create a document term matrix to clean, convert term matrix into tidytext corpus, remove stop words and apply LDA

viewsCorpus <- Corpus(VectorSource(texts_subsample$text)) 
viewsDTM <- DocumentTermMatrix(viewsCorpus)
viewsDTM_tidy <- tidy(viewsDTM)
viewsDTM_tidy_cleaned <- viewsDTM_tidy %>% 
  anti_join(stop_words, by = c("term" = "word")) %>%
  filter(nchar(term) > 3)
top_terms_by_topic_LDA <- function(input_text, 
                                   plot = T, 
                                   number_of_topics = 5) 
{    
  Corpus <- Corpus(VectorSource(input_text))
  DTM <- DocumentTermMatrix(Corpus) 
  unique_indexes <- unique(DTM$i) 
  DTM <- DTM[unique_indexes,] 
  lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
  topics <- tidy(lda, matrix = "beta") 
  top_terms <- topics  %>% 
    group_by(topic) %>% 
    top_n(10, beta) %>%
    ungroup() %>% 
    arrange(topic, -beta) 
  if(plot == T){
    top_terms %>% 
      mutate(term = reorder(term, beta)) %>% 
      ggplot(aes(term, beta, fill = factor(topic))) + 
      geom_col(show.legend = FALSE) + 
      facet_wrap(~ topic, scales = "free") + 
      labs(x = NULL, y = "Beta") + 
      coord_flip() 
  }else{ 
    return(top_terms)
  }
}
top_terms_by_topic_LDA(viewsDTM_tidy_cleaned$term, number_of_topics = 5)

Stem the words (e.g. convert each word to its stem) and re-apply LDA

viewsDTM_tidy_cleaned <- viewsDTM_tidy_cleaned %>% 
  mutate(stem = wordStem(term))
top_terms_by_topic_LDA(viewsDTM_tidy_cleaned$stem, number_of_topics=5)