Text analytics is used in many fields, the goal is to discover relevant information likely hidden into the text. Natural Language Processing (NLP) is one methodology used in text mining. It tries to decode the ambiguities in written language by tokenization, clustering, extracting relationships and using algorithms to identify topics.
library(dplyr)
library(ggplot2)
library(gridExtra)
library(tidytext)
library(wordcloud2)
library(formattable)
library(knitr)
library(kableExtra)
library(ggraph)
library(igraph)
library(widyr)
library(tm)
library(topicmodels)
library(SnowballC)
orig <- read.csv("C:/Users/user/Desktop/articoli/song/songdata.csv")
names(orig)
## [1] "artist" "song" "link" "text"
dim(orig)
## [1] 57650 4
class(orig)
## [1] "data.frame"
str(orig)
## 'data.frame': 57650 obs. of 4 variables:
## $ artist: Factor w/ 643 levels "'n Sync","ABBA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ song : Factor w/ 44824 levels "'39","'59 Crunch",..: 1364 2345 2896 3677 3678 6022 6771 7219 8410 8598 ...
## $ link : Factor w/ 57650 levels "/a/abba/ahes+my+kind+of+girl_20598417.html",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ text : Factor w/ 57494 levels "'AD 'AKHSHYAV LO NISH'AR DAVAR \nTEN LI SIMAN \nHAKE'EV SHEHAYAH NISH'AR \nKEN KOL HAZMAN \nATAH MITRACHEK,"| __truncated__,..: 31695 43057 17616 32559 32560 51085 11108 8594 25646 18976 ...
glimpse(orig)
## Observations: 57,650
## Variables: 4
## $ artist <fct> ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, ABBA, A...
## $ song <fct> Ahe's My Kind Of Girl, Andante, Andante, As Good As New...
## $ link <fct> /a/abba/ahes+my+kind+of+girl_20598417.html, /a/abba/and...
## $ text <fct> Look at her face, it's a wonderful face
## And it means ...
orig$text <- as.character(orig$text)
#new dataset
lyrics <- orig[,c(1,2,4)]
The analysis begin breaking out the lyrics into individual words and mining for insights, using tidy data principles. It has a specific structure: -Each variable is a column -Each observation is a row -Each type of observational unit is a table We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. There are also others text-mining approach as String, Corpus and Document-term matrix.
song_grp<-lyrics %>%group_by(artist)%>%summarise(song_cnt=unique(length(song)))%>%arrange(desc(song_cnt))
song_grp[1:10,] %>%
ungroup(artist, song_cnt) %>%
mutate(song_cnt = color_bar("lightblue")(song_cnt)) %>%
mutate(artist = color_tile("green","green")(artist)) %>%
kable("html", escape = FALSE, align = "c", caption = "Artist With Highest song Count") %>%
kable_styling(bootstrap_options =
c("striped", "condensed", "bordered"),
full_width = FALSE)
| artist | song_cnt |
|---|---|
| Donna Summer | 191 |
| Gordon Lightfoot | 189 |
| Bob Dylan | 188 |
| George Strait | 188 |
| Alabama | 187 |
| Cher | 187 |
| Loretta Lynn | 187 |
| Reba Mcentire | 187 |
| Chaka Khan | 186 |
| Dean Martin | 186 |
wordcloud2(song_grp[1:300,],size = .1)
Tokenization start using unnest_tokens() function by tidytext library already loaded. This function requires at least two arguments: the output column name that will be created as the text is unnested into it (i.e. word), and the input column that holds the current text (i.e. text). Take lyrics dataset and pipe it into unnest_tokens() and then remove stop words. They are overly common words that may not add any meaning to our results. There are different lists to choose from, but here I’ve used the lexicon called stop_words from the tidytext package. After tokenized the lyrics into words, I’ve used anti_join() to remove stop words, then I’ve used distinct() to get rid of any duplicate records as well. Lastly, I’ve removed all words with fewer than four characters because in lyrics these are often interjections.
head(sample(stop_words$word, 15), 15)
## [1] "whose" "sorry" "lest" "she's"
## [5] "who's" "consequently" "specify" "this"
## [9] "indicated" "backing" "specifying" "normally"
## [13] "upon" "others" "viz"
lyrics_filtered <- lyrics %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
distinct() %>%
filter(nchar(word) > 3)
head(lyrics_filtered)
## artist song word
## 1 ABBA Ahe's My Kind Of Girl wonderful
## 2 ABBA Ahe's My Kind Of Girl means
## 3 ABBA Ahe's My Kind Of Girl special
## 4 ABBA Ahe's My Kind Of Girl smiles
## 5 ABBA Ahe's My Kind Of Girl lucky
## 6 ABBA Ahe's My Kind Of Girl fellow
dim(lyrics_filtered)
## [1] 2190290 3
In music, individual word frequencies carry a great deal of importance, whether it be repetition or rarity. Both affect memorability of the entire song itself. One question a songwriter may want to know is if there is a correlation between word frequency and hit songs. So I’ve summarized count of words per song and for a simple evaluation I’ve summarized also the most frequently used words in all dataset.
full_word_count <- lyrics %>%
unnest_tokens(word, text) %>%
group_by(song) %>%
summarise(num_words = n()) %>%
arrange(desc(num_words))
full_word_count[1:10,] %>%
ungroup(num_words, song) %>%
mutate(num_words = color_bar("lightblue")(num_words)) %>%
mutate(song = color_tile("green","green")(song)) %>%
kable("html", escape = FALSE, align = "c", caption = "Songs With Highest Word Count") %>%
kable_styling(bootstrap_options =
c("striped", "condensed", "bordered"),
full_width = FALSE)
| song | num_words |
|---|---|
| Hold On | 6230 |
| Angel | 6003 |
| Forever | 5827 |
| Home | 5209 |
| I Believe | 5084 |
| Best Friend | 4803 |
| Crazy | 4735 |
| Beautiful | 4733 |
| Paradise | 4624 |
| Stay | 4569 |
lyrics_filtered %>%
filter(word == "angel") %>%
select(word, song, artist) %>%
arrange() %>%
top_n(10,song) %>%
mutate(song = color_tile("lightblue","lightblue")(song)) %>%
mutate(word = color_tile("green","green")(word)) %>%
kable("html", escape = FALSE, align = "c", caption = "angel word per song and artist") %>%
kable_styling(bootstrap_options =
c("striped", "condensed", "bordered"),
full_width = FALSE)
| word | song | artist |
|---|---|---|
| angel | You Opened My Eyes | Bosson |
| angel | You’re So Beautiful | Def Leppard |
| angel | You’ve Changed | George Michael |
| angel | You Rescue Me | Wishbone Ash |
| angel | You Made Me Love You (Dear Mr. Gable) | Judy Garland |
| angel | You’re Not Alone Tonight | Keith Urban |
| angel | Young Offender | New Order |
| angel | You Only Tell Me You Love Me When You’re Drunk | Pet Shop Boys |
| angel | You Just Gotta Love Christmas | Peter Cetera |
| angel | Your Precious Love | Whitesnake |
lyrics_filtered %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot() +
geom_col(aes(word, n), fill = "blue") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank()) +
xlab("") +
ylab("Song Count") +
ggtitle("Most Frequently Used Words in Lyrics") +
coord_flip()
lyrics_words_counts <- lyrics_filtered %>%
count(word, sort = TRUE)
wordcloud2(lyrics_words_counts[1:500, ], size = .5)
popular_words <- lyrics_filtered %>%
group_by(song) %>%
count(word, song, sort = TRUE) %>%
slice(seq_len(8)) %>%
ungroup() %>%
arrange(song,n) %>%
mutate(row = row_number())
popular_words
## # A tibble: 357,967 x 4
## song word n row
## <fct> <chr> <int> <int>
## 1 '39 ahead 1 1
## 2 '39 assembled 1 2
## 3 '39 blue 1 3
## 4 '39 born 1 4
## 5 '39 brave 1 5
## 6 '39 bring 1 6
## 7 '39 call 1 7
## 8 '39 calling 1 8
## 9 '59 Crunch alive 1 9
## 10 '59 Crunch chords 1 10
## # ... with 357,957 more rows
lyrics_word_lengths <- lyrics %>%
unnest_tokens(word, text) %>%
group_by(song,artist) %>%
distinct() %>%
#filter(!word %in% undesirable_words) %>%
mutate(word_length = nchar(word))
lyrics_word_lengths %>%
count(word_length, sort = TRUE) %>%
ggplot(aes(word_length),
binwidth = 10) +
geom_histogram(aes(fill = ..count..),
breaks = seq(1,25, by = 2),
show.legend = FALSE) +
xlab("Word Length") +
ylab("Word Count") +
ggtitle("Word Length Distribution") +
theme(plot.title = element_text(hjust = 0.5),
panel.grid.minor = element_blank())
wc <- lyrics_word_lengths %>%
ungroup() %>%
select(word, word_length) %>%
distinct() %>%
arrange(desc(word_length))
wordcloud2(wc[1:300, ],
size = .15,
minSize = .0005,
ellipticity = .3,
rotateRatio = 1,
fontWeight = "bold")
Tidy data is a useful structure for comparing between variables or grouping by rows, but it can be challenging to compare between rows. Most operations for finding pairwise counts or correlations need to turn the data into a wide matrix first.The widyr package makes operations such as computing counts and correlations easy. One useful function is the pairwise_count() function. The prefix pairwise_ means it will result in one row for each pair of words in the word variable. While the input had one row for each pair of a document and a word, the output has one row for each pair of words. This is also a tidy format. The target is to examine correlation among words, which indicates how often they appear together relative to how often they appear separately. The focus is on the phi coefficient, a common measure for binary correlation. It’s equivalent to the Pearson correlation that either both word X and Y appear, or neither do, than that one appears without the other. The pairwise_cor() function in widyr lets us find the phi coefficient between words based on how often they appear in the same section. Its syntax is similar to pairwise_count(). It’s used ggraph to visualize bigrams: to visualize the correlations and clusters of words that were found by the widyr package.
section_U2<-lyrics%>%filter(artist=='U2')%>%mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
filter(nchar(word) > 3)
head(section_U2)
## artist song section word
## 1 U2 Breathe 1 16th
## 2 U2 Breathe 1 june
## 3 U2 Breathe 1 door
## 4 U2 Breathe 1 bell
## 5 U2 Breathe 1 rings
## 6 U2 Breathe 1 door
lyric_pair<-section_U2%>%pairwise_count(word,section,sort=TRUE)
head(lyric_pair)
## # A tibble: 6 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 time love 13.0
## 2 song love 13.0
## 3 love time 13.0
## 4 song time 13.0
## 5 love song 13.0
## 6 time song 13.0
word_corr<- section_U2 %>%group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE)
head(word_corr)
## # A tibble: 6 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 york alright 0.993
## 2 alright york 0.993
## 3 mother alright 0.986
## 4 alright mother 0.986
## 5 york mother 0.984
## 6 mother york 0.984
set.seed(12345)
word_corr %>%
filter(correlation > .75) %>%
graph_from_data_frame() %>%
ggraph(layout = "kk") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "blue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items. Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. Every topic is a mixture of words. LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. I’ve used the LDA() function from the topicmodels package, setting k = 5, to create a five-topic LDA model. This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents.
set.seed(1234)
row_indexes <- sample(1:nrow(lyrics), 1600, replace = F)
texts_subsample <-slice(lyrics, row_indexes)
viewsCorpus <- Corpus(VectorSource(texts_subsample$text))
viewsDTM <- DocumentTermMatrix(viewsCorpus)
viewsDTM_tidy <- tidy(viewsDTM)
viewsDTM_tidy_cleaned <- viewsDTM_tidy %>%
anti_join(stop_words, by = c("term" = "word")) %>%
filter(nchar(term) > 3)
top_terms_by_topic_LDA <- function(input_text,
plot = T,
number_of_topics = 5)
{
Corpus <- Corpus(VectorSource(input_text))
DTM <- DocumentTermMatrix(Corpus)
unique_indexes <- unique(DTM$i)
DTM <- DTM[unique_indexes,]
lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
topics <- tidy(lda, matrix = "beta")
top_terms <- topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
if(plot == T){
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
labs(x = NULL, y = "Beta") +
coord_flip()
}else{
return(top_terms)
}
}
top_terms_by_topic_LDA(viewsDTM_tidy_cleaned$term, number_of_topics = 5)
viewsDTM_tidy_cleaned <- viewsDTM_tidy_cleaned %>%
mutate(stem = wordStem(term))
top_terms_by_topic_LDA(viewsDTM_tidy_cleaned$stem, number_of_topics=5)