According to Wikipedia, Natural Language Processing (NLP) “is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.”
In this exercise some NLP techniques will be used to get key insights from Lana del Rey lyrics dataset obtained from Kaggle. A Text Analysis will be completed to analyze the lyrics by using metrics and generating word clouds. Then, a Sentiment Analysis will be performed, to analyze the sentiment (positivity and negativity) behind the lyrics of Lana del Rey, and to try to draw conclusions.
The data was obtained from Kaggle. The original data contains lyrics for 57650 songs. In this analysis we only used the ones related to Lana del Rey. The dataset contains 4 variables:
# load libraries
require(tidyverse)
library(tidytext)
library(vctrs)
library(readxl)
library(dplyr)
library(DT)
library(lubridate)
library(nortest)
require(plotly)
require(plyr)
require(dplyr)
library(highcharter)
library(wordcloud)
library(reshape2)
library(ggplot2)
library(hrbrthemes)
In this step, all the data is loaded and it is transformed to character because this will be necessary for the tokenization, which it is the act of splitting a character string into smaller parts to be further analysed.
songs <- read.csv("songdata.csv")
songs$song <- songs$song %>% as.character()
songs$artist <- songs$artist %>% as.character()
songs$link <- songs$link %>% as.character()
songs$text <- songs$text %>% as.character()
lyrics.lana <- subset(songs, artist == 'Lana Del Rey')
The first thing that we need to do is remove stop words. These are words such as ‘the’ , ‘oh’ , etc. There is a data frame that contains the typical stop words, this dataframe will be used for the analysis. Also it will be used the unnest_tokens() function, which converts text into a column of n-grams (with default value 1, i.e., single words).
data(stop_words)
lyrics.lana <- lyrics.lana %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by=c("word"="word"))
# Look at the most popular words
count.words <- lyrics.lana %>%
dplyr::count(word, sort = TRUE)
head(count.words)
## word n
## 1 baby 320
## 2 love 295
## 3 wanna 181
## 4 gonna 117
## 5 ooh 113
## 6 boy 112
There are still a few words we can remove like “uh”, “ya”, “ah”, “oooh”,“la”, “em” and “ooh”
stop.words2 <- tibble(word = c("uh", "ya", "ah", "oooh","la", "em","ooh","hey"))
lyrics.lana <- lyrics.lana %>%
anti_join(stop.words2,by=c("word"="word"))
# Look at the most popular words
count.words <- lyrics.lana %>%
dplyr::count(word, sort = TRUE)
head(count.words)
## word n
## 1 baby 320
## 2 love 295
## 3 wanna 181
## 4 gonna 117
## 5 boy 112
## 6 bad 83
graph1 <- lyrics.lana %>%
dplyr::count(word, sort = TRUE) %>%
top_n(15) %>%
ggplot(aes(x = reorder(word, n), y = n)) +
geom_col(fill = "#377EB8") +
coord_flip() +
ggtitle("Lana del Rey", "Most frequently words used in songs") +
geom_label(aes(x = reorder(word, n), y = n, label = n)) +
labs(x = "Word", y = "Number of times used") +
theme_minimal()
graph1
As it can be seen, ‘baby’ is the word which appears the most - over 320 times. The second most used word is love, which appears 295 times.
lyric.total <- lyrics.lana %>%
group_by(song) %>%
dplyr::distinct(word) %>%
dplyr::summarise(Total = n())
graph2 <- lyric.total %>%
top_n(15) %>%
ggplot(aes(x = reorder(song, Total), y = Total)) +
geom_col(fill = "#377EB8") +
coord_flip() +
ggtitle("Lana del Rey", "Total song lenght based on the number of unique words") +
geom_label(aes(x = reorder(song, Total), y = Total, label = Total)) +
labs(x = "Song", y = "Unique words by song") +
theme_minimal()
graph2
“Off To The Races” song which was released in 2012 as part of the album “Born To Die” is the song with the highest number of unique words.
Wordcloud function allow to see in a visual way the most recurring positive and negative words. The comparision.cloud() function allow to plot both negative and positive words in a single wordcloud as follows.
bing <- get_sentiments("bing")
Word.emotion <- lyrics.lana %>%
inner_join(bing)
graph3 <- Word.emotion %>%
inner_join(bing) %>%
dplyr::count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "green"),
max.words = 100)
The word ‘love’ is the most frequently used positive word. Other popular positive words are ‘fun’, ‘pretty’ or ‘beautiful’. When we look at the negative words, we can see words we’d associate with angry/unhappy songs, like bad (Fake Diamonds) or crazy (Cruel World).
Sentiment Analysis is a type of classification where the data is classified into different classes. These classes can be binary in nature (positive or negative) or, they can have multiple classes (happy, sad, angry, etc.). For doing this, the bing sentiment scale will be used to analyze the positive and negative words in each of the songs. This is the formula used:
sentiment_rate = (positive - negative) / (positive + negative))
Then, depending on the sentiment rate:
bing <- get_sentiments("bing")
emotion.analysis <- lyrics.lana %>%
inner_join(bing)
lyrics.sentiment <- emotion.analysis %>%
dplyr::count(song, sentiment) %>%
spread(key = sentiment, value = n) %>%
mutate(sentiment_rate = (positive - negative) / (positive + negative)) %>%
select(song, sentiment_rate)
lyrics.sentiment$sentiment_final<-
ifelse(lyrics.sentiment$sentiment_rate>0, "Positive",
ifelse(lyrics.sentiment$sentiment_rate==0, "Neutral","Negative"))
lyrics.sentiment <- lyrics.sentiment %>%
mutate(sentiment_final = replace(sentiment_final, is.na(sentiment_final) , "Neutral" ))
graph4 <- lyrics.sentiment %>%
group_by(sentiment_final) %>%
dplyr::summarise(Total = n())
graph4
## # A tibble: 3 x 2
## sentiment_final Total
## <chr> <int>
## 1 Negative 40
## 2 Neutral 17
## 3 Positive 55
Var1 <- graph4$Total/sum(graph4$Total) * 100
ggplot(graph4, aes(x="", y=Total, fill=sentiment_final)) +
geom_bar(stat="identity", width=1) +
coord_polar("y") +
scale_fill_brewer(palette = "Set1") +
ggtitle("Lana del Rey", "Sentiments in songs: positive, negative or neutral") +
labs(x = "", y = "", fill = "Type")+
geom_text(aes(label=paste0(sprintf("%0.1f", round(Var1, digits = 3)), "%")),
size=4,
color="white",
position = position_stack(vjust = 0.5))
Based on the pie chart, 49,1% of the songs evoke positive emotions, a 35,7% evoke negative emotions and a 15,2 % evoke neutral emotions.