NLP - Natural Language Processing to analyze the sentiment in Lana del Rey lyrics

1. Overview

According to Wikipedia, Natural Language Processing (NLP) “is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.”

In this exercise some NLP techniques will be used to get key insights from Lana del Rey lyrics dataset obtained from Kaggle. A Text Analysis will be completed to analyze the lyrics by using metrics and generating word clouds. Then, a Sentiment Analysis will be performed, to analyze the sentiment (positivity and negativity) behind the lyrics of Lana del Rey, and to try to draw conclusions.

1.2 Research Questions

What is Lana del Rey singing about?
What are the most common words used in her songs?
How long are the songs?
What sentiments do Lana del Rey songs evoke?

2. Data

2.1 Data Information

The data was obtained from Kaggle. The original data contains lyrics for 57650 songs. In this analysis we only used the ones related to Lana del Rey. The dataset contains 4 variables:

artist: name of the artist.
song: name of the song.
link: a link to a webpage containing the song
lyric: the lyrics of the song identified in the song column

2.2 Packages Required

# load libraries
require(tidyverse)
library(tidytext)
library(vctrs)
library(readxl)
library(dplyr)
library(DT)
library(lubridate)
library(nortest)
require(plotly)
require(plyr)
require(dplyr)
library(highcharter)
library(wordcloud)
library(reshape2)
library(ggplot2)
library(hrbrthemes)

2.3 Data Loading

In this step, all the data is loaded and it is transformed to character because this will be necessary for the tokenization, which it is the act of splitting a character string into smaller parts to be further analysed.

songs <- read.csv("songdata.csv")
songs$song <- songs$song %>% as.character()
songs$artist <- songs$artist %>% as.character()
songs$link <- songs$link %>% as.character()
songs$text <- songs$text %>% as.character()

lyrics.lana <- subset(songs, artist == 'Lana Del Rey')

3. Data Cleaning

The first thing that we need to do is remove stop words. These are words such as ‘the’ , ‘oh’ , etc. There is a data frame that contains the typical stop words, this dataframe will be used for the analysis. Also it will be used the unnest_tokens() function, which converts text into a column of n-grams (with default value 1, i.e., single words).

data(stop_words)

lyrics.lana <- lyrics.lana %>%
  unnest_tokens(word, text) %>% 
   anti_join(stop_words, by=c("word"="word")) 

# Look at the most popular words
count.words <- lyrics.lana %>%
  dplyr::count(word, sort = TRUE)

head(count.words)

##    word   n
## 1  baby 320
## 2  love 295
## 3 wanna 181
## 4 gonna 117
## 5   ooh 113
## 6   boy 112

There are still a few words we can remove like “uh”, “ya”, “ah”, “oooh”,“la”, “em” and “ooh”

stop.words2 <- tibble(word = c("uh", "ya", "ah", "oooh","la", "em","ooh","hey"))

lyrics.lana <- lyrics.lana %>%
  anti_join(stop.words2,by=c("word"="word"))

# Look at the most popular words
count.words <- lyrics.lana %>%
  dplyr::count(word, sort = TRUE)

head(count.words)

##    word   n
## 1  baby 320
## 2  love 295
## 3 wanna 181
## 4 gonna 117
## 5   boy 112
## 6   bad  83

4. Text Analysis

4.1 Most popular words in Lana del Rey’s songs

graph1 <- lyrics.lana %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(15) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "#377EB8") +
  coord_flip() +
  ggtitle("Lana del Rey", "Most frequently words used in songs") +
  geom_label(aes(x = reorder(word, n), y = n, label = n)) +
  labs(x = "Word", y = "Number of times used") +
  theme_minimal()
graph1

As it can be seen, ‘baby’ is the word which appears the most - over 320 times. The second most used word is love, which appears 295 times.

4.2 Song length based on the number of unique words

lyric.total <- lyrics.lana %>%
group_by(song) %>%
dplyr::distinct(word)  %>%
dplyr::summarise(Total = n())

graph2 <- lyric.total %>% 
  top_n(15) %>%
  ggplot(aes(x = reorder(song, Total), y = Total)) +
  geom_col(fill = "#377EB8") +
  coord_flip() +
  ggtitle("Lana del Rey", "Total song lenght based on the number of unique words") +
  geom_label(aes(x = reorder(song, Total), y = Total, label = Total)) +
  labs(x = "Song", y = "Unique words by song") +
  theme_minimal()

graph2

“Off To The Races” song which was released in 2012 as part of the album “Born To Die” is the song with the highest number of unique words.

4.3 Wordcloud with the most recurring positive and negative words

Wordcloud function allow to see in a visual way the most recurring positive and negative words. The comparision.cloud() function allow to plot both negative and positive words in a single wordcloud as follows.

bing <- get_sentiments("bing")
Word.emotion <- lyrics.lana %>%
  inner_join(bing)

graph3 <- Word.emotion %>%
 inner_join(bing) %>%
 dplyr::count(word, sentiment, sort = TRUE) %>%
 acast(word ~ sentiment, value.var = "n", fill = 0) %>%
 comparison.cloud(colors = c("red", "green"),
          max.words = 100)

The word ‘love’ is the most frequently used positive word. Other popular positive words are ‘fun’, ‘pretty’ or ‘beautiful’. When we look at the negative words, we can see words we’d associate with angry/unhappy songs, like bad (Fake Diamonds) or crazy (Cruel World).

4.4 Sentiments in Lana del Rey’s songs

Sentiment Analysis is a type of classification where the data is classified into different classes. These classes can be binary in nature (positive or negative) or, they can have multiple classes (happy, sad, angry, etc.). For doing this, the bing sentiment scale will be used to analyze the positive and negative words in each of the songs. This is the formula used:

sentiment_rate = (positive - negative) / (positive + negative))

Then, depending on the sentiment rate:

Sentiment rate > 0 = positive song
Sentiment rate = 0 = neutral song
Sentiment rate <0 = negative song

bing <- get_sentiments("bing")
emotion.analysis <- lyrics.lana %>%
  inner_join(bing)


lyrics.sentiment <- emotion.analysis %>%
dplyr::count(song, sentiment) %>%
spread(key = sentiment, value = n)  %>%
mutate(sentiment_rate = (positive - negative) / (positive + negative)) %>%
select(song, sentiment_rate)

lyrics.sentiment$sentiment_final<-
      ifelse(lyrics.sentiment$sentiment_rate>0, "Positive",
      ifelse(lyrics.sentiment$sentiment_rate==0, "Neutral","Negative"))

lyrics.sentiment <- lyrics.sentiment %>%
  mutate(sentiment_final = replace(sentiment_final, is.na(sentiment_final) , "Neutral" ))

graph4 <- lyrics.sentiment %>% 
  group_by(sentiment_final) %>% 
  dplyr::summarise(Total = n())
graph4

## # A tibble: 3 x 2
##   sentiment_final Total
##   <chr>           <int>
## 1 Negative           40
## 2 Neutral            17
## 3 Positive           55

Var1 <-  graph4$Total/sum(graph4$Total) * 100

ggplot(graph4, aes(x="", y=Total, fill=sentiment_final)) + 
  geom_bar(stat="identity", width=1) +
  coord_polar("y") +
  scale_fill_brewer(palette = "Set1") +
  ggtitle("Lana del Rey", "Sentiments in songs: positive, negative or neutral") +
  labs(x = "", y = "", fill = "Type")+
geom_text(aes(label=paste0(sprintf("%0.1f", round(Var1, digits = 3)), "%")), 
              size=4,
              color="white",
              position = position_stack(vjust = 0.5))

Based on the pie chart, 49,1% of the songs evoke positive emotions, a 35,7% evoke negative emotions and a 15,2 % evoke neutral emotions.