In this document, I will be reviewing some of the most popular songs today to see how much of it has positivity and how much of it has negativity. I am interested in this because I believe that music should have some meaning to it, but it seems to be missing in several modern songs I hear. We will be looking into the top 20 songs that have been streamed the most on Spotify according to their weekly-updated list of songs. This will show us if some of the music that is popular today has a positive or negative connotation.
In this document, we will be taking the 20 most streamed songs on Spotify and analyzing their lyrics through coding and using the genius package. This will also use two different scales of analysis of the lyrics to determine the positivity and negativity of the lyrics. We will also see what the most used words in these songs are and their connotation of positive or negative. For this project, we will be using the variable n to represent how often the lyric appears in the song. Overall, we will come to a conclusion as to whether or not the most popular songs on Spotify are positive or negative.
To begin the process, we have to install the required packages first.
library("devtools")
## Warning: package 'devtools' was built under R version 3.5.3
## Warning: package 'usethis' was built under R version 3.5.3
library("tidytext")
## Warning: package 'tidytext' was built under R version 3.5.3
#Use devtools::install_github("hadley/tidyverse") in console
library("tidyverse")
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
#Use devtools::install_github("josiahparry/genius") in console
library("genius")
Now we will be using the 20 most streamed songs on Spotify and put them into a tribble so we can work with them.
popular_songs <- tribble(~artist, ~song,
"Ed Sheeran", "Shape of You",
"Drake", "One Dance",
"The Chainsmokers", "Closer",
"Post Malone", "rockstar",
"Ed Sheeran", "Thinking Out Loud",
"Major Lazer", "Lean On",
"Drake", "God's Plan",
"Luis Fonsi", "Despacito",
"Justin Bieber", "Love Yourself",
"Justin Bieber", "Sorry",
"Camila Cabello", "Havana",
"The Chainsmokers", "Don't Let Me Down",
"The Weeknd", "Starboy",
"Mike Posner", "I Took A Pill In Ibiza",
"Dua Lipa", "New Rules",
"DJ Snake", "Let Me Love You",
"Ed Sheeran", "Photograph",
"James Arthur", "Say You Won't Let Go",
"Ed Sheeran", "Perfect",
"Kendrick Lamar", "Humble") %>%
add_genius(artist, song, "lyrics")
## Joining, by = c("artist", "song")
head(popular_songs)
## # A tibble: 6 x 5
## artist song track_title line lyric
## <chr> <chr> <chr> <int> <chr>
## 1 Ed Sheer~ Shape of~ Shape of Y~ 1 The club isn't the best place to f~
## 2 Ed Sheer~ Shape of~ Shape of Y~ 2 So the bar is where I go
## 3 Ed Sheer~ Shape of~ Shape of Y~ 3 Me and my friends at the table doi~
## 4 Ed Sheer~ Shape of~ Shape of Y~ 4 Drinking fast and then we talk slow
## 5 Ed Sheer~ Shape of~ Shape of Y~ 5 And you come over and start up a c~
## 6 Ed Sheer~ Shape of~ Shape of Y~ 6 And trust me I'll give it a chance~
For starters, let’s just see which of these songs is the longest in terms of how many lines they have.
popular_songs %>%
count(track_title) %>%
arrange(-n)
## # A tibble: 20 x 2
## track_title n
## <chr> <int>
## 1 Shape of You 91
## 2 Despacito 77
## 3 Havana 68
## 4 HUMBLE. 67
## 5 One Dance 66
## 6 <U+200B><U+200B>rockstar 64
## 7 New Rules 62
## 8 Say You Won't Let Go 62
## 9 Starboy 59
## 10 Closer 57
## 11 I Took a Pill in Ibiza 53
## 12 Don't Let Me Down 52
## 13 Let Me Love You 51
## 14 Love Yourself 51
## 15 Lean On 47
## 16 God's Plan 46
## 17 Photograph 46
## 18 Sorry 43
## 19 Thinking Out Loud 35
## 20 Perfect 34
The next step in this process is we have to tokenize the lyrics. By doing so, we can discover which are the most common words and how often these words appear in the selected songs.
popular_lyrics <- popular_songs %>%
unnest_tokens(word, lyric)
head(popular_lyrics)
## # A tibble: 6 x 5
## artist song track_title line word
## <chr> <chr> <chr> <int> <chr>
## 1 Ed Sheeran Shape of You Shape of You 1 the
## 2 Ed Sheeran Shape of You Shape of You 1 club
## 3 Ed Sheeran Shape of You Shape of You 1 isn't
## 4 Ed Sheeran Shape of You Shape of You 1 the
## 5 Ed Sheeran Shape of You Shape of You 1 best
## 6 Ed Sheeran Shape of You Shape of You 1 place
Now, we can get a table to discover which words are the most common in all of these songs.
popular_lyrics %>%
count(word) %>%
arrange(-n)
## # A tibble: 1,254 x 2
## word n
## <chr> <int>
## 1 i 343
## 2 you 245
## 3 me 189
## 4 a 184
## 5 and 171
## 6 the 152
## 7 on 148
## 8 my 147
## 9 up 138
## 10 in 127
## # ... with 1,244 more rows
These words are very common, but they don’t tell us anything we’re looking for. We need to exclude these from the results. To do so, we need to find the stopwords, which include the common words that we found in the results.
get_stopwords()
## # A tibble: 175 x 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # ... with 165 more rows
Now that we have our list of common words to exclude, we can now find the most common words that will hopefully have some positive or negative connotation.
popular_words <- popular_lyrics %>%
anti_join(get_stopwords()) %>%
count(word) %>%
arrange(-n)
## Joining, by = "word"
Now that the data has been organized to be more informative than before, we can now find the words that appear the most in the songs that have been streamed the most on Spotify. Let’s see what appears most.
popular_words %>%
top_n(10, n) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(word, n)) +
coord_flip() +
geom_col()
Now, we will analyze the data using a program from genius to determine the sentiment of the words. We will be using the bing method of classifying emotion. This will categorize the words into either positive or negative. We’ll be putting this into another tribble so we can view this when we need to.
popular_sentiment <- popular_lyrics %>%
inner_join(get_sentiments("bing")) %>%
group_by(track_title) %>%
count(sentiment, word)
## Joining, by = "word"
Now, we’ll be seeing how many words in the songs are positive, how many of them are negative, and how often these words appear in the songs. We’ll be viewing this data in graphs of positive and negative words, respectively.
#Now we can see how many words are positive and negative in these songs, and how often they appear.
popular_sentiment %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) +
geom_col() + coord_flip() +
facet_wrap(~sentiment, scales = "free")
There are a lot more negative words used, but the positive words appear to be more frequent. Now, we’ll see how many of these negative words appear in the 20 songs selected.
popular_sentiment %>%
inner_join(get_sentiments("bing")) %>%
count(track_title, sentiment) %>%
spread(sentiment, n, fill = 0)
## Joining, by = c("sentiment", "word")
## # A tibble: 20 x 3
## # Groups: track_title [20]
## track_title negative positive
## <chr> <dbl> <dbl>
## 1 <U+200B><U+200B>rockstar 10 6
## 2 Closer 6 6
## 3 Despacito 1 1
## 4 Don't Let Me Down 2 3
## 5 God's Plan 8 5
## 6 Havana 3 4
## 7 HUMBLE. 14 8
## 8 I Took a Pill in Ibiza 7 6
## 9 Lean On 2 3
## 10 Let Me Love You 5 5
## 11 Love Yourself 8 6
## 12 New Rules 2 4
## 13 One Dance 3 5
## 14 Perfect 4 12
## 15 Photograph 7 4
## 16 Say You Won't Let Go 5 14
## 17 Shape of You 5 9
## 18 Sorry 4 3
## 19 Starboy 13 5
## 20 Thinking Out Loud 5 8
This shows the diversity of positive and negative words that appear in each song, but now, we can try to quantify the positivity and negativity in each word that was found in our process. We’ll be using the catgorization method known as “afinn” to achieve this. This will give us a score for each word on a scale of positivity and negativity as opposed to just saying whether it’s positive or negative.
popular_score <- popular_lyrics %>%
inner_join(get_sentiments("afinn")) %>%
group_by(track_title) %>%
count(score, word)
## Joining, by = "word"
Now that we have a data set to find a quantity of positivity and negativity, we can see how many words appear in our selected songs under a certain score.
popular_score %>%
inner_join(get_sentiments("afinn")) %>%
count(track_title, score) %>%
spread(score, n, fill = 0)
## Joining, by = c("score", "word")
## # A tibble: 20 x 10
## # Groups: track_title [20]
## track_title `-5` `-4` `-3` `-2` `-1` `1` `2` `3` `4`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <U+200B><U+200B>rockstar 1 2 1 2 5 4 3 0 0
## 2 Closer 0 0 0 2 4 1 2 2 0
## 3 Despacito 0 0 0 0 1 1 1 0 0
## 4 Don't Let Me Down 0 0 1 0 1 2 1 0 1
## 5 God's Plan 0 1 2 3 3 2 4 2 0
## 6 Havana 0 1 0 0 2 2 1 2 0
## 7 HUMBLE. 1 6 3 3 2 4 2 2 0
## 8 I Took a Pill in ~ 0 1 1 4 3 3 1 0 0
## 9 Lean On 0 0 0 1 2 1 4 0 0
## 10 Let Me Love You 0 1 0 1 0 2 0 3 1
## 11 Love Yourself 0 0 1 4 3 1 4 2 0
## 12 New Rules 0 0 0 3 1 1 2 2 0
## 13 One Dance 0 0 0 0 1 5 2 1 0
## 14 Perfect 0 0 0 1 0 4 5 3 0
## 15 Photograph 0 0 1 3 2 2 1 1 0
## 16 Say You Won't Let~ 0 0 0 3 2 2 8 4 0
## 17 Shape of You 0 0 0 1 3 2 4 2 0
## 18 Sorry 0 0 1 3 3 2 2 0 0
## 19 Starboy 1 0 3 2 4 3 3 1 0
## 20 Thinking Out Loud 0 0 0 1 2 0 5 1 0
Now, we’ll be putting these songs into their own objects to find the quantities of these songs separately. Now, for some reason, the “rockstar” subset gave me a few issues when dealing with this, so I had to select the rows to separate it from the rest as opposed to using the track title to specify the requirements.
Shape_of_You <- subset(popular_score, track_title == "Shape of You")
One_Dance <- subset(popular_score, track_title == "One Dance")
Closer <- subset(popular_score, track_title == "Closer")
rockstar <- subset(popular_score[c(1:18), c(1:4)])
Thinking_Out_Loud <- subset(popular_score, track_title == "Thinking Out Loud")
Lean_On <- subset(popular_score, track_title == "Lean On")
Gods_Plan <- subset(popular_score, track_title == "God's Plan")
Despacito <- subset(popular_score, track_title == "Despacito")
Love_Yourself <- subset(popular_score, track_title == "Love Yourself")
Sorry <- subset(popular_score, track_title == "Sorry")
Havana <- subset(popular_score, track_title == "Havana")
Dont_Let_Me_Down <- subset(popular_score, track_title == "Don't Let Me Down")
Starboy <- subset(popular_score, track_title == "Starboy")
I_Took_A_Pill_In_Ibiza <- subset(popular_score, track_title == "I Took a Pill in Ibiza")
New_Rules <- subset(popular_score, track_title == "New Rules")
Let_Me_Love_You <- subset(popular_score, track_title == "Let Me Love You")
Photograph <- subset(popular_score, track_title == "Photograph")
Say_You_Wont_Let_Go <- subset(popular_score, track_title == "Say You Won't Let Go")
Perfect <- subset(popular_score, track_title == "Perfect")
Humble <- subset(popular_score, track_title == "HUMBLE.")
Now, we can figure out the scores of these songs and their positivity and negativity. To finalize our scores, we have to take the score of each word in the song, and multiply it by the amount of times it appears in the song. In this case, we will be multiplying each song’s score column with their respective “n” column.
Score1 <- sum(Shape_of_You$score*Shape_of_You$n)
Score2 <- sum(One_Dance$score*One_Dance$n)
Score3 <- sum(Closer$score*Closer$n)
Score4 <- sum(rockstar$score*rockstar$n)
Score5 <- sum(Thinking_Out_Loud$score*Thinking_Out_Loud$n)
Score6 <- sum(Lean_On$score*Lean_On$n)
Score7 <- sum(Gods_Plan$score*Gods_Plan$n)
Score8 <- sum(Despacito$score*Despacito$n)
Score9 <- sum(Love_Yourself$score*Love_Yourself$n)
Score10 <- sum(Sorry$score*Sorry$n)
Score11 <- sum(Havana$score*Havana$n)
Score12 <- sum(Dont_Let_Me_Down$score*Dont_Let_Me_Down$n)
Score13 <- sum(Starboy$score*Starboy$n)
Score14 <- sum(I_Took_A_Pill_In_Ibiza$score*I_Took_A_Pill_In_Ibiza$n)
Score15 <- sum(New_Rules$score*New_Rules$n)
Score16 <- sum(Let_Me_Love_You$score*Let_Me_Love_You$n)
Score17 <- sum(Photograph$score*Photograph$n)
Score18 <- sum(Say_You_Wont_Let_Go$score*Say_You_Wont_Let_Go$n)
Score19 <- sum(Perfect$score*Perfect$n)
Score20 <- sum(Humble$score*Humble$n)
Now for the final results:
#Score for Shape of You
Score1
## [1] 98
#Score for One Dance
Score2
## [1] 20
#Score for Closer
Score3
## [1] -7
#Score for rockstar
Score4
## [1] 8
#Score for Thinking Out Loud
Score5
## [1] 44
#Score for Lean On
Score6
## [1] -6
#Score for God's Plan
Score7
## [1] -24
#Score for Despacito
Score8
## [1] 5
#Score for Love Yourself
Score9
## [1] 41
#Score for Sorry
Score10
## [1] -19
#Score for Havana
Score11
## [1] 19
#Score for Don't Let Me Down
Score12
## [1] 5
#Score for Starboy
Score13
## [1] -18
#Score for I Took a Pill in Ibiza
Score14
## [1] -25
#Score for New Rules
Score15
## [1] 5
#Score for Let Me Love You
Score16
## [1] 46
#Score for Photograph
Score17
## [1] -5
#Score for Say You Won't Let Go
Score18
## [1] 26
#Score for Perfect
Score19
## [1] 47
#Score for HUMBLE.
Score20
## [1] -187
At the end of the analysis, we have discovered the positivity and negativity in the 20 songs that are the most streamed on Spotify. We discovered that “Shape of You” has the highest positivity score and “HUMBLE.” has the highest negativity score. The rest of the scores vary widely from the -20s to the mid 40s. Overall, there seems to be more positivity in these popular songs than negativity.
This is not to be used as a definitive answer for the amount of positivity and negativity in modern music. This analysis looked only at the words used and whether they often have positive or negative connotation. There are some words and phrases that might have a different context than the context it was counted for. The sample that was used consisted of only 20 songs and should not be used to compare to a majority of songs as the results may vary if someone else were to attempt their own version of this analysis. Trying your own version of this analysis is also encouraged.
This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: David Lanfranchi Semester: Spring 2019