Overview

In this document, I will be reviewing some of the most popular songs today to see how much of it has positivity and how much of it has negativity. I am interested in this because I believe that music should have some meaning to it, but it seems to be missing in several modern songs I hear. We will be looking into the top 20 songs that have been streamed the most on Spotify according to their weekly-updated list of songs. This will show us if some of the music that is popular today has a positive or negative connotation.

Introduction

In this document, we will be taking the 20 most streamed songs on Spotify and analyzing their lyrics through coding and using the genius package. This will also use two different scales of analysis of the lyrics to determine the positivity and negativity of the lyrics. We will also see what the most used words in these songs are and their connotation of positive or negative. For this project, we will be using the variable n to represent how often the lyric appears in the song. Overall, we will come to a conclusion as to whether or not the most popular songs on Spotify are positive or negative.

Exploring the Data

To begin the process, we have to install the required packages first.

library("devtools")
## Warning: package 'devtools' was built under R version 3.5.3
## Warning: package 'usethis' was built under R version 3.5.3
library("tidytext")
## Warning: package 'tidytext' was built under R version 3.5.3
#Use devtools::install_github("hadley/tidyverse") in console
library("tidyverse")
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
#Use devtools::install_github("josiahparry/genius") in console
library("genius")

Now we will be using the 20 most streamed songs on Spotify and put them into a tribble so we can work with them.

popular_songs <- tribble(~artist, ~song,
                  "Ed Sheeran", "Shape of You",
                  "Drake", "One Dance",
                  "The Chainsmokers", "Closer",
                  "Post Malone", "rockstar",
                  "Ed Sheeran", "Thinking Out Loud",
                  "Major Lazer", "Lean On",
                  "Drake", "God's Plan",
                  "Luis Fonsi", "Despacito",
                  "Justin Bieber", "Love Yourself",
                  "Justin Bieber", "Sorry",
                  "Camila Cabello", "Havana",
                  "The Chainsmokers", "Don't Let Me Down",
                  "The Weeknd", "Starboy",
                  "Mike Posner", "I Took A Pill In  Ibiza",
                  "Dua Lipa", "New Rules",
                  "DJ Snake", "Let Me Love You",
                  "Ed Sheeran", "Photograph",
                  "James Arthur", "Say You Won't Let Go",
                  "Ed Sheeran", "Perfect",
                  "Kendrick Lamar", "Humble") %>%
  
  add_genius(artist, song, "lyrics")
## Joining, by = c("artist", "song")
head(popular_songs)
## # A tibble: 6 x 5
##   artist    song      track_title  line lyric                              
##   <chr>     <chr>     <chr>       <int> <chr>                              
## 1 Ed Sheer~ Shape of~ Shape of Y~     1 The club isn't the best place to f~
## 2 Ed Sheer~ Shape of~ Shape of Y~     2 So the bar is where I go           
## 3 Ed Sheer~ Shape of~ Shape of Y~     3 Me and my friends at the table doi~
## 4 Ed Sheer~ Shape of~ Shape of Y~     4 Drinking fast and then we talk slow
## 5 Ed Sheer~ Shape of~ Shape of Y~     5 And you come over and start up a c~
## 6 Ed Sheer~ Shape of~ Shape of Y~     6 And trust me I'll give it a chance~

For starters, let’s just see which of these songs is the longest in terms of how many lines they have.

popular_songs %>%
  count(track_title) %>%
  arrange(-n)
## # A tibble: 20 x 2
##    track_title                n
##    <chr>                  <int>
##  1 Shape of You              91
##  2 Despacito                 77
##  3 Havana                    68
##  4 HUMBLE.                   67
##  5 One Dance                 66
##  6 <U+200B><U+200B>rockstar                  64
##  7 New Rules                 62
##  8 Say You Won't Let Go      62
##  9 Starboy                   59
## 10 Closer                    57
## 11 I Took a Pill in Ibiza    53
## 12 Don't Let Me Down         52
## 13 Let Me Love You           51
## 14 Love Yourself             51
## 15 Lean On                   47
## 16 God's Plan                46
## 17 Photograph                46
## 18 Sorry                     43
## 19 Thinking Out Loud         35
## 20 Perfect                   34

The next step in this process is we have to tokenize the lyrics. By doing so, we can discover which are the most common words and how often these words appear in the selected songs.

popular_lyrics <- popular_songs %>%
  unnest_tokens(word, lyric)

head(popular_lyrics)
## # A tibble: 6 x 5
##   artist     song         track_title   line word 
##   <chr>      <chr>        <chr>        <int> <chr>
## 1 Ed Sheeran Shape of You Shape of You     1 the  
## 2 Ed Sheeran Shape of You Shape of You     1 club 
## 3 Ed Sheeran Shape of You Shape of You     1 isn't
## 4 Ed Sheeran Shape of You Shape of You     1 the  
## 5 Ed Sheeran Shape of You Shape of You     1 best 
## 6 Ed Sheeran Shape of You Shape of You     1 place

Now, we can get a table to discover which words are the most common in all of these songs.

popular_lyrics %>%
  count(word) %>%
  arrange(-n)
## # A tibble: 1,254 x 2
##    word      n
##    <chr> <int>
##  1 i       343
##  2 you     245
##  3 me      189
##  4 a       184
##  5 and     171
##  6 the     152
##  7 on      148
##  8 my      147
##  9 up      138
## 10 in      127
## # ... with 1,244 more rows

These words are very common, but they don’t tell us anything we’re looking for. We need to exclude these from the results. To do so, we need to find the stopwords, which include the common words that we found in the results.

get_stopwords()
## # A tibble: 175 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # ... with 165 more rows

Now that we have our list of common words to exclude, we can now find the most common words that will hopefully have some positive or negative connotation.

popular_words <- popular_lyrics %>%
  anti_join(get_stopwords()) %>%
  count(word) %>%
  arrange(-n)
## Joining, by = "word"

Now that the data has been organized to be more informative than before, we can now find the words that appear the most in the songs that have been streamed the most on Spotify. Let’s see what appears most.

popular_words %>%
  top_n(10, n) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  coord_flip() +
  geom_col()

Analysis

Now, we will analyze the data using a program from genius to determine the sentiment of the words. We will be using the bing method of classifying emotion. This will categorize the words into either positive or negative. We’ll be putting this into another tribble so we can view this when we need to.

popular_sentiment <- popular_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  group_by(track_title) %>%
  count(sentiment, word)
## Joining, by = "word"

Now, we’ll be seeing how many words in the songs are positive, how many of them are negative, and how often these words appear in the songs. We’ll be viewing this data in graphs of positive and negative words, respectively.

#Now we can see how many words are positive and negative in these songs, and how often they appear.
popular_sentiment %>%
  group_by(sentiment) %>%
  top_n(10, n) %>%
  ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) +
  geom_col() + coord_flip() +
  facet_wrap(~sentiment, scales = "free")

There are a lot more negative words used, but the positive words appear to be more frequent. Now, we’ll see how many of these negative words appear in the 20 songs selected.

popular_sentiment %>%
  inner_join(get_sentiments("bing")) %>%
  count(track_title, sentiment) %>%
  spread(sentiment, n, fill = 0)
## Joining, by = c("sentiment", "word")
## # A tibble: 20 x 3
## # Groups:   track_title [20]
##    track_title            negative positive
##    <chr>                     <dbl>    <dbl>
##  1 <U+200B><U+200B>rockstar                     10        6
##  2 Closer                        6        6
##  3 Despacito                     1        1
##  4 Don't Let Me Down             2        3
##  5 God's Plan                    8        5
##  6 Havana                        3        4
##  7 HUMBLE.                      14        8
##  8 I Took a Pill in Ibiza        7        6
##  9 Lean On                       2        3
## 10 Let Me Love You               5        5
## 11 Love Yourself                 8        6
## 12 New Rules                     2        4
## 13 One Dance                     3        5
## 14 Perfect                       4       12
## 15 Photograph                    7        4
## 16 Say You Won't Let Go          5       14
## 17 Shape of You                  5        9
## 18 Sorry                         4        3
## 19 Starboy                      13        5
## 20 Thinking Out Loud             5        8

This shows the diversity of positive and negative words that appear in each song, but now, we can try to quantify the positivity and negativity in each word that was found in our process. We’ll be using the catgorization method known as “afinn” to achieve this. This will give us a score for each word on a scale of positivity and negativity as opposed to just saying whether it’s positive or negative.

popular_score <- popular_lyrics %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(track_title) %>%
  count(score, word)
## Joining, by = "word"

Now that we have a data set to find a quantity of positivity and negativity, we can see how many words appear in our selected songs under a certain score.

popular_score %>%
  inner_join(get_sentiments("afinn")) %>%
  count(track_title, score) %>%
  spread(score, n, fill = 0)
## Joining, by = c("score", "word")
## # A tibble: 20 x 10
## # Groups:   track_title [20]
##    track_title         `-5`  `-4`  `-3`  `-2`  `-1`   `1`   `2`   `3`   `4`
##    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 <U+200B><U+200B>rockstar               1     2     1     2     5     4     3     0     0
##  2 Closer                 0     0     0     2     4     1     2     2     0
##  3 Despacito              0     0     0     0     1     1     1     0     0
##  4 Don't Let Me Down      0     0     1     0     1     2     1     0     1
##  5 God's Plan             0     1     2     3     3     2     4     2     0
##  6 Havana                 0     1     0     0     2     2     1     2     0
##  7 HUMBLE.                1     6     3     3     2     4     2     2     0
##  8 I Took a Pill in ~     0     1     1     4     3     3     1     0     0
##  9 Lean On                0     0     0     1     2     1     4     0     0
## 10 Let Me Love You        0     1     0     1     0     2     0     3     1
## 11 Love Yourself          0     0     1     4     3     1     4     2     0
## 12 New Rules              0     0     0     3     1     1     2     2     0
## 13 One Dance              0     0     0     0     1     5     2     1     0
## 14 Perfect                0     0     0     1     0     4     5     3     0
## 15 Photograph             0     0     1     3     2     2     1     1     0
## 16 Say You Won't Let~     0     0     0     3     2     2     8     4     0
## 17 Shape of You           0     0     0     1     3     2     4     2     0
## 18 Sorry                  0     0     1     3     3     2     2     0     0
## 19 Starboy                1     0     3     2     4     3     3     1     0
## 20 Thinking Out Loud      0     0     0     1     2     0     5     1     0

Now, we’ll be putting these songs into their own objects to find the quantities of these songs separately. Now, for some reason, the “rockstar” subset gave me a few issues when dealing with this, so I had to select the rows to separate it from the rest as opposed to using the track title to specify the requirements.

Shape_of_You <- subset(popular_score, track_title == "Shape of You")
One_Dance <- subset(popular_score, track_title == "One Dance")
Closer <- subset(popular_score, track_title == "Closer")
rockstar <- subset(popular_score[c(1:18), c(1:4)])
Thinking_Out_Loud <- subset(popular_score, track_title == "Thinking Out Loud")
Lean_On <- subset(popular_score, track_title == "Lean On")
Gods_Plan <- subset(popular_score, track_title == "God's Plan")
Despacito <- subset(popular_score, track_title == "Despacito")
Love_Yourself <- subset(popular_score, track_title == "Love Yourself")
Sorry <- subset(popular_score, track_title == "Sorry")
Havana <- subset(popular_score, track_title == "Havana")
Dont_Let_Me_Down <- subset(popular_score, track_title == "Don't Let Me Down")
Starboy <- subset(popular_score, track_title == "Starboy")
I_Took_A_Pill_In_Ibiza <- subset(popular_score, track_title == "I Took a Pill in Ibiza")
New_Rules <- subset(popular_score, track_title == "New Rules")
Let_Me_Love_You <- subset(popular_score, track_title == "Let Me Love You")
Photograph <- subset(popular_score, track_title == "Photograph")
Say_You_Wont_Let_Go <- subset(popular_score, track_title == "Say You Won't Let Go")
Perfect <- subset(popular_score, track_title == "Perfect")
Humble <- subset(popular_score, track_title == "HUMBLE.")

Now, we can figure out the scores of these songs and their positivity and negativity. To finalize our scores, we have to take the score of each word in the song, and multiply it by the amount of times it appears in the song. In this case, we will be multiplying each song’s score column with their respective “n” column.

Score1 <- sum(Shape_of_You$score*Shape_of_You$n)
Score2 <- sum(One_Dance$score*One_Dance$n)
Score3 <- sum(Closer$score*Closer$n)
Score4 <- sum(rockstar$score*rockstar$n)
Score5 <- sum(Thinking_Out_Loud$score*Thinking_Out_Loud$n)
Score6 <- sum(Lean_On$score*Lean_On$n)
Score7 <- sum(Gods_Plan$score*Gods_Plan$n)
Score8 <- sum(Despacito$score*Despacito$n)
Score9 <- sum(Love_Yourself$score*Love_Yourself$n)
Score10 <- sum(Sorry$score*Sorry$n)
Score11 <- sum(Havana$score*Havana$n)
Score12 <- sum(Dont_Let_Me_Down$score*Dont_Let_Me_Down$n)
Score13 <- sum(Starboy$score*Starboy$n)
Score14 <- sum(I_Took_A_Pill_In_Ibiza$score*I_Took_A_Pill_In_Ibiza$n)
Score15 <- sum(New_Rules$score*New_Rules$n)
Score16 <- sum(Let_Me_Love_You$score*Let_Me_Love_You$n)
Score17 <- sum(Photograph$score*Photograph$n)
Score18 <- sum(Say_You_Wont_Let_Go$score*Say_You_Wont_Let_Go$n)
Score19 <- sum(Perfect$score*Perfect$n)
Score20 <- sum(Humble$score*Humble$n)

Now for the final results:

#Score for Shape of You
Score1
## [1] 98
#Score for One Dance
Score2
## [1] 20
#Score for Closer
Score3
## [1] -7
#Score for rockstar
Score4
## [1] 8
#Score for Thinking Out Loud
Score5
## [1] 44
#Score for Lean On
Score6
## [1] -6
#Score for God's Plan
Score7
## [1] -24
#Score for Despacito
Score8
## [1] 5
#Score for Love Yourself
Score9
## [1] 41
#Score for Sorry
Score10
## [1] -19
#Score for Havana
Score11
## [1] 19
#Score for Don't Let Me Down
Score12
## [1] 5
#Score for Starboy
Score13
## [1] -18
#Score for I Took a Pill in Ibiza
Score14
## [1] -25
#Score for New Rules
Score15
## [1] 5
#Score for Let Me Love You
Score16
## [1] 46
#Score for Photograph
Score17
## [1] -5
#Score for Say You Won't Let Go
Score18
## [1] 26
#Score for Perfect
Score19
## [1] 47
#Score for HUMBLE.
Score20
## [1] -187

Conclusions

At the end of the analysis, we have discovered the positivity and negativity in the 20 songs that are the most streamed on Spotify. We discovered that “Shape of You” has the highest positivity score and “HUMBLE.” has the highest negativity score. The rest of the scores vary widely from the -20s to the mid 40s. Overall, there seems to be more positivity in these popular songs than negativity.

Limitations

This is not to be used as a definitive answer for the amount of positivity and negativity in modern music. This analysis looked only at the words used and whether they often have positive or negative connotation. There are some words and phrases that might have a different context than the context it was counted for. The sample that was used consisted of only 20 songs and should not be used to compare to a majority of songs as the results may vary if someone else were to attempt their own version of this analysis. Trying your own version of this analysis is also encouraged.


This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: David Lanfranchi Semester: Spring 2019