Introduction

This project aims to analyze the topics of the most popular songs in order to identify any similarities and/or trends. Lyrics about love, heartbreak, and betrayal have been known to go viral. But are sad songs really want most people want to listen to? Or is it just a fad? Trends can help artists and recording labels decide what to produce next, as they can help predict popularity. The overall chart-topping songs from 2006 to 2015 are the focus of the majority of this report. One decade limits the results down to a more manageable level while also covering a large enough range. Additionally, music released in 2008 and 2012 are more likely to have a similarity than music from 1980 and 2012.

The billboard package was used to determine the songs for this inquiry. Information on billboard can be found at the following sources:

https://github.com/mikkelkrogsholm/billboard

https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features

Required Libraries

Before any analysis can begin, the required libraries must be loaded.

library(billboard)
library(tidyverse)
library(tidytext)
library(dplyr)
library(ggplot2)
library(wordcloud2)
library(knitr)

The billboard Package

The billboard package is comprised of four data sets: ‘lyrics’, ‘spotify_playlists’, ‘spotify_track_data’, and ‘wiki_hot_100s’

A variety of functions can be used to understand the contents of each data set and when to utilize which one. Some of my favorites include: str(), names(), head(), and View().

Specializing the Data

The following lines limit the data to the desired time frame. Although most of the data sets go up to 2016, ‘spotify_playlists’ most recent entries are from 2015. In order to have equal access to information about all the songs, 2016 is excluded.

lyrics %>% 
  filter(year <= 2015 & year >= 2006) -> words
spotify_playlists %>% 
  filter(year <= 2015 & year >= 2006) -> playlist
spotify_track_data %>% 
  filter(year <= 2015 & year >= 2006) -> songs
wiki_hot_100s %>% 
  filter(year <= 2015 & year >= 2006) -> rank

Now the actual analysis can begin.

Valence

One of the variables in the ‘songs’ data set is valence. Spotify defines valence as:

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Since we are trying to find the similarities between the songs, valence can be useful. The first line of the code below groups songs by year and calculates the mean valence for each. Then that information is graphed.

aggregate(x = songs$valence, by = list(songs$year), FUN = mean) -> mean_val
mean_val %>% 
  ggplot(aes(x = Group.1, y = x)) + geom_col(fill = "deepskyblue",
                                             color = "white") +
  labs(x = "Year", y = "Mean Valence", title = "Mean Valence per Year")

Throughout the decade valence stayed pretty much the same. The difference between the highest and lowest score is only 0.065. On average, the top 100 songs for each year are positive, but just barely. We found out previously the scale for valence is 0.0 to 1.0, hence 0.5 is neutral. All of the bars are between 0.5 and 0.6, so just barely on the positive side.

Combining for a New Data Set

Before diving into lyrics and sentiment, I combined two of the data sets. ‘wiki_hot_100’ is the only place the ranks can be found. ‘lyrics’ is the one data set that contains the words of the songs. The other three variables - ‘title’, ‘artist’, and year’- overlap. To avoid the same information from appearing twice, I created a subset. Note that all alterations were made on a copy the data set so the original version could still be accessed. Next, I combined the subset, ‘words2’, with the full data frame, rank2. The result is a data set containing the variables ‘no’, ‘title’, ‘artist’, ‘year’, and ‘lyrics’. I named it ‘combo’.

rank2 <- rank
songs2 <- songs
words2 <- words
words2 <- subset(words2, select = -c(artist, year))
combo <- left_join(rank2, words2, by = "title")
combo$no <- as.numeric((as.character(combo$no)))

The last line of the code chunk changes the ‘no’ column to the numeric type, as before it was the character type. Now it can be sorted chronologically.

These lines arrange the songs by ranking as opposed to year:

combo %>% 
  arrange(no) -> order

The Top Songs From 2006 to 2015

There are no lyrics available for the #1 songs in the years 2007, 2008, 2012, 2013, and 2015. The next most popular song with lyrics available for each of those years is analyzed instead. These are the songs we are going to focus on:

rank %>% 
  filter(title %in% c("Bad Day", "Before He Cheats", "Bleeding Love",
                      "Boom Boom Pow", "Tik Tok", "Rolling in the Deep",
                      "Call Me Maybe", "Radioactive",
                      "Happy", "Thinking Out Loud")) -> top2
top2 %>% 
  filter(no != "71") %>% 
  filter(no != "57") %>% kable()
no title artist year
1 Bad Day Daniel Powter 2006
6 Before He Cheats Carrie Underwood 2007
2 Bleeding Love Leona Lewis 2008
1 Boom Boom Pow The Black Eyed Peas 2009
1 Tik Tok Kesha 2010
1 Rolling in the Deep Adele 2011
2 Call Me Maybe Carly Rae Jepsen 2012
3 Radioactive Imagine Dragons 2013
1 Happy Pharrell Williams 2014
2 Thinking Out Loud Ed Sheeran 2015

I made a new data set to store these songs with their lyrics in:

words %>% 
  filter(title %in% c("Bad Day", "Before He Cheats", "Bleeding Love",
                      "Boom Boom Pow", "Tik Tok", "Rolling in the Deep",
                      "Call Me Maybe", "Radioactive",
                      "Happy", "Thinking Out Loud")) -> top

The reason I used the ‘rank’ and ‘words’ data sets above as opposed to combo is to obtain clean output. Only including necessary variables minimizes the possibility of repeats or errors when printing.

Now that all of trimming down and finalizing data is done, we can get to the sentiment analysis.

Cleaning the Lyrics

Since the lyrics are taken from https://genius.com/, there are words included that are not actually said in the song. This would skew the data, so those words (like “verse” and “chorus”) have been removed.

top %>% 
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% c("[Verse 1]", "[Verse 2]", "[Verse 3]", "[Chorus]",
                      "[Chorus 2]","[Bridge]", "[Hook]", "[Pre-Chorus]",
                      "[Pre-Chorus 1]", "[Pre-Chorus 2]", "[Bridge]",
                      "[Breakdown]", "[Outro]", "[Album Intro]",
                      "[Intro: Will.I.Am]", "[Chorus: Will.I.Am]",
                      "[Verse 1: Will.I.Am]", "[Pre-Chorus: Fergie]",
                      "[Verse 2: Taboo]", "[Verse 3: Apl.De.Ap & Will.I.Am]",
                      "[Hook: Will.I.Am]", "[Verse 4: Will.I.Am]",
                      "[Bridge: Fergie]", "[Break]", "[Refrain]",
                      "[Produced by Pharrell Williams]",
                      "Chorus", "chorus", "Verse", "Bridge", "Hook")) %>% 
  count(word, sort = TRUE) -> clean

I have constructed a word cloud of ‘clean’ and shown a table of the top 15 results.

Word Cloud

wordcloud2(clean, color = 'random-light')
head(clean, 15) %>% kable()
word n
boom 106
whoa 54
gonna 51
love 35
deep 32
rolling 32
met 29
beat 28
fall 27
bleeding 26
feel 26
verse 26
bad 25
age 24
clap 24

Bar Graph

This bar graph is visualizing the same data as the word cloud.

clean %>% 
  head(10) %>% 
  ggplot(aes(reorder(word, -n), n, fill = word)) + geom_col() +
  scale_fill_hue(c = 120) +
  labs(x = "Word", y = "Frequency", title = "Most Common Words in Top Songs")

Both the word cloud and graph show a high presence of “boom”. Usually this word would be considered ad-lib, but it is intentionally used frequently in “Boom Boom Pow”. The majority of words included in this graph are not strongly positive or negative. A sentiment analysis will help us get a better understanding of the moods of the songs.

Sentiment Analysis

I chose to use the ‘words’ data set here instead of ‘clean’ because unnest_tokens and anti_join filter the data their own way. Also, the sample size is about double. A graph of the top 15 words has been produced.

words %>% 
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments("afinn")) -> sent
sent %>% 
  head(15) %>% 
  ggplot(aes(x = word, value, fill = word)) + geom_col() +
  labs(x = "Word", y = "Value", title = "Overall Most Popular Words
       In the Last Decade") + theme(legend.position = "none")

The results are this graph are interesting. Nine of fifteen songs have a negative value. But earlier, the valence showed a dominance of positivity.

Let’s look at the sentiment analysis of another lexicon:

words %>% 
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments("nrc")) -> sent2
sent2 %>% 
  ggplot(aes(x = sentiment, y = n, fill = sentiment)) +
  geom_bar(stat="identity")+
  scale_fill_hue(c = 120) +
  theme(legend.position = "none") +
  labs(title = "Emotions of Overall Most Popular Songs",
         x = "Emotion", y = "Frequency")

Conclusion

“Emotions of Overall Most Popular Songs” supports the findings of the valence graph. Positive is clearly the most frequent emotion. The second highest bar is negative, which aligns with the first sentiment analysis. A conclusion we can draw is that songs often included negative lyrics, even when the general attitude was positive. This leads me to believe the most successful music exhibits a variety of emotions.