Text Analysis of Popular Songs

Introduction

This project aims to analyze the topics of the most popular songs in order to identify any similarities and/or trends. Lyrics about love, heartbreak, and betrayal have been known to go viral. But are sad songs really want most people want to listen to? Or is it just a fad? Trends can help artists and recording labels decide what to produce next, as they can help predict popularity. The overall chart-topping songs from 2006 to 2015 are the focus of the majority of this report. One decade limits the results down to a more manageable level while also covering a large enough range. Additionally, music released in 2008 and 2012 are more likely to have a similarity than music from 1980 and 2012.

The billboard package was used to determine the songs for this inquiry. Information on billboard can be found at the following sources:

https://github.com/mikkelkrogsholm/billboard

https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features

Required Libraries

Before any analysis can begin, the required libraries must be loaded.

library(billboard)
library(tidyverse)
library(tidytext)
library(dplyr)
library(ggplot2)
library(wordcloud2)
library(knitr)

The billboard Package

The billboard package is comprised of four data sets: ‘lyrics’, ‘spotify_playlists’, ‘spotify_track_data’, and ‘wiki_hot_100s’

A variety of functions can be used to understand the contents of each data set and when to utilize which one. Some of my favorites include: str(), names(), head(), and View().

Specializing the Data

The following lines limit the data to the desired time frame. Although most of the data sets go up to 2016, ‘spotify_playlists’ most recent entries are from 2015. In order to have equal access to information about all the songs, 2016 is excluded.

lyrics %>% 
  filter(year <= 2015 & year >= 2006) -> words
spotify_playlists %>% 
  filter(year <= 2015 & year >= 2006) -> playlist
spotify_track_data %>% 
  filter(year <= 2015 & year >= 2006) -> songs
wiki_hot_100s %>% 
  filter(year <= 2015 & year >= 2006) -> rank

Now the actual analysis can begin.

Valence

One of the variables in the ‘songs’ data set is valence. Spotify defines valence as:

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Since we are trying to find the similarities between the songs, valence can be useful. The first line of the code below groups songs by year and calculates the mean valence for each. Then that information is graphed.

aggregate(x = songs$valence, by = list(songs$year), FUN = mean) -> mean_val
mean_val %>% 
  ggplot(aes(x = Group.1, y = x)) + geom_col(fill = "deepskyblue",
                                             color = "white") +
  labs(x = "Year", y = "Mean Valence", title = "Mean Valence per Year")

Throughout the decade valence stayed pretty much the same. The difference between the highest and lowest score is only 0.065. On average, the top 100 songs for each year are positive, but just barely. We found out previously the scale for valence is 0.0 to 1.0, hence 0.5 is neutral. All of the bars are between 0.5 and 0.6, so just barely on the positive side.

Combining for a New Data Set

Before diving into lyrics and sentiment, I combined two of the data sets. ‘wiki_hot_100’ is the only place the ranks can be found. ‘lyrics’ is the one data set that contains the words of the songs. The other three variables - ‘title’, ‘artist’, and year’- overlap. To avoid the same information from appearing twice, I created a subset. Note that all alterations were made on a copy the data set so the original version could still be accessed. Next, I combined the subset, ‘words2’, with the full data frame, rank2. The result is a data set containing the variables ‘no’, ‘title’, ‘artist’, ‘year’, and ‘lyrics’. I named it ‘combo’.

rank2 <- rank
songs2 <- songs
words2 <- words
words2 <- subset(words2, select = -c(artist, year))
combo <- left_join(rank2, words2, by = "title")
combo$no <- as.numeric((as.character(combo$no)))

The last line of the code chunk changes the ‘no’ column to the numeric type, as before it was the character type. Now it can be sorted chronologically.

These lines arrange the songs by ranking as opposed to year:

combo %>% 
  arrange(no) -> order

The Top Songs From 2006 to 2015

There are no lyrics available for the #1 songs in the years 2007, 2008, 2012, 2013, and 2015. The next most popular song with lyrics available for each of those years is analyzed instead. These are the songs we are going to focus on:

rank %>% 
  filter(title %in% c("Bad Day", "Before He Cheats", "Bleeding Love",
                      "Boom Boom Pow", "Tik Tok", "Rolling in the Deep",
                      "Call Me Maybe", "Radioactive",
                      "Happy", "Thinking Out Loud")) -> top2
top2 %>% 
  filter(no != "71") %>% 
  filter(no != "57") %>% kable()

no	title	artist	year
1	Bad Day	Daniel Powter	2006
6	Before He Cheats	Carrie Underwood	2007
2	Bleeding Love	Leona Lewis	2008
1	Boom Boom Pow	The Black Eyed Peas	2009
1	Tik Tok	Kesha	2010
1	Rolling in the Deep	Adele	2011
2	Call Me Maybe	Carly Rae Jepsen	2012
3	Radioactive	Imagine Dragons	2013
1	Happy	Pharrell Williams	2014
2	Thinking Out Loud	Ed Sheeran	2015

I made a new data set to store these songs with their lyrics in:

words %>% 
  filter(title %in% c("Bad Day", "Before He Cheats", "Bleeding Love",
                      "Boom Boom Pow", "Tik Tok", "Rolling in the Deep",
                      "Call Me Maybe", "Radioactive",
                      "Happy", "Thinking Out Loud")) -> top

The reason I used the ‘rank’ and ‘words’ data sets above as opposed to combo is to obtain clean output. Only including necessary variables minimizes the possibility of repeats or errors when printing.

Now that all of trimming down and finalizing data is done, we can get to the sentiment analysis.

Cleaning the Lyrics

Since the lyrics are taken from https://genius.com/, there are words included that are not actually said in the song. This would skew the data, so those words (like “verse” and “chorus”) have been removed.

top %>% 
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% c("[Verse 1]", "[Verse 2]", "[Verse 3]", "[Chorus]",
                      "[Chorus 2]","[Bridge]", "[Hook]", "[Pre-Chorus]",
                      "[Pre-Chorus 1]", "[Pre-Chorus 2]", "[Bridge]",
                      "[Breakdown]", "[Outro]", "[Album Intro]",
                      "[Intro: Will.I.Am]", "[Chorus: Will.I.Am]",
                      "[Verse 1: Will.I.Am]", "[Pre-Chorus: Fergie]",
                      "[Verse 2: Taboo]", "[Verse 3: Apl.De.Ap & Will.I.Am]",
                      "[Hook: Will.I.Am]", "[Verse 4: Will.I.Am]",
                      "[Bridge: Fergie]", "[Break]", "[Refrain]",
                      "[Produced by Pharrell Williams]",
                      "Chorus", "chorus", "Verse", "Bridge", "Hook")) %>% 
  count(word, sort = TRUE) -> clean

I have constructed a word cloud of ‘clean’ and shown a table of the top 15 results.

Word Cloud

wordcloud2(clean, color = 'random-light')

head(clean, 15) %>% kable()

word	n
boom	106
whoa	54
gonna	51
love	35
deep	32
rolling	32
met	29
beat	28
fall	27
bleeding	26
feel	26
verse	26
bad	25
age	24
clap	24

Bar Graph

This bar graph is visualizing the same data as the word cloud.

clean %>% 
  head(10) %>% 
  ggplot(aes(reorder(word, -n), n, fill = word)) + geom_col() +
  scale_fill_hue(c = 120) +
  labs(x = "Word", y = "Frequency", title = "Most Common Words in Top Songs")

Both the word cloud and graph show a high presence of “boom”. Usually this word would be considered ad-lib, but it is intentionally used frequently in “Boom Boom Pow”. The majority of words included in this graph are not strongly positive or negative. A sentiment analysis will help us get a better understanding of the moods of the songs.

Sentiment Analysis

I chose to use the ‘words’ data set here instead of ‘clean’ because unnest_tokens and anti_join filter the data their own way. Also, the sample size is about double. A graph of the top 15 words has been produced.

words %>% 
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments("afinn")) -> sent
sent %>% 
  head(15) %>% 
  ggplot(aes(x = word, value, fill = word)) + geom_col() +
  labs(x = "Word", y = "Value", title = "Overall Most Popular Words
       In the Last Decade") + theme(legend.position = "none")

The results are this graph are interesting. Nine of fifteen songs have a negative value. But earlier, the valence showed a dominance of positivity.

Let’s look at the sentiment analysis of another lexicon:

words %>% 
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments("nrc")) -> sent2
sent2 %>% 
  ggplot(aes(x = sentiment, y = n, fill = sentiment)) +
  geom_bar(stat="identity")+
  scale_fill_hue(c = 120) +
  theme(legend.position = "none") +
  labs(title = "Emotions of Overall Most Popular Songs",
         x = "Emotion", y = "Frequency")

Conclusion

“Emotions of Overall Most Popular Songs” supports the findings of the valence graph. Positive is clearly the most frequent emotion. The second highest bar is negative, which aligns with the first sentiment analysis. A conclusion we can draw is that songs often included negative lyrics, even when the general attitude was positive. This leads me to believe the most successful music exhibits a variety of emotions.