A Background on Valence

Valence is a tool developed by Spotify’s API to measure the musical positivity of a song. The API assigns each song with a valence score, ranging from 0 to 1. Songs with higher valence scores will sound “more positive”, which Spotify defines as happy, euphoric and cheerful, while songs with lower valence scores will sound “more negative”, defined as sad, angry and depressing. More information about valence and Spotify’s API features can be found here. I decided not to consider lyric sentiment in this project, as I did not expect to see a significant change between lyrics in popular songs. Most popular songs discuss the same topics, and I imagine that each decade’s most used words would include “love” and “dance.” Furthermore, love songs may be considered “sad” but with lyrics discussing love, a sentiment analysis would likely read the words as positive.

Finding the Top 100 Songs of Each Decade

First, I needed to compile a data set of the Top 100 Songs of the 1980s, 1990s, 2000s and 2010s. I used the top 100 songs from each decade to represent the music that was successful at that time. In order to find a consistent list of popular songs, I used VH1’s TOP 100 list for every decade in my sample. I added each of these songs to a playlist on my personal spotify; each of these playlists can be found here: 80s, 90s, 00s, 10s.

Loading Necessary Packages

Next, I needed to load my necessary packages for this project on R Studio:

library(tidyverse)
library(tidytext)
library(ggplot2)
library(scales)
library(plotly)
library(purrr)
library(repurrrsive)
library(knitr)
library(ggjoy)
library(ggthemes)
library(spotifyr)
library(SASmarkdown)
library(dplyr)
library(rstatix)
library(lsmeans)

Starting with the Spotify API

I created an app to access Spotify’s API on their developer website, and used my client ID and secret to access the program in R:

Sys.setenv(SPOTIFY_CLIENT_ID = '39059820891e46da8d97a3c5ccae4fb7')
Sys.setenv(SPOTIFY_CLIENT_SECRET = 'bc0407d2db104323bf1727ee15e79414')

access_token <- get_spotify_access_token()

Retrieving My Decades Playlists

Now that the API was ready for use, I loaded my decades playlists and the tracks of the playlists into R studio for analysis:

get_playlist('4AQ8bJiAZydOqxSgOAUTu2',  authorization = get_spotify_access_token()) -> tens

get_playlist('6LzFXijHFrA8zOod9tYezZ', authorization = get_spotify_access_token()) -> thousands

get_playlist('5FtLHnlXAQWMoGlDKVURX6', authorization = get_spotify_access_token()) -> nineties

get_playlist('6eGq5wQQhT6GBfKtc1deJi', authorization = get_spotify_access_token()) -> eighties

get_playlist_tracks('4AQ8bJiAZydOqxSgOAUTu2', fields = NULL, limit = 100,
                    offset = 0, market = NULL,
                    authorization = get_spotify_access_token(),
                    include_meta_info = FALSE) -> tentracks
get_playlist_tracks('6LzFXijHFrA8zOod9tYezZ', fields = NULL, limit = 100,
                    offset = 0, market = NULL,
                    authorization = get_spotify_access_token(),
                    include_meta_info = FALSE) -> thousandstracks
get_playlist_tracks('5FtLHnlXAQWMoGlDKVURX6', fields = NULL, limit = 100,
                    offset = 0, market = NULL,
                    authorization = get_spotify_access_token(),
                    include_meta_info = FALSE) -> ninetiestracks
get_playlist_tracks('6eGq5wQQhT6GBfKtc1deJi', fields = NULL, limit = 100,
                    offset = 0, market = NULL,
                    authorization = get_spotify_access_token(),
                    include_meta_info = FALSE) -> eightiestracks

Calculating a Valence Score

Next, I used the Spotify API to calculate a valence score for each song in my sample. My sample consists of 400 songs, 100 per decade, with the decades being 80s, 90s, 00s, and 2010s.

n = 400

get_playlist_audio_features('10s', '4AQ8bJiAZydOqxSgOAUTu2', authorization = get_spotify_access_token()) -> tenfeatures
get_playlist_audio_features('00s', '6LzFXijHFrA8zOod9tYezZ', authorization = get_spotify_access_token()) -> thousandfeatures
get_playlist_audio_features('90s', '5FtLHnlXAQWMoGlDKVURX6', authorization = get_spotify_access_token()) -> ninetiesfeatures
get_playlist_audio_features('80s', '6eGq5wQQhT6GBfKtc1deJi', authorization = get_spotify_access_token()) -> eightiesfeatures

A Quick Look at Valence

Now that I have the valence score for every song calculated, I wanted to take a quick look at the most positive and most negative song for each decade:

tenfeatures %>%
  arrange(-valence) %>%
  select(track.name, valence) %>%
  head(10) %>%
  kable(caption = "Most Positive Songs of the 2010s")
Most Positive Songs of the 2010s
track.name valence
Watch Me (Whip / Nae Nae) 0.962
All About That Bass 0.961
Sucker 0.952
Shake It Off 0.943
Cheerleader 0.943
Shape of You 0.931
Uptown Funk (feat. Bruno Mars) 0.928
Rude 0.925
Sunflower - Spider-Man: Into the Spider-Verse 0.913
Sugar 0.884
thousandfeatures %>%
  arrange(-valence) %>%
  select(track.name, valence) %>%
  head(10) %>%
  kable(caption = "Most Positive Songs of the 2000s")
Most Positive Songs of the 2000s
track.name valence
Family Affair 0.969
Hey Ya! 0.965
SexyBack (feat. Timbaland) 0.964
Get the Party Started 0.961
Stacy’s Mom 0.927
Toxic 0.924
Try Again 0.915
Hot In Herre 0.912
Hollaback Girl 0.904
Let Me Blow Ya Mind 0.897
ninetiesfeatures %>%
  arrange(-valence) %>%
  select(track.name, valence) %>%
  head(10) %>%
  kable(caption = "Most Positive Songs of the 1990s")
Most Positive Songs of the 1990s
track.name valence
Achy Breaky Heart 0.961
Unbelievable 0.934
Livin’ la Vida Loca 0.933
All I Wanna Do 0.931
Groove Is in the Heart 0.924
Whatta Man 0.921
Genie In a Bottle 0.913
The Way 0.908
…Baby One More Time 0.907
Mo Money Mo Problems (feat. Puff Daddy & Mase) - 2014 Remaster 0.904
eightiesfeatures %>%
  arrange(-valence) %>%
  select(track.name, valence) %>%
  head(10) %>%
  kable(caption = "Most Positive Songs of the 1980s")
Most Positive Songs of the 1980s
track.name valence
Hey Mickey 0.982
She Blinded Me With Science 0.975
Addicted To Love 0.975
Push It 0.973
Start Me Up - Remastered 2009 0.971
Like a Virgin 0.970
Rock Me Amadeus - The American Extended 0.967
One Thing Leads To Another 0.965
Our Lips Are Sealed 0.964
I Can’t Go for That (No Can Do) 0.963
tenfeatures %>%
  arrange(valence) %>%
  select(track.name, valence) %>%
  head(10) %>%
  kable(caption = "Most Negative Songs of the 2010s")
Most Negative Songs of the 2010s
track.name valence
rockstar (feat. 21 Savage) 0.129
The Hills 0.137
Perfect 0.168
Stay With Me 0.184
Lucid Dreams 0.218
Grenade 0.227
Need You Now 0.231
Radioactive 0.236
Panda 0.266
See You Again (feat. Charlie Puth) 0.283
thousandfeatures %>%
  arrange(valence) %>%
  select(track.name, valence) %>%
  head(10) %>%
  kable(caption = "Most Negative Songs of the 2000s")
Most Negative Songs of the 2000s
track.name valence
Lose Yourself 0.0590
Beautiful 0.0992
With Arms Wide Open 0.1410
Hero 0.1460
Get Low 0.1540
Hurt 0.1630
Run It! (feat. Juelz Santana) 0.2120
Untitled (How Does It Feel) 0.2210
Bleeding Love 0.2250
Mr. Brightside 0.2320
ninetiesfeatures %>%
  arrange(valence) %>%
  select(track.name, valence) %>%
  head(10) %>%
  kable(caption = "Most Negative Songs of the 1990s")
Most Negative Songs of the 1990s
track.name valence
My Heart Will Go On - Love Theme from “Titanic” 0.0382
I Alone 0.0992
Creep 0.1040
I Will Always Love You 0.1100
Black Hole Sun 0.1470
Nothing Compares 2 U 0.1610
Linger 0.2040
Jeremy 0.2870
One 0.3250
Vogue 0.3290
eightiesfeatures %>%
  arrange(valence) %>%
  select(track.name, valence) %>%
  head(10) %>%
  kable(caption = "Most Negative Songs of the 1980s")
Most Negative Songs of the 1980s
track.name valence
With Or Without You 0.122
Total Eclipse of the Heart 0.189
I Want to Know What Love Is 0.229
Sister Christian 0.234
In the Air Tonight 0.271
Time After Time 0.294
Mr. Roboto 0.314
Welcome To The Jungle 0.316
Every Rose Has Its Thorn 0.342
Dr. Feelgood 0.348

While these results are interesting to look at, it’s hard to draw any conclusions from the tables since each decades minimum and maximum value are around the same valence score (max=0.982 and min=0.0382). The most positive song in the entire sample is “Hey Mickey” by Toni Basil, and the most negative song in the entire sample is “My Heart Will Go On” by Celine Dion. After creating these tables, I listened to both of these songs to pay attention to the difference in the sound of the music. Hey Mickey is certainly more upbeat and cheerful than My Heart Will Go On.

Visualizing a Change in Valence

As entertaining as it is to look at each song’s musical positivity, I was interested in how the mean valence of each decade compared to each other. My first step in analyzing a real change involved creating a density plot to visualize the mean valence of each decade.

twothousand <- tenfeatures %>%
  rbind(thousandfeatures)

generation <- twothousand %>%
  rbind(ninetiesfeatures)

decades <- generation %>%
  rbind(eightiesfeatures)

decades %>%
  ggplot(aes(x = valence, fill = playlist_name)) +
  geom_density(alpha=0.7, color = NA) +
  labs(x="Valence", y="Density") +
  guides(fill=guide_legend(title="Playlist")) +
  theme_minimal() +
  ggtitle("Distribution of Valence in the Music Over the Decades")

This density plot shows us that the 80s have the highest concentration of songs with a valence score near 0.8, which appears to be much higher than the other decades. The 90s has the highest concentration of songs with a valence score of 0.76 and 0.5, the 2000s centered around 0.6 but stayed pretty evenly distributed among all valence scores, and the 2010s centered around 0.3 and 0.6 with the lowest valence scores.

Calculating Decade Means and Standard Deviations

I calculated each decade’s mean valence score and standard deviation in hopes of seeing a pattern in the change of valence score:

mean(tenfeatures$valence)
## [1] 0.56264
mean(thousandfeatures$valence)
## [1] 0.572642
mean(ninetiesfeatures$valence)
## [1] 0.624274
mean(eightiesfeatures$valence)
## [1] 0.71118
sd(tenfeatures$valence)
## [1] 0.2210674
sd(thousandfeatures$valence)
## [1] 0.2356586
sd(ninetiesfeatures$valence)
## [1] 0.2190559
sd(eightiesfeatures$valence)
## [1] 0.2234216

Mean Valence Score of the 2010s: 0.5627 Mean Valence Score of the 2000s: 0.572642 Mean Valence Score of the 1990s: 0.624284 Mean Valence Score of the 1980s: 0.71118 Standard Deviation of the 2010s: 0.2211675 Standard Deviation of the 2000s: 0.2356586 Standard Deviation of the 1990s: 0.2190713 Standard Deviation of the 1980s: 0.2234216

The means of each decade seem to decrease as time progresses, with the highest mean in the 1980s and the lowest in the 2010s. The standard deviations of each decade are roughly equal, which will be helpful when I check assumptions for the model in SAS.

Testing the Valence

Now that I’ve conducted some basic statistics and visualizations, I felt ready to conduct a hypothesis test comparing the mean valence score across decades. My independent variable in this model is the decade, and my dependent variable is the valence score. This model represents the mean valence score of the top 100 songs predicted by the decade of the top 100 songs.

My null and alternative hypotheses define the parameters of my model and are listed below:

H0 (null): the population mean valence (musical positivity) of the top 100 songs of each decade will be the same regardless of decade HA (alternative): at least one decade will have a different population mean valence for their top 100 songs than another decade

These can also be represented in terms of population mean, μ, for each decade. μ1 = 1980s μ2 = 1990s μ3 = 2000s μ4 = 2010s

H0: μ1 = μ2 = μ3 = μ4 = 0 HA: at least one μ ≠ 0.

Defining the Model and Distribution and Checking Assumptions

This model is a one-way analysis of variance’s (ANOVA) model with an F-distribution, meaning that I am testing to compare the means of each group, in this case, decade. The ANOVA model has a couple assumptions that my sample must meet for me to proceed with my analysis. The ANOVA assumptions check for a normally distributed population, equal variance among groups, and samples are independent of each other. I created the necessary graphs to check these assumptions using code:

First, I checked the normality assumption by creating a QQ-Plot of the residuals for the model and for each individual group. I will also create a histogram to make sure the sample is normally distributed, and box plots to analyze the distributions.

Here is the QQ-Plot of the model, the histogram and the boxplots:

model <- lm(valence ~ playlist_name, data = decades)
qqnorm(model$residuals)

decades %>%
  ggplot(x = playlist_name, y = valence) + geom_boxplot(aes(x = playlist_name, y = valence))

model %>%
  ggplot(aes(x = valence)) + geom_histogram(
    bins = 10,
    col = "black",
    fill = "light blue"
  )

Since most of the points in the QQ-Plot fall along the reference line, I can assume normality amongst the model. The histogram also shows a somewhat normal, bell-shaped distribution which fulfills the normality assumption.

Here is the QQ-Plot of each group:

ggplot(decades, aes(sample = valence, color = playlist_name)) + geom_qq() + geom_qq_line()

The points on the QQ-Plot mostly fall along the reference lines, so I can assume normality among each group.

Next, I will check the equal variances assumption by plotting the residuals against their fitted values:

plot(model, 1)

There is no evident relationship between the residuals and their fitted values, so we can assume there are equal variances.

The final assumption, that the samples are independent of each other, relies mostly on how the data was gathered. The sample was not random, since I specifically chose the top 100 songs of each decade, however, decade is a predictor variable in this model, and there are 400 observations in the sample, so we can assume the samples are independent.

ANOVA Test

Now that I know my assumptions are met, I am ready to conduct my hypothesis test.

anova <- decades %>% anova_test(valence ~ playlist_name)
anova
## ANOVA Table (type II tests)
## 
##          Effect DFn DFd     F        p p<.05   ges
## 1 playlist_name   3 396 9.124 7.49e-06     * 0.065

From this table, I see that I have an F-value of 9.117 and a p-value of <0.0001. This output tells me that my results are 9.117 times more extreme than I would expect if decade (defined here as playlist_name) had no effect on the mean valence score.

Based on these results, I have sufficient evidence to conclude that the population mean valence score is different between at least one of the decades.

Which Decade Had The Happiest Music?

Now that I have sufficient evidence that each decade does not have the same mean valence score, I am interested in finding out which decade has the highest mean valence score and whether it is significantly higher than the other decades.

To find this out, first, I will calculate the least squares means of each decade and find the difference between the means using Tukey’s Honest Significant Differences. Tukey’s HSD allows for the test of multiple pairwise comparisons between means.

lsmeans(model, specs = "playlist_name")
##  playlist_name lsmean     SE  df lower.CL upper.CL
##  00s            0.573 0.0225 396    0.528    0.617
##  10s            0.563 0.0225 396    0.518    0.607
##  80s            0.711 0.0225 396    0.667    0.755
##  90s            0.624 0.0225 396    0.580    0.668
## 
## Confidence level used: 0.95
tukey_hsd(model)
## # A tibble: 6 x 9
##   term  group1 group2 null.value estimate conf.low conf.high   p.adj
## * <chr> <chr>  <chr>       <dbl>    <dbl>    <dbl>     <dbl>   <dbl>
## 1 play… 00s    10s             0  -0.0100  -0.0921   0.0721  9.89e-1
## 2 play… 00s    80s             0   0.139    0.0565   0.221   9.93e-5
## 3 play… 00s    90s             0   0.0516  -0.0304   0.134   3.67e-1
## 4 play… 10s    80s             0   0.149    0.0665   0.231   2.44e-5
## 5 play… 10s    90s             0   0.0616  -0.0204   0.144   2.14e-1
## 6 play… 80s    90s             0  -0.0869  -0.169   -0.00485 3.31e-2
## # … with 1 more variable: p.adj.signif <chr>

The pairwise comparisons between the mean valence score of each decade show me that the 80s have a higher mean valence score than the 2010s, 2000s, and 1990s.

How Happy Were the 80s?

Now that I’ve found that the 1980s had a higher mean musical positivity score than the other decades, I want to see if the 80s mean was statistically significantly higher than the mean valence score of the other decades. To do this, I will need to create a contrast statement and run a hypothesis test on it.

My contrast statement will be comparing the mean valence score of the top 100 songs of the 80s to the mean valence score of the top 100 songs of the other decades.

lsmeans(model, specs = "playlist_name") -> mean
contrast(mean, list("80svsall" = c(3,-1,-1,-1)))
##  contrast estimate     SE  df t.ratio p.value
##  80svsall    -0.18 0.0779 396 -2.313  0.0213

A p-value of 0.0212 tells me that the probability of getting the same results as me if the mean valence score of the top 100 songs of the 1980s is not significantly higher than other decades is 2.12%. Based on this p-value, I have sufficient evidence to conclude that the mean valence score of popular songs in the 80s is signficantly higher than the 1990s, 2000s and 2010s.

So, The 80s Really Were Better

Throughout my analysis of valence scores of the most popular songs of the 1980s, 1990s, 2000s and 2010s, I have found that the 1980s have a significantly higher mean valence score than the other decades. In other words, the 1980s top 100 songs were much more cheerful, euphoric and happy on average when compared to the top 100 songs from other decades.

So, if you need a pick-me-up or you’re tired of the depressing songs on the radio, put on Bon Jovi’s Livin’ on a Prayer and enjoy the good ’ol days.