Valence is a tool developed by Spotify’s API to measure the musical positivity of a song. The API assigns each song with a valence score, ranging from 0 to 1. Songs with higher valence scores will sound “more positive”, which Spotify defines as happy, euphoric and cheerful, while songs with lower valence scores will sound “more negative”, defined as sad, angry and depressing. More information about valence and Spotify’s API features can be found here. I decided not to consider lyric sentiment in this project, as I did not expect to see a significant change between lyrics in popular songs. Most popular songs discuss the same topics, and I imagine that each decade’s most used words would include “love” and “dance.” Furthermore, love songs may be considered “sad” but with lyrics discussing love, a sentiment analysis would likely read the words as positive.
According to this article, since the 1950s, popular music has changed from being sappy to being upbeat and dark, with added profanity. I was interested in just how drastic the change in positivity of popular music actually is, and after studying Spotify’s API, I wanted to compare the valence score of popular songs across decades. I hypothesize that I will find a decrease in valence score from older decades (the 80s and 90s) to newer decades (2000s and 2010s).
First, I needed to compile a data set of the Top 100 Songs of the 1980s, 1990s, 2000s and 2010s. I used the top 100 songs from each decade to represent the music that was successful at that time. In order to find a consistent list of popular songs, I used VH1’s TOP 100 list for every decade in my sample. I added each of these songs to a playlist on my personal spotify; each of these playlists can be found here: 80s, 90s, 00s, 10s.
Next, I needed to load my necessary packages for this project on R Studio:
library(tidyverse)
library(tidytext)
library(ggplot2)
library(scales)
library(plotly)
library(purrr)
library(repurrrsive)
library(knitr)
library(ggjoy)
library(ggthemes)
library(spotifyr)
library(SASmarkdown)
library(dplyr)
library(rstatix)
library(lsmeans)
I created an app to access Spotify’s API on their developer website, and used my client ID and secret to access the program in R:
Sys.setenv(SPOTIFY_CLIENT_ID = '39059820891e46da8d97a3c5ccae4fb7')
Sys.setenv(SPOTIFY_CLIENT_SECRET = 'bc0407d2db104323bf1727ee15e79414')
access_token <- get_spotify_access_token()
Now that the API was ready for use, I loaded my decades playlists and the tracks of the playlists into R studio for analysis:
get_playlist('4AQ8bJiAZydOqxSgOAUTu2', authorization = get_spotify_access_token()) -> tens
get_playlist('6LzFXijHFrA8zOod9tYezZ', authorization = get_spotify_access_token()) -> thousands
get_playlist('5FtLHnlXAQWMoGlDKVURX6', authorization = get_spotify_access_token()) -> nineties
get_playlist('6eGq5wQQhT6GBfKtc1deJi', authorization = get_spotify_access_token()) -> eighties
get_playlist_tracks('4AQ8bJiAZydOqxSgOAUTu2', fields = NULL, limit = 100,
offset = 0, market = NULL,
authorization = get_spotify_access_token(),
include_meta_info = FALSE) -> tentracks
get_playlist_tracks('6LzFXijHFrA8zOod9tYezZ', fields = NULL, limit = 100,
offset = 0, market = NULL,
authorization = get_spotify_access_token(),
include_meta_info = FALSE) -> thousandstracks
get_playlist_tracks('5FtLHnlXAQWMoGlDKVURX6', fields = NULL, limit = 100,
offset = 0, market = NULL,
authorization = get_spotify_access_token(),
include_meta_info = FALSE) -> ninetiestracks
get_playlist_tracks('6eGq5wQQhT6GBfKtc1deJi', fields = NULL, limit = 100,
offset = 0, market = NULL,
authorization = get_spotify_access_token(),
include_meta_info = FALSE) -> eightiestracks
Next, I used the Spotify API to calculate a valence score for each song in my sample. My sample consists of 400 songs, 100 per decade, with the decades being 80s, 90s, 00s, and 2010s.
n = 400
get_playlist_audio_features('10s', '4AQ8bJiAZydOqxSgOAUTu2', authorization = get_spotify_access_token()) -> tenfeatures
get_playlist_audio_features('00s', '6LzFXijHFrA8zOod9tYezZ', authorization = get_spotify_access_token()) -> thousandfeatures
get_playlist_audio_features('90s', '5FtLHnlXAQWMoGlDKVURX6', authorization = get_spotify_access_token()) -> ninetiesfeatures
get_playlist_audio_features('80s', '6eGq5wQQhT6GBfKtc1deJi', authorization = get_spotify_access_token()) -> eightiesfeatures
Now that I have the valence score for every song calculated, I wanted to take a quick look at the most positive and most negative song for each decade:
tenfeatures %>%
arrange(-valence) %>%
select(track.name, valence) %>%
head(10) %>%
kable(caption = "Most Positive Songs of the 2010s")
track.name | valence |
---|---|
Watch Me (Whip / Nae Nae) | 0.962 |
All About That Bass | 0.961 |
Sucker | 0.952 |
Shake It Off | 0.943 |
Cheerleader | 0.943 |
Shape of You | 0.931 |
Uptown Funk (feat. Bruno Mars) | 0.928 |
Rude | 0.925 |
Sunflower - Spider-Man: Into the Spider-Verse | 0.913 |
Sugar | 0.884 |
thousandfeatures %>%
arrange(-valence) %>%
select(track.name, valence) %>%
head(10) %>%
kable(caption = "Most Positive Songs of the 2000s")
track.name | valence |
---|---|
Family Affair | 0.969 |
Hey Ya! | 0.965 |
SexyBack (feat. Timbaland) | 0.964 |
Get the Party Started | 0.961 |
Stacy’s Mom | 0.927 |
Toxic | 0.924 |
Try Again | 0.915 |
Hot In Herre | 0.912 |
Hollaback Girl | 0.904 |
Let Me Blow Ya Mind | 0.897 |
ninetiesfeatures %>%
arrange(-valence) %>%
select(track.name, valence) %>%
head(10) %>%
kable(caption = "Most Positive Songs of the 1990s")
track.name | valence |
---|---|
Achy Breaky Heart | 0.961 |
Unbelievable | 0.934 |
Livin’ la Vida Loca | 0.933 |
All I Wanna Do | 0.931 |
Groove Is in the Heart | 0.924 |
Whatta Man | 0.921 |
Genie In a Bottle | 0.913 |
The Way | 0.908 |
…Baby One More Time | 0.907 |
Mo Money Mo Problems (feat. Puff Daddy & Mase) - 2014 Remaster | 0.904 |
eightiesfeatures %>%
arrange(-valence) %>%
select(track.name, valence) %>%
head(10) %>%
kable(caption = "Most Positive Songs of the 1980s")
track.name | valence |
---|---|
Hey Mickey | 0.982 |
She Blinded Me With Science | 0.975 |
Addicted To Love | 0.975 |
Push It | 0.973 |
Start Me Up - Remastered 2009 | 0.971 |
Like a Virgin | 0.970 |
Rock Me Amadeus - The American Extended | 0.967 |
One Thing Leads To Another | 0.965 |
Our Lips Are Sealed | 0.964 |
I Can’t Go for That (No Can Do) | 0.963 |
tenfeatures %>%
arrange(valence) %>%
select(track.name, valence) %>%
head(10) %>%
kable(caption = "Most Negative Songs of the 2010s")
track.name | valence |
---|---|
rockstar (feat. 21 Savage) | 0.129 |
The Hills | 0.137 |
Perfect | 0.168 |
Stay With Me | 0.184 |
Lucid Dreams | 0.218 |
Grenade | 0.227 |
Need You Now | 0.231 |
Radioactive | 0.236 |
Panda | 0.266 |
See You Again (feat. Charlie Puth) | 0.283 |
thousandfeatures %>%
arrange(valence) %>%
select(track.name, valence) %>%
head(10) %>%
kable(caption = "Most Negative Songs of the 2000s")
track.name | valence |
---|---|
Lose Yourself | 0.0590 |
Beautiful | 0.0992 |
With Arms Wide Open | 0.1410 |
Hero | 0.1460 |
Get Low | 0.1540 |
Hurt | 0.1630 |
Run It! (feat. Juelz Santana) | 0.2120 |
Untitled (How Does It Feel) | 0.2210 |
Bleeding Love | 0.2250 |
Mr. Brightside | 0.2320 |
ninetiesfeatures %>%
arrange(valence) %>%
select(track.name, valence) %>%
head(10) %>%
kable(caption = "Most Negative Songs of the 1990s")
track.name | valence |
---|---|
My Heart Will Go On - Love Theme from “Titanic” | 0.0382 |
I Alone | 0.0992 |
Creep | 0.1040 |
I Will Always Love You | 0.1100 |
Black Hole Sun | 0.1470 |
Nothing Compares 2 U | 0.1610 |
Linger | 0.2040 |
Jeremy | 0.2870 |
One | 0.3250 |
Vogue | 0.3290 |
eightiesfeatures %>%
arrange(valence) %>%
select(track.name, valence) %>%
head(10) %>%
kable(caption = "Most Negative Songs of the 1980s")
track.name | valence |
---|---|
With Or Without You | 0.122 |
Total Eclipse of the Heart | 0.189 |
I Want to Know What Love Is | 0.229 |
Sister Christian | 0.234 |
In the Air Tonight | 0.271 |
Time After Time | 0.294 |
Mr. Roboto | 0.314 |
Welcome To The Jungle | 0.316 |
Every Rose Has Its Thorn | 0.342 |
Dr. Feelgood | 0.348 |
While these results are interesting to look at, it’s hard to draw any conclusions from the tables since each decades minimum and maximum value are around the same valence score (max=0.982 and min=0.0382). The most positive song in the entire sample is “Hey Mickey” by Toni Basil, and the most negative song in the entire sample is “My Heart Will Go On” by Celine Dion. After creating these tables, I listened to both of these songs to pay attention to the difference in the sound of the music. Hey Mickey is certainly more upbeat and cheerful than My Heart Will Go On.
As entertaining as it is to look at each song’s musical positivity, I was interested in how the mean valence of each decade compared to each other. My first step in analyzing a real change involved creating a density plot to visualize the mean valence of each decade.
twothousand <- tenfeatures %>%
rbind(thousandfeatures)
generation <- twothousand %>%
rbind(ninetiesfeatures)
decades <- generation %>%
rbind(eightiesfeatures)
decades %>%
ggplot(aes(x = valence, fill = playlist_name)) +
geom_density(alpha=0.7, color = NA) +
labs(x="Valence", y="Density") +
guides(fill=guide_legend(title="Playlist")) +
theme_minimal() +
ggtitle("Distribution of Valence in the Music Over the Decades")
This density plot shows us that the 80s have the highest concentration of songs with a valence score near 0.8, which appears to be much higher than the other decades. The 90s has the highest concentration of songs with a valence score of 0.76 and 0.5, the 2000s centered around 0.6 but stayed pretty evenly distributed among all valence scores, and the 2010s centered around 0.3 and 0.6 with the lowest valence scores.
I calculated each decade’s mean valence score and standard deviation in hopes of seeing a pattern in the change of valence score:
mean(tenfeatures$valence)
## [1] 0.56264
mean(thousandfeatures$valence)
## [1] 0.572642
mean(ninetiesfeatures$valence)
## [1] 0.624274
mean(eightiesfeatures$valence)
## [1] 0.71118
sd(tenfeatures$valence)
## [1] 0.2210674
sd(thousandfeatures$valence)
## [1] 0.2356586
sd(ninetiesfeatures$valence)
## [1] 0.2190559
sd(eightiesfeatures$valence)
## [1] 0.2234216
Mean Valence Score of the 2010s: 0.5627 Mean Valence Score of the 2000s: 0.572642 Mean Valence Score of the 1990s: 0.624284 Mean Valence Score of the 1980s: 0.71118 Standard Deviation of the 2010s: 0.2211675 Standard Deviation of the 2000s: 0.2356586 Standard Deviation of the 1990s: 0.2190713 Standard Deviation of the 1980s: 0.2234216
The means of each decade seem to decrease as time progresses, with the highest mean in the 1980s and the lowest in the 2010s. The standard deviations of each decade are roughly equal, which will be helpful when I check assumptions for the model in SAS.
Now that I’ve conducted some basic statistics and visualizations, I felt ready to conduct a hypothesis test comparing the mean valence score across decades. My independent variable in this model is the decade, and my dependent variable is the valence score. This model represents the mean valence score of the top 100 songs predicted by the decade of the top 100 songs.
My null and alternative hypotheses define the parameters of my model and are listed below:
H0 (null): the population mean valence (musical positivity) of the top 100 songs of each decade will be the same regardless of decade HA (alternative): at least one decade will have a different population mean valence for their top 100 songs than another decade
These can also be represented in terms of population mean, μ, for each decade. μ1 = 1980s μ2 = 1990s μ3 = 2000s μ4 = 2010s
H0: μ1 = μ2 = μ3 = μ4 = 0 HA: at least one μ ≠ 0.
This model is a one-way analysis of variance’s (ANOVA) model with an F-distribution, meaning that I am testing to compare the means of each group, in this case, decade. The ANOVA model has a couple assumptions that my sample must meet for me to proceed with my analysis. The ANOVA assumptions check for a normally distributed population, equal variance among groups, and samples are independent of each other. I created the necessary graphs to check these assumptions using code:
First, I checked the normality assumption by creating a QQ-Plot of the residuals for the model and for each individual group. I will also create a histogram to make sure the sample is normally distributed, and box plots to analyze the distributions.
Here is the QQ-Plot of the model, the histogram and the boxplots:
model <- lm(valence ~ playlist_name, data = decades)
qqnorm(model$residuals)
decades %>%
ggplot(x = playlist_name, y = valence) + geom_boxplot(aes(x = playlist_name, y = valence))
model %>%
ggplot(aes(x = valence)) + geom_histogram(
bins = 10,
col = "black",
fill = "light blue"
)
Since most of the points in the QQ-Plot fall along the reference line, I can assume normality amongst the model. The histogram also shows a somewhat normal, bell-shaped distribution which fulfills the normality assumption.
Here is the QQ-Plot of each group:
ggplot(decades, aes(sample = valence, color = playlist_name)) + geom_qq() + geom_qq_line()
The points on the QQ-Plot mostly fall along the reference lines, so I can assume normality among each group.
Next, I will check the equal variances assumption by plotting the residuals against their fitted values:
plot(model, 1)
There is no evident relationship between the residuals and their fitted values, so we can assume there are equal variances.
The final assumption, that the samples are independent of each other, relies mostly on how the data was gathered. The sample was not random, since I specifically chose the top 100 songs of each decade, however, decade is a predictor variable in this model, and there are 400 observations in the sample, so we can assume the samples are independent.
Now that I know my assumptions are met, I am ready to conduct my hypothesis test.
anova <- decades %>% anova_test(valence ~ playlist_name)
anova
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 playlist_name 3 396 9.124 7.49e-06 * 0.065
From this table, I see that I have an F-value of 9.117 and a p-value of <0.0001. This output tells me that my results are 9.117 times more extreme than I would expect if decade (defined here as playlist_name) had no effect on the mean valence score.
Based on these results, I have sufficient evidence to conclude that the population mean valence score is different between at least one of the decades.
Now that I have sufficient evidence that each decade does not have the same mean valence score, I am interested in finding out which decade has the highest mean valence score and whether it is significantly higher than the other decades.
To find this out, first, I will calculate the least squares means of each decade and find the difference between the means using Tukey’s Honest Significant Differences. Tukey’s HSD allows for the test of multiple pairwise comparisons between means.
lsmeans(model, specs = "playlist_name")
## playlist_name lsmean SE df lower.CL upper.CL
## 00s 0.573 0.0225 396 0.528 0.617
## 10s 0.563 0.0225 396 0.518 0.607
## 80s 0.711 0.0225 396 0.667 0.755
## 90s 0.624 0.0225 396 0.580 0.668
##
## Confidence level used: 0.95
tukey_hsd(model)
## # A tibble: 6 x 9
## term group1 group2 null.value estimate conf.low conf.high p.adj
## * <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 play… 00s 10s 0 -0.0100 -0.0921 0.0721 9.89e-1
## 2 play… 00s 80s 0 0.139 0.0565 0.221 9.93e-5
## 3 play… 00s 90s 0 0.0516 -0.0304 0.134 3.67e-1
## 4 play… 10s 80s 0 0.149 0.0665 0.231 2.44e-5
## 5 play… 10s 90s 0 0.0616 -0.0204 0.144 2.14e-1
## 6 play… 80s 90s 0 -0.0869 -0.169 -0.00485 3.31e-2
## # … with 1 more variable: p.adj.signif <chr>
The pairwise comparisons between the mean valence score of each decade show me that the 80s have a higher mean valence score than the 2010s, 2000s, and 1990s.
Now that I’ve found that the 1980s had a higher mean musical positivity score than the other decades, I want to see if the 80s mean was statistically significantly higher than the mean valence score of the other decades. To do this, I will need to create a contrast statement and run a hypothesis test on it.
My contrast statement will be comparing the mean valence score of the top 100 songs of the 80s to the mean valence score of the top 100 songs of the other decades.
lsmeans(model, specs = "playlist_name") -> mean
contrast(mean, list("80svsall" = c(3,-1,-1,-1)))
## contrast estimate SE df t.ratio p.value
## 80svsall -0.18 0.0779 396 -2.313 0.0213
A p-value of 0.0212 tells me that the probability of getting the same results as me if the mean valence score of the top 100 songs of the 1980s is not significantly higher than other decades is 2.12%. Based on this p-value, I have sufficient evidence to conclude that the mean valence score of popular songs in the 80s is signficantly higher than the 1990s, 2000s and 2010s.
Throughout my analysis of valence scores of the most popular songs of the 1980s, 1990s, 2000s and 2010s, I have found that the 1980s have a significantly higher mean valence score than the other decades. In other words, the 1980s top 100 songs were much more cheerful, euphoric and happy on average when compared to the top 100 songs from other decades.
So, if you need a pick-me-up or you’re tired of the depressing songs on the radio, put on Bon Jovi’s Livin’ on a Prayer and enjoy the good ’ol days.