Using the spotifyr package, I wanted to better understand the patterns that drive popular music. Specifically, I wanted to look at what factors have changed in popular music over the last few decades. In order to do so, I use the Billboard Top 100 End of Year chart as a proxy for the most popular songs per year.
To begin, I use the following packages and gain access using my specific Spotify Client ID and secret code:
setwd("~/Desktop/ /Text Documents/R")
require(httr)
require(jsonlite)
require(plyr)
require(glmnet)
library(spotifyr)
library(knitr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(kableExtra)
Not all of these packages were needed and please do not use my ID and password.
#Gaining Access
Sys.setenv(SPOTIFY_CLIENT_ID = '269036b6875046f7ab4e49c8f9270a1b')
Sys.setenv(SPOTIFY_CLIENT_SECRET = '75202597e596483e9dd3c19ec1c2a296')
access_token <- get_spotify_access_token()
To begin and test whether the spotifyr package works for me, I look at the Spotify Top 20 Hits at the moment:
#Using Spotify top hits
#this is using the ID from finding playlist online at end of URL
tophits <- get_playlist("37i9dQZF1DXcBWIGoYBM5M")
In order to get playlist code, search name of playlist online and find spotify URL of it. Playlist code is the last string of characters of URL.
track_list <- tophits[["tracks"]]
item_list <- track_list[["items"]]
track_id <- item_list[,16]
track_id <- as.data.frame(track_id)
num_tracks <- count(track_id)
num_tracks <- as.integer(num_tracks)
#Iterating over track ID's
a <- sapply(track_id, get_track_audio_features)
#Cleaning and naming list columns into a dataframe
b <- as.data.frame(a)
b <- as.data.frame(b)
e <- sapply(b, unlist)
f <- sapply(e, as.vector)
g <- as.data.frame(f)
#creating a vector for each column value
There are likely far more efficient ways to do this..
h1 <- as.vector(g[1:num_tracks , ])
h2 <- as.vector(g[(1+num_tracks):(num_tracks*2) , ])
h3 <- as.vector(g[(1+num_tracks*2):(num_tracks*3) , ])
h4 <- as.vector(g[(1+num_tracks*3):(num_tracks*4) , ])
h5 <- as.vector(g[(1+num_tracks*4):(num_tracks*5) , ])
h6 <- as.vector(g[(1+num_tracks*5):(num_tracks*6) , ])
h7 <- as.vector(g[(1+num_tracks*6):(num_tracks*7) , ])
h8 <- as.vector(g[(1+num_tracks*7):(num_tracks*8) , ])
h9 <- as.vector(g[(1+num_tracks*8):(num_tracks*9) , ])
h10 <- as.vector(g[(1+num_tracks*9):(num_tracks*10) , ])
h11 <- as.vector(g[(1+num_tracks*10):(num_tracks*11) , ])
h17 <- as.vector(g[(1+num_tracks*16):(num_tracks*17) , ])
music <- data.frame(h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11, h17)
categories <- row.names(b)
cat_vec <- categories[1:11]
cat_vec1 <- categories[17]
names(music) <- cat_vec
names(music)[12] <- cat_vec1
#Coercing into a numeric instead of factor
music2 <- music %>%
mutate_all(as.character)
music2 <- music2 %>%
mutate_all(as.numeric)
#Visualization
Valence is a measure of song happiness/positivity.
ggplot(music2) +
geom_density(aes(x = danceability, fill = "red"), alpha = 0.4) +
geom_density(aes(x = energy,fill = "purple"), alpha = 0.4) +
geom_density(aes(x = valence, fill = "green"), alpha = 0.4) +
labs(title = "Spotify Top 20 Playlist",
x = "Measure of Attribute",
y = "Frequency", fill = "Song Attribute") +
scale_fill_manual(values = c("red", "purple", "green"), labels = c("Danceability", "Energy", "Valence"))
One thing you might notice are that the most popular songs are not necessarily overly dancy (on average). Yet, they do tend to rank highly in terms of energy.
I want to move onto looking at changes in popular music across time. I decide to use the Billboard Top 100 for every 5 years starting with 1957.
The following code shows how I gathered a dataframe for the Billboard top 100 from 1957 (used same process as earlier with Spotify Top 20 Playlist).
###USING 1957 BILLBOARDS FOR EXAMPLE
nine_fif_seven <- get_playlist("59B77qi1KbJQnXLPHQDwXh")
track_list_57 <- nine_fif_seven[["tracks"]]
item_list_57 <- track_list_57[["items"]]
track_id_57 <- item_list_57[,16]
track_id_57 <- as.data.frame(track_id_57)
num_tracks_57 <- count(track_id_57)
num_tracks_57 <- as.integer(num_tracks_57)
#Iterating over track ID's
a57 <- sapply(track_id_57, get_track_audio_features)
#Cleaning and naming list columns into a dataframe
b57 <- as.data.frame(a57)
b57 <- as.data.frame(b57)
e57 <- sapply(b57, unlist)
f57 <- sapply(e57, as.vector)
g57 <- as.data.frame(f57)
Once again, there are ways to do this with much cleaner code. Considering I was not doing this with many different playlists, I simply used same process without streamlining code to be nicer.
#creating a vector for each column value
h1.57 <- as.vector(g57[1:num_tracks_57 , ])
h2.57 <- as.vector(g57[(1+num_tracks_57):(num_tracks_57*2) , ])
h3.57 <- as.vector(g57[(1+num_tracks_57*2):(num_tracks_57*3) , ])
h4.57 <- as.vector(g57[(1+num_tracks_57*3):(num_tracks_57*4) , ])
h5.57 <- as.vector(g57[(1+num_tracks_57*4):(num_tracks_57*5) , ])
h6.57 <- as.vector(g57[(1+num_tracks_57*5):(num_tracks_57*6) , ])
h7.57 <- as.vector(g57[(1+num_tracks_57*6):(num_tracks_57*7) , ])
h8.57 <- as.vector(g57[(1+num_tracks_57*7):(num_tracks_57*8) , ])
h9.57 <- as.vector(g57[(1+num_tracks_57*8):(num_tracks_57*9) , ])
h10.57 <- as.vector(g57[(1+num_tracks_57*9):(num_tracks_57*10) , ])
h11.57 <- as.vector(g57[(1+num_tracks_57*10):(num_tracks_57*11) , ])
h17.57 <- as.vector(g57[(1+num_tracks_57*16):(num_tracks_57*17) , ])
music.57 <- data.frame(h1.57, h2.57, h3.57, h4.57, h5.57, h6.57, h7.57, h8.57, h9.57, h10.57, h11.57, h17.57)
categories_57 <- row.names(b57)
cat_vec_57 <- categories_57[1:11]
cat_vec1_57 <- categories_57[17]
names(music.57) <- cat_vec_57
names(music.57)[12] <- cat_vec1_57
#Coercing into a numeric instead of factor
music2.57 <- music.57 %>%
mutate_all(as.character)
music2.57 <- music2.57 %>%
mutate_all(as.numeric)
#..........REPEATED FOR EACH YEAR AND SAVED
After running this code for the 1957 Playlist, I saved the playlist as a csv and repeated the process for each 5 year increment, saving accordingly.
I now have a dataframe with musical attributes for each Billboard Top 100 and combine them into one dataframe.
setwd("~/Desktop/ /Text Documents/R")
bill57 <- read.csv("bill57.csv")
bill62 <- read.csv("bill62.csv")
bill67 <- read.csv("bill67.csv")
bill72 <- read.csv("bill72.csv")
bill77 <- read.csv("bill77.csv")
bill82 <- read.csv("bill82.csv")
bill87 <- read.csv("bill87.csv")
bill92 <- read.csv("bill92.csv")
bill97 <- read.csv("bill97.csv")
bill02 <- read.csv("bill02.csv")
bill07 <- read.csv("bill07.csv")
bill12 <- read.csv("bill12.csv")
bill17 <- read.csv("bill17.csv")
I am not sure what went wrong with all this crap.
bill02$X.1 <- NULL
bill07$X.1 <- NULL
bill12$year <- 2012
#Merging all dataframes
bill_total <- rbind(bill57, bill62, bill67, bill72, bill77, bill82, bill87, bill92, bill97, bill02, bill07, bill12, bill17)
Using the Billboard Top 100, one of the most famous rankings of music by popularity, I look at how attributes of popular music have changed over time.
#Differences in Summary Statistics by year
summary <- bill_total %>%
group_by(year) %>%
summarize(mean(danceability), mean(energy), mean(loudness), mean(valence), mean(tempo))
names(summary) <- c("year", "dancy", "energy", "loud", "valence", "tempo")
knitr::kable(
summary %>%
round(digits = 2), caption = "Mean Attributes of Billboard Top 100's"
)
Mean Attributes of Billboard Top 100’s
| year | dancy | energy | loud | valence | tempo |
|---|---|---|---|---|---|
| 1957 | 0.53 | 0.47 | -10.47 | 0.67 | 121.62 |
| 1962 | 0.56 | 0.48 | -10.56 | 0.68 | 118.93 |
| 1967 | 0.54 | 0.54 | -9.90 | 0.65 | 122.56 |
| 1972 | 0.57 | 0.55 | -10.67 | 0.65 | 121.87 |
| 1977 | 0.56 | 0.60 | -9.94 | 0.60 | 123.03 |
| 1982 | 0.62 | 0.59 | -10.05 | 0.65 | 122.02 |
| 1987 | 0.64 | 0.67 | -9.40 | 0.64 | 120.21 |
| 1992 | 0.63 | 0.64 | -9.18 | 0.56 | 115.07 |
| 1997 | 0.68 | 0.59 | -8.37 | 0.56 | 116.29 |
| 2002 | 0.67 | 0.69 | -5.94 | 0.59 | 120.27 |
| 2007 | 0.65 | 0.71 | -5.50 | 0.56 | 120.43 |
| 2012 | 0.65 | 0.72 | -5.17 | 0.56 | 125.11 |
| 2017 | 0.68 | 0.68 | -5.53 | 0.53 | 118.50 |
ggplot(summary, aes(x = year)) +
geom_line(aes(y = dancy, color = "red")) +
geom_line(aes(y = energy, color = "blue")) +
geom_line(aes(y = valence, color = "purple")) +
scale_color_manual(values = c("red", "blue", "purple"), labels = c("Danceability", "Valence", "Energy")) +
labs(title = "Billboard Top 100's Over Time",
x = "Year",
y = "Amount", colour = "Song Attribute")
The graph above shows how the attributes of songs in the Billboard Top 100’s have changed over time. Both energy and danceability have tended to go upwards since the 1960’s. Yet, valence (positivity) has tended to go downwards. It appears that the most popular songs have tended to become sadder over time, atleast according to this ranking.
Can we predict the decade music was on the Top 100’s list based on these attributes? To answer, I run a multinomial logistic regression:
require(foreign)
require(nnet)
require(reshape2)
library(stargazer)
multi1 <- multinom(year ~ danceability + energy +
speechiness + valence,data = bill_total, trace=FALSE)
stargazer(multi1, type = 'html', title = "Multinomial Logistic Results")
| Dependent variable: | ||||||||||||
| 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | 2012 | 2017 | |
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | (10) | (11) | (12) | |
| danceability | 1.896 | 1.728 | 3.970*** | 5.367*** | 8.190*** | 11.366*** | 11.999*** | 14.563*** | 14.229*** | 13.744*** | 14.283*** | 15.943*** |
| (1.181) | (1.205) | (1.198) | (1.241) | (1.322) | (1.425) | (1.396) | (1.457) | (1.464) | (1.450) | (1.449) | (1.466) | |
| energy | 0.442 | 3.481*** | 3.615*** | 6.766*** | 6.550*** | 10.024*** | 9.206*** | 7.567*** | 11.102*** | 12.577*** | 13.104*** | 11.752*** |
| (0.919) | (0.946) | (0.923) | (0.956) | (0.983) | (1.050) | (1.026) | (1.037) | (1.104) | (1.138) | (1.129) | (1.102) | |
| speechiness | -3.086 | -15.934*** | -2.457 | -7.772** | -31.326*** | -28.144*** | -3.776 | -0.499 | 2.002 | 3.918* | 0.457 | 2.014 |
| (2.401) | (5.212) | (2.329) | (3.237) | (7.533) | (6.637) | (2.613) | (2.277) | (2.150) | (2.091) | (2.254) | (2.158) | |
| valence | -0.609 | -2.133** | -3.347*** | -5.939*** | -5.712*** | -8.307*** | -10.040*** | -10.315*** | -11.006*** | -11.889*** | -12.085*** | -12.760*** |
| (0.932) | (0.950) | (0.933) | (0.954) | (0.974) | (1.008) | (1.011) | (1.024) | (1.045) | (1.055) | (1.044) | (1.045) | |
| Constant | -0.652 | -0.554 | -1.668*** | -2.350*** | -3.035*** | -5.710*** | -5.709*** | -6.515*** | -8.367*** | -8.733*** | -8.955*** | -8.885*** |
| (0.572) | (0.590) | (0.598) | (0.631) | (0.706) | (0.823) | (0.786) | (0.828) | (0.925) | (0.949) | (0.939) | (0.924) | |
| Akaike Inf. Crit. | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 | 5,673.800 |
| Note: | p<0.1; p<0.05; p<0.01 | |||||||||||
The results above show the multinomial logistic coefficients and standard errors of the regression. Specifically, they represent coefficients relative to the base year 1957.
These coefficients are not very useful at the moment, since they are not in log odds form, but they do show what is statistically significant for us. For instance, notice that every year past 1967 presents a statistically significant estimate at the 0.01% level for danceability, energy, and valence. Speechiness is only significant for a few years in the 70’s and 80’s.
In order to find meaning in the coefficients themselves, I convert the estimates to a log-odd ratio and present the results below:
#Converting values to log-odds ratios
results <- as.data.frame(exp(coef(multi1)))
knitr::kable(
results, caption = "Log-Odd Ratios for Song Attributes on Year"
)
Log-Odd Ratios for Song Attributes on Year
| (Intercept) | danceability | energy | speechiness | valence | |
|---|---|---|---|---|---|
| 1962 | 0.5208494 | 6.661422e+00 | 1.555074e+00 | 0.0456903 | 0.5441405 |
| 1967 | 0.5748024 | 5.631347e+00 | 3.249863e+01 | 0.0000001 | 0.1185238 |
| 1972 | 0.1885981 | 5.296551e+01 | 3.715979e+01 | 0.0856583 | 0.0351924 |
| 1977 | 0.0953486 | 2.143047e+02 | 8.682193e+02 | 0.0004214 | 0.0026335 |
| 1982 | 0.0480518 | 3.606303e+03 | 6.990921e+02 | 0.0000000 | 0.0033055 |
| 1987 | 0.0033116 | 8.634961e+04 | 2.256384e+04 | 0.0000000 | 0.0002467 |
| 1992 | 0.0033144 | 1.625644e+05 | 9.961289e+03 | 0.0229122 | 0.0000436 |
| 1997 | 0.0014814 | 2.112033e+06 | 1.933329e+03 | 0.6072308 | 0.0000331 |
| 2002 | 0.0002324 | 1.511616e+06 | 6.632075e+04 | 7.4048987 | 0.0000166 |
| 2007 | 0.0001612 | 9.309401e+05 | 2.898579e+05 | 50.3147102 | 0.0000069 |
| 2012 | 0.0001291 | 1.596567e+06 | 4.911120e+05 | 1.5792237 | 0.0000056 |
| 2017 | 0.0001384 | 8.390913e+06 | 1.269533e+05 | 7.4920731 | 0.0000029 |
Disregarding speechiness(since most values are not statistically significant), we can notice that the odds of a more positive or “happier” song making the Billboard Top 100’s Chart are much lower for a song in the 2000’s compared to 1957. This is what we previously expected based on the plot.