Analyzing Spotify Data

Jacob Linger

5/8/2019

Using the spotifyr package, I wanted to better understand the patterns that drive popular music. Specifically, I wanted to look at what factors have changed in popular music over the last few decades. In order to do so, I use the Billboard Top 100 End of Year chart as a proxy for the most popular songs per year.

To begin, I use the following packages and gain access using my specific Spotify Client ID and secret code:

   


setwd("~/Desktop/  /Text Documents/R")


require(httr)
require(jsonlite)
require(plyr)
require(glmnet)
library(spotifyr)
library(knitr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(kableExtra)

Not all of these packages were needed and please do not use my ID and password.

#Gaining Access
Sys.setenv(SPOTIFY_CLIENT_ID = '269036b6875046f7ab4e49c8f9270a1b')
Sys.setenv(SPOTIFY_CLIENT_SECRET = '75202597e596483e9dd3c19ec1c2a296')
access_token <- get_spotify_access_token()

   

Looking at Current Hits on Spotify

To begin and test whether the spotifyr package works for me, I look at the Spotify Top 20 Hits at the moment:

   


#Using Spotify top hits
#this is using the ID from finding playlist online at end of URL
tophits <- get_playlist("37i9dQZF1DXcBWIGoYBM5M")

In order to get playlist code, search name of playlist online and find spotify URL of it. Playlist code is the last string of characters of URL.

track_list <- tophits[["tracks"]]
item_list <- track_list[["items"]]
track_id <- item_list[,16]
track_id <- as.data.frame(track_id)
num_tracks <- count(track_id)
num_tracks <- as.integer(num_tracks)

#Iterating over track ID's 
a <- sapply(track_id, get_track_audio_features)
#Cleaning and naming list columns into a dataframe
b <- as.data.frame(a)
b <- as.data.frame(b)
e <- sapply(b, unlist)
f <- sapply(e, as.vector)
g <- as.data.frame(f)
#creating a vector for each column value

There are likely far more efficient ways to do this..

h1 <- as.vector(g[1:num_tracks , ])
h2 <- as.vector(g[(1+num_tracks):(num_tracks*2) , ])
h3 <- as.vector(g[(1+num_tracks*2):(num_tracks*3) , ])
h4 <- as.vector(g[(1+num_tracks*3):(num_tracks*4) , ])
h5 <- as.vector(g[(1+num_tracks*4):(num_tracks*5) , ])
h6 <- as.vector(g[(1+num_tracks*5):(num_tracks*6) , ])
h7 <- as.vector(g[(1+num_tracks*6):(num_tracks*7) , ])
h8 <- as.vector(g[(1+num_tracks*7):(num_tracks*8) , ])
h9 <- as.vector(g[(1+num_tracks*8):(num_tracks*9) , ])
h10 <- as.vector(g[(1+num_tracks*9):(num_tracks*10) , ])
h11 <- as.vector(g[(1+num_tracks*10):(num_tracks*11) , ])
h17 <- as.vector(g[(1+num_tracks*16):(num_tracks*17) , ])
music <- data.frame(h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11, h17)
categories <- row.names(b)
cat_vec <- categories[1:11]
cat_vec1 <- categories[17]
names(music) <- cat_vec
names(music)[12] <- cat_vec1

#Coercing into a numeric instead of factor
music2 <- music %>%
  mutate_all(as.character)
music2 <- music2 %>%
  mutate_all(as.numeric)

#Visualization

Valence is a measure of song happiness/positivity.

ggplot(music2) +
  geom_density(aes(x = danceability, fill = "red"), alpha = 0.4) +
  geom_density(aes(x = energy,fill = "purple"), alpha = 0.4) +
  geom_density(aes(x = valence, fill = "green"), alpha = 0.4) +
  labs(title = "Spotify Top 20 Playlist",
         x = "Measure of Attribute",
         y = "Frequency", fill = "Song Attribute") +
  scale_fill_manual(values = c("red", "purple", "green"), labels = c("Danceability", "Energy", "Valence"))


   

One thing you might notice are that the most popular songs are not necessarily overly dancy (on average). Yet, they do tend to rank highly in terms of energy.

Loading The Saved CSV’s and Combining to One Dataframe

I now have a dataframe with musical attributes for each Billboard Top 100 and combine them into one dataframe.

   


setwd("~/Desktop/  /Text Documents/R")
bill57 <- read.csv("bill57.csv")
bill62 <- read.csv("bill62.csv")
bill67 <- read.csv("bill67.csv")
bill72 <- read.csv("bill72.csv")
bill77 <- read.csv("bill77.csv")
bill82 <- read.csv("bill82.csv")
bill87 <- read.csv("bill87.csv")
bill92 <- read.csv("bill92.csv")
bill97 <- read.csv("bill97.csv")
bill02 <- read.csv("bill02.csv")
bill07 <- read.csv("bill07.csv")
bill12 <- read.csv("bill12.csv")
bill17 <- read.csv("bill17.csv")

I am not sure what went wrong with all this crap.

bill02$X.1 <- NULL
bill07$X.1 <- NULL
bill12$year <- 2012

#Merging all dataframes
bill_total <- rbind(bill57, bill62, bill67, bill72, bill77, bill82, bill87, bill92, bill97, bill02, bill07, bill12, bill17)

   

Analysis of Billboard Top 100 Lists over Time

Using the Billboard Top 100, one of the most famous rankings of music by popularity, I look at how attributes of popular music have changed over time.

   


#Differences in Summary Statistics by year
summary <- bill_total %>%
  group_by(year) %>%
  summarize(mean(danceability), mean(energy), mean(loudness), mean(valence), mean(tempo))

names(summary) <- c("year", "dancy", "energy", "loud", "valence", "tempo")

knitr::kable(
  summary %>% 
    round(digits = 2), caption = "Mean Attributes of Billboard Top 100's"
)

Mean Attributes of Billboard Top 100’s

year dancy energy loud valence tempo
1957 0.53 0.47 -10.47 0.67 121.62
1962 0.56 0.48 -10.56 0.68 118.93
1967 0.54 0.54 -9.90 0.65 122.56
1972 0.57 0.55 -10.67 0.65 121.87
1977 0.56 0.60 -9.94 0.60 123.03
1982 0.62 0.59 -10.05 0.65 122.02
1987 0.64 0.67 -9.40 0.64 120.21
1992 0.63 0.64 -9.18 0.56 115.07
1997 0.68 0.59 -8.37 0.56 116.29
2002 0.67 0.69 -5.94 0.59 120.27
2007 0.65 0.71 -5.50 0.56 120.43
2012 0.65 0.72 -5.17 0.56 125.11
2017 0.68 0.68 -5.53 0.53 118.50
ggplot(summary, aes(x = year)) +
  geom_line(aes(y = dancy, color = "red")) +
  geom_line(aes(y = energy, color = "blue")) +
  geom_line(aes(y = valence, color = "purple")) +
  scale_color_manual(values = c("red", "blue", "purple"), labels = c("Danceability", "Valence", "Energy")) +
  labs(title = "Billboard Top 100's Over Time",
         x = "Year",
         y = "Amount", colour = "Song Attribute")


   

The graph above shows how the attributes of songs in the Billboard Top 100’s have changed over time. Both energy and danceability have tended to go upwards since the 1960’s. Yet, valence (positivity) has tended to go downwards. It appears that the most popular songs have tended to become sadder over time, atleast according to this ranking.

Running a Multinomial Logistic Regression on Decade

Can we predict the decade music was on the Top 100’s list based on these attributes? To answer, I run a multinomial logistic regression:

   


require(foreign)
require(nnet)
require(reshape2)
library(stargazer)

multi1 <- multinom(year ~ danceability + energy +
                    speechiness + valence,data = bill_total, trace=FALSE)

stargazer(multi1, type = 'html', title = "Multinomial Logistic Results")
Multinomial Logistic Results
Dependent variable:
1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 2012 2017
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
danceability 1.896 1.728 3.970*** 5.367*** 8.190*** 11.366*** 11.999*** 14.563*** 14.229*** 13.744*** 14.283*** 15.943***
(1.181) (1.205) (1.198) (1.241) (1.322) (1.425) (1.396) (1.457) (1.464) (1.450) (1.449) (1.466)
energy 0.442 3.481*** 3.615*** 6.766*** 6.550*** 10.024*** 9.206*** 7.567*** 11.102*** 12.577*** 13.104*** 11.752***
(0.919) (0.946) (0.923) (0.956) (0.983) (1.050) (1.026) (1.037) (1.104) (1.138) (1.129) (1.102)
speechiness -3.086 -15.934*** -2.457 -7.772** -31.326*** -28.144*** -3.776 -0.499 2.002 3.918* 0.457 2.014
(2.401) (5.212) (2.329) (3.237) (7.533) (6.637) (2.613) (2.277) (2.150) (2.091) (2.254) (2.158)
valence -0.609 -2.133** -3.347*** -5.939*** -5.712*** -8.307*** -10.040*** -10.315*** -11.006*** -11.889*** -12.085*** -12.760***
(0.932) (0.950) (0.933) (0.954) (0.974) (1.008) (1.011) (1.024) (1.045) (1.055) (1.044) (1.045)
Constant -0.652 -0.554 -1.668*** -2.350*** -3.035*** -5.710*** -5.709*** -6.515*** -8.367*** -8.733*** -8.955*** -8.885***
(0.572) (0.590) (0.598) (0.631) (0.706) (0.823) (0.786) (0.828) (0.925) (0.949) (0.939) (0.924)
Akaike Inf. Crit. 5,673.800 5,673.800 5,673.800 5,673.800 5,673.800 5,673.800 5,673.800 5,673.800 5,673.800 5,673.800 5,673.800 5,673.800
Note: p<0.1; p<0.05; p<0.01

   

The results above show the multinomial logistic coefficients and standard errors of the regression. Specifically, they represent coefficients relative to the base year 1957.

These coefficients are not very useful at the moment, since they are not in log odds form, but they do show what is statistically significant for us. For instance, notice that every year past 1967 presents a statistically significant estimate at the 0.01% level for danceability, energy, and valence. Speechiness is only significant for a few years in the 70’s and 80’s.

In order to find meaning in the coefficients themselves, I convert the estimates to a log-odd ratio and present the results below:

   


#Converting values to log-odds ratios
results <- as.data.frame(exp(coef(multi1)))

knitr::kable(
  results, caption = "Log-Odd Ratios for Song Attributes on Year"
)

Log-Odd Ratios for Song Attributes on Year

(Intercept) danceability energy speechiness valence
1962 0.5208494 6.661422e+00 1.555074e+00 0.0456903 0.5441405
1967 0.5748024 5.631347e+00 3.249863e+01 0.0000001 0.1185238
1972 0.1885981 5.296551e+01 3.715979e+01 0.0856583 0.0351924
1977 0.0953486 2.143047e+02 8.682193e+02 0.0004214 0.0026335
1982 0.0480518 3.606303e+03 6.990921e+02 0.0000000 0.0033055
1987 0.0033116 8.634961e+04 2.256384e+04 0.0000000 0.0002467
1992 0.0033144 1.625644e+05 9.961289e+03 0.0229122 0.0000436
1997 0.0014814 2.112033e+06 1.933329e+03 0.6072308 0.0000331
2002 0.0002324 1.511616e+06 6.632075e+04 7.4048987 0.0000166
2007 0.0001612 9.309401e+05 2.898579e+05 50.3147102 0.0000069
2012 0.0001291 1.596567e+06 4.911120e+05 1.5792237 0.0000056
2017 0.0001384 8.390913e+06 1.269533e+05 7.4920731 0.0000029

   

Disregarding speechiness(since most values are not statistically significant), we can notice that the odds of a more positive or “happier” song making the Billboard Top 100’s Chart are much lower for a song in the 2000’s compared to 1957. This is what we previously expected based on the plot.