Analyzing Spotify Data

Using the spotifyr package, I wanted to better understand the patterns that drive popular music. Specifically, I wanted to look at what factors have changed in popular music over the last few decades. In order to do so, I use the Billboard Top 100 End of Year chart as a proxy for the most popular songs per year.

To begin, I use the following packages and gain access using my specific Spotify Client ID and secret code:

⊕Not all of these packages were needed and please do not use my ID and password.

Looking at Current Hits on Spotify

To begin and test whether the spotifyr package works for me, I look at the Spotify Top 20 Hits at the moment:

#Using Spotify top hits
#this is using the ID from finding playlist online at end of URL
tophits <- get_playlist("37i9dQZF1DXcBWIGoYBM5M")

⊕In order to get playlist code, search name of playlist online and find spotify URL of it. Playlist code is the last string of characters of URL.

track_list <- tophits[["tracks"]]
item_list <- track_list[["items"]]
track_id <- item_list[,16]
track_id <- as.data.frame(track_id)
num_tracks <- count(track_id)
num_tracks <- as.integer(num_tracks)

#Iterating over track ID's 
a <- sapply(track_id, get_track_audio_features)
#Cleaning and naming list columns into a dataframe
b <- as.data.frame(a)
b <- as.data.frame(b)
e <- sapply(b, unlist)
f <- sapply(e, as.vector)
g <- as.data.frame(f)
#creating a vector for each column value

⊕There are likely far more efficient ways to do this..

h1 <- as.vector(g[1:num_tracks , ])
h2 <- as.vector(g[(1+num_tracks):(num_tracks*2) , ])
h3 <- as.vector(g[(1+num_tracks*2):(num_tracks*3) , ])
h4 <- as.vector(g[(1+num_tracks*3):(num_tracks*4) , ])
h5 <- as.vector(g[(1+num_tracks*4):(num_tracks*5) , ])
h6 <- as.vector(g[(1+num_tracks*5):(num_tracks*6) , ])
h7 <- as.vector(g[(1+num_tracks*6):(num_tracks*7) , ])
h8 <- as.vector(g[(1+num_tracks*7):(num_tracks*8) , ])
h9 <- as.vector(g[(1+num_tracks*8):(num_tracks*9) , ])
h10 <- as.vector(g[(1+num_tracks*9):(num_tracks*10) , ])
h11 <- as.vector(g[(1+num_tracks*10):(num_tracks*11) , ])
h17 <- as.vector(g[(1+num_tracks*16):(num_tracks*17) , ])
music <- data.frame(h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11, h17)
categories <- row.names(b)
cat_vec <- categories[1:11]
cat_vec1 <- categories[17]
names(music) <- cat_vec
names(music)[12] <- cat_vec1

#Coercing into a numeric instead of factor
music2 <- music %>%
  mutate_all(as.character)
music2 <- music2 %>%
  mutate_all(as.numeric)

#Visualization

⊕Valence is a measure of song happiness/positivity.

ggplot(music2) +
  geom_density(aes(x = danceability, fill = "red"), alpha = 0.4) +
  geom_density(aes(x = energy,fill = "purple"), alpha = 0.4) +
  geom_density(aes(x = valence, fill = "green"), alpha = 0.4) +
  labs(title = "Spotify Top 20 Playlist",
         x = "Measure of Attribute",
         y = "Frequency", fill = "Song Attribute") +
  scale_fill_manual(values = c("red", "purple", "green"), labels = c("Danceability", "Energy", "Valence"))

One thing you might notice are that the most popular songs are not necessarily overly dancy (on average). Yet, they do tend to rank highly in terms of energy.

Analyzing Changes in Popular Music Over Time

I want to move onto looking at changes in popular music across time. I decide to use the Billboard Top 100 for every 5 years starting with 1957.

The following code shows how I gathered a dataframe for the Billboard top 100 from 1957 (used same process as earlier with Spotify Top 20 Playlist).

###USING 1957 BILLBOARDS FOR EXAMPLE
nine_fif_seven <- get_playlist("59B77qi1KbJQnXLPHQDwXh")
track_list_57 <- nine_fif_seven[["tracks"]]
item_list_57 <- track_list_57[["items"]]
track_id_57 <- item_list_57[,16]
track_id_57 <- as.data.frame(track_id_57)
num_tracks_57 <- count(track_id_57)
num_tracks_57 <- as.integer(num_tracks_57)

#Iterating over track ID's 
a57 <- sapply(track_id_57, get_track_audio_features)
#Cleaning and naming list columns into a dataframe
b57 <- as.data.frame(a57)
b57 <- as.data.frame(b57)
e57 <- sapply(b57, unlist)
f57 <- sapply(e57, as.vector)
g57 <- as.data.frame(f57)

⊕Once again, there are ways to do this with much cleaner code. Considering I was not doing this with many different playlists, I simply used same process without streamlining code to be nicer.

#creating a vector for each column value
h1.57 <- as.vector(g57[1:num_tracks_57 , ])
h2.57 <- as.vector(g57[(1+num_tracks_57):(num_tracks_57*2) , ])
h3.57 <- as.vector(g57[(1+num_tracks_57*2):(num_tracks_57*3) , ])
h4.57 <- as.vector(g57[(1+num_tracks_57*3):(num_tracks_57*4) , ])
h5.57 <- as.vector(g57[(1+num_tracks_57*4):(num_tracks_57*5) , ])
h6.57 <- as.vector(g57[(1+num_tracks_57*5):(num_tracks_57*6) , ])
h7.57 <- as.vector(g57[(1+num_tracks_57*6):(num_tracks_57*7) , ])
h8.57 <- as.vector(g57[(1+num_tracks_57*7):(num_tracks_57*8) , ])
h9.57 <- as.vector(g57[(1+num_tracks_57*8):(num_tracks_57*9) , ])
h10.57 <- as.vector(g57[(1+num_tracks_57*9):(num_tracks_57*10) , ])
h11.57 <- as.vector(g57[(1+num_tracks_57*10):(num_tracks_57*11) , ])
h17.57 <- as.vector(g57[(1+num_tracks_57*16):(num_tracks_57*17) , ])
music.57 <- data.frame(h1.57, h2.57, h3.57, h4.57, h5.57, h6.57, h7.57, h8.57, h9.57, h10.57, h11.57, h17.57)
categories_57 <- row.names(b57)
cat_vec_57 <- categories_57[1:11]
cat_vec1_57 <- categories_57[17]
names(music.57) <- cat_vec_57
names(music.57)[12] <- cat_vec1_57

#Coercing into a numeric instead of factor
music2.57 <- music.57 %>%
  mutate_all(as.character)
music2.57 <- music2.57 %>%
  mutate_all(as.numeric)


#..........REPEATED FOR EACH YEAR AND SAVED

After running this code for the 1957 Playlist, I saved the playlist as a csv and repeated the process for each 5 year increment, saving accordingly.

Loading The Saved CSV’s and Combining to One Dataframe

I now have a dataframe with musical attributes for each Billboard Top 100 and combine them into one dataframe.

setwd("~/Desktop/  /Text Documents/R")
bill57 <- read.csv("bill57.csv")
bill62 <- read.csv("bill62.csv")
bill67 <- read.csv("bill67.csv")
bill72 <- read.csv("bill72.csv")
bill77 <- read.csv("bill77.csv")
bill82 <- read.csv("bill82.csv")
bill87 <- read.csv("bill87.csv")
bill92 <- read.csv("bill92.csv")
bill97 <- read.csv("bill97.csv")
bill02 <- read.csv("bill02.csv")
bill07 <- read.csv("bill07.csv")
bill12 <- read.csv("bill12.csv")
bill17 <- read.csv("bill17.csv")

⊕I am not sure what went wrong with all this crap.

bill02$X.1 <- NULL
bill07$X.1 <- NULL
bill12$year <- 2012

#Merging all dataframes
bill_total <- rbind(bill57, bill62, bill67, bill72, bill77, bill82, bill87, bill92, bill97, bill02, bill07, bill12, bill17)

Analysis of Billboard Top 100 Lists over Time

Using the Billboard Top 100, one of the most famous rankings of music by popularity, I look at how attributes of popular music have changed over time.

#Differences in Summary Statistics by year
summary <- bill_total %>%
  group_by(year) %>%
  summarize(mean(danceability), mean(energy), mean(loudness), mean(valence), mean(tempo))

names(summary) <- c("year", "dancy", "energy", "loud", "valence", "tempo")

knitr::kable(
  summary %>% 
    round(digits = 2), caption = "Mean Attributes of Billboard Top 100's"
)

Mean Attributes of Billboard Top 100’s

year	dancy	energy	loud	valence	tempo
1957	0.53	0.47	-10.47	0.67	121.62
1962	0.56	0.48	-10.56	0.68	118.93
1967	0.54	0.54	-9.90	0.65	122.56
1972	0.57	0.55	-10.67	0.65	121.87
1977	0.56	0.60	-9.94	0.60	123.03
1982	0.62	0.59	-10.05	0.65	122.02
1987	0.64	0.67	-9.40	0.64	120.21
1992	0.63	0.64	-9.18	0.56	115.07
1997	0.68	0.59	-8.37	0.56	116.29
2002	0.67	0.69	-5.94	0.59	120.27
2007	0.65	0.71	-5.50	0.56	120.43
2012	0.65	0.72	-5.17	0.56	125.11
2017	0.68	0.68	-5.53	0.53	118.50

ggplot(summary, aes(x = year)) +
  geom_line(aes(y = dancy, color = "red")) +
  geom_line(aes(y = energy, color = "blue")) +
  geom_line(aes(y = valence, color = "purple")) +
  scale_color_manual(values = c("red", "blue", "purple"), labels = c("Danceability", "Valence", "Energy")) +
  labs(title = "Billboard Top 100's Over Time",
         x = "Year",
         y = "Amount", colour = "Song Attribute")

The graph above shows how the attributes of songs in the Billboard Top 100’s have changed over time. Both energy and danceability have tended to go upwards since the 1960’s. Yet, valence (positivity) has tended to go downwards. It appears that the most popular songs have tended to become sadder over time, atleast according to this ranking.

Running a Multinomial Logistic Regression on Decade

Can we predict the decade music was on the Top 100’s list based on these attributes? To answer, I run a multinomial logistic regression:

require(foreign)
require(nnet)
require(reshape2)
library(stargazer)

multi1 <- multinom(year ~ danceability + energy +
                    speechiness + valence,data = bill_total, trace=FALSE)

stargazer(multi1, type = 'html', title = "Multinomial Logistic Results")

**Multinomial Logistic Results**

	Dependent variable:

	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007	2012	2017
	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)	(10)	(11)	(12)

danceability	1.896	1.728	3.970^***	5.367^***	8.190^***	11.366^***	11.999^***	14.563^***	14.229^***	13.744^***	14.283^***	15.943^***
	(1.181)	(1.205)	(1.198)	(1.241)	(1.322)	(1.425)	(1.396)	(1.457)	(1.464)	(1.450)	(1.449)	(1.466)

energy	0.442	3.481^***	3.615^***	6.766^***	6.550^***	10.024^***	9.206^***	7.567^***	11.102^***	12.577^***	13.104^***	11.752^***
	(0.919)	(0.946)	(0.923)	(0.956)	(0.983)	(1.050)	(1.026)	(1.037)	(1.104)	(1.138)	(1.129)	(1.102)

speechiness	-3.086	-15.934^***	-2.457	-7.772^**	-31.326^***	-28.144^***	-3.776	-0.499	2.002	3.918^*	0.457	2.014
	(2.401)	(5.212)	(2.329)	(3.237)	(7.533)	(6.637)	(2.613)	(2.277)	(2.150)	(2.091)	(2.254)	(2.158)

valence	-0.609	-2.133^**	-3.347^***	-5.939^***	-5.712^***	-8.307^***	-10.040^***	-10.315^***	-11.006^***	-11.889^***	-12.085^***	-12.760^***
	(0.932)	(0.950)	(0.933)	(0.954)	(0.974)	(1.008)	(1.011)	(1.024)	(1.045)	(1.055)	(1.044)	(1.045)

Constant	-0.652	-0.554	-1.668^***	-2.350^***	-3.035^***	-5.710^***	-5.709^***	-6.515^***	-8.367^***	-8.733^***	-8.955^***	-8.885^***
	(0.572)	(0.590)	(0.598)	(0.631)	(0.706)	(0.823)	(0.786)	(0.828)	(0.925)	(0.949)	(0.939)	(0.924)


Akaike Inf. Crit.	5,673.800	5,673.800	5,673.800	5,673.800	5,673.800	5,673.800	5,673.800	5,673.800	5,673.800	5,673.800	5,673.800	5,673.800

Note:	p<0.1; p<0.05; p<0.01

The results above show the multinomial logistic coefficients and standard errors of the regression. Specifically, they represent coefficients relative to the base year 1957.

These coefficients are not very useful at the moment, since they are not in log odds form, but they do show what is statistically significant for us. For instance, notice that every year past 1967 presents a statistically significant estimate at the 0.01% level for danceability, energy, and valence. Speechiness is only significant for a few years in the 70’s and 80’s.

In order to find meaning in the coefficients themselves, I convert the estimates to a log-odd ratio and present the results below:

#Converting values to log-odds ratios
results <- as.data.frame(exp(coef(multi1)))

knitr::kable(
  results, caption = "Log-Odd Ratios for Song Attributes on Year"
)

Log-Odd Ratios for Song Attributes on Year

	(Intercept)	danceability	energy	speechiness	valence
1962	0.5208494	6.661422e+00	1.555074e+00	0.0456903	0.5441405
1967	0.5748024	5.631347e+00	3.249863e+01	0.0000001	0.1185238
1972	0.1885981	5.296551e+01	3.715979e+01	0.0856583	0.0351924
1977	0.0953486	2.143047e+02	8.682193e+02	0.0004214	0.0026335
1982	0.0480518	3.606303e+03	6.990921e+02	0.0000000	0.0033055
1987	0.0033116	8.634961e+04	2.256384e+04	0.0000000	0.0002467
1992	0.0033144	1.625644e+05	9.961289e+03	0.0229122	0.0000436
1997	0.0014814	2.112033e+06	1.933329e+03	0.6072308	0.0000331
2002	0.0002324	1.511616e+06	6.632075e+04	7.4048987	0.0000166
2007	0.0001612	9.309401e+05	2.898579e+05	50.3147102	0.0000069
2012	0.0001291	1.596567e+06	4.911120e+05	1.5792237	0.0000056
2017	0.0001384	8.390913e+06	1.269533e+05	7.4920731	0.0000029

Disregarding speechiness(since most values are not statistically significant), we can notice that the odds of a more positive or “happier” song making the Billboard Top 100’s Chart are much lower for a song in the 2000’s compared to 1957. This is what we previously expected based on the plot.