Introduction

For our PSTAT 131 final project, we chose to explore Spotify music and the attributes that make a song popular because we all listen to music frequently and have specific songs that we enjoy. Something we’ve always wondered about is why we enjoy a particular song, and as a consequence, how certain songs become more popular over others. Is it the artist singing the song, the fact that we’ve heard it on the radio, the beats per minute, or is it whether the lyrics are explicit? To come up with a solution to these questions, we decided to create specific goals: get to know our data thoroughly, create visualizations of the attributes of our question, run regression models to find which attributes are most relevant, select and fine tune a model, and convey our results in an easily understandable report. This brings us to our research question: can we predict a song’s popularity based on its attributes?

Many current hits are actually created from machine learning algorithms. Whether they are enjoyable to listen to is another topic, but nevertheless these machine-created hits successfully used algorithms for lyrics, beats, timing, and other factors to make these song popular. We were inspired by this topic for our research question, especially when wondering which factors in these algorithms had the most weight in the song creation.

Loading Data and Packages

The first step of our machine learning project was simple: to find a data set. However, it was not quite as simple as we had originally anticipated. We went through a multitude of datasets to find a dataset that did not face us with problems such as too many missing values, too many observations, too little observations, and more. This process took much longer than we expected, as it comprised of downloading multiple files and attempting to upload them into git, only to discover that they didn’t satisfy all of our data needs to explore our research question.

Finally, we landed on a dataset from a Kaggle project called “Music Recommendation System using Spotify Dataset”, for which a codebook is attached in our zip files. Below, we provide the list of general packages we used for this project (model-specific packages will be included in the model-building section), along with a preview of our dataset.

# list all packages here
library(tidyverse)
library(knitr)
library(lubridate)
library(httpuv)
library(cluster)
library(factoextra)
library(data.table)
library(dplyr)
library(ggplot2)
library(corrplot)

# load music.csv
music <- read.csv("/Users/kimbauer/Desktop/college work/pstat 131/project/Bauer_Cai_Navarra_131_Project/Data/musicdata.csv")

# music.csv preview
head(music)
##   valence year acousticness             artists danceability duration_ms energy
## 1   0.328 1960        0.547      ['Etta James']        0.274      179693  0.348
## 2   0.402 1960        0.829      ['Etta James']        0.421      196133  0.285
## 3   0.644 1960        0.764 ['Ella Fitzgerald']        0.508      175987  0.287
## 4   0.836 1960        0.733 ['Ella Fitzgerald']        0.579      131733  0.502
## 5   0.965 1960        0.699     ['Neil Sedaka']        0.743      139200  0.799
## 6   0.407 1960        0.878       ['Sam Cooke']        0.553      165560  0.291
##   explicit                     id instrumentalness key liveness loudness mode
## 1        0 4Hhv2vrOTy89HFRcjU3QOx         1.33e-02   5   0.3340   -8.631    1
## 2        0 0zGLlXbHlrAyBN1x6sY0rb         1.55e-06   2   0.2330   -9.430    0
## 3        0 4ukUoXLuFzMixyZyabSGc4         0.00e+00   1   0.1530  -12.472    1
## 4        0 65irrLqfCMRiO3p87P4C0D         0.00e+00   8   0.2810   -7.570    1
## 5        0 2x6pbpjVGjiWCcH89IK8AX         0.00e+00   8   0.0635   -5.466    0
## 6        0 0BFEyqJ9DJXS7gKg0Kj46R         0.00e+00   4   0.1290  -10.426    0
##                        name popularity release_date speechiness   tempo
## 1                   At Last         76         1960      0.0293  87.430
## 2     A Sunday Kind Of Love         70         1960      0.0293  85.861
## 3               Sleigh Ride         69       1/1/60      0.0523 154.759
## 4        Frosty The Snowman         69       1/1/60      0.0513  76.816
## 5 Breaking Up Is Hard to Do         62     12/30/60      0.0375 116.112
## 6               You Send Me         59       1/1/60      0.0301  96.217

Data Pre-processing

In our initial look at this dataset, we wanted to identify any concerns early on so that we could either address them in our data pre-processing or look for a new dataset if the concerns were too large to overcome. The first thing we checked for was missing values. After running a summary of the dataset, we found that there were no missing values, which we took as a good sign. We did notice that some of the release_date values only contained the release year and not the exact release data of the song, but we decided that this would not be a problem for our exploration as we had an additional column containing the release year of each song, which was much more consistent. We kept the release year column and decides to drop release date later.

Additionally, we were thrown off by the amount of zero values we saw in our initial glance at the data, specifically in the columns explicit and mode. After further examination, we realized that they were coded as dummy variables, so this was not a problem for our exploration. We also determined that we did not want to drop either of these variables from our dataset, as we felt that if a song is explicit or not could be influential in how popular it becomes, and a song being in a minor key or a major key could also be influential in how popular it becomes. Thus, we kept these dummy variables in our dataset, and we later coded them to factor variables.

Because our research question required us to take a regression approach, we did have a concern of our data having too many categorical variables to fit into our regressive model. To address this problem, we thought about which variables would be most important to our data exploration. One of the variables that we thought would be most important to include was artists, as we know that the artist of a song can have a direct impact on the song’s popularity: a more popular artist is more likely to release a song that has higher popularity, and vice versa. With the artists column being string-valued categorical observations, and with there being so many different artists in our data, we wanted to find a way to incorporate this variable into our data as a unique but quantitative variable. We decided that this would be the next step of our data pre-processing.

In order to keep the important variable of artists but make it quantitative, we decided to take the average song popularity by each artist in the data, and assign that variable of average artist popularity to each artist as a new variable in the dataset. Thus, our dataset would lose the column artists containing each artist’s name, and instead assign the variable avg_art_pop as the artist’s average song popularity. This variable mutation process is outlined in the code chunk below.

# create a new table that aggregates the variables popularity and artists and finds the mean song popularity for each given artist
artist_pop_table <- aggregate(music$popularity, list(music$artists), FUN = mean)

# merge the table above into our music dataset by artist
full_merge <- merge(x = music, y = artist_pop_table, by.x = c("artists"), by.y = c("Group.1"), all.x = TRUE)

# rename full dataset and x column
full_data <- rename(full_merge, avg_art_pop = x)

# delete original 'artist' column
delete1 <- c("artists")
full_data <- full_data[!(names(full_data) %in% delete1)]
head(full_data)
##   valence year acousticness danceability duration_ms energy explicit
## 1   0.460 2012       0.0280        0.730      279080  0.682        1
## 2   0.841 1964       0.0589        0.681      177427  0.499        0
## 3   0.952 1966       0.1460        0.616      133413  0.881        0
## 4   0.827 1966       0.1410        0.659      131480  0.573        0
## 5   0.678 1966       0.5060        0.776      149267  0.764        0
## 6   0.963 1966       0.1050        0.646      170827  0.926        0
##                       id instrumentalness key liveness loudness mode
## 1 55iiBrGOWT5W99VKy4Fk2C         2.87e-06   7   0.1140   -4.517    1
## 2 4v75K5mEyumdXCZU09kwAg         6.39e-06   7   0.0743   -6.894    1
## 3 63Vw8PU3ps8TAxDdHAUTMX         7.52e-01   5   0.0643   -6.357    1
## 4 40NHppXZVM0eYvsVseywWl         9.70e-05   2   0.0779  -10.183    1
## 5 0HmDZd0eP9PVMaZggRp6Em         0.00e+00   0   0.1760   -8.336    1
## 6 23MXQSnv6ZrZUiLy6DdnYl         8.00e-01   0   0.0884   -5.933    1
##                     name popularity release_date speechiness   tempo
## 1               Badlands         46      3/26/12      0.1430 147.988
## 2               96 Tears         42       1/1/64      0.0340 123.458
## 3        I Need Somebody         23         1966      0.0372 134.048
## 4            Ten O'clock         16         1966      0.0336  96.254
## 5 You're Telling Me Lies         15         1966      0.0802  88.574
## 6                Up Side         26         1966      0.0426 125.672
##   avg_art_pop
## 1    46.00000
## 2    25.11111
## 3    25.11111
## 4    25.11111
## 5    25.11111
## 6    25.11111

In addressing the multiple categorical variables we had left in our dataset, we then set out to determine if these variables would be important to our exploration and if they would be feasible to incorporate quantitatively into our data (similar to artists above). The qualitative variables that we had left in our dataset (that we decided would be irrational to dummy code) were id, key, name, release_date. We discussed each of these variables as a group, and decided that none of them would be very useful or feasible to try and fit quantitatively into our final data set, and we don’t feel they will logically be important in predicting a given song’s average popularity. This is because id served as a unique song identifier, most individuals would not be able to distinguish between the different keys, name could be mostly anything an artist desired (and thus served no real consequence to popularity), and release_date was closely related to year, which we already decided to incorporate in our data.

# delete variables that will not be useful in our exploration
delete2 <- c("id", "key", "name", "release_date")
full_data <- full_data[!(names(full_data) %in% delete2)]

Finally, after making sure that all of our quantitative variables were numeric (shown in code chunk below), our data was ready for some exploratory data analysis.

# mutating leftover int variables to num type
full_data$popularity <- as.numeric(full_data$popularity)
full_data$year <- as.numeric(full_data$year)
full_data$duration_ms <- as.numeric(full_data$duration_ms)

Exploratory Data Analysis

In our exploratory data analysis, we set out to explore the relationship between the various predictors and the response variable, popularity. First, an analysis of the distribution of the response popularity allowed us to check that it was fairly normally-distributed.

# adding a density histogram for our outcome variable popularity
hist <- ggplot(full_data) + 
  ggtitle("Density Histogram for Song Popularity (Outcome Variable)") + # include title
  geom_histogram(aes(popularity, y = after_stat(density)), binwidth = 5, fill = "cyan", color = "grey")  # add histogram w density
x <- seq(0, 100, length.out = 100)
df <- with(full_data, data.frame(x=x, y = dnorm(x, mean(popularity), sd(popularity))))
hist + geom_line(data = df, aes(x = x, y = y), color = "red")

We then subset the data to a more manageable size, as our data set had too many observations to plot, in order to examine the visual relationships between our predictors and popularity:

set.seed((123))
# sample data to use for exploratory graphics
sample <- full_data[sample(nrow(full_data), 200), ]

Then, we defined various simple scatter plots for our exploratory data analysis:

valence <- ggplot(sample, aes(x=valence, y=popularity)) + geom_point(color = "seagreen1") + geom_smooth(se = FALSE, color = "black")

year <- ggplot(sample, aes(x=year, y = popularity)) + geom_point(color = "dodgerblue2") + geom_smooth(se = FALSE, color = "black") 

acousticness <- ggplot(sample, aes(x=acousticness, y=popularity)) + geom_point(col = "mediumpurple1") 

danceability <- ggplot(sample, aes(x=danceability, y=popularity)) + geom_point(col = "lightpink") + geom_smooth(se = FALSE, color = "black")

duration <- ggplot(sample, aes(x=duration_ms, y=popularity)) + geom_point(col = "sienna1") 

energy <- ggplot(sample, aes(x=energy, y=popularity)) + geom_point(col = "firebrick3") + geom_smooth(se = FALSE, color = "black")

explicit <- ggplot(sample, aes(x=explicit)) + geom_bar(fill = "darkseagreen") + scale_x_continuous(breaks = c(0,1), labels=c("0" = "Clean", "1" = "Explicit"))

instrumentalness <- ggplot(sample, aes(x=instrumentalness, y=popularity)) + geom_point(col = "goldenrod1") 

liveness <- ggplot(sample, aes(x=liveness, y=popularity)) + geom_point(col = "deeppink1") 

loudness <- ggplot(sample, aes(x=loudness, y=popularity)) + geom_point(col = "khaki1") + geom_smooth(se = FALSE, color = "black")

mode <- ggplot(sample, aes(x=mode)) + geom_bar(fill = "plum") + scale_x_continuous(breaks=0:1,
              labels=c("Minor", "Major"))

speechiness <- ggplot(sample, aes(x=speechiness, y=popularity)) + geom_point(col = "green4")

tempo <- ggplot(sample, aes(x=tempo, y=popularity)) + geom_point(col = "springgreen") 

avg_art_pop <- ggplot(sample, aes(x=avg_art_pop, y=popularity)) + geom_point(col = "deepskyblue") + geom_smooth(se = FALSE, color = "black")

After defining our various plots, we accumulated them into a panel for better visibility:

library(ggpubr)
figure <- ggarrange(valence, year, acousticness, danceability, duration, energy, explicit, instrumentalness, liveness, loudness, mode, speechiness, tempo, avg_art_pop, ncol = 3, nrow = 5)
figure

There are several things that we extrapolated from the panel of plots: mainly, many of the predictors, such as valence, acousticness, danceability, duration_ms, and tempo all have fairly weak relationships with our response variable popularity. One can also see that many songs in the dataset appear to be around 0 for speechiness, liveness, and acousticness. Regarding the dummy variables explicit and mode, there is a clear skew towards clean songs and songs in a major key.

Looking specifically at the variables that have the strongest relationships with popularity, the two predictors year and avg_art_pop appear to have the strongest positive relationship with popularity. This makes intuitive sense.

There are several factors that could explain why year has a positive, near linear, relationship with popularity: for one, the population has increased from the 1960s until present day. With more people comes a more active listening base for artists to reach. Additionally, the methods by which we obtain music have changed over the decades, starting off with things like records and record players, CDs, iTunes, and now with music streaming through platforms like Spotify. With each changing method, listening to music has become more accessible over time, and thus more popular. While we didn’t expect year to be so highly correlated with song popularity, it does make logical sense given further thought.

It makes intuitive sense why an artist’s average popularity would be have a strong, positive relationship with popularity. An artist with high brand value or recognition, such as Kendrick Lamar, Ariana Grande, or Justin Bieber are more likely to have new releases listened to by significantly more people as opposed to a more obscure or lesser known artist such as Gus Dapperton or Peach Pit. This strong, positive relationship between our avg_art_pop variable and our response variable popularity is thus explainable given further thought.

For the next step of our exploratory data analysis, we want to take an even closer look at relationships between our predictors and response through a correlation matrix and subsequent heatmap. First, in order to create our correlation matrix and heatmap, we coded our dummy variables into as.factor variables:

# mutating dummy variables 
full_data2 = full_data %>% 
  mutate(explicit = as.factor(ifelse(explicit == 0, "Clean", "Explicit"))) %>% 
  mutate(mode = as.factor(ifelse(mode == 0, "Minor", "Major")))

Finally, we created the correlation matrix and heatmap:

# correlation matrix
corr_mat <- cor(full_data)
corr_mat[, 12]
##          valence             year     acousticness     danceability 
##      -0.05766409       0.68198486      -0.24959846       0.18750900 
##      duration_ms           energy         explicit instrumentalness 
##      -0.02197240       0.19717541       0.26782335      -0.14840380 
##         liveness         loudness             mode       popularity 
##      -0.07600035       0.31230798      -0.06658299       1.00000000 
##      speechiness            tempo      avg_art_pop 
##       0.10043080       0.03306722       0.81967675
# heatmap
corrplot(corr_mat, type = "lower", order = "hclust", tl.col = "black", tl.srt = 45)

The heatmap above reinforced the notion that the two predictors with the strongest positive association with popularity were year and avg_art_pop. We thus had reason to believe that they would be influential and most important to include in our model building.

Model Fitting

After we explored our data visually, we began to build our models. We looked at the characteristics of our data, such as the number of observations, type of data (qualitative vs quantitative), etc., and did some outside research on models that might fit best with our data. We found that an elastic net model, a Bayesian additive regression trees (BART) model, a multivariate adaptive regression splines (MARS) model, and a k-nearest neighbors (k-NN) model would suit our regression-based research question and dataset well. We explored the reasons why we chose each model, we fitted each of these models to our training data, and then we calculated the test MSE of each model to compare later.

Splitting our data

Since our dataset has 120,750 observations (songs) in total, our data can safely be split into train and test sets. That way we could use the training data to fit our model and save the test data to evaluate how well our model predicted the data (and calculate the test MSE). We chose to use 70% of the data for training and 30% of the data for testing. To implement this, we set the seed for reproducibility, took a random sample of 70% of the dataset, and used that sample to create a training set. The remaining 30% of the dataset were used for the testing set.

# split training and testing data
#set seed
set.seed(123)
# Sample 70% of observations as training data 
trainsample = sort(sample(nrow(full_data), nrow(full_data)*.7))
# define dat.train as the 70% of observaions
train = full_data[trainsample,]
# The rest as test data
test = full_data[-trainsample,]

k-NN

The k-nearest neighbors regression algorithm is a non-parametric method that takes observations from different neighboring observations to approximate the association between the predictors and the outcome variable (in our case, song popularity). We found that k-NN was a good fit for our research question because the function can have many forms other than linear, which is what we were looking for.

The downside about the k-NN algorithm is that it can only be used when the variances of variables are similar. By using this model, we made the assumption that the variances of each predictor of popularity are similar enough for k-NN to work. Furthermore, k-NN makes it difficult to “predict into the future”, which, alongside the curse of dimensionality, makes it not very interpretable. In spite of these downsides, we moved forward with our use of k-NN as we knew it would be suitable for our purposes.

To build our k-NN model, we first loaded the required packages and defined the do.chunk() function for k-fold cross-validation. We used folddef to specify an index to map all observations in the design matrix and all observations into the response into a set of k folds. For the k-NN model, our tuning parameter was k, the number of neighbors. chunkid was used to set the fold used as the validation set. The do.chunk function as a whole trained the model using all folds but the chunkid fold, and it also computed the validation error for that last fold. The resulting dataframe consisted of all possible values of the folds and their test error rate.

library(ISLR) # load libraries
library(tidyverse)
library(class)
library(FNN)

do.chunk <- function(chunkid, folddef, Xdat, Ydat,...){ # for k-fold cross validation
  train = (folddef!= chunkid)

  Xtr = Xdat[train,] # split x train
  Ytr = Ydat[train] # split y train

  Xval = Xdat[!train,] # split x validation
  Yval = Ydat[!train] # split y validation
  
  predYtr = knn(train = Xtr, test = Xtr, cl = Ytr,...) # knn prediction y train
  predYvl = knn(train = Xtr, test = Xval, cl = Ytr,...) # knn prediction y value

  data.frame(fold = chunkid,
             train.error = mean(predYtr != Ytr), 
             val.error = mean(predYvl != Yval)) # load into dataset
}

Next, we specified that we wanted 10 folds and divided the data into ten intervals.

nfold = 10 # 10 folds
set.seed(100) # for reproducibility
folds = cut(1:nrow(train), breaks = nfold, labels = FALSE) %>% sample() # divide data

We decided to have 50 neighbors considered, so we looped through all 50 of the neighbors for each different chunk id.

error.folds = NULL
# Give possible number of nearest neighbours to be considered
allK = 1:50
# Set seed since do.chunk() contains a random component induced by knn()
set.seed(888)

Ytrainknn <- train$popularity # from lab 4
Xtrainknn <- train %>% select(-popularity) %>% scale(center = TRUE, scale = TRUE)

Ytestknn <- test$popularity
Xtestknn <- test %>% select(-popularity) %>% scale(center = TRUE, scale = TRUE)

# Loop through different number of neighbors
for (k in allK){
# Loop through different chunk id 
  for (j in seq(10)){
    tmp = do.chunk(chunkid=j, folddef=folds, Xdat=Xtrainknn,
                   Ydat=Ytrainknn, k=k)
    tmp$neighbors = k # Record the last number of neighbor
    error.folds = rbind(error.folds, tmp) # combine results }
  }
}

head(error.folds, 10)
##    fold train.error val.error neighbors
## 1     1 0.008649700 0.9558737         1
## 2     2 0.009043945 0.9504259         1
## 3     3 0.009030918 0.9564652         1
## 4     4 0.008754749 0.9549219         1
## 5     5 0.008728573 0.9532710         1
## 6     6 0.008820475 0.9583531         1
## 7     7 0.008754749 0.9555135         1
## 8     8 0.009070354 0.9517331         1
## 9     9 0.008807330 0.9555135         1
## 10   10 0.008912609 0.9587129         1

We wanted to decide the optimal number of neighbors for k-NN, using error.folds. For results, we wanted to find the type (training or test) of error and the corresponding value for each fold and each neighbor. We did this using the melt() function, as shown below.

# Transform the format of error.folds for further convenience
errors = melt(error.folds, id.vars=c('fold', 'neighbors'), value.name='error')
# Choose the number of neighbors which minimizes validation error
val.error.means = errors %>%
# Select all rows of validation errors 
  filter(variable=='val.error') %>%
# Group the selected data frame by neighbors 
  group_by(neighbors, variable) %>%
# Calculate CV error rate for each k 
  summarise_each(funs(mean), error) %>%
# Remove existing group
  ungroup() %>%
  filter(error==min(error))
# Best number of neighbors
# if there is a tie, pick larger number of neighbors for simpler model 
numneighbor = max(val.error.means$neighbors)
numneighbor
## [1] 50

The 10-fold cross validation found that a 50-NN classifier would be best. This concerned us because of how many observations were considered in model building. This led us to believe that the model might overfit the training data. To check, we calculated the training and test MSE using the k-NN prediction.

# train knn regressor and make predictions on training set using k=98
set.seed(888)
pred.Ytrainknn = knn.reg(train = Xtrainknn, test = Xtrainknn, y = Ytrainknn, k = numneighbor)
# get training MSE
mean((pred.Ytrainknn$pred - Ytrainknn)^2)
## [1] 79.66631
set.seed(888)
# now make predictions on test set
pred.Ytestknn = knn.reg(train = Xtrainknn, test = Xtestknn, y = Ytrainknn, k = numneighbor)
mean((pred.Ytestknn$pred - Ytestknn)^2) 
## [1] 82.41085

The training MSE was 79.66631 and the test MSE was 82.41085. This small discrepancy between the training and test MSEs was a sign of slight overfitting. Because of this, we were wary of choosing k-NN as our model.

Elastic Net

The next model we fit is the elastic net, a combination of the ridge and lasso techniques. The elastic net method is a regularized linear regression that combines L1 and L2 penalty functions from the lasso and ridge techniques. Thus, the elastic net performs variable selection and regularization simultaneously. We found the elastic net to be more ideal than the ridge and lasso techniques on their own because it uses both methods and learns from them.

The idea of the elastic net is to minimize the residual sum of squares and add a penalty term:

\[\sum\limits_{i=1}^n (y_i - \beta_0 - \sum\limits_{j=1}^p \beta_j x_{ij})^2 + \lambda[(1 - \alpha)\ ||\beta||_2^2/2 + \alpha\ ||\beta||_1]\] where \(||\beta||_1\) takes the form of L1 (L1 norm): \[||\beta||_1 = \sum\limits_{j=1}^p |\beta_j|\]

and \(||\beta||_2\) takes the form of L2 (Euclidean norm): \[||\beta||_2 = \sqrt{\sum\limits_{j=1}^p \beta_j^2}\]

knitr::include_graphics("/Users/kimbauer/Desktop/college work/pstat 131/project/Bauer_Cai_Navarra_131_Project/Figures/elasticnetimage.png")

This model was a good choice for our data because it is a regression model, which fits our mostly quantitative variables, and it improves on the lasso and ridge techniques. Additionally, the lasso and ridge methods account for collinearity, so that eliminated our need to investigate collinearity between predictors. Below, we load the required packages for elastic net model fitting, re-split the training and testing data in matrix form (which we found was required for the elastic net model), and began fitting our model with built-in tuning and cross-validation.

# load packages
library(dplyr) # data manipulation
library(glmnet) # elastic net 
library(ggplot2) # for visualization
library(caret) # classification and regression training

# creating new train and test sets as matrices (from lab 5)
el_X <- model.matrix(popularity ~ ., full_data)[, -1]
el_Y <- full_data$popularity
set.seed(123)
eltrain = sort(sample(nrow(el_X), nrow(el_X)*.7)) # same as train sample above
eltest = (-eltrain)
x.eltrain = el_X[eltrain,]
y.eltrain = el_Y[eltrain]
x.eltest = el_X[eltest,]
y.eltest = el_Y[eltest]

# model building on training sets
control <- trainControl(method = "repeatedcv", 
                        number = 5, # 5 fold
                        repeats = 5, # 5 repeats
                        search = "random", 
                        verboseIter = TRUE)
# training elastic net regression model
elastic_model <- train(x = x.eltrain, y = y.eltrain,
                       method = "glmnet",
                       preProcess = c("center", "scale"),
                       tuneLength = 25, # tuning the model
                       trControl = control) # our cross-validation defined above

As stated above, the elastic net model comes with built-in cross-validation and tuning in order to select the optimal alpha and lambda value for model fitting. Below, we print out the best tuned model it selected from cross-validation, plot the chosen model, and print out the values it found for best tuning.

elastic_model 
## glmnet 
## 
## 84525 samples
##    14 predictor
## 
## Pre-processing: centered (14), scaled (14) 
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 67621, 67619, 67620, 67620, 67620, 67621, ... 
## Resampling results across tuning parameters:
## 
##   alpha       lambda       RMSE       Rsquared   MAE     
##   0.06876433  0.113432800   8.735923  0.6847304  6.361206
##   0.17250896  0.149374972   8.736850  0.6846874  6.361618
##   0.20867316  1.714738322   8.847165  0.6815835  6.557331
##   0.21125721  3.394815536   9.045809  0.6783818  6.825648
##   0.29286574  0.465784791   8.752501  0.6840097  6.388511
##   0.31549376  7.029300792   9.743641  0.6771812  7.601590
##   0.35680771  0.001551316   8.735681  0.6847318  6.354317
##   0.39223995  4.100785818   9.233766  0.6800352  7.063697
##   0.48713970  0.971739738   8.805351  0.6822081  6.469126
##   0.52333486  0.072676349   8.736796  0.6846624  6.353545
##   0.53708309  0.217686804   8.744468  0.6842437  6.362709
##   0.55336682  2.737403145   9.045971  0.6818364  6.839039
##   0.58332201  0.008188642   8.735672  0.6847297  6.352686
##   0.66370131  0.005493856   8.735676  0.6847288  6.352366
##   0.66803380  6.064987237  10.094770  0.6819633  7.986012
##   0.69651918  0.001657071   8.735681  0.6847283  6.352238
##   0.81856105  0.021962223   8.735706  0.6847262  6.351891
##   0.82267651  1.898387317   8.957145  0.6819579  6.707539
##   0.82977965  0.154433921   8.744647  0.6841816  6.356173
##   0.83235296  2.524038284   9.098348  0.6816817  6.900946
##   0.87944868  0.008348323   8.735705  0.6847261  6.351754
##   0.88215644  0.001898939   8.735704  0.6847261  6.351749
##   0.88673137  0.188647932   8.749121  0.6839133  6.359207
##   0.89035794  3.971283445   9.608186  0.6788287  7.495460
##   0.89747707  0.020774229   8.735702  0.6847262  6.351708
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.583322 and lambda
##  = 0.008188642.
# RMSE was used to select the optimal model using the smallest value.

# plot
plot(elastic_model, main = "Elastic Net Regression")

# print out values from best tuned model
best_elastic <- elastic_model$results %>%
  filter(alpha == elastic_model$bestTune$alpha, lambda == elastic_model$bestTune$lambda)
best_elastic
##      alpha      lambda     RMSE  Rsquared      MAE     RMSESD  RsquaredSD
## 1 0.583322 0.008188642 8.735672 0.6847297 6.352686 0.05325549 0.003896935
##        MAESD
## 1 0.02982672

Once our elastic net model was trained, we wanted to make a prediction on our test data. To do this, we used the predict() function on our tuned and cross-validated model, and then we calculated the test MSE, which is printed out below.

# elastic model prediction on test data
elastic.pred <- predict(elastic_model, x.eltest)

# evaluate mse on the test data
elastic_mse <- mean((elastic.pred - y.eltest)^2)
elastic_mse
## [1] 76.22993

This value of 76.22993 is a smaller test MSE than the one we have calculated so far, thus we continued with our model building to see how our other models compare.

Bayesian Additive Regression Trees (BART)

The BART method uses decision trees for regression and constructs each tree randomly, similar to bagging and random forest models. Each new tree is trying to capture a “signal” that is not accounted for by the rest of the trees. BART is related to both the bagging and random forest algorithms, as well as the boosting algorithm, in that each tree is constructed randomly and each tree tries to capture a signal not yet accounted for by the current model. In the BART model, K is the number of regression trees and B is the number of iterations that the BART will run through. Each tree in BART is perturbed to avoid local minima and achieve a more thorough exploration of the model space (ISLR Ch 8).

Below is a description of the algorithm we used to construct our BART model.

knitr::include_graphics("/Users/kimbauer/Desktop/college work/pstat 131/project/Bauer_Cai_Navarra_131_Project/Figures/BARTalgorithm.png")

First, we took our training and testing data and split them into Xtrain, Ytrain, Xtest, and Ytest for the BART model.

library(BART)

# BART Implementation
Ytrainbart <- train$popularity # from lab 4
Xtrainbart <- train %>% select(-popularity) %>% scale(center = TRUE, scale = TRUE)

Ytestbart <- test$popularity
Xtestbart <- test %>% select(-popularity) %>% scale(center = TRUE, scale = TRUE)

We then implemented k-fold cross validation with our BART model, which is similar to the k-NN cross validation that we did above:

# do chunk k-fold cv with BART
do.chunk.bart <- function(chunkid, folddef, Xdat, Ydat, ...){
  # Get training index
  train = (folddef!=chunkid)
  # Get training set by the above index
  Xtr = Xdat[train,]
  # Get responses in training set
  Ytr = Ydat[train]
  # Get validation set
  Xval = Xdat[!train,]
  # Get responses in validation set
  Yval = Ydat[!train]
  # Predict training labels
  predYtr = gbart(Xtr, Ytr, x.test = Xtr)
  # Predict validation labels
  predYval <<- gbart(Xtr, Ytr, x.test = Xval)
  data.frame(fold = chunkid,
    train.error = mean((Ytrainbart - predYtr$yhat.train)^2), # Training error for each fold
    val.error = mean((Ytestbart - predYval$yhat.test)^2))  # Validation error for each fold
}
# k-fold CV on BART
nfold = 5
# cut: divides all training observations into 3 intervals;
# labels = FALSE instructs R to use integers to code different intervals
set.seed(3)
folds = cut(1:nrow(train), breaks=nfold, labels=FALSE) %>% sample()
folds

# Set error.folds (a vector) to save validation errors in future
error.folds = NULL
# Set seed since do.chunk() contains a random component induced by knn()
set.seed(888)
# Loop through different chunk id
for (i in seq(5)){
  tmp = do.chunk.bart(chunkid=i, folddef=folds, Xdat=Xtrainbart, Ydat=Ytrainbart)
  error.folds = rbind(error.folds, tmp)
}
head(error.folds)

Finally, we were able to get the test MSE after we iterated through a 5-fold cross validation.

# Test mse
test_mse = mean((Ytestbart - predYval$yhat.test)^2)
test_mse
## [1] 407.7647

A test MSE of 407.7647 was surprisingly high and demonstrated that BART did not perform well for our data.

Multivariate Additive Regression Splines (MARS)

The final algorithm we implemented was the multivariate adaptive regression splines (MARS). The MARS model is an ensemble of piece-wise linear functions that are joined together by hinge functions. This accounts for non-linearity between inputs and outputs. The MARS algorithm has two stages: a forward stage and a backward stage. The forward stage generates many pairs of functions as candidates for the basis of the model. The backward stage is considered the pruning stage; the model goes through all of the functions and deletes the functions that don’t add further value to the model performance.

We chose to use the MARS algorithm because we saw that the relationship between popularity and our predictors is often non-linear. Additionally, we had many predictor variables and the MARS algorithm was capable of achieving good performance on a regression problem with many predictor variables.

In a nutshell, the general MARS method works like this: 1. Divide a given dataset into k pieces. 2. Fit a regression model to each piece. 3. Use k-fold cross-validation to choose a value for k.

Our first step was to create a tuning grid for pruning and set a seed for reproducability:

library(earth) # library for fitting MARS models
library(caret) # library for tuning the model parameters

trainmars <- train # copy dataset for mars model

#create a tuning grid
hyper_grid <- expand.grid(degree = 1:3,
                          nprune = seq(2, 50, length.out = 10) %>%
                          floor())

#make this example reproducible
set.seed(123)

Next, we fit the MARS model into a 10-fold cross validation:

#fit MARS model using k-fold cross-validation
cv_mars <- train(
  x = subset(trainmars, select = -c(popularity)),
  y = trainmars$popularity,
  method = "earth",
  metric = "RMSE",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = hyper_grid)

Finally, we used the MARS algorithm to find the model with the lowest test RMSE:

#display model with lowest test RMSE
mars_model <- cv_mars$results %>%
  filter(nprune == cv_mars$bestTune$nprune, degree == cv_mars$bestTune$degree)   

mars_model
##   degree nprune     RMSE  Rsquared      MAE    RMSESD RsquaredSD    MAESD
## 1      3     18 8.621526 0.6928844 6.265727 0.1530968 0.01070354 0.102432

This result shows that the best model found by the MARS method was of degree 3 with 18 prunes. The R-squared was 0.6928, which is fairly strong correlation. The test root mean squared error was 8.6215.

Below is a graph of the test RMSE by terms and degree - one can see that the third degree has the lowest RMSE and the number of terms appear to minimize RMSE at a value between 10 and 20 (hence why our value of 18 makes sense).

#display test RMSE by terms and degree
ggplot(cv_mars)

To compare this model to the other models we fit, we will calculate the test MSE:

# elastic model prediction on test data (to have prediction code)
mars.pred <- predict(cv_mars, test)

# evaluate mse on the test data
mars_mse <- mean((mars.pred - test$popularity)^2)
mars_mse
## [1] 74.13321

Thus, we found a test MSE value of 74.13321 from our model using multivariate adaptive regression splines.

Comparing models

We decided that choosing the model that with the lowest test MSE is indicative of the model that best fit the data. From highest test MSE to lowest, the models were as follows:

  • BART had a test MSE of 407.76, so on average, it was 20.19 percentiles off the true popularity.
  • k-NN had a test MSE of 82.41, so on average, it was 9.07 percentiles off the actual song popularity.
  • Elastic net had a test MSE of 76.22, so on average, it was 8.73 percentiles off the actual song popularity.
  • MARS had a test MSE of 74.13, so on average, it was 8.6 percentiles off the actual song popularity.

Based on these results, the best performing model of the four we chose was MARS.

Performance of our Prediction

With MARS being our best-performing model of the four we chose to explore, we wanted to evaluate how well it did in predicting on our test data. Below, we reproduced a plot we had shown above of the test RMSE of our tuned MARS model by terms and degree.

plot(cv_mars)

The MARS model having a test MSE of 74.13 and a RMSE of 8.6 (as displayed in the plot above) indicated that on average, our MARS prediction of song popularity was 8.6 percentiles off off the actual song popularity for each of the ~30,000 test observation.

While machine learning algorithms have come close to dissecting the attributes of a hit song, music is a very human topic. No algorithm can replace the human ear in listening to music. Because our topic of music was quite subjective, there were inherently many outliers and a large portion of randomness. Thus, we found our average residual of about 8 percentage points to be relatively low; our model did a great job with prediction!

To determine how well the model predictions performed in comparison to the test data, we also looked at a plot of the predictions against the true values. In the plot below, if the values were perfectly predicted, the points would fall along the y=x line, for the \(\hat{y}\)s would match the \(y\)s.

plot(test$popularity, mars.pred, xlab="actual", ylab="predicted", col = "darkseagreen", cex = 0.6) + # scatterplot
  abline(lm(test$popularity ~ mars.pred)) # add trend line

## integer(0)

From the above plot, we can see that our prediction was decent, as the points follow relatively closely to the y=x line (shown in black).

We can take an even closer look at this relationship between actual values and predicted values by taking a sample of the data for visual purposes, as we want to be able to see the data in smaller quantity. This sample is taken below:

resid <- test$popularity - mars.pred # get residuals
df_mars_visual <- data.frame(test$popularity, mars.pred, resid) # data frame of actual, predicted, and residuals 
sample_mars <- df_mars_visual[sample(nrow(df_mars_visual), 300), ] # 300 observation sample of the dataframe

And the above plot of actual vs. predicted is replicated with this sample below:

plot(sample_mars$test.popularity, sample_mars$y, xlab= 'actual', ylab= 'predicted', main = "Sample: predicted vs actual values using MARS", col = "skyblue", cex = 0.7) + # same scatter with sampled data
abline(lm(sample_mars$test.popularity ~ sample_mars$y)) # trend line

## integer(0)

By plotting a sample of the data, we can see the relationship between actual and predicted values much clearer and that they fall relatively closely to the y=x line.

We can also evaluate our prediction by making a plot of the residuals:

plot(sample_mars$y.1, ylab = 'MARS residuals', col = 'blue', cex = 0.4) + # scatter
  abline(0,0, col = 'red') # horizontal residual line

## integer(0)

From this plot, we can see that the residuals are fairly evenly spread. The highest residuals are around 30 points off, with most residuals within 10 points off.

From all of these examinations of the performance of our MARS prediction, we can conclude that it performed fairly well for our purposes, and thus we were able to make a decently good prediction of song popularity based on a song’s attributes.

Conclusion

Summary

Thus, we have answered our research question of “Can we predict a song’s popularity based on its attributes?” by going through the entire machine learning exploration process. We started by tidying our data in preparation for a regression model and adding the variable, avg_art_pop, which was the average popularity of all of an artist’s songs. The tidying process took longer than expected when we encountered dataset issues; some datasets were too large for Github, some datasets didn’t have key variables we were looking for, etc., and choosing the right one took a lot of trial and error.

Next, we plotted our variables of interest to identify key patterns. We made a scatterplot of every quantitative variable in our models and a bar plot of every categorical variable we had. This allowed us to get a glimpse of which predictors might be influential and make sure that we were aware of any patterns that might violate an assumption in our models. For this reason, we decided not to run a linear regression model because our data violated too many of the model assumptions! We also created a correlation matrix and a heatmap to see which predictors were most correlated with song popularity, and which predictors varied together (collinearity). Finally, we made a density histogram of song popularity and compared it to a normal curve - it was nearly a perfect normal distribution!

Once we completed the exploratory data analysis, we were able to research the models that might best fit our data. We looked into aspects such as number of predictors, best prediction accuracy, model flexibility and interpretability, and other model specifications mentioned in lecture. Finally, we landed on the following algorithms: k-nearest neighbors, elastic net, Bayesian additive regression trees (BART), and multivariate adaptive regression splines (MARS). After fitting, tuning, and running cross-validation on each model, we found that the model with the lowest MSE was the MARS.

To finish our project, we tuned and ran a prediction using the MARS model to answer our research question. We observed that the prediction did a fairly good job at predicting popularity. The test MSE was 74.13, which we found to be a fairly low MSE given the data. Since song popularity is measured in a percentile between 0 and 100, the MARS model performed decently well, with predictions of song popularity being an average of 8 percentiles off of the true song popularity.

Reflection

In concluding our machine learning project, we reflected on some aspect of the project we would improve on or do differently if given the opportunity to re-do the project. One area we felt we could improve on was the definition of our avg_art_pop variable. Toward the end of the project, we realized that this variable may violate a rule of prediction - if the variable includes the average popularity of all songs for that artist, this may include future values. For example, if an observation is Katy Perry’s, “Spit”, released in 2003, the avg_art_pop variable for that observation includes Katy Perry’s future hits, released after 2003, in its average. If we were to re-do this project, we could address this problem by trying to make average artist popularity a variable that only relies on the observations that chronologically come before it.

Another area we can improve on for our next project would be in our decision making regarding our research question and the data we use to answer it. Our approach in this project was a bit misguided; we were unsure of the topic we wanted to focus on, so we started by looking at some general data sets and seeing what caught our eye. After some discussion, we decided on pursuing something within our shared interest of music, but we began looking for a dataset before we were certain about the research question we wanted to explore. We thought that we would be able to create our research question from a data set we decided on. As a result, we went through multiple datasets and multiple research questions before we were able to finally settle on the ones presented above. In hindsight, we feel that we would have been able to save time and labor by deciding on a topic and creating our research question before beginning our search for data.

Based on this exploration into song popularity and building of a machine learning model, we learned that the factors that most influence a song’s popularity are the artist’s own popularity and the year the song was released in. For future artists, this means building a brand and gaining popularity and recognition so that songs that you do release will be more noticed. In addition, with so many technologies and tools available to artists to spread their work and their brand (like Spotify, Apple Music, social media, and marketing firms) there is no time like the present to get into the music industry. Artists who maximize the utility of these resources have the greatest chance at success.

Sources

Introduction to Statistical Learning with Applications in R (James et. al.) Chapter 4, 8

PSTAT 131 Lab 2: k-NN

PSTAT 131 Lab 4: Cross-Validation

PSTAT 131 Lab 5: Ridge-Lasso

Kaggle dataset

Elastic Net model information

Elastic Net code

Elastic Net code

BART and overfitting

MARS model information

MARS code

MARS plots