For our PSTAT 131 final project, we chose to explore Spotify music and the attributes that make a song popular because we all listen to music frequently and have specific songs that we enjoy. Something we’ve always wondered about is why we enjoy a particular song, and as a consequence, how certain songs become more popular over others. Is it the artist singing the song, the fact that we’ve heard it on the radio, the beats per minute, or is it whether the lyrics are explicit? To come up with a solution to these questions, we decided to create specific goals: get to know our data thoroughly, create visualizations of the attributes of our question, run regression models to find which attributes are most relevant, select and fine tune a model, and convey our results in an easily understandable report. This brings us to our research question: can we predict a song’s popularity based on its attributes?
Many current hits are actually created from machine learning algorithms. Whether they are enjoyable to listen to is another topic, but nevertheless these machine-created hits successfully used algorithms for lyrics, beats, timing, and other factors to make these song popular. We were inspired by this topic for our research question, especially when wondering which factors in these algorithms had the most weight in the song creation.
The first step of our machine learning project was simple: to find a data set. However, it was not quite as simple as we had originally anticipated. We went through a multitude of datasets to find a dataset that did not face us with problems such as too many missing values, too many observations, too little observations, and more. This process took much longer than we expected, as it comprised of downloading multiple files and attempting to upload them into git, only to discover that they didn’t satisfy all of our data needs to explore our research question.
Finally, we landed on a dataset from a Kaggle project called “Music Recommendation System using Spotify Dataset”, for which a codebook is attached in our zip files. Below, we provide the list of general packages we used for this project (model-specific packages will be included in the model-building section), along with a preview of our dataset.
# list all packages here
library(tidyverse)
library(knitr)
library(lubridate)
library(httpuv)
library(cluster)
library(factoextra)
library(data.table)
library(dplyr)
library(ggplot2)
library(corrplot)
# load music.csv
<- read.csv("/Users/kimbauer/Desktop/college work/pstat 131/project/Bauer_Cai_Navarra_131_Project/Data/musicdata.csv")
music
# music.csv preview
head(music)
## valence year acousticness artists danceability duration_ms energy
## 1 0.328 1960 0.547 ['Etta James'] 0.274 179693 0.348
## 2 0.402 1960 0.829 ['Etta James'] 0.421 196133 0.285
## 3 0.644 1960 0.764 ['Ella Fitzgerald'] 0.508 175987 0.287
## 4 0.836 1960 0.733 ['Ella Fitzgerald'] 0.579 131733 0.502
## 5 0.965 1960 0.699 ['Neil Sedaka'] 0.743 139200 0.799
## 6 0.407 1960 0.878 ['Sam Cooke'] 0.553 165560 0.291
## explicit id instrumentalness key liveness loudness mode
## 1 0 4Hhv2vrOTy89HFRcjU3QOx 1.33e-02 5 0.3340 -8.631 1
## 2 0 0zGLlXbHlrAyBN1x6sY0rb 1.55e-06 2 0.2330 -9.430 0
## 3 0 4ukUoXLuFzMixyZyabSGc4 0.00e+00 1 0.1530 -12.472 1
## 4 0 65irrLqfCMRiO3p87P4C0D 0.00e+00 8 0.2810 -7.570 1
## 5 0 2x6pbpjVGjiWCcH89IK8AX 0.00e+00 8 0.0635 -5.466 0
## 6 0 0BFEyqJ9DJXS7gKg0Kj46R 0.00e+00 4 0.1290 -10.426 0
## name popularity release_date speechiness tempo
## 1 At Last 76 1960 0.0293 87.430
## 2 A Sunday Kind Of Love 70 1960 0.0293 85.861
## 3 Sleigh Ride 69 1/1/60 0.0523 154.759
## 4 Frosty The Snowman 69 1/1/60 0.0513 76.816
## 5 Breaking Up Is Hard to Do 62 12/30/60 0.0375 116.112
## 6 You Send Me 59 1/1/60 0.0301 96.217
In our initial look at this dataset, we wanted to identify any concerns early on so that we could either address them in our data pre-processing or look for a new dataset if the concerns were too large to overcome. The first thing we checked for was missing values. After running a summary of the dataset, we found that there were no missing values, which we took as a good sign. We did notice that some of the release_date
values only contained the release year and not the exact release data of the song, but we decided that this would not be a problem for our exploration as we had an additional column containing the release year of each song, which was much more consistent. We kept the release year column and decides to drop release date
later.
Additionally, we were thrown off by the amount of zero values we saw in our initial glance at the data, specifically in the columns explicit
and mode
. After further examination, we realized that they were coded as dummy variables, so this was not a problem for our exploration. We also determined that we did not want to drop either of these variables from our dataset, as we felt that if a song is explicit or not could be influential in how popular it becomes, and a song being in a minor key or a major key could also be influential in how popular it becomes. Thus, we kept these dummy variables in our dataset, and we later coded them to factor variables.
Because our research question required us to take a regression approach, we did have a concern of our data having too many categorical variables to fit into our regressive model. To address this problem, we thought about which variables would be most important to our data exploration. One of the variables that we thought would be most important to include was artists
, as we know that the artist of a song can have a direct impact on the song’s popularity: a more popular artist is more likely to release a song that has higher popularity, and vice versa. With the artists
column being string-valued categorical observations, and with there being so many different artists in our data, we wanted to find a way to incorporate this variable into our data as a unique but quantitative variable. We decided that this would be the next step of our data pre-processing.
In order to keep the important variable of artists
but make it quantitative, we decided to take the average song popularity by each artist in the data, and assign that variable of average artist popularity to each artist as a new variable in the dataset. Thus, our dataset would lose the column artists
containing each artist’s name, and instead assign the variable avg_art_pop
as the artist’s average song popularity. This variable mutation process is outlined in the code chunk below.
# create a new table that aggregates the variables popularity and artists and finds the mean song popularity for each given artist
<- aggregate(music$popularity, list(music$artists), FUN = mean)
artist_pop_table
# merge the table above into our music dataset by artist
<- merge(x = music, y = artist_pop_table, by.x = c("artists"), by.y = c("Group.1"), all.x = TRUE)
full_merge
# rename full dataset and x column
<- rename(full_merge, avg_art_pop = x)
full_data
# delete original 'artist' column
<- c("artists")
delete1 <- full_data[!(names(full_data) %in% delete1)]
full_data head(full_data)
## valence year acousticness danceability duration_ms energy explicit
## 1 0.460 2012 0.0280 0.730 279080 0.682 1
## 2 0.841 1964 0.0589 0.681 177427 0.499 0
## 3 0.952 1966 0.1460 0.616 133413 0.881 0
## 4 0.827 1966 0.1410 0.659 131480 0.573 0
## 5 0.678 1966 0.5060 0.776 149267 0.764 0
## 6 0.963 1966 0.1050 0.646 170827 0.926 0
## id instrumentalness key liveness loudness mode
## 1 55iiBrGOWT5W99VKy4Fk2C 2.87e-06 7 0.1140 -4.517 1
## 2 4v75K5mEyumdXCZU09kwAg 6.39e-06 7 0.0743 -6.894 1
## 3 63Vw8PU3ps8TAxDdHAUTMX 7.52e-01 5 0.0643 -6.357 1
## 4 40NHppXZVM0eYvsVseywWl 9.70e-05 2 0.0779 -10.183 1
## 5 0HmDZd0eP9PVMaZggRp6Em 0.00e+00 0 0.1760 -8.336 1
## 6 23MXQSnv6ZrZUiLy6DdnYl 8.00e-01 0 0.0884 -5.933 1
## name popularity release_date speechiness tempo
## 1 Badlands 46 3/26/12 0.1430 147.988
## 2 96 Tears 42 1/1/64 0.0340 123.458
## 3 I Need Somebody 23 1966 0.0372 134.048
## 4 Ten O'clock 16 1966 0.0336 96.254
## 5 You're Telling Me Lies 15 1966 0.0802 88.574
## 6 Up Side 26 1966 0.0426 125.672
## avg_art_pop
## 1 46.00000
## 2 25.11111
## 3 25.11111
## 4 25.11111
## 5 25.11111
## 6 25.11111
In addressing the multiple categorical variables we had left in our dataset, we then set out to determine if these variables would be important to our exploration and if they would be feasible to incorporate quantitatively into our data (similar to artists
above). The qualitative variables that we had left in our dataset (that we decided would be irrational to dummy code) were id
, key
, name
, release_date
. We discussed each of these variables as a group, and decided that none of them would be very useful or feasible to try and fit quantitatively into our final data set, and we don’t feel they will logically be important in predicting a given song’s average popularity. This is because id
served as a unique song identifier, most individuals would not be able to distinguish between the different key
s, name
could be mostly anything an artist desired (and thus served no real consequence to popularity
), and release_date
was closely related to year
, which we already decided to incorporate in our data.
# delete variables that will not be useful in our exploration
<- c("id", "key", "name", "release_date")
delete2 <- full_data[!(names(full_data) %in% delete2)] full_data
Finally, after making sure that all of our quantitative variables were numeric (shown in code chunk below), our data was ready for some exploratory data analysis.
# mutating leftover int variables to num type
$popularity <- as.numeric(full_data$popularity)
full_data$year <- as.numeric(full_data$year)
full_data$duration_ms <- as.numeric(full_data$duration_ms) full_data
In our exploratory data analysis, we set out to explore the relationship between the various predictors and the response variable, popularity. First, an analysis of the distribution of the response popularity
allowed us to check that it was fairly normally-distributed.
# adding a density histogram for our outcome variable popularity
<- ggplot(full_data) +
hist ggtitle("Density Histogram for Song Popularity (Outcome Variable)") + # include title
geom_histogram(aes(popularity, y = after_stat(density)), binwidth = 5, fill = "cyan", color = "grey") # add histogram w density
<- seq(0, 100, length.out = 100)
x <- with(full_data, data.frame(x=x, y = dnorm(x, mean(popularity), sd(popularity))))
df + geom_line(data = df, aes(x = x, y = y), color = "red") hist
We then subset the data to a more manageable size, as our data set had too many observations to plot, in order to examine the visual relationships between our predictors and popularity:
set.seed((123))
# sample data to use for exploratory graphics
<- full_data[sample(nrow(full_data), 200), ] sample
Then, we defined various simple scatter plots for our exploratory data analysis:
<- ggplot(sample, aes(x=valence, y=popularity)) + geom_point(color = "seagreen1") + geom_smooth(se = FALSE, color = "black")
valence
<- ggplot(sample, aes(x=year, y = popularity)) + geom_point(color = "dodgerblue2") + geom_smooth(se = FALSE, color = "black")
year
<- ggplot(sample, aes(x=acousticness, y=popularity)) + geom_point(col = "mediumpurple1")
acousticness
<- ggplot(sample, aes(x=danceability, y=popularity)) + geom_point(col = "lightpink") + geom_smooth(se = FALSE, color = "black")
danceability
<- ggplot(sample, aes(x=duration_ms, y=popularity)) + geom_point(col = "sienna1")
duration
<- ggplot(sample, aes(x=energy, y=popularity)) + geom_point(col = "firebrick3") + geom_smooth(se = FALSE, color = "black")
energy
<- ggplot(sample, aes(x=explicit)) + geom_bar(fill = "darkseagreen") + scale_x_continuous(breaks = c(0,1), labels=c("0" = "Clean", "1" = "Explicit"))
explicit
<- ggplot(sample, aes(x=instrumentalness, y=popularity)) + geom_point(col = "goldenrod1")
instrumentalness
<- ggplot(sample, aes(x=liveness, y=popularity)) + geom_point(col = "deeppink1")
liveness
<- ggplot(sample, aes(x=loudness, y=popularity)) + geom_point(col = "khaki1") + geom_smooth(se = FALSE, color = "black")
loudness
<- ggplot(sample, aes(x=mode)) + geom_bar(fill = "plum") + scale_x_continuous(breaks=0:1,
mode labels=c("Minor", "Major"))
<- ggplot(sample, aes(x=speechiness, y=popularity)) + geom_point(col = "green4")
speechiness
<- ggplot(sample, aes(x=tempo, y=popularity)) + geom_point(col = "springgreen")
tempo
<- ggplot(sample, aes(x=avg_art_pop, y=popularity)) + geom_point(col = "deepskyblue") + geom_smooth(se = FALSE, color = "black") avg_art_pop
After defining our various plots, we accumulated them into a panel for better visibility:
library(ggpubr)
<- ggarrange(valence, year, acousticness, danceability, duration, energy, explicit, instrumentalness, liveness, loudness, mode, speechiness, tempo, avg_art_pop, ncol = 3, nrow = 5)
figure figure
There are several things that we extrapolated from the panel of plots: mainly, many of the predictors, such as valence
, acousticness
, danceability
, duration_ms
, and tempo
all have fairly weak relationships with our response variable popularity
. One can also see that many songs in the dataset appear to be around 0 for speechiness
, liveness
, and acousticness
. Regarding the dummy variables explicit
and mode
, there is a clear skew towards clean songs and songs in a major key.
Looking specifically at the variables that have the strongest relationships with popularity, the two predictors year
and avg_art_pop
appear to have the strongest positive relationship with popularity. This makes intuitive sense.
There are several factors that could explain why year has a positive, near linear, relationship with popularity: for one, the population has increased from the 1960s until present day. With more people comes a more active listening base for artists to reach. Additionally, the methods by which we obtain music have changed over the decades, starting off with things like records and record players, CDs, iTunes, and now with music streaming through platforms like Spotify. With each changing method, listening to music has become more accessible over time, and thus more popular. While we didn’t expect year to be so highly correlated with song popularity, it does make logical sense given further thought.
It makes intuitive sense why an artist’s average popularity would be have a strong, positive relationship with popularity. An artist with high brand value or recognition, such as Kendrick Lamar, Ariana Grande, or Justin Bieber are more likely to have new releases listened to by significantly more people as opposed to a more obscure or lesser known artist such as Gus Dapperton or Peach Pit. This strong, positive relationship between our avg_art_pop
variable and our response variable popularity
is thus explainable given further thought.
For the next step of our exploratory data analysis, we want to take an even closer look at relationships between our predictors and response through a correlation matrix and subsequent heatmap. First, in order to create our correlation matrix and heatmap, we coded our dummy variables into as.factor variables:
# mutating dummy variables
= full_data %>%
full_data2 mutate(explicit = as.factor(ifelse(explicit == 0, "Clean", "Explicit"))) %>%
mutate(mode = as.factor(ifelse(mode == 0, "Minor", "Major")))
Finally, we created the correlation matrix and heatmap:
# correlation matrix
<- cor(full_data)
corr_mat 12] corr_mat[,
## valence year acousticness danceability
## -0.05766409 0.68198486 -0.24959846 0.18750900
## duration_ms energy explicit instrumentalness
## -0.02197240 0.19717541 0.26782335 -0.14840380
## liveness loudness mode popularity
## -0.07600035 0.31230798 -0.06658299 1.00000000
## speechiness tempo avg_art_pop
## 0.10043080 0.03306722 0.81967675
# heatmap
corrplot(corr_mat, type = "lower", order = "hclust", tl.col = "black", tl.srt = 45)
The heatmap above reinforced the notion that the two predictors with the strongest positive association with popularity were year
and avg_art_pop
. We thus had reason to believe that they would be influential and most important to include in our model building.
After we explored our data visually, we began to build our models. We looked at the characteristics of our data, such as the number of observations, type of data (qualitative vs quantitative), etc., and did some outside research on models that might fit best with our data. We found that an elastic net model, a Bayesian additive regression trees (BART) model, a multivariate adaptive regression splines (MARS) model, and a k-nearest neighbors (k-NN) model would suit our regression-based research question and dataset well. We explored the reasons why we chose each model, we fitted each of these models to our training data, and then we calculated the test MSE of each model to compare later.
Since our dataset has 120,750 observations (songs) in total, our data can safely be split into train and test sets. That way we could use the training data to fit our model and save the test data to evaluate how well our model predicted the data (and calculate the test MSE). We chose to use 70% of the data for training and 30% of the data for testing. To implement this, we set the seed for reproducibility, took a random sample of 70% of the dataset, and used that sample to create a training set. The remaining 30% of the dataset were used for the testing set.
# split training and testing data
#set seed
set.seed(123)
# Sample 70% of observations as training data
= sort(sample(nrow(full_data), nrow(full_data)*.7))
trainsample # define dat.train as the 70% of observaions
= full_data[trainsample,]
train # The rest as test data
= full_data[-trainsample,] test
The k-nearest neighbors regression algorithm is a non-parametric method that takes observations from different neighboring observations to approximate the association between the predictors and the outcome variable (in our case, song popularity). We found that k-NN was a good fit for our research question because the function can have many forms other than linear, which is what we were looking for.
The downside about the k-NN algorithm is that it can only be used when the variances of variables are similar. By using this model, we made the assumption that the variances of each predictor of popularity are similar enough for k-NN to work. Furthermore, k-NN makes it difficult to “predict into the future”, which, alongside the curse of dimensionality, makes it not very interpretable. In spite of these downsides, we moved forward with our use of k-NN as we knew it would be suitable for our purposes.
To build our k-NN model, we first loaded the required packages and defined the do.chunk()
function for k-fold cross-validation. We used folddef
to specify an index to map all observations in the design matrix and all observations into the response into a set of k folds. For the k-NN model, our tuning parameter was k, the number of neighbors. chunkid
was used to set the fold used as the validation set. The do.chunk
function as a whole trained the model using all folds but the chunkid
fold, and it also computed the validation error for that last fold. The resulting dataframe consisted of all possible values of the folds and their test error rate.
library(ISLR) # load libraries
library(tidyverse)
library(class)
library(FNN)
<- function(chunkid, folddef, Xdat, Ydat,...){ # for k-fold cross validation
do.chunk = (folddef!= chunkid)
train
= Xdat[train,] # split x train
Xtr = Ydat[train] # split y train
Ytr
= Xdat[!train,] # split x validation
Xval = Ydat[!train] # split y validation
Yval
= knn(train = Xtr, test = Xtr, cl = Ytr,...) # knn prediction y train
predYtr = knn(train = Xtr, test = Xval, cl = Ytr,...) # knn prediction y value
predYvl
data.frame(fold = chunkid,
train.error = mean(predYtr != Ytr),
val.error = mean(predYvl != Yval)) # load into dataset
}
Next, we specified that we wanted 10 folds and divided the data into ten intervals.
= 10 # 10 folds
nfold set.seed(100) # for reproducibility
= cut(1:nrow(train), breaks = nfold, labels = FALSE) %>% sample() # divide data folds
We decided to have 50 neighbors considered, so we looped through all 50 of the neighbors for each different chunk id.
= NULL
error.folds # Give possible number of nearest neighbours to be considered
= 1:50
allK # Set seed since do.chunk() contains a random component induced by knn()
set.seed(888)
<- train$popularity # from lab 4
Ytrainknn <- train %>% select(-popularity) %>% scale(center = TRUE, scale = TRUE)
Xtrainknn
<- test$popularity
Ytestknn <- test %>% select(-popularity) %>% scale(center = TRUE, scale = TRUE)
Xtestknn
# Loop through different number of neighbors
for (k in allK){
# Loop through different chunk id
for (j in seq(10)){
= do.chunk(chunkid=j, folddef=folds, Xdat=Xtrainknn,
tmp Ydat=Ytrainknn, k=k)
$neighbors = k # Record the last number of neighbor
tmp= rbind(error.folds, tmp) # combine results }
error.folds
}
}
head(error.folds, 10)
## fold train.error val.error neighbors
## 1 1 0.008649700 0.9558737 1
## 2 2 0.009043945 0.9504259 1
## 3 3 0.009030918 0.9564652 1
## 4 4 0.008754749 0.9549219 1
## 5 5 0.008728573 0.9532710 1
## 6 6 0.008820475 0.9583531 1
## 7 7 0.008754749 0.9555135 1
## 8 8 0.009070354 0.9517331 1
## 9 9 0.008807330 0.9555135 1
## 10 10 0.008912609 0.9587129 1
We wanted to decide the optimal number of neighbors for k-NN, using error.folds. For results, we wanted to find the type (training or test) of error and the corresponding value for each fold and each neighbor. We did this using the melt()
function, as shown below.
# Transform the format of error.folds for further convenience
= melt(error.folds, id.vars=c('fold', 'neighbors'), value.name='error')
errors # Choose the number of neighbors which minimizes validation error
= errors %>%
val.error.means # Select all rows of validation errors
filter(variable=='val.error') %>%
# Group the selected data frame by neighbors
group_by(neighbors, variable) %>%
# Calculate CV error rate for each k
summarise_each(funs(mean), error) %>%
# Remove existing group
ungroup() %>%
filter(error==min(error))
# Best number of neighbors
# if there is a tie, pick larger number of neighbors for simpler model
= max(val.error.means$neighbors)
numneighbor numneighbor
## [1] 50
The 10-fold cross validation found that a 50-NN classifier would be best. This concerned us because of how many observations were considered in model building. This led us to believe that the model might overfit the training data. To check, we calculated the training and test MSE using the k-NN prediction.
# train knn regressor and make predictions on training set using k=98
set.seed(888)
= knn.reg(train = Xtrainknn, test = Xtrainknn, y = Ytrainknn, k = numneighbor)
pred.Ytrainknn # get training MSE
mean((pred.Ytrainknn$pred - Ytrainknn)^2)
## [1] 79.66631
set.seed(888)
# now make predictions on test set
= knn.reg(train = Xtrainknn, test = Xtestknn, y = Ytrainknn, k = numneighbor)
pred.Ytestknn mean((pred.Ytestknn$pred - Ytestknn)^2)
## [1] 82.41085
The training MSE was 79.66631 and the test MSE was 82.41085. This small discrepancy between the training and test MSEs was a sign of slight overfitting. Because of this, we were wary of choosing k-NN as our model.
The next model we fit is the elastic net, a combination of the ridge and lasso techniques. The elastic net method is a regularized linear regression that combines L1 and L2 penalty functions from the lasso and ridge techniques. Thus, the elastic net performs variable selection and regularization simultaneously. We found the elastic net to be more ideal than the ridge and lasso techniques on their own because it uses both methods and learns from them.
The idea of the elastic net is to minimize the residual sum of squares and add a penalty term:
\[\sum\limits_{i=1}^n (y_i - \beta_0 - \sum\limits_{j=1}^p \beta_j x_{ij})^2 + \lambda[(1 - \alpha)\ ||\beta||_2^2/2 + \alpha\ ||\beta||_1]\] where \(||\beta||_1\) takes the form of L1 (L1 norm): \[||\beta||_1 = \sum\limits_{j=1}^p |\beta_j|\]
and \(||\beta||_2\) takes the form of L2 (Euclidean norm): \[||\beta||_2 = \sqrt{\sum\limits_{j=1}^p \beta_j^2}\]
::include_graphics("/Users/kimbauer/Desktop/college work/pstat 131/project/Bauer_Cai_Navarra_131_Project/Figures/elasticnetimage.png") knitr
This model was a good choice for our data because it is a regression model, which fits our mostly quantitative variables, and it improves on the lasso and ridge techniques. Additionally, the lasso and ridge methods account for collinearity, so that eliminated our need to investigate collinearity between predictors. Below, we load the required packages for elastic net model fitting, re-split the training and testing data in matrix form (which we found was required for the elastic net model), and began fitting our model with built-in tuning and cross-validation.
# load packages
library(dplyr) # data manipulation
library(glmnet) # elastic net
library(ggplot2) # for visualization
library(caret) # classification and regression training
# creating new train and test sets as matrices (from lab 5)
<- model.matrix(popularity ~ ., full_data)[, -1]
el_X <- full_data$popularity
el_Y set.seed(123)
= sort(sample(nrow(el_X), nrow(el_X)*.7)) # same as train sample above
eltrain = (-eltrain)
eltest = el_X[eltrain,]
x.eltrain = el_Y[eltrain]
y.eltrain = el_X[eltest,]
x.eltest = el_Y[eltest]
y.eltest
# model building on training sets
<- trainControl(method = "repeatedcv",
control number = 5, # 5 fold
repeats = 5, # 5 repeats
search = "random",
verboseIter = TRUE)
# training elastic net regression model
<- train(x = x.eltrain, y = y.eltrain,
elastic_model method = "glmnet",
preProcess = c("center", "scale"),
tuneLength = 25, # tuning the model
trControl = control) # our cross-validation defined above
As stated above, the elastic net model comes with built-in cross-validation and tuning in order to select the optimal alpha and lambda value for model fitting. Below, we print out the best tuned model it selected from cross-validation, plot the chosen model, and print out the values it found for best tuning.
elastic_model
## glmnet
##
## 84525 samples
## 14 predictor
##
## Pre-processing: centered (14), scaled (14)
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 67621, 67619, 67620, 67620, 67620, 67621, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.06876433 0.113432800 8.735923 0.6847304 6.361206
## 0.17250896 0.149374972 8.736850 0.6846874 6.361618
## 0.20867316 1.714738322 8.847165 0.6815835 6.557331
## 0.21125721 3.394815536 9.045809 0.6783818 6.825648
## 0.29286574 0.465784791 8.752501 0.6840097 6.388511
## 0.31549376 7.029300792 9.743641 0.6771812 7.601590
## 0.35680771 0.001551316 8.735681 0.6847318 6.354317
## 0.39223995 4.100785818 9.233766 0.6800352 7.063697
## 0.48713970 0.971739738 8.805351 0.6822081 6.469126
## 0.52333486 0.072676349 8.736796 0.6846624 6.353545
## 0.53708309 0.217686804 8.744468 0.6842437 6.362709
## 0.55336682 2.737403145 9.045971 0.6818364 6.839039
## 0.58332201 0.008188642 8.735672 0.6847297 6.352686
## 0.66370131 0.005493856 8.735676 0.6847288 6.352366
## 0.66803380 6.064987237 10.094770 0.6819633 7.986012
## 0.69651918 0.001657071 8.735681 0.6847283 6.352238
## 0.81856105 0.021962223 8.735706 0.6847262 6.351891
## 0.82267651 1.898387317 8.957145 0.6819579 6.707539
## 0.82977965 0.154433921 8.744647 0.6841816 6.356173
## 0.83235296 2.524038284 9.098348 0.6816817 6.900946
## 0.87944868 0.008348323 8.735705 0.6847261 6.351754
## 0.88215644 0.001898939 8.735704 0.6847261 6.351749
## 0.88673137 0.188647932 8.749121 0.6839133 6.359207
## 0.89035794 3.971283445 9.608186 0.6788287 7.495460
## 0.89747707 0.020774229 8.735702 0.6847262 6.351708
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.583322 and lambda
## = 0.008188642.
# RMSE was used to select the optimal model using the smallest value.
# plot
plot(elastic_model, main = "Elastic Net Regression")
# print out values from best tuned model
<- elastic_model$results %>%
best_elastic filter(alpha == elastic_model$bestTune$alpha, lambda == elastic_model$bestTune$lambda)
best_elastic
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.583322 0.008188642 8.735672 0.6847297 6.352686 0.05325549 0.003896935
## MAESD
## 1 0.02982672
Once our elastic net model was trained, we wanted to make a prediction on our test data. To do this, we used the predict()
function on our tuned and cross-validated model, and then we calculated the test MSE, which is printed out below.
# elastic model prediction on test data
<- predict(elastic_model, x.eltest)
elastic.pred
# evaluate mse on the test data
<- mean((elastic.pred - y.eltest)^2)
elastic_mse elastic_mse
## [1] 76.22993
This value of 76.22993 is a smaller test MSE than the one we have calculated so far, thus we continued with our model building to see how our other models compare.
The BART method uses decision trees for regression and constructs each tree randomly, similar to bagging and random forest models. Each new tree is trying to capture a “signal” that is not accounted for by the rest of the trees. BART is related to both the bagging and random forest algorithms, as well as the boosting algorithm, in that each tree is constructed randomly and each tree tries to capture a signal not yet accounted for by the current model. In the BART model, K is the number of regression trees and B is the number of iterations that the BART will run through. Each tree in BART is perturbed to avoid local minima and achieve a more thorough exploration of the model space (ISLR Ch 8).
Below is a description of the algorithm we used to construct our BART model.
::include_graphics("/Users/kimbauer/Desktop/college work/pstat 131/project/Bauer_Cai_Navarra_131_Project/Figures/BARTalgorithm.png") knitr
First, we took our training and testing data and split them into Xtrain, Ytrain, Xtest, and Ytest for the BART model.
library(BART)
# BART Implementation
<- train$popularity # from lab 4
Ytrainbart <- train %>% select(-popularity) %>% scale(center = TRUE, scale = TRUE)
Xtrainbart
<- test$popularity
Ytestbart <- test %>% select(-popularity) %>% scale(center = TRUE, scale = TRUE) Xtestbart
We then implemented k-fold cross validation with our BART model, which is similar to the k-NN cross validation that we did above:
# do chunk k-fold cv with BART
<- function(chunkid, folddef, Xdat, Ydat, ...){
do.chunk.bart # Get training index
= (folddef!=chunkid)
train # Get training set by the above index
= Xdat[train,]
Xtr # Get responses in training set
= Ydat[train]
Ytr # Get validation set
= Xdat[!train,]
Xval # Get responses in validation set
= Ydat[!train]
Yval # Predict training labels
= gbart(Xtr, Ytr, x.test = Xtr)
predYtr # Predict validation labels
<<- gbart(Xtr, Ytr, x.test = Xval)
predYval data.frame(fold = chunkid,
train.error = mean((Ytrainbart - predYtr$yhat.train)^2), # Training error for each fold
val.error = mean((Ytestbart - predYval$yhat.test)^2)) # Validation error for each fold
}
# k-fold CV on BART
= 5
nfold # cut: divides all training observations into 3 intervals;
# labels = FALSE instructs R to use integers to code different intervals
set.seed(3)
= cut(1:nrow(train), breaks=nfold, labels=FALSE) %>% sample()
folds
folds
# Set error.folds (a vector) to save validation errors in future
= NULL
error.folds # Set seed since do.chunk() contains a random component induced by knn()
set.seed(888)
# Loop through different chunk id
for (i in seq(5)){
= do.chunk.bart(chunkid=i, folddef=folds, Xdat=Xtrainbart, Ydat=Ytrainbart)
tmp = rbind(error.folds, tmp)
error.folds
}head(error.folds)
Finally, we were able to get the test MSE after we iterated through a 5-fold cross validation.
# Test mse
= mean((Ytestbart - predYval$yhat.test)^2)
test_mse test_mse
## [1] 407.7647
A test MSE of 407.7647 was surprisingly high and demonstrated that BART did not perform well for our data.
The final algorithm we implemented was the multivariate adaptive regression splines (MARS). The MARS model is an ensemble of piece-wise linear functions that are joined together by hinge functions. This accounts for non-linearity between inputs and outputs. The MARS algorithm has two stages: a forward stage and a backward stage. The forward stage generates many pairs of functions as candidates for the basis of the model. The backward stage is considered the pruning stage; the model goes through all of the functions and deletes the functions that don’t add further value to the model performance.
We chose to use the MARS algorithm because we saw that the relationship between popularity and our predictors is often non-linear. Additionally, we had many predictor variables and the MARS algorithm was capable of achieving good performance on a regression problem with many predictor variables.
In a nutshell, the general MARS method works like this: 1. Divide a given dataset into k pieces. 2. Fit a regression model to each piece. 3. Use k-fold cross-validation to choose a value for k.
Our first step was to create a tuning grid for pruning and set a seed for reproducability:
library(earth) # library for fitting MARS models
library(caret) # library for tuning the model parameters
<- train # copy dataset for mars model
trainmars
#create a tuning grid
<- expand.grid(degree = 1:3,
hyper_grid nprune = seq(2, 50, length.out = 10) %>%
floor())
#make this example reproducible
set.seed(123)
Next, we fit the MARS model into a 10-fold cross validation:
#fit MARS model using k-fold cross-validation
<- train(
cv_mars x = subset(trainmars, select = -c(popularity)),
y = trainmars$popularity,
method = "earth",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = hyper_grid)
Finally, we used the MARS algorithm to find the model with the lowest test RMSE:
#display model with lowest test RMSE
<- cv_mars$results %>%
mars_model filter(nprune == cv_mars$bestTune$nprune, degree == cv_mars$bestTune$degree)
mars_model
## degree nprune RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 3 18 8.621526 0.6928844 6.265727 0.1530968 0.01070354 0.102432
This result shows that the best model found by the MARS method was of degree 3 with 18 prunes. The R-squared was 0.6928, which is fairly strong correlation. The test root mean squared error was 8.6215.
Below is a graph of the test RMSE by terms and degree - one can see that the third degree has the lowest RMSE and the number of terms appear to minimize RMSE at a value between 10 and 20 (hence why our value of 18 makes sense).
#display test RMSE by terms and degree
ggplot(cv_mars)
To compare this model to the other models we fit, we will calculate the test MSE:
# elastic model prediction on test data (to have prediction code)
<- predict(cv_mars, test)
mars.pred
# evaluate mse on the test data
<- mean((mars.pred - test$popularity)^2)
mars_mse mars_mse
## [1] 74.13321
Thus, we found a test MSE value of 74.13321 from our model using multivariate adaptive regression splines.
We decided that choosing the model that with the lowest test MSE is indicative of the model that best fit the data. From highest test MSE to lowest, the models were as follows:
Based on these results, the best performing model of the four we chose was MARS.
With MARS being our best-performing model of the four we chose to explore, we wanted to evaluate how well it did in predicting on our test data. Below, we reproduced a plot we had shown above of the test RMSE of our tuned MARS model by terms and degree.
plot(cv_mars)
The MARS model having a test MSE of 74.13 and a RMSE of 8.6 (as displayed in the plot above) indicated that on average, our MARS prediction of song popularity was 8.6 percentiles off off the actual song popularity for each of the ~30,000 test observation.
While machine learning algorithms have come close to dissecting the attributes of a hit song, music is a very human topic. No algorithm can replace the human ear in listening to music. Because our topic of music was quite subjective, there were inherently many outliers and a large portion of randomness. Thus, we found our average residual of about 8 percentage points to be relatively low; our model did a great job with prediction!
To determine how well the model predictions performed in comparison to the test data, we also looked at a plot of the predictions against the true values. In the plot below, if the values were perfectly predicted, the points would fall along the y=x line, for the \(\hat{y}\)s would match the \(y\)s.
plot(test$popularity, mars.pred, xlab="actual", ylab="predicted", col = "darkseagreen", cex = 0.6) + # scatterplot
abline(lm(test$popularity ~ mars.pred)) # add trend line
## integer(0)
From the above plot, we can see that our prediction was decent, as the points follow relatively closely to the y=x line (shown in black).
We can take an even closer look at this relationship between actual values and predicted values by taking a sample of the data for visual purposes, as we want to be able to see the data in smaller quantity. This sample is taken below:
<- test$popularity - mars.pred # get residuals
resid <- data.frame(test$popularity, mars.pred, resid) # data frame of actual, predicted, and residuals
df_mars_visual <- df_mars_visual[sample(nrow(df_mars_visual), 300), ] # 300 observation sample of the dataframe sample_mars
And the above plot of actual vs. predicted is replicated with this sample below:
plot(sample_mars$test.popularity, sample_mars$y, xlab= 'actual', ylab= 'predicted', main = "Sample: predicted vs actual values using MARS", col = "skyblue", cex = 0.7) + # same scatter with sampled data
abline(lm(sample_mars$test.popularity ~ sample_mars$y)) # trend line
## integer(0)
By plotting a sample of the data, we can see the relationship between actual and predicted values much clearer and that they fall relatively closely to the y=x line.
We can also evaluate our prediction by making a plot of the residuals:
plot(sample_mars$y.1, ylab = 'MARS residuals', col = 'blue', cex = 0.4) + # scatter
abline(0,0, col = 'red') # horizontal residual line
## integer(0)
From this plot, we can see that the residuals are fairly evenly spread. The highest residuals are around 30 points off, with most residuals within 10 points off.
From all of these examinations of the performance of our MARS prediction, we can conclude that it performed fairly well for our purposes, and thus we were able to make a decently good prediction of song popularity based on a song’s attributes.
Thus, we have answered our research question of “Can we predict a song’s popularity based on its attributes?” by going through the entire machine learning exploration process. We started by tidying our data in preparation for a regression model and adding the variable, avg_art_pop
, which was the average popularity of all of an artist’s songs. The tidying process took longer than expected when we encountered dataset issues; some datasets were too large for Github, some datasets didn’t have key variables we were looking for, etc., and choosing the right one took a lot of trial and error.
Next, we plotted our variables of interest to identify key patterns. We made a scatterplot of every quantitative variable in our models and a bar plot of every categorical variable we had. This allowed us to get a glimpse of which predictors might be influential and make sure that we were aware of any patterns that might violate an assumption in our models. For this reason, we decided not to run a linear regression model because our data violated too many of the model assumptions! We also created a correlation matrix and a heatmap to see which predictors were most correlated with song popularity, and which predictors varied together (collinearity). Finally, we made a density histogram of song popularity and compared it to a normal curve - it was nearly a perfect normal distribution!
Once we completed the exploratory data analysis, we were able to research the models that might best fit our data. We looked into aspects such as number of predictors, best prediction accuracy, model flexibility and interpretability, and other model specifications mentioned in lecture. Finally, we landed on the following algorithms: k-nearest neighbors, elastic net, Bayesian additive regression trees (BART), and multivariate adaptive regression splines (MARS). After fitting, tuning, and running cross-validation on each model, we found that the model with the lowest MSE was the MARS.
To finish our project, we tuned and ran a prediction using the MARS model to answer our research question. We observed that the prediction did a fairly good job at predicting popularity. The test MSE was 74.13, which we found to be a fairly low MSE given the data. Since song popularity is measured in a percentile between 0 and 100, the MARS model performed decently well, with predictions of song popularity being an average of 8 percentiles off of the true song popularity.
In concluding our machine learning project, we reflected on some aspect of the project we would improve on or do differently if given the opportunity to re-do the project. One area we felt we could improve on was the definition of our avg_art_pop
variable. Toward the end of the project, we realized that this variable may violate a rule of prediction - if the variable includes the average popularity of all songs for that artist, this may include future values. For example, if an observation is Katy Perry’s, “Spit”, released in 2003, the avg_art_pop
variable for that observation includes Katy Perry’s future hits, released after 2003, in its average. If we were to re-do this project, we could address this problem by trying to make average artist popularity a variable that only relies on the observations that chronologically come before it.
Another area we can improve on for our next project would be in our decision making regarding our research question and the data we use to answer it. Our approach in this project was a bit misguided; we were unsure of the topic we wanted to focus on, so we started by looking at some general data sets and seeing what caught our eye. After some discussion, we decided on pursuing something within our shared interest of music, but we began looking for a dataset before we were certain about the research question we wanted to explore. We thought that we would be able to create our research question from a data set we decided on. As a result, we went through multiple datasets and multiple research questions before we were able to finally settle on the ones presented above. In hindsight, we feel that we would have been able to save time and labor by deciding on a topic and creating our research question before beginning our search for data.
Based on this exploration into song popularity and building of a machine learning model, we learned that the factors that most influence a song’s popularity are the artist’s own popularity and the year the song was released in. For future artists, this means building a brand and gaining popularity and recognition so that songs that you do release will be more noticed. In addition, with so many technologies and tools available to artists to spread their work and their brand (like Spotify, Apple Music, social media, and marketing firms) there is no time like the present to get into the music industry. Artists who maximize the utility of these resources have the greatest chance at success.
Introduction to Statistical Learning with Applications in R (James et. al.) Chapter 4, 8
PSTAT 131 Lab 2: k-NN
PSTAT 131 Lab 4: Cross-Validation
PSTAT 131 Lab 5: Ridge-Lasso