Predicting Song Popularity with Spotify Data

Question and Background Info

Our overarching question: How well can we predict the popularity of songs on Spotify?

Our main criteria when selecting a question to ask was picking a question that was interesting to us and a broad audience. As avid popular music listeners, we felt this subject has personal significance, and we could learn about the music we’re listening to. Popular music is a common topic of discussion, so we thought that the topic would interest a general audience. Additionally, if successful, this model could have practical uses. For example, if someone was interested in creating popular songs, they could use the model to determine characteristics to consider when creating songs.

Information about our dataset:

Spotify data was collected from the Data Visualization students (SARC 5400) from the Spring 2022 semester. 23 people provided their data, and the instructor pooled it together to create the data. We ended up with 89,273 songs to include in our model. For our machine learning model, we will join two datasets. The first is “TrackInfo,” with the demographic information about each song. These include the disk number; names of the songs; the name of the album; the name of the artist; release date; release date precision; track number; duration of the song in ms, whether the song is explicit or not; whether the song is a single, part of a compilation or a part of an album; popularity of the song, and the total number of tracks in the album. The “TrackFeatures” dataset contains information about the characteristics of each song. These include danceability, energy, key, loudness, mode, speechiness, acoustics, instrumental, liveliness, valence, tempo, duration in ms, and time signature. We joined these two datasets because our target variable is in a dataset separate from a lot of our predictor variables of interest. Our target variable is the “popularity” variable, originally found in the “TrackInfo” dataset. It’s on a 1-100 scale, with higher values representing higher popularity. For more information about Spotify data, see the following link: https://support.spotify.com/us/article/understanding-my-data/

Previous Research:

We wanted to see if anyone has already created a machine-learning model to study popular music, Spotify data, or anything similar. This way, we can understand how our model might perform. We found members of the Manipal Institute of Technology in Manipal, India who used Spotify API data to see if they could build a machine-learning model to predict the popularity of songs. Before running the machine learning models, they found that “Popularity is better correlated with loudness, followed by danceability, energy, valence, tempo, and duration. Popularity is negatively correlated with acoustics, speechiness, liveness, and instrumentalness” (Kaneria et al.). The researchers created random forest, logistic regression, naïve Bayes, and gradient-boosted decision tree models to see which performed the best. Their best-performing model was the random forest, with an accuracy of 89%. However, they created a binary classification model, and we plan to use a regression model, so it will be interesting to see how our results compare.

Citation:

Kaneria, A.V., Rao, A.B., Aithal, S.G., Pai, S.N. (2021). Prediction of Song Popularity Using Machine Learning Concepts. In: K V, S., Rao, K. (eds) Smart Sensors Measurements and Instrumentation. Lecture Notes in Electrical Engineering, vol 750. Springer, Singapore. https://doi.org/10.1007/978-981-16-0336-5_4

Exploratory Data Analysis

Target Variable: Popularity

Statistic	Value
Min.	0.00
1st Qu.	41.00
Median	52.00
Mean	50.62
3rd Qu.	62.00
Max.	100.00

In order to better understand the data we’re working with, we calculated correlations of each of the features with the popularity variable, and boxplot of the popularity variable.

Calculating correlations and creating a correlation matrix helps us determine whether there are any variables that could help us predict the popularity of songs. “1” represents popularity. Red indicates a negative correlation and blue signals a positive one. The darker the color, the stronger the correlation is. Feature correlations with popularity seem to be weak overall. The two strongest positive correlations are loudness (0.16) and danceability (0.14). The strongest negative correlations are instrumentalness (-0.19) and acousticness (-.09). We predict these four variables will be important for our analysis. The lower correlation between popularity and the features indicates that creating a model with accurate predictions could be difficult.

We calculated basic summary statistics to better understand the overall characteristics of our target variable. Popularity ranges from 0-100 with 0 meaning low popularity and 100 meaning high popularity. There are outliers on both ends. The median and mean popularity rating are both around 50, which is exactly halfway between 0-100. Most of the popularity ratings (the middle 50%) seem to be in the 40-60 range. It indicates that half the data we’re using to train and test the model comes from this 40-60 popularity range.

Popularity versus Duration

Here we graphed popularity versus duration of songs on a bar graph. We can see at approximately 200,000 milliseconds, the popularity is the highest. Generally the popularity varies for every level of duration of tracks. This means that duration and popularity do not necessarily correlate in any clear way. Duration is not one of the important factors when considering how popular a song will be.

Loudness vs popularity

Tempo vs Poularity

Liveness vs Poularity

Population versus Danceability

## [1] 0.1369768

In order to analyze the correlation between popularity and danceability, we must construct a scatterplot. The scatterplot is very populated, as there are a lot of data points; however, we can see there is a huge cluster around the middle for popularity. In order to do some more analysis, we must look at the correlation coefficient.The correlation coefficient is 0.1369768, which shows that there is almost no correlation between these two variables.

Methods and Evaluation

Cleaning

Before we could create a model, we had to clean the data. We inner joined the track features and track info datasets, so we could have access to all of the data we needed. From there, we dropped variables with no predictive value (disk number, album id, artist id, track name, album name, artist name, release date, release date precision, track number) and a repeated variable (duration_ms). Afterwards, we converted the Album Type, Explicit, and mode columns to factors. We used complete cases to ensure there wasn’t any missing data.

Decision Tree

After the cleaning process, we split the data into training, testing, and tuning sets. We initially tried building a regression decision tree model, but we obtained an r-squared value of .52 which indicates a poorer performing model and an RMSE of 11, which is high given the range of the popularity is from 1-100, so decided to create a regression random forest model to see if we could obtain better results

# Cross validation

fitControl <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5) 

# Setting/Determining the Hyper Parameters

tree.grid <- expand.grid(maxdepth=c(3:21))
#increasing depth did not increase accuracy

# Train the models
set.seed(1984)
tracks_dt <- train(x=features,
                y=target,
                method="rpart2",
                trControl=fitControl,
                metric="RMSE")
tracks_dt

tracks2_dt <- train(x=features,
                y=target,
                method="rpart2",
                trControl=fitControl,
                tuneGrid=tree.grid,
                metric="RMSE")

Random Forest

We first created a random forest regression model with an mtry of 4 (the square root of the number of predictors) and 1000 trees. The RMSE we obtained is still pretty high at about 15, but the R squared has gotten much higher at about 93. Next we adjusted each hyperparameter while holding the others constant to find the most optimal hyperparameters. We decided to keep the number of variables considered at 4 because increasing or reducing it will decrease model accuracy. Increasing the sample size decreased the MSE slightly to 232, but decreased the R squared to about 91. For our final model, we set ntree to 700, mtry to 4, sampsize to 500, and nodesize to 20. When using our model on the testing set, we obtained a RMSE of 15.51, a MAE of 12.21, and an r squared of .08.

set.seed(1984)  
tracks_RF_final = randomForest(Popularity~.,          
                              tr_train,     
                              ntree = 700,          
                              mtry = 4,           
                              replace = TRUE,     
                              sampsize = 500,     
                              nodesize = 20,    
                              importance = TRUE,   
                              proximity = FALSE,    
                              norm.votes = TRUE,   
                              do.trace = TRUE,     
                              keep.forest = TRUE, 
                              keep.inbag = TRUE)

Evaluation Statistics Summary

Statistic	Value
R-squared(DT)	0.52
RMSE(DT)	11.00
R-squared(RF)	0.91
RMSE(RF)	15.00
R-squared(RF-test)	0.08
MAE(RF-test)	12.00

In order to evaluate our model, we used the metrics R-squared, RSME, and MAE. Since R-squared is a measure of how much the variation in the target variable is explained by the independent variables, we wanted to interpret how well our features explained the variation in popularity. In order to evaluate the actual accuracy of our models we used RMSE to indicate the level of error for the training sets in the decision tree and random forest. In order to evaluate accuracy after passing the test set through the random forest, we used MAE to indicate the average of all the observations’ errors.

Conclusions and Future Work

Given the low R squared values and high MAE and RSME values, our random forest isn’t the best option to predict the popularity of songs on Spotify. While we achieved a .91 r squared with the training set, we obtained a .08 r squared with the testing set. To us, this indicates that the model might have overfitted to the training set data. In the future, we would have to determine ways to mitigate this. Are there any hyperparameter adjustments we could make to mitigate the overfitting? Also, given the range of the target variable, (1-100), the RMSE (15.51) and the MAE (12.21) are quite high, and that lowers the predictive value of the model. We couldn’t lower these too much with the hyperparameter adjustments we tried, so It would be interesting to see if another machine learning model type could achieve a lower RMSE and MAE. Additionally, would decreasing the amount of features to the top five features most correlated with popularity make a difference in the quality of predictions? Also, we are interested in subsetting the data by some criterion (e.g. by year the song was created) to see if that decreases the error in predictions. In our analysis, we have only scratched the surface in what we can do with the data.

In this project, there were several aspects that limited our analysis, although this was generally an excellent dataset itself. The biggest limitation in our project was that we do not have an exact link (url) for the dataset (only two csv files, trackInfo and trackFeature). Although at first glance, this may not be a clear limitation, not having descriptions of what the variable names are can limit our analysis of the dataset as we do not immediately have clear definitions of what certain variable names mean, such as “key.” In addition, we do not necessarily know the difference between “energy” and “danceability”, or “liveness” and “tempo”, as they can appear as synonyms. Another limitation on the analysis of the project was the fact that this particular dataset was so large in comparison to the datasets we have been handling this semester for the class. Handling a large dataset for the first time can be troubling, especially when it comes to data cleaning and preprocessing.