Chris Steege
Alt text
Problem Statement: I have been captivated by music throughout my entire life. I am an avid listener, participant, and enjoy the art of songwriting and jamming out with friends. I chose to work with this Spotify data set I pulled from GitHub, because I am curious what there is to be learned from a set of songs, their qualities, artists, and popularity outcomes. In particular, I have often wondered what qualities make a song popular and how industry professionals might be able to predict a “hit” song. Moreover, I wonder how these predictions could be valuable to companies in the music industry looking to find potential hit songs and profit from publishing, marketing, and selling recorded music and music videos.
Solution Approach: I chose to explore this data, following data cleaning, by visualizing and using univariate/multivariate statistical methods to answer different questions that arose from my natural curiosity as I got to know the data set. I also want to try clustering on the tracks to see explore prototypical songs and whether genre properly categorizes song prototypes.
Insights:
Pop, Rap, and Latin are currently the most popular genre types on Spotify in descending order.
The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and to a lesser extent with Latin music. EDM (Electronic Dance Music) reached its pinnacle in 2005 and appears to be making a gradual resurgence to former heights.
Every major genre was more likely to have its songs produced in the major key. Rock has the largest proportion with nearly 70% of songs being major.
Top Artists by Decade (Most Hit Songs):
The distribution of popularity scores is unbalanced due to the fact the most songs never seemed to make it off the ground. The bottom 1% and top 25% capture 8% of the popularity scores from the data.
There were two notable independent variables that have an issue with co-linearity. There was a strong positive correlation between loudness and energy (Pearson Correlation - 0.68) and a strong negative correlation between energy and acousticness (Pearson Correlation - -0.55). Loudness & acousticness and danceability & valence also shared a fairly strong correlation.
Energy, instrumentalness, speechiness, and danceability were the strongest predictors of track popularity according to our regression models while duration, tempo, and key, held close to no predictive power.
Songs that are energetic and loud and songs that are quiet and gentle can both be popular. The quantitative attributes seem to be the greatest indicators of popularity. The most important being instrumentalness based on our clustering evaluation.
Here is a comprehensive list of all the packages necessary for purpose of data cleaning, visualization, and analysis.
library(tibble)
library(xtable)
library(DT)
library(knitr)
library(tm)
library(ggplot2)
library(dplyr)
library(plotly)
library(readr)
library(caret)
library(magrittr)
library(reshape2)
library(Hmisc)
library(tidyverse)
library(modelr)
library(broom)
library(pscl)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(glmnet)
library(factoextra)
library(fpc)
library(viridis)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(e1071)
All of the procedures in this section are for the purpose of preparing for data analysis.
Spotify Songs data set from GitHub: The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each.
This dataset was downloaded from Github
More information about the data here.
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
## [1] 32833 23
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
## # A tibble: 5 x 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 69gRFGO~ <NA> <NA> 0 717UG2du6utFe~
## 2 5cjecvX~ <NA> <NA> 0 3luHJEPw434tv~
## 3 5TTzhRS~ <NA> <NA> 0 3luHJEPw434tv~
## 4 3VKFip3~ <NA> <NA> 0 717UG2du6utFe~
## 5 69gRFGO~ <NA> <NA> 0 717UG2du6utFe~
## # ... with 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
We will now continue forward by removing any duplicates.
The dimensions following the removal of duplicates by id and name are different. This tells us that there are songs within the data set that contain the same name but have a different id. We want to remove these duplicates, but we want to keep our songs titled ‘NA’.
Dimensions after removing track_id duplicates:
## [1] 28356 23
Spotify_songs = merge(Spotify_songs[!duplicated(Spotify_songs$track_name),], Spotify_songs[rowSums(is.na(Spotify_songs)) > 0,], all= TRUE, sort = FALSE)
Dimensions after removing track_name duplicate:
## [1] 23450 23
Spotify_songs <- Spotify_songs %>%
mutate(playlist_genre = as.factor(Spotify_songs$playlist_genre),
playlist_subgenre = as.factor(Spotify_songs$playlist_subgenre),
mode = as.factor(mode),
key = as.factor(key))
Spotify_songs$popularity_group <- as.character(ntile(Spotify_songs$track_popularity,4))
Spotify_songs <- Spotify_songs %>%
mutate(popularity_group = case_when(
((track_popularity > 0) & (track_popularity < 20)) ~ "1",
((track_popularity >= 20) & (track_popularity < 40))~ "2",
((track_popularity >= 40) & (track_popularity < 60)) ~ "3",
TRUE ~ "4")
)
table(Spotify_songs$popularity_group)
##
## 1 2 3 4
## 3225 5211 7552 7466
In this data set, the rows represent a unique song, and the columns give us information about the songs.
Here, I will provide a table with the variable names, types, and descriptions.
| Variable Name | Data Type | Variable Description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | numeric | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Nammme of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | factor | Playlist genre |
| playlist_subgenre | factor | Playlist subgenre |
| danceability | numeric | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | numeric | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | factor | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | numeric | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | factor | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | numeric | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | numeric | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | numeric | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | numeric | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | numeric | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | numeric | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | numeric | Duration of song in milliseconds |
| popularity_group | character | Song unique ID |
First, I wanted to explore the popularity of different genres. I found that Pop, Rap, Latin, Rock, R&B, and EDM were the genres from most to least popular.
I then wanted to visualize how the popularity of each genre changes with respect to its release date. The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and, to a lesser extent, Latin music. Notably, EDM music reached its pinnacle in 2005 and appears to be making a return.
The musician in me was curious to see how often the major and minor keys were used for each genre. I was expecting to see the major key more often, but I was surprised to see that it captured the majority for every single genre. Pop and rock especially are very likely to be produced in the major key.
Here, I wanted to find out the number of hit songs released by each artist per decade. I’ve enjoyed music from the 1950’s and on so it was interesting for me to see what the most popular artists from each decade are according to this data set. I’m happy to see The Beatles and Queen performing well in their respective decades!
I decided to start exploring the popularity scores by plotting a histogram. Upon doing this, I have notice a large portion of the songs are not popular. There are very few songs with a popularity greater than 75. The top 25% captures 8% of our observations, but the bottom 1% captures 8%, so the popularity scores are quite unbalanced on the low end. There is a mean popularity score of about 39.
Here is a list of all the correlations between our independent and dependent numeric variables sorted by significance.
I wanted to look at a correlation heat map of this data to visualize any large correlations between the independent variables and track popularity. Upon inspection, it does seem we have a few notable correlations. Loudness-energy has a strong positive correlation and loudness-acousticness, acoutsticness-energy have a moderate negative correltation. The rest are below abs(0.2).
Profit Function
I’m going to start by modeling this data with a linear regression to try to predict the popularity of songs. I will use the standard procedure of splitting the data set into a training set and a test set with a 70% to 30% split, respectively. Then I will perform variable selection and fit the training data to a linear model. I will be using forward, backward, and LASSO variable selection techniques later on.
Below are the result from fitting the full data set using all relevent continuous and categorical variables.
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness +
## speechiness + acousticness + instrumentalness + liveness +
## valence + tempo + duration_ms + key + mode + playlist_genre,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.276 -16.016 3.035 17.158 63.369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.521e+01 2.320e+00 23.794 < 2e-16 ***
## danceability 1.065e+01 1.481e+00 7.189 6.79e-13 ***
## energy -2.246e+01 1.583e+00 -14.186 < 2e-16 ***
## loudness 1.274e+00 8.327e-02 15.298 < 2e-16 ***
## speechiness -3.268e+00 1.895e+00 -1.724 0.08467 .
## acousticness 4.723e+00 9.378e-01 5.036 4.80e-07 ***
## instrumentalness -6.705e+00 8.132e-01 -8.245 < 2e-16 ***
## liveness -2.442e+00 1.149e+00 -2.126 0.03349 *
## valence -8.001e-01 8.719e-01 -0.918 0.35883
## tempo 2.983e-02 6.643e-03 4.490 7.16e-06 ***
## duration_ms -3.783e-05 2.930e-06 -12.910 < 2e-16 ***
## key1 5.063e-01 7.310e-01 0.693 0.48860
## key2 -6.390e-01 7.931e-01 -0.806 0.42044
## key3 -8.053e-03 1.192e+00 -0.007 0.99461
## key4 -2.030e-01 8.636e-01 -0.235 0.81415
## key5 3.836e-01 8.218e-01 0.467 0.64068
## key6 3.646e-01 8.330e-01 0.438 0.66165
## key7 -9.224e-01 7.597e-01 -1.214 0.22471
## key8 8.560e-01 8.321e-01 1.029 0.30366
## key9 -7.544e-01 7.826e-01 -0.964 0.33507
## key10 6.295e-01 8.587e-01 0.733 0.46348
## key11 -2.153e-01 7.947e-01 -0.271 0.78645
## mode1 7.823e-01 3.721e-01 2.103 0.03552 *
## playlist_genrelatin 7.718e+00 6.829e-01 11.301 < 2e-16 ***
## playlist_genrepop 1.267e+01 6.327e-01 20.029 < 2e-16 ***
## playlist_genrer&b 2.357e+00 7.197e-01 3.274 0.00106 **
## playlist_genrerap 7.913e+00 6.572e-01 12.041 < 2e-16 ***
## playlist_genrerock 1.116e+01 7.311e-01 15.266 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.27 on 16407 degrees of freedom
## Multiple R-squared: 0.09744, Adjusted R-squared: 0.09596
## F-statistic: 65.61 on 27 and 16407 DF, p-value: < 2.2e-16
When visualizing the performance of the full model without variable selection, it is clear that our model fit is far from optimal based on the spread between the residuals and fitted values. Also, error between our prediction and actual values as indicated by the second chart shows our predictions are not very accurate.
From the Q-Q plot, we can see that the residuals are not normally distributed. This is evident by the snaking line. The predictions of popularity for values in the quantiles around the mean are fit well to the line, but the extreme values have significant residuals.
Based on estimate values, the attributes which most contribute to the prediction are:
Playist Genre
Energy
Danceability
Instrumentalness
Based on estimate values, the attributes which least contribute to the prediction are:
Key
Tempo
Duration
The coefficients indicate the degree of incremental change for each variable in its prediction of track popularity. The negative values will cause a prediction of lower track popularity and positive values predict higher track popularity.
Forward and backward stewise variable selection ended up choosing the same set of predictor variables.
I want to choose the optimal penalty parameter, which reduces the MSE of our predictions the most. Based on the chart, as long as the log lambda is low then we should be fine.
LASSO has chosen similar variables as stepwise regression, but it left a few additional variables out of the picture. The additional variables it dropped were mode, speechiness, and a few of the specific genre categories.
## 31 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 6.153480e+01
## playlist_genreedm -7.513109e+00
## playlist_genrelatin .
## playlist_genrepop 3.591696e+00
## playlist_genrer&b -4.499203e+00
## playlist_genrerap .
## playlist_genrerock 7.074539e-02
## danceability 3.561999e+00
## energy -1.508254e+01
## key0 .
## key1 .
## key2 .
## key3 .
## key4 .
## key5 .
## key6 .
## key7 .
## key8 .
## key9 .
## key10 .
## key11 .
## loudness 7.740227e-01
## mode0 .
## mode1 .
## speechiness .
## acousticness 2.522423e+00
## instrumentalness -5.777941e+00
## liveness -8.977511e-01
## valence .
## tempo 3.474570e-03
## duration_ms -3.356553e-05
Here I want to look at the in-sample MSE, out-of-sample MSE, and profit for the full model, stepwise, and LASSO regression models.
The in-sample and out-of-sample MSE’s are very similar for all variable selection models.
| In-Sample | Out of Sample | Profit | |
|---|---|---|---|
| Full_Model_MSE | 495.0772 | 493.4970 | 524000 |
| Step_b_MSE | 495.4129 | 493.3253 | 852000 |
| Step_f_MSE | 495.4129 | 493.3253 | 852000 |
| LASSO_MSE | 495.2913 | 493.3668 | 660000 |
From earlier we noticed that our track popularity was unbalanced. The majority of tracks are unpopular. To try to improve the linear regression model, I will continue on by using under-sampling. This will cause the regression to balance the classes and give more value to the popular songs we care about.
I’m going to approach this task by taking all the samples from popular tracks and then randomly sampling an equal amount of observations from the unpopular category. Since the cutoff for popular and unpopular songs is arbitrarily chosen, I will choose multiple cutoffs and try linear regression on each.
The R squared increases as the threshold rises. This could be a good indicator, but we should see how well the model performs in regards to MSE to determine if this is actually improving the model.
| model_r.squared | model_pvalue | |
|---|---|---|
| Orginal model | 0.0620000 | 0 |
| Cutoff - 45 | 0.1022451 | 0 |
| Cutoff - 50 | 0.1091899 | 0 |
| Cutoff - 55 | 0.1254122 | 0 |
| Cutoff - 60 | 0.1452157 | 0 |
| Cutoff - 65 | 0.1712999 | 0 |
| Cutoff - 70 | 0.2096549 | 0 |
Even though the R squared increases when under-sampling at a higher threshold, it leads to higher MSE - especially for out-of-sample MSE. This is not good, but it is good that our profits increased - which is what is most important.
Below is a table of MSE using an under-sampling threshold of 70. Any possible variation of threshold still lead to worse predictive power. Notably, the different variable selection techniques are still performing very close to each other.
We are going to have to try out some different models to see if we can improve our predictions.
| In-Sample | Out-of-Sample | Profit | |
|---|---|---|---|
| Full_Model_under_MSE | 507.2579 | 660.1027 | 1064000 |
| Step_b_under_MSE | 508.8740 | 658.8303 | 1116000 |
| Step_f_under_MSE | 508.8740 | 658.8303 | 1116000 |
| LASSO_under_MSE | 508.3991 | 659.5437 | 1068000 |
Random Forest
The main idea of random forest is to select m out of p variables for each split in each decision tree generated. The purpose of doing this is to decorrelate the trees as to reduce the error due to bias and variance of our predictions across all training sets.
Below I have fit the data to the random forest algorithm using the under-sampled data set.
I and want to observe how the error rate relates to the number of trees.
Choosing Number of Candidate Variables
## 1 2 3 4 5 6 7 8 9 10 11 12 13
Building Model
##
## Call:
## randomForest(formula = track_popularity ~ ., data = train[, -c(14)], importance = TRUE, ntree = 120)
## Type of random forest: regression
## Number of trees: 120
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 496.7592
## % Var explained: 9.44
Model Validation
| In-Sample | Out-of-Sample | Profit | |
|---|---|---|---|
| Spotify_rf_MSE | 496.7592 | 490.8139 | 1248000 |
| Spotify_undersampled_rf_MSE | 507.1951 | 629.3525 | 1832000 |
Boosted Trees
Boosting trees works similar to bagging trees in that it randomly samples the data with replacement and uses these random samples to build the trees. What is different is that boosted trees weight the data so that values which were mis-classified are given more account in subsequent bags.
Interestingly, the relative influence of our variables using this model is slightly different than our linear regression models.
## var rel.inf
## key key 15.3262726
## playlist_genre playlist_genre 13.3735598
## instrumentalness instrumentalness 9.1094248
## duration_ms duration_ms 8.5271447
## loudness loudness 7.4915693
## energy energy 7.1776526
## tempo tempo 7.1092586
## acousticness acousticness 6.9209974
## speechiness speechiness 6.8783594
## danceability danceability 6.8293068
## valence valence 5.6309892
## liveness liveness 5.2902582
## mode mode 0.3352065
Optimal Trees
Model Validation
## Using 510 trees...
##
## Using 510 trees...
##
## Using 510 trees...
##
## Using 510 trees...
| In-Sample | Out-of-Sample | Profit | |
|---|---|---|---|
| Spotify_boost_MSE | 462.1989 | 477.3446 | 2472000 |
| Spotify_undersampled_boost_MSE | 392.2046 | 643.7124 | 2152000 |
Support Vector Regression
Support Vector Machines are most commonly used as classifiers, but they can also be used in regression for predicting continuous numerical values.
How it works:
Tuning - Grid Search
# tuneResult <- tune(svm, track_popularity ~., data = Undersampled_train,
# ranges = list(epsilon = seq(0.5,0.7,0.02), cost = 2^(2:7))
# )
Building the Model
##
## Call:
## svm(formula = track_popularity ~ ., data = train, cost = 4, gamma = 1/length(train),
## epsilon = 0.58, probability = TRUE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 4
## gamma: 0.07142857
## epsilon: 0.58
##
## Sigma: 0.7847259
##
##
## Number of Support Vectors: 9361
Model Validation
| In-Sample | Out-of-Sample | Profit | |
|---|---|---|---|
| Spotify_svm_MSE | 377.4687 | 510.1713 | 2400000 |
| Spotify_undersampled_svm_MSE | 315.9451 | 658.7301 | 1412000 |
Results and Analysis
The overall best model was a clear winner. The boosting trees had the best out-of-sample MSE at 477 and a profit of $2,472,000 on the test data set. It is possible that this does not stay the case if we were to add more data, attributes, or change the cost function. As the cost function changes, the relative profits could change significantly.
I found it interesting that under-sampling always increased the in-sample and out-of-sample MSE, but only sometimes increased the profits. This is probably the case for the regression algorithms and random forest, because the data is heavily influenced by songs that never gain traction and never get popular. Some of these songs are experiencing such low popularity for reasons that are not contingent upon their attributes or simply have not had time to get attention. This is a problem when we value being able to predict popular songs. The larger MSE with under-sampling is not necessarily a bad thing, because we value being able to predict popular songs even at the expense of being far off sometimes.
It actually makes sense that Boosted Trees and SVR were the most optimal performing algorithms based on their function.
## In-Sample Out-of-Sample Profit
## Boosted Trees 462.1989 477.3446 2472000
## SVR 377.4687 510.1713 2400000
## Boosted Trees Undersampled 392.2046 643.7124 2152000
## Random Forest Undersampled 507.1951 629.3525 1832000
## SVR Undersampled 315.9451 658.7301 1412000
## Random Forest 496.7592 490.8139 1248000
## LR Undersampled - Stepwise Selection 508.8740 658.8303 1116000
## LASSO Undersampled 508.3991 659.5437 1068000
## LR Undersampled - Full Model 507.2579 660.1027 1064000
## LR - Stepwise Selection 495.4129 493.3253 852000
## LASSO 495.2913 493.3668 660000
## LR - Full Model 495.0772 493.4970 524000
K-Means Algorithm
K-Means Clustering is an unsupervised clustering algorithm.
In this case, however, I want to use it as a supervised learning algorithm for 2 purposes.
Classification is possible because we can choose the same K value as the number of genres and classify songs based on the dominant genre of the most proximal cluster.
How it works: (in a nutshell).
K-Means - Genre Classification
Scaling
I will start by scaling the data so that all the numeric values are between -1 and 1. This is helpful to our algorithm, otherwise, some variables could have more inherent influence on the procedure.
train_K <- train
train_scale <- scale(train[,-c(2,5,7,14)])
test_scale <- scale(test[,-c(1,2,3,5,6,7,8,9,10,11,14,16,24)])
Fitting K-Means
We have 6 genres, so I will choose to have 6 centroids. Nstart will generate 25 conifigurations and choose the best initial configuration. This puts the algorithm in a good direction to capture the clusters as best as possible.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3555997 190.0 6596764 352.4 6302146 336.6
## Vcells 21790322 166.3 67024115 511.4 67024115 511.4
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 821750)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 821750)
| Cluster | Size |
|---|---|
| Cluster 1 | 1340 |
| Cluster 2 | 1046 |
| Cluster 3 | 2388 |
| Cluster 4 | 4072 |
| Cluster 5 | 5290 |
| Cluster 6 | 2299 |
Visualization of Clusters
I want to try to visualize the cluster separation by projecting the attributes in 2 dimensions and highlighting the clusters. This is helpful for picturing how well our clusters are being separated.
I seems that K Means is doing an okay job of clustering since boundaries are discernible. There is some significant overlap between clusters, however.
Model Validation - Classifier
I am now going to use the clustering results and try associate each cluster with a genre based on the most common genre in each.
Unfortunately, the clusters did not seem to separate by genre distinctively. This is indicated by a high cluster entropy. The clusters are not always dominated by one particular genre, and in some cases, a class is the dominant genre in multiple clusters. Rather than trying to figure out how to make this work as a genre classifier, I believe it may be best to not continue trying to use K means as a classifier for genre.
## `summarise()` regrouping output by 'Cluster' (override with `.groups` argument)
Unsupervised K-Means
The Elbow Methed:
I tried variable selection based on the result I got from stepwise variables selection, LASSO, and the correlation heat map. The variables I experimented with removing were Valence, Speeechiness, and Loudness using different combinations.
I have decided to keep 9 variables by removing valence and speechiness. When I did this, it led to a prominent elbow at 3 clusters. This is promising so I will move forward with this set of variables and use 3 clusters.
3 Cluster Model
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3700933 197.7 6596764 352.4 6596764 352.4
## Vcells 22111515 168.7 67024115 511.4 67024115 511.4
| Cluster | Size |
|---|---|
| Cluster 1 | 1513 |
| Cluster 2 | 10681 |
| Cluster 3 | 4241 |
Visualizing Clusters
We can again plot the clusters in 2 dimensions to visualize the clusters and their distinctness.
There is some overlap between clusters again, but they are fairly distinct and well separated.
Descriptive Statistics
Boxplots
The box plots offer a way of understanding how the clusters are dinstinct from each other. Each cluster has some characteristics that distinguish them from the others.
Cluster 1
Cluster 2
Cluster 3
Cluster Result and Analysis
`- I wanted to use K-means for 2 purpose. To classify genre and to find clusters apparent in the data. Our classifier did not work well, because our data did not naturally cluster into 6 clusters separated by genres only used once. I could have inferred a genre associated with each cluster, based on its market share in each cluster and count of songs of each genre, but in this case, it did not seem a proper use of the tool.
I then used K-means for unsupervised clustering and let the data tell me what the clusters are. It chose 3 clusters, and in my attribute analysis, I found each cluster is identifiable by a a few attributes. Cluster 1 and 2 were popular and were characterized by being loud, energetic, and live sounding or soft, gentle, and acoustic. These two clusters were also quite popular compared to the 3rd cluster. The third cluster was not very popular, and its distinguishing characteristic was instrumentalness.
The final result from the clustering indicated to me that music on both ends of the energetic spectrum and of the different genres can be popular and that the protoypical songs based on the attributes are reducible beyond genre. What does not work very well for popularity is instrumental music. This is reminiscent of our insights from the linear regression where instrumentalness had a strong negative association with popularity.
Insights:
Pop, Rap, and Latin are currently the most popular genre types on Spotify in descending order.
The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and to a lesser extent with Latin music. EDM (Electronic Dance Music) reached its pinnacle in 2005 and appears to be making a gradual resurgence to former heights.
Every major genre was more likely to have its songs produced in the major key. Rock has the largest proportion with nearly 70% of songs being major.
Top Artists by Decade (Most Hit Songs):
The distribution of popularity scores is unbalanced due to the fact the most songs never seemed to make it off the ground. The bottom 1% and top 25% capture 8% of the popularity scores from the data.
There were two notable independent variables that have an issue with co-linearity. There was a strong positive correlation between loudness and energy (Pearson Correlation - 0.68) and a strong negative correlation between energy and acousticness (Pearson Correlation - -0.55). Loudness & acousticness and danceability & valence also shared a fairly strong correlation.
Energy, instrumentalness, speechiness, and danceability were the strongest predictors of track popularity according to our regression models while duration, tempo, and key, held close to no predictive power.
Songs that are energetic and loud and songs that are quiet and gentle can both be popular. The quantitative attributes seem to be the greatest indicators of popularity. The most important being instrumentalness to its detriment based on our clustering evaluation.
Reflection and Future Work:
This project was exciting, because I got to explore and apply many new packages and techniques in R through statistical analysis and data visualization. I was happy with the way our track popularity predictions turned out. The first few models did not perform well compared to the boosted trees and SVM. I found it very interesting that under-sampling improved the efficacy of some models with profit estimations, but worsened it in others. This is largely due to the mechanics of particular algorithms.
The scope of what could be explored with this project is not nearly exhausted. There is still more to be explored in regards variable/model selection, different under-sampling techniques, new algorithms, and optimization and hyper-parameter tuning. One immediate idea for variable selection is to use PCA to find the principle components for building the model to fix any co-linearity issues. I also would like to see what would happen if I simply normalized the track_popularity curve by dropping many of the low popularity songs. Finally, the algorithms I used predicted popularity as continuous values, but I would also like to see what our profits and accuracy would be using classification algorithms to predict popularity groups like multinomial regression, naive bayes, SVM, etc, ensemble methods, etc. At the very least, what we have now could serve as a useful tool for music industry professionals by utilizing machine learning in the prediction of hit songs. Also, the insights presented may offer something of value to those trying to produce popular songs by infusing qualities of popular songs into their music.