Spotify Data Analysis

Synopsis

Alt text

Problem Statement: I have been captivated by music throughout my entire life. I am an avid listener, participant, and enjoy the art of songwriting and jamming out with friends. I chose to work with this Spotify data set I pulled from GitHub, because I am curious what there is to be learned from a set of songs, their qualities, artists, and popularity outcomes. In particular, I have often wondered what qualities make a song popular and how industry professionals might be able to predict a “hit” song. Moreover, I wonder how these predictions could be valuable to companies in the music industry looking to find potential hit songs and profit from publishing, marketing, and selling recorded music and music videos.

Solution Approach: I chose to explore this data, following data cleaning, by visualizing and using univariate/multivariate statistical methods to answer different questions that arose from my natural curiosity as I got to know the data set. I also want to try clustering on the tracks to see explore prototypical songs and whether genre properly categorizes song prototypes.

I then plan build multiple machine learning models aimed at predicting the popularity of songs based their attributes alone. The popularity predictions will then be used to estimate profits based on a profit function I developed to estimate cost to publish, market, and sell and revenue that depends on how popular the song becomes.

Insights:

Pop, Rap, and Latin are currently the most popular genre types on Spotify in descending order.
The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and to a lesser extent with Latin music. EDM (Electronic Dance Music) reached its pinnacle in 2005 and appears to be making a gradual resurgence to former heights.
Every major genre was more likely to have its songs produced in the major key. Rock has the largest proportion with nearly 70% of songs being major.
Top Artists by Decade (Most Hit Songs):
- 1950’s - Elvis Presley
- 1960’s - The Beatles
- 1970’s - Fleetwood Mac and Queen
- 1980’s - AC/DC, Luis Miguel, and Michael Jackson
- 1990’s - Red Hot Chili Peppers
- 2000’s - Coldplay
- 2010’s - Billie Eilish
- 2020’s - Eminem (So far as of March 2020)
The distribution of popularity scores is unbalanced due to the fact the most songs never seemed to make it off the ground. The bottom 1% and top 25% capture 8% of the popularity scores from the data.
There were two notable independent variables that have an issue with co-linearity. There was a strong positive correlation between loudness and energy (Pearson Correlation - 0.68) and a strong negative correlation between energy and acousticness (Pearson Correlation - -0.55). Loudness & acousticness and danceability & valence also shared a fairly strong correlation.
Energy, instrumentalness, speechiness, and danceability were the strongest predictors of track popularity according to our regression models while duration, tempo, and key, held close to no predictive power.
Songs that are energetic and loud and songs that are quiet and gentle can both be popular. The quantitative attributes seem to be the greatest indicators of popularity. The most important being instrumentalness based on our clustering evaluation.

Packages Required

Here is a comprehensive list of all the packages necessary for purpose of data cleaning, visualization, and analysis.

library(tibble)
library(xtable)
library(DT)
library(knitr)
library(tm)
library(ggplot2)
library(dplyr)
library(plotly)
library(readr)
library(caret)
library(magrittr)
library(reshape2)
library(Hmisc)
library(tidyverse) 
library(modelr)     
library(broom) 
library(pscl)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(glmnet)
library(factoextra)
library(fpc)
library(viridis)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(e1071)

Data Preparation

All of the procedures in this section are for the purpose of preparing for data analysis.

Data Import

Spotify Songs data set from GitHub: The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each.

This dataset was downloaded from Github

More information about the data here.

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

## [1] 32833    23

Data Cleaning

Missing Values:

We’ll start cleaning the data by scanning for any missing values within the data set.

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

There are only 5 observations with missing values. These songs still contain useful information and are distinguishable by their id, so we will simply leave them in the data set.

## # A tibble: 5 x 23
##   track_id track_name track_artist track_popularity track_album_id
##   <chr>    <chr>      <chr>                   <dbl> <chr>         
## 1 69gRFGO~ <NA>       <NA>                        0 717UG2du6utFe~
## 2 5cjecvX~ <NA>       <NA>                        0 3luHJEPw434tv~
## 3 5TTzhRS~ <NA>       <NA>                        0 3luHJEPw434tv~
## 4 3VKFip3~ <NA>       <NA>                        0 717UG2du6utFe~
## 5 69gRFGO~ <NA>       <NA>                        0 717UG2du6utFe~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

Dealing With Duplicates:

We will now continue forward by removing any duplicates.
The dimensions following the removal of duplicates by id and name are different. This tells us that there are songs within the data set that contain the same name but have a different id. We want to remove these duplicates, but we want to keep our songs titled ‘NA’.

Dimensions after removing track_id duplicates:

## [1] 28356    23

Spotify_songs = merge(Spotify_songs[!duplicated(Spotify_songs$track_name),], Spotify_songs[rowSums(is.na(Spotify_songs)) > 0,], all= TRUE, sort = FALSE)

Dimensions after removing track_name duplicate:

## [1] 23450    23

Converting Categorical Variables to Factors:

I want to convert genre, sub genre, key, and mode into factors. This will aid data analysis in the future.

Spotify_songs <- Spotify_songs %>%
  mutate(playlist_genre = as.factor(Spotify_songs$playlist_genre),
         playlist_subgenre = as.factor(Spotify_songs$playlist_subgenre),
         mode = as.factor(mode),
         key = as.factor(key))

Creating New Variables:

Having the popularity separated into groups will be useful in the future for either classification or clustering.

Spotify_songs$popularity_group <- as.character(ntile(Spotify_songs$track_popularity,4))

Spotify_songs <- Spotify_songs %>% 
  mutate(popularity_group = case_when(
    ((track_popularity > 0) & (track_popularity < 20)) ~ "1",
    ((track_popularity >= 20) & (track_popularity < 40))~ "2",
    ((track_popularity >= 40) & (track_popularity < 60)) ~ "3",
    TRUE ~ "4")
    )
table(Spotify_songs$popularity_group)

## 
##    1    2    3    4 
## 3225 5211 7552 7466

Data Preview

In this data set, the rows represent a unique song, and the columns give us information about the songs.

Data Description

Here, I will provide a table with the variable names, types, and descriptions.

Variable Name	Data Type	Variable Description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	numeric	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Nammme of playlist
playlist_id	character	Playlist ID
playlist_genre	factor	Playlist genre
playlist_subgenre	factor	Playlist subgenre
danceability	numeric	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	numeric	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	factor	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1.
loudness	numeric	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	factor	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	numeric	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	numeric	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	numeric	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	numeric	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	numeric	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	numeric	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	numeric	Duration of song in milliseconds
popularity_group	character	Song unique ID

Exploratory Data Analysis

Genre Popularity

First, I wanted to explore the popularity of different genres. I found that Pop, Rap, Latin, Rock, R&B, and EDM were the genres from most to least popular.

Genre Popularity by Release Date

I then wanted to visualize how the popularity of each genre changes with respect to its release date. The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and, to a lesser extent, Latin music. Notably, EDM music reached its pinnacle in 2005 and appears to be making a return.

Major and Minor Key Proportion

The musician in me was curious to see how often the major and minor keys were used for each genre. I was expecting to see the major key more often, but I was surprised to see that it captured the majority for every single genre. Pop and rock especially are very likely to be produced in the major key.

Artists - Most Hits by Decade

Here, I wanted to find out the number of hit songs released by each artist per decade. I’ve enjoyed music from the 1950’s and on so it was interesting for me to see what the most popular artists from each decade are according to this data set. I’m happy to see The Beatles and Queen performing well in their respective decades!

Popularity Scores

I decided to start exploring the popularity scores by plotting a histogram. Upon doing this, I have notice a large portion of the songs are not popular. There are very few songs with a popularity greater than 75. The top 25% captures 8% of our observations, but the bottom 1% captures 8%, so the popularity scores are quite unbalanced on the low end. There is a mean popularity score of about 39.

Variable Correlation

Here is a list of all the correlations between our independent and dependent numeric variables sorted by significance.

Correlation Heat Map

I wanted to look at a correlation heat map of this data to visualize any large correlations between the independent variables and track popularity. Upon inspection, it does seem we have a few notable correlations. Loudness-energy has a strong positive correlation and loudness-acousticness, acoutsticness-energy have a moderate negative correltation. The rest are below abs(0.2).

Popularity Prediction Using Regression

Profit Function

Profit Function

I developed a simple prototype profit function. Basically it works like this:
- If a song is predicted to be in the first quartile of popularity amongst the pool, then do not invest any money.
- If a song is predicted to be in the second, third, or fourth quartile of popularity amongst the pool, then invest $1,800, $5,400 and $11,000 respectively.
- If the songs actual popularity is in the the second, third, or fourth quartile of popularity among the pool, then you make $2,000, $6,000, and $12,000, respectively - IF YOU INVESTED!

Multiple Linear Regression

Train/Test Split

I’m going to start by modeling this data with a linear regression to try to predict the popularity of songs. I will use the standard procedure of splitting the data set into a training set and a test set with a 70% to 30% split, respectively. Then I will perform variable selection and fit the training data to a linear model. I will be using forward, backward, and LASSO variable selection techniques later on.

Below are the result from fitting the full data set using all relevent continuous and categorical variables.

## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness + 
##     speechiness + acousticness + instrumentalness + liveness + 
##     valence + tempo + duration_ms + key + mode + playlist_genre, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.276 -16.016   3.035  17.158  63.369 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.521e+01  2.320e+00  23.794  < 2e-16 ***
## danceability         1.065e+01  1.481e+00   7.189 6.79e-13 ***
## energy              -2.246e+01  1.583e+00 -14.186  < 2e-16 ***
## loudness             1.274e+00  8.327e-02  15.298  < 2e-16 ***
## speechiness         -3.268e+00  1.895e+00  -1.724  0.08467 .  
## acousticness         4.723e+00  9.378e-01   5.036 4.80e-07 ***
## instrumentalness    -6.705e+00  8.132e-01  -8.245  < 2e-16 ***
## liveness            -2.442e+00  1.149e+00  -2.126  0.03349 *  
## valence             -8.001e-01  8.719e-01  -0.918  0.35883    
## tempo                2.983e-02  6.643e-03   4.490 7.16e-06 ***
## duration_ms         -3.783e-05  2.930e-06 -12.910  < 2e-16 ***
## key1                 5.063e-01  7.310e-01   0.693  0.48860    
## key2                -6.390e-01  7.931e-01  -0.806  0.42044    
## key3                -8.053e-03  1.192e+00  -0.007  0.99461    
## key4                -2.030e-01  8.636e-01  -0.235  0.81415    
## key5                 3.836e-01  8.218e-01   0.467  0.64068    
## key6                 3.646e-01  8.330e-01   0.438  0.66165    
## key7                -9.224e-01  7.597e-01  -1.214  0.22471    
## key8                 8.560e-01  8.321e-01   1.029  0.30366    
## key9                -7.544e-01  7.826e-01  -0.964  0.33507    
## key10                6.295e-01  8.587e-01   0.733  0.46348    
## key11               -2.153e-01  7.947e-01  -0.271  0.78645    
## mode1                7.823e-01  3.721e-01   2.103  0.03552 *  
## playlist_genrelatin  7.718e+00  6.829e-01  11.301  < 2e-16 ***
## playlist_genrepop    1.267e+01  6.327e-01  20.029  < 2e-16 ***
## playlist_genrer&b    2.357e+00  7.197e-01   3.274  0.00106 ** 
## playlist_genrerap    7.913e+00  6.572e-01  12.041  < 2e-16 ***
## playlist_genrerock   1.116e+01  7.311e-01  15.266  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.27 on 16407 degrees of freedom
## Multiple R-squared:  0.09744,    Adjusted R-squared:  0.09596 
## F-statistic: 65.61 on 27 and 16407 DF,  p-value: < 2.2e-16

Visual Assessment For Full Model

When visualizing the performance of the full model without variable selection, it is clear that our model fit is far from optimal based on the spread between the residuals and fitted values. Also, error between our prediction and actual values as indicated by the second chart shows our predictions are not very accurate.

From the Q-Q plot, we can see that the residuals are not normally distributed. This is evident by the snaking line. The predictions of popularity for values in the quantiles around the mean are fit well to the line, but the extreme values have significant residuals.

Coefficient Assessment For Full Model

Based on estimate values, the attributes which most contribute to the prediction are:

Playist Genre
Energy
Danceability
Instrumentalness

Based on estimate values, the attributes which least contribute to the prediction are:

Key
Tempo
Duration

The coefficients indicate the degree of incremental change for each variable in its prediction of track popularity. The negative values will cause a prediction of lower track popularity and positive values predict higher track popularity.

Forward/Backward Stepwise Variable Selection

Forward and backward stewise variable selection ended up choosing the same set of predictor variables.

playlist_genre
duration_ms
instrumentalness
energy
loudness
danceability
acousticness
tempo
liveness
mode
speechiness

Choosing Optimal Penalty For LASSO

I want to choose the optimal penalty parameter, which reduces the MSE of our predictions the most. Based on the chart, as long as the log lambda is low then we should be fine.

LASSO Chosen Coefficients

LASSO has chosen similar variables as stepwise regression, but it left a few additional variables out of the picture. The additional variables it dropped were mode, speechiness, and a few of the specific genre categories.

## 31 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)          6.153480e+01
## playlist_genreedm   -7.513109e+00
## playlist_genrelatin  .           
## playlist_genrepop    3.591696e+00
## playlist_genrer&b   -4.499203e+00
## playlist_genrerap    .           
## playlist_genrerock   7.074539e-02
## danceability         3.561999e+00
## energy              -1.508254e+01
## key0                 .           
## key1                 .           
## key2                 .           
## key3                 .           
## key4                 .           
## key5                 .           
## key6                 .           
## key7                 .           
## key8                 .           
## key9                 .           
## key10                .           
## key11                .           
## loudness             7.740227e-01
## mode0                .           
## mode1                .           
## speechiness          .           
## acousticness         2.522423e+00
## instrumentalness    -5.777941e+00
## liveness            -8.977511e-01
## valence              .           
## tempo                3.474570e-03
## duration_ms         -3.356553e-05

Model Validation

Here I want to look at the in-sample MSE, out-of-sample MSE, and profit for the full model, stepwise, and LASSO regression models.

The in-sample and out-of-sample MSE’s are very similar for all variable selection models.

	In-Sample	Out of Sample	Profit
Full_Model_MSE	495.0772	493.4970	524000
Step_b_MSE	495.4129	493.3253	852000
Step_f_MSE	495.4129	493.3253	852000
LASSO_MSE	495.2913	493.3668	660000

Undersampling

From earlier we noticed that our track popularity was unbalanced. The majority of tracks are unpopular. To try to improve the linear regression model, I will continue on by using under-sampling. This will cause the regression to balance the classes and give more value to the popular songs we care about.

I’m going to approach this task by taking all the samples from popular tracks and then randomly sampling an equal amount of observations from the unpopular category. Since the cutoff for popular and unpopular songs is arbitrarily chosen, I will choose multiple cutoffs and try linear regression on each.

The R squared increases as the threshold rises. This could be a good indicator, but we should see how well the model performs in regards to MSE to determine if this is actually improving the model.

	model_r.squared	model_pvalue
Orginal model	0.0620000	0
Cutoff - 45	0.1022451	0
Cutoff - 50	0.1091899	0
Cutoff - 55	0.1254122	0
Cutoff - 60	0.1452157	0
Cutoff - 65	0.1712999	0
Cutoff - 70	0.2096549	0

Model Validation With Undersampling

Even though the R squared increases when under-sampling at a higher threshold, it leads to higher MSE - especially for out-of-sample MSE. This is not good, but it is good that our profits increased - which is what is most important.

Below is a table of MSE using an under-sampling threshold of 70. Any possible variation of threshold still lead to worse predictive power. Notably, the different variable selection techniques are still performing very close to each other.

We are going to have to try out some different models to see if we can improve our predictions.

	In-Sample	Out-of-Sample	Profit
Full_Model_under_MSE	507.2579	660.1027	1064000
Step_b_under_MSE	508.8740	658.8303	1116000
Step_f_under_MSE	508.8740	658.8303	1116000
LASSO_under_MSE	508.3991	659.5437	1068000

Random Forest

Random Forest

The main idea of random forest is to select m out of p variables for each split in each decision tree generated. The purpose of doing this is to decorrelate the trees as to reduce the error due to bias and variance of our predictions across all training sets.
Below I have fit the data to the random forest algorithm using the under-sampled data set.
I and want to observe how the error rate relates to the number of trees.
- The out-of-bag error rate tends to decline as I increase the number of trees as expected.
- The decline tapers off around 100, so I will use 120 tree to build the model

Choosing Number of Candidate Variables

Here I graph the error rate against the number of variables that will be randomly selected for each split in each tree.
- Based on the plot, using around 4 variables is sufficient to minimize the testing error.
- This is about equal to the default setting which is the square root of the number of attributes in the data set.

## 1  2  3  4  5  6  7  8  9  10  11  12  13

Building Model

We can build our model using the parameters acquired from our optimality search.
- 120 Trees.
- ~ 3-4 Candidate Variables.

## 
## Call:
##  randomForest(formula = track_popularity ~ ., data = train[, -c(14)],      importance = TRUE, ntree = 120) 
##                Type of random forest: regression
##                      Number of trees: 120
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 496.7592
##                     % Var explained: 9.44

Model Validation

Just like for Multiple Linear Regression, I want to see the in-sample error, out-of-sample error, and profit.
Our out-of-sample MSE improved slightly in the full model, and happily, our profits increased compared to the regression models for both models.

	In-Sample	Out-of-Sample	Profit
Spotify_rf_MSE	496.7592	490.8139	1248000
Spotify_undersampled_rf_MSE	507.1951	629.3525	1832000

Boosted Trees

Boosted Trees

Boosting trees works similar to bagging trees in that it randomly samples the data with replacement and uses these random samples to build the trees. What is different is that boosted trees weight the data so that values which were mis-classified are given more account in subsequent bags.
Interestingly, the relative influence of our variables using this model is slightly different than our linear regression models.

##                               var    rel.inf
## key                           key 15.3262726
## playlist_genre     playlist_genre 13.3735598
## instrumentalness instrumentalness  9.1094248
## duration_ms           duration_ms  8.5271447
## loudness                 loudness  7.4915693
## energy                     energy  7.1776526
## tempo                       tempo  7.1092586
## acousticness         acousticness  6.9209974
## speechiness           speechiness  6.8783594
## danceability         danceability  6.8293068
## valence                   valence  5.6309892
## liveness                 liveness  5.2902582
## mode                         mode  0.3352065

Optimal Trees

I want to do a search for the optimal number of trees.
Choosing too many trees can result in over-fitting.
Below is a plot of the MSE when using varying amounts of trees.
The optimal number of trees is around 510.

Model Validation

Just like for Multiple Linear Regression and Random Forest, I want to see the in-sample error, out-of-sample error, and profit.
The in-sample MSE performed excellently compared to our other models. Even better, for the full model, our out-of-sample MSE for the full model made a large improvement as well.
Profits made a huge jump using the boosted trees vs. the other models. The full model boasts a predicted profit of 2472000.

## Using 510 trees...
## 
## Using 510 trees...
## 
## Using 510 trees...
## 
## Using 510 trees...

	In-Sample	Out-of-Sample	Profit
Spotify_boost_MSE	462.1989	477.3446	2472000
Spotify_undersampled_boost_MSE	392.2046	643.7124	2152000

Support Vector Regression

Support Vector Regression

Support Vector Machines are most commonly used as classifiers, but they can also be used in regression for predicting continuous numerical values.
How it works:
- SVM works by constructing a hyperplane which separates classes while creating the largest margin between classes.
- The logic behind this is that larger margins generally imply a lower error when generalized to data not used to train the model.
- Minimizing the margin is set up as an optimization problem where the orthogonal vector to the hyperplane is a linear combination of all the points.
- Support Vector Regression is different from support vector machines in that rather than optimizing under the condition that all examples are classified correctly - which is possible with slack, SVR optimizes under the condition that your y value deviates less that epsilon (Required Accuracy) from the hyperplane.
- Kerneling can be used to capture many types of classification boundaries.

Tuning - Grid Search

For SVR we need to tune our 2 hyper-parameters.
1. Epsilon (Required Accuracy)
2. Cost (Weight given to slack in optimization)
- When we choose a higher cost value, we allow less slack.

# tuneResult <- tune(svm, track_popularity ~.,  data = Undersampled_train,
#               ranges = list(epsilon = seq(0.5,0.7,0.02), cost = 2^(2:7))
# )

Building the Model

From our tuning:
1. Epsilon = 0.58
2. Cost = 4

## 
## Call:
## svm(formula = track_popularity ~ ., data = train, cost = 4, gamma = 1/length(train), 
##     epsilon = 0.58, probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  4 
##       gamma:  0.07142857 
##     epsilon:  0.58 
## 
## Sigma:  0.7847259 
## 
## 
## Number of Support Vectors:  9361

Model Validation

Again just like for the other models, I want to see the in-sample error, out-of-sample error, and profit.

	In-Sample	Out-of-Sample	Profit
Spotify_svm_MSE	377.4687	510.1713	2400000
Spotify_undersampled_svm_MSE	315.9451	658.7301	1412000

Results and Analysis

Results and Analysis

The overall best model was a clear winner. The boosting trees had the best out-of-sample MSE at 477 and a profit of $2,472,000 on the test data set. It is possible that this does not stay the case if we were to add more data, attributes, or change the cost function. As the cost function changes, the relative profits could change significantly.
I found it interesting that under-sampling always increased the in-sample and out-of-sample MSE, but only sometimes increased the profits. This is probably the case for the regression algorithms and random forest, because the data is heavily influenced by songs that never gain traction and never get popular. Some of these songs are experiencing such low popularity for reasons that are not contingent upon their attributes or simply have not had time to get attention. This is a problem when we value being able to predict popular songs. The larger MSE with under-sampling is not necessarily a bad thing, because we value being able to predict popular songs even at the expense of being far off sometimes.
It actually makes sense that Boosted Trees and SVR were the most optimal performing algorithms based on their function.
- Boosted trees weights the sampling of points by the degree of error in prior predictions. Because of the bias toward low track popularity in the data set, the algorithm should predict popular songs more erroneously than other tracks. This leads them to be sampled more often in future trees and corrects for the bias toward low popularity.
- SVR is one of the more powerful machine learning algorithms, and it also does very well at capturing non-linear boundaries. It also eavily penalizes values outside epsilon making it still put a lot of relative value in predicting the high end of track popularity. I would be curious to see how polynomial regression performs .

##                                      In-Sample Out-of-Sample  Profit
## Boosted Trees                         462.1989      477.3446 2472000
## SVR                                   377.4687      510.1713 2400000
## Boosted Trees Undersampled            392.2046      643.7124 2152000
## Random Forest Undersampled            507.1951      629.3525 1832000
## SVR Undersampled                      315.9451      658.7301 1412000
## Random Forest                         496.7592      490.8139 1248000
## LR Undersampled - Stepwise Selection  508.8740      658.8303 1116000
## LASSO Undersampled                    508.3991      659.5437 1068000
## LR Undersampled - Full Model          507.2579      660.1027 1064000
## LR - Stepwise Selection               495.4129      493.3253  852000
## LASSO                                 495.2913      493.3668  660000
## LR - Full Model                       495.0772      493.4970  524000

Clustering For Prototypical Songs

K-Means Clustering

The Algorithm

K-Means Algorithm

K-Means Clustering is an unsupervised clustering algorithm.
In this case, however, I want to use it as a supervised learning algorithm for 2 purposes.
1. I want to be able to predict genres of new songs.
2. I want to see what attributes are characteristic of the different genres.
Classification is possible because we can choose the same K value as the number of genres and classify songs based on the dominant genre of the most proximal cluster.
How it works: (in a nutshell).
- The K-Means algorithm works by choosing K points apriori. These will serve as the start of the clusters and ideally these point will be far away.
- Points will be added to these clusters based on the closest distance between the point and the clusters.
- After each time points are assigned to a new cluster, the centroids are re-computed.
- Once all the points are added to the clusters, the algorithm conclude.
- The algorithm aims to minimize the squared error function of the distance between points and there respective cluster centroids.

K-Means - Genre Classification

K-Means - Genre Classification

I am curious to see if K-means will find 6 clusters that each are characterized by a specific genre.
This would indicate that the genres have a fairly well defined boundary with respect to each other, based on the song attributes.

Scaling

I will start by scaling the data so that all the numeric values are between -1 and 1. This is helpful to our algorithm, otherwise, some variables could have more inherent influence on the procedure.

train_K <- train
train_scale <- scale(train[,-c(2,5,7,14)])
test_scale <- scale(test[,-c(1,2,3,5,6,7,8,9,10,11,14,16,24)])

Fitting K-Means

We have 6 genres, so I will choose to have 6 centroids. Nstart will generate 25 conifigurations and choose the best initial configuration. This puts the algorithm in a good direction to capture the clusters as best as possible.

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  3555997 190.0    6596764 352.4  6302146 336.6
## Vcells 21790322 166.3   67024115 511.4 67024115 511.4

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 821750)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 821750)

Cluster	Size
Cluster 1	1340
Cluster 2	1046
Cluster 3	2388
Cluster 4	4072
Cluster 5	5290
Cluster 6	2299

Visualization of Clusters

I want to try to visualize the cluster separation by projecting the attributes in 2 dimensions and highlighting the clusters. This is helpful for picturing how well our clusters are being separated.

I seems that K Means is doing an okay job of clustering since boundaries are discernible. There is some significant overlap between clusters, however.

Model Validation - Classifier

I am now going to use the clustering results and try associate each cluster with a genre based on the most common genre in each.
Unfortunately, the clusters did not seem to separate by genre distinctively. This is indicated by a high cluster entropy. The clusters are not always dominated by one particular genre, and in some cases, a class is the dominant genre in multiple clusters. Rather than trying to figure out how to make this work as a genre classifier, I believe it may be best to not continue trying to use K means as a classifier for genre.

## `summarise()` regrouping output by 'Cluster' (override with `.groups` argument)

Unsupervised K-Means Clustering

Unsupervised K-Means

Our attempt at finding genres as clusters was unsuccessful with K-Means. I still want to see what agglomerations are to be discovered with no supervision.
I want the data to tell me how many clusters are hidden in the data. To do this, I will use the elbow method.

The Elbow Methed:
- This works by calculating the sum of squares within the cluster for different cluster sizes.
- The ideal number of cluster will be chosen at the elbow on the the chart.
- We want to have compact clusters (low sum of squares within the cluster), but not artifically compact (the case where every point is its own cluster is the extreme).
I tried variable selection based on the result I got from stepwise variables selection, LASSO, and the correlation heat map. The variables I experimented with removing were Valence, Speeechiness, and Loudness using different combinations.
I have decided to keep 9 variables by removing valence and speechiness. When I did this, it led to a prominent elbow at 3 clusters. This is promising so I will move forward with this set of variables and use 3 clusters.

3 Cluster Model

This time we have 3 clusters. One large, one medium, one small.

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  3700933 197.7    6596764 352.4  6596764 352.4
## Vcells 22111515 168.7   67024115 511.4 67024115 511.4

Cluster	Size
Cluster 1	1513
Cluster 2	10681
Cluster 3	4241

Visualizing Clusters

We can again plot the clusters in 2 dimensions to visualize the clusters and their distinctness.
There is some overlap between clusters again, but they are fairly distinct and well separated.

Descriptive Statistics

One way of observing the characteristics of our clusters is to observe the mean values of the clustering attributes.

Boxplots

The box plots offer a way of understanding how the clusters are dinstinct from each other. Each cluster has some characteristics that distinguish them from the others.
Cluster 1
- Small Cluster
- Low Track Popularity
- High Instrumentalness
- A High and very specific Tempo (Around 125)
  - The specificity is indicated by the low standard deviation compared to others
Cluster 2
- Largest Cluster
- High Loudness
- High Energy
- High Liveness
Cluster 3
- Mid-sized Cluster
- Low Energy
- Low Loudness
- Low Tempo
- High Acousticness

Results and Analysis

Cluster Result and Analysis

`- I wanted to use K-means for 2 purpose. To classify genre and to find clusters apparent in the data. Our classifier did not work well, because our data did not naturally cluster into 6 clusters separated by genres only used once. I could have inferred a genre associated with each cluster, based on its market share in each cluster and count of songs of each genre, but in this case, it did not seem a proper use of the tool.

I then used K-means for unsupervised clustering and let the data tell me what the clusters are. It chose 3 clusters, and in my attribute analysis, I found each cluster is identifiable by a a few attributes. Cluster 1 and 2 were popular and were characterized by being loud, energetic, and live sounding or soft, gentle, and acoustic. These two clusters were also quite popular compared to the 3rd cluster. The third cluster was not very popular, and its distinguishing characteristic was instrumentalness.
The final result from the clustering indicated to me that music on both ends of the energetic spectrum and of the different genres can be popular and that the protoypical songs based on the attributes are reducible beyond genre. What does not work very well for popularity is instrumental music. This is reminiscent of our insights from the linear regression where instrumentalness had a strong negative association with popularity.

Summary

Insights:

Pop, Rap, and Latin are currently the most popular genre types on Spotify in descending order.
The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and to a lesser extent with Latin music. EDM (Electronic Dance Music) reached its pinnacle in 2005 and appears to be making a gradual resurgence to former heights.
Every major genre was more likely to have its songs produced in the major key. Rock has the largest proportion with nearly 70% of songs being major.
Top Artists by Decade (Most Hit Songs):
- 1950’s - Elvis Presley
- 1960’s - The Beatles
- 1970’s - Fleetwood Mac and Queen
- 1980’s - AC/DC, Luis Miguel, and Michael Jackson
- 1990’s - Red Hot Chili Peppers
- 2000’s - Coldplay
- 2010’s - Billie Eilish
- 2020’s - Eminem (So far as of March 2020)
The distribution of popularity scores is unbalanced due to the fact the most songs never seemed to make it off the ground. The bottom 1% and top 25% capture 8% of the popularity scores from the data.
There were two notable independent variables that have an issue with co-linearity. There was a strong positive correlation between loudness and energy (Pearson Correlation - 0.68) and a strong negative correlation between energy and acousticness (Pearson Correlation - -0.55). Loudness & acousticness and danceability & valence also shared a fairly strong correlation.
Energy, instrumentalness, speechiness, and danceability were the strongest predictors of track popularity according to our regression models while duration, tempo, and key, held close to no predictive power.
Songs that are energetic and loud and songs that are quiet and gentle can both be popular. The quantitative attributes seem to be the greatest indicators of popularity. The most important being instrumentalness to its detriment based on our clustering evaluation.

Reflection and Future Work:

This project was exciting, because I got to explore and apply many new packages and techniques in R through statistical analysis and data visualization. I was happy with the way our track popularity predictions turned out. The first few models did not perform well compared to the boosted trees and SVM. I found it very interesting that under-sampling improved the efficacy of some models with profit estimations, but worsened it in others. This is largely due to the mechanics of particular algorithms.
The scope of what could be explored with this project is not nearly exhausted. There is still more to be explored in regards variable/model selection, different under-sampling techniques, new algorithms, and optimization and hyper-parameter tuning. One immediate idea for variable selection is to use PCA to find the principle components for building the model to fix any co-linearity issues. I also would like to see what would happen if I simply normalized the track_popularity curve by dropping many of the low popularity songs. Finally, the algorithms I used predicted popularity as continuous values, but I would also like to see what our profits and accuracy would be using classification algorithms to predict popularity groups like multinomial regression, naive bayes, SVM, etc, ensemble methods, etc. At the very least, what we have now could serve as a useful tool for music industry professionals by utilizing machine learning in the prediction of hit songs. Also, the insights presented may offer something of value to those trying to produce popular songs by infusing qualities of popular songs into their music.