The aim of this analysis was to compute a classifier model using Spotify audio features as predictors of music genres. The algorithms selected for this purpose will be Random Forest, K-Nearest Neighbor, and Boosted Trees. Hence, the evaluation metrics discussed are Precision, Recall, Area Under Curve (roc-auc), and Specificity.
All the packages required for this analysis can be loaded by running the code below. The packages are also included in the code throughout all the phases of the analysis for reference (just in case).
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(
readr,
tidyverse,
visdat,
psych,
GGally,
tidymodels,
workflows,
tune,
ranger,
xgboost,
vip,
hardhat)
1.1 Data description
The data at hand was extracted from the Spotify API via the R package spotifyr.The dataset can be loaded from the github link provided below.
# load dataset from github account
library(readr)
spotify <- read.csv('https://raw.githubusercontent.com/karendelarosa/datasets/master/spotifyb.csv')
library(tidyverse)
spotify <- spotify %>%
mutate_if(is.integer, as.numeric)
spotify$genre <- factor(spotify$genre)
spotify <- spotify %>%
select(-X)
# inspect variable names
names(spotify)
## [1] "count" "artist" "album" "song"
## [5] "genre" "year" "popularity" "acousticness"
## [9] "danceability" "energy" "instrumentalness" "liveness"
## [13] "speechiness" "tempo" "valence" "key"
## [17] "mode" "duration"
As shown in the output, the dataset contains variables referencing information about the music such as artist name, album name, song name, song popularity score, and year of release. In addition, the dataset contains 12 audio feature variables that reflect music characteristics. The audio features in question usually fall into three major categories: a) confidence measures, b) perceptual measures, and c) music descriptors. A brief description for each of these audio features is provided below:
Sources:
https://developer.spotify.com/discover/
Spotify audio features
a) confidence measures
Acousticness: A measure from 0.0 to 1.0 that detects acoustic sounds in a track.
Liveness: A measure from 0.0 to 1.0 that detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
Speechiness: A measure from 0.0 to 1.0 that detects the presence of spoken words in a track. The more speech sounds are heard in the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the value will be.
Instrumentalness A measure from 0.0 to 1 that reflects the extent to which a track does not contain vocalizations. The closer the instrumentalness value is to 1.0, the greater the likelihood that the track does not contain vocal content.
b) perceptual measures
Energy: A measure from 0.0 to 1.0 that reflects the intensity of a track. Energetic tracks are usually fast, loud, and noisy.
Loudness: A measure from -60 and 0 that represents the overall loudness of a track. This value is measured in decibels (dB). Loudness values are averaged across the entire track.
Danceability: A measure from 0.0 to 1 that indicates how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, and beat strength. A value closer to 0.0 indicates that a track is less danceable, and value closer to 1.0 is indicates that a track is more danceable.
Valence: A measure from 0.0 to 1.0 that reflects the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
c) music descriptors
Tempo: The overall estimated speed of a track measured in beats per minute (BPM).
Duration: The duration of a track measured in milliseconds.
Key: A measure from 0 to 11 that indicates the key of a track.
Mode: A measure that indicates the key in the music of the track (1 is major, and 0 is for minor).
1.2 Data inspection and Variables
A plot and summary outputs were computed to inspect the dataset (the plot code takes about 30 seconds to run).
# see type of variables
str(spotify)
## 'data.frame': 21182 obs. of 18 variables:
## $ count : num 1 2 3 4 5 6 7 8 9 10 ...
## $ artist : chr "Bella Renee" "tyDi" "Electric Polar Bears" "My Bad" ...
## $ album : chr "Swimming Pools" "You Never Know" "Too Good To Lose" "Delusions - EP" ...
## $ song : chr "Swimming Pools" "You Never Know" "Too Good To Lose" "Viva La Vida" ...
## $ genre : Factor w/ 4 levels "latin","pop",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ year : num 2021 2021 2021 2021 2021 ...
## $ popularity : num 40 51 55 48 35 36 89 67 49 47 ...
## $ acousticness : num 0.27 0.0871 0.309 0.025 0.632 0.167 0.00226 0.198 0.0586 0.0273 ...
## $ danceability : num 0.629 0.621 0.674 0.548 0.663 0.724 0.721 0.404 0.738 0.616 ...
## $ energy : num 0.524 0.955 0.95 0.755 0.573 0.644 0.738 0.806 0.909 0.731 ...
## $ instrumentalness: num 1.25e-05 3.26e-03 3.14e-05 0.00 5.26e-04 0.00 4.41e-06 0.00 1.93e-02 0.00 ...
## $ liveness : num 0.05 0.296 0.319 0.0985 0.249 0.0867 0.118 0.114 0.364 0.215 ...
## $ speechiness : num 0.0407 0.0444 0.116 0.0552 0.0446 0.132 0.0403 0.0496 0.061 0.061 ...
## $ tempo : num 120 127 127.1 97.9 120.1 ...
## $ valence : num 0.508 0.424 0.47 0.659 0.292 0.787 0.637 0.112 0.711 0.618 ...
## $ key : num 4 0 11 11 0 1 7 2 9 0 ...
## $ mode : num 0 1 1 0 1 0 1 1 1 1 ...
## $ duration : num 202027 139173 187087 146939 224000 ...
# compute summary statistics
summary(spotify)
## count artist album song
## Min. : 1 Length:21182 Length:21182 Length:21182
## 1st Qu.: 5466 Class :character Class :character Class :character
## Median :11054 Mode :character Mode :character Mode :character
## Mean :11272
## 3rd Qu.:17177
## Max. :22900
## genre year popularity acousticness
## latin:4814 Min. :1962 Min. : 0.00 Min. :0.0000014
## pop :6200 1st Qu.:2006 1st Qu.: 31.00 1st Qu.:0.0193000
## rap :5549 Median :2016 Median : 51.00 Median :0.0912000
## rock :4619 Mean :2011 Mean : 45.56 Mean :0.1811030
## 3rd Qu.:2020 3rd Qu.: 65.00 3rd Qu.:0.2670000
## Max. :2021 Max. :100.00 Max. :0.9920000
## danceability energy instrumentalness liveness
## Min. :0.0000 Min. :0.0185 Min. :0.0000000 Min. :0.0167
## 1st Qu.:0.5620 1st Qu.:0.5870 1st Qu.:0.0000000 1st Qu.:0.0932
## Median :0.6760 Median :0.7100 Median :0.0000040 Median :0.1270
## Mean :0.6589 Mean :0.6956 Mean :0.0493000 Mean :0.1873
## 3rd Qu.:0.7700 3rd Qu.:0.8250 3rd Qu.:0.0008697 3rd Qu.:0.2440
## Max. :0.9880 Max. :0.9990 Max. :0.9850000 Max. :0.9880
## speechiness tempo valence key
## Min. :0.0000 Min. : 0.00 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0406 1st Qu.: 97.92 1st Qu.:0.3650 1st Qu.: 2.000
## Median :0.0645 Median :120.00 Median :0.5490 Median : 5.000
## Mean :0.1182 Mean :121.51 Mean :0.5428 Mean : 5.289
## 3rd Qu.:0.1530 3rd Qu.:139.94 3rd Qu.:0.7290 3rd Qu.: 8.000
## Max. :0.9670 Max. :249.44 Max. :0.9920 Max. :11.000
## mode duration
## Min. :0.0000 Min. : 9560
## 1st Qu.:0.0000 1st Qu.:185924
## Median :1.0000 Median :215480
## Mean :0.5923 Mean :222603
## 3rd Qu.:1.0000 3rd Qu.:251064
## Max. :1.0000 Max. :747325
# plot variable distributions
library(visdat)
vis_dat(spotify)
Overall, the dataset contains 21,182 observations and 18 variables (see the summary and str outputs for reference). As shown in the plot, count, year, popularity, and the twelve audio features discussed are numerical variables. Moreover, genre is a categorical variable, and the identifier columns (artist, album, and song) are characters.
It seems like some of the descriptor audio features are measured at the ratio level and placed on a wider scale. Thus, if any of these audio features are included as predictors on the classifier model, all the predictors should be normalized to avoid overestimation of variable effects on the outcome. Genre (the outcome variable) contains four categories: latin, pop, rap, and rock (Note: Genre names will not be capitalized on this assessment so they are consistent with the genre column labels on the dataset). Hence, these four categories are the classes to be predicted.
Missing data might disrupt some of the machine learning and statistical procedures that will be performed. Hence, it is important to determine what steps should be fulfilled in order to manage conflicts if there is any missing data in the observations. The code below was used to identify missing data cases within each variable.
# look for missing data (just in case)
is.na(spotify) %>% colSums()
## count artist album song
## 0 0 0 0
## genre year popularity acousticness
## 0 0 0 0
## danceability energy instrumentalness liveness
## 0 0 0 0
## speechiness tempo valence key
## 0 0 0 0
## mode duration
## 0 0
Based on this output, the dataset does not contain missing data.
2.1 Descriptive statistics
The describe function was computed to assess descriptive statistics. Some insights can be drawn from the output below.
library(psych)
describe(spotify)
## vars n mean sd median trimmed mad
## count 1 21182 11271.69 6675.06 11054.50 11236.39 8674.69
## artist* 2 21182 3473.89 2027.23 3490.00 3473.39 2621.24
## album* 3 21182 6742.70 3924.36 6770.50 6760.93 5109.04
## song* 4 21182 8399.24 4840.98 8410.00 8400.66 6225.44
## genre* 5 21182 2.47 1.07 2.00 2.46 1.48
## year 6 21182 2011.17 12.14 2016.00 2013.58 7.41
## popularity 7 21182 45.56 25.06 51.00 47.02 23.72
## acousticness 8 21182 0.18 0.22 0.09 0.14 0.13
## danceability 9 21182 0.66 0.15 0.68 0.67 0.15
## energy 10 21182 0.70 0.17 0.71 0.71 0.17
## instrumentalness 11 21182 0.05 0.17 0.00 0.00 0.00
## liveness 12 21182 0.19 0.15 0.13 0.16 0.07
## speechiness 13 21182 0.12 0.12 0.06 0.09 0.05
## tempo 14 21182 121.51 28.79 120.00 119.63 30.77
## valence 15 21182 0.54 0.23 0.55 0.55 0.27
## key 16 21182 5.29 3.63 5.00 5.25 4.45
## mode 17 21182 0.59 0.49 1.00 0.62 0.00
## duration 18 21182 222603.14 58122.62 215480.00 218282.48 47541.05
## min max range skew kurtosis se
## count 1.00 22900.00 22899.00 0.05 -1.23 45.86
## artist* 1.00 6972.00 6971.00 0.00 -1.21 13.93
## album* 1.00 13441.00 13440.00 -0.03 -1.23 26.96
## song* 1.00 16774.00 16773.00 0.00 -1.21 33.26
## genre* 1.00 4.00 3.00 0.05 -1.24 0.01
## year 1962.00 2021.00 59.00 -1.63 2.15 0.08
## popularity 0.00 100.00 100.00 -0.55 -0.76 0.17
## acousticness 0.00 0.99 0.99 1.52 1.67 0.00
## danceability 0.00 0.99 0.99 -0.50 -0.09 0.00
## energy 0.02 1.00 0.98 -0.54 -0.03 0.00
## instrumentalness 0.00 0.98 0.98 3.92 14.68 0.00
## liveness 0.02 0.99 0.97 2.15 5.62 0.00
## speechiness 0.00 0.97 0.97 2.50 9.62 0.00
## tempo 0.00 249.44 249.44 0.52 -0.31 0.20
## valence 0.00 0.99 0.99 -0.11 -0.88 0.00
## key 0.00 11.00 11.00 0.01 -1.32 0.02
## mode 0.00 1.00 1.00 -0.38 -1.86 0.00
## duration 9560.00 747325.00 737765.00 1.25 4.59 399.36
The mean values computed reflect interesting characteristics about the tracks listed on the dataset. For instance, it seems like most of the tracks in the dataset are highly danceable (m = 0.66, sd = 0.15) and have a high level of energy (m = 0.70, sd = 0.17). On the other hand, looks like about 50% of the tracks are highly positive, and the other 50% are less positive (m = 0.54, sd = 0.23). Moreover, looks like most of the tracks in the dataset are not instrumental (m = 0.05, sd = 0.17), and were not recorded live (m = 0.19, sd = 0.15). Insights about the four genre categories will be discussed below.
2.2 Balance assessment
Data imbalances can affect classification predictions when they are not managed properly. Thus, a barplot was computed to assess genre class distributions and identify potential imbalances.
First, let’s sort the music genre categories by count in ascending order.
#Sort type factors in ascending order
spotify[order(spotify$genre),]
#Reorder type factors by frequency
spotify$genre <- fct_infreq(spotify$genre)
#Now reverse order (so it appears
#as decreasing the plot)
spotify$genre <- fct_rev(spotify$genre)
Now, let’s look at the plot.
# music genres counts and percentages
plot <- ggplot(data = spotify, aes(y=genre)) +
geom_bar(aes(fill=genre)) +
theme (plot.title = element_text(hjust = 0.5)) +
expand_limits(x = 8500) +
labs(title = "Music genres - Number of Cases and Percentages",
y = "Genres",
x = "Count") +
geom_text(stat='count',
aes(label = sprintf('%s (%.1f%%)',
after_stat(count),
after_stat(count / sum(count) * 100))),
hjust=ifelse(1.5, -0.1, 1.1))
plot
These are the current class distributions in the data based on the plot: rock (21.8%), rap (26.2%), pop (29.3%), and latin (22.7%). Thus, it seems like there are no acute imbalances between the genre classes.
In order to further assess these distributions, a normalized version of the Shannon diversity index was run to observe how balanced the data is. A value closer to 0 indicates unbalanced data, whereas a value closer to 1 indicates balanced data. The result from this analysis is reported below.
# compute normalized Shannon diversity index for balance evaluation
balance <- spotify %>%
group_by(genre) %>%
count() %>%
ungroup() %>%
mutate(check = - (n / sum(n) * log(n / sum(n))) / log(length(genre)) ) %>%
summarise(balance = sum(check)) %>%
pull()
balance
## [1] 0.9949964
The score for this balance analysis is 0.99. Hence, this further supports the assumption that the genre classes in the data are fairly distributed.
2.3 Feature assessment
A logistic regression was run to determine which audio features are correlated with genre. This analysis will help inform decisions for feature selection. The results of the logistic regression are discussed below.
# compute logistic regression with glm function
genremodel <- glm(genre ~ acousticness + danceability + energy +
instrumentalness +
liveness + speechiness + tempo + valence +
key + mode + duration,
data = spotify, family = binomial)
# see model output
summary(genremodel)
##
## Call:
## glm(formula = genre ~ acousticness + danceability + energy +
## instrumentalness + liveness + speechiness + tempo + valence +
## key + mode + duration, family = binomial, data = spotify)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6858 0.1258 0.3444 0.5931 2.5133
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.324e+00 2.024e-01 -11.484 < 2e-16 ***
## acousticness 1.645e-01 1.085e-01 1.517 0.12936
## danceability 9.847e+00 1.825e-01 53.950 < 2e-16 ***
## energy -2.429e-01 1.447e-01 -1.679 0.09320 .
## instrumentalness -3.635e-01 1.118e-01 -3.251 0.00115 **
## liveness -6.484e-01 1.315e-01 -4.930 8.23e-07 ***
## speechiness 3.698e+00 2.289e-01 16.156 < 2e-16 ***
## tempo 4.126e-03 6.938e-04 5.947 2.74e-09 ***
## valence -2.458e+00 1.037e-01 -23.698 < 2e-16 ***
## key 1.059e-02 5.621e-03 1.884 0.05960 .
## mode -2.197e-01 4.207e-02 -5.223 1.76e-07 ***
## duration -7.001e-06 3.481e-07 -20.110 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22218 on 21181 degrees of freedom
## Residual deviance: 15988 on 21170 degrees of freedom
## AIC: 16012
##
## Number of Fisher Scoring iterations: 5
Audio features with the highest correlation (p<0.01) will be selected for the classification analysis. Hence, acousticness, danceability, energy, speechiness, and valence will be predictors on the classifier model.
The plot below was computed to look at the distributions of these five numerical variables.
# compute variable distributions
library(GGally)
ggpairs(spotify, columns = c("acousticness",
"danceability",
"energy",
"speechiness",
"valence"),
aes(color = genre, alpha = 0.5),
upper = list(continuous = wrap("cor", size = 2.5)))
Looking at the plot, it seems like the distribution curves for the acousticness and speechiness variables are positively skewed. Uneven distributions in numerical predictors might make it harder for algorithms to detect patterns. Hence, normalization and logistic functions can be computed to adjust the data, so the distributions in numerical variables are more symmetrical. In order to prevent issues linked to skewness, a normalize argument will be added to the model recipe (step_normalize).
3.1 Feature selection
A subset with the selected features was created to start processing the classifier model. The new dataset contains five predictors (acousticness, danceability, energy, speechiness, and valence), as well as the outcome variable (genre).
# create subset with selected audio feature predictors
spotify2 <- subset(spotify, select=c(
"acousticness","danceability","energy",
"speechiness","valence","genre"))
#Inspect subset
names(spotify2)
## [1] "acousticness" "danceability" "energy" "speechiness" "valence"
## [6] "genre"
str(spotify2)
## 'data.frame': 21182 obs. of 6 variables:
## $ acousticness: num 0.27 0.0871 0.309 0.025 0.632 0.167 0.00226 0.198 0.0586 0.0273 ...
## $ danceability: num 0.629 0.621 0.674 0.548 0.663 0.724 0.721 0.404 0.738 0.616 ...
## $ energy : num 0.524 0.955 0.95 0.755 0.573 0.644 0.738 0.806 0.909 0.731 ...
## $ speechiness : num 0.0407 0.0444 0.116 0.0552 0.0446 0.132 0.0403 0.0496 0.061 0.061 ...
## $ valence : num 0.508 0.424 0.47 0.659 0.292 0.787 0.637 0.112 0.711 0.618 ...
## $ genre : Factor w/ 4 levels "rock","latin",..: 4 4 4 4 4 4 4 4 4 4 ...
summary(spotify2)
## acousticness danceability energy speechiness
## Min. :0.0000014 Min. :0.0000 Min. :0.0185 Min. :0.0000
## 1st Qu.:0.0193000 1st Qu.:0.5620 1st Qu.:0.5870 1st Qu.:0.0406
## Median :0.0912000 Median :0.6760 Median :0.7100 Median :0.0645
## Mean :0.1811030 Mean :0.6589 Mean :0.6956 Mean :0.1182
## 3rd Qu.:0.2670000 3rd Qu.:0.7700 3rd Qu.:0.8250 3rd Qu.:0.1530
## Max. :0.9920000 Max. :0.9880 Max. :0.9990 Max. :0.9670
## valence genre
## Min. :0.0000 rock :4619
## 1st Qu.:0.3650 latin:4814
## Median :0.5490 rap :5549
## Mean :0.5428 pop :6200
## 3rd Qu.:0.7290
## Max. :0.9920
describe(spotify2)
## vars n mean sd median trimmed mad min max range skew
## acousticness 1 21182 0.18 0.22 0.09 0.14 0.13 0.00 0.99 0.99 1.52
## danceability 2 21182 0.66 0.15 0.68 0.67 0.15 0.00 0.99 0.99 -0.50
## energy 3 21182 0.70 0.17 0.71 0.71 0.17 0.02 1.00 0.98 -0.54
## speechiness 4 21182 0.12 0.12 0.06 0.09 0.05 0.00 0.97 0.97 2.50
## valence 5 21182 0.54 0.23 0.55 0.55 0.27 0.00 0.99 0.99 -0.11
## genre* 6 21182 2.63 1.12 3.00 2.66 1.48 1.00 4.00 3.00 -0.17
## kurtosis se
## acousticness 1.67 0.00
## danceability -0.09 0.00
## energy -0.03 0.00
## speechiness 9.62 0.00
## valence -0.88 0.00
## genre* -1.34 0.01
Upon inspection, the new subset looks good.
3.2 Data splitting
The main dataset will be split into training and testing sets. Thus, these subsets will be used to evaluate the model in subsequent phases of the analysis. The initial_split, training, and testing functions from the tidymodels package were computed to assign 80% of the data to the train set, and 20% of the remaining data to the test set. See the splitting and statistics output provided below.
# load packages
library(tidymodels)
library(workflows)
library(tune)
# randomize data
set.seed(234589)
# split the data into training (80%) and testing (20%)
spotify_split <- initial_split(spotify2, prop = 0.80, strata = genre)
spotify_split # proportions of cases (train/test/total)
## <Analysis/Assess/Total>
## <16945/4237/21182>
# extract training and testing sets
spotify_train <- training(spotify_split)
spotify_test <- testing(spotify_split)
# inspect training and testing datasets
str(spotify_train)
## 'data.frame': 16945 obs. of 6 variables:
## $ acousticness: num 0.0419 0.0207 0.171 0.199 0.163 0.261 0.00761 0.0606 0.401 0.171 ...
## $ danceability: num 0.71 0.705 0.662 0.706 0.662 0.704 0.745 0.598 0.712 0.789 ...
## $ energy : num 0.793 0.803 0.794 0.869 0.489 0.643 0.669 0.815 0.776 0.638 ...
## $ speechiness : num 0.0694 0.037 0.0533 0.0517 0.0525 0.0301 0.0316 0.0858 0.0802 0.045 ...
## $ valence : num 0.604 0.555 0.641 0.603 0.506 0.413 0.513 0.383 0.659 0.427 ...
## $ genre : Factor w/ 4 levels "rock","latin",..: 2 2 2 2 2 2 2 2 2 2 ...
str(spotify_test)
## 'data.frame': 4237 obs. of 6 variables:
## $ acousticness: num 0.00474 0.116 0.512 0.231 0.148 0.361 0.0697 0.0689 0.136 0.123 ...
## $ danceability: num 0.617 0.833 0.664 0.664 0.496 0.57 0.614 0.632 0.652 0.362 ...
## $ energy : num 0.928 0.46 0.66 0.915 0.752 0.797 0.934 0.595 0.815 0.921 ...
## $ speechiness : num 0.153 0.0532 0.0243 0.123 0.237 0.0416 0.07 0.0401 0.131 0.486 ...
## $ valence : num 0.521 0.128 0.355 0.407 0.255 0.227 0.436 0.435 0.794 0.529 ...
## $ genre : Factor w/ 4 levels "rock","latin",..: 4 4 4 4 4 4 4 4 4 4 ...
dim(spotify_train)
## [1] 16945 6
dim(spotify_test)
## [1] 4237 6
As shown in the output, the training set contains 16,945 observations (80% of the data), and the testing set contains 4,237 observations (20% of the data). Both the training and testing sets have 6 variables in total, including the five audio feature predictors and the genre outcome. Upon inspection, the training and testing sets look good.
3.3 Data Preprocessing Recipe
Three objects should be created in order to tune and evaluate predictive models when using the tidymodels package. These objects are the recipe, model specification, and workflow. The recipe for this model was computed below.
As indicated previously, a normalizing function (step_normalize) was added to the recipe to center and scale all the predictors, and slightly adjust some of the numerical variable distributions.
# create the recipe for the model
model_recipe <- recipe(genre ~ ., data = spotify_train) %>%
step_normalize(all_numeric())
model_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 5
##
## Operations:
##
## Centering and scaling for all_numeric()
Upon inspection, the model recipe looks good.
3.4 Model specification
The next step for pre-processing the model is to specify which algorithms will be used to conduct the predictive analysis. Each algorithm will be assigned to a separate model so all the models can be evaluated and compared with one another.
As indicated in the project description, the algorithms used for this analysis are Random Forest, K-Nearest Neighbor, and Boosted Tree. Please note that some of these algorithms require specific packages so the models can be created correctly. In this particular case, the Random Forest model requires the ranger package, and the Boosted Tree model requires the xgboost package. The mode argument for all the models should be set as ‘classification’. Hence, the specific engines that can be used for each algorithm are listed on the tidymodels website (https://parsnip.tidymodels.org/reference/set_engine.html).
# Random Forest model
library(ranger)
rf_model <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
# K-Nearest Neighbor model
knn_model <-
nearest_neighbor(neighbors = 4) %>% #
set_engine("kknn") %>%
set_mode("classification")
# Boosted Trees model
library(xgboost)
boost_model <-
boost_tree() %>%
set_engine("xgboost") %>%
set_mode("classification")
Any of the created objects can be inspected when needed. If the code lines run without any issues, this is an indicator that the objects were created correctly.
3.5 Workflows
The next step is to create a workflow for each of the models computed in the previous phase. The recipe and the specific algorithm model objects should be added to the workflows in question. The workflow objects were created below.
# workflow for Random Forest
rf_wf <- workflow() %>%
add_model(rf_model) %>%
add_recipe(model_recipe)
# workflow for K-Nearest Neighbor
knn_wf <- workflow() %>%
add_model(knn_model) %>%
add_recipe(model_recipe)
# workflow for Boosted model
boost_wf <- workflow() %>%
add_model(boost_model) %>%
add_recipe(model_recipe)
3.6 Folds
A validation set (usually labeled as a ‘folds’ object) should be created to define the number of resamples for the hyper-parameter tuning. Thus, the model can be tested throughout several rounds of tuning when this object is included as a parameter.
The folds object should be created from the training set. In addition, the outcome variable should be specified in the ‘strata =’ argument within the code. The validation set for this analysis will have 5 folds. See the code for the creation of this object below.
# split the train set into folds
set.seed(9876)
folds <- vfold_cv(data = spotify_train, v = 5, strata = "genre")
3.7 Tuning parameters and Model evaluation
All the required objects for the prediction analysis have been created. Thus, it is time to tune and evaluate the models at hand. The tune parameters should include the workflows defined previously, the folds object for hyperparameter turning, and the performance metrics that will be required to evaluate the model.
The two performance metrics requested for this assessment are precision and recall. However, additional evaluation metrics were also computed to assess the highest performance score on each model. The additional metrics computed were f-means, accuracy, kap, area under curve (roc-auc), sensitivity (sens) and specificity (spec).
All the evaluation metrics at hand are measured on scale from 0.0 to 1.0. A score closer to 0 indicates lower performance, and a score closer to 1 indicates higher performance. Thus, a performance score is usually considered to be ‘good’ when it is 0.8 or higher (please note that accepted scores might change depending on the aim of the project, and the professional / academic field in question). The evaluation metrics for each model will be reported.
Note:
The tuning codes for each model take about 1 or 2 minutes to run. The reported evaluation scores will be rounded on the discussion for a clearer assessment.
-Random Forest model
The Random Forest model is tuned and evaluated below.
# tune Random Forest model
set.seed(634)
rf_tune <-
rf_wf %>%
fit_resamples(
resamples = folds,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE)
)
# report evaluation metrics for the Random Forest model
rf_tune %>%
collect_metrics()
## # A tibble: 8 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.605 5 0.00311 Preprocessor1_Model1
## 2 f_meas macro 0.602 5 0.00297 Preprocessor1_Model1
## 3 kap multiclass 0.469 5 0.00430 Preprocessor1_Model1
## 4 precision macro 0.607 5 0.00258 Preprocessor1_Model1
## 5 recall macro 0.603 5 0.00334 Preprocessor1_Model1
## 6 roc_auc hand_till 0.835 5 0.00268 Preprocessor1_Model1
## 7 sens macro 0.603 5 0.00334 Preprocessor1_Model1
## 8 spec macro 0.867 5 0.00110 Preprocessor1_Model1
The precision and recall scores for the Random Forest model were the same (0.61). These scores are not particularly high. Curiously, the scores for roc-auc (0.83) and specificity (0.87) were significantly better. The scores at hand might lead to inquiries and interesting insights regarding the performance and parameters of the model. Let’s observe how the K-Nearest Neighbor and Boosted Trees models perform in relation to this model.
-K-Nearest Neighbor model
The K-Nearest Neighbor model is tuned and evaluated below.
set.seed(768)
# tune K-Nearest Neighbor model
knn_tune <-
knn_wf %>%
fit_resamples(
resamples = folds,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE)
)
# report evaluation metrics for the K-Nearest Neighbor model
knn_tune %>%
collect_metrics()
## # A tibble: 8 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.521 5 0.00546 Preprocessor1_Model1
## 2 f_meas macro 0.522 5 0.00571 Preprocessor1_Model1
## 3 kap multiclass 0.358 5 0.00729 Preprocessor1_Model1
## 4 precision macro 0.522 5 0.00580 Preprocessor1_Model1
## 5 recall macro 0.522 5 0.00557 Preprocessor1_Model1
## 6 roc_auc hand_till 0.761 5 0.00234 Preprocessor1_Model1
## 7 sens macro 0.522 5 0.00557 Preprocessor1_Model1
## 8 spec macro 0.839 5 0.00181 Preprocessor1_Model1
Looks like the precision and recall scores for the K-Nearest Neighbor model were also low, and fairly close to one another (0.52). The roc-auc score was decent (0.76) and the specificity score was acceptable (0.84). Similar results might be observed in the Boosted Trees model.
-Boosted Trees
The Boosted Trees model is tuned and evaluated below.
set.seed(143)
# tune the Boosted Trees model
boost_tune <- boost_wf %>%
fit_resamples(
resamples = folds,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE)
)
# report evaluation metrics for the Boosted Trees model
boost_tune %>%
collect_metrics()
## # A tibble: 8 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.584 5 0.00343 Preprocessor1_Model1
## 2 f_meas macro 0.580 5 0.00328 Preprocessor1_Model1
## 3 kap multiclass 0.440 5 0.00469 Preprocessor1_Model1
## 4 precision macro 0.588 5 0.00297 Preprocessor1_Model1
## 5 recall macro 0.580 5 0.00347 Preprocessor1_Model1
## 6 roc_auc hand_till 0.824 5 0.00280 Preprocessor1_Model1
## 7 sens macro 0.580 5 0.00347 Preprocessor1_Model1
## 8 spec macro 0.859 5 0.00119 Preprocessor1_Model1
The precision score for the Boosted Trees model is 0.59, whereas the recall score is 0.58. As observed with the other models, these scores are pretty close. Similar to the Random Forest model, the roc-auc score is acceptable (0.82) and the specificity score is the highest (0.86). Overall, the performance metrics across the three models are consistent.
3.7 Best performing model
Let’s gather the highest scores on all the reported metrics to determine which model had the best performance (this chunk of code takes about 40 seconds to run).
Highest performance scores
# report highest precision score
best_resamples <-
bind_rows(
show_best(rf_tune, metric = "precision", n = 1) %>% mutate(model = "Random Forest") %>% select(model, precision = mean),
show_best(knn_tune, metric = "precision", n = 1) %>% mutate(model = "K-NN") %>% select(model, precision = mean),
show_best(boost_tune, metric = "precision", n = 1) %>% mutate(model = "Boosted Trees") %>% select(model, precision = mean)
)
best_resamples %>%
arrange(desc(precision)) %>%
knitr::kable()
| model | precision |
|---|---|
| Random Forest | 0.6070627 |
| Boosted Trees | 0.5884896 |
| K-NN | 0.5222312 |
# report highest recall score
best_resamples <-
bind_rows(
show_best(rf_tune, metric = "recall", n = 1) %>% mutate(model = "Random Forest") %>% select(model, recall = mean),
show_best(knn_tune, metric = "recall", n = 1) %>% mutate(model = "K-NN") %>% select(model, recall = mean),
show_best(boost_tune, metric = "recall", n = 1) %>% mutate(model = "Boosted Trees") %>% select(model, recall = mean)
)
best_resamples %>%
arrange(desc(recall)) %>%
knitr::kable()
| model | recall |
|---|---|
| Random Forest | 0.6025992 |
| Boosted Trees | 0.5800721 |
| K-NN | 0.5221858 |
# report highest roc-auc score
best_resamples <-
bind_rows(
show_best(rf_tune, metric = "roc_auc", n = 1) %>% mutate(model = "Random Forest") %>% select(model, roc_auc = mean),
show_best(knn_tune, metric = "roc_auc", n = 1) %>% mutate(model = "K-NN") %>% select(model, roc_auc = mean),
show_best(boost_tune, metric = "roc_auc", n = 1) %>% mutate(model = "Boosted Trees") %>% select(model, roc_auc = mean)
)
best_resamples %>%
arrange(desc(roc_auc)) %>%
knitr::kable()
| model | roc_auc |
|---|---|
| Random Forest | 0.8349646 |
| Boosted Trees | 0.8243245 |
| K-NN | 0.7605227 |
# report highest specificity score
best_resamples <-
bind_rows(
show_best(rf_tune, metric = "spec", n = 1) %>% mutate(model = "Random Forest") %>% select(model, spec = mean),
show_best(knn_tune, metric = "spec", n = 1) %>% mutate(model = "K-NN") %>% select(model, spec = mean),
show_best(boost_tune, metric = "spec", n = 1) %>% mutate(model = "Boosted Trees") %>% select(model, spec = mean)
)
best_resamples %>%
arrange(desc(spec)) %>%
knitr::kable()
| model | spec |
|---|---|
| Random Forest | 0.8665823 |
| Boosted Trees | 0.8592468 |
| K-NN | 0.8391618 |
Results showed that the Random Forest model had the highest performance with the following evaluation scores: specificity (0.87), area under curve (0.83), precision (0.61), and recall (0.60). A general assessment of model performance is provided in the final phase of this analysis.
3.8 Test predictions
Predictions drawn from test set
The best performing model has been identified (Random Forest). Thus, we can proceed to test predictions drawn from this model. These predictions are computed and assessed below.
# generate predictions from the test set
test_predictions <- rf_tune %>% collect_predictions()
test_predictions
## # A tibble: 16,945 × 9
## id .pred_class .row .pred_rock .pred_latin .pred_rap .pred_pop genre
## <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 Fold1 pop 1 0.0313 0.255 0.152 0.562 latin
## 2 Fold1 pop 2 0.0851 0.169 0.0364 0.709 latin
## 3 Fold1 rap 11 0.0811 0.308 0.410 0.201 latin
## 4 Fold1 latin 14 0.0571 0.513 0.291 0.139 latin
## 5 Fold1 pop 16 0.0330 0.0823 0.110 0.775 latin
## 6 Fold1 pop 27 0.0397 0.288 0.236 0.436 latin
## 7 Fold1 pop 38 0.386 0.168 0.00837 0.437 latin
## 8 Fold1 rap 39 0.00701 0.172 0.649 0.172 latin
## 9 Fold1 pop 41 0.189 0.0607 0.0383 0.712 latin
## 10 Fold1 rap 42 0.0571 0.170 0.457 0.315 latin
## # … with 16,935 more rows, and 1 more variable: .config <chr>
Looking at the output, we can tell that the classifications for the observed and predicted genres are fairly consistent.
Preliminary confusion matrix
A preliminary confusion matrix was generated from the predictions at hand in order to evaluate some of the observations. See the confusion matrix below.
# generate a preliminary confusion matrix wih the test set
test_predictions %>%
conf_mat(truth = genre, estimate = .pred_class)
## Truth
## Prediction rock latin rap pop
## rock 2561 283 105 764
## latin 204 1599 528 583
## rap 96 1004 3079 595
## pop 834 965 727 3018
The confusion matrix shows that more than 50% of the observations made for each of the music genres were correctly predicted. When looking at proportions, it seems like the predictions made for rock fared consistently well. Moreover, there seems to be some overlap between the pop and latin music genres. This suggests that several audio feature scores might be fairly similar across these two genres.
3.9 Fitting the final model
We should fit the final model to complete the analytical procedures at hand. The last_fit function from the tidymodels package is used to fit the best performing model (in this case, the Random Forest model) to the training data, and evaluate the model on the test set. The workflow for the best performing model, the data split object, and the performance metrics of interest should be provided on this code. See the performance results reported below.
# fit the final model
last_fit_rf <- last_fit(rf_wf,
split = spotify_split,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec)
)
# compute performance score for the final model
last_fit_rf %>%
collect_metrics()
## # A tibble: 8 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 recall macro 0.614 Preprocessor1_Model1
## 2 precision macro 0.620 Preprocessor1_Model1
## 3 f_meas macro 0.615 Preprocessor1_Model1
## 4 accuracy multiclass 0.617 Preprocessor1_Model1
## 5 kap multiclass 0.485 Preprocessor1_Model1
## 6 sens macro 0.614 Preprocessor1_Model1
## 7 spec macro 0.871 Preprocessor1_Model1
## 8 roc_auc hand_till 0.843 Preprocessor1_Model1
Looking at the results, we can tell that the precision and recall scores improved slightly in the final model. The performance scores for the final model are the following: precision (0.62), recall (0.61), roc-auc (0.84), and specificity (0.87).
General assessment of model performance
In order to conduct a well informed assessment of model performance, it is important to discuss differences between evaluation metrics and understand how each of these metrics inform the model.
For instance, precision reflects the number of false positives whereas recall reflects the number of false negatives. On the other hand, the area under curve is a measure that indicates how well a model is able to distinguish positive and negative cases by placing observations on the Receiver Operator Characteristic (ROC) curve. Hence, the higher the ROC curve, the better the model ability to make these distinctions. Lastly, specificity is a measure that reflects the ability of a model to predict negative classes correctly (hence, it represents the assessment of negatives component from the roc-auc measure).
The specificity and roc-auc scores were notably higher than the precision and recall scores. This suggests that the model might be more suitable to correctly detect true positives and true negatives, than to detect false positives and false negatives. The specificity and roc_auc scores are usually overestimated when the outcome classes are imbalanced. However, this might not be the case with the classifier model at hand, as the preliminary assessments showed that the categorical classes are fairly balanced. Moreover, the model might be especially suitable to detect true negatives correctly, considering that specificity was the highest performance score among all the evaluation metrics. Sources suggesting what evaluation metrics are deemed more appropriate to assess multinominal classification models should be further examined.
Visualizations can show interesting insights about predictive models. Hence, the following visualizations were computed to assess the classifier model at hand.
4.1 Distributions of predicted classes
This plot was produced to illustrate prediction distributions among the four music genres. Insights are discussed below.
# compute distributions of predicted classes
test_predictions %>%
ggplot() +
geom_density(aes(x = .pred_class, fill = genre),
alpha = 0.5)
The prediction distributions for the latin genre seem to be a bit leptokurtic. This might be attributed to the assumption that latin songs tend do be more energetic and danceable than songs from other music genres. On the other hand, the distributions for the rock genre seem to be slightly platykurtic. This indicates that the distribution for the rock genre has less extreme cases than those usually observed in normal distributions. Thus, audio feature scores in the rock genre might be significantly lower than those observed in the other music genres. These assumptions could be tested by conducting an ANOVA analysis (followed by Post-Hoc test) to compare audio feature scores across the four music genres.
4.2 Variables of Importance
A plot was computed to show the variables of importance in the classifier model. Insights about the variables in question are discussed below.
# plot variables of importance
library(vip)
last_fit_rf %>%
pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vip(num_features = 5)
This plot indicates that the three audio features with the highest importance scores are speechiness, danceability, and acousticness. The highest speechiness scores might be observed in the rap genre, as rap songs tend to show a greater amount of spoken words than songs in other genres. On the other hand, the highest danceability scores might be observed in the latin genre, due to latin songs being perceived as highly danceable (as suggested in the previous assessment). Curiously, acousticness had a slight edge over valence in weight of importance. This suggests that acoustic sounds might help differentiate music genres from one another. Perhaps it might be interesting to find out whether certain acoustic sounds (such as sounds liked to specific instruments) are more prominent in some music genres than others.
4.3 Final Confusion Matrix
An additional confusion matrix was computed to assess predictions from the final model. Insights about genre classifications in the matrix are discussed below.
# compute final confusion matrix
last_fit_rf %>%
collect_predictions() %>%
conf_mat(genre, .pred_class) %>%
autoplot(type = "heatmap")
This confusion matrix shows the most accurate predictions from bottom to top. Thus, the boxes with the highest number of accurate predictions are presented in darker shades. Based on the results, the music genre with the most accurate predictions is rock (showing a tentative 70.84% of cases correctly classified). This observation is consistent with insights drawn from the preliminary confusion matrix. The second music genre with the most accurate predictions is rap. This might be attributed to the perceived high levels of speechiness in rap songs as discussed previously. The third genre with the most accurate predictions is pop, followed by latin (the genre with the least accurate predictions in the matrix). These findings are consistent with the assumption that latin songs might be more difficult to classify due to their overlap with other music genres in terms of audio features.
End of Project