Project description

The aim of this analysis was to compute a classifier model using Spotify audio features as predictors of music genres. The algorithms selected for this purpose will be Random Forest, K-Nearest Neighbor, and Boosted Trees. Hence, the evaluation metrics discussed are Precision, Recall, Area Under Curve (roc-auc), and Specificity.

Packages

All the packages required for this analysis can be loaded by running the code below. The packages are also included in the code throughout all the phases of the analysis for reference (just in case).

if (!require("pacman")) install.packages("pacman")

## Loading required package: pacman

pacman::p_load(
  readr,
  tidyverse,
  visdat,
  psych,
  GGally,
  tidymodels,
  workflows,
  tune,
  ranger,
  xgboost,
  vip,
  hardhat)

1. Data observations and variables

1.1 Data description

The data at hand was extracted from the Spotify API via the R package spotifyr.The dataset can be loaded from the github link provided below.

# load dataset from github account
library(readr)
spotify <- read.csv('https://raw.githubusercontent.com/karendelarosa/datasets/master/spotifyb.csv')

library(tidyverse)
spotify <- spotify %>% 
  mutate_if(is.integer, as.numeric)

spotify$genre <- factor(spotify$genre)

spotify <- spotify %>% 
  select(-X)

# inspect variable names
names(spotify)

##  [1] "count"            "artist"           "album"            "song"            
##  [5] "genre"            "year"             "popularity"       "acousticness"    
##  [9] "danceability"     "energy"           "instrumentalness" "liveness"        
## [13] "speechiness"      "tempo"            "valence"          "key"             
## [17] "mode"             "duration"

As shown in the output, the dataset contains variables referencing information about the music such as artist name, album name, song name, song popularity score, and year of release. In addition, the dataset contains 12 audio feature variables that reflect music characteristics. The audio features in question usually fall into three major categories: a) confidence measures, b) perceptual measures, and c) music descriptors. A brief description for each of these audio features is provided below:

Sources:

https://developer.spotify.com/discover/

https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a)

Spotify audio features

a) confidence measures

Acousticness: A measure from 0.0 to 1.0 that detects acoustic sounds in a track.
Liveness: A measure from 0.0 to 1.0 that detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
Speechiness: A measure from 0.0 to 1.0 that detects the presence of spoken words in a track. The more speech sounds are heard in the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the value will be.
Instrumentalness A measure from 0.0 to 1 that reflects the extent to which a track does not contain vocalizations. The closer the instrumentalness value is to 1.0, the greater the likelihood that the track does not contain vocal content.

b) perceptual measures

Energy: A measure from 0.0 to 1.0 that reflects the intensity of a track. Energetic tracks are usually fast, loud, and noisy.
Loudness: A measure from -60 and 0 that represents the overall loudness of a track. This value is measured in decibels (dB). Loudness values are averaged across the entire track.
Danceability: A measure from 0.0 to 1 that indicates how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, and beat strength. A value closer to 0.0 indicates that a track is less danceable, and value closer to 1.0 is indicates that a track is more danceable.
Valence: A measure from 0.0 to 1.0 that reflects the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

c) music descriptors

Tempo: The overall estimated speed of a track measured in beats per minute (BPM).
Duration: The duration of a track measured in milliseconds.
Key: A measure from 0 to 11 that indicates the key of a track.
Mode: A measure that indicates the key in the music of the track (1 is major, and 0 is for minor).

1.2 Data inspection and Variables

A plot and summary outputs were computed to inspect the dataset (the plot code takes about 30 seconds to run).

# see type of variables
str(spotify)

## 'data.frame':    21182 obs. of  18 variables:
##  $ count           : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ artist          : chr  "Bella Renee" "tyDi" "Electric Polar Bears" "My Bad" ...
##  $ album           : chr  "Swimming Pools" "You Never Know" "Too Good To Lose" "Delusions - EP" ...
##  $ song            : chr  "Swimming Pools" "You Never Know" "Too Good To Lose" "Viva La Vida" ...
##  $ genre           : Factor w/ 4 levels "latin","pop",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ year            : num  2021 2021 2021 2021 2021 ...
##  $ popularity      : num  40 51 55 48 35 36 89 67 49 47 ...
##  $ acousticness    : num  0.27 0.0871 0.309 0.025 0.632 0.167 0.00226 0.198 0.0586 0.0273 ...
##  $ danceability    : num  0.629 0.621 0.674 0.548 0.663 0.724 0.721 0.404 0.738 0.616 ...
##  $ energy          : num  0.524 0.955 0.95 0.755 0.573 0.644 0.738 0.806 0.909 0.731 ...
##  $ instrumentalness: num  1.25e-05 3.26e-03 3.14e-05 0.00 5.26e-04 0.00 4.41e-06 0.00 1.93e-02 0.00 ...
##  $ liveness        : num  0.05 0.296 0.319 0.0985 0.249 0.0867 0.118 0.114 0.364 0.215 ...
##  $ speechiness     : num  0.0407 0.0444 0.116 0.0552 0.0446 0.132 0.0403 0.0496 0.061 0.061 ...
##  $ tempo           : num  120 127 127.1 97.9 120.1 ...
##  $ valence         : num  0.508 0.424 0.47 0.659 0.292 0.787 0.637 0.112 0.711 0.618 ...
##  $ key             : num  4 0 11 11 0 1 7 2 9 0 ...
##  $ mode            : num  0 1 1 0 1 0 1 1 1 1 ...
##  $ duration        : num  202027 139173 187087 146939 224000 ...

# compute summary statistics 
summary(spotify)

##      count          artist             album               song          
##  Min.   :    1   Length:21182       Length:21182       Length:21182      
##  1st Qu.: 5466   Class :character   Class :character   Class :character  
##  Median :11054   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :11272                                                           
##  3rd Qu.:17177                                                           
##  Max.   :22900                                                           
##    genre           year        popularity      acousticness      
##  latin:4814   Min.   :1962   Min.   :  0.00   Min.   :0.0000014  
##  pop  :6200   1st Qu.:2006   1st Qu.: 31.00   1st Qu.:0.0193000  
##  rap  :5549   Median :2016   Median : 51.00   Median :0.0912000  
##  rock :4619   Mean   :2011   Mean   : 45.56   Mean   :0.1811030  
##               3rd Qu.:2020   3rd Qu.: 65.00   3rd Qu.:0.2670000  
##               Max.   :2021   Max.   :100.00   Max.   :0.9920000  
##   danceability        energy       instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0185   Min.   :0.0000000   Min.   :0.0167  
##  1st Qu.:0.5620   1st Qu.:0.5870   1st Qu.:0.0000000   1st Qu.:0.0932  
##  Median :0.6760   Median :0.7100   Median :0.0000040   Median :0.1270  
##  Mean   :0.6589   Mean   :0.6956   Mean   :0.0493000   Mean   :0.1873  
##  3rd Qu.:0.7700   3rd Qu.:0.8250   3rd Qu.:0.0008697   3rd Qu.:0.2440  
##  Max.   :0.9880   Max.   :0.9990   Max.   :0.9850000   Max.   :0.9880  
##   speechiness         tempo           valence            key        
##  Min.   :0.0000   Min.   :  0.00   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0406   1st Qu.: 97.92   1st Qu.:0.3650   1st Qu.: 2.000  
##  Median :0.0645   Median :120.00   Median :0.5490   Median : 5.000  
##  Mean   :0.1182   Mean   :121.51   Mean   :0.5428   Mean   : 5.289  
##  3rd Qu.:0.1530   3rd Qu.:139.94   3rd Qu.:0.7290   3rd Qu.: 8.000  
##  Max.   :0.9670   Max.   :249.44   Max.   :0.9920   Max.   :11.000  
##       mode           duration     
##  Min.   :0.0000   Min.   :  9560  
##  1st Qu.:0.0000   1st Qu.:185924  
##  Median :1.0000   Median :215480  
##  Mean   :0.5923   Mean   :222603  
##  3rd Qu.:1.0000   3rd Qu.:251064  
##  Max.   :1.0000   Max.   :747325

# plot variable  distributions 
library(visdat)
vis_dat(spotify)

Overall, the dataset contains 21,182 observations and 18 variables (see the summary and str outputs for reference). As shown in the plot, count, year, popularity, and the twelve audio features discussed are numerical variables. Moreover, genre is a categorical variable, and the identifier columns (artist, album, and song) are characters.

It seems like some of the descriptor audio features are measured at the ratio level and placed on a wider scale. Thus, if any of these audio features are included as predictors on the classifier model, all the predictors should be normalized to avoid overestimation of variable effects on the outcome. Genre (the outcome variable) contains four categories: latin, pop, rap, and rock (Note: Genre names will not be capitalized on this assessment so they are consistent with the genre column labels on the dataset). Hence, these four categories are the classes to be predicted.

Missing data might disrupt some of the machine learning and statistical procedures that will be performed. Hence, it is important to determine what steps should be fulfilled in order to manage conflicts if there is any missing data in the observations. The code below was used to identify missing data cases within each variable.

# look for missing data (just in case)
is.na(spotify) %>% colSums()

##            count           artist            album             song 
##                0                0                0                0 
##            genre             year       popularity     acousticness 
##                0                0                0                0 
##     danceability           energy instrumentalness         liveness 
##                0                0                0                0 
##      speechiness            tempo          valence              key 
##                0                0                0                0 
##             mode         duration 
##                0                0

Based on this output, the dataset does not contain missing data.

2. Descriptive statistics and feature assessments

2.1 Descriptive statistics

The describe function was computed to assess descriptive statistics. Some insights can be drawn from the output below.

library(psych)
describe(spotify)

##                  vars     n      mean       sd    median   trimmed      mad
## count               1 21182  11271.69  6675.06  11054.50  11236.39  8674.69
## artist*             2 21182   3473.89  2027.23   3490.00   3473.39  2621.24
## album*              3 21182   6742.70  3924.36   6770.50   6760.93  5109.04
## song*               4 21182   8399.24  4840.98   8410.00   8400.66  6225.44
## genre*              5 21182      2.47     1.07      2.00      2.46     1.48
## year                6 21182   2011.17    12.14   2016.00   2013.58     7.41
## popularity          7 21182     45.56    25.06     51.00     47.02    23.72
## acousticness        8 21182      0.18     0.22      0.09      0.14     0.13
## danceability        9 21182      0.66     0.15      0.68      0.67     0.15
## energy             10 21182      0.70     0.17      0.71      0.71     0.17
## instrumentalness   11 21182      0.05     0.17      0.00      0.00     0.00
## liveness           12 21182      0.19     0.15      0.13      0.16     0.07
## speechiness        13 21182      0.12     0.12      0.06      0.09     0.05
## tempo              14 21182    121.51    28.79    120.00    119.63    30.77
## valence            15 21182      0.54     0.23      0.55      0.55     0.27
## key                16 21182      5.29     3.63      5.00      5.25     4.45
## mode               17 21182      0.59     0.49      1.00      0.62     0.00
## duration           18 21182 222603.14 58122.62 215480.00 218282.48 47541.05
##                      min       max     range  skew kurtosis     se
## count               1.00  22900.00  22899.00  0.05    -1.23  45.86
## artist*             1.00   6972.00   6971.00  0.00    -1.21  13.93
## album*              1.00  13441.00  13440.00 -0.03    -1.23  26.96
## song*               1.00  16774.00  16773.00  0.00    -1.21  33.26
## genre*              1.00      4.00      3.00  0.05    -1.24   0.01
## year             1962.00   2021.00     59.00 -1.63     2.15   0.08
## popularity          0.00    100.00    100.00 -0.55    -0.76   0.17
## acousticness        0.00      0.99      0.99  1.52     1.67   0.00
## danceability        0.00      0.99      0.99 -0.50    -0.09   0.00
## energy              0.02      1.00      0.98 -0.54    -0.03   0.00
## instrumentalness    0.00      0.98      0.98  3.92    14.68   0.00
## liveness            0.02      0.99      0.97  2.15     5.62   0.00
## speechiness         0.00      0.97      0.97  2.50     9.62   0.00
## tempo               0.00    249.44    249.44  0.52    -0.31   0.20
## valence             0.00      0.99      0.99 -0.11    -0.88   0.00
## key                 0.00     11.00     11.00  0.01    -1.32   0.02
## mode                0.00      1.00      1.00 -0.38    -1.86   0.00
## duration         9560.00 747325.00 737765.00  1.25     4.59 399.36

The mean values computed reflect interesting characteristics about the tracks listed on the dataset. For instance, it seems like most of the tracks in the dataset are highly danceable (m = 0.66, sd = 0.15) and have a high level of energy (m = 0.70, sd = 0.17). On the other hand, looks like about 50% of the tracks are highly positive, and the other 50% are less positive (m = 0.54, sd = 0.23). Moreover, looks like most of the tracks in the dataset are not instrumental (m = 0.05, sd = 0.17), and were not recorded live (m = 0.19, sd = 0.15). Insights about the four genre categories will be discussed below.

2.2 Balance assessment

Data imbalances can affect classification predictions when they are not managed properly. Thus, a barplot was computed to assess genre class distributions and identify potential imbalances.

First, let’s sort the music genre categories by count in ascending order.

#Sort type factors in ascending order
spotify[order(spotify$genre),]

#Reorder type factors by frequency
spotify$genre <- fct_infreq(spotify$genre)

#Now reverse order (so it appears 
#as decreasing the plot)
spotify$genre <- fct_rev(spotify$genre)

Now, let’s look at the plot.

# music genres counts and percentages
plot <- ggplot(data = spotify, aes(y=genre)) +
  geom_bar(aes(fill=genre)) +
  theme (plot.title = element_text(hjust = 0.5)) +
  expand_limits(x = 8500) +
  labs(title = "Music genres - Number of Cases and Percentages",
                 y = "Genres",
                 x = "Count") +
  geom_text(stat='count', 
            aes(label = sprintf('%s (%.1f%%)', 
                after_stat(count), 
                after_stat(count / sum(count) * 100))),
            hjust=ifelse(1.5, -0.1, 1.1))

plot

These are the current class distributions in the data based on the plot: rock (21.8%), rap (26.2%), pop (29.3%), and latin (22.7%). Thus, it seems like there are no acute imbalances between the genre classes.

In order to further assess these distributions, a normalized version of the Shannon diversity index was run to observe how balanced the data is. A value closer to 0 indicates unbalanced data, whereas a value closer to 1 indicates balanced data. The result from this analysis is reported below.

# compute normalized Shannon diversity index for balance evaluation
balance <- spotify %>% 
  group_by(genre) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(check = - (n / sum(n) * log(n / sum(n))) / log(length(genre)) ) %>% 
  summarise(balance = sum(check)) %>% 
  pull()

balance

## [1] 0.9949964

The score for this balance analysis is 0.99. Hence, this further supports the assumption that the genre classes in the data are fairly distributed.

2.3 Feature assessment

A logistic regression was run to determine which audio features are correlated with genre. This analysis will help inform decisions for feature selection. The results of the logistic regression are discussed below.

# compute logistic regression with glm function
genremodel <- glm(genre ~ acousticness + danceability + energy + 
                 instrumentalness +
               liveness + speechiness + tempo + valence +
                 key + mode + duration, 
               data = spotify, family = binomial)

# see model output
summary(genremodel)

## 
## Call:
## glm(formula = genre ~ acousticness + danceability + energy + 
##     instrumentalness + liveness + speechiness + tempo + valence + 
##     key + mode + duration, family = binomial, data = spotify)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6858   0.1258   0.3444   0.5931   2.5133  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.324e+00  2.024e-01 -11.484  < 2e-16 ***
## acousticness      1.645e-01  1.085e-01   1.517  0.12936    
## danceability      9.847e+00  1.825e-01  53.950  < 2e-16 ***
## energy           -2.429e-01  1.447e-01  -1.679  0.09320 .  
## instrumentalness -3.635e-01  1.118e-01  -3.251  0.00115 ** 
## liveness         -6.484e-01  1.315e-01  -4.930 8.23e-07 ***
## speechiness       3.698e+00  2.289e-01  16.156  < 2e-16 ***
## tempo             4.126e-03  6.938e-04   5.947 2.74e-09 ***
## valence          -2.458e+00  1.037e-01 -23.698  < 2e-16 ***
## key               1.059e-02  5.621e-03   1.884  0.05960 .  
## mode             -2.197e-01  4.207e-02  -5.223 1.76e-07 ***
## duration         -7.001e-06  3.481e-07 -20.110  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22218  on 21181  degrees of freedom
## Residual deviance: 15988  on 21170  degrees of freedom
## AIC: 16012
## 
## Number of Fisher Scoring iterations: 5

Audio features with the highest correlation (p<0.01) will be selected for the classification analysis. Hence, acousticness, danceability, energy, speechiness, and valence will be predictors on the classifier model.

The plot below was computed to look at the distributions of these five numerical variables.

# compute variable distributions

library(GGally)

ggpairs(spotify, columns =  c("acousticness", 
                                  "danceability", 
                                  "energy", 
                                  "speechiness",
                                   "valence"), 
        aes(color = genre, alpha = 0.5),
        upper = list(continuous = wrap("cor", size = 2.5)))

Looking at the plot, it seems like the distribution curves for the acousticness and speechiness variables are positively skewed. Uneven distributions in numerical predictors might make it harder for algorithms to detect patterns. Hence, normalization and logistic functions can be computed to adjust the data, so the distributions in numerical variables are more symmetrical. In order to prevent issues linked to skewness, a normalize argument will be added to the model recipe (step_normalize).

3. Classification model: Implementation and validation

3.1 Feature selection

A subset with the selected features was created to start processing the classifier model. The new dataset contains five predictors (acousticness, danceability, energy, speechiness, and valence), as well as the outcome variable (genre).

# create subset with selected audio feature predictors 
spotify2 <- subset(spotify, select=c(
                            "acousticness","danceability","energy",
                             "speechiness","valence","genre"))

#Inspect subset
names(spotify2)

## [1] "acousticness" "danceability" "energy"       "speechiness"  "valence"     
## [6] "genre"

str(spotify2)

## 'data.frame':    21182 obs. of  6 variables:
##  $ acousticness: num  0.27 0.0871 0.309 0.025 0.632 0.167 0.00226 0.198 0.0586 0.0273 ...
##  $ danceability: num  0.629 0.621 0.674 0.548 0.663 0.724 0.721 0.404 0.738 0.616 ...
##  $ energy      : num  0.524 0.955 0.95 0.755 0.573 0.644 0.738 0.806 0.909 0.731 ...
##  $ speechiness : num  0.0407 0.0444 0.116 0.0552 0.0446 0.132 0.0403 0.0496 0.061 0.061 ...
##  $ valence     : num  0.508 0.424 0.47 0.659 0.292 0.787 0.637 0.112 0.711 0.618 ...
##  $ genre       : Factor w/ 4 levels "rock","latin",..: 4 4 4 4 4 4 4 4 4 4 ...

summary(spotify2)

##   acousticness        danceability        energy        speechiness    
##  Min.   :0.0000014   Min.   :0.0000   Min.   :0.0185   Min.   :0.0000  
##  1st Qu.:0.0193000   1st Qu.:0.5620   1st Qu.:0.5870   1st Qu.:0.0406  
##  Median :0.0912000   Median :0.6760   Median :0.7100   Median :0.0645  
##  Mean   :0.1811030   Mean   :0.6589   Mean   :0.6956   Mean   :0.1182  
##  3rd Qu.:0.2670000   3rd Qu.:0.7700   3rd Qu.:0.8250   3rd Qu.:0.1530  
##  Max.   :0.9920000   Max.   :0.9880   Max.   :0.9990   Max.   :0.9670  
##     valence         genre     
##  Min.   :0.0000   rock :4619  
##  1st Qu.:0.3650   latin:4814  
##  Median :0.5490   rap  :5549  
##  Mean   :0.5428   pop  :6200  
##  3rd Qu.:0.7290               
##  Max.   :0.9920

describe(spotify2)

##              vars     n mean   sd median trimmed  mad  min  max range  skew
## acousticness    1 21182 0.18 0.22   0.09    0.14 0.13 0.00 0.99  0.99  1.52
## danceability    2 21182 0.66 0.15   0.68    0.67 0.15 0.00 0.99  0.99 -0.50
## energy          3 21182 0.70 0.17   0.71    0.71 0.17 0.02 1.00  0.98 -0.54
## speechiness     4 21182 0.12 0.12   0.06    0.09 0.05 0.00 0.97  0.97  2.50
## valence         5 21182 0.54 0.23   0.55    0.55 0.27 0.00 0.99  0.99 -0.11
## genre*          6 21182 2.63 1.12   3.00    2.66 1.48 1.00 4.00  3.00 -0.17
##              kurtosis   se
## acousticness     1.67 0.00
## danceability    -0.09 0.00
## energy          -0.03 0.00
## speechiness      9.62 0.00
## valence         -0.88 0.00
## genre*          -1.34 0.01

Upon inspection, the new subset looks good.

3.2 Data splitting

The main dataset will be split into training and testing sets. Thus, these subsets will be used to evaluate the model in subsequent phases of the analysis. The initial_split, training, and testing functions from the tidymodels package were computed to assign 80% of the data to the train set, and 20% of the remaining data to the test set. See the splitting and statistics output provided below.

# load packages
library(tidymodels)
library(workflows)
library(tune)


# randomize data
set.seed(234589)

# split the data into training (80%) and testing (20%)
spotify_split <- initial_split(spotify2, prop = 0.80, strata = genre)
spotify_split # proportions of cases (train/test/total)

## <Analysis/Assess/Total>
## <16945/4237/21182>

# extract training and testing sets
spotify_train <- training(spotify_split)
spotify_test <- testing(spotify_split)

# inspect training and testing datasets
str(spotify_train)

## 'data.frame':    16945 obs. of  6 variables:
##  $ acousticness: num  0.0419 0.0207 0.171 0.199 0.163 0.261 0.00761 0.0606 0.401 0.171 ...
##  $ danceability: num  0.71 0.705 0.662 0.706 0.662 0.704 0.745 0.598 0.712 0.789 ...
##  $ energy      : num  0.793 0.803 0.794 0.869 0.489 0.643 0.669 0.815 0.776 0.638 ...
##  $ speechiness : num  0.0694 0.037 0.0533 0.0517 0.0525 0.0301 0.0316 0.0858 0.0802 0.045 ...
##  $ valence     : num  0.604 0.555 0.641 0.603 0.506 0.413 0.513 0.383 0.659 0.427 ...
##  $ genre       : Factor w/ 4 levels "rock","latin",..: 2 2 2 2 2 2 2 2 2 2 ...

str(spotify_test)

## 'data.frame':    4237 obs. of  6 variables:
##  $ acousticness: num  0.00474 0.116 0.512 0.231 0.148 0.361 0.0697 0.0689 0.136 0.123 ...
##  $ danceability: num  0.617 0.833 0.664 0.664 0.496 0.57 0.614 0.632 0.652 0.362 ...
##  $ energy      : num  0.928 0.46 0.66 0.915 0.752 0.797 0.934 0.595 0.815 0.921 ...
##  $ speechiness : num  0.153 0.0532 0.0243 0.123 0.237 0.0416 0.07 0.0401 0.131 0.486 ...
##  $ valence     : num  0.521 0.128 0.355 0.407 0.255 0.227 0.436 0.435 0.794 0.529 ...
##  $ genre       : Factor w/ 4 levels "rock","latin",..: 4 4 4 4 4 4 4 4 4 4 ...

dim(spotify_train)

## [1] 16945     6

dim(spotify_test)

## [1] 4237    6

As shown in the output, the training set contains 16,945 observations (80% of the data), and the testing set contains 4,237 observations (20% of the data). Both the training and testing sets have 6 variables in total, including the five audio feature predictors and the genre outcome. Upon inspection, the training and testing sets look good.

3.3 Data Preprocessing Recipe

Three objects should be created in order to tune and evaluate predictive models when using the tidymodels package. These objects are the recipe, model specification, and workflow. The recipe for this model was computed below.

As indicated previously, a normalizing function (step_normalize) was added to the recipe to center and scale all the predictors, and slightly adjust some of the numerical variable distributions.

# create the recipe for the model
model_recipe <- recipe(genre ~ ., data = spotify_train) %>% 
  step_normalize(all_numeric())

model_recipe

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          5
## 
## Operations:
## 
## Centering and scaling for all_numeric()

Upon inspection, the model recipe looks good.

3.4 Model specification

The next step for pre-processing the model is to specify which algorithms will be used to conduct the predictive analysis. Each algorithm will be assigned to a separate model so all the models can be evaluated and compared with one another.

As indicated in the project description, the algorithms used for this analysis are Random Forest, K-Nearest Neighbor, and Boosted Tree. Please note that some of these algorithms require specific packages so the models can be created correctly. In this particular case, the Random Forest model requires the ranger package, and the Boosted Tree model requires the xgboost package. The mode argument for all the models should be set as ‘classification’. Hence, the specific engines that can be used for each algorithm are listed on the tidymodels website (https://parsnip.tidymodels.org/reference/set_engine.html).

# Random Forest model
library(ranger) 

rf_model <- 
  rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# K-Nearest Neighbor model
knn_model <- 
  nearest_neighbor(neighbors = 4) %>% # 
  set_engine("kknn") %>% 
  set_mode("classification") 

# Boosted Trees model 
library(xgboost)

boost_model <- 
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

Any of the created objects can be inspected when needed. If the code lines run without any issues, this is an indicator that the objects were created correctly.

3.5 Workflows

The next step is to create a workflow for each of the models computed in the previous phase. The recipe and the specific algorithm model objects should be added to the workflows in question. The workflow objects were created below.

# workflow for Random Forest
rf_wf <- workflow() %>% 
  add_model(rf_model) %>% 
  add_recipe(model_recipe)

# workflow for K-Nearest Neighbor
knn_wf <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(model_recipe)

# workflow for Boosted  model
boost_wf <- workflow() %>% 
  add_model(boost_model) %>% 
  add_recipe(model_recipe)

3.6 Folds

A validation set (usually labeled as a ‘folds’ object) should be created to define the number of resamples for the hyper-parameter tuning. Thus, the model can be tested throughout several rounds of tuning when this object is included as a parameter.

The folds object should be created from the training set. In addition, the outcome variable should be specified in the ‘strata =’ argument within the code. The validation set for this analysis will have 5 folds. See the code for the creation of this object below.

# split the train set into folds

set.seed(9876)

folds <- vfold_cv(data = spotify_train, v = 5, strata = "genre")

3.7 Tuning parameters and Model evaluation

All the required objects for the prediction analysis have been created. Thus, it is time to tune and evaluate the models at hand. The tune parameters should include the workflows defined previously, the folds object for hyperparameter turning, and the performance metrics that will be required to evaluate the model.

The two performance metrics requested for this assessment are precision and recall. However, additional evaluation metrics were also computed to assess the highest performance score on each model. The additional metrics computed were f-means, accuracy, kap, area under curve (roc-auc), sensitivity (sens) and specificity (spec).

All the evaluation metrics at hand are measured on scale from 0.0 to 1.0. A score closer to 0 indicates lower performance, and a score closer to 1 indicates higher performance. Thus, a performance score is usually considered to be ‘good’ when it is 0.8 or higher (please note that accepted scores might change depending on the aim of the project, and the professional / academic field in question). The evaluation metrics for each model will be reported.

Note:

The tuning codes for each model take about 1 or 2 minutes to run. The reported evaluation scores will be rounded on the discussion for a clearer assessment.

-Random Forest model

The Random Forest model is tuned and evaluated below.

# tune Random Forest model 
set.seed(634)

rf_tune <-
  rf_wf %>% 
  fit_resamples(
    resamples = folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)
    ) 

# report evaluation metrics for the Random Forest model
rf_tune %>%
  collect_metrics()

## # A tibble: 8 × 6
##   .metric   .estimator  mean     n std_err .config             
##   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy  multiclass 0.605     5 0.00311 Preprocessor1_Model1
## 2 f_meas    macro      0.602     5 0.00297 Preprocessor1_Model1
## 3 kap       multiclass 0.469     5 0.00430 Preprocessor1_Model1
## 4 precision macro      0.607     5 0.00258 Preprocessor1_Model1
## 5 recall    macro      0.603     5 0.00334 Preprocessor1_Model1
## 6 roc_auc   hand_till  0.835     5 0.00268 Preprocessor1_Model1
## 7 sens      macro      0.603     5 0.00334 Preprocessor1_Model1
## 8 spec      macro      0.867     5 0.00110 Preprocessor1_Model1

The precision and recall scores for the Random Forest model were the same (0.61). These scores are not particularly high. Curiously, the scores for roc-auc (0.83) and specificity (0.87) were significantly better. The scores at hand might lead to inquiries and interesting insights regarding the performance and parameters of the model. Let’s observe how the K-Nearest Neighbor and Boosted Trees models perform in relation to this model.

-K-Nearest Neighbor model

The K-Nearest Neighbor model is tuned and evaluated below.

set.seed(768)

# tune K-Nearest Neighbor model 
knn_tune <- 
  knn_wf %>% 
  fit_resamples(
    resamples = folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)
    ) 
  
# report evaluation metrics for the K-Nearest Neighbor model 
knn_tune %>%
  collect_metrics()

## # A tibble: 8 × 6
##   .metric   .estimator  mean     n std_err .config             
##   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy  multiclass 0.521     5 0.00546 Preprocessor1_Model1
## 2 f_meas    macro      0.522     5 0.00571 Preprocessor1_Model1
## 3 kap       multiclass 0.358     5 0.00729 Preprocessor1_Model1
## 4 precision macro      0.522     5 0.00580 Preprocessor1_Model1
## 5 recall    macro      0.522     5 0.00557 Preprocessor1_Model1
## 6 roc_auc   hand_till  0.761     5 0.00234 Preprocessor1_Model1
## 7 sens      macro      0.522     5 0.00557 Preprocessor1_Model1
## 8 spec      macro      0.839     5 0.00181 Preprocessor1_Model1

Looks like the precision and recall scores for the K-Nearest Neighbor model were also low, and fairly close to one another (0.52). The roc-auc score was decent (0.76) and the specificity score was acceptable (0.84). Similar results might be observed in the Boosted Trees model.

-Boosted Trees

The Boosted Trees model is tuned and evaluated below.

set.seed(143)

# tune the Boosted Trees model
boost_tune <- boost_wf %>% 
  fit_resamples(
    resamples = folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)
    ) 
  

# report evaluation metrics for the Boosted Trees model
boost_tune %>%
  collect_metrics()

## # A tibble: 8 × 6
##   .metric   .estimator  mean     n std_err .config             
##   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy  multiclass 0.584     5 0.00343 Preprocessor1_Model1
## 2 f_meas    macro      0.580     5 0.00328 Preprocessor1_Model1
## 3 kap       multiclass 0.440     5 0.00469 Preprocessor1_Model1
## 4 precision macro      0.588     5 0.00297 Preprocessor1_Model1
## 5 recall    macro      0.580     5 0.00347 Preprocessor1_Model1
## 6 roc_auc   hand_till  0.824     5 0.00280 Preprocessor1_Model1
## 7 sens      macro      0.580     5 0.00347 Preprocessor1_Model1
## 8 spec      macro      0.859     5 0.00119 Preprocessor1_Model1

The precision score for the Boosted Trees model is 0.59, whereas the recall score is 0.58. As observed with the other models, these scores are pretty close. Similar to the Random Forest model, the roc-auc score is acceptable (0.82) and the specificity score is the highest (0.86). Overall, the performance metrics across the three models are consistent.

3.7 Best performing model

Let’s gather the highest scores on all the reported metrics to determine which model had the best performance (this chunk of code takes about 40 seconds to run).

Highest performance scores

# report highest precision score
best_resamples <- 
  bind_rows(
            show_best(rf_tune, metric = "precision", n = 1) %>% mutate(model = "Random Forest") %>% select(model, precision = mean),  
            show_best(knn_tune, metric = "precision", n = 1) %>% mutate(model = "K-NN") %>%            select(model, precision = mean), 
            show_best(boost_tune, metric = "precision", n = 1) %>% mutate(model = "Boosted Trees") %>% select(model, precision = mean)
  )

best_resamples %>% 
  arrange(desc(precision)) %>% 
  knitr::kable()

model	precision
Random Forest	0.6070627
Boosted Trees	0.5884896
K-NN	0.5222312

# report highest recall score
best_resamples <- 
  bind_rows(
            show_best(rf_tune, metric = "recall", n = 1) %>% mutate(model = "Random Forest") %>% select(model, recall = mean),  
            show_best(knn_tune, metric = "recall", n = 1) %>% mutate(model = "K-NN") %>% select(model, recall = mean), 
            show_best(boost_tune, metric = "recall", n = 1) %>% mutate(model = "Boosted Trees") %>% select(model, recall = mean)
  )

best_resamples %>% 
  arrange(desc(recall)) %>% 
  knitr::kable()

model	recall
Random Forest	0.6025992
Boosted Trees	0.5800721
K-NN	0.5221858

# report highest roc-auc score
best_resamples <- 
  bind_rows(
            show_best(rf_tune, metric = "roc_auc", n = 1) %>% mutate(model = "Random Forest") %>% select(model, roc_auc = mean),  
            show_best(knn_tune, metric = "roc_auc", n = 1) %>% mutate(model = "K-NN") %>% select(model, roc_auc = mean), 
            show_best(boost_tune, metric = "roc_auc", n = 1) %>% mutate(model = "Boosted Trees") %>% select(model, roc_auc = mean)
  )

best_resamples %>% 
  arrange(desc(roc_auc)) %>% 
  knitr::kable()

model	roc_auc
Random Forest	0.8349646
Boosted Trees	0.8243245
K-NN	0.7605227

# report highest specificity score
best_resamples <- 
  bind_rows(
            show_best(rf_tune, metric = "spec", n = 1) %>% mutate(model = "Random Forest") %>% select(model, spec = mean),  
            show_best(knn_tune, metric = "spec", n = 1) %>% mutate(model = "K-NN") %>% select(model, spec = mean), 
            show_best(boost_tune, metric = "spec", n = 1) %>% mutate(model = "Boosted Trees") %>% select(model, spec = mean)
  )

best_resamples %>% 
  arrange(desc(spec)) %>% 
  knitr::kable()

model	spec
Random Forest	0.8665823
Boosted Trees	0.8592468
K-NN	0.8391618

Results showed that the Random Forest model had the highest performance with the following evaluation scores: specificity (0.87), area under curve (0.83), precision (0.61), and recall (0.60). A general assessment of model performance is provided in the final phase of this analysis.

3.8 Test predictions

Predictions drawn from test set

The best performing model has been identified (Random Forest). Thus, we can proceed to test predictions drawn from this model. These predictions are computed and assessed below.

# generate predictions from the test set
test_predictions <- rf_tune %>% collect_predictions()
test_predictions

## # A tibble: 16,945 × 9
##    id    .pred_class  .row .pred_rock .pred_latin .pred_rap .pred_pop genre
##    <chr> <fct>       <int>      <dbl>       <dbl>     <dbl>     <dbl> <fct>
##  1 Fold1 pop             1    0.0313       0.255    0.152       0.562 latin
##  2 Fold1 pop             2    0.0851       0.169    0.0364      0.709 latin
##  3 Fold1 rap            11    0.0811       0.308    0.410       0.201 latin
##  4 Fold1 latin          14    0.0571       0.513    0.291       0.139 latin
##  5 Fold1 pop            16    0.0330       0.0823   0.110       0.775 latin
##  6 Fold1 pop            27    0.0397       0.288    0.236       0.436 latin
##  7 Fold1 pop            38    0.386        0.168    0.00837     0.437 latin
##  8 Fold1 rap            39    0.00701      0.172    0.649       0.172 latin
##  9 Fold1 pop            41    0.189        0.0607   0.0383      0.712 latin
## 10 Fold1 rap            42    0.0571       0.170    0.457       0.315 latin
## # … with 16,935 more rows, and 1 more variable: .config <chr>

Looking at the output, we can tell that the classifications for the observed and predicted genres are fairly consistent.

Preliminary confusion matrix

A preliminary confusion matrix was generated from the predictions at hand in order to evaluate some of the observations. See the confusion matrix below.

# generate a preliminary confusion matrix wih the test set
test_predictions %>% 
  conf_mat(truth = genre, estimate = .pred_class)

##           Truth
## Prediction rock latin  rap  pop
##      rock  2561   283  105  764
##      latin  204  1599  528  583
##      rap     96  1004 3079  595
##      pop    834   965  727 3018

The confusion matrix shows that more than 50% of the observations made for each of the music genres were correctly predicted. When looking at proportions, it seems like the predictions made for rock fared consistently well. Moreover, there seems to be some overlap between the pop and latin music genres. This suggests that several audio feature scores might be fairly similar across these two genres.

3.9 Fitting the final model

We should fit the final model to complete the analytical procedures at hand. The last_fit function from the tidymodels package is used to fit the best performing model (in this case, the Random Forest model) to the training data, and evaluate the model on the test set. The workflow for the best performing model, the data split object, and the performance metrics of interest should be provided on this code. See the performance results reported below.

# fit the final model
last_fit_rf <- last_fit(rf_wf, 
                        split = spotify_split,
                        metrics = metric_set(
                          recall, precision, f_meas, 
                          accuracy, kap,
                          roc_auc, sens, spec)
                        )
# compute performance score for the final model
last_fit_rf %>% 
  collect_metrics()

## # A tibble: 8 × 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 recall    macro          0.614 Preprocessor1_Model1
## 2 precision macro          0.620 Preprocessor1_Model1
## 3 f_meas    macro          0.615 Preprocessor1_Model1
## 4 accuracy  multiclass     0.617 Preprocessor1_Model1
## 5 kap       multiclass     0.485 Preprocessor1_Model1
## 6 sens      macro          0.614 Preprocessor1_Model1
## 7 spec      macro          0.871 Preprocessor1_Model1
## 8 roc_auc   hand_till      0.843 Preprocessor1_Model1

Looking at the results, we can tell that the precision and recall scores improved slightly in the final model. The performance scores for the final model are the following: precision (0.62), recall (0.61), roc-auc (0.84), and specificity (0.87).

General assessment of model performance

In order to conduct a well informed assessment of model performance, it is important to discuss differences between evaluation metrics and understand how each of these metrics inform the model.

For instance, precision reflects the number of false positives whereas recall reflects the number of false negatives. On the other hand, the area under curve is a measure that indicates how well a model is able to distinguish positive and negative cases by placing observations on the Receiver Operator Characteristic (ROC) curve. Hence, the higher the ROC curve, the better the model ability to make these distinctions. Lastly, specificity is a measure that reflects the ability of a model to predict negative classes correctly (hence, it represents the assessment of negatives component from the roc-auc measure).

The specificity and roc-auc scores were notably higher than the precision and recall scores. This suggests that the model might be more suitable to correctly detect true positives and true negatives, than to detect false positives and false negatives. The specificity and roc_auc scores are usually overestimated when the outcome classes are imbalanced. However, this might not be the case with the classifier model at hand, as the preliminary assessments showed that the categorical classes are fairly balanced. Moreover, the model might be especially suitable to detect true negatives correctly, considering that specificity was the highest performance score among all the evaluation metrics. Sources suggesting what evaluation metrics are deemed more appropriate to assess multinominal classification models should be further examined.

4. Visualizations and Insights about the model

Visualizations can show interesting insights about predictive models. Hence, the following visualizations were computed to assess the classifier model at hand.

4.1 Distributions of predicted classes

This plot was produced to illustrate prediction distributions among the four music genres. Insights are discussed below.

# compute distributions of predicted classes

test_predictions %>%
  ggplot() +
  geom_density(aes(x = .pred_class, fill = genre), 
               alpha = 0.5)

The prediction distributions for the latin genre seem to be a bit leptokurtic. This might be attributed to the assumption that latin songs tend do be more energetic and danceable than songs from other music genres. On the other hand, the distributions for the rock genre seem to be slightly platykurtic. This indicates that the distribution for the rock genre has less extreme cases than those usually observed in normal distributions. Thus, audio feature scores in the rock genre might be significantly lower than those observed in the other music genres. These assumptions could be tested by conducting an ANOVA analysis (followed by Post-Hoc test) to compare audio feature scores across the four music genres.

4.2 Variables of Importance

A plot was computed to show the variables of importance in the classifier model. Insights about the variables in question are discussed below.

# plot variables of importance
library(vip)

last_fit_rf %>% 
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>% 
  vip(num_features = 5)

This plot indicates that the three audio features with the highest importance scores are speechiness, danceability, and acousticness. The highest speechiness scores might be observed in the rap genre, as rap songs tend to show a greater amount of spoken words than songs in other genres. On the other hand, the highest danceability scores might be observed in the latin genre, due to latin songs being perceived as highly danceable (as suggested in the previous assessment). Curiously, acousticness had a slight edge over valence in weight of importance. This suggests that acoustic sounds might help differentiate music genres from one another. Perhaps it might be interesting to find out whether certain acoustic sounds (such as sounds liked to specific instruments) are more prominent in some music genres than others.

4.3 Final Confusion Matrix

An additional confusion matrix was computed to assess predictions from the final model. Insights about genre classifications in the matrix are discussed below.

# compute final confusion matrix
last_fit_rf %>%
  collect_predictions() %>% 
  conf_mat(genre, .pred_class) %>% 
  autoplot(type = "heatmap")

This confusion matrix shows the most accurate predictions from bottom to top. Thus, the boxes with the highest number of accurate predictions are presented in darker shades. Based on the results, the music genre with the most accurate predictions is rock (showing a tentative 70.84% of cases correctly classified). This observation is consistent with insights drawn from the preliminary confusion matrix. The second music genre with the most accurate predictions is rap. This might be attributed to the perceived high levels of speechiness in rap songs as discussed previously. The third genre with the most accurate predictions is pop, followed by latin (the genre with the least accurate predictions in the matrix). These findings are consistent with the assumption that latin songs might be more difficult to classify due to their overlap with other music genres in terms of audio features.

End of Project

Machine Learning analysis

Karen De la Rosa

10/25/2021