Spotify Genre

Introduction

Recall that the first time we logged in spotify, it asked us what our favourite music genres are, thus assist it to recommand our prefered songs in Daily Mixes or Discover Weekly. It also use our listening data, like listening history, to learn our preference.

But how could Spotify classify songs into broad genres? What’s the feature of each genre, and how features of a song can determine its genre? Those are questions need to be answered in this project.

There are 12 audio features for each song, 6 broad genres and 24 subgenres as the label. Details available in Data Preparation and Exploratory Data Analysis part. I’ll eanalysis on those 14 variables to answer questions asked in last paragraph.

During this project, you’ll see:

  1. The corelation between features.

  2. The feature pattern of each genre.

  3. Given features of a song, how can we classfy its genre.

To fulfill those goals, I’ll conduct:

  1. Exploratory Data Analysis:

    • Visulization of audio feaftures of different genre.
    • Corelation between features
  2. Data Modeling;

    • Random Forest
    • Support Vector Machine (SVM)

Let’s explore the spotify_songs dataset to discovery the pattern behind it.

Package Required

Packages required for this projects are:

library(tidyverse)  
library(knitr)       # kable() function will be used to display output
library(corrplot)    # corrplot() function will be used to explore corelation
library(randomForest)# package for random forest
library(e1071)       # package for support vector machine

Data Preparation

Load the data

The very first step is to load spotify_songs dataset. This dataset originally comes from spotifyr package, a R wrapperf pulling track audio features and other information from Spotify’s Web API in bulk.By automatically batching API requests, it allows you to enter an artist’s name and retrieve their entire discography in seconds, along with Spotify’s audio features and track/album popularity metrics.

songs <- read_csv('spotify_songs.csv')

Data Documentation

There are 32833 observations, 23 variables in dataset.

Here are variables we have:

Variables Descirption
track_id Song unique ID
track_name Song Name
track_artist Song Artist
track_popularity Song Popularity (0-100) where higher is better
track_album_id Album unique ID
track_album_name Song album name
track_album_release_date Date when album released
playlist_name Name of playlist
playlist_id Playlist ID
playlist_genre Playlist genre
playlist_subgenre Playlist subgenre
danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms Duration of song in milliseconds

As our main purpose is to explore charaterics and calssfication of each genre, variables of interests are playlist_genre, playlist_subgenre and those numberic variables following, which are columns 10 to 23.

Data Cleaning

Missing Value

Now let’s check number of missing values.

n_of_missing <- c()

for (i in 1:ncol(songs)) {
  n_of_missing[i] = sum(is.na(songs[i]))
}

missing <- data.frame(rbind(n_of_missing))
colnames(missing) <- colnames(songs)
missing[which(missing > 0)]
##              track_name track_artist track_album_name
## n_of_missing          5            5                5

There are 5 missing values in column track_name, track_artist and track_album_name.

which(is.na(songs$track_name))
## [1]  8152  9283  9284 19569 19812
which(is.na(songs$track_artist)) == which(is.na(songs$track_name))
## [1] TRUE TRUE TRUE TRUE TRUE
which(is.na(songs$track_album_name)) == which(is.na(songs$track_name))
## [1] TRUE TRUE TRUE TRUE TRUE

We find that indexed of missing values are same among variables ,means there are only 5 observations have missing values.

We can keep those observations because missing values are not in columns 10 to 23, those observations have values in variables of interests, hence we don’t need to remove them.

Variable Type

Convert variable playlist_genr and playlist_subgenre to categorical variables. As we want expolre the classfication of genre, playlist_genr and playlist_subgenre should be transformed from character to factor to faciliate our analysis.

There is another variable mode needs to be converted into categorical variable. As mentioned in the data documentation, Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.Thus, instead of being numeric variable, mode is categorical in nature.

songs <- songs %>%
  mutate(playlist_genre = as.factor(songs$playlist_genre),
         playlist_subgenre = as.factor(songs$playlist_subgenre),
         mode = as.factor(mode))

Outlier Identification

Now let’s focus on numberic variables and check is there any outlier in the dataset.

summary(songs[,c(13:15, 17:23)])
##      energy              key            loudness        speechiness    
##  Min.   :0.000175   Min.   : 0.000   Min.   :-46.448   Min.   :0.0000  
##  1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171   1st Qu.:0.0410  
##  Median :0.721000   Median : 6.000   Median : -6.166   Median :0.0625  
##  Mean   :0.698619   Mean   : 5.374   Mean   : -6.720   Mean   :0.1071  
##  3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645   3rd Qu.:0.1320  
##  Max.   :1.000000   Max.   :11.000   Max.   :  1.275   Max.   :0.9180  
##   acousticness    instrumentalness       liveness         valence      
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0151   1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3310  
##  Median :0.0804   Median :0.0000161   Median :0.1270   Median :0.5120  
##  Mean   :0.1753   Mean   :0.0847472   Mean   :0.1902   Mean   :0.5106  
##  3rd Qu.:0.2550   3rd Qu.:0.0048300   3rd Qu.:0.2480   3rd Qu.:0.6930  
##  Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910  
##      tempo         duration_ms    
##  Min.   :  0.00   Min.   :  4000  
##  1st Qu.: 99.96   1st Qu.:187819  
##  Median :121.98   Median :216000  
##  Mean   :120.88   Mean   :225800  
##  3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :239.44   Max.   :517810

There are some variables need attention:

  • loudness: Minimum of it is -46.448 while the first quarter is -8.171. Let’s check the distribution.
ggplot(songs, aes(x = 1, y = loudness)) +
  geom_boxplot() +
  coord_flip() +
  ggtitle('Boxplot of Loudness')

From two figures above we know that loudness has a right skewed distribution, but the minimum value stills looks like an outlier as it’s quite isolated from rest data. Let’s locate this observation, it comes from genre latin. Now we can compare distribution of latin’s loudness with other genres.

ggplot(songs, aes(x = 1, y = loudness)) +
  geom_boxplot() +
  facet_grid(playlist_genre~., scales = "free_x") +
  coord_flip() +
  labs(title = "Boxplot of Loudness",
       subtitle = 'By different Genre') +
  theme(axis.text.y  = element_blank())

Genre latin does have relatively more outliers on the left side than other genres. Therefore, this minimum number might be true as genre latin looks like have a more right skewed distribution than other genres.

Besides, data documentation says loudness values typical range between -60 and 0 db, so minimum value: -46.448 is acceptable.

  • instrumentalness : average value of instrumentalness is 0.0847472, much larger than its median 1.6110^{-5}. Let’s check the distribution.
ggplot(songs, aes(x = instrumentalness)) +
  geom_histogram(binwidth = 0.1, bins = 10) +
  ggtitle("Histogram of Instrumentalness")

We see that majority (85.4323394%) observations have a value no larger than 0.1 in instrumentalness, and this is the reason why the difference between mean and median of instrumentalness is quite large.

  • tempo: The maximum value of Tempo is 239.44, much larger than the third quantile 133.918. Minmum is 0, means this song has no tempo at all. Let’s check on the distribution.
ggplot(songs, aes(x = 1, y = tempo)) +
  geom_boxplot() +
  coord_flip() +
  ggtitle('Boxplot of Tempo') 

Variable tempo has a left skewed distribution. But the minimum value 0 makes no sense as it means overall estimated tempo of a track in beats per minute is 0. Let’s locate this observation, it comes from genre rock. Now we can compare distribution of rock’s tempo with other genres.

ggplot(songs, aes(x = 1, y = tempo)) +
  geom_boxplot() +
  facet_grid(playlist_genre~., scales = "free_x") +
  coord_flip() +
  labs(title = "Boxplot of Different Genre's Tempo",
       subtitle = 'By different Genre') +
  theme(axis.text.y  = element_blank())

Minimum of tempo does look like an outlier, it should be removed from the dataset in case of skewing the analysis.

songs <- songs[-which(songs$tempo == min(songs$tempo)),]

Data Preparation Summary

Now the data is ready for exploratory data analysis.

  • There are 32832 observations, 23 variables in dataset. Variable names list below.
colnames(songs)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

variables of interests are playlist_genre, playlist_subgenre and those audio feature variables following, which are columns playlist_genre to duration_ms.

  • There is 5 observations have missing values. But those missing values don’t belong to variables of interest, thus I keep them in the dataset.

  • There is 1 outlier in variable tempo (value = 0, means this song has no tempo at all, this doesn’t make any sense), it is removed from the dataset.

  • Among 23 variables, three of them: playlist_genre, playlist_subgenre and mode are categorical variables.

kable(songs %>% 
  group_by(playlist_genre) %>% 
  summarise(total = n()))
playlist_genre total
edm 6043
latin 5155
pop 5507
r&b 5431
rap 5746
rock 4950
kable(songs %>% 
  group_by(playlist_subgenre) %>% 
  summarise(total = n()) %>%
  head(n = 6))
playlist_subgenre total
album rock 1064
big room 1206
classic rock 1296
dance pop 1298
electro house 1511
electropop 1408
kable(songs %>% 
  group_by(mode) %>% 
  summarise(total = n()) %>%
  head(n = 6))
mode total
0 14259
1 18573
  • Column energy to duration_ms, exclude mode are numeric vairables.

Table below shows descriptive statistics of all numeric variables in the dataset.

Variables Minimum 1st Quater Mean Median 3rd Quarter Maximum
danceability 0.0771 0.563 0.6548695 0.672 0.761 0.983
energy 1.7510^{-4} 0.581 0.698631 0.721 0.84 1
key 0 2 5.374604 6 9 11
loudness -46.448 -8.171 -6.7189092 -6.166 -4.645 1.275
speechiness 0.0224 0.041 0.1070713 0.0625 0.132 0.918
acousticness 1.410^{-6} 0.0151 0.1753391 0.0804 0.255 0.994
liveness 0.00936 0.0927 0.190182 0.127 0.248 0.996
valence 10^{-5} 0.331 0.5105765 0.512 0.693 0.991
tempo 35.477 99.96 120.8848134 121.984 133.91825 239.44
duration_ms 2.949310^{4} 1.878212510^{5} 2.258065710^{5} 2.16000510^{5} 2.5358510^{5} 5.178110^{5}

Exploratory Data Analysis

The feature pattern of each genre

First let’s explore different feature pattern of each genre.

feature_names <- names(songs)[c(12:15,17:23)]

long_songs <- songs %>%
  select(c('playlist_genre', feature_names)) %>%
  pivot_longer(cols = feature_names) 

long_songs %>%
  ggplot(aes(x = value)) +
  geom_freqpoly(aes(color = playlist_genre)) +
  facet_wrap(~name, ncol = 3, scales = 'free') +
  labs(title = 'Spotify Audio Feature Pattern',
       subtitle = 'By different Genre', 
       x = '', y = '') +
  theme(axis.text.y = element_blank())

From the plot above it is clear to see that different genre do have differen pattern on audio features. On variables like instrumentalness, key, liveness and loudness, distributions of different genres don’t vary too much, majority of them overlap each other. But on variables like danceability, energy and valence, different genres have very different dictribution. Those variables could be more helpful in classification. Let’s take a closer look into those three variables.

long_songs %>%
  filter(name %in% c('danceability', 'energy', 'valence')) %>%
  ggplot(aes(x = value)) +
  geom_freqpoly(aes(color = playlist_genre)) +
  facet_wrap(~name, ncol = 3, scales = 'free') +
  labs(title = 'Pattern on Danceability, Energy and Valence',
       subtitle = 'By different Genre', 
       x = '', y = '') +
  theme(axis.text.y = element_blank())

It’s clear to see some genres have different feature pattern on those variables. For example, genre edm looks like having medium danceability, higher energy and low valence compared to other genres, rock looks like having lower danceability, higher energy and medium valence. Those figures show that, acoording the combination of a song’s audio features, it could be possible to classfity it into a specific genre with confidence.

The corelation between features

Let’s also check the corelation between audio features and draw a corrplot.

songs %>%
  select(feature_names) %>%
  cor() %>%
  corrplot(method = 'color', order = 'hclust',  type = 'upper', 
           diag = TRUE, main = 'Correlation between Audio Features',
           mar = c(2,2,2,2))

We can see that there exist a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high possitive correlation between energy and loudness. This is consistent with our common sense.

Data Modeling

Multiple models would be built to explore the classfication of playlist genre. Model perfomance would be computed based on in-sample ( randomly choosing 75% of original database) and out-of-sample (remaining 25%) misclassification rate (MR).

Train-test Split

First step is to randomly split the whole dataset into training (75%) and testing (25%) set for model validation.

set.seed(7052)
idx <- sample(1:nrow(songs), nrow(songs) * .75)

songs <- songs[,c(10, 12:23)]
songs_train <- songs[idx,]
songs_test <- songs[-idx,]

Random Forest

Random forest is an extension of Bagging, which it’s a aggregation of trees. By boostrap (sample with replacement) from the data and fit tree models, random forest aggragates those trees and makes significant improvement in terms of prediction.

The idea of random forests is to randomly select m out of p predictors as candidate variables for each split in each tree. Commonly, m=√p in classifacation. The reason of doing this is that it can decorrelates the trees such that it reduces variance when we aggregate the trees.

In this particular case, m should be sqaure root of number of predictors √12 = 4. Let’s fit this model on the traing data and make preditions on both training and testing data.

songs_rf <- randomForest(playlist_genre~., data = songs_train, mtry = 4)

pred_train <- predict(songs_rf)
pred_test <- predict(songs_rf, songs_test)

Misclassification rate (MR) is defined as the rate of misclassify a song’s genre, we could computate both in-sample and out-od-sample misclassfication rate based on the prediction we got in last step.

In-sample MR (Training data) Out-of-sample MR (Testing data)
0.445947 0.454191

We see that in-sample MR and out-of-sample MR are around 0.45, and are close to each other, indicatiing there is no overfitting problem in this model. Let’s check on the confusion matrix to see more detail.

In-sample Confusion Matrix

kable(songs_rf$confusion)
edm latin pop r&b rap rock class.error
edm 3116 228 644 162 228 139 0.3101616
latin 367 1517 723 458 643 169 0.6087181
pop 742 495 1394 603 280 606 0.6616505
r&b 158 324 524 1948 824 310 0.5234834
rap 229 367 228 524 2877 111 0.3364852
rock 117 92 357 271 58 2791 0.2428106

Out-of-sample Confusion Matrix

kable(table(pred_test, songs_test$playlist_genre))
edm latin pop r&b rap rock
edm 1024 133 253 43 90 46
latin 81 519 161 113 100 28
pop 230 202 453 186 72 120
r&b 66 153 223 618 170 113
rap 88 220 104 279 933 24
rock 37 51 193 104 45 933

From two tables above we can see that Random Forest model could classify rock, edm and rap very well, misclassification rates are 24.70%, 31.30% and 34.61% respectively, while it performs worse in the indentification of r&b, lation are pop, misclassification rates are 52.84%, 58.06% and 67.06%.

SVM

SVM is probably one of the best off-the-shelf classifiers for many of problems. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. It handles nonlinearity, is well regularized (avoids overfitting), have few parameters, and fast for large number of observations.

In this particular case, we still use the training data to build the SVM model, and use the testing data to do model validation.

song_svm <- svm(playlist_genre ~ ., data = songs_train, cost = 1)

pred_train_svm <- predict(song_svm)
pred_test_svm <- predict(song_svm, songs_test)
In-sample MR (Training data) Out-of-sample MR (Testing data)
0.440627 0.4678363

We see that in-sample MR is 0.441, and out-of-sample MR is 0.468, they are not far away each other, and similar to the model performance of random forest. Let’s check on the confusion matrix to see more detail.

In-sample Confusion Matrix

kable(table(songs_train$playlist_genre, pred_train_svm))
edm latin pop r&b rap rock
edm 3029 196 697 136 276 183
latin 330 1521 681 483 653 209
pop 612 402 1657 524 359 566
r&b 159 368 487 1893 882 299
rap 259 353 252 378 2988 106
rock 179 115 350 299 57 2686

Out-of-sample Confusion Matrix

kable(table(songs_test$playlist_genre, pred_test_svm))
edm latin pop r&b rap rock
edm 966 69 263 55 119 54
latin 112 506 221 147 241 51
pop 224 142 514 196 114 197
r&b 40 150 168 576 297 112
rap 106 107 87 135 938 37
rock 65 41 141 126 23 868

Similarily, SVM model could classify rock, edm and rap very well, but doesn’t do well in indentification of r&b, lation are pop.

Summary

In this project, by using data originally comes from spotifyr package, I’ve explored:

  1. The audio feature pattern of each genre by visulizing audio feaftures across different genre.

  2. The corelation between features by computing the relation coefficients.

  3. Given features of a song, how to classfy its genre based on a random forest model and a SVM model.

Those explorations could give us some hints on:

  1. The specific feature pattern of each genre. For example, genre edm looks like having medium danceability, higher energy and low valence compared to other genres, rock looks like having lower danceability, higher energy and medium valence.

  2. The relationship between audio features. For example, there exist a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high possitive correlation between energy and loudness, which are consistent with our common sense.

  3. How the classification model works on this problem. In the random forest model, it’s more accurate to classify rock, edm and rap, however, it doesn’t work that well in classifing r&b, lation are pop.

This project does give us some insights in audio feature pattern and the classification of genre, however, due to the limitation of data, model and time, the project is far from perfection. More data could be imported and more fancy models, like neural network could be built to etract more insights and informations from the data.