Recall that the first time we logged in spotify, it asked us what our favourite music genres are, thus assist it to recommand our prefered songs in Daily Mixes or Discover Weekly. It also use our listening data, like listening history, to learn our preference.
But how could Spotify classify songs into broad genres? What’s the feature of each genre, and how features of a song can determine its genre? Those are questions need to be answered in this project.
There are 12 audio features for each song, 6 broad genres and 24 subgenres as the label. Details available in Data Preparation and Exploratory Data Analysis part. I’ll eanalysis on those 14 variables to answer questions asked in last paragraph.
During this project, you’ll see:
The corelation between features.
The feature pattern of each genre.
Given features of a song, how can we classfy its genre.
To fulfill those goals, I’ll conduct:
Exploratory Data Analysis:
Data Modeling;
Let’s explore the spotify_songs dataset to discovery the pattern behind it.
Packages required for this projects are:
library(tidyverse)
library(knitr) # kable() function will be used to display output
library(corrplot) # corrplot() function will be used to explore corelation
library(randomForest)# package for random forest
library(e1071) # package for support vector machine
The very first step is to load spotify_songs dataset. This dataset originally comes from spotifyr package, a R wrapperf pulling track audio features and other information from Spotify’s Web API in bulk.By automatically batching API requests, it allows you to enter an artist’s name and retrieve their entire discography in seconds, along with Spotify’s audio features and track/album popularity metrics.
songs <- read_csv('spotify_songs.csv')
There are 32833 observations, 23 variables in dataset.
Here are variables we have:
| Variables | Descirption |
|---|---|
| track_id | Song unique ID |
| track_name | Song Name |
| track_artist | Song Artist |
| track_popularity | Song Popularity (0-100) where higher is better |
| track_album_id | Album unique ID |
| track_album_name | Song album name |
| track_album_release_date | Date when album released |
| playlist_name | Name of playlist |
| playlist_id | Playlist ID |
| playlist_genre | Playlist genre |
| playlist_subgenre | Playlist subgenre |
| danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | Duration of song in milliseconds |
As our main purpose is to explore charaterics and calssfication of each genre, variables of interests are playlist_genre, playlist_subgenre and those numberic variables following, which are columns 10 to 23.
Now let’s check number of missing values.
n_of_missing <- c()
for (i in 1:ncol(songs)) {
n_of_missing[i] = sum(is.na(songs[i]))
}
missing <- data.frame(rbind(n_of_missing))
colnames(missing) <- colnames(songs)
missing[which(missing > 0)]
## track_name track_artist track_album_name
## n_of_missing 5 5 5
There are 5 missing values in column track_name, track_artist and track_album_name.
which(is.na(songs$track_name))
## [1] 8152 9283 9284 19569 19812
which(is.na(songs$track_artist)) == which(is.na(songs$track_name))
## [1] TRUE TRUE TRUE TRUE TRUE
which(is.na(songs$track_album_name)) == which(is.na(songs$track_name))
## [1] TRUE TRUE TRUE TRUE TRUE
We find that indexed of missing values are same among variables ,means there are only 5 observations have missing values.
We can keep those observations because missing values are not in columns 10 to 23, those observations have values in variables of interests, hence we don’t need to remove them.
Convert variable playlist_genr and playlist_subgenre to categorical variables. As we want expolre the classfication of genre, playlist_genr and playlist_subgenre should be transformed from character to factor to faciliate our analysis.
There is another variable mode needs to be converted into categorical variable. As mentioned in the data documentation, Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.Thus, instead of being numeric variable, mode is categorical in nature.
songs <- songs %>%
mutate(playlist_genre = as.factor(songs$playlist_genre),
playlist_subgenre = as.factor(songs$playlist_subgenre),
mode = as.factor(mode))
Now let’s focus on numberic variables and check is there any outlier in the dataset.
summary(songs[,c(13:15, 17:23)])
## energy key loudness speechiness
## Min. :0.000175 Min. : 0.000 Min. :-46.448 Min. :0.0000
## 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171 1st Qu.:0.0410
## Median :0.721000 Median : 6.000 Median : -6.166 Median :0.0625
## Mean :0.698619 Mean : 5.374 Mean : -6.720 Mean :0.1071
## 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645 3rd Qu.:0.1320
## Max. :1.000000 Max. :11.000 Max. : 1.275 Max. :0.9180
## acousticness instrumentalness liveness valence
## Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0151 1st Qu.:0.0000000 1st Qu.:0.0927 1st Qu.:0.3310
## Median :0.0804 Median :0.0000161 Median :0.1270 Median :0.5120
## Mean :0.1753 Mean :0.0847472 Mean :0.1902 Mean :0.5106
## 3rd Qu.:0.2550 3rd Qu.:0.0048300 3rd Qu.:0.2480 3rd Qu.:0.6930
## Max. :0.9940 Max. :0.9940000 Max. :0.9960 Max. :0.9910
## tempo duration_ms
## Min. : 0.00 Min. : 4000
## 1st Qu.: 99.96 1st Qu.:187819
## Median :121.98 Median :216000
## Mean :120.88 Mean :225800
## 3rd Qu.:133.92 3rd Qu.:253585
## Max. :239.44 Max. :517810
There are some variables need attention:
ggplot(songs, aes(x = 1, y = loudness)) +
geom_boxplot() +
coord_flip() +
ggtitle('Boxplot of Loudness')
From two figures above we know that loudness has a right skewed distribution, but the minimum value stills looks like an outlier as it’s quite isolated from rest data. Let’s locate this observation, it comes from genre latin. Now we can compare distribution of latin’s loudness with other genres.
ggplot(songs, aes(x = 1, y = loudness)) +
geom_boxplot() +
facet_grid(playlist_genre~., scales = "free_x") +
coord_flip() +
labs(title = "Boxplot of Loudness",
subtitle = 'By different Genre') +
theme(axis.text.y = element_blank())
Genre latin does have relatively more outliers on the left side than other genres. Therefore, this minimum number might be true as genre latin looks like have a more right skewed distribution than other genres.
Besides, data documentation says loudness values typical range between -60 and 0 db, so minimum value: -46.448 is acceptable.
ggplot(songs, aes(x = instrumentalness)) +
geom_histogram(binwidth = 0.1, bins = 10) +
ggtitle("Histogram of Instrumentalness")
We see that majority (85.4323394%) observations have a value no larger than 0.1 in instrumentalness, and this is the reason why the difference between mean and median of instrumentalness is quite large.
ggplot(songs, aes(x = 1, y = tempo)) +
geom_boxplot() +
coord_flip() +
ggtitle('Boxplot of Tempo')
Variable tempo has a left skewed distribution. But the minimum value 0 makes no sense as it means overall estimated tempo of a track in beats per minute is 0. Let’s locate this observation, it comes from genre rock. Now we can compare distribution of rock’s tempo with other genres.
ggplot(songs, aes(x = 1, y = tempo)) +
geom_boxplot() +
facet_grid(playlist_genre~., scales = "free_x") +
coord_flip() +
labs(title = "Boxplot of Different Genre's Tempo",
subtitle = 'By different Genre') +
theme(axis.text.y = element_blank())
Minimum of tempo does look like an outlier, it should be removed from the dataset in case of skewing the analysis.
songs <- songs[-which(songs$tempo == min(songs$tempo)),]
Now the data is ready for exploratory data analysis.
colnames(songs)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
variables of interests are playlist_genre, playlist_subgenre and those audio feature variables following, which are columns playlist_genre to duration_ms.
There is 5 observations have missing values. But those missing values don’t belong to variables of interest, thus I keep them in the dataset.
There is 1 outlier in variable tempo (value = 0, means this song has no tempo at all, this doesn’t make any sense), it is removed from the dataset.
Among 23 variables, three of them: playlist_genre, playlist_subgenre and mode are categorical variables.
kable(songs %>%
group_by(playlist_genre) %>%
summarise(total = n()))
| playlist_genre | total |
|---|---|
| edm | 6043 |
| latin | 5155 |
| pop | 5507 |
| r&b | 5431 |
| rap | 5746 |
| rock | 4950 |
kable(songs %>%
group_by(playlist_subgenre) %>%
summarise(total = n()) %>%
head(n = 6))
| playlist_subgenre | total |
|---|---|
| album rock | 1064 |
| big room | 1206 |
| classic rock | 1296 |
| dance pop | 1298 |
| electro house | 1511 |
| electropop | 1408 |
kable(songs %>%
group_by(mode) %>%
summarise(total = n()) %>%
head(n = 6))
| mode | total |
|---|---|
| 0 | 14259 |
| 1 | 18573 |
Table below shows descriptive statistics of all numeric variables in the dataset.
| Variables | Minimum | 1st Quater | Mean | Median | 3rd Quarter | Maximum |
|---|---|---|---|---|---|---|
| danceability | 0.0771 | 0.563 | 0.6548695 | 0.672 | 0.761 | 0.983 |
| energy | 1.7510^{-4} | 0.581 | 0.698631 | 0.721 | 0.84 | 1 |
| key | 0 | 2 | 5.374604 | 6 | 9 | 11 |
| loudness | -46.448 | -8.171 | -6.7189092 | -6.166 | -4.645 | 1.275 |
| speechiness | 0.0224 | 0.041 | 0.1070713 | 0.0625 | 0.132 | 0.918 |
| acousticness | 1.410^{-6} | 0.0151 | 0.1753391 | 0.0804 | 0.255 | 0.994 |
| liveness | 0.00936 | 0.0927 | 0.190182 | 0.127 | 0.248 | 0.996 |
| valence | 10^{-5} | 0.331 | 0.5105765 | 0.512 | 0.693 | 0.991 |
| tempo | 35.477 | 99.96 | 120.8848134 | 121.984 | 133.91825 | 239.44 |
| duration_ms | 2.949310^{4} | 1.878212510^{5} | 2.258065710^{5} | 2.16000510^{5} | 2.5358510^{5} | 5.178110^{5} |
First let’s explore different feature pattern of each genre.
feature_names <- names(songs)[c(12:15,17:23)]
long_songs <- songs %>%
select(c('playlist_genre', feature_names)) %>%
pivot_longer(cols = feature_names)
long_songs %>%
ggplot(aes(x = value)) +
geom_freqpoly(aes(color = playlist_genre)) +
facet_wrap(~name, ncol = 3, scales = 'free') +
labs(title = 'Spotify Audio Feature Pattern',
subtitle = 'By different Genre',
x = '', y = '') +
theme(axis.text.y = element_blank())
From the plot above it is clear to see that different genre do have differen pattern on audio features. On variables like instrumentalness, key, liveness and loudness, distributions of different genres don’t vary too much, majority of them overlap each other. But on variables like danceability, energy and valence, different genres have very different dictribution. Those variables could be more helpful in classification. Let’s take a closer look into those three variables.
long_songs %>%
filter(name %in% c('danceability', 'energy', 'valence')) %>%
ggplot(aes(x = value)) +
geom_freqpoly(aes(color = playlist_genre)) +
facet_wrap(~name, ncol = 3, scales = 'free') +
labs(title = 'Pattern on Danceability, Energy and Valence',
subtitle = 'By different Genre',
x = '', y = '') +
theme(axis.text.y = element_blank())
It’s clear to see some genres have different feature pattern on those variables. For example, genre edm looks like having medium danceability, higher energy and low valence compared to other genres, rock looks like having lower danceability, higher energy and medium valence. Those figures show that, acoording the combination of a song’s audio features, it could be possible to classfity it into a specific genre with confidence.
Let’s also check the corelation between audio features and draw a corrplot.
songs %>%
select(feature_names) %>%
cor() %>%
corrplot(method = 'color', order = 'hclust', type = 'upper',
diag = TRUE, main = 'Correlation between Audio Features',
mar = c(2,2,2,2))
We can see that there exist a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high possitive correlation between energy and loudness. This is consistent with our common sense.
Multiple models would be built to explore the classfication of playlist genre. Model perfomance would be computed based on in-sample ( randomly choosing 75% of original database) and out-of-sample (remaining 25%) misclassification rate (MR).
First step is to randomly split the whole dataset into training (75%) and testing (25%) set for model validation.
set.seed(7052)
idx <- sample(1:nrow(songs), nrow(songs) * .75)
songs <- songs[,c(10, 12:23)]
songs_train <- songs[idx,]
songs_test <- songs[-idx,]
Random forest is an extension of Bagging, which it’s a aggregation of trees. By boostrap (sample with replacement) from the data and fit tree models, random forest aggragates those trees and makes significant improvement in terms of prediction.
The idea of random forests is to randomly select m out of p predictors as candidate variables for each split in each tree. Commonly, m=√p in classifacation. The reason of doing this is that it can decorrelates the trees such that it reduces variance when we aggregate the trees.
In this particular case, m should be sqaure root of number of predictors √12 = 4. Let’s fit this model on the traing data and make preditions on both training and testing data.
songs_rf <- randomForest(playlist_genre~., data = songs_train, mtry = 4)
pred_train <- predict(songs_rf)
pred_test <- predict(songs_rf, songs_test)
Misclassification rate (MR) is defined as the rate of misclassify a song’s genre, we could computate both in-sample and out-od-sample misclassfication rate based on the prediction we got in last step.
| In-sample MR (Training data) | Out-of-sample MR (Testing data) |
|---|---|
| 0.445947 | 0.454191 |
We see that in-sample MR and out-of-sample MR are around 0.45, and are close to each other, indicatiing there is no overfitting problem in this model. Let’s check on the confusion matrix to see more detail.
In-sample Confusion Matrix
kable(songs_rf$confusion)
| edm | latin | pop | r&b | rap | rock | class.error | |
|---|---|---|---|---|---|---|---|
| edm | 3116 | 228 | 644 | 162 | 228 | 139 | 0.3101616 |
| latin | 367 | 1517 | 723 | 458 | 643 | 169 | 0.6087181 |
| pop | 742 | 495 | 1394 | 603 | 280 | 606 | 0.6616505 |
| r&b | 158 | 324 | 524 | 1948 | 824 | 310 | 0.5234834 |
| rap | 229 | 367 | 228 | 524 | 2877 | 111 | 0.3364852 |
| rock | 117 | 92 | 357 | 271 | 58 | 2791 | 0.2428106 |
Out-of-sample Confusion Matrix
kable(table(pred_test, songs_test$playlist_genre))
| edm | latin | pop | r&b | rap | rock | |
|---|---|---|---|---|---|---|
| edm | 1024 | 133 | 253 | 43 | 90 | 46 |
| latin | 81 | 519 | 161 | 113 | 100 | 28 |
| pop | 230 | 202 | 453 | 186 | 72 | 120 |
| r&b | 66 | 153 | 223 | 618 | 170 | 113 |
| rap | 88 | 220 | 104 | 279 | 933 | 24 |
| rock | 37 | 51 | 193 | 104 | 45 | 933 |
From two tables above we can see that Random Forest model could classify rock, edm and rap very well, misclassification rates are 24.70%, 31.30% and 34.61% respectively, while it performs worse in the indentification of r&b, lation are pop, misclassification rates are 52.84%, 58.06% and 67.06%.
SVM is probably one of the best off-the-shelf classifiers for many of problems. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. It handles nonlinearity, is well regularized (avoids overfitting), have few parameters, and fast for large number of observations.
In this particular case, we still use the training data to build the SVM model, and use the testing data to do model validation.
song_svm <- svm(playlist_genre ~ ., data = songs_train, cost = 1)
pred_train_svm <- predict(song_svm)
pred_test_svm <- predict(song_svm, songs_test)
| In-sample MR (Training data) | Out-of-sample MR (Testing data) |
|---|---|
| 0.440627 | 0.4678363 |
We see that in-sample MR is 0.441, and out-of-sample MR is 0.468, they are not far away each other, and similar to the model performance of random forest. Let’s check on the confusion matrix to see more detail.
In-sample Confusion Matrix
kable(table(songs_train$playlist_genre, pred_train_svm))
| edm | latin | pop | r&b | rap | rock | |
|---|---|---|---|---|---|---|
| edm | 3029 | 196 | 697 | 136 | 276 | 183 |
| latin | 330 | 1521 | 681 | 483 | 653 | 209 |
| pop | 612 | 402 | 1657 | 524 | 359 | 566 |
| r&b | 159 | 368 | 487 | 1893 | 882 | 299 |
| rap | 259 | 353 | 252 | 378 | 2988 | 106 |
| rock | 179 | 115 | 350 | 299 | 57 | 2686 |
Out-of-sample Confusion Matrix
kable(table(songs_test$playlist_genre, pred_test_svm))
| edm | latin | pop | r&b | rap | rock | |
|---|---|---|---|---|---|---|
| edm | 966 | 69 | 263 | 55 | 119 | 54 |
| latin | 112 | 506 | 221 | 147 | 241 | 51 |
| pop | 224 | 142 | 514 | 196 | 114 | 197 |
| r&b | 40 | 150 | 168 | 576 | 297 | 112 |
| rap | 106 | 107 | 87 | 135 | 938 | 37 |
| rock | 65 | 41 | 141 | 126 | 23 | 868 |
Similarily, SVM model could classify rock, edm and rap very well, but doesn’t do well in indentification of r&b, lation are pop.
In this project, by using data originally comes from spotifyr package, I’ve explored:
The audio feature pattern of each genre by visulizing audio feaftures across different genre.
The corelation between features by computing the relation coefficients.
Given features of a song, how to classfy its genre based on a random forest model and a SVM model.
Those explorations could give us some hints on:
The specific feature pattern of each genre. For example, genre edm looks like having medium danceability, higher energy and low valence compared to other genres, rock looks like having lower danceability, higher energy and medium valence.
The relationship between audio features. For example, there exist a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high possitive correlation between energy and loudness, which are consistent with our common sense.
How the classification model works on this problem. In the random forest model, it’s more accurate to classify rock, edm and rap, however, it doesn’t work that well in classifing r&b, lation are pop.
This project does give us some insights in audio feature pattern and the classification of genre, however, due to the limitation of data, model and time, the project is far from perfection. More data could be imported and more fancy models, like neural network could be built to etract more insights and informations from the data.