Spotify Genre

Introduction

Recall that the first time we logged in spotify, it asked us what our favourite music genres are, thus assist it to recommand our prefered songs in Daily Mixes or Discover Weekly. It also use our listening data, like listening history, to learn our preference.

But how could Spotify classify songs into broad genres? What’s the feature of each genre, and how features of a song can determine its genre? Those are questions need to be answered in this project.

There are 12 audio features for each song, 6 broad genres and 24 subgenres as the label. Details available in Data Preparation and Exploratory Data Analysis part. I’ll eanalysis on those 14 variables to answer questions asked in last paragraph.

During this project, you’ll see:

The corelation between features.
The feature pattern of each genre.
Given features of a song, how can we classfy its genre.

To fulfill those goals, I’ll conduct:

Exploratory Data Analysis:
- Visulization of audio feaftures of different genre.
- Corelation between features
Data Modeling;
- Random Forest
- Support Vector Machine (SVM)

Let’s explore the spotify_songs dataset to discovery the pattern behind it.

Package Required

Packages required for this projects are:

library(tidyverse)  
library(knitr)       # kable() function will be used to display output
library(corrplot)    # corrplot() function will be used to explore corelation
library(randomForest)# package for random forest
library(e1071)       # package for support vector machine

Data Preparation

Load the data

The very first step is to load spotify_songs dataset. This dataset originally comes from spotifyr package, a R wrapperf pulling track audio features and other information from Spotify’s Web API in bulk.By automatically batching API requests, it allows you to enter an artist’s name and retrieve their entire discography in seconds, along with Spotify’s audio features and track/album popularity metrics.

songs <- read_csv('spotify_songs.csv')

Data Documentation

There are 32833 observations, 23 variables in dataset.

Here are variables we have:

Variables	Descirption
track_id	Song unique ID
track_name	Song Name
track_artist	Song Artist
track_popularity	Song Popularity (0-100) where higher is better
track_album_id	Album unique ID
track_album_name	Song album name
track_album_release_date	Date when album released
playlist_name	Name of playlist
playlist_id	Playlist ID
playlist_genre	Playlist genre
playlist_subgenre	Playlist subgenre
danceability	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	Duration of song in milliseconds

As our main purpose is to explore charaterics and calssfication of each genre, variables of interests are playlist_genre, playlist_subgenre and those numberic variables following, which are columns 10 to 23.

Data Cleaning

Missing Value

Now let’s check number of missing values.

n_of_missing <- c()

for (i in 1:ncol(songs)) {
  n_of_missing[i] = sum(is.na(songs[i]))
}

missing <- data.frame(rbind(n_of_missing))
colnames(missing) <- colnames(songs)
missing[which(missing > 0)]

##              track_name track_artist track_album_name
## n_of_missing          5            5                5

There are 5 missing values in column track_name, track_artist and track_album_name.

which(is.na(songs$track_name))

## [1]  8152  9283  9284 19569 19812

which(is.na(songs$track_artist)) == which(is.na(songs$track_name))

## [1] TRUE TRUE TRUE TRUE TRUE

which(is.na(songs$track_album_name)) == which(is.na(songs$track_name))

## [1] TRUE TRUE TRUE TRUE TRUE

We find that indexed of missing values are same among variables ,means there are only 5 observations have missing values.

We can keep those observations because missing values are not in columns 10 to 23, those observations have values in variables of interests, hence we don’t need to remove them.

Variable Type

Convert variable playlist_genr and playlist_subgenre to categorical variables. As we want expolre the classfication of genre, playlist_genr and playlist_subgenre should be transformed from character to factor to faciliate our analysis.

There is another variable mode needs to be converted into categorical variable. As mentioned in the data documentation, Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.Thus, instead of being numeric variable, mode is categorical in nature.

songs <- songs %>%
  mutate(playlist_genre = as.factor(songs$playlist_genre),
         playlist_subgenre = as.factor(songs$playlist_subgenre),
         mode = as.factor(mode))

Outlier Identification

Now let’s focus on numberic variables and check is there any outlier in the dataset.

summary(songs[,c(13:15, 17:23)])

##      energy              key            loudness        speechiness    
##  Min.   :0.000175   Min.   : 0.000   Min.   :-46.448   Min.   :0.0000  
##  1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171   1st Qu.:0.0410  
##  Median :0.721000   Median : 6.000   Median : -6.166   Median :0.0625  
##  Mean   :0.698619   Mean   : 5.374   Mean   : -6.720   Mean   :0.1071  
##  3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645   3rd Qu.:0.1320  
##  Max.   :1.000000   Max.   :11.000   Max.   :  1.275   Max.   :0.9180  
##   acousticness    instrumentalness       liveness         valence      
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0151   1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3310  
##  Median :0.0804   Median :0.0000161   Median :0.1270   Median :0.5120  
##  Mean   :0.1753   Mean   :0.0847472   Mean   :0.1902   Mean   :0.5106  
##  3rd Qu.:0.2550   3rd Qu.:0.0048300   3rd Qu.:0.2480   3rd Qu.:0.6930  
##  Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910  
##      tempo         duration_ms    
##  Min.   :  0.00   Min.   :  4000  
##  1st Qu.: 99.96   1st Qu.:187819  
##  Median :121.98   Median :216000  
##  Mean   :120.88   Mean   :225800  
##  3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :239.44   Max.   :517810

There are some variables need attention:

loudness: Minimum of it is -46.448 while the first quarter is -8.171. Let’s check the distribution.

ggplot(songs, aes(x = 1, y = loudness)) +
  geom_boxplot() +
  coord_flip() +
  ggtitle('Boxplot of Loudness')

From two figures above we know that loudness has a right skewed distribution, but the minimum value stills looks like an outlier as it’s quite isolated from rest data. Let’s locate this observation, it comes from genre latin. Now we can compare distribution of latin’s loudness with other genres.

ggplot(songs, aes(x = 1, y = loudness)) +
  geom_boxplot() +
  facet_grid(playlist_genre~., scales = "free_x") +
  coord_flip() +
  labs(title = "Boxplot of Loudness",
       subtitle = 'By different Genre') +
  theme(axis.text.y  = element_blank())

Genre latin does have relatively more outliers on the left side than other genres. Therefore, this minimum number might be true as genre latin looks like have a more right skewed distribution than other genres.

Besides, data documentation says loudness values typical range between -60 and 0 db, so minimum value: -46.448 is acceptable.

instrumentalness : average value of instrumentalness is 0.0847472, much larger than its median 1.6110^{-5}. Let’s check the distribution.

ggplot(songs, aes(x = instrumentalness)) +
  geom_histogram(binwidth = 0.1, bins = 10) +
  ggtitle("Histogram of Instrumentalness")

We see that majority (85.4323394%) observations have a value no larger than 0.1 in instrumentalness, and this is the reason why the difference between mean and median of instrumentalness is quite large.

tempo: The maximum value of Tempo is 239.44, much larger than the third quantile 133.918. Minmum is 0, means this song has no tempo at all. Let’s check on the distribution.

ggplot(songs, aes(x = 1, y = tempo)) +
  geom_boxplot() +
  coord_flip() +
  ggtitle('Boxplot of Tempo')

Variable tempo has a left skewed distribution. But the minimum value 0 makes no sense as it means overall estimated tempo of a track in beats per minute is 0. Let’s locate this observation, it comes from genre rock. Now we can compare distribution of rock’s tempo with other genres.

ggplot(songs, aes(x = 1, y = tempo)) +
  geom_boxplot() +
  facet_grid(playlist_genre~., scales = "free_x") +
  coord_flip() +
  labs(title = "Boxplot of Different Genre's Tempo",
       subtitle = 'By different Genre') +
  theme(axis.text.y  = element_blank())

Minimum of tempo does look like an outlier, it should be removed from the dataset in case of skewing the analysis.

songs <- songs[-which(songs$tempo == min(songs$tempo)),]

Data Preparation Summary

Now the data is ready for exploratory data analysis.

There are 32832 observations, 23 variables in dataset. Variable names list below.

colnames(songs)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

variables of interests are playlist_genre, playlist_subgenre and those audio feature variables following, which are columns playlist_genre to duration_ms.

There is 5 observations have missing values. But those missing values don’t belong to variables of interest, thus I keep them in the dataset.
There is 1 outlier in variable tempo (value = 0, means this song has no tempo at all, this doesn’t make any sense), it is removed from the dataset.
Among 23 variables, three of them: playlist_genre, playlist_subgenre and mode are categorical variables.

kable(songs %>% 
  group_by(playlist_genre) %>% 
  summarise(total = n()))

playlist_genre	total
edm	6043
latin	5155
pop	5507
r&b	5431
rap	5746
rock	4950

kable(songs %>% 
  group_by(playlist_subgenre) %>% 
  summarise(total = n()) %>%
  head(n = 6))

playlist_subgenre	total
album rock	1064
big room	1206
classic rock	1296
dance pop	1298
electro house	1511
electropop	1408

kable(songs %>% 
  group_by(mode) %>% 
  summarise(total = n()) %>%
  head(n = 6))

mode	total
0	14259
1	18573

Column energy to duration_ms, exclude mode are numeric vairables.

Table below shows descriptive statistics of all numeric variables in the dataset.

Variables	Minimum	1st Quater	Mean	Median	3rd Quarter	Maximum
danceability	0.0771	0.563	0.6548695	0.672	0.761	0.983
energy	1.7510^{-4}	0.581	0.698631	0.721	0.84	1
key	0	2	5.374604	6	9	11
loudness	-46.448	-8.171	-6.7189092	-6.166	-4.645	1.275
speechiness	0.0224	0.041	0.1070713	0.0625	0.132	0.918
acousticness	1.410^{-6}	0.0151	0.1753391	0.0804	0.255	0.994
liveness	0.00936	0.0927	0.190182	0.127	0.248	0.996
valence	10^{-5}	0.331	0.5105765	0.512	0.693	0.991
tempo	35.477	99.96	120.8848134	121.984	133.91825	239.44
duration_ms	2.949310^{4}	1.878212510^{5}	2.258065710^{5}	2.16000510^{5}	2.5358510^{5}	5.178110^{5}

Exploratory Data Analysis

The feature pattern of each genre

First let’s explore different feature pattern of each genre.

feature_names <- names(songs)[c(12:15,17:23)]

long_songs <- songs %>%
  select(c('playlist_genre', feature_names)) %>%
  pivot_longer(cols = feature_names) 

long_songs %>%
  ggplot(aes(x = value)) +
  geom_freqpoly(aes(color = playlist_genre)) +
  facet_wrap(~name, ncol = 3, scales = 'free') +
  labs(title = 'Spotify Audio Feature Pattern',
       subtitle = 'By different Genre', 
       x = '', y = '') +
  theme(axis.text.y = element_blank())

From the plot above it is clear to see that different genre do have differen pattern on audio features. On variables like instrumentalness, key, liveness and loudness, distributions of different genres don’t vary too much, majority of them overlap each other. But on variables like danceability, energy and valence, different genres have very different dictribution. Those variables could be more helpful in classification. Let’s take a closer look into those three variables.

long_songs %>%
  filter(name %in% c('danceability', 'energy', 'valence')) %>%
  ggplot(aes(x = value)) +
  geom_freqpoly(aes(color = playlist_genre)) +
  facet_wrap(~name, ncol = 3, scales = 'free') +
  labs(title = 'Pattern on Danceability, Energy and Valence',
       subtitle = 'By different Genre', 
       x = '', y = '') +
  theme(axis.text.y = element_blank())

It’s clear to see some genres have different feature pattern on those variables. For example, genre edm looks like having medium danceability, higher energy and low valence compared to other genres, rock looks like having lower danceability, higher energy and medium valence. Those figures show that, acoording the combination of a song’s audio features, it could be possible to classfity it into a specific genre with confidence.

The corelation between features

Let’s also check the corelation between audio features and draw a corrplot.

songs %>%
  select(feature_names) %>%
  cor() %>%
  corrplot(method = 'color', order = 'hclust',  type = 'upper', 
           diag = TRUE, main = 'Correlation between Audio Features',
           mar = c(2,2,2,2))

We can see that there exist a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high possitive correlation between energy and loudness. This is consistent with our common sense.

Data Modeling

Multiple models would be built to explore the classfication of playlist genre. Model perfomance would be computed based on in-sample ( randomly choosing 75% of original database) and out-of-sample (remaining 25%) misclassification rate (MR).

Train-test Split

First step is to randomly split the whole dataset into training (75%) and testing (25%) set for model validation.

set.seed(7052)
idx <- sample(1:nrow(songs), nrow(songs) * .75)

songs <- songs[,c(10, 12:23)]
songs_train <- songs[idx,]
songs_test <- songs[-idx,]

Random Forest

Random forest is an extension of Bagging, which it’s a aggregation of trees. By boostrap (sample with replacement) from the data and fit tree models, random forest aggragates those trees and makes significant improvement in terms of prediction.

The idea of random forests is to randomly select m out of p predictors as candidate variables for each split in each tree. Commonly, m=√p in classifacation. The reason of doing this is that it can decorrelates the trees such that it reduces variance when we aggregate the trees.

In this particular case, m should be sqaure root of number of predictors √12 = 4. Let’s fit this model on the traing data and make preditions on both training and testing data.

songs_rf <- randomForest(playlist_genre~., data = songs_train, mtry = 4)

pred_train <- predict(songs_rf)
pred_test <- predict(songs_rf, songs_test)

Misclassification rate (MR) is defined as the rate of misclassify a song’s genre, we could computate both in-sample and out-od-sample misclassfication rate based on the prediction we got in last step.

In-sample MR (Training data)	Out-of-sample MR (Testing data)
0.445947	0.454191

We see that in-sample MR and out-of-sample MR are around 0.45, and are close to each other, indicatiing there is no overfitting problem in this model. Let’s check on the confusion matrix to see more detail.

In-sample Confusion Matrix

kable(songs_rf$confusion)

	edm	latin	pop	r&b	rap	rock	class.error
edm	3116	228	644	162	228	139	0.3101616
latin	367	1517	723	458	643	169	0.6087181
pop	742	495	1394	603	280	606	0.6616505
r&b	158	324	524	1948	824	310	0.5234834
rap	229	367	228	524	2877	111	0.3364852
rock	117	92	357	271	58	2791	0.2428106

Out-of-sample Confusion Matrix

kable(table(pred_test, songs_test$playlist_genre))

	edm	latin	pop	r&b	rap	rock
edm	1024	133	253	43	90	46
latin	81	519	161	113	100	28
pop	230	202	453	186	72	120
r&b	66	153	223	618	170	113
rap	88	220	104	279	933	24
rock	37	51	193	104	45	933

From two tables above we can see that Random Forest model could classify rock, edm and rap very well, misclassification rates are 24.70%, 31.30% and 34.61% respectively, while it performs worse in the indentification of r&b, lation are pop, misclassification rates are 52.84%, 58.06% and 67.06%.

SVM

SVM is probably one of the best off-the-shelf classifiers for many of problems. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. It handles nonlinearity, is well regularized (avoids overfitting), have few parameters, and fast for large number of observations.

In this particular case, we still use the training data to build the SVM model, and use the testing data to do model validation.

song_svm <- svm(playlist_genre ~ ., data = songs_train, cost = 1)

pred_train_svm <- predict(song_svm)
pred_test_svm <- predict(song_svm, songs_test)

In-sample MR (Training data)	Out-of-sample MR (Testing data)
0.440627	0.4678363

We see that in-sample MR is 0.441, and out-of-sample MR is 0.468, they are not far away each other, and similar to the model performance of random forest. Let’s check on the confusion matrix to see more detail.

In-sample Confusion Matrix

kable(table(songs_train$playlist_genre, pred_train_svm))

	edm	latin	pop	r&b	rap	rock
edm	3029	196	697	136	276	183
latin	330	1521	681	483	653	209
pop	612	402	1657	524	359	566
r&b	159	368	487	1893	882	299
rap	259	353	252	378	2988	106
rock	179	115	350	299	57	2686

Out-of-sample Confusion Matrix

kable(table(songs_test$playlist_genre, pred_test_svm))

	edm	latin	pop	r&b	rap	rock
edm	966	69	263	55	119	54
latin	112	506	221	147	241	51
pop	224	142	514	196	114	197
r&b	40	150	168	576	297	112
rap	106	107	87	135	938	37
rock	65	41	141	126	23	868

Similarily, SVM model could classify rock, edm and rap very well, but doesn’t do well in indentification of r&b, lation are pop.

Summary

In this project, by using data originally comes from spotifyr package, I’ve explored:

The audio feature pattern of each genre by visulizing audio feaftures across different genre.
The corelation between features by computing the relation coefficients.
Given features of a song, how to classfy its genre based on a random forest model and a SVM model.

Those explorations could give us some hints on:

The specific feature pattern of each genre. For example, genre edm looks like having medium danceability, higher energy and low valence compared to other genres, rock looks like having lower danceability, higher energy and medium valence.
The relationship between audio features. For example, there exist a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high possitive correlation between energy and loudness, which are consistent with our common sense.
How the classification model works on this problem. In the random forest model, it’s more accurate to classify rock, edm and rap, however, it doesn’t work that well in classifing r&b, lation are pop.

This project does give us some insights in audio feature pattern and the classification of genre, however, due to the limitation of data, model and time, the project is far from perfection. More data could be imported and more fancy models, like neural network could be built to etract more insights and informations from the data.

Final Project: Spotify Genre

Xingchen

4/20/2020

Spotify Genre

Introduction

Package Required

Data Preparation

Load the data

Data Documentation

Data Cleaning

Missing Value

Variable Type

Outlier Identification

Data Preparation Summary

Exploratory Data Analysis

The feature pattern of each genre

The corelation between features

Data Modeling

Train-test Split

Random Forest

SVM

Summary