This or That: Classifying Spotify Song Genres

Introduction

The desire to group items into buckets of similar characteristics is an ingrained interest all humans experience at a very young age. Animals, colored blocks, or anything else a child can get their hands on can be sorted and categorized. As we listen to music the same innate desire occurs. This is where genres come into play. As a human listens to a song it may be patently clear what genre it belongs in, but it can be much more difficult to objectively identify why it belongs in a specific genre. Even more challenging is the task of classifying a large number of songs into appropriate genres.

As the world’s most popular audio stream subscription service, Spotify has an incredible about of data and insights into how we listen to musics. As of 2021, Spotify has over 70,000,000 musical tracks. This project will explore a dataset of over 30,000 songs within six genres.

Machine Learning & Classification Analysis

This project will explore a dataset of approximately 30,000 songs that fall into six different genres(rap, rock, latin, pop, R&B, and EDM). After cleaning and visualizing the data, machine learning algorithms will be utilized with the goal of building a model that can classify the songs into their correct genres based on the song metrics. With six categories there is a 16.67% change of randomly guessing a song’s genre. The goal is to far out perform random chance and build a model to automate the classification of these songs.

Packages

The following packages are required to run the included code and analysis

library(tidyverse)    # creating clean and tidy data
library(dplyr)        # transforming data
library(ggplot2)      # data visualization
library(kableExtra)   # creation of complex tables
library(knitr)        # dynamic report generation
library(purrr)        # functional programming toolkit
library(corrplot)
library(class)        # classification functions including K-nearest neighbor
theme_set(theme_bw()) # set default plot theme

Data Import and Preparation

Obtaining the Data

A CSV of the data can be found here. A data dictionary and additional context can be found on the tidytuesday GitHub repository. This data set and data dictionary were used for a January 2020 edition of the tidytuesday podcast.

Data Dictionary

kable(data_dict, caption = "Table 1: Data Dictionary") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 1: Data Dictionary
variable_names	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C sharp/D flat, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Import Raw Data

After obtaining the CSV file, it is ingested using the read.csv command.

raw_data <- read.csv("spotify_songs.csv", stringsAsFactors =  FALSE)

raw_data %>% 
  select(c("track_name","track_album_name","playlist_genre","playlist_subgenre")) %>% 
  slice_sample(n = 10) %>% 
  kable(caption = "Table 2: Preview of Raw Data") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 2: Preview of Raw Data
track_name	track_album_name	playlist_genre	playlist_subgenre
Jumbo	Jumbo	edm	electro house
Share the World with Me, Pt. 4	Sceneries	rock	album rock
Where The Hood At	Grand Champ	rap	gangster rap
No Strings Attached	Outsidein	edm	electro house
Mala Costumbre	NumerologÃa	latin	reggaeton
Drop It	Drop It	edm	big room
Immortal - Edited Version	Immortal	rap	southern hip hop
Mascara	Mascara	latin	latin hip hop
Praise You (Chill Mix)	Believe In The Kingdom	latin	tropical
Dig	Sobville (Episode I)	rap	trap

Data Cleaning

The data is fairly clean and tidy and does not require significant restructuring. Checks for missing values, duplicates and outliers are completed.

The 5 missing values are not considered significant and are not adjusted or removed.

kable(sapply(raw_data, function(x) sum(is.na(x))),
      caption = "Table 3: Count of missing values by variable") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 3: Count of missing values by variable
	x
track_id	0
track_name	5
track_artist	5
track_popularity	0
track_album_id	0
track_album_name	5
track_album_release_date	0
playlist_name	0
playlist_id	0
playlist_genre	0
playlist_subgenre	0
danceability	0
energy	0
key	0
loudness	0
mode	0
speechiness	0
acousticness	0
instrumentalness	0
liveness	0
valence	0
tempo	0
duration_ms	0

Convert duration to seconds

For readability and interpretability, the duration_ms variable is updated to be in seconds.

raw_data <- raw_data %>% 
  mutate(duration_s = duration_ms / 1000) %>% 
  select(-duration_ms)

Check for Outliers

Most of the variables have already been normalized or are not continuous; however duration, tempo and loudness are checked for outliers. While there are outliers in each variable, the decision is made to retain all data. This may need to be readdressed after modeling and further refinement is undertaken.

raw_data %>% 
  select("duration_s", "tempo", "loudness") %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scales = 'free') +
  geom_boxplot(outlier.color = 'red') + 
  coord_flip()

duration_outliers <- boxplot(raw_data$duration_s, plot = FALSE, range = 2.5)$out
loudness_outliers <- boxplot(raw_data$loudness, plot = FALSE, range = 2.5)$out
tempo_outliers <- boxplot(raw_data$tempo, plot = FALSE, range = 2.5)$out
data.frame(metric = c("duration_s", "loudness", "tempo"),
           count_outliers = c(length(duration_outliers),
                              length(loudness_outliers),
                              length(tempo_outliers))) %>% 
  kable("html", caption = "Table 4: Count of Outliers") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 4: Count of Outliers
metric	count_outliers
duration_s	403
loudness	215
tempo	5

Remove Duplicates

There are very few missing data point given the sample size of almost 33,000 observations. Checking the track_id reveals a large number of duplicated songs. The 4477 duplicated tracks are removed.

clean_data <- raw_data %>%
  distinct(track_id, .keep_all = TRUE)
clean_data %>%
  select(c("track_name","track_album_name","playlist_genre","playlist_subgenre")) %>% 
  slice_sample(n = 10) %>% 
  kable(caption = "Table 5: Preview of Clean Data") %>% 
  kable_styling(bootstrap_options = "striped", full_width = T)

Table 5: Preview of Clean Data
track_name	track_album_name	playlist_genre	playlist_subgenre
God Damnit (with Call Me Karizma)	God Damnit (with Call Me Karizma)	edm	pop edm
Diva	Diva	edm	pop edm
Red Right Hand - From ‘Peaky Blinders’ Original Soundtrack	Red Right Hand (From ‘Peaky Blinders’ Original Soundtrack)	rock	permanent wave
The Way I Are (Dance with Somebody) (feat. Lil Wayne) - Spotify Version	The Way I Are (Dance With Somebody) [feat. Lil Wayne]	r&b	hip pop
Ready Or Not	The Very Best Of After 7	r&b	new jack swing
Outlaws	Outlaws	edm	electro house
Black Ice (Sky High) (feat. OutKast)	Still Standing	rap	southern hip hop
Sky	Powder Rooms, Vol. 1	pop	indie poptimism
minimal	minimal	pop	indie poptimism
Chill Cloud	Chill Cloud	r&b	hip pop

Exploratory Data Analysis

Based on the goal of developing a classification tool, identifying how genres differ from one another is key. Twelve variables are identified as potential features from early analysis.

Song Metrics

A density plot of these variables demonstrates that some genres have potentially identifiable characteristics, such as rap and rock, but other do not have features that are immediately apparent.

feature_metrics <- names(clean_data[12:23])

clean_data %>% 
  select(c("playlist_genre", feature_metrics)) %>% 
  pivot_longer(cols = feature_metrics) %>% 
  ggplot(aes(value)) +
  geom_density(aes(color = playlist_genre)) +
  facet_wrap(~ name, ncol = 3, scales = "free") +
  labs(title = "Song Metrics by Genre", x = '', y = "density")

This visualization was inspired by Kaylin Pavlik’s analysis of Spotify data.

Metric Correlations

Prior to constructing a model, checking for correlated variables helps to avoid co-linearity issues. There are only a small number of strong correlations including a positive correlation between loudness and energy and an inverse correlation between energy and acousticness. Both of these intuitively are expected based upon a review of their data definitions.

variable_corr <- cor(clean_data[12:23])
corrplot(variable_corr,  method = "color", type = "upper",
         tl.col = "black", tl.srt = 45, title = "Correlation of Song Metrics")

Modeling

There are a large number of models that can be considered for machine learning classification. Initial models considered included XGBoost, random forest, decision tree, and K-nearest neighbor. For this initial effort, K-nearest neighbor was selected due to its flexibility and easy deployment.

Data Preparation

The cleaned data is randomized, normalized, and split into training and testing dataset.

model_data <- clean_data[c(10,12:23)] # extract genre and metrics
random <- sample(1:nrow(model_data), 0.8 * nrow(model_data)) # generate a number that is 80% of the dataset
nor <- function(x) { (x - min(x))/(max(x) - min(x)) } # normalization function
data_norm <- as.data.frame(lapply(model_data[,c(2:13)], nor)) # normalize the metric columns
spotify_train <- data_norm[random,] # generate training data
spotify_test <- data_norm[-random,] # generate test data
spotify_train_target <- model_data[random,1] # extract genre data for model training
spotify_test_target <- model_data[-random,1] # extract genre data for model accuracy

K-Nearest Neighbor

The k-nearest neighbor algorithm works well in this application because its implementation is simple, it does not require an explicit training step, and it handles multi-class problems better than many algorithms that are better suited to binary problems.

A potential downside of this algorithm is the compute demand as K is increased in the model. Similarly, the model can struggle if there are an imbalanced proportion of a specific category in the training dataset. This can lead to the most common genre being over classified.

knn_model <- knn(spotify_train, spotify_test, cl = spotify_train_target, k = 250) # train model
confusion_matrix <- table(knn_model, spotify_test_target) # confusion matrix 
accuracy <- function(x) {sum(diag(x)/(sum(rowSums(x)))) * 100} # accuracy function
knn_model_accuracy <- accuracy(confusion_matrix) # assess accuracy of model

kable(round((confusion_matrix/rowSums(confusion_matrix))*100,1), "html", caption = "Table 6: Confusion Matrix (Percent Classified by Model)") %>% 
  kable_styling(bootstrap_options = "striped", full_width = T)

Table 6: Confusion Matrix (Percent Classified by Model)
	edm	latin	pop	r&b	rap	rock
edm	49.8	7.8	16.4	3.3	11.0	11.6
latin	7.5	39.3	13.5	16.7	15.4	7.5
pop	14.0	17.1	31.0	12.9	12.4	12.6
r&b	3.4	11.4	15.2	42.8	12.9	14.5
rap	4.3	10.6	9.2	16.8	57.5	1.5
rock	5.9	4.3	17.5	8.5	4.3	59.4

Summary

Conclusions

Starting from a random guess of 1 out of 6 (16.67%), the model is able to achieve a 45.13% accuracy. Using machine learning, a three times improvement over random guess is realized; however, this still leaves significant area for improvement. Utilizing K-nearest neighbors, the model incorrectly classifies over half of songs in the test dataset. Certain genres are even less accurate in classification. Specifically, the pop genre is very hard to classify and the model was approximately 30% accurate in this genre. One potential explanation for this is that pop does not have any characteristics that are significantly different from the genres that is being compared against for classification. This can be observed in the density plots broken down by genre. Conversely, the rock and rap genres are correctly classified 60% of the time. The rap genre has a strong speechiness distribution that sets it apart from other genres and rock similarly has a danceability distribution that assists in setting it apart from other genres. The model accomplished the goal of providing a tool to classify the songs, but there is room for improvement. Future investigation of this problem could include testing of additional models such as random forest and decision trees or looking at other groupings of genres.