Introduction

The desire to group items into buckets of similar characteristics is an ingrained interest all humans experience at a very young age. Animals, colored blocks, or anything else a child can get their hands on can be sorted and categorized. As we listen to music the same innate desire occurs. This is where genres come into play. As a human listens to a song it may be patently clear what genre it belongs in, but it can be much more difficult to objectively identify why it belongs in a specific genre. Even more challenging is the task of classifying a large number of songs into appropriate genres.

As the world’s most popular audio stream subscription service, Spotify has an incredible about of data and insights into how we listen to musics. As of 2021, Spotify has over 70,000,000 musical tracks. This project will explore a dataset of over 30,000 songs within six genres.

Machine Learning & Classification Analysis

This project will explore a dataset of approximately 30,000 songs that fall into six different genres(rap, rock, latin, pop, R&B, and EDM). After cleaning and visualizing the data, machine learning algorithms will be utilized with the goal of building a model that can classify the songs into their correct genres based on the song metrics. With six categories there is a 16.67% change of randomly guessing a song’s genre. The goal is to far out perform random chance and build a model to automate the classification of these songs.

Packages

The following packages are required to run the included code and analysis

library(tidyverse)    # creating clean and tidy data
library(dplyr)        # transforming data
library(ggplot2)      # data visualization
library(kableExtra)   # creation of complex tables
library(knitr)        # dynamic report generation
library(purrr)        # functional programming toolkit
library(corrplot)
library(class)        # classification functions including K-nearest neighbor
theme_set(theme_bw()) # set default plot theme

Data Import and Preparation

Obtaining the Data

A CSV of the data can be found here. A data dictionary and additional context can be found on the tidytuesday GitHub repository. This data set and data dictionary were used for a January 2020 edition of the tidytuesday podcast.

Data Dictionary

kable(data_dict, caption = "Table 1: Data Dictionary") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 1: Data Dictionary
variable_names class description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C sharp/D flat, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

Import Raw Data

After obtaining the CSV file, it is ingested using the read.csv command.

raw_data <- read.csv("spotify_songs.csv", stringsAsFactors =  FALSE)
raw_data %>% 
  select(c("track_name","track_album_name","playlist_genre","playlist_subgenre")) %>% 
  slice_sample(n = 10) %>% 
  kable(caption = "Table 2: Preview of Raw Data") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 2: Preview of Raw Data
track_name track_album_name playlist_genre playlist_subgenre
Jumbo Jumbo edm electro house
Share the World with Me, Pt. 4 Sceneries rock album rock
Where The Hood At Grand Champ rap gangster rap
No Strings Attached Outsidein edm electro house
Mala Costumbre Numerología latin reggaeton
Drop It Drop It edm big room
Immortal - Edited Version Immortal rap southern hip hop
Mascara Mascara latin latin hip hop
Praise You (Chill Mix) Believe In The Kingdom latin tropical
Dig Sobville (Episode I) rap trap

Data Cleaning

The data is fairly clean and tidy and does not require significant restructuring. Checks for missing values, duplicates and outliers are completed.

The 5 missing values are not considered significant and are not adjusted or removed.

kable(sapply(raw_data, function(x) sum(is.na(x))),
      caption = "Table 3: Count of missing values by variable") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 3: Count of missing values by variable
x
track_id 0
track_name 5
track_artist 5
track_popularity 0
track_album_id 0
track_album_name 5
track_album_release_date 0
playlist_name 0
playlist_id 0
playlist_genre 0
playlist_subgenre 0
danceability 0
energy 0
key 0
loudness 0
mode 0
speechiness 0
acousticness 0
instrumentalness 0
liveness 0
valence 0
tempo 0
duration_ms 0

Convert duration to seconds

For readability and interpretability, the duration_ms variable is updated to be in seconds.

raw_data <- raw_data %>% 
  mutate(duration_s = duration_ms / 1000) %>% 
  select(-duration_ms)

Check for Outliers

Most of the variables have already been normalized or are not continuous; however duration, tempo and loudness are checked for outliers. While there are outliers in each variable, the decision is made to retain all data. This may need to be readdressed after modeling and further refinement is undertaken.

raw_data %>% 
  select("duration_s", "tempo", "loudness") %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scales = 'free') +
  geom_boxplot(outlier.color = 'red') + 
  coord_flip()

duration_outliers <- boxplot(raw_data$duration_s, plot = FALSE, range = 2.5)$out
loudness_outliers <- boxplot(raw_data$loudness, plot = FALSE, range = 2.5)$out
tempo_outliers <- boxplot(raw_data$tempo, plot = FALSE, range = 2.5)$out
data.frame(metric = c("duration_s", "loudness", "tempo"),
           count_outliers = c(length(duration_outliers),
                              length(loudness_outliers),
                              length(tempo_outliers))) %>% 
  kable("html", caption = "Table 4: Count of Outliers") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Table 4: Count of Outliers
metric count_outliers
duration_s 403
loudness 215
tempo 5

Remove Duplicates

There are very few missing data point given the sample size of almost 33,000 observations. Checking the track_id reveals a large number of duplicated songs. The 4477 duplicated tracks are removed.

clean_data <- raw_data %>%
  distinct(track_id, .keep_all = TRUE)
clean_data %>%
  select(c("track_name","track_album_name","playlist_genre","playlist_subgenre")) %>% 
  slice_sample(n = 10) %>% 
  kable(caption = "Table 5: Preview of Clean Data") %>% 
  kable_styling(bootstrap_options = "striped", full_width = T)
Table 5: Preview of Clean Data
track_name track_album_name playlist_genre playlist_subgenre
God Damnit (with Call Me Karizma) God Damnit (with Call Me Karizma) edm pop edm
Diva Diva edm pop edm
Red Right Hand - From ‘Peaky Blinders’ Original Soundtrack Red Right Hand (From ‘Peaky Blinders’ Original Soundtrack) rock permanent wave
The Way I Are (Dance with Somebody) (feat. Lil Wayne) - Spotify Version The Way I Are (Dance With Somebody) [feat. Lil Wayne] r&b hip pop
Ready Or Not The Very Best Of After 7 r&b new jack swing
Outlaws Outlaws edm electro house
Black Ice (Sky High) (feat. OutKast) Still Standing rap southern hip hop
Sky Powder Rooms, Vol. 1 pop indie poptimism
minimal minimal pop indie poptimism
Chill Cloud Chill Cloud r&b hip pop

Exploratory Data Analysis

Based on the goal of developing a classification tool, identifying how genres differ from one another is key. Twelve variables are identified as potential features from early analysis.

Song Metrics

A density plot of these variables demonstrates that some genres have potentially identifiable characteristics, such as rap and rock, but other do not have features that are immediately apparent.

feature_metrics <- names(clean_data[12:23])

clean_data %>% 
  select(c("playlist_genre", feature_metrics)) %>% 
  pivot_longer(cols = feature_metrics) %>% 
  ggplot(aes(value)) +
  geom_density(aes(color = playlist_genre)) +
  facet_wrap(~ name, ncol = 3, scales = "free") +
  labs(title = "Song Metrics by Genre", x = '', y = "density")

This visualization was inspired by Kaylin Pavlik’s analysis of Spotify data.

Metric Correlations

Prior to constructing a model, checking for correlated variables helps to avoid co-linearity issues. There are only a small number of strong correlations including a positive correlation between loudness and energy and an inverse correlation between energy and acousticness. Both of these intuitively are expected based upon a review of their data definitions.

variable_corr <- cor(clean_data[12:23])
corrplot(variable_corr,  method = "color", type = "upper",
         tl.col = "black", tl.srt = 45, title = "Correlation of Song Metrics")

Modeling

There are a large number of models that can be considered for machine learning classification. Initial models considered included XGBoost, random forest, decision tree, and K-nearest neighbor. For this initial effort, K-nearest neighbor was selected due to its flexibility and easy deployment.

Data Preparation

The cleaned data is randomized, normalized, and split into training and testing dataset.

model_data <- clean_data[c(10,12:23)] # extract genre and metrics
random <- sample(1:nrow(model_data), 0.8 * nrow(model_data)) # generate a number that is 80% of the dataset
nor <- function(x) { (x - min(x))/(max(x) - min(x)) } # normalization function
data_norm <- as.data.frame(lapply(model_data[,c(2:13)], nor)) # normalize the metric columns
spotify_train <- data_norm[random,] # generate training data
spotify_test <- data_norm[-random,] # generate test data
spotify_train_target <- model_data[random,1] # extract genre data for model training
spotify_test_target <- model_data[-random,1] # extract genre data for model accuracy

K-Nearest Neighbor

The k-nearest neighbor algorithm works well in this application because its implementation is simple, it does not require an explicit training step, and it handles multi-class problems better than many algorithms that are better suited to binary problems.

A potential downside of this algorithm is the compute demand as K is increased in the model. Similarly, the model can struggle if there are an imbalanced proportion of a specific category in the training dataset. This can lead to the most common genre being over classified.

knn_model <- knn(spotify_train, spotify_test, cl = spotify_train_target, k = 250) # train model
confusion_matrix <- table(knn_model, spotify_test_target) # confusion matrix 
accuracy <- function(x) {sum(diag(x)/(sum(rowSums(x)))) * 100} # accuracy function
knn_model_accuracy <- accuracy(confusion_matrix) # assess accuracy of model

kable(round((confusion_matrix/rowSums(confusion_matrix))*100,1), "html", caption = "Table 6: Confusion Matrix (Percent Classified by Model)") %>% 
  kable_styling(bootstrap_options = "striped", full_width = T)
Table 6: Confusion Matrix (Percent Classified by Model)
edm latin pop r&b rap rock
edm 49.8 7.8 16.4 3.3 11.0 11.6
latin 7.5 39.3 13.5 16.7 15.4 7.5
pop 14.0 17.1 31.0 12.9 12.4 12.6
r&b 3.4 11.4 15.2 42.8 12.9 14.5
rap 4.3 10.6 9.2 16.8 57.5 1.5
rock 5.9 4.3 17.5 8.5 4.3 59.4

Summary

Conclusions

Starting from a random guess of 1 out of 6 (16.67%), the model is able to achieve a 45.13% accuracy. Using machine learning, a three times improvement over random guess is realized; however, this still leaves significant area for improvement. Utilizing K-nearest neighbors, the model incorrectly classifies over half of songs in the test dataset. Certain genres are even less accurate in classification. Specifically, the pop genre is very hard to classify and the model was approximately 30% accurate in this genre. One potential explanation for this is that pop does not have any characteristics that are significantly different from the genres that is being compared against for classification. This can be observed in the density plots broken down by genre. Conversely, the rock and rap genres are correctly classified 60% of the time. The rap genre has a strong speechiness distribution that sets it apart from other genres and rock similarly has a danceability distribution that assists in setting it apart from other genres. The model accomplished the goal of providing a tool to classify the songs, but there is room for improvement. Future investigation of this problem could include testing of additional models such as random forest and decision trees or looking at other groupings of genres.