The desire to group items into buckets of similar characteristics is an ingrained interest all humans experience at a very young age. Animals, colored blocks, or anything else a child can get their hands on can be sorted and categorized. As we listen to music the same innate desire occurs. This is where genres come into play. As a human listens to a song it may be patently clear what genre it belongs in, but it can be much more difficult to objectively identify why it belongs in a specific genre. Even more challenging is the task of classifying a large number of songs into appropriate genres.
As the world’s most popular audio stream subscription service, Spotify has an incredible about of data and insights into how we listen to musics. As of 2021, Spotify has over 70,000,000 musical tracks. This project will explore a dataset of over 30,000 songs within six genres.
This project will explore a dataset of approximately 30,000 songs that fall into six different genres(rap, rock, latin, pop, R&B, and EDM). After cleaning and visualizing the data, machine learning algorithms will be utilized with the goal of building a model that can classify the songs into their correct genres based on the song metrics. With six categories there is a 16.67% change of randomly guessing a song’s genre. The goal is to far out perform random chance and build a model to automate the classification of these songs.
The following packages are required to run the included code and analysis
library(tidyverse) # creating clean and tidy data
library(dplyr) # transforming data
library(ggplot2) # data visualization
library(kableExtra) # creation of complex tables
library(knitr) # dynamic report generation
library(purrr) # functional programming toolkit
library(corrplot)
library(class) # classification functions including K-nearest neighbor
theme_set(theme_bw()) # set default plot theme
A CSV of the data can be found here. A data dictionary and additional context can be found on the tidytuesday GitHub repository. This data set and data dictionary were used for a January 2020 edition of the tidytuesday podcast.
kable(data_dict, caption = "Table 1: Data Dictionary") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| variable_names | class | description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C sharp/D flat, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
After obtaining the CSV file, it is ingested using the read.csv command.
raw_data <- read.csv("spotify_songs.csv", stringsAsFactors = FALSE)
raw_data %>%
select(c("track_name","track_album_name","playlist_genre","playlist_subgenre")) %>%
slice_sample(n = 10) %>%
kable(caption = "Table 2: Preview of Raw Data") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| track_name | track_album_name | playlist_genre | playlist_subgenre |
|---|---|---|---|
| Jumbo | Jumbo | edm | electro house |
| Share the World with Me, Pt. 4 | Sceneries | rock | album rock |
| Where The Hood At | Grand Champ | rap | gangster rap |
| No Strings Attached | Outsidein | edm | electro house |
| Mala Costumbre | NumerologÃa | latin | reggaeton |
| Drop It | Drop It | edm | big room |
| Immortal - Edited Version | Immortal | rap | southern hip hop |
| Mascara | Mascara | latin | latin hip hop |
| Praise You (Chill Mix) | Believe In The Kingdom | latin | tropical |
| Dig | Sobville (Episode I) | rap | trap |
The data is fairly clean and tidy and does not require significant restructuring. Checks for missing values, duplicates and outliers are completed.
The 5 missing values are not considered significant and are not adjusted or removed.
kable(sapply(raw_data, function(x) sum(is.na(x))),
caption = "Table 3: Count of missing values by variable") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| x | |
|---|---|
| track_id | 0 |
| track_name | 5 |
| track_artist | 5 |
| track_popularity | 0 |
| track_album_id | 0 |
| track_album_name | 5 |
| track_album_release_date | 0 |
| playlist_name | 0 |
| playlist_id | 0 |
| playlist_genre | 0 |
| playlist_subgenre | 0 |
| danceability | 0 |
| energy | 0 |
| key | 0 |
| loudness | 0 |
| mode | 0 |
| speechiness | 0 |
| acousticness | 0 |
| instrumentalness | 0 |
| liveness | 0 |
| valence | 0 |
| tempo | 0 |
| duration_ms | 0 |
For readability and interpretability, the duration_ms variable is updated to be in seconds.
raw_data <- raw_data %>%
mutate(duration_s = duration_ms / 1000) %>%
select(-duration_ms)
Most of the variables have already been normalized or are not continuous; however duration, tempo and loudness are checked for outliers. While there are outliers in each variable, the decision is made to retain all data. This may need to be readdressed after modeling and further refinement is undertaken.
raw_data %>%
select("duration_s", "tempo", "loudness") %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scales = 'free') +
geom_boxplot(outlier.color = 'red') +
coord_flip()
duration_outliers <- boxplot(raw_data$duration_s, plot = FALSE, range = 2.5)$out
loudness_outliers <- boxplot(raw_data$loudness, plot = FALSE, range = 2.5)$out
tempo_outliers <- boxplot(raw_data$tempo, plot = FALSE, range = 2.5)$out
data.frame(metric = c("duration_s", "loudness", "tempo"),
count_outliers = c(length(duration_outliers),
length(loudness_outliers),
length(tempo_outliers))) %>%
kable("html", caption = "Table 4: Count of Outliers") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| metric | count_outliers |
|---|---|
| duration_s | 403 |
| loudness | 215 |
| tempo | 5 |
There are very few missing data point given the sample size of almost 33,000 observations. Checking the track_id reveals a large number of duplicated songs. The 4477 duplicated tracks are removed.
clean_data <- raw_data %>%
distinct(track_id, .keep_all = TRUE)
clean_data %>%
select(c("track_name","track_album_name","playlist_genre","playlist_subgenre")) %>%
slice_sample(n = 10) %>%
kable(caption = "Table 5: Preview of Clean Data") %>%
kable_styling(bootstrap_options = "striped", full_width = T)
| track_name | track_album_name | playlist_genre | playlist_subgenre |
|---|---|---|---|
| God Damnit (with Call Me Karizma) | God Damnit (with Call Me Karizma) | edm | pop edm |
| Diva | Diva | edm | pop edm |
| Red Right Hand - From ‘Peaky Blinders’ Original Soundtrack | Red Right Hand (From ‘Peaky Blinders’ Original Soundtrack) | rock | permanent wave |
| The Way I Are (Dance with Somebody) (feat. Lil Wayne) - Spotify Version | The Way I Are (Dance With Somebody) [feat. Lil Wayne] | r&b | hip pop |
| Ready Or Not | The Very Best Of After 7 | r&b | new jack swing |
| Outlaws | Outlaws | edm | electro house |
| Black Ice (Sky High) (feat. OutKast) | Still Standing | rap | southern hip hop |
| Sky | Powder Rooms, Vol. 1 | pop | indie poptimism |
| minimal | minimal | pop | indie poptimism |
| Chill Cloud | Chill Cloud | r&b | hip pop |
Based on the goal of developing a classification tool, identifying how genres differ from one another is key. Twelve variables are identified as potential features from early analysis.
A density plot of these variables demonstrates that some genres have potentially identifiable characteristics, such as rap and rock, but other do not have features that are immediately apparent.
feature_metrics <- names(clean_data[12:23])
clean_data %>%
select(c("playlist_genre", feature_metrics)) %>%
pivot_longer(cols = feature_metrics) %>%
ggplot(aes(value)) +
geom_density(aes(color = playlist_genre)) +
facet_wrap(~ name, ncol = 3, scales = "free") +
labs(title = "Song Metrics by Genre", x = '', y = "density")
This visualization was inspired by Kaylin Pavlik’s analysis of Spotify data.
Prior to constructing a model, checking for correlated variables helps to avoid co-linearity issues. There are only a small number of strong correlations including a positive correlation between loudness and energy and an inverse correlation between energy and acousticness. Both of these intuitively are expected based upon a review of their data definitions.
variable_corr <- cor(clean_data[12:23])
corrplot(variable_corr, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, title = "Correlation of Song Metrics")
There are a large number of models that can be considered for machine learning classification. Initial models considered included XGBoost, random forest, decision tree, and K-nearest neighbor. For this initial effort, K-nearest neighbor was selected due to its flexibility and easy deployment.
The cleaned data is randomized, normalized, and split into training and testing dataset.
model_data <- clean_data[c(10,12:23)] # extract genre and metrics
random <- sample(1:nrow(model_data), 0.8 * nrow(model_data)) # generate a number that is 80% of the dataset
nor <- function(x) { (x - min(x))/(max(x) - min(x)) } # normalization function
data_norm <- as.data.frame(lapply(model_data[,c(2:13)], nor)) # normalize the metric columns
spotify_train <- data_norm[random,] # generate training data
spotify_test <- data_norm[-random,] # generate test data
spotify_train_target <- model_data[random,1] # extract genre data for model training
spotify_test_target <- model_data[-random,1] # extract genre data for model accuracy
The k-nearest neighbor algorithm works well in this application because its implementation is simple, it does not require an explicit training step, and it handles multi-class problems better than many algorithms that are better suited to binary problems.
A potential downside of this algorithm is the compute demand as K is increased in the model. Similarly, the model can struggle if there are an imbalanced proportion of a specific category in the training dataset. This can lead to the most common genre being over classified.
knn_model <- knn(spotify_train, spotify_test, cl = spotify_train_target, k = 250) # train model
confusion_matrix <- table(knn_model, spotify_test_target) # confusion matrix
accuracy <- function(x) {sum(diag(x)/(sum(rowSums(x)))) * 100} # accuracy function
knn_model_accuracy <- accuracy(confusion_matrix) # assess accuracy of model
kable(round((confusion_matrix/rowSums(confusion_matrix))*100,1), "html", caption = "Table 6: Confusion Matrix (Percent Classified by Model)") %>%
kable_styling(bootstrap_options = "striped", full_width = T)
| edm | latin | pop | r&b | rap | rock | |
|---|---|---|---|---|---|---|
| edm | 49.8 | 7.8 | 16.4 | 3.3 | 11.0 | 11.6 |
| latin | 7.5 | 39.3 | 13.5 | 16.7 | 15.4 | 7.5 |
| pop | 14.0 | 17.1 | 31.0 | 12.9 | 12.4 | 12.6 |
| r&b | 3.4 | 11.4 | 15.2 | 42.8 | 12.9 | 14.5 |
| rap | 4.3 | 10.6 | 9.2 | 16.8 | 57.5 | 1.5 |
| rock | 5.9 | 4.3 | 17.5 | 8.5 | 4.3 | 59.4 |
Starting from a random guess of 1 out of 6 (16.67%), the model is able to achieve a 45.13% accuracy. Using machine learning, a three times improvement over random guess is realized; however, this still leaves significant area for improvement. Utilizing K-nearest neighbors, the model incorrectly classifies over half of songs in the test dataset. Certain genres are even less accurate in classification. Specifically, the pop genre is very hard to classify and the model was approximately 30% accurate in this genre. One potential explanation for this is that pop does not have any characteristics that are significantly different from the genres that is being compared against for classification. This can be observed in the density plots broken down by genre. Conversely, the rock and rap genres are correctly classified 60% of the time. The rap genre has a strong speechiness distribution that sets it apart from other genres and rock similarly has a danceability distribution that assists in setting it apart from other genres. The model accomplished the goal of providing a tool to classify the songs, but there is room for improvement. Future investigation of this problem could include testing of additional models such as random forest and decision trees or looking at other groupings of genres.