Recently there was an AI bot by The Pudding that went viral for roasting how basic people’s Spotify accounts are (you can try it out here: https://pudding.cool/2020/12/judge-my-spotify/). On one hand, the bot was pretty cool and showed how powerful AI can be. On the other hand, I was personally offended by it calling me old for liking alternative rock and r&b from the early 2000s.
Look, it’s not that I have anything against more modern music, but I just don’t understand what type of style they are going for most of the time. I tried putting on a more recent r&b playlist (trying to impress the AI, of course) and discovered a song called “Lurkin” by Chris Brown and Torey Lanez. Chris Brown is mostly a r&b singer, Lanez is a rapper, yet “Lurkin” with its catchy hooks seems clearly targeted to be a mainstream pop song. However, this only got more confusing when I looked the song up on the Spotify API and realized they classified it as “Latin hip hop”. To summarize, we have a rapper and a r&b singer collaborating on a song that could be considered to be pop or rap, yet I’m first hearing it on a r&b playlist. You got all that?
This got me thinking if we could use machine learning to train a model to classify the differences between rap, r&b, and pop. Thankfully, there was a publicly available Spotify dataset from the Tidy Tuesday community (https://github.com/rfordatascience/tidytuesday) to work with.
Data Cleanup
Pre-processing
The release dates are very inconsistent, with several having a full “yyyy-mm-dd” format and others simply having “yyyy”. For purposes of this analysis, we only want the years, which happen to be the first 4 characters for every release date string. I am also grouping them into decades (with 2020 being looped in with the 2010s since this dataset only goes through March 2020), and then removing abnormally long and short songs.
df_spotify = spotify_songs %>%
filter(is.na(track_artist)==F,
is.na(track_name)==F) %>%
#extracts the first 4 characters of the release date, which is the year
mutate(year = substr(track_album_release_date, 0, 4),
year = as.numeric(year),
#Groups years into decades
decade = case_when(year <= 1979 ~ 'Pre-1980s',
year <= 1989 ~ '1980s',
year <= 1999 ~ '1990s',
year <= 2009 ~ '2000s',
year <= 2020 ~ '2010s',
TRUE ~ 'NA'),
#Reorders decade as a factor to remain sequential from earliest to latest
decade = fct_relevel(decade, levels = c('Pre-1980s', '1980s', '1990s', '2000s', '2010s')),
#Converts milliseconds to seconds
duration_sec = duration_ms * .001) %>%
filter(decade != 'NA')
#This gets the 10th and 90th percentile of song lengths. Trying to remove abnormally short or long songs
duration.bounds = quantile(df_spotify$duration_sec, c(.1,.9))
df_spotify = df_spotify %>%
filter(duration_sec >= duration.bounds[1] &
duration_sec <= duration.bounds[2])
#Eliminates a lot of the columns that won't be needed, mostly stuff about the album or playlist
df_spotify = df_spotify %>%
select(track_id, track_artist, track_name, playlist_genre, decade) %>%
bind_cols(df_spotify %>% select_if(is.numeric)) %>%
select(-duration_ms)The dataset has 26,264 rows, but it contains about 6,000 duplicated songs across different playlists due to artist re-releases (for example, Bon Jovi’s “Livin’ on a Prayer” showed up 11 different times on albums starting in 1986 and all the way through 2010). While there is no way for sure to know what the original release of each duplicated song was (“Livin’ on a Prayer” had 5 distinct releases from January 1986 alone), I will assume that the most popular version of the song in the dataset is most likely to be the original release. So the chunk below will extract only the most popular release of each track and eliminate any duplicates.
df_spotify = df_spotify %>%
#This sorts the data to have the most popular tracks first, which then uses the slice function to take the 1st row from each aggregate artist/song combo
arrange(desc(track_popularity)) %>%
group_by(track_artist, track_name) %>%
slice(1) %>%
ungroup()
#Sample Data
set.seed(20120)
kable(df_spotify %>% sample_n(5)) %>%
kable_paper() %>%
scroll_box(height = '75%', width= '100%')| track_id | track_artist | track_name | playlist_genre | decade | track_popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | year | duration_sec |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1MwSMSjbU20bF5nFgWLsJh | SAND&SEA | looking around at the sky. | pop | 2010s | 38 | 0.708 | 0.667 | 2 | -7.328 | 1 | 0.0587 | 0.51200 | 0.00e+00 | 0.137 | 0.5250 | 160.004 | 2020 | 225.000 |
| 5a2dsl9XD05WBQtSCAkTdF | Lucchii | Disconnected | rap | 2010s | 36 | 0.512 | 0.877 | 5 | -5.548 | 0 | 0.5740 | 0.71100 | 0.00e+00 | 0.425 | 0.3170 | 128.224 | 2020 | 277.500 |
| 455AfCsOhhLPRc68sE01D8 | Katy Perry | Last Friday Night (T.G.I.F.) | pop | 2010s | 74 | 0.649 | 0.815 | 3 | -3.796 | 0 | 0.0415 | 0.00125 | 4.31e-05 | 0.671 | 0.7650 | 126.030 | 2012 | 230.747 |
| 4rvnIyOOA0ws4PMiZpZgZK | Coopex | Never Letting Go | latin | 2010s | 45 | 0.490 | 0.818 | 6 | -6.653 | 0 | 0.2190 | 0.10800 | 1.29e-04 | 0.142 | 0.0987 | 128.065 | 2019 | 183.755 |
| 5cYZ5msSztsVKkaSYIoZ3b | Gucci Mane | Drop It Off (Feat. Eva Trill) | rap | 2010s | 20 | 0.792 | 0.428 | 2 | -7.862 | 1 | 0.0715 | 0.07620 | 0.00e+00 | 0.183 | 0.2020 | 148.065 | 2015 | 182.973 |
Popularity by Genre
Here is a look at the breakdown of popularity by genre to look for observable trends. This chunk creates a histogram of popularity by genre.
#Calculate the group mean popularity by genre for overlay on the plot
mean.popularity = df_spotify %>%
group_by(playlist_genre) %>%
summarize(mean.pop = mean(track_popularity))
ggplot(df_spotify) +
geom_histogram(aes(track_popularity, col=track_popularity),
breaks=0:100, color = 'blue', alpha = .75) +
geom_vline(data = mean.popularity, col = 'red',
mapping = aes(xintercept = mean.pop)) +
geom_label(data = mean.popularity, col = 'red',
aes(x = mean.pop, y = 300,
label = paste('Mean:', round(mean.pop,1)))) +
facet_wrap(~playlist_genre) +
ggtitle('Song Popularity on Spotify by Genre')The first thing that is obvious are the large amount of songs with 0 popularity, which looks consistent across genres. I don’t know if this is some sort of input error or perhaps these are just far more unknown songs that don’t register as many listens. To be honest, the data dictionary did not provide any info on how popularity is calculated, but I think it is fairly safe to remove them for this analysis.
Clearly edm is the least popular genre (amen to that) with Latin and pop the most, but the important thing is seeing that other than edm the group means are relatively close to each other.
This boxplot is now looking at the popularity distribution by decade.
Interesting to see pop music peaking in the 1980s and then dropping after that. Meanwhile, rap and r&b both saw declines in the 2000s but huge resurgences in the 2010s. It is important to note that with so much more data on 2010s songs than any other decade, it’s unlikely there’s much statistical significance in saying that 2010s music is more popular than any other decade (that’s probably not a big surprise since I would assume that Spotify’s userbase likely skews much younger). We could also see the number of outliers on 1980s rap, which is largely because of the lack of data points (rap didn’t really take off on a national level until the 1990s). So I will remove all songs before 1990 from modeling.
Correlation Plot
The dataset contains 12 numeric attributes about the actual sound of the music. We can use the corrplot function to create a useful heat map on the correlation between the numerical song attributes provided in the dataset.
df_fewer_genres %>%
select(7:17) %>% #these are the song attributes
scale() %>%
cor() %>%
corrplot(method = 'color',
type = 'upper',
diag = F,
tl.col = 'black',
addCoef.col = "grey30",
number.cex = 0.6,
tl.cex = .8,
main = 'Correlation Plot of Song Attributes',
mar = c(1,0,2,0))Energy has a strong 68% positive correlation with Loudness, which I suppose explains why so many musicians are always asking their crowds to, “Make some noise.” Energy also has a moderate negative correlation with Acousticness, which would also intuitively make some sense. To reduce collinearity, we will remove Energy from modeling.
Model Setup
Training and Testing Splits
For a multi-level classification, we will try both a k-nearest neighbors and random forest model using the musical attributes. We will not include anything about time period or artist because we want to see if the computer can distinguish genres just on the sound alone. The 9,790 songs used are split into training and testing splits (7,344/2,446) with the proportion of each genre held constant. We will also use bootstrap resampling, which allows us to try modeling different splits from the original data.
set.seed(516)
#Creates a seperate data frame for just our 3 genres and decades
df_spotify_model = df_spotify %>%
filter(decade %in% c('1990s', '2000s', '2010s') ,
playlist_genre %in% c('rap', 'pop', 'r&b'))
#Splits the data into training and testing. Strata ensures equal distribution across genres in testing and training sets.
df_split = initial_split(df_spotify_model,
strata = playlist_genre)
df_training = training(df_split)
df_testing = testing(df_split)
#Gets stratified bootstrap samples of the training data
df_bootstraps = bootstraps(df_training,
strata = playlist_genre)Pre-Processing Recipe
The Tidymodels package allows us to create a “recipe” for pre-processing our data before modeling. This eliminates redundancy when making multiple models. The recipe below specifies a prediction formula, eliminates non-predictor variables, removes highly correlated variables (energy), and then centers and scales the numeric attributes so that they are on level playing fields when assessing their predictive power. We then use the prep and bake functions to see what the recipe looks like when applied to the training data. A summary of the attributes are below:
#Creates a recipe for consistently processing our data. Starts with writing the formula for modeling to predict genre
spotify_recipe = recipe(playlist_genre ~ ., df_training) %>%
#Removes all of the excess variables
step_rm(all_nominal(), track_popularity, year, -playlist_genre) %>%
#Gets rid of variables with correlation over .6 (which will filter out Energy)
step_corr(all_numeric(), threshold = .6) %>%
#Centers and scales all numeric predictors
step_center(all_predictors()) %>%
step_scale(all_predictors())
#Tidymodels requires a prep step on the training data. You can then feed this into "bake", which will show the results of our recipe when applied to the training data
spotify_prepped_recipe = prep(spotify_recipe, df_training)
df_training_baked = bake(spotify_prepped_recipe, df_training)
summary(df_training_baked)## danceability key loudness mode
## Min. :-4.21927 Min. :-1.4581 Min. :-7.2229 Min. :-1.0918
## 1st Qu.:-0.63863 1st Qu.:-0.9078 1st Qu.:-0.4979 1st Qu.:-1.0918
## Median : 0.09736 Median : 0.1929 Median : 0.1533 Median : 0.9158
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.71560 3rd Qu.: 0.7432 3rd Qu.: 0.6855 3rd Qu.: 0.9158
## Max. : 2.24648 Max. : 1.5687 Max. : 2.6410 Max. : 0.9158
## speechiness acousticness instrumentalness liveness
## Min. :-0.9086 Min. :-0.8797 Min. :-0.2626 Min. :-1.2257
## 1st Qu.:-0.7311 1st Qu.:-0.7593 1st Qu.:-0.2626 1st Qu.:-0.6121
## Median :-0.4610 Median :-0.4111 Median :-0.2626 Median :-0.3946
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4954 3rd Qu.: 0.4253 3rd Qu.:-0.2598 3rd Qu.: 0.3087
## Max. : 6.9208 Max. : 3.4208 Max. : 6.2030 Max. : 5.8912
## valence tempo duration_sec playlist_genre
## Min. :-2.1608634 Min. :-2.36162 Min. :-1.6871 pop:2807
## 1st Qu.:-0.7622605 1st Qu.:-0.82223 1st Qu.:-0.7891 r&b:2083
## Median :-0.0007789 Median :-0.03537 Median :-0.1268 rap:2454
## Mean : 0.0000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.7697143 3rd Qu.: 0.65717 3rd Qu.: 0.6941
## Max. : 2.1349743 Max. : 3.36046 Max. : 2.3891
Random Forest and KNN Models
Finally, we set the engines for each model and apply our pre-processing recipes into a workflow.
#Set the random forest engine
rf_model = rand_forest(mtry = tune(), min_n = tune(), trees = 300) %>%
set_mode("classification") %>%
set_engine("ranger")
#Create the workflow to apply the model and then the pre-processing recipe
rf_workflow = workflow() %>%
add_model(rf_model) %>%
add_recipe(spotify_prepped_recipe)
set.seed(516)
#Apply the resampling to our processed data
rf_bootstraps = rf_workflow %>%
tune_grid(resamples = df_bootstraps, grid = 3)
#now repeats the same 3 steps for knn model
knn_model = nearest_neighbor( ) %>%
set_mode('classification') %>%
set_engine('kknn')
knn_workflow = workflow() %>%
add_model(knn_model) %>%
add_recipe(spotify_prepped_recipe)
set.seed(516)
knn_bootstraps = knn_workflow %>%
tune_grid(resamples = df_bootstraps, grid = 3)Evaluate Models
Accuracy Measures: Random Forest vs. KNN
Here are the results for each model when it comes to predicting the correct genre:
#Select best botstraps models by accuracy
rf_best = select_best(rf_bootstraps, metric = 'accuracy')
knn_best = select_best(knn_bootstraps, metric = 'accuracy')
rf_best_metrics = collect_metrics(rf_bootstraps) %>%
filter(`.config` == rf_best$.config)
#Gets the metrics on the training data for the best models and puts them into a table form
model_comp = rf_best_metrics %>%
mutate(model = 'rf') %>%
bind_rows(
collect_metrics(knn_bootstraps) %>%
filter(`.config` == knn_best$.config) %>%
mutate(model = 'knn')) %>%
pivot_wider(names_from = .metric, values_from = mean ) %>%
group_by(model) %>%
summarise(accuracy = round(sum(accuracy, na.rm = T),4),
roc_auc = round(sum(roc_auc, na.rm = T),4))
kable(model_comp)| model | accuracy | roc_auc |
|---|---|---|
| knn | 0.5325 | 0.6959 |
| rf | 0.6325 | 0.7943 |
For a classification problem like this one that has fairly balanced classes, accuracy is the better metric (accuracy is a pure measure of true positives and true negatives; AUC is better for weighting false positives and negatives). The random forest model performed at about 63.3% accuracy on the training data compared to about 53% for the knn model. I think this gap kind of makes sense given the original premise that pop, r&b, and rap have a lot of crossover in how they sound. RF will do a better job establishing specific heuristics for classification. If KNN misclassified one song, then by definition there’s a very good chance it will also misclassify all other songs just like it (hence the name “nearest neighbors”).
By comparison, a random guess would get it right 33.33%, so the RF model nearly doubling that is fairly strong.
Benchmarking Accuracy Measure
However, I wanted to make a benchmark for how evaluating my initial question of whether pop, rap, and r&b really are that similar. So I also created a seperate random forest model for classifying rap against 2 of the other genres in the original dataset: rock and edm. Those 3 genres are very distinct from each other in terms of sound. I would hypothesize that the accuracy of the rock/edm model should be much higher than the 63.3% of the pop/r&b model, even when holding all of the rap songs constant across both models.
| model | accuracy | roc_auc |
|---|---|---|
| RF - pop/rap/r&b | 0.6325 | 0.7943 |
| RF - rock/edm/rap | 0.8060 | 0.9290 |
As I suspected, the computer has a much easier time identifying the difference between rap, rock, and edm as it did compared to rap, pop, and r&b. The rock/edm model was accurate on 80.6% on in-sample songs, which was over 17% better than the r&b/pop model! So it’s clearly not just my old man ears that struggle to figure out rap, pop, and r&b.
Modeling Testing Data
Let’s now run the RF model (rap/pop/r&b) on the testing data and see how it does predicting genres out of sample.
#Selects the best metrics for the random forest by taking the most accurate bootstrap model from the training data model
final_rf = rf_workflow %>%
finalize_workflow(select_best(rf_bootstraps, metric = 'accuracy'))
#Fits the model on the testing data
spotify_fit = last_fit(final_rf, df_split)
kable(collect_metrics(spotify_fit) %>% select(.metric, .estimate))| .metric | .estimate |
|---|---|
| accuracy | 0.6512674 |
| roc_auc | 0.8101257 |
Interesting that our random forest model performed about 2% better on the testing (65% accuracy) than it did on the training data (63.2%). Typically we would expect the accuracy on the training data to be slightly higher (the model parameters would be “biased” for the training data). This likely indicates a high amount of variance in the model. In other words, there are probably a fair amount of songs the model classified correctly on the testing data even though it wasn’t particularly confident in the choice.
One way to look at this would be to look at the distribution of the predicted probabilities on the testing data. The RF model assigns a probability that each song belongs to rap/r&b/pop. This graph will show how confident the model was when making its selection for each song.
We can see a heavy left skew in the predicted probabilities, which does confirm that the model more often than not was classifying with low confidence. It’s worth reiterating that this is not a surprise given my original hypothesis. If the three genres really did sound alike, then the model would not be overly confident when making a choice. This isn’t to say the results are not good (57% average confidence across 3 choices does show some degree of certainty), but it does offer an explanation as to why the testing data might have slightly outperformed the training data.
Confusion Matrix Plot by Predicted Genre
How did the model do in predicting each of the 3 genre’s individually? Let’s check the mosaic plot.
Wow! So it turns out that the model is actually very accurate distinguishing pop (77%) and rap (74%), but it is only a little bit better than random at distinguishing r&b (43%). This means that my observation that pop and rap blending together would be wrong, but I was right about it happeningto r&b. The model validates that r&b tends to sound too much like either a rap or a pop song.
Keep in mind, the training data was made up of nearly 80% of songs from 2010 on. As a fan of 90s and 2000s r&b, I’d be really curious if this is a trend that has evolved for more recent songs or if it was always like this and I had just never noticed. However, I do not have enough data to evaluate this trend right now.
Variable Importance Plot
Finally, let’s see a variable importance plot to see what features were most predictive.
The model found the most importance to be for speechiness and loudness (rap), as well as danceability and tempo (pop). Duration also showed some predictive power, ironically being a trait most associated with predicting the otherwise difficult r&b (r&b songs were about 10 seconds longer on average than rap and 15 more than pop in the training data).
But perhaps most importantly, now that we have a model, we can circle back to answering my original question of what the heck genre was “Lurkin” supposed to be?
| track_artist | track_name | .pred_pop | .pred_r&b | .pred_rap |
|---|---|---|---|---|
| Chris Brown | Lurkin’ (feat. Tory Lanez) | 0.315856 | 0.2456512 | 0.4384928 |
Sigh. The computer was just as confused as I was!
Summary
- The random forest model predicted rap, r&b, and pop songs with 65% accuracy out of sample. This beat the k-nearest neighbors model by over 10%.
- It was significantly better at accurately predicting pop (77%) and rap (73%) over r&b (43%).
- The amount of speech and loudness of a song are strong predictors for rap.
- The amount of danceability and tempo are strong predictors for pop.
- The attributes for r&b songs blend in too much with pop and/or rap attributes for the model to distinguish them.