Recently there was an AI bot by The Pudding that went viral for roasting how basic people’s Spotify accounts are (you can try it out here: https://pudding.cool/2020/12/judge-my-spotify/). On one hand, the bot was pretty cool and showed how powerful AI can be. On the other hand, I was personally offended by it calling me old for liking alternative rock and r&b from the early 2000s.

Look, it’s not that I have anything against more modern music, but I just don’t understand what type of style they are going for most of the time. I tried putting on a more recent r&b playlist (trying to impress the AI, of course) and discovered a song called “Lurkin” by Chris Brown and Torey Lanez. Chris Brown is mostly a r&b singer, Lanez is a rapper, yet “Lurkin” with its catchy hooks seems clearly targeted to be a mainstream pop song. However, this only got more confusing when I looked the song up on the Spotify API and realized they classified it as “Latin hip hop”. To summarize, we have a rapper and a r&b singer collaborating on a song that could be considered to be pop or rap, yet I’m first hearing it on a r&b playlist. You got all that?

This got me thinking if we could use machine learning to train a model to classify the differences between rap, r&b, and pop. Thankfully, there was a publicly available Spotify dataset from the Tidy Tuesday community (https://github.com/rfordatascience/tidytuesday) to work with.

Data Cleanup

Pre-processing

The release dates are very inconsistent, with several having a full “yyyy-mm-dd” format and others simply having “yyyy”. For purposes of this analysis, we only want the years, which happen to be the first 4 characters for every release date string. I am also grouping them into decades (with 2020 being looped in with the 2010s since this dataset only goes through March 2020), and then removing abnormally long and short songs.

The dataset has 26,264 rows, but it contains about 6,000 duplicated songs across different playlists due to artist re-releases (for example, Bon Jovi’s “Livin’ on a Prayer” showed up 11 different times on albums starting in 1986 and all the way through 2010). While there is no way for sure to know what the original release of each duplicated song was (“Livin’ on a Prayer” had 5 distinct releases from January 1986 alone), I will assume that the most popular version of the song in the dataset is most likely to be the original release. So the chunk below will extract only the most popular release of each track and eliminate any duplicates.

track_id track_artist track_name playlist_genre decade track_popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo year duration_sec
1MwSMSjbU20bF5nFgWLsJh SAND&SEA looking around at the sky. pop 2010s 38 0.708 0.667 2 -7.328 1 0.0587 0.51200 0.00e+00 0.137 0.5250 160.004 2020 225.000
5a2dsl9XD05WBQtSCAkTdF Lucchii Disconnected rap 2010s 36 0.512 0.877 5 -5.548 0 0.5740 0.71100 0.00e+00 0.425 0.3170 128.224 2020 277.500
455AfCsOhhLPRc68sE01D8 Katy Perry Last Friday Night (T.G.I.F.) pop 2010s 74 0.649 0.815 3 -3.796 0 0.0415 0.00125 4.31e-05 0.671 0.7650 126.030 2012 230.747
4rvnIyOOA0ws4PMiZpZgZK Coopex Never Letting Go latin 2010s 45 0.490 0.818 6 -6.653 0 0.2190 0.10800 1.29e-04 0.142 0.0987 128.065 2019 183.755
5cYZ5msSztsVKkaSYIoZ3b Gucci Mane Drop It Off (Feat. Eva Trill) rap 2010s 20 0.792 0.428 2 -7.862 1 0.0715 0.07620 0.00e+00 0.183 0.2020 148.065 2015 182.973

Popularity by Genre

Here is a look at the breakdown of popularity by genre to look for observable trends. This chunk creates a histogram of popularity by genre.

The first thing that is obvious are the large amount of songs with 0 popularity, which looks consistent across genres. I don’t know if this is some sort of input error or perhaps these are just far more unknown songs that don’t register as many listens. To be honest, the data dictionary did not provide any info on how popularity is calculated, but I think it is fairly safe to remove them for this analysis.

Clearly edm is the least popular genre (amen to that) with Latin and pop the most, but the important thing is seeing that other than edm the group means are relatively close to each other.

This boxplot is now looking at the popularity distribution by decade.

Interesting to see pop music peaking in the 1980s and then dropping after that. Meanwhile, rap and r&b both saw declines in the 2000s but huge resurgences in the 2010s. It is important to note that with so much more data on 2010s songs than any other decade, it’s unlikely there’s much statistical significance in saying that 2010s music is more popular than any other decade (that’s probably not a big surprise since I would assume that Spotify’s userbase likely skews much younger). We could also see the number of outliers on 1980s rap, which is largely because of the lack of data points (rap didn’t really take off on a national level until the 1990s). So I will remove all songs before 1990 from modeling.

Correlation Plot

The dataset contains 12 numeric attributes about the actual sound of the music. We can use the corrplot function to create a useful heat map on the correlation between the numerical song attributes provided in the dataset.

Energy has a strong 68% positive correlation with Loudness, which I suppose explains why so many musicians are always asking their crowds to, “Make some noise.” Energy also has a moderate negative correlation with Acousticness, which would also intuitively make some sense. To reduce collinearity, we will remove Energy from modeling.

Model Setup

Training and Testing Splits

For a multi-level classification, we will try both a k-nearest neighbors and random forest model using the musical attributes. We will not include anything about time period or artist because we want to see if the computer can distinguish genres just on the sound alone. The 9,790 songs used are split into training and testing splits (7,344/2,446) with the proportion of each genre held constant. We will also use bootstrap resampling, which allows us to try modeling different splits from the original data.

Pre-Processing Recipe

The Tidymodels package allows us to create a “recipe” for pre-processing our data before modeling. This eliminates redundancy when making multiple models. The recipe below specifies a prediction formula, eliminates non-predictor variables, removes highly correlated variables (energy), and then centers and scales the numeric attributes so that they are on level playing fields when assessing their predictive power. We then use the prep and bake functions to see what the recipe looks like when applied to the training data. A summary of the attributes are below:

##   danceability           key             loudness            mode        
##  Min.   :-4.21927   Min.   :-1.4581   Min.   :-7.2229   Min.   :-1.0918  
##  1st Qu.:-0.63863   1st Qu.:-0.9078   1st Qu.:-0.4979   1st Qu.:-1.0918  
##  Median : 0.09736   Median : 0.1929   Median : 0.1533   Median : 0.9158  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.71560   3rd Qu.: 0.7432   3rd Qu.: 0.6855   3rd Qu.: 0.9158  
##  Max.   : 2.24648   Max.   : 1.5687   Max.   : 2.6410   Max.   : 0.9158  
##   speechiness       acousticness     instrumentalness     liveness      
##  Min.   :-0.9086   Min.   :-0.8797   Min.   :-0.2626   Min.   :-1.2257  
##  1st Qu.:-0.7311   1st Qu.:-0.7593   1st Qu.:-0.2626   1st Qu.:-0.6121  
##  Median :-0.4610   Median :-0.4111   Median :-0.2626   Median :-0.3946  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4954   3rd Qu.: 0.4253   3rd Qu.:-0.2598   3rd Qu.: 0.3087  
##  Max.   : 6.9208   Max.   : 3.4208   Max.   : 6.2030   Max.   : 5.8912  
##     valence               tempo           duration_sec     playlist_genre
##  Min.   :-2.1608634   Min.   :-2.36162   Min.   :-1.6871   pop:2807      
##  1st Qu.:-0.7622605   1st Qu.:-0.82223   1st Qu.:-0.7891   r&b:2083      
##  Median :-0.0007789   Median :-0.03537   Median :-0.1268   rap:2454      
##  Mean   : 0.0000000   Mean   : 0.00000   Mean   : 0.0000                 
##  3rd Qu.: 0.7697143   3rd Qu.: 0.65717   3rd Qu.: 0.6941                 
##  Max.   : 2.1349743   Max.   : 3.36046   Max.   : 2.3891

Evaluate Models

Accuracy Measures: Random Forest vs. KNN

Here are the results for each model when it comes to predicting the correct genre:

model accuracy roc_auc
knn 0.5325 0.6959
rf 0.6325 0.7943

For a classification problem like this one that has fairly balanced classes, accuracy is the better metric (accuracy is a pure measure of true positives and true negatives; AUC is better for weighting false positives and negatives). The random forest model performed at about 63.3% accuracy on the training data compared to about 53% for the knn model. I think this gap kind of makes sense given the original premise that pop, r&b, and rap have a lot of crossover in how they sound. RF will do a better job establishing specific heuristics for classification. If KNN misclassified one song, then by definition there’s a very good chance it will also misclassify all other songs just like it (hence the name “nearest neighbors”).

By comparison, a random guess would get it right 33.33%, so the RF model nearly doubling that is fairly strong.

Benchmarking Accuracy Measure

However, I wanted to make a benchmark for how evaluating my initial question of whether pop, rap, and r&b really are that similar. So I also created a seperate random forest model for classifying rap against 2 of the other genres in the original dataset: rock and edm. Those 3 genres are very distinct from each other in terms of sound. I would hypothesize that the accuracy of the rock/edm model should be much higher than the 63.3% of the pop/r&b model, even when holding all of the rap songs constant across both models.

model accuracy roc_auc
RF - pop/rap/r&b 0.6325 0.7943
RF - rock/edm/rap 0.8060 0.9290

As I suspected, the computer has a much easier time identifying the difference between rap, rock, and edm as it did compared to rap, pop, and r&b. The rock/edm model was accurate on 80.6% on in-sample songs, which was over 17% better than the r&b/pop model! So it’s clearly not just my old man ears that struggle to figure out rap, pop, and r&b.

Modeling Testing Data

Let’s now run the RF model (rap/pop/r&b) on the testing data and see how it does predicting genres out of sample.

.metric .estimate
accuracy 0.6512674
roc_auc 0.8101257

Interesting that our random forest model performed about 2% better on the testing (65% accuracy) than it did on the training data (63.2%). Typically we would expect the accuracy on the training data to be slightly higher (the model parameters would be “biased” for the training data). This likely indicates a high amount of variance in the model. In other words, there are probably a fair amount of songs the model classified correctly on the testing data even though it wasn’t particularly confident in the choice.

One way to look at this would be to look at the distribution of the predicted probabilities on the testing data. The RF model assigns a probability that each song belongs to rap/r&b/pop. This graph will show how confident the model was when making its selection for each song.

We can see a heavy left skew in the predicted probabilities, which does confirm that the model more often than not was classifying with low confidence. It’s worth reiterating that this is not a surprise given my original hypothesis. If the three genres really did sound alike, then the model would not be overly confident when making a choice. This isn’t to say the results are not good (57% average confidence across 3 choices does show some degree of certainty), but it does offer an explanation as to why the testing data might have slightly outperformed the training data.

Confusion Matrix Plot by Predicted Genre

How did the model do in predicting each of the 3 genre’s individually? Let’s check the mosaic plot.

Wow! So it turns out that the model is actually very accurate distinguishing pop (77%) and rap (74%), but it is only a little bit better than random at distinguishing r&b (43%). This means that my observation that pop and rap blending together would be wrong, but I was right about it happeningto r&b. The model validates that r&b tends to sound too much like either a rap or a pop song.

Keep in mind, the training data was made up of nearly 80% of songs from 2010 on. As a fan of 90s and 2000s r&b, I’d be really curious if this is a trend that has evolved for more recent songs or if it was always like this and I had just never noticed. However, I do not have enough data to evaluate this trend right now.

Variable Importance Plot

Finally, let’s see a variable importance plot to see what features were most predictive.

The model found the most importance to be for speechiness and loudness (rap), as well as danceability and tempo (pop). Duration also showed some predictive power, ironically being a trait most associated with predicting the otherwise difficult r&b (r&b songs were about 10 seconds longer on average than rap and 15 more than pop in the training data).

But perhaps most importantly, now that we have a model, we can circle back to answering my original question of what the heck genre was “Lurkin” supposed to be?

track_artist track_name .pred_pop .pred_r&b .pred_rap
Chris Brown Lurkin’ (feat. Tory Lanez) 0.315856 0.2456512 0.4384928

Sigh. The computer was just as confused as I was!

Summary