Recently there was an AI bot by The Pudding that went viral for roasting how basic people’s Spotify accounts are (you can try it out here: https://pudding.cool/2020/12/judge-my-spotify/). On one hand, the bot was pretty cool and showed how powerful AI can be. On the other hand, I was personally offended by it calling me old for liking alternative rock and r&b from the early 2000s.

Look, it’s not that I have anything against more modern music, but I just don’t understand what type of style they are going for most of the time. I tried putting on a more recent r&b playlist (trying to impress the AI, of course) and discovered a song called “Lurkin” by Chris Brown and Torey Lanez. Chris Brown is mostly a r&b singer, Lanez is a rapper, yet “Lurkin” with its catchy hooks seems clearly targeted to be a mainstream pop song. However, this only got more confusing when I looked the song up on the Spotify API and realized they classified it as “Latin hip hop”. To summarize, we have a rapper and a r&b singer collaborating on a song that could be considered to be pop or rap, yet I’m first hearing it on a r&b playlist. You got all that?

This got me thinking if we could use machine learning to train a model to classify the differences between rap, r&b, and pop. Thankfully, there was a publicly available Spotify dataset from the Tidy Tuesday community (https://github.com/rfordatascience/tidytuesday) to work with.

Data Cleanup

Pre-processing

The release dates are very inconsistent, with several having a full “yyyy-mm-dd” format and others simply having “yyyy”. For purposes of this analysis, we only want the years, which happen to be the first 4 characters for every release date string. I am also grouping them into decades (with 2020 being looped in with the 2010s since this dataset only goes through March 2020), and then removing abnormally long and short songs.

df_spotify = spotify_songs %>% 
   filter(is.na(track_artist)==F, 
          is.na(track_name)==F) %>%
          #extracts the first 4 characters of the release date, which is the year
   mutate(year = substr(track_album_release_date, 0, 4),  
          year = as.numeric(year),
          #Groups years into decades
          decade = case_when(year <= 1979 ~ 'Pre-1980s',
                             year <= 1989 ~  '1980s',
                             year <= 1999 ~ '1990s',
                             year <= 2009 ~ '2000s',
                             year <= 2020 ~ '2010s',
                             TRUE ~ 'NA'),
          #Reorders decade as a factor to remain sequential from earliest to latest
          decade = fct_relevel(decade, levels = c('Pre-1980s', '1980s', '1990s', '2000s', '2010s')),
          #Converts milliseconds to seconds
          duration_sec = duration_ms * .001) %>%
   filter(decade != 'NA')

#This gets the 10th and 90th percentile of song lengths. Trying to remove abnormally short or long songs
duration.bounds = quantile(df_spotify$duration_sec, c(.1,.9))

df_spotify = df_spotify %>%
   filter(duration_sec >= duration.bounds[1] & 
             duration_sec <= duration.bounds[2])

#Eliminates a lot of the columns that won't be needed, mostly stuff about the album or playlist
df_spotify = df_spotify %>%
   select(track_id, track_artist, track_name, playlist_genre, decade) %>%
   bind_cols(df_spotify %>% select_if(is.numeric)) %>%
   select(-duration_ms)

The dataset has 26,264 rows, but it contains about 6,000 duplicated songs across different playlists due to artist re-releases (for example, Bon Jovi’s “Livin’ on a Prayer” showed up 11 different times on albums starting in 1986 and all the way through 2010). While there is no way for sure to know what the original release of each duplicated song was (“Livin’ on a Prayer” had 5 distinct releases from January 1986 alone), I will assume that the most popular version of the song in the dataset is most likely to be the original release. So the chunk below will extract only the most popular release of each track and eliminate any duplicates.

df_spotify = df_spotify %>%
   #This sorts the data to have the most popular tracks first, which then uses the slice function to take the 1st row from each aggregate    artist/song combo
   arrange(desc(track_popularity)) %>%
   group_by(track_artist, track_name) %>%
   slice(1) %>%
   ungroup()

#Sample Data
set.seed(20120)
kable(df_spotify %>% sample_n(5)) %>%
     kable_paper() %>%
  scroll_box(height = '75%', width= '100%')

track_id	track_artist	track_name	playlist_genre	decade	track_popularity	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	year	duration_sec
1MwSMSjbU20bF5nFgWLsJh	SAND&SEA	looking around at the sky.	pop	2010s	38	0.708	0.667	2	-7.328	1	0.0587	0.51200	0.00e+00	0.137	0.5250	160.004	2020	225.000
5a2dsl9XD05WBQtSCAkTdF	Lucchii	Disconnected	rap	2010s	36	0.512	0.877	5	-5.548	0	0.5740	0.71100	0.00e+00	0.425	0.3170	128.224	2020	277.500
455AfCsOhhLPRc68sE01D8	Katy Perry	Last Friday Night (T.G.I.F.)	pop	2010s	74	0.649	0.815	3	-3.796	0	0.0415	0.00125	4.31e-05	0.671	0.7650	126.030	2012	230.747
4rvnIyOOA0ws4PMiZpZgZK	Coopex	Never Letting Go	latin	2010s	45	0.490	0.818	6	-6.653	0	0.2190	0.10800	1.29e-04	0.142	0.0987	128.065	2019	183.755
5cYZ5msSztsVKkaSYIoZ3b	Gucci Mane	Drop It Off (Feat. Eva Trill)	rap	2010s	20	0.792	0.428	2	-7.862	1	0.0715	0.07620	0.00e+00	0.183	0.2020	148.065	2015	182.973

Popularity by Genre

Here is a look at the breakdown of popularity by genre to look for observable trends. This chunk creates a histogram of popularity by genre.

#Calculate the group mean popularity by genre for overlay on the plot
mean.popularity = df_spotify %>%
   group_by(playlist_genre) %>%
   summarize(mean.pop = mean(track_popularity))

ggplot(df_spotify) +
   geom_histogram(aes(track_popularity, col=track_popularity), 
                  breaks=0:100, color = 'blue', alpha = .75) +
   geom_vline(data = mean.popularity, col = 'red',
              mapping = aes(xintercept = mean.pop)) +
   geom_label(data = mean.popularity, col = 'red',
              aes(x = mean.pop, y = 300, 
                  label = paste('Mean:', round(mean.pop,1)))) +
   facet_wrap(~playlist_genre) +
   ggtitle('Song Popularity on Spotify by Genre')

The first thing that is obvious are the large amount of songs with 0 popularity, which looks consistent across genres. I don’t know if this is some sort of input error or perhaps these are just far more unknown songs that don’t register as many listens. To be honest, the data dictionary did not provide any info on how popularity is calculated, but I think it is fairly safe to remove them for this analysis.

Clearly edm is the least popular genre (amen to that) with Latin and pop the most, but the important thing is seeing that other than edm the group means are relatively close to each other.

This boxplot is now looking at the popularity distribution by decade.

Interesting to see pop music peaking in the 1980s and then dropping after that. Meanwhile, rap and r&b both saw declines in the 2000s but huge resurgences in the 2010s. It is important to note that with so much more data on 2010s songs than any other decade, it’s unlikely there’s much statistical significance in saying that 2010s music is more popular than any other decade (that’s probably not a big surprise since I would assume that Spotify’s userbase likely skews much younger). We could also see the number of outliers on 1980s rap, which is largely because of the lack of data points (rap didn’t really take off on a national level until the 1990s). So I will remove all songs before 1990 from modeling.

Correlation Plot

The dataset contains 12 numeric attributes about the actual sound of the music. We can use the corrplot function to create a useful heat map on the correlation between the numerical song attributes provided in the dataset.

df_fewer_genres %>%
  select(7:17) %>% #these are the song attributes
  scale() %>%
  cor() %>%
  corrplot(method = 'color', 
                     type = 'upper', 
                     diag = F, 
                     tl.col = 'black',
                     addCoef.col = "grey30",
                     number.cex = 0.6,
                     tl.cex = .8,
                     main = 'Correlation Plot of Song Attributes',
                     mar = c(1,0,2,0))

Energy has a strong 68% positive correlation with Loudness, which I suppose explains why so many musicians are always asking their crowds to, “Make some noise.” Energy also has a moderate negative correlation with Acousticness, which would also intuitively make some sense. To reduce collinearity, we will remove Energy from modeling.

Model Setup

Training and Testing Splits

For a multi-level classification, we will try both a k-nearest neighbors and random forest model using the musical attributes. We will not include anything about time period or artist because we want to see if the computer can distinguish genres just on the sound alone. The 9,790 songs used are split into training and testing splits (7,344/2,446) with the proportion of each genre held constant. We will also use bootstrap resampling, which allows us to try modeling different splits from the original data.

set.seed(516)

#Creates a seperate data frame for just our 3 genres and decades
df_spotify_model = df_spotify %>%
   filter(decade %in% c('1990s', '2000s', '2010s') ,
          playlist_genre %in% c('rap', 'pop', 'r&b'))

#Splits the data into training and testing. Strata ensures equal distribution across genres in testing and training sets.
df_split = initial_split(df_spotify_model, 
                         strata = playlist_genre)
df_training = training(df_split)
df_testing = testing(df_split)

#Gets stratified bootstrap samples of the training data
df_bootstraps = bootstraps(df_training, 
                           strata = playlist_genre)

Pre-Processing Recipe

The Tidymodels package allows us to create a “recipe” for pre-processing our data before modeling. This eliminates redundancy when making multiple models. The recipe below specifies a prediction formula, eliminates non-predictor variables, removes highly correlated variables (energy), and then centers and scales the numeric attributes so that they are on level playing fields when assessing their predictive power. We then use the prep and bake functions to see what the recipe looks like when applied to the training data. A summary of the attributes are below:

#Creates a recipe for consistently processing our data. Starts with writing the formula for modeling to predict genre
spotify_recipe = recipe(playlist_genre ~ ., df_training)  %>%
   #Removes all of the excess variables
   step_rm(all_nominal(),  track_popularity, year, -playlist_genre) %>%
   #Gets rid of variables with correlation over .6 (which will filter out Energy)
   step_corr(all_numeric(), threshold = .6) %>%
   #Centers and scales all numeric predictors
   step_center(all_predictors()) %>%
   step_scale(all_predictors())


#Tidymodels requires a prep step on the training data. You can then feed this into "bake", which will show the results of our recipe when applied to the training data
spotify_prepped_recipe = prep(spotify_recipe, df_training)
df_training_baked = bake(spotify_prepped_recipe, df_training)

summary(df_training_baked)

##   danceability           key             loudness            mode        
##  Min.   :-4.21927   Min.   :-1.4581   Min.   :-7.2229   Min.   :-1.0918  
##  1st Qu.:-0.63863   1st Qu.:-0.9078   1st Qu.:-0.4979   1st Qu.:-1.0918  
##  Median : 0.09736   Median : 0.1929   Median : 0.1533   Median : 0.9158  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.71560   3rd Qu.: 0.7432   3rd Qu.: 0.6855   3rd Qu.: 0.9158  
##  Max.   : 2.24648   Max.   : 1.5687   Max.   : 2.6410   Max.   : 0.9158  
##   speechiness       acousticness     instrumentalness     liveness      
##  Min.   :-0.9086   Min.   :-0.8797   Min.   :-0.2626   Min.   :-1.2257  
##  1st Qu.:-0.7311   1st Qu.:-0.7593   1st Qu.:-0.2626   1st Qu.:-0.6121  
##  Median :-0.4610   Median :-0.4111   Median :-0.2626   Median :-0.3946  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4954   3rd Qu.: 0.4253   3rd Qu.:-0.2598   3rd Qu.: 0.3087  
##  Max.   : 6.9208   Max.   : 3.4208   Max.   : 6.2030   Max.   : 5.8912  
##     valence               tempo           duration_sec     playlist_genre
##  Min.   :-2.1608634   Min.   :-2.36162   Min.   :-1.6871   pop:2807      
##  1st Qu.:-0.7622605   1st Qu.:-0.82223   1st Qu.:-0.7891   r&b:2083      
##  Median :-0.0007789   Median :-0.03537   Median :-0.1268   rap:2454      
##  Mean   : 0.0000000   Mean   : 0.00000   Mean   : 0.0000                 
##  3rd Qu.: 0.7697143   3rd Qu.: 0.65717   3rd Qu.: 0.6941                 
##  Max.   : 2.1349743   Max.   : 3.36046   Max.   : 2.3891

Random Forest and KNN Models

Finally, we set the engines for each model and apply our pre-processing recipes into a workflow.

#Set the random forest engine
rf_model = rand_forest(mtry = tune(), min_n = tune(), trees = 300) %>% 
  set_mode("classification") %>% 
  set_engine("ranger") 

#Create the workflow to apply the model and then the pre-processing recipe
rf_workflow = workflow() %>%
   add_model(rf_model) %>%
   add_recipe(spotify_prepped_recipe)

set.seed(516)
#Apply the resampling to our processed data
rf_bootstraps = rf_workflow %>%
   tune_grid(resamples = df_bootstraps, grid = 3)

#now repeats the same 3 steps for knn model
knn_model = nearest_neighbor( ) %>%
   set_mode('classification') %>%
   set_engine('kknn')
   
knn_workflow = workflow() %>%
   add_model(knn_model) %>%
   add_recipe(spotify_prepped_recipe)

set.seed(516)
knn_bootstraps = knn_workflow %>%
   tune_grid(resamples = df_bootstraps, grid = 3)

Evaluate Models

Accuracy Measures: Random Forest vs. KNN

Here are the results for each model when it comes to predicting the correct genre:

#Select best botstraps models by accuracy
rf_best = select_best(rf_bootstraps, metric = 'accuracy')
knn_best = select_best(knn_bootstraps, metric = 'accuracy')

rf_best_metrics = collect_metrics(rf_bootstraps) %>% 
     filter(`.config` == rf_best$.config)

#Gets the metrics on the training data for the best models and puts them into a table form
model_comp = rf_best_metrics %>% 
     mutate(model = 'rf') %>%
   bind_rows(
      collect_metrics(knn_bootstraps) %>% 
         filter(`.config` == knn_best$.config) %>% 
         mutate(model = 'knn')) %>%
   pivot_wider(names_from = .metric, values_from = mean ) %>%
   group_by(model) %>%
   summarise(accuracy = round(sum(accuracy, na.rm = T),4), 
             roc_auc = round(sum(roc_auc, na.rm = T),4))

kable(model_comp)

model	accuracy	roc_auc
knn	0.5325	0.6959
rf	0.6325	0.7943

For a classification problem like this one that has fairly balanced classes, accuracy is the better metric (accuracy is a pure measure of true positives and true negatives; AUC is better for weighting false positives and negatives). The random forest model performed at about 63.3% accuracy on the training data compared to about 53% for the knn model. I think this gap kind of makes sense given the original premise that pop, r&b, and rap have a lot of crossover in how they sound. RF will do a better job establishing specific heuristics for classification. If KNN misclassified one song, then by definition there’s a very good chance it will also misclassify all other songs just like it (hence the name “nearest neighbors”).

By comparison, a random guess would get it right 33.33%, so the RF model nearly doubling that is fairly strong.

Benchmarking Accuracy Measure

However, I wanted to make a benchmark for how evaluating my initial question of whether pop, rap, and r&b really are that similar. So I also created a seperate random forest model for classifying rap against 2 of the other genres in the original dataset: rock and edm. Those 3 genres are very distinct from each other in terms of sound. I would hypothesize that the accuracy of the rock/edm model should be much higher than the 63.3% of the pop/r&b model, even when holding all of the rap songs constant across both models.

model	accuracy	roc_auc
RF - pop/rap/r&b	0.6325	0.7943
RF - rock/edm/rap	0.8060	0.9290

As I suspected, the computer has a much easier time identifying the difference between rap, rock, and edm as it did compared to rap, pop, and r&b. The rock/edm model was accurate on 80.6% on in-sample songs, which was over 17% better than the r&b/pop model! So it’s clearly not just my old man ears that struggle to figure out rap, pop, and r&b.

Modeling Testing Data

Let’s now run the RF model (rap/pop/r&b) on the testing data and see how it does predicting genres out of sample.

#Selects the best metrics for the random forest by taking the most accurate bootstrap model from the training data model
final_rf = rf_workflow %>%
   finalize_workflow(select_best(rf_bootstraps, metric = 'accuracy'))

#Fits the model on the testing data
spotify_fit = last_fit(final_rf, df_split)

kable(collect_metrics(spotify_fit) %>% select(.metric, .estimate))

.metric	.estimate
accuracy	0.6512674
roc_auc	0.8101257

Interesting that our random forest model performed about 2% better on the testing (65% accuracy) than it did on the training data (63.2%). Typically we would expect the accuracy on the training data to be slightly higher (the model parameters would be “biased” for the training data). This likely indicates a high amount of variance in the model. In other words, there are probably a fair amount of songs the model classified correctly on the testing data even though it wasn’t particularly confident in the choice.

One way to look at this would be to look at the distribution of the predicted probabilities on the testing data. The RF model assigns a probability that each song belongs to rap/r&b/pop. This graph will show how confident the model was when making its selection for each song.

We can see a heavy left skew in the predicted probabilities, which does confirm that the model more often than not was classifying with low confidence. It’s worth reiterating that this is not a surprise given my original hypothesis. If the three genres really did sound alike, then the model would not be overly confident when making a choice. This isn’t to say the results are not good (57% average confidence across 3 choices does show some degree of certainty), but it does offer an explanation as to why the testing data might have slightly outperformed the training data.

Confusion Matrix Plot by Predicted Genre

How did the model do in predicting each of the 3 genre’s individually? Let’s check the mosaic plot.

Wow! So it turns out that the model is actually very accurate distinguishing pop (77%) and rap (74%), but it is only a little bit better than random at distinguishing r&b (43%). This means that my observation that pop and rap blending together would be wrong, but I was right about it happeningto r&b. The model validates that r&b tends to sound too much like either a rap or a pop song.

Keep in mind, the training data was made up of nearly 80% of songs from 2010 on. As a fan of 90s and 2000s r&b, I’d be really curious if this is a trend that has evolved for more recent songs or if it was always like this and I had just never noticed. However, I do not have enough data to evaluate this trend right now.

Variable Importance Plot

Finally, let’s see a variable importance plot to see what features were most predictive.

The model found the most importance to be for speechiness and loudness (rap), as well as danceability and tempo (pop). Duration also showed some predictive power, ironically being a trait most associated with predicting the otherwise difficult r&b (r&b songs were about 10 seconds longer on average than rap and 15 more than pop in the training data).

But perhaps most importantly, now that we have a model, we can circle back to answering my original question of what the heck genre was “Lurkin” supposed to be?

track_artist	track_name	.pred_pop	.pred_r&b	.pred_rap
Chris Brown	Lurkin’ (feat. Tory Lanez)	0.315856	0.2456512	0.4384928

Sigh. The computer was just as confused as I was!

Summary

The random forest model predicted rap, r&b, and pop songs with 65% accuracy out of sample. This beat the k-nearest neighbors model by over 10%.
It was significantly better at accurately predicting pop (77%) and rap (73%) over r&b (43%).
The amount of speech and loudness of a song are strong predictors for rap.
The amount of danceability and tempo are strong predictors for pop.
The attributes for r&b songs blend in too much with pop and/or rap attributes for the model to distinguish them.

Using Machine Learning to Predict Song Genres from Spotify

Jason Laso

12/28/2020