Spotify Playlist Clustering Demo

This post is a walkthrough of an R script I made, which lets you make personalised Spotify playlists from the songs you already have saved in your library. More specifically, using this script you can organise songs into playlists based on the audio features provided by Spotify. This provides a quick way to arrange large volumes of songs into playlists that have similar aduio characteristics.

Currently Spotify tracks 8 audio features, these are as follows:

energy - highly energetic songs feel fast and noisy
danceability - defines how suitable a track is for dancing
instrumentalness - tracks high in instrumentalness will contain less vocals
liveness - detects presence of a live audience in the recording
loudness - essentially how loud a track is
speechiness - detects presence of spoken words in a track
tempo - tempo of a track in beats per minute
valence - how musically positive or negative a track is

You can see that some features are better defined than others but for more explanation visit https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/.

I am first going to show how we can select the songs we want to be used for the new playlist. Then I will explain how the songs are organised and how we can add the selected songs to the new playlist.

A lot of user made functions are used in this walkthrough, which for the purposes of focus will not be explained in detail. If you want you can check out the annotated code on my GitHub page at https://github.com/AndriiG13/Personalised_Spotify_Playlists. Finally, the functions upon which my custom functions are built upon are from the “spotifyr” package by Charlie Thompson at https://github.com/charlie86/spotifyr.

Let’s begin.

Step 1

First we need to pick the pool of songs from which the songs that are most sonically suitable for our new playlist will be picked. This can be done by selecting a number of existing user Spotify playlists and then combining all of the songs from the playlists into one dataframe.

There is already a function in the “spotifyr” package that lets us download track data but it only loads a maximum of 100 tracks at once.

The modified function tracks_from_playlists takes the user id (which can be found on your Spotify profile) as first argument and a vector of playlist names as the second.

my_id <- 'dtj7fdsb05eq0rswpflu3su6z'

tracks_from_playlists <- put_songs_from_playlists_into_df(user_id = my_id, pl_name =  c( "rap ++", "electronic sounds 1", "the sun is out"))

nrow(tracks_from_playlists)

## [1] 147

This will be the pool of songs from which we make the new playlist. Right now there are 123 songs to choose from. However, this only gives us general information about the songs such as names and ids, but we now need to add audio features for each song.

track_features <- track_features %>% dplyr::select(-type, -id, -track_href, -analysis_url, -duration_ms, -time_signature, -key, -mode)

Once we remove some of the unnecessary columns the dataframe looks a bit like this (only the first 6 songs).

track.name	danceability	energy	loudness	speechiness	acousticness	liveness
Checkpoints	0.413	0.670	-7.662	0.0882	0.2050	0.506
Piece of Mind	0.536	0.869	-5.224	0.1830	0.4400	0.748
TOKYO	0.632	0.625	-5.781	0.2910	0.0644	0.689
Heavenly Father	0.608	0.676	-7.560	0.2480	0.2120	0.347
We Don’t Care	0.595	0.754	-5.827	0.1810	0.0142	0.247
No Problem (feat. Lil Wayne & 2 Chainz)	0.652	0.795	-5.192	0.1740	0.1560	0.123

Step 2

Now that we have our song pool and we have audio features for each song, we can begin to make the new playlist.

To do that, I will use K-means clustering. A technique that essentially partitions data into clusters with similar features. We set a number of clusters we want and then the algorithm finds clusters so that the total wihin-cluster variation is as small as possible. This video gives a great explanation of the algorithm https://www.youtube.com/watch?v=4b5d3muPQmA.

Before we do K-means clustering we need to scale the data as the features are measured on different scales. I also decided to rescale the values to 0-100, so that we only have positive values.

track_features_scaled <- scale(dplyr::select(track_features, -track.id, -track.name, -uri), center = TRUE, scale = TRUE)

track_features_scaled <- apply(track_features_scaled, 2, scales::rescale, to = c(0,100))

track_features_scaled_full <- bind_cols(dplyr::select(track_features, track.id, track.name, uri), as_tibble(track_features_scaled))

Now we get to the really cool part, the clustering function. The user puts in a vector of so called “high_features”, meaning that they want the songs in their new playlist to have high values for these features. The user also puts in a vector of “low_features”, meaning that they want the songs in their new playlist to have low values for these features. The user also puts in the minimum number of songs they want in their new playlist.

The function then runs the clustering algorithm with the specified features and then checks which cluster has the biggest difference between the high and the low features, and saves this value as the “playlist_score” variable. The cluster with the highest playlist_score is then selected.

Importantly, K-means converges to a local minima, meaning that running the algorithm multiple times will likely result in different cluster structures being formed. Because of this I run the K-means 10 times, each time selecting the cluster with the biggest playlist_score. Finally, the function outputs 3 clusters with the highest playlists_scores.

I decided to create a playlist that is high in danceability and has a positive mood but is low in speechiness and loudness

clustering_function_output <- clustering_function(songs_data = track_features_scaled_full, high_features = c("danceability","valence"), low_features =  c("speechiness","loudness"),
                    min_number_of_songs =  15)

A visually cleaned up version of the R output looks like this:

Playlist 1 Score
danceability	valence	speechiness	loudness	playlist_score	size	iteration
79.27126	81.8576	12.44106	27.25764	60.71508	15	9

Playlist 1 Songs
track.name	uri
Can You Get To That	spotify:track:6QUngYwZ65et2ye7Bj85EK
King James	spotify:track:5ri4b7YQp2PWn8tl3MRYgE
Lady and Man	spotify:track:0tjTndnyFm1xQsaHGf2imW
untitled 08 \| 09.06.2014.	spotify:track:5bBUDJUfGcG7eFy3Bf4fXv
SPEEDBOAT	spotify:track:2FTOLKjQUswhpdMFq15Raf
Ruff Hysteria	spotify:track:5cJnjUiWhTTBwT4Fo2J7rM

Playlist 2 Score
danceability	valence	speechiness	loudness	playlist_score	size	iteration
78.43286	80.68804	11.93591	27.58975	59.79763	16	3

Playlist 2 Songs
track.name	uri
Can You Get To That	spotify:track:6QUngYwZ65et2ye7Bj85EK
King James	spotify:track:5ri4b7YQp2PWn8tl3MRYgE
Lady and Man	spotify:track:0tjTndnyFm1xQsaHGf2imW
untitled 08 \| 09.06.2014.	spotify:track:5bBUDJUfGcG7eFy3Bf4fXv
SPEEDBOAT	spotify:track:2FTOLKjQUswhpdMFq15Raf
Ruff Hysteria	spotify:track:5cJnjUiWhTTBwT4Fo2J7rM

Playlist 3 Score
danceability	valence	speechiness	loudness	playlist_score	size	iteration
75.88844	78.40915	11.53143	26.42675	58.1697	18	4

Playlist 3 Songs
track.name	uri
Can You Get To That	spotify:track:6QUngYwZ65et2ye7Bj85EK
King James	spotify:track:5ri4b7YQp2PWn8tl3MRYgE
Lady and Man	spotify:track:0tjTndnyFm1xQsaHGf2imW
Everybody Loves The Sunshine	spotify:track:5le4sn0iMcnKU56bdmNzso
untitled 08 \| 09.06.2014.	spotify:track:5bBUDJUfGcG7eFy3Bf4fXv
SPEEDBOAT	spotify:track:2FTOLKjQUswhpdMFq15Raf

Note that this is only showing the first 6 songs for visual purposes. The decision to output 3 clusters with highest playlist scores was made to give the user some choice. Even though one cluster may have a higher playlist score the user may like the songs in a different cluster more.

Step 3

The final step to add the songs from our chosen cluster to a playlist. In this case I chose the first cluster and then used the put_selected_songs_in_a_playlist function to make a new playlist called “Spotify Clustering Demo” and then added all of the songs from the cluster directly into the newly created playlist.

chosen_playlist <- clustering_function_output$playlist_1_songs

put_selected_songs_in_a_playlist(user_id = my_id, chosen_playlst_df =  chosen_playlist, name_of_the_new_playlist = "Spotify Clustering Demo")

Just like that we now have a new playlist tailored to our preferences.

Lets compare the selected audio features of songs from our new playlist to the songs in our original song pool.

songs_pool_means <- colMeans(dplyr::select(track_features_scaled_full, danceability,valence, speechiness,loudness))

combined_dfs <- bind_rows(clustering_function_output$playlist_1_features[ ,-c(5,6,7)],songs_pool_means)

combined_dfs$type <- c("new_playlist", "song_pool")

combined_dfs_gathered <- gather(combined_dfs, key = "feature", value = "score", -type)

combined_dfs_gathered$feature <- factor(combined_dfs_gathered$feature, 
                                     levels = c("danceability", "valence", "speechiness","loudness"))

ggplot(combined_dfs_gathered, aes(x = feature, y = score, fill = type)) +
  geom_col(position = "dodge")

As expected, our new playlist is higher in danceability/valence and is lower in speechiness/loudness than the original song pool.

In my experience the quality of playlists produced by the script is good by a cool addition to the script could be to also scrape song lyrics using the Genius API so that the lyrical content of songs can also be taken into account when making playlists.

Spotify Playlist Clustering Demo

Andrii Grygoryshyn

Step 1

Step 2

Step 3