Spotify Tracks Clustering with KMeans

Introduction

Listening to music for some people is always be one of their activity in leisure times. Even, there are some people who called it their hobby. One of the most popular platform to listening to music is spotify. Spotify has a recommendation system which can customize their music’s users according to their taste. In this case, we will try to make a music recommendation based on Spotify dataset. We will use K-Means Clustering algorithm which categorized into Unsupervised Learning in Machine Learning.

In this project, we will use these library :

library(dplyr)
library(GGally)
library(inspectdf)
library(ggiraphExtra)
library(factoextra)

About K-Means Clustering

Clustering refers to the practice of finding meaningful ways to group data (or create subgroups) within a dataset - and the resulting groups are usually called clusters. The objective is to have a number of partitions where the observations that fall into each partition are similar to others in that group, while the partitions are distinctive from one another.

K-means is a centroid-based clustering algorithm that follows a simple procedure of classifying a given dataset into a pre-determined number of clusters, denoted as “k”. This procedure is essentially a series of interations where we:

Find cluster centers
Compute distances between each point to each cluster centers
Assign / re-assign cluster membership

Inspect Data

For this project, we will using Spotify Tracks Data Base from Kaggle. This dataset was obtained from Spotify API in 2019. In this dataset, we have approximately 10,000 per genre, which it has 26 genres so it is a total of 232,725 tracks.

spotify <- read.csv("SpotifyFeatures.csv",stringsAsFactors = T)
head(spotify)

##   ï..genre        artist_name                       track_name               track_id popularity acousticness danceability duration_ms energy instrumentalness key liveness loudness  mode speechiness   tempo time_signature valence
## 1    Movie     Henri Salvador      C'est beau de faire un Show 0BRjO6ga9RKCKjfDqeFgWV          0        0.611        0.389       99373 0.9100            0.000  C#   0.3460   -1.828 Major      0.0525 166.969            4/4   0.814
## 2    Movie Martin & les fÃ©es Perdu d'avance (par Gad Elmaleh) 0BjC1NfoEOOusryehmNudP          1        0.246        0.590      137373 0.7370            0.000  F#   0.1510   -5.559 Minor      0.0868 174.003            4/4   0.816
## 3    Movie    Joseph Williams   Don't Let Me Be Lonely Tonight 0CoSDzoNIKCRs124s9uTVy          3        0.952        0.663      170267 0.1310            0.000   C   0.1030  -13.879 Minor      0.0362  99.488            5/4   0.368
## 4    Movie     Henri Salvador   Dis-moi Monsieur Gordon Cooper 0Gc6TVm52BwZD07Ki6tIvf          0        0.703        0.240      152427 0.3260            0.000  C#   0.0985  -12.178 Major      0.0395 171.758            4/4   0.227
## 5    Movie       Fabien Nataf                        Ouverture 0IuslXpMROHdEPvSl1fTQK          4        0.950        0.331       82625 0.2250            0.123   F   0.2020  -21.150 Major      0.0456 140.576            4/4   0.390
## 6    Movie     Henri Salvador   Le petit souper aux chandelles 0Mf1jKa8eNAf1a4PwTbizj          0        0.749        0.578      160627 0.0948            0.000  C#   0.1070  -14.970 Major      0.1430  87.479            4/4   0.358

In this dataset, we have:

ï..genre : Genre
artist_name : Artist Name
track_name : Track Name
track_id : The Spotify ID for the track.
popularity : The popularity for the track.
acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
duration_ms : The duration of the track in milliseconds.
energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”.
key : The key the track is in.
liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live
loudness : The overall loudness of a track in decibels (dB).
mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
tempo : The overall estimated tempo of a track in beats per minute (BPM).
time_signature : An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.

Data Cleaning

First, we will adjust our data type.

# Adjust Data Type
spotify <- spotify %>% 
  mutate(artist_name = as.character(artist_name),
         track_name = as.character(track_name),
         track_id = as.character(track_id),
         duration_ms = duration_ms/60000) %>% 
  rename(genre = ï..genre,
         duration_min = duration_ms)
head(spotify)

##   genre        artist_name                       track_name               track_id popularity acousticness danceability duration_min energy instrumentalness key liveness loudness  mode speechiness   tempo time_signature valence
## 1 Movie     Henri Salvador      C'est beau de faire un Show 0BRjO6ga9RKCKjfDqeFgWV          0        0.611        0.389     1.656217 0.9100            0.000  C#   0.3460   -1.828 Major      0.0525 166.969            4/4   0.814
## 2 Movie Martin & les fÃ©es Perdu d'avance (par Gad Elmaleh) 0BjC1NfoEOOusryehmNudP          1        0.246        0.590     2.289550 0.7370            0.000  F#   0.1510   -5.559 Minor      0.0868 174.003            4/4   0.816
## 3 Movie    Joseph Williams   Don't Let Me Be Lonely Tonight 0CoSDzoNIKCRs124s9uTVy          3        0.952        0.663     2.837783 0.1310            0.000   C   0.1030  -13.879 Minor      0.0362  99.488            5/4   0.368
## 4 Movie     Henri Salvador   Dis-moi Monsieur Gordon Cooper 0Gc6TVm52BwZD07Ki6tIvf          0        0.703        0.240     2.540450 0.3260            0.000  C#   0.0985  -12.178 Major      0.0395 171.758            4/4   0.227
## 5 Movie       Fabien Nataf                        Ouverture 0IuslXpMROHdEPvSl1fTQK          4        0.950        0.331     1.377083 0.2250            0.123   F   0.2020  -21.150 Major      0.0456 140.576            4/4   0.390
## 6 Movie     Henri Salvador   Le petit souper aux chandelles 0Mf1jKa8eNAf1a4PwTbizj          0        0.749        0.578     2.677117 0.0948            0.000  C#   0.1070  -14.970 Major      0.1430  87.479            4/4   0.358

Then, we will check if there is any missing value in our data set.

# Is there any missing value?
colSums(is.na(spotify))

##            genre      artist_name       track_name         track_id       popularity     acousticness     danceability     duration_min           energy instrumentalness              key         liveness         loudness             mode      speechiness            tempo   time_signature          valence 
##                0                0                0                0                0                0                0                0                0                0                0                0                0                0                0                0                0                0

As we can see, our dataset doesn’t have any missing value.

Last, We will assign column track_id to rownames because it has the most unique value. But, before we assign that column, we will try to remove the duplicated data first. Then, We will only using the numeric variables. So, we will filter the data to only numeric variables. We will use only numeric variables because k-means clustering will measure the cluster with distance that is numeric.

# Remove duplicated data 
spotify_clean <- spotify[!duplicated(spotify$track_id),]

# Assign track_id into rownames
rownames(spotify_clean) <- spotify_clean$track_id

# Filter only numeric variables
spotify_clean <- spotify_clean %>% 
  select(where(is.numeric))
head(spotify_clean)

##                        popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness   tempo valence
## 0BRjO6ga9RKCKjfDqeFgWV          0        0.611        0.389     1.656217 0.9100            0.000   0.3460   -1.828      0.0525 166.969   0.814
## 0BjC1NfoEOOusryehmNudP          1        0.246        0.590     2.289550 0.7370            0.000   0.1510   -5.559      0.0868 174.003   0.816
## 0CoSDzoNIKCRs124s9uTVy          3        0.952        0.663     2.837783 0.1310            0.000   0.1030  -13.879      0.0362  99.488   0.368
## 0Gc6TVm52BwZD07Ki6tIvf          0        0.703        0.240     2.540450 0.3260            0.000   0.0985  -12.178      0.0395 171.758   0.227
## 0IuslXpMROHdEPvSl1fTQK          4        0.950        0.331     1.377083 0.2250            0.123   0.2020  -21.150      0.0456 140.576   0.390
## 0Mf1jKa8eNAf1a4PwTbizj          0        0.749        0.578     2.677117 0.0948            0.000   0.1070  -14.970      0.1430  87.479   0.358

Exploratory Data Analysis

Correlation Matrix

spotify_clean %>% ggcorr(label = T)

From the correlation matrix, We found some variables that have strong correlation with each other. The highest correlation is 0.8 for energy and loudness.

Data Distribution

spotify_clean %>% inspect_num() %>% show_plot()

From the histograms, we can observe that each variable has different range of data, so we need to scale our dataset.

Data Preprocessing

We want to know what is the optimum number of k for clustering, but the dataset is too large to be plot. So, we will doing sampling to reduce the amount of data. We will randomly choose 5% of the data.

RNGkind(sample.kind = "Rounding")
set.seed(205)

index <- sample(x = nrow(spotify_clean), size = nrow(spotify_clean)*0.05)
spotify_red <- spotify_clean[index,]

Because we want to use K-Means Clustering algorithm and our dataset doesn’t have the same distribution, we must scaling our dataset first.

# Scaling the data
spotify_scale_red <- scale(spotify_red)

Determine optimal cluster

We will determine the optimal cluster with fviz_nbclust() function from factoextra package.

# Elbow method
fviz_nbclust(x = spotify_scale_red, 
             FUNcluster = kmeans,
             method = 'wss'
               )

# Silhouette method
fviz_nbclust(spotify_scale_red, 
             kmeans, 
             method= "silhouette")

According to the plot with elbow method and silhouette method, the optimal number of cluster for our dataset is 3 cluster.

K-Means Clustering for Reduced Dataset

Model Fitting

# k-means with 3 clusters
RNGkind(sample.kind = "Rounding")
set.seed(100)

spotify_kmeans_red <- kmeans(x = spotify_scale_red, centers = 3)

K-Means Output:

The number of observations per cluster

# The number of observations per cluster
spotify_kmeans_red$size

## [1] 2296 6015  527

Location of the center of the cluster / centroid, commonly used for profiling clusters

# Location of the center of the cluster / centroid
spotify_kmeans_red$centers

##   popularity acousticness danceability duration_min     energy instrumentalness   liveness   loudness speechiness      tempo    valence
## 1 -0.5374929    1.1455026   -0.9752292   0.06026724 -1.2695006        0.8892772 -0.2716888 -1.2249279  -0.3679547 -0.3593301 -0.8785874
## 2  0.2845410   -0.5279572    0.3623791  -0.01924507  0.4532126       -0.2934957 -0.1069123  0.4952637  -0.1740382  0.1928593  0.3468601
## 3 -0.9059396    1.0352723    0.1127438  -0.04291174  0.3580642       -0.5244853  2.4039376 -0.3160846   3.5894944 -0.6357250 -0.1311705

Cluster label for each observation

# clustering output (Cluster label for each observation)
head(spotify_kmeans_red$cluster)

## 1gZisUSub74637T2rJmimh 0n0HybfiBU3YDQNVtWugtm 4zag02jtwumRcnPdYYr0Do 2p4jhU7DOjYcN0W2o1PBM6 06peZfvxR5721oGqHwogha 3d5GNCBqQqXd3stDPPY5FO 
##                      2                      2                      2                      3                      2                      2

The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced

# The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced
spotify_kmeans_red$iter

## [1] 3

Goodness of fit

# wss check 
spotify_kmeans_red$withinss

## [1] 20219.515 37053.428  3844.882

# bss/tss check 
spotify_kmeans_red$betweenss/spotify_kmeans_red$totss

## [1] 0.3712611

Profilling

# Assign cluster column into the dataset
spotify_red$cluster <- spotify_kmeans_red$cluster


# Profilling with summarise data
spotify_red %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 3 x 12
##   cluster popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence
##     <int>      <dbl>        <dbl>        <dbl>        <dbl>  <dbl>            <dbl>    <dbl>    <dbl>       <dbl> <dbl>   <dbl>
## 1       1       26.7        0.830        0.357         4.02  0.206          0.455      0.169   -18.1       0.0511 106.    0.215
## 2       2       41.0        0.215        0.609         3.87  0.679          0.0762     0.205    -7.03      0.0910 124.    0.542
## 3       3       20.3        0.789        0.562         3.82  0.653          0.00222    0.747   -12.2       0.866   97.9   0.414

fviz_cluster(object = spotify_kmeans_red,
             data = spotify_red, labelsize = 1)

ggRadar(
  data=spotify_red,
  mapping = aes(colours = cluster),
  interactive = T
)

K-Means Clustering for Full Dataset

If we assume 5% of the data will generalize the full amount of the data, we will try to do k-means clustering to the full dataset.

spotify_scale <- scale(spotify_clean)

Model Fitting

# k-means with 3 clusters
RNGkind(sample.kind = "Rounding")
set.seed(100)

spotify_kmeans <- kmeans(x = spotify_scale, centers = 3)

K-Means Output:

The number of observations per cluster

# The number of observations per cluster
spotify_kmeans$size

## [1]  10253 121677  44844

Location of the center of the cluster / centroid, commonly used for profiling clusters

# Location of the center of the cluster / centroid
spotify_kmeans$centers

##   popularity acousticness danceability duration_min     energy instrumentalness   liveness   loudness speechiness      tempo    valence
## 1 -0.9161329    1.0606121    0.1090624   0.06128842  0.3699452       -0.5282455  2.3887044 -0.2960337   3.6231110 -0.6073367 -0.1333129
## 2  0.2704178   -0.5239985    0.3546318  -0.03156168  0.4449938       -0.2926363 -0.1002277  0.4842328  -0.1676415  0.1888458  0.3382422
## 3 -0.5242733    1.1792906   -0.9871723   0.07162475 -1.2920026        0.9147983 -0.2741945 -1.2462037  -0.3735091 -0.3735431 -0.8872857

Cluster label for each observation

# clustering output (Cluster label for each observation)
head(spotify_kmeans$cluster)

## 0BRjO6ga9RKCKjfDqeFgWV 0BjC1NfoEOOusryehmNudP 0CoSDzoNIKCRs124s9uTVy 0Gc6TVm52BwZD07Ki6tIvf 0IuslXpMROHdEPvSl1fTQK 0Mf1jKa8eNAf1a4PwTbizj 
##                      2                      2                      3                      3                      3                      3

The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced

# The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced
spotify_kmeans$iter

## [1] 4

Goodness of Fit

#check wss
spotify_kmeans$withinss

## [1]  98955.25 743993.28 383516.15

#check bss/tss
spotify_kmeans$betweenss/spotify_kmeans$totss

## [1] 0.3692657

Profilling

# Assign cluster column into the dataset
spotify_clean$cluster <- spotify_kmeans$cluster


# Profilling with summarise data
spotify_clean %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 3 x 12
##   cluster popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence
##     <int>      <dbl>        <dbl>        <dbl>        <dbl>  <dbl>            <dbl>    <dbl>    <dbl>       <dbl> <dbl>   <dbl>
## 1       1       20.3        0.793        0.562         4.07  0.659          0.00148    0.729   -12.0       0.868   98.2   0.416
## 2       2       41.0        0.212        0.609         3.87  0.680          0.0776     0.203    -7.04      0.0931 123.    0.542
## 3       3       27.2        0.836        0.353         4.09  0.201          0.467      0.167   -18.1       0.0511 106.    0.214

fviz_cluster(object = spotify_kmeans,
             data = spotify_clean, labelsize = 1)

ggRadar(
  data=spotify_clean,
  mapping = aes(colours = cluster),
  interactive = T
)

Characteristics of Clusters

Cluster 1 : Highest speechiness and liveness. Lowest Tempo, popularity, instrumentalness.
Cluster 2 : Highest energy, loudness, danceability, valence, tempo, and popularity.Lowest accousticness.
Cluster 3 : Highest acousticness and instrumentalness. Lowest speechiness, valence, danceability, energy, liveness, and loudness.

How Spotify Track Recommendation Works?

First, we will combine our numerical dataset with our first spotify dataset, so it has column track_name, artist_name, and genre.

# Remove Row Names
spotify_clean$track_id <- rownames(spotify_clean)
rownames(spotify_clean) <- NULL
# Combine dataset
spotify_track <- spotify %>% 
  select(track_id, track_name, artist_name, genre) %>% 
  left_join(spotify_clean, by = "track_id")
head(spotify_track)

output

##                 track_id                       track_name        artist_name genre popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness   tempo valence cluster
## 1 0BRjO6ga9RKCKjfDqeFgWV      C'est beau de faire un Show     Henri Salvador Movie          0        0.611        0.389     1.656217 0.9100            0.000   0.3460   -1.828      0.0525 166.969   0.814       2
## 2 0BjC1NfoEOOusryehmNudP Perdu d'avance (par Gad Elmaleh) Martin & les fÃ©es Movie          1        0.246        0.590     2.289550 0.7370            0.000   0.1510   -5.559      0.0868 174.003   0.816       2
## 3 0CoSDzoNIKCRs124s9uTVy   Don't Let Me Be Lonely Tonight    Joseph Williams Movie          3        0.952        0.663     2.837783 0.1310            0.000   0.1030  -13.879      0.0362  99.488   0.368       3
## 4 0Gc6TVm52BwZD07Ki6tIvf   Dis-moi Monsieur Gordon Cooper     Henri Salvador Movie          0        0.703        0.240     2.540450 0.3260            0.000   0.0985  -12.178      0.0395 171.758   0.227       3
## 5 0IuslXpMROHdEPvSl1fTQK                        Ouverture       Fabien Nataf Movie          4        0.950        0.331     1.377083 0.2250            0.123   0.2020  -21.150      0.0456 140.576   0.390       3
## 6 0Mf1jKa8eNAf1a4PwTbizj   Le petit souper aux chandelles     Henri Salvador Movie          0        0.749        0.578     2.677117 0.0948            0.000   0.1070  -14.970      0.1430  87.479   0.358       3

After our dataset for spotify track recommendation is ready, we will try to make our spotify track recommendation system. For example, I like to listen for Ariana Grande’s song entitled “Break Free”. So, we will try to search first, which cluster that has Ariana Grande - Break Free.

spotify_cluster <- spotify_track %>% 
  filter(track_name == "Break Free",
         artist_name == "Ariana Grande") %>%
  select(artist_name,track_name,cluster, genre) %>% 
  head(1)
spotify_cluster

##     artist_name track_name cluster genre
## 1 Ariana Grande Break Free       2 Dance

It turns out that Ariana Grande - Break Free is in the cluster 2. Now, we want to listen to Justin Bieber’s song, but we want to listen to a song that will fit our preference.

set.seed(100)
spotify_track %>% 
  filter(cluster == spotify_cluster$cluster, 
         artist_name=="Justin Bieber") %>%
  slice_sample(n = 3) %>% 
  select(-track_id) %>% 
  arrange(-popularity)

##             track_name   artist_name genre popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness  tempo valence cluster
## 1  Stuck In The Moment Justin Bieber Dance         61        0.179        0.715     3.716000  0.690                0    0.131   -6.279      0.0531 89.991   0.469       2
## 2 Sorry - Latino Remix Justin Bieber Dance         61        0.162        0.667     3.666450  0.755                0    0.306   -3.773      0.0431 99.994   0.513       2
## 3           Hold Tight Justin Bieber Dance         53        0.287        0.456     4.236883  0.608                0    0.101   -5.815      0.3280 58.875   0.491       2

The output shows that the system recommend us to listen to Justin Bieber’s song titled : Stuck In The Moment, Sorry - Latino Remix, and Hold Tight.

Conclusion

The optimum cluster based on the elbow method and silhouette method is 3 cluster.
The full dataset is generalized by the reduced dataset because we can see that \(\frac{BSS}{TSS}\) didn’t change that far. It is around 37%.
From the k-means profilling for the full dataset, we can conclude that:
- For cluster 1, we can call it Live Music cluster. It has the highest liveness and speechiness. Highest liveness values represent highest probability that the track was performed live, so the songs in this cluster have live music ambience.
- For cluster 2, we can call it Dance Music cluster. It has the highest danceability, valence, tempo, energy, and loudness. So it would be fit if we want to dance. It is also has the highest valence and popularity. It shows us that most of the music in this cluster has the highest popularity among others.
- For cluster 3, we can call it Music for Study cluster. It has the highest acousticness and instrumentalness. It is also has the lowest speechiness so it would be fit for study because the music wouldn’t disturb us.