Spotify Tracks Clustering with KMeans
Introduction
Listening to music for some people is always be one of their activity in leisure times. Even, there are some people who called it their hobby. One of the most popular platform to listening to music is spotify. Spotify has a recommendation system which can customize their music’s users according to their taste. In this case, we will try to make a music recommendation based on Spotify dataset. We will use K-Means Clustering algorithm which categorized into Unsupervised Learning in Machine Learning.
In this project, we will use these library :
library(dplyr)
library(GGally)
library(inspectdf)
library(ggiraphExtra)
library(factoextra)About K-Means Clustering
Clustering refers to the practice of finding meaningful ways to group data (or create subgroups) within a dataset - and the resulting groups are usually called clusters. The objective is to have a number of partitions where the observations that fall into each partition are similar to others in that group, while the partitions are distinctive from one another.
K-means is a centroid-based clustering algorithm that follows a simple procedure of classifying a given dataset into a pre-determined number of clusters, denoted as “k”. This procedure is essentially a series of interations where we:
- Find cluster centers
- Compute distances between each point to each cluster centers
- Assign / re-assign cluster membership
Inspect Data
For this project, we will using Spotify Tracks Data Base from Kaggle. This dataset was obtained from Spotify API in 2019. In this dataset, we have approximately 10,000 per genre, which it has 26 genres so it is a total of 232,725 tracks.
spotify <- read.csv("SpotifyFeatures.csv",stringsAsFactors = T)
head(spotify)## ï..genre artist_name track_name track_id popularity acousticness danceability duration_ms energy instrumentalness key liveness loudness mode speechiness tempo time_signature valence
## 1 Movie Henri Salvador C'est beau de faire un Show 0BRjO6ga9RKCKjfDqeFgWV 0 0.611 0.389 99373 0.9100 0.000 C# 0.3460 -1.828 Major 0.0525 166.969 4/4 0.814
## 2 Movie Martin & les fées Perdu d'avance (par Gad Elmaleh) 0BjC1NfoEOOusryehmNudP 1 0.246 0.590 137373 0.7370 0.000 F# 0.1510 -5.559 Minor 0.0868 174.003 4/4 0.816
## 3 Movie Joseph Williams Don't Let Me Be Lonely Tonight 0CoSDzoNIKCRs124s9uTVy 3 0.952 0.663 170267 0.1310 0.000 C 0.1030 -13.879 Minor 0.0362 99.488 5/4 0.368
## 4 Movie Henri Salvador Dis-moi Monsieur Gordon Cooper 0Gc6TVm52BwZD07Ki6tIvf 0 0.703 0.240 152427 0.3260 0.000 C# 0.0985 -12.178 Major 0.0395 171.758 4/4 0.227
## 5 Movie Fabien Nataf Ouverture 0IuslXpMROHdEPvSl1fTQK 4 0.950 0.331 82625 0.2250 0.123 F 0.2020 -21.150 Major 0.0456 140.576 4/4 0.390
## 6 Movie Henri Salvador Le petit souper aux chandelles 0Mf1jKa8eNAf1a4PwTbizj 0 0.749 0.578 160627 0.0948 0.000 C# 0.1070 -14.970 Major 0.1430 87.479 4/4 0.358
In this dataset, we have:
ï..genre: Genreartist_name: Artist Nametrack_name: Track Nametrack_id: The Spotify ID for the track.popularity: The popularity for the track.acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.duration_ms: The duration of the track in milliseconds.energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”.key: The key the track is in.liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed liveloudness: The overall loudness of a track in decibels (dB).mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.tempo: The overall estimated tempo of a track in beats per minute (BPM).time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
Data Cleaning
First, we will adjust our data type.
# Adjust Data Type
spotify <- spotify %>%
mutate(artist_name = as.character(artist_name),
track_name = as.character(track_name),
track_id = as.character(track_id),
duration_ms = duration_ms/60000) %>%
rename(genre = ï..genre,
duration_min = duration_ms)
head(spotify)## genre artist_name track_name track_id popularity acousticness danceability duration_min energy instrumentalness key liveness loudness mode speechiness tempo time_signature valence
## 1 Movie Henri Salvador C'est beau de faire un Show 0BRjO6ga9RKCKjfDqeFgWV 0 0.611 0.389 1.656217 0.9100 0.000 C# 0.3460 -1.828 Major 0.0525 166.969 4/4 0.814
## 2 Movie Martin & les fées Perdu d'avance (par Gad Elmaleh) 0BjC1NfoEOOusryehmNudP 1 0.246 0.590 2.289550 0.7370 0.000 F# 0.1510 -5.559 Minor 0.0868 174.003 4/4 0.816
## 3 Movie Joseph Williams Don't Let Me Be Lonely Tonight 0CoSDzoNIKCRs124s9uTVy 3 0.952 0.663 2.837783 0.1310 0.000 C 0.1030 -13.879 Minor 0.0362 99.488 5/4 0.368
## 4 Movie Henri Salvador Dis-moi Monsieur Gordon Cooper 0Gc6TVm52BwZD07Ki6tIvf 0 0.703 0.240 2.540450 0.3260 0.000 C# 0.0985 -12.178 Major 0.0395 171.758 4/4 0.227
## 5 Movie Fabien Nataf Ouverture 0IuslXpMROHdEPvSl1fTQK 4 0.950 0.331 1.377083 0.2250 0.123 F 0.2020 -21.150 Major 0.0456 140.576 4/4 0.390
## 6 Movie Henri Salvador Le petit souper aux chandelles 0Mf1jKa8eNAf1a4PwTbizj 0 0.749 0.578 2.677117 0.0948 0.000 C# 0.1070 -14.970 Major 0.1430 87.479 4/4 0.358
Then, we will check if there is any missing value in our data set.
# Is there any missing value?
colSums(is.na(spotify))## genre artist_name track_name track_id popularity acousticness danceability duration_min energy instrumentalness key liveness loudness mode speechiness tempo time_signature valence
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
As we can see, our dataset doesn’t have any missing value.
Last, We will assign column track_id to rownames because it has the most unique value. But, before we assign that column, we will try to remove the duplicated data first. Then, We will only using the numeric variables. So, we will filter the data to only numeric variables. We will use only numeric variables because k-means clustering will measure the cluster with distance that is numeric.
# Remove duplicated data
spotify_clean <- spotify[!duplicated(spotify$track_id),]
# Assign track_id into rownames
rownames(spotify_clean) <- spotify_clean$track_id
# Filter only numeric variables
spotify_clean <- spotify_clean %>%
select(where(is.numeric))
head(spotify_clean)## popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence
## 0BRjO6ga9RKCKjfDqeFgWV 0 0.611 0.389 1.656217 0.9100 0.000 0.3460 -1.828 0.0525 166.969 0.814
## 0BjC1NfoEOOusryehmNudP 1 0.246 0.590 2.289550 0.7370 0.000 0.1510 -5.559 0.0868 174.003 0.816
## 0CoSDzoNIKCRs124s9uTVy 3 0.952 0.663 2.837783 0.1310 0.000 0.1030 -13.879 0.0362 99.488 0.368
## 0Gc6TVm52BwZD07Ki6tIvf 0 0.703 0.240 2.540450 0.3260 0.000 0.0985 -12.178 0.0395 171.758 0.227
## 0IuslXpMROHdEPvSl1fTQK 4 0.950 0.331 1.377083 0.2250 0.123 0.2020 -21.150 0.0456 140.576 0.390
## 0Mf1jKa8eNAf1a4PwTbizj 0 0.749 0.578 2.677117 0.0948 0.000 0.1070 -14.970 0.1430 87.479 0.358
Exploratory Data Analysis
Correlation Matrix
spotify_clean %>% ggcorr(label = T)From the correlation matrix, We found some variables that have strong correlation with each other. The highest correlation is 0.8 for energy and loudness.
Data Distribution
spotify_clean %>% inspect_num() %>% show_plot()From the histograms, we can observe that each variable has different range of data, so we need to scale our dataset.
Data Preprocessing
We want to know what is the optimum number of k for clustering, but the dataset is too large to be plot. So, we will doing sampling to reduce the amount of data. We will randomly choose 5% of the data.
RNGkind(sample.kind = "Rounding")
set.seed(205)
index <- sample(x = nrow(spotify_clean), size = nrow(spotify_clean)*0.05)
spotify_red <- spotify_clean[index,]Because we want to use K-Means Clustering algorithm and our dataset doesn’t have the same distribution, we must scaling our dataset first.
# Scaling the data
spotify_scale_red <- scale(spotify_red)Determine optimal cluster
We will determine the optimal cluster with fviz_nbclust() function from factoextra package.
# Elbow method
fviz_nbclust(x = spotify_scale_red,
FUNcluster = kmeans,
method = 'wss'
)# Silhouette method
fviz_nbclust(spotify_scale_red,
kmeans,
method= "silhouette")According to the plot with elbow method and silhouette method, the optimal number of cluster for our dataset is 3 cluster.
K-Means Clustering for Reduced Dataset
Model Fitting
# k-means with 3 clusters
RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_kmeans_red <- kmeans(x = spotify_scale_red, centers = 3)K-Means Output:
- The number of observations per cluster
# The number of observations per cluster
spotify_kmeans_red$size## [1] 2296 6015 527
- Location of the center of the cluster / centroid, commonly used for profiling clusters
# Location of the center of the cluster / centroid
spotify_kmeans_red$centers## popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence
## 1 -0.5374929 1.1455026 -0.9752292 0.06026724 -1.2695006 0.8892772 -0.2716888 -1.2249279 -0.3679547 -0.3593301 -0.8785874
## 2 0.2845410 -0.5279572 0.3623791 -0.01924507 0.4532126 -0.2934957 -0.1069123 0.4952637 -0.1740382 0.1928593 0.3468601
## 3 -0.9059396 1.0352723 0.1127438 -0.04291174 0.3580642 -0.5244853 2.4039376 -0.3160846 3.5894944 -0.6357250 -0.1311705
- Cluster label for each observation
# clustering output (Cluster label for each observation)
head(spotify_kmeans_red$cluster)## 1gZisUSub74637T2rJmimh 0n0HybfiBU3YDQNVtWugtm 4zag02jtwumRcnPdYYr0Do 2p4jhU7DOjYcN0W2o1PBM6 06peZfvxR5721oGqHwogha 3d5GNCBqQqXd3stDPPY5FO
## 2 2 2 3 2 2
- The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced
# The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced
spotify_kmeans_red$iter## [1] 3
Goodness of fit
# wss check
spotify_kmeans_red$withinss## [1] 20219.515 37053.428 3844.882
# bss/tss check
spotify_kmeans_red$betweenss/spotify_kmeans_red$totss## [1] 0.3712611
Profilling
# Assign cluster column into the dataset
spotify_red$cluster <- spotify_kmeans_red$cluster
# Profilling with summarise data
spotify_red %>%
group_by(cluster) %>%
summarise_all(mean)## # A tibble: 3 x 12
## cluster popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 26.7 0.830 0.357 4.02 0.206 0.455 0.169 -18.1 0.0511 106. 0.215
## 2 2 41.0 0.215 0.609 3.87 0.679 0.0762 0.205 -7.03 0.0910 124. 0.542
## 3 3 20.3 0.789 0.562 3.82 0.653 0.00222 0.747 -12.2 0.866 97.9 0.414
fviz_cluster(object = spotify_kmeans_red,
data = spotify_red, labelsize = 1)ggRadar(
data=spotify_red,
mapping = aes(colours = cluster),
interactive = T
)K-Means Clustering for Full Dataset
If we assume 5% of the data will generalize the full amount of the data, we will try to do k-means clustering to the full dataset.
spotify_scale <- scale(spotify_clean)Model Fitting
# k-means with 3 clusters
RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_kmeans <- kmeans(x = spotify_scale, centers = 3)K-Means Output:
- The number of observations per cluster
# The number of observations per cluster
spotify_kmeans$size## [1] 10253 121677 44844
- Location of the center of the cluster / centroid, commonly used for profiling clusters
# Location of the center of the cluster / centroid
spotify_kmeans$centers## popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence
## 1 -0.9161329 1.0606121 0.1090624 0.06128842 0.3699452 -0.5282455 2.3887044 -0.2960337 3.6231110 -0.6073367 -0.1333129
## 2 0.2704178 -0.5239985 0.3546318 -0.03156168 0.4449938 -0.2926363 -0.1002277 0.4842328 -0.1676415 0.1888458 0.3382422
## 3 -0.5242733 1.1792906 -0.9871723 0.07162475 -1.2920026 0.9147983 -0.2741945 -1.2462037 -0.3735091 -0.3735431 -0.8872857
- Cluster label for each observation
# clustering output (Cluster label for each observation)
head(spotify_kmeans$cluster)## 0BRjO6ga9RKCKjfDqeFgWV 0BjC1NfoEOOusryehmNudP 0CoSDzoNIKCRs124s9uTVy 0Gc6TVm52BwZD07Ki6tIvf 0IuslXpMROHdEPvSl1fTQK 0Mf1jKa8eNAf1a4PwTbizj
## 2 2 3 3 3 3
- The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced
# The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced
spotify_kmeans$iter## [1] 4
Goodness of Fit
#check wss
spotify_kmeans$withinss## [1] 98955.25 743993.28 383516.15
#check bss/tss
spotify_kmeans$betweenss/spotify_kmeans$totss## [1] 0.3692657
Profilling
# Assign cluster column into the dataset
spotify_clean$cluster <- spotify_kmeans$cluster
# Profilling with summarise data
spotify_clean %>%
group_by(cluster) %>%
summarise_all(mean)## # A tibble: 3 x 12
## cluster popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 20.3 0.793 0.562 4.07 0.659 0.00148 0.729 -12.0 0.868 98.2 0.416
## 2 2 41.0 0.212 0.609 3.87 0.680 0.0776 0.203 -7.04 0.0931 123. 0.542
## 3 3 27.2 0.836 0.353 4.09 0.201 0.467 0.167 -18.1 0.0511 106. 0.214
fviz_cluster(object = spotify_kmeans,
data = spotify_clean, labelsize = 1)ggRadar(
data=spotify_clean,
mapping = aes(colours = cluster),
interactive = T
)Characteristics of Clusters
- Cluster 1 : Highest speechiness and liveness. Lowest Tempo, popularity, instrumentalness.
- Cluster 2 : Highest energy, loudness, danceability, valence, tempo, and popularity.Lowest accousticness.
- Cluster 3 : Highest acousticness and instrumentalness. Lowest speechiness, valence, danceability, energy, liveness, and loudness.
How Spotify Track Recommendation Works?
First, we will combine our numerical dataset with our first spotify dataset, so it has column track_name, artist_name, and genre.
# Remove Row Names
spotify_clean$track_id <- rownames(spotify_clean)
rownames(spotify_clean) <- NULL
# Combine dataset
spotify_track <- spotify %>%
select(track_id, track_name, artist_name, genre) %>%
left_join(spotify_clean, by = "track_id")
head(spotify_track)output
## track_id track_name artist_name genre popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence cluster
## 1 0BRjO6ga9RKCKjfDqeFgWV C'est beau de faire un Show Henri Salvador Movie 0 0.611 0.389 1.656217 0.9100 0.000 0.3460 -1.828 0.0525 166.969 0.814 2
## 2 0BjC1NfoEOOusryehmNudP Perdu d'avance (par Gad Elmaleh) Martin & les fées Movie 1 0.246 0.590 2.289550 0.7370 0.000 0.1510 -5.559 0.0868 174.003 0.816 2
## 3 0CoSDzoNIKCRs124s9uTVy Don't Let Me Be Lonely Tonight Joseph Williams Movie 3 0.952 0.663 2.837783 0.1310 0.000 0.1030 -13.879 0.0362 99.488 0.368 3
## 4 0Gc6TVm52BwZD07Ki6tIvf Dis-moi Monsieur Gordon Cooper Henri Salvador Movie 0 0.703 0.240 2.540450 0.3260 0.000 0.0985 -12.178 0.0395 171.758 0.227 3
## 5 0IuslXpMROHdEPvSl1fTQK Ouverture Fabien Nataf Movie 4 0.950 0.331 1.377083 0.2250 0.123 0.2020 -21.150 0.0456 140.576 0.390 3
## 6 0Mf1jKa8eNAf1a4PwTbizj Le petit souper aux chandelles Henri Salvador Movie 0 0.749 0.578 2.677117 0.0948 0.000 0.1070 -14.970 0.1430 87.479 0.358 3
After our dataset for spotify track recommendation is ready, we will try to make our spotify track recommendation system. For example, I like to listen for Ariana Grande’s song entitled “Break Free”. So, we will try to search first, which cluster that has Ariana Grande - Break Free.
spotify_cluster <- spotify_track %>%
filter(track_name == "Break Free",
artist_name == "Ariana Grande") %>%
select(artist_name,track_name,cluster, genre) %>%
head(1)
spotify_cluster## artist_name track_name cluster genre
## 1 Ariana Grande Break Free 2 Dance
It turns out that Ariana Grande - Break Free is in the cluster 2. Now, we want to listen to Justin Bieber’s song, but we want to listen to a song that will fit our preference.
set.seed(100)
spotify_track %>%
filter(cluster == spotify_cluster$cluster,
artist_name=="Justin Bieber") %>%
slice_sample(n = 3) %>%
select(-track_id) %>%
arrange(-popularity)## track_name artist_name genre popularity acousticness danceability duration_min energy instrumentalness liveness loudness speechiness tempo valence cluster
## 1 Stuck In The Moment Justin Bieber Dance 61 0.179 0.715 3.716000 0.690 0 0.131 -6.279 0.0531 89.991 0.469 2
## 2 Sorry - Latino Remix Justin Bieber Dance 61 0.162 0.667 3.666450 0.755 0 0.306 -3.773 0.0431 99.994 0.513 2
## 3 Hold Tight Justin Bieber Dance 53 0.287 0.456 4.236883 0.608 0 0.101 -5.815 0.3280 58.875 0.491 2
The output shows that the system recommend us to listen to Justin Bieber’s song titled : Stuck In The Moment, Sorry - Latino Remix, and Hold Tight.
Conclusion
The optimum cluster based on the elbow method and silhouette method is 3 cluster.
The full dataset is generalized by the reduced dataset because we can see that \(\frac{BSS}{TSS}\) didn’t change that far. It is around 37%.
From the k-means profilling for the full dataset, we can conclude that:
For cluster 1, we can call it Live Music cluster. It has the highest liveness and speechiness. Highest liveness values represent highest probability that the track was performed live, so the songs in this cluster have live music ambience.
For cluster 2, we can call it Dance Music cluster. It has the highest danceability, valence, tempo, energy, and loudness. So it would be fit if we want to dance. It is also has the highest valence and popularity. It shows us that most of the music in this cluster has the highest popularity among others.
For cluster 3, we can call it Music for Study cluster. It has the highest acousticness and instrumentalness. It is also has the lowest speechiness so it would be fit for study because the music wouldn’t disturb us.