Introduction
Spotify uses recommender algorithms to recommend the most appropriate music to their customers. It also curates numerous playlists, curated by humans, that include songs with similar moods, tempos etc. This paper will try to evaluate whether a similar result can be achieved with a simple clustering mechanism.
The data used for this project was found on kaggle: https://www.kaggle.com/edalrami/19000-spotify-songs
The process
The data will be first inspected and prepared for further analysis. A Principal Component Analysis will be then carried out to find out the optimal number of components and the most contributing variables.
K-means and hierarchical clustering will be run and compared. Lastly, the membership to the clusters will be compared with the playlist membership.
Load and Inspect the data
Load the neccessary libraries and data:
Check for missing values:
cbind(
lapply(
lapply(data, is.na)
, sum)
)## [,1]
## song_name 0
## song_popularity 0
## song_duration_ms 0
## acousticness 0
## danceability 0
## energy 0
## instrumentalness 0
## key 0
## liveness 0
## loudness 0
## audio_mode 0
## speechiness 0
## tempo 0
## time_signature 0
## audio_valence 0
Check the summary of the data to see if there are any abnormal results. The data has duplicate songs which have to be removed.
summary(data)## song_name song_popularity song_duration_ms acousticness
## Length:18835 Min. : 0.00 Min. : 12000 Min. :0.000001
## Class :character 1st Qu.: 40.00 1st Qu.: 184340 1st Qu.:0.024100
## Mode :character Median : 56.00 Median : 211306 Median :0.132000
## Mean : 52.99 Mean : 218212 Mean :0.258539
## 3rd Qu.: 69.00 3rd Qu.: 242844 3rd Qu.:0.424000
## Max. :100.00 Max. :1799346 Max. :0.996000
## danceability energy instrumentalness key
## Min. :0.0000 Min. :0.00107 Min. :0.0000000 Min. : 0.000
## 1st Qu.:0.5330 1st Qu.:0.51000 1st Qu.:0.0000000 1st Qu.: 2.000
## Median :0.6450 Median :0.67400 Median :0.0000114 Median : 5.000
## Mean :0.6333 Mean :0.64499 Mean :0.0780080 Mean : 5.289
## 3rd Qu.:0.7480 3rd Qu.:0.81500 3rd Qu.:0.0025700 3rd Qu.: 8.000
## Max. :0.9870 Max. :0.99900 Max. :0.9970000 Max. :11.000
## liveness loudness audio_mode speechiness
## Min. :0.0109 Min. :-38.768 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0929 1st Qu.: -9.044 1st Qu.:0.0000 1st Qu.:0.0378
## Median :0.1220 Median : -6.555 Median :1.0000 Median :0.0555
## Mean :0.1797 Mean : -7.447 Mean :0.6281 Mean :0.1021
## 3rd Qu.:0.2210 3rd Qu.: -4.908 3rd Qu.:1.0000 3rd Qu.:0.1190
## Max. :0.9860 Max. : 1.585 Max. :1.0000 Max. :0.9410
## tempo time_signature audio_valence
## Min. : 0.00 Min. :0.000 Min. :0.000
## 1st Qu.: 98.37 1st Qu.:4.000 1st Qu.:0.335
## Median :120.01 Median :4.000 Median :0.527
## Mean :121.07 Mean :3.959 Mean :0.528
## 3rd Qu.:139.93 3rd Qu.:4.000 3rd Qu.:0.725
## Max. :242.32 Max. :5.000 Max. :0.984
data<- data %>%
distinct(song_name, .keep_all=TRUE)
rownames(data)<- data[,1]The correlation matrix is run initially to check if we can expect any highly correlated variables. Loudness and energy appear to be highly positively correlated; while acousticness is found to be negatively correlated to both loudness and energy.
data_ns_mt <-as.matrix(data[2:15])
cor1<- cor(data_ns_mt, method = "pearson")
corrplot(cor1)The data is then normalized for further analysis. A correlation matrix is run again on the normalized data. Similar results are observed from the normalized correlation matrix as before.
preproc <- preProcess(data, method=c("center", "scale"))
data_standard <- predict(preproc, data)
summary(data_standard)## song_name song_popularity song_duration_ms acousticness
## Length:13070 Min. :-2.4113 Min. :-3.2565 Min. :-0.9204
## Class :character 1st Qu.:-0.5713 1st Qu.:-0.5620 1st Qu.:-0.8376
## Mode :character Median : 0.1249 Median :-0.1126 Median :-0.4333
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7217 3rd Qu.: 0.4079 3rd Qu.: 0.6693
## Max. : 2.5618 Max. :24.9126 Max. : 2.3799
## danceability energy instrumentalness key
## Min. :-3.92780 Min. :-2.8335 Min. :-0.3921 Min. :-1.48427
## 1st Qu.:-0.63321 1st Qu.:-0.6506 1st Qu.:-0.3921 1st Qu.:-0.92622
## Median : 0.07535 Median : 0.1396 Median :-0.3920 Median :-0.08915
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.72893 3rd Qu.: 0.8004 3rd Qu.:-0.3681 3rd Qu.: 0.74792
## Max. : 2.27488 Max. : 1.6219 Max. : 3.6514 Max. : 1.58499
## liveness loudness audio_mode speechiness
## Min. :-1.1599 Min. :-7.5357 Min. :-1.3132 Min. :-0.9586
## 1st Qu.:-0.5999 1st Qu.:-0.4249 1st Qu.:-1.3132 1st Qu.:-0.6040
## Median :-0.4094 Median : 0.2265 Median : 0.7614 Median :-0.4414
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2922 3rd Qu.: 0.6688 3rd Qu.: 0.7614 3rd Qu.: 0.1346
## Max. : 5.5407 Max. : 2.2807 Max. : 0.7614 Max. : 7.9866
## tempo time_signature audio_valence
## Min. :-4.16252 Min. :-12.3761 Min. :-2.125720
## 1st Qu.:-0.79306 1st Qu.: 0.1517 1st Qu.:-0.782477
## Median :-0.03872 Median : 0.1517 Median : 0.001752
## Mean : 0.00000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.64606 3rd Qu.: 0.1517 3rd Qu.: 0.810111
## Max. : 4.16301 Max. : 3.2836 Max. : 1.831620
data_col<-colnames(data[1:15])
data_mt <- as.matrix(data_standard[, 2:14])
data.cor<-cor(data_mt, method="pearson")
print(data.cor, digits=2)## song_popularity song_duration_ms acousticness danceability
## song_popularity 1.00000 -0.0070 -0.0372 0.052
## song_duration_ms -0.00697 1.0000 -0.1220 -0.086
## acousticness -0.03715 -0.1220 1.0000 -0.180
## danceability 0.05214 -0.0859 -0.1797 1.000
## energy -0.01248 0.1057 -0.6807 0.064
## instrumentalness -0.09109 -0.0269 0.1863 -0.127
## key -0.00040 -0.0042 -0.0015 0.011
## liveness -0.03476 0.0235 -0.0899 -0.094
## loudness 0.05118 0.0354 -0.5724 0.182
## audio_mode 0.00022 -0.0256 0.0602 -0.100
## speechiness -0.00044 -0.0808 -0.0916 0.204
## tempo -0.02694 0.0171 -0.1471 -0.127
## time_signature 0.02145 0.0042 -0.1563 0.139
## energy instrumentalness key liveness loudness audio_mode
## song_popularity -0.012 -0.091 -0.0004 -0.0348 0.0512 0.00022
## song_duration_ms 0.106 -0.027 -0.0042 0.0235 0.0354 -0.02559
## acousticness -0.681 0.186 -0.0015 -0.0899 -0.5724 0.06019
## danceability 0.064 -0.127 0.0108 -0.0943 0.1817 -0.10049
## energy 1.000 -0.231 0.0163 0.1776 0.7703 -0.04833
## instrumentalness -0.231 1.000 -0.0111 -0.0438 -0.4053 -0.01662
## key 0.016 -0.011 1.0000 -0.0035 0.0099 -0.17661
## liveness 0.178 -0.044 -0.0035 1.0000 0.1155 -0.00690
## loudness 0.770 -0.405 0.0099 0.1155 1.0000 -0.05712
## audio_mode -0.048 -0.017 -0.1766 -0.0069 -0.0571 1.00000
## speechiness 0.078 -0.082 0.0271 0.0894 0.0836 -0.11347
## tempo 0.181 -0.046 -0.0067 0.0260 0.1422 0.02413
## time_signature 0.148 -0.065 -0.0120 0.0116 0.1147 -0.02452
## speechiness tempo time_signature
## song_popularity -0.00044 -0.0269 0.0214
## song_duration_ms -0.08082 0.0171 0.0042
## acousticness -0.09155 -0.1471 -0.1563
## danceability 0.20391 -0.1267 0.1391
## energy 0.07774 0.1814 0.1476
## instrumentalness -0.08235 -0.0456 -0.0649
## key 0.02711 -0.0067 -0.0120
## liveness 0.08941 0.0260 0.0116
## loudness 0.08362 0.1422 0.1147
## audio_mode -0.11347 0.0241 -0.0245
## speechiness 1.00000 0.0530 0.0545
## tempo 0.05296 1.0000 0.0075
## time_signature 0.05446 0.0075 1.0000
cor_standard<- corrplot(data.cor, order ="alphabet", tl.cex=0.6)#Principal Component Analysis
The principal component analysis was chosen among other dimension reduction technologies due to its computational efficiency (other methods took a long time to load) and ability to remove correlating features.
The analysis is run and the eigenvalues are determined.
pca <-prcomp(data_mt)
eigen(data.cor)$values## [1] 2.7071950 1.3728252 1.1749742 1.0851599 1.0138411 0.9780977 0.9296091
## [8] 0.8996730 0.8491665 0.7660557 0.6500129 0.3936726 0.1797171
summary(pca)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6454 1.1717 1.08396 1.04171 1.00690 0.98899 0.96416
## Proportion of Variance 0.2082 0.1056 0.09038 0.08347 0.07799 0.07524 0.07151
## Cumulative Proportion 0.2082 0.3139 0.40423 0.48770 0.56569 0.64093 0.71244
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.94851 0.92150 0.87525 0.8062 0.62743 0.42393
## Proportion of Variance 0.06921 0.06532 0.05893 0.0500 0.03028 0.01382
## Cumulative Proportion 0.78164 0.84696 0.90589 0.9559 0.98618 1.00000
Visualizations of the eigenvalues and percentage of explained variances:
scree_plot_eigenvalue <-fviz_eig(pca, choice = "eigenvalue", addlabels = TRUE,
ggtheme = theme_hc(style = "darkunica"), barfill = "cyan",
barcolor = "cyan", linecolor = "azure")
scree_plot_prct<-fviz_eig(pca,addlabels = TRUE, ggtheme = theme_hc(style = "darkunica"), barfill = "deeppink2",
barcolor = "deeppink2", linecolor = "azure")
grid.arrange(scree_plot_eigenvalue, scree_plot_prct, nrow = 1) get_eigenvalue(pca)## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.7071950 20.824577 20.82458
## Dim.2 1.3728252 10.560194 31.38477
## Dim.3 1.1749742 9.038263 40.42303
## Dim.4 1.0851599 8.347384 48.77042
## Dim.5 1.0138411 7.798777 56.56920
## Dim.6 0.9780977 7.523828 64.09302
## Dim.7 0.9296091 7.150839 71.24386
## Dim.8 0.8996730 6.920562 78.16442
## Dim.9 0.8491665 6.532050 84.69647
## Dim.10 0.7660557 5.892737 90.58921
## Dim.11 0.6500129 5.000099 95.58931
## Dim.12 0.3936726 3.028250 98.61756
## Dim.13 0.1797171 1.382439 100.00000
From the output and the graphs it can be observed that the first 4 components have eigenvalues larger than one. These components together explain 0.48% of the variance.
We check for the most contributing variables to the first 4 principal components. We select the first 7 variables for further analysis.
var<-get_pca_var(pca)
contribution<-fviz_contrib(pca, "var", axes=1:4, xtickslab.rt=90, ggtheme = theme_hc(style = "darkunica"))
contributionfeatures<- data_standard %>%
select(energy, loudness, acousticness, audio_mode, danceability, audio_mode, speechiness, key)#K-Means The K-means clustering was chosen due to it’s simplicty and efficiency in working with larger datasets.
Step 1: Find the optimal number of clusters
We use the elbowplot and the silhouette plot to select the optimal number of clusters. The elbow plot would suggest that 4 or 6 clusters should be selected while the silhouette plot suggests 2. We therefore select 2, 4 & 6 clusters for the k-means analysis.
#hopkins(features, n=nrow(features)-1)
opt_elbow <-Optimal_Clusters_KMeans(features, max_clusters=10, plot_clusters = TRUE)opt_silhouette<-Optimal_Clusters_KMeans(features, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")Step 2: Run and visualise K-means
The k-means clustering is run with 2, 4 and 6 clusters and visualized. There appear to be no clear clusters and the k-means with larger number of clusters displays overlapping clusters. Nevertheless some tendencies can be found.
##### K-means
kmeans_clusters6<- kmeans(features, 6)
kmeans_clusters4<- kmeans(features, 4)
kmeans_clusters2<- kmeans(features, 2)
kmeans6_plot<- fviz_cluster(kmeans_clusters6, data = features, ellipse.type = "convex",
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "K-means with 6 clusters")
kmeans4_plot <- fviz_cluster(kmeans_clusters4, data = features, ellipse.type = "convex",
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "K-means with 4 clusters")
kmeans2_plot<- fviz_cluster(kmeans_clusters2, data = features, ellipse.type = "convex",
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "K-means with 2 clusters")
grid.arrange(kmeans6_plot, kmeans4_plot, kmeans2_plot, nrow = 1)Hierarchical Clustering
The optimal number of clusters selection in RMarkdown was ommitted as it had a long running time.
An agglomerative hierarchical clustering approach was selected. Agglomerative coefficient was calculated for each methodology and the best performing methodology (wards and complete) were chosen.
#elbow_hc<- fviz_nbclust(features, FUN = hcut, method = "wss")
#silhouette_hc <- fviz_nbclust(df, FUN = hcut, method = "silhouette")
d <- dist(features, method = "euclidean")
hc_complete<- hclust(d, method = "complete")
hc_average <- hclust(d, method = "average")
hc_single <- hclust(d, method = "single")
hc_ward <- hclust(d, method = "ward.D2")
complete_cf <- coef.hclust(hc_complete)
average_cf <- coef.hclust(hc_average)
single_cf <- coef.hclust(hc_single)
ward_cf<- coef.hclust(hc_ward)
Coefficients <- data.table(
Method = c("Complete", "Average", "Single", "Ward"),
Agglomerative_Coefficient= c(complete_cf, average_cf,single_cf, ward_cf )
)
datatable(Coefficients)The trees are cut at 2,4 and 6 clusters for both methods.
# cutting the tree
# Ward's method
# cut tree into 4 groups
sub_grp_wards4 <- cutree(hc_ward, k = 4)
sub_grp_wards6 <- cutree(hc_ward, k = 6)
sub_grp_wards2 <- cutree(hc_ward, k = 2)
wards4_plot <-fviz_cluster(list(data = features, cluster = sub_grp_wards4),
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "Hclust (Wards method) with 4 clusters")
wards6_plot <-fviz_cluster(list(data = features, cluster = sub_grp_wards6),
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "Hclust (Wards method) with 6 clusters")
wards2_plot <-fviz_cluster(list(data = features, cluster = sub_grp_wards2),
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "Hclust (Wards method) with 2 clusters")
grid.arrange(wards6_plot, wards4_plot, wards2_plot, nrow = 1)# Complete method
# cut tree into 4 groups
sub_grp_complete4 <- cutree(hc_complete, k = 4)
sub_grp_complete6 <- cutree(hc_complete, k = 6)
sub_grp_complete2 <- cutree(hc_complete, k = 2)
complete4_plot <-fviz_cluster(list(data = features, cluster = sub_grp_complete4),
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "Hclust (Complete method) with 4 clusters")
complete2_plot <-fviz_cluster(list(data = features, cluster = sub_grp_complete2),
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "Hclust (Complete method) with 2 clusters")
complete6_plot <-fviz_cluster(list(data = features, cluster = sub_grp_complete6),
ggtheme = theme_hc(style = "darkunica"),
geom = "point",
main = "Hclust (Complete method) with 6 clusters")
grid.arrange(complete6_plot, complete4_plot, complete2_plot, nrow = 1)Cluster membership vs Playlist membership
The data source provided another interesting set of data- the list of songs in 300 playlists. In order to evaluate how well the clustering mechanisms would cluster the dataset songs into playlists, the number of distinct playlists in each cluster were counted.
It is not a sophisticated measure, but it is still interesting to see if the clusters have any resemblance to human-made playlists.
The results below show that there is little correlation: the bigger clusters included all or almost all of the playlists.
Hierarchical clustering seemed to perform best in this regard, however the lower playlist count is due to the lower number of objects in the cluster (e.g. cluster 6 had 22 objects and 22 playlists). It is important to note, however that one song could have been featured in multiple playlists.
#Evaluate the quality of clusters
clusters <- features[,0]
clusters$kmeans2 <- kmeans_clusters2$cluster
clusters$kmeans4 <- kmeans_clusters4$cluster
clusters$kmeans6 <- kmeans_clusters6$cluster
clusters$hc_w_2<- sub_grp_wards2
clusters$hc_w_4<- sub_grp_wards4
clusters$hc_w_6<- sub_grp_wards6
clusters$hc_c_2<- sub_grp_complete2
clusters$hc_c_4<- sub_grp_complete4
clusters$hc_c_6<- sub_grp_complete6
clusters$song_name <- rownames(clusters)
playlists <- left_join(clusters, info,by = "song_name")
playlists<- playlists %>%
select(-album_names, -artist_name)
Hc_c_6_playlist_table <- playlists %>%
distinct(playlist,hc_c_6) %>%
group_by(hc_c_6) %>%
summarize("playlists in cluster" = n())
Hc_c_4_playlist_table<- playlists %>%
distinct(playlist,hc_c_4) %>%
group_by(hc_c_4) %>%
summarize("playlists in cluster" = n())
Hc_c_2_playlist_table<-playlists %>%
distinct(playlist,hc_c_2) %>%
group_by(hc_c_2) %>%
summarize("playlists in cluster" = n())
Hc_w_4_playlist_table<- playlists %>%
distinct(playlist,hc_w_4) %>%
group_by(hc_w_4) %>%
summarize("playlists in cluster" = n())
Hc_w_6_playlist_table<- playlists %>%
distinct(playlist,hc_w_6) %>%
group_by(hc_w_6) %>%
summarize("playlists in cluster" = n())
Hc_w_2_playlist_table<-playlists %>%
distinct(playlist,hc_w_2) %>%
group_by(hc_w_2) %>%
summarize("playlists in cluster" = n())
kmeans2_playlist_table<- playlists %>%
distinct(playlist,kmeans2) %>%
group_by(kmeans2) %>%
summarize("playlists in cluster" = n())
kmeans4_playlist_table <- playlists %>%
distinct(playlist,kmeans4) %>%
group_by(kmeans4) %>%
summarize("playlists in cluster" = n())
kmeans6_playlist_table <-as.data.table(playlists %>%
distinct(playlist,kmeans6) %>%
group_by(kmeans6) %>%
summarize("playlists in cluster" = n()))
library(knitr)## Warning: package 'knitr' was built under R version 4.0.5
kable(list(Hc_c_2_playlist_table,Hc_w_2_playlist_table, kmeans2_playlist_table ))
|
|
|
kable(list(Hc_c_4_playlist_table,Hc_w_4_playlist_table, kmeans4_playlist_table ))
|
|
|
kable(list(Hc_c_6_playlist_table,Hc_w_6_playlist_table, kmeans6_playlist_table ))
|
|
|