Clusters and Dimension reduction

Paulina Sereikyte

2022

Introduction

Spotify uses recommender algorithms to recommend the most appropriate music to their customers. It also curates numerous playlists, curated by humans, that include songs with similar moods, tempos etc. This paper will try to evaluate whether a similar result can be achieved with a simple clustering mechanism.

The data used for this project was found on kaggle: https://www.kaggle.com/edalrami/19000-spotify-songs

The process

The data will be first inspected and prepared for further analysis. A Principal Component Analysis will be then carried out to find out the optimal number of components and the most contributing variables.

K-means and hierarchical clustering will be run and compared. Lastly, the membership to the clusters will be compared with the playlist membership.

Load and Inspect the data

Load the neccessary libraries and data:

Check for missing values:

cbind(
  lapply(
    lapply(data, is.na)
    , sum)
)
##                  [,1]
## song_name        0   
## song_popularity  0   
## song_duration_ms 0   
## acousticness     0   
## danceability     0   
## energy           0   
## instrumentalness 0   
## key              0   
## liveness         0   
## loudness         0   
## audio_mode       0   
## speechiness      0   
## tempo            0   
## time_signature   0   
## audio_valence    0

Check the summary of the data to see if there are any abnormal results. The data has duplicate songs which have to be removed.

summary(data)
##   song_name         song_popularity  song_duration_ms   acousticness     
##  Length:18835       Min.   :  0.00   Min.   :  12000   Min.   :0.000001  
##  Class :character   1st Qu.: 40.00   1st Qu.: 184340   1st Qu.:0.024100  
##  Mode  :character   Median : 56.00   Median : 211306   Median :0.132000  
##                     Mean   : 52.99   Mean   : 218212   Mean   :0.258539  
##                     3rd Qu.: 69.00   3rd Qu.: 242844   3rd Qu.:0.424000  
##                     Max.   :100.00   Max.   :1799346   Max.   :0.996000  
##   danceability        energy        instrumentalness         key        
##  Min.   :0.0000   Min.   :0.00107   Min.   :0.0000000   Min.   : 0.000  
##  1st Qu.:0.5330   1st Qu.:0.51000   1st Qu.:0.0000000   1st Qu.: 2.000  
##  Median :0.6450   Median :0.67400   Median :0.0000114   Median : 5.000  
##  Mean   :0.6333   Mean   :0.64499   Mean   :0.0780080   Mean   : 5.289  
##  3rd Qu.:0.7480   3rd Qu.:0.81500   3rd Qu.:0.0025700   3rd Qu.: 8.000  
##  Max.   :0.9870   Max.   :0.99900   Max.   :0.9970000   Max.   :11.000  
##     liveness         loudness         audio_mode      speechiness    
##  Min.   :0.0109   Min.   :-38.768   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0929   1st Qu.: -9.044   1st Qu.:0.0000   1st Qu.:0.0378  
##  Median :0.1220   Median : -6.555   Median :1.0000   Median :0.0555  
##  Mean   :0.1797   Mean   : -7.447   Mean   :0.6281   Mean   :0.1021  
##  3rd Qu.:0.2210   3rd Qu.: -4.908   3rd Qu.:1.0000   3rd Qu.:0.1190  
##  Max.   :0.9860   Max.   :  1.585   Max.   :1.0000   Max.   :0.9410  
##      tempo        time_signature  audio_valence  
##  Min.   :  0.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 98.37   1st Qu.:4.000   1st Qu.:0.335  
##  Median :120.01   Median :4.000   Median :0.527  
##  Mean   :121.07   Mean   :3.959   Mean   :0.528  
##  3rd Qu.:139.93   3rd Qu.:4.000   3rd Qu.:0.725  
##  Max.   :242.32   Max.   :5.000   Max.   :0.984
data<- data %>% 
  distinct(song_name, .keep_all=TRUE)
rownames(data)<- data[,1]

The correlation matrix is run initially to check if we can expect any highly correlated variables. Loudness and energy appear to be highly positively correlated; while acousticness is found to be negatively correlated to both loudness and energy.

data_ns_mt <-as.matrix(data[2:15])
cor1<- cor(data_ns_mt, method = "pearson")
corrplot(cor1)

The data is then normalized for further analysis. A correlation matrix is run again on the normalized data. Similar results are observed from the normalized correlation matrix as before.

preproc <- preProcess(data, method=c("center", "scale"))
data_standard <- predict(preproc, data)
summary(data_standard)
##   song_name         song_popularity   song_duration_ms   acousticness    
##  Length:13070       Min.   :-2.4113   Min.   :-3.2565   Min.   :-0.9204  
##  Class :character   1st Qu.:-0.5713   1st Qu.:-0.5620   1st Qu.:-0.8376  
##  Mode  :character   Median : 0.1249   Median :-0.1126   Median :-0.4333  
##                     Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##                     3rd Qu.: 0.7217   3rd Qu.: 0.4079   3rd Qu.: 0.6693  
##                     Max.   : 2.5618   Max.   :24.9126   Max.   : 2.3799  
##   danceability          energy        instrumentalness       key          
##  Min.   :-3.92780   Min.   :-2.8335   Min.   :-0.3921   Min.   :-1.48427  
##  1st Qu.:-0.63321   1st Qu.:-0.6506   1st Qu.:-0.3921   1st Qu.:-0.92622  
##  Median : 0.07535   Median : 0.1396   Median :-0.3920   Median :-0.08915  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.72893   3rd Qu.: 0.8004   3rd Qu.:-0.3681   3rd Qu.: 0.74792  
##  Max.   : 2.27488   Max.   : 1.6219   Max.   : 3.6514   Max.   : 1.58499  
##     liveness          loudness         audio_mode       speechiness     
##  Min.   :-1.1599   Min.   :-7.5357   Min.   :-1.3132   Min.   :-0.9586  
##  1st Qu.:-0.5999   1st Qu.:-0.4249   1st Qu.:-1.3132   1st Qu.:-0.6040  
##  Median :-0.4094   Median : 0.2265   Median : 0.7614   Median :-0.4414  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2922   3rd Qu.: 0.6688   3rd Qu.: 0.7614   3rd Qu.: 0.1346  
##  Max.   : 5.5407   Max.   : 2.2807   Max.   : 0.7614   Max.   : 7.9866  
##      tempo          time_signature     audio_valence      
##  Min.   :-4.16252   Min.   :-12.3761   Min.   :-2.125720  
##  1st Qu.:-0.79306   1st Qu.:  0.1517   1st Qu.:-0.782477  
##  Median :-0.03872   Median :  0.1517   Median : 0.001752  
##  Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.64606   3rd Qu.:  0.1517   3rd Qu.: 0.810111  
##  Max.   : 4.16301   Max.   :  3.2836   Max.   : 1.831620
data_col<-colnames(data[1:15])
data_mt <- as.matrix(data_standard[, 2:14])
data.cor<-cor(data_mt, method="pearson") 
print(data.cor, digits=2)
##                  song_popularity song_duration_ms acousticness danceability
## song_popularity          1.00000          -0.0070      -0.0372        0.052
## song_duration_ms        -0.00697           1.0000      -0.1220       -0.086
## acousticness            -0.03715          -0.1220       1.0000       -0.180
## danceability             0.05214          -0.0859      -0.1797        1.000
## energy                  -0.01248           0.1057      -0.6807        0.064
## instrumentalness        -0.09109          -0.0269       0.1863       -0.127
## key                     -0.00040          -0.0042      -0.0015        0.011
## liveness                -0.03476           0.0235      -0.0899       -0.094
## loudness                 0.05118           0.0354      -0.5724        0.182
## audio_mode               0.00022          -0.0256       0.0602       -0.100
## speechiness             -0.00044          -0.0808      -0.0916        0.204
## tempo                   -0.02694           0.0171      -0.1471       -0.127
## time_signature           0.02145           0.0042      -0.1563        0.139
##                  energy instrumentalness     key liveness loudness audio_mode
## song_popularity  -0.012           -0.091 -0.0004  -0.0348   0.0512    0.00022
## song_duration_ms  0.106           -0.027 -0.0042   0.0235   0.0354   -0.02559
## acousticness     -0.681            0.186 -0.0015  -0.0899  -0.5724    0.06019
## danceability      0.064           -0.127  0.0108  -0.0943   0.1817   -0.10049
## energy            1.000           -0.231  0.0163   0.1776   0.7703   -0.04833
## instrumentalness -0.231            1.000 -0.0111  -0.0438  -0.4053   -0.01662
## key               0.016           -0.011  1.0000  -0.0035   0.0099   -0.17661
## liveness          0.178           -0.044 -0.0035   1.0000   0.1155   -0.00690
## loudness          0.770           -0.405  0.0099   0.1155   1.0000   -0.05712
## audio_mode       -0.048           -0.017 -0.1766  -0.0069  -0.0571    1.00000
## speechiness       0.078           -0.082  0.0271   0.0894   0.0836   -0.11347
## tempo             0.181           -0.046 -0.0067   0.0260   0.1422    0.02413
## time_signature    0.148           -0.065 -0.0120   0.0116   0.1147   -0.02452
##                  speechiness   tempo time_signature
## song_popularity     -0.00044 -0.0269         0.0214
## song_duration_ms    -0.08082  0.0171         0.0042
## acousticness        -0.09155 -0.1471        -0.1563
## danceability         0.20391 -0.1267         0.1391
## energy               0.07774  0.1814         0.1476
## instrumentalness    -0.08235 -0.0456        -0.0649
## key                  0.02711 -0.0067        -0.0120
## liveness             0.08941  0.0260         0.0116
## loudness             0.08362  0.1422         0.1147
## audio_mode          -0.11347  0.0241        -0.0245
## speechiness          1.00000  0.0530         0.0545
## tempo                0.05296  1.0000         0.0075
## time_signature       0.05446  0.0075         1.0000
cor_standard<- corrplot(data.cor, order ="alphabet", tl.cex=0.6)

#Principal Component Analysis

The principal component analysis was chosen among other dimension reduction technologies due to its computational efficiency (other methods took a long time to load) and ability to remove correlating features.

The analysis is run and the eigenvalues are determined.

pca <-prcomp(data_mt)
eigen(data.cor)$values
##  [1] 2.7071950 1.3728252 1.1749742 1.0851599 1.0138411 0.9780977 0.9296091
##  [8] 0.8996730 0.8491665 0.7660557 0.6500129 0.3936726 0.1797171
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.6454 1.1717 1.08396 1.04171 1.00690 0.98899 0.96416
## Proportion of Variance 0.2082 0.1056 0.09038 0.08347 0.07799 0.07524 0.07151
## Cumulative Proportion  0.2082 0.3139 0.40423 0.48770 0.56569 0.64093 0.71244
##                            PC8     PC9    PC10   PC11    PC12    PC13
## Standard deviation     0.94851 0.92150 0.87525 0.8062 0.62743 0.42393
## Proportion of Variance 0.06921 0.06532 0.05893 0.0500 0.03028 0.01382
## Cumulative Proportion  0.78164 0.84696 0.90589 0.9559 0.98618 1.00000

Visualizations of the eigenvalues and percentage of explained variances:

scree_plot_eigenvalue <-fviz_eig(pca, choice = "eigenvalue", addlabels = TRUE,
                                 ggtheme = theme_hc(style = "darkunica"), barfill = "cyan",
                                 barcolor = "cyan", linecolor = "azure")
scree_plot_prct<-fviz_eig(pca,addlabels = TRUE, ggtheme = theme_hc(style = "darkunica"), barfill = "deeppink2",
                          barcolor = "deeppink2", linecolor = "azure")

grid.arrange(scree_plot_eigenvalue, scree_plot_prct, nrow = 1) 

get_eigenvalue(pca)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   2.7071950        20.824577                    20.82458
## Dim.2   1.3728252        10.560194                    31.38477
## Dim.3   1.1749742         9.038263                    40.42303
## Dim.4   1.0851599         8.347384                    48.77042
## Dim.5   1.0138411         7.798777                    56.56920
## Dim.6   0.9780977         7.523828                    64.09302
## Dim.7   0.9296091         7.150839                    71.24386
## Dim.8   0.8996730         6.920562                    78.16442
## Dim.9   0.8491665         6.532050                    84.69647
## Dim.10  0.7660557         5.892737                    90.58921
## Dim.11  0.6500129         5.000099                    95.58931
## Dim.12  0.3936726         3.028250                    98.61756
## Dim.13  0.1797171         1.382439                   100.00000

From the output and the graphs it can be observed that the first 4 components have eigenvalues larger than one. These components together explain 0.48% of the variance.

We check for the most contributing variables to the first 4 principal components. We select the first 7 variables for further analysis.

var<-get_pca_var(pca)
contribution<-fviz_contrib(pca, "var", axes=1:4, xtickslab.rt=90,  ggtheme = theme_hc(style = "darkunica"))
contribution

features<- data_standard %>% 
  select(energy, loudness, acousticness, audio_mode, danceability, audio_mode, speechiness, key)

#K-Means The K-means clustering was chosen due to it’s simplicty and efficiency in working with larger datasets.

Step 1: Find the optimal number of clusters

We use the elbowplot and the silhouette plot to select the optimal number of clusters. The elbow plot would suggest that 4 or 6 clusters should be selected while the silhouette plot suggests 2. We therefore select 2, 4 & 6 clusters for the k-means analysis.

#hopkins(features, n=nrow(features)-1)
opt_elbow <-Optimal_Clusters_KMeans(features, max_clusters=10, plot_clusters = TRUE)

opt_silhouette<-Optimal_Clusters_KMeans(features, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

Step 2: Run and visualise K-means

The k-means clustering is run with 2, 4 and 6 clusters and visualized. There appear to be no clear clusters and the k-means with larger number of clusters displays overlapping clusters. Nevertheless some tendencies can be found.

##### K-means
kmeans_clusters6<- kmeans(features, 6)
kmeans_clusters4<- kmeans(features, 4)
kmeans_clusters2<- kmeans(features, 2)

kmeans6_plot<- fviz_cluster(kmeans_clusters6, data = features, ellipse.type = "convex", 
             ggtheme = theme_hc(style = "darkunica"), 
             geom = "point",
             main = "K-means with 6 clusters")

kmeans4_plot <- fviz_cluster(kmeans_clusters4, data = features, ellipse.type = "convex", 
             ggtheme = theme_hc(style = "darkunica"), 
             geom = "point",
             main = "K-means with 4 clusters")

kmeans2_plot<- fviz_cluster(kmeans_clusters2, data = features, ellipse.type = "convex", 
             ggtheme = theme_hc(style = "darkunica"), 
             geom = "point",
             main = "K-means with 2 clusters")

grid.arrange(kmeans6_plot, kmeans4_plot, kmeans2_plot, nrow = 1)

Hierarchical Clustering

The optimal number of clusters selection in RMarkdown was ommitted as it had a long running time.

An agglomerative hierarchical clustering approach was selected. Agglomerative coefficient was calculated for each methodology and the best performing methodology (wards and complete) were chosen.

#elbow_hc<- fviz_nbclust(features, FUN = hcut, method = "wss")
#silhouette_hc <- fviz_nbclust(df, FUN = hcut, method = "silhouette")

d <- dist(features, method = "euclidean")
hc_complete<- hclust(d, method = "complete")
hc_average <- hclust(d, method = "average")
hc_single <- hclust(d, method = "single")
hc_ward <- hclust(d, method = "ward.D2")

complete_cf <- coef.hclust(hc_complete)
average_cf <- coef.hclust(hc_average)
single_cf <- coef.hclust(hc_single)
ward_cf<- coef.hclust(hc_ward)


Coefficients <- data.table(
  Method = c("Complete", "Average", "Single", "Ward"),
  Agglomerative_Coefficient= c(complete_cf, average_cf,single_cf, ward_cf )
)
datatable(Coefficients)

The trees are cut at 2,4 and 6 clusters for both methods.

# cutting the tree

# Ward's method

# cut tree into 4 groups
sub_grp_wards4 <- cutree(hc_ward, k = 4)
sub_grp_wards6 <- cutree(hc_ward, k = 6)
sub_grp_wards2 <- cutree(hc_ward, k = 2)

wards4_plot <-fviz_cluster(list(data = features, cluster = sub_grp_wards4), 
                           ggtheme = theme_hc(style = "darkunica"), 
                           geom = "point",
                           main = "Hclust (Wards method) with 4 clusters")

wards6_plot <-fviz_cluster(list(data = features, cluster = sub_grp_wards6), 
                           ggtheme = theme_hc(style = "darkunica"), 
                           geom = "point",
                           main = "Hclust (Wards method) with 6 clusters")

wards2_plot <-fviz_cluster(list(data = features, cluster = sub_grp_wards2), 
                           ggtheme = theme_hc(style = "darkunica"), 
                           geom = "point",
                           main = "Hclust (Wards method) with 2 clusters")

grid.arrange(wards6_plot, wards4_plot, wards2_plot, nrow = 1)

# Complete method

# cut tree into 4 groups
sub_grp_complete4 <- cutree(hc_complete, k = 4)
sub_grp_complete6 <- cutree(hc_complete, k = 6)
sub_grp_complete2 <- cutree(hc_complete, k = 2)

complete4_plot <-fviz_cluster(list(data = features, cluster = sub_grp_complete4), 
                           ggtheme = theme_hc(style = "darkunica"), 
                           geom = "point",
                           main = "Hclust (Complete method) with 4 clusters")

complete2_plot <-fviz_cluster(list(data = features, cluster = sub_grp_complete2), 
                              ggtheme = theme_hc(style = "darkunica"), 
                              geom = "point",
                              main = "Hclust (Complete method) with 2 clusters")

complete6_plot <-fviz_cluster(list(data = features, cluster = sub_grp_complete6), 
                              ggtheme = theme_hc(style = "darkunica"), 
                              geom = "point",
                              main = "Hclust (Complete method) with 6 clusters")

grid.arrange(complete6_plot, complete4_plot, complete2_plot, nrow = 1)

Cluster membership vs Playlist membership

The data source provided another interesting set of data- the list of songs in 300 playlists. In order to evaluate how well the clustering mechanisms would cluster the dataset songs into playlists, the number of distinct playlists in each cluster were counted.

It is not a sophisticated measure, but it is still interesting to see if the clusters have any resemblance to human-made playlists.

The results below show that there is little correlation: the bigger clusters included all or almost all of the playlists.

Hierarchical clustering seemed to perform best in this regard, however the lower playlist count is due to the lower number of objects in the cluster (e.g. cluster 6 had 22 objects and 22 playlists). It is important to note, however that one song could have been featured in multiple playlists.

#Evaluate the quality of clusters
clusters <- features[,0]
clusters$kmeans2 <- kmeans_clusters2$cluster
clusters$kmeans4 <- kmeans_clusters4$cluster
clusters$kmeans6 <- kmeans_clusters6$cluster
clusters$hc_w_2<- sub_grp_wards2
clusters$hc_w_4<- sub_grp_wards4
clusters$hc_w_6<- sub_grp_wards6
clusters$hc_c_2<- sub_grp_complete2
clusters$hc_c_4<- sub_grp_complete4
clusters$hc_c_6<- sub_grp_complete6

clusters$song_name <- rownames(clusters)
playlists <- left_join(clusters, info,by = "song_name")
playlists<- playlists %>% 
  select(-album_names, -artist_name)

Hc_c_6_playlist_table <- playlists %>%
  distinct(playlist,hc_c_6) %>% 
  group_by(hc_c_6) %>%
  summarize("playlists in cluster" = n())

Hc_c_4_playlist_table<- playlists %>%
  distinct(playlist,hc_c_4) %>% 
  group_by(hc_c_4) %>%
  summarize("playlists in cluster" = n())

Hc_c_2_playlist_table<-playlists %>%
  distinct(playlist,hc_c_2) %>% 
  group_by(hc_c_2) %>%
  summarize("playlists in cluster" = n())

Hc_w_4_playlist_table<- playlists %>%
  distinct(playlist,hc_w_4) %>% 
  group_by(hc_w_4) %>%
  summarize("playlists in cluster" = n())

Hc_w_6_playlist_table<- playlists %>%
  distinct(playlist,hc_w_6) %>% 
  group_by(hc_w_6) %>%
  summarize("playlists in cluster" = n())

Hc_w_2_playlist_table<-playlists %>%
  distinct(playlist,hc_w_2) %>% 
  group_by(hc_w_2) %>%
  summarize("playlists in cluster" = n())

kmeans2_playlist_table<- playlists %>%
  distinct(playlist,kmeans2) %>% 
  group_by(kmeans2) %>%
  summarize("playlists in cluster" = n())

kmeans4_playlist_table <- playlists %>%
  distinct(playlist,kmeans4) %>% 
  group_by(kmeans4) %>%
  summarize("playlists in cluster" = n())

kmeans6_playlist_table <-as.data.table(playlists %>%
  distinct(playlist,kmeans6) %>% 
  group_by(kmeans6) %>%
  summarize("playlists in cluster" = n()))

library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
kable(list(Hc_c_2_playlist_table,Hc_w_2_playlist_table, kmeans2_playlist_table ))
hc_c_2 playlists in cluster
1 298
2 220
hc_w_2 playlists in cluster
1 300
2 223
kmeans2 playlists in cluster
1 263
2 298
kable(list(Hc_c_4_playlist_table,Hc_w_4_playlist_table, kmeans4_playlist_table ))
hc_c_4 playlists in cluster
1 298
2 217
3 77
4 29
hc_w_4 playlists in cluster
1 298
2 300
3 223
4 264
kmeans4 playlists in cluster
1 242
2 298
3 293
4 241
kable(list(Hc_c_6_playlist_table,Hc_w_6_playlist_table, kmeans6_playlist_table ))
hc_c_6 playlists in cluster
1 298
2 208
3 217
4 77
5 11
6 22
hc_w_6 playlists in cluster
1 298
2 287
3 292
4 223
5 264
6 255
kmeans6 playlists in cluster
1 228
2 294
3 293
4 249
5 295
6 235