4 Main methods
4.1 First clustering
Let us start by just trying to use NbClust function, which automatically finds the optimal number of clusters:
set.seed(123)
c3<-NbClust(data_pc.n, distance="euclidean", method="complete", index="ch")
c3$Best.nc
## Number_clusters Value_Index
## 11.0000 11.6441
c3$Best.partition
## [1] 1 2 2 3 2 4 5 5 6 2 7 7 5 5 8 4 2 8 3 4 3 3 6 2 3
## [26] 9 9 8 10 2 4 11 7 7 4 5 5 7 4 7
Well, not exactly what we aimed to discover. Finding 11 clusters suggests that there is enough variation to split single artists into multiple clusters. While this is a good observation about the nature and complexity of music, it is not exactly what we were hoping to achieve. We can now try mapping these clusters with the artists’ names, looking if the excess clusters are just divisions within each artists catalog, maybe changes in styles or eras of their discography, or has the algorithm completely dismissed artist boundaries while grouping:
first_result<- data.frame(cluster = c3$Best.partition, artist = artist_name_pc)
unique_counts <- first_result %>% group_by(cluster) %>%
summarise(
unique = n_distinct(artist.s._name),
total = n()
)
kable(unique_counts)
| cluster | unique | total |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 3 | 7 |
| 3 | 3 | 5 |
| 4 | 3 | 6 |
| 5 | 3 | 6 |
| 6 | 2 | 2 |
| 7 | 2 | 6 |
| 8 | 2 | 3 |
| 9 | 1 | 2 |
| 10 | 1 | 1 |
| 11 | 1 | 1 |
Again, not exactly the result we were hoping for, many of the clusters include more than one artist.
4.2 Clustering tests
We can now try calculating silhouette to check for the optimal number of clusters. The overall Silhouette Score for the clustering solution is the mean of the silhouette coefficients for all \(N\) data points:
\[S = \frac{1}{N} \sum_{i=1}^{N} s(i)\] where \(s(i)\) is Silhouette for one data point:
\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]
where \(a(i)\) represents the average distance from point \(i\) to other points within its own cluster, and \(b(i)\) is the smallest average distance from \(i\) to points in a different cluster.
opt<-Optimal_Clusters_KMeans(data_pc.n, max_clusters=11, plot_clusters = TRUE, criterion="silhouette")
The silhouettes reveal that, contrary to our expectations, that \(n=4\) produces by far the lowest score.
Additionally, no amount of clusters surpasses a score of \(0.2\), indicating poor separability in the
raw data. We can try to further verify this by checking the Hopkins
Statistic and visualizing the distance matrix.
set.seed(123)
verification_statistics <- get_clust_tendency(data_pc.n, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
The Hopkins Statistic for our data equals 0.793. This is great news, the closer it is to one, the better chances for the data to be clusterable, not just random noise. We can once again further verify this by plotting the visual dissimilarity matrix.
d <- get_dist(data_pc.n, method = "euclidean")
fviz_dist(d, show_labels = FALSE) + labs(title = "Distance Matrix of Data Before PCA")
Here we can more clearly see what the structure of data really is. Blue
values represent point positioned far apart from each other (solid blue
stripes represent outliers). The red diagonal represents distances
between points and themselves (distance always zero - solid red). Even
though there is a presence of outliers, we can still see some reddish
squares along the diagonal, they suggest there is similarity inside the
data.
4.3 PCA
Let us try reducing the dimensions of our data with PCA (Principal component analysis), this may help with the clustering by simplifying the numerical operations. PCA is used for exploratory data analysis, it reduces the number of variables by calculating eigenvectors and eigenvalues of the data matrix. Eigenvectors serve as the new axes and eigenvalues represent the significance of each new variable (principal component).
data_pc.pca <- prcomp(data_pc.n, scale = FALSE)
kable(head(round(data_pc.pca$x, 2)))
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 |
|---|---|---|---|---|---|---|---|
| -0.66 | 3.39 | -0.78 | -4.96 | 1.79 | 0.95 | 0.17 | -0.09 |
| 0.24 | -0.39 | -1.39 | -0.90 | -1.33 | -0.83 | 0.25 | -0.11 |
| 0.62 | 1.02 | -1.74 | 0.66 | -0.05 | -0.74 | -0.47 | -0.53 |
| -0.32 | 1.00 | -1.25 | 1.02 | 0.45 | -0.76 | 0.45 | -0.60 |
| 0.03 | 0.72 | -1.29 | 0.52 | 0.09 | -0.61 | -0.57 | 0.50 |
| 0.02 | 0.92 | 0.83 | 0.02 | -0.86 | -0.20 | 1.26 | -0.04 |
summary(data_pc.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4883 1.2381 1.1731 1.0414 0.77952 0.7736 0.66040
## Proportion of Variance 0.2769 0.1916 0.1720 0.1356 0.07596 0.0748 0.05452
## Cumulative Proportion 0.2769 0.4685 0.6405 0.7761 0.85205 0.9268 0.98137
## PC8
## Standard deviation 0.38609
## Proportion of Variance 0.01863
## Cumulative Proportion 1.00000
# we predict PCA values for test data based on train data linear transformation
data_test.pca <- predict(data_pc.pca, newdata = data_test.n)
Rule of thumb concerning PCA is to leave only principal components with variation larger than one. Following this logic, in further analysis we will limit the dataset to the first 4 principal components carrying 78% of the original datasets variance.
data_pc.pca_reduced <- data_pc.pca$x[, 1:4]
data_test.pca_reduced <- data_test.pca[, 1:4]
fviz_pca_var(data_pc.pca, col.var="wheat4")
On the above graph we can see the share of each variable in the first and second principal component. For example, we can see that high values of PC1 will mean high values of acousticness but low values of energy. Additionally we can plot the feature contribution for each of the principal components. Let us see a couple examples:
for (i in 3:4) {
p <- fviz_contrib(data_pc.pca,
choice = "var",
axes = i,
title = paste("Contribution of variables to Dim", i))
print(p)
}
We can also try plotting all our observations based on their values of the first two principal components:
Once again we can see some possibility of proper clustering. It is important to keep in mind, that there are two additional components not included in the graph, which will be included in further analysis. With that established, let us use the K-means algorithm and impose 4 clusters to avoid getting lost in different nuances of each artists discography.
4.4 K-means
The K-means algorithm aims to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean (centroid). The objective is to minimize the Within-Cluster Sum of Squares (WCSS):
\[argminJ = \sum_{j=1}^{k} \sum_{x_i \in C_j} ||x_i - \mu_j||^2\]
Where:
- \(J\) is the objective function (sum of squared errors).
- \(k\) is the number of clusters.
- \(C_j\) is the set of points belonging to cluster \(j\).
- \(x_i\) is a data point in cluster \(C_j\).
- \(\mu_j\) is the centroid (mean) of cluster \(C_j\).
set.seed(123)
kmeans_result <- eclust(data_pc.pca_reduced, "kmeans", hc_metric="euclidean", k=4)
second_result<- table(cluster = kmeans_result$cluster, artist = artist_name_pc[[1]])
kable(second_result)
| Bad Bunny | Harry Styles | Kendrick Lamar | Morgan Wallen |
|---|---|---|---|
| 2 | 4 | 0 | 4 |
| 0 | 2 | 5 | 0 |
| 3 | 3 | 0 | 6 |
| 5 | 1 | 5 | 0 |
Again, no luck, the divide is not as we would like it. Let us try a different approach. An algorithm similar to K-means - K-medoids (PAM).
4.5 K-medoids (PAM)
The two algorithms, K-means and PAM, are similar in nature, but K-medoids instead of choosing an arbitrary point in space as the center of each cluster chooses one of the data points. This approach might help us, as such solution, for example, makes the results a bit more resistant to outliers.
pam_result <- pam(data_pc.pca_reduced, k = 4)
fviz_cluster(pam_result)
pam_table <- table(cluster = pam_result$clustering, artist = artist_name_pc[[1]])
kable(pam_table)
| Bad Bunny | Harry Styles | Kendrick Lamar | Morgan Wallen |
|---|---|---|---|
| 7 | 2 | 4 | 0 |
| 1 | 5 | 3 | 3 |
| 1 | 3 | 0 | 7 |
| 1 | 0 | 3 | 0 |
A bit better. This time we could actually try naming the clusters, the first one could be the Bad Bunny - “reggaeton” cluster, number three the Morgan Wallen - “country” cluster and number four the Kendrick Lamar - “rap” cluster. Although they are not clearly separated, there are still songs breaking the barriers of this clustering. This might just be the truth about popular music in the 2020s, that most hits are musically very alike, especially cluster two consists of all four artists. We can try verifying this by seeing where will the already trained algorithm assign a few new songs by the same artists.
pam_train_kcca <- as.kcca(pam_result, data_pc.pca_reduced)
pam_pred <- predict(pam_train_kcca, data_test.pca_reduced)
pam_table_test <- table(cluster = pam_pred, artist = artist_name_test)
kable(pam_table_test)
| Bad Bunny | Harry Styles | Kendrick Lamar | Morgan Wallen |
|---|---|---|---|
| 3 | 1 | 0 | 0 |
| 1 | 3 | 1 | 1 |
| 4 | 3 | 0 | 0 |
| 1 | 0 | 1 | 0 |
This is probably time use a more scientific test than the eye test to properly determine if the clustering was successful. Let us combine the train and test data to get the full picture of the analysis and calculate the rand index. It is a measure used to assess the similarity between two data clusterings. It operates by comparing all pairs of observations between two periods (\(t_0\) and \(t_1\)) to determine consistency in their grouping.
For any given pair of observations, there are four possible outcomes based on whether they belong to the same cluster or different clusters in the two periods:
- \(a\): The number of pairs that are in the same cluster in \(t_0\) and remain in the same cluster in \(t_1\).
- \(b\): The number of pairs that are in different clusters in \(t_0\) and remain in different clusters in \(t_1\).
- \(c\): The number of pairs that are in the same cluster in \(t_0\) but move to different clusters in \(t_1\).
- \(d\): The number of pairs that are in different clusters in \(t_0\) but move to the same cluster in \(t_1\).
The Rand Index (\(RI\)) is calculated as the ratio of consistent pairs (agreements) to total pairs:
\[RI = \frac{a + b}{a + b + c + d}\]
- \(RI = 1\): Indicates that the partitions are identical (perfect agreement). In this scenario, \(c\) and \(d\) are null (0).
- \(RI = 0\): Indicates that the partitions do not agree on any pair of points.
The Rand Index specifically evaluates pairs of points and is insensitive to relabelling (i.e., changing the names of the clusters does not affect the score).
result_together <- c(pam_result$clustering, pam_pred)
artist_together <- c(artist_name_pc[[1]], artist_name_test)
table_together <- table(cluster = result_together, artist = artist_together)
kable(table_together)
| Bad Bunny | Harry Styles | Kendrick Lamar | Morgan Wallen |
|---|---|---|---|
| 10 | 3 | 4 | 0 |
| 2 | 8 | 4 | 4 |
| 5 | 6 | 0 | 7 |
| 2 | 0 | 4 | 0 |
Well, unfortunately the rand index of 0.103 means the clustering is almost as bad as random guessing. This time we have got no other choice, than to accept the fact that to a computer looking strictly at statistics like “danceability” or “instrumentalness” there is almost no difference between the most popular acts. Whether the cause of this is the inability of computers to distinguish between artists (or genres) in popular music, or the fact that the Spotify created indicators of “audio features” might be lackluster in truly reflecting the songs nature, we were unable to recreate the proper divide between the songs. Nevertheless, what we have managed to achieve is to split the songs into categories, cluster one seems to represent more up-tempo music, cluster two and three focus more on melody and cluster four seems to represent the outliers, songs that might be different to the English speaking mainstream.