1. Introduction

The goal of this project is to analyze audio features of Spotify tracks to identify distinct mood clusters using unsupervised learning techniques.

Instead of relying on predefined genres, clustering algorithms are applied to group songs based on sonic characteristics such as energy, danceability, and acousticness.

1.1 Dataset

The data used to conduct analysis is a dataset which consists of audio features extracted from Spotify tracks. It contains numerical variables describing sonic characteristics (e.g., danceability, energy, tempo) used to determine the mood of the music. It is imported directly from Kaggle: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset

# Loading the dataset
if (file.exists("dataset.csv")) {
  df_raw <- read.csv("dataset.csv")
} else {
  stop("File 'dataset.csv' not found.")
}

# Display first few rows (optional check)
head(df_raw)
##   X               track_id                artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV            Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A           Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx           Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K       Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc           Tyrone Wells
##                                               album_name
## 1                                                 Comedy
## 2                                       Ghost (Acoustic)
## 3                                         To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5                                                Hold On
## 6                                   Days I Will Remember
##                   track_name popularity duration_ms explicit danceability
## 1                     Comedy         73      230666    False        0.676
## 2           Ghost - Acoustic         55      149610    False        0.420
## 3             To Begin Again         57      210826    False        0.438
## 4 Can't Help Falling In Love         71      201933    False        0.266
## 5                    Hold On         82      198853    False        0.618
## 6       Days I Will Remember         58      214240    False        0.688
##   energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610   1   -6.746    0      0.1430       0.0322         1.01e-06   0.3580
## 2 0.1660   1  -17.235    1      0.0763       0.9240         5.56e-06   0.1010
## 3 0.3590   0   -9.734    1      0.0557       0.2100         0.00e+00   0.1170
## 4 0.0596   0  -18.515    1      0.0363       0.9050         7.07e-05   0.1320
## 5 0.4430   2   -9.681    1      0.0526       0.4690         0.00e+00   0.0829
## 6 0.4810   6   -8.807    1      0.1050       0.2890         0.00e+00   0.1890
##   valence   tempo time_signature track_genre
## 1   0.715  87.917              4    acoustic
## 2   0.267  77.489              4    acoustic
## 3   0.120  76.332              4    acoustic
## 4   0.143 181.740              3    acoustic
## 5   0.167 119.949              4    acoustic
## 6   0.666  98.017              4    acoustic

2. Data Preprocessing

2.1 Loading and Cleaning

if (file.exists("dataset.csv")) {
  df_raw <- read.csv("dataset.csv")
} else {
  stop("File 'dataset.csv' not found.")
}

df_clean <- df_raw %>% 
  distinct(track_id, .keep_all = TRUE)

2.2 Feature Selection and Sampling

The project selects 9 specific features: danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo

audio_features <- c(
  "danceability", "energy", "loudness", "speechiness",
  "acousticness", "instrumentalness", "liveness",
  "valence", "tempo"
)

set.seed(123)
df_sample <- df_clean %>% 
  sample_n(500) %>% 
  select(all_of(audio_features))

A random sample of 500 observations is taken. While the original dataset is much larger, sampling is necessary because of Hierarchical Clustering.It was decided that 500 is a representative size that allows the algorithms to run efficiently while still providing clear, visible results. ## 2.3 Data Scaling

The scale() function transforms all values so they have a mean of 0 and a standard deviation of 1. This ensures that every musical feature has an equal “vote” in determining the clusters.

df_scaled <- scale(df_sample)

3. Exploratory Data Analysis (EDA)

Correlation Matrix helps to see which features naturally move together (e.g., high energy often correlates with high loudness).

M <- cor(df_scaled)
corrplot(
  M, method = "number", number.cex = 0.75,
  order = "hclust", type = "upper"
)

par(mfrow = c(3, 3))
for (i in colnames(df_scaled)) {
  plot(
    df_scaled[, i],
    main = paste("Plot of", i),
    ylab = i,
    col = "steelblue",
    pch = 19,
    cex = 0.5
  )
}

par(mfrow = c(1, 1))

The correlation matrix shows a strong positive link between energy and loudness and a clear negative one between energy and acousticness, allowing for a sharp distinction between intense and calm tracks. Since no features are perfectly redundant, each variable provides unique and necessary information to accurately identify different musical moods.

The individual factor plots demonstrate a high degree of variance across all audio features, confirming that the 500-track sample represents a diverse range of musical styles.

4. Clustering Tendency (Hopkins Statistic)

set.seed(123)
hopkins_res <- get_clust_tendency(
  df_scaled, n = 123, graph = TRUE,
  gradient = list(low = "red", mid = "white", high = "blue")
)

paste("Hopkins Statistic:", round(hopkins_res$hopkins_stat, 2))
## [1] "Hopkins Statistic: 0.77"

The Hopkins Statistic of 0.77 confirms that the dataset has a clustering tendency and contains non-random patterns.

5. Optimal Number of Clusters

In the following step we are looking for the optimal number of clusters.

p1 <- fviz_nbclust(df_scaled, kmeans, method = "silhouette") +
  ggtitle("Optimal k (K-Means)") + theme_classic()

p2 <- fviz_nbclust(df_scaled, pam, method = "silhouette") +
  ggtitle("Optimal k (PAM)") + theme_classic()

p3 <- fviz_nbclust(df_scaled, clara, method = "silhouette") +
  ggtitle("Optimal k (CLARA)") + theme_classic()

p4 <- fviz_nbclust(df_scaled, hcut, method = "silhouette") +
  ggtitle("Optimal k (Hierarchical)") + theme_classic()

grid.arrange(p1, p2, p3, p4, ncol = 1)

The analysis consistently identified \(k = 4\) as the optimal number of clusters across all tested algorithms, indicating a highly stable and natural division of musical moods within the dataset. While Ward’s Method showed the strongest internal structural cohesion with an agglomerative coefficient of 0.969, the DIANA algorithm achieved a better average silhouette width of 0.223.

6. Partitioning Algorithms (Visual Comparison)

This section visualizes how K-Means, PAM, and CLARA distribute the 500 tracks into four distinct clusters using Principal Component Analysis (PCA) to project the multi-dimensional audio features onto a two-dimensional space.

set.seed(123)

km4 <- kmeans(df_scaled, centers = 4, nstart = 25)
pam_res <- pam(df_scaled, k = 4)
clara_res <- clara(df_scaled, k = 4, samples = 50)

p_km <- fviz_cluster(km4, data = df_scaled) + ggtitle("K-Means")
p_pam <- fviz_cluster(pam_res, data = df_scaled) + ggtitle("PAM")
p_clara <- fviz_cluster(clara_res, data = df_scaled) + ggtitle("CLARA")

grid.arrange(p_km, p_pam, p_clara, ncol = 1)

Visual comparison identify similar grouping patterns, K-Means provides the most geographically balanced clusters across the audio feature space. Although there is some overlap between groups—reflecting the natural, fluid transitions between musical moods—each cluster maintains a distinct core representing specific sonic characteristics.

7. Hierarchical Clustering (Agglomerative) Comparison

Hierarchical clustering is a method that builds a hierarchy of clusters. In the Agglomerative (bottom-up) approach, the algorithm starts by treating every single music track as its own individual cluster. It then repeatedly merges the two “closest” clusters together until all tracks are joined into one large group. This process is visualized using a Dendrogram, a tree-like diagram that shows the relationships and distances between clusters.

Single Linkage defines the distance between two clusters as the distance between their closest members.

Complete Linkage uses the distance between the farthest members of two clusters.

Ward’s Method minimizes the total within-cluster variance. It looks for the merger that will result in the smallest increase in the “clutter” or “noise” inside the cluster.

hc_single <- eclust(
  df_scaled, k = 4, FUNcluster = "hclust",
  hc_method = "single", graph = FALSE
)

hc_complete <- eclust(
  df_scaled, k = 4, FUNcluster = "hclust",
  hc_method = "complete", graph = FALSE
)

hc_ward <- eclust(
  df_scaled, k = 4, FUNcluster = "hclust",
  hc_method = "ward.D2", graph = FALSE
)

p1 <- fviz_dend(hc_single, show_labels = FALSE, k = 4,
                k_colors = "jco", rect = TRUE,
                main = "Single Linkage")

p2 <- fviz_dend(hc_complete, show_labels = FALSE, k = 4,
                k_colors = "jco", rect = TRUE,
                main = "Complete Linkage")

p3 <- fviz_dend(hc_ward, show_labels = FALSE, k = 4,
                k_colors = "jco", rect = TRUE,
                main = "Ward's Method")

grid.arrange(p1, p2, p3, ncol = 1)

Ward’s method was the most successful, achieving a high Agglomerative Coefficient (0.969). This means it created very cohesive and solid mood profiles that are statistically stable.

8. Efficiency Testing (Ward’s Method)

After generating the hierarchical structure, it is necessary to evaluate the quality and stability of the resulting clusters. This section utilizes internal validation metrics to determine how well the songs are grouped.

dm <- dist(df_scaled, method = "euclidean")

clusterCut_ward <- cutree(hc_ward, 4)
table(clusterCut_ward)
## clusterCut_ward
##   1   2   3   4 
## 306  98  29  67
hc_stats <- cluster.stats(dm, clusterCut_ward)

cat("Average silhouette width:",
    round(hc_stats$avg.silwidth, 3), "\n")
## Average silhouette width: 0.16
cat("Average silhouette widths per cluster:\n")
## Average silhouette widths per cluster:
print(round(hc_stats$clus.avg.silwidths, 3))
##     1     2     3     4 
## 0.092 0.305 0.308 0.191
dunn <- hc_stats$min.separation / hc_stats$max.diameter
cat("Dunn Index:", round(dunn, 4), "\n")
## Dunn Index: 0.0745
cat("Agglomerative coefficient:",
    round(coef.hclust(hc_ward), 3))
## Agglomerative coefficient: 0.969

Cluster Distribution output reveals the size of each group. Cluster 1 is the largest (\(n=306\)), while Cluster 3 is the most exclusive (\(n=29\)), suggesting that some musical moods are much more common in the dataset than others.

Average Silhouette Width (0.16) measures how similar an object is to its own cluster compared to other clusters. A value of \(0.16\) indicates that while a structure exists, there is significant overlap between moods, which is expected given the subjective nature of music audio features.

Dunn Index (0.0745) identifies clusters that are compact and well-separated. A lower value here suggests that the “boundaries” between different musical moods are fluid rather than rigid.

Exceptionally high score of Agglomerative Coefficient (0.969) indicates a very strong clustering structure. It confirms that Ward’s method was highly effective.

9. Divisive Hierarchical Clustering (DIANA)

In contrast to the agglomerative approach, DIANA (Divisive Analysis) operates top-down, starting with a single cluster containing all observations and recursively splitting it.

hc_diana <- eclust(df_scaled, k = 4, FUNcluster = "diana", graph = FALSE)

pltree(hc_diana, cex = 0.6, hang = -1,
       main = "Dendrogram of DIANA")

hc_stats_diana <- cluster.stats(dm, hc_diana$cluster)

cat("Average silhouette width (DIANA):",
    round(hc_stats_diana$avg.silwidth, 3), "\n")
## Average silhouette width (DIANA): 0.223
cat("Average silhouette widths per cluster (DIANA):\n")
## Average silhouette widths per cluster (DIANA):
print(round(hc_stats_diana$clus.avg.silwidths, 3))
##     1     2     3     4 
## 0.208 0.262 0.276 0.638
dunn_diana <- hc_stats_diana$min.separation /
              hc_stats_diana$max.diameter

cat("Dunn Index (DIANA):",
    round(dunn_diana, 4), "\n")
## Dunn Index (DIANA): 0.0683
cat("Divisive coefficient:",
    round(coef(hc_diana), 3))
## Divisive coefficient: 0.891

DIANA achieved an average silhouette width of 0.223, which is notably higher than Ward’s method. This indicates that the divisive approach created clusters where tracks are more clearly assigned to their respective groups.

10. Final Silhouette Comparison

The final step of the analysis involves a direct comparison of the silhouette profiles for the three partitioning algorithms: K-Means, PAM, and CLARA. Silhouette plots provide a visual representation of how well each individual track fits into its assigned cluster.

sil_km <- silhouette(km4$cluster, dm)
sil_pam <- silhouette(pam_res$clustering, dm)
sil_clara <- silhouette(clara_res$clustering, dm)

p1 <- fviz_silhouette(sil_km, print.summary = FALSE) +
  ggtitle("K-Means")

p2 <- fviz_silhouette(sil_pam, print.summary = FALSE) +
  ggtitle("PAM")

p3 <- fviz_silhouette(sil_clara, print.summary = FALSE) +
  ggtitle("CLARA")

grid.arrange(p1, p2, p3, ncol = 1)

Final Conclusion

The silhouette analysis confirms that K-Means is the most reliable partitioning method for this dataset, as it exhibits the highest consistency with fewer misclassified points compared to PAM or CLARA. Furthermore, the presence of four distinct peaks across all silhouette plots provides strong structural validation for the initial choice of \(k = 4\) clusters.

This confirms that unsupervised learning can effectively discover meaningful musical mood structures without manual genre labeling.