# Imports
library(dplyr) # Data manipulation
library(factoextra) # Hopkins test and clustering visualisation
library(cluster) # Silhouette evaluation
# Set language to English
Sys.setenv(LANGUAGE='en')
metadata_dir <- "/Users/teo/Datasets/fma/data"
This project uses a subset of the Free Music Archive (FMA) dataset, specifically audio features. The dataset contains 8 audio features provided by Echonest (now Spotify) for 13,129 tracks:
More information of this dataset can be found on the following links:
echonest_raw <- read.csv(file.path(metadata_dir, "echonest.csv"), sep = ",")
# Find column indices
track_id_col <- which(echonest_raw[3, ] == "track_id")
audio_cols <- which(echonest_raw[1, ] == "audio_features")
# Set track_id column name
echonest_raw[2, 1] <- "track_id"
# Select relevant columns
echonest_clean <- echonest_raw %>%
select(track_id_col, audio_cols)
echonest_clean <- echonest_clean %>%
setNames(echonest_clean[2, ]) %>%
slice(-c(1, 2, 3))
head(echonest_clean)
## track_id acousticness danceability energy instrumentalness liveness
## 1 2 0.4166752327 0.6758939853 0.6344762684 0.0106280683 0.1776465712
## 2 3 0.3744077685 0.5286430621 0.8174611317 0.0018511032 0.1058799438
## 3 5 0.0435668989 0.7455658702 0.7014699916 0.0006967990 0.3731433124
## 4 10 0.9516699648 0.6581786543 0.9245251615 0.9654270154 0.1154738842
## 5 134 0.4522173071 0.5132380502 0.5604099311 0.0194426943 0.0965666940
## 6 139 0.1065495253 0.2609111726 0.6070668636 0.8350869898 0.2236762711
## speechiness tempo valence
## 1 0.1593100648 165.9220000000 0.5766609880
## 2 0.4618181276 126.9570000000 0.2692402421
## 3 0.1245953419 100.2600000000 0.6216612236
## 4 0.0329852191 111.5620000000 0.9635898919
## 5 0.5255193792 114.2900000000 0.8940722715
## 6 0.0305692764 196.9610000000 0.1602670903
# Convert to appropriate data types
echonest_clean <- echonest_clean %>%
mutate(
track_id = as.integer(track_id),
acousticness = as.numeric(acousticness),
danceability = as.numeric(danceability),
energy = as.numeric(energy),
instrumentalness = as.numeric(instrumentalness),
liveness = as.numeric(liveness),
speechiness = as.numeric(speechiness),
tempo = as.numeric(tempo),
valence = as.numeric(valence)
)
# Check for missing values
colSums(is.na(echonest_clean))
## track_id acousticness danceability energy
## 0 0 0 0
## instrumentalness liveness speechiness tempo
## 0 0 0 0
## valence
## 0
# Check for duplicates
sum(duplicated(echonest_clean$track_id))
## [1] 0
# Summary statistics
summary(echonest_clean)
## track_id acousticness danceability energy
## Min. : 2 Min. :0.0000009 Min. :0.05131 Min. :2.017e-05
## 1st Qu.: 12986 1st Qu.:0.1037726 1st Qu.:0.34476 1st Qu.:3.213e-01
## Median : 28097 Median :0.5739848 Median :0.48563 Median :5.491e-01
## Mean : 34031 Mean :0.5246876 Mean :0.48729 Mean :5.375e-01
## 3rd Qu.: 45021 3rd Qu.:0.9207270 3rd Qu.:0.62909 3rd Qu.:7.763e-01
## Max. :124911 Max. :0.9957965 Max. :0.96864 Max. :1.000e+00
## instrumentalness liveness speechiness tempo
## Min. :0.0000 Min. :0.0253 Min. :0.02232 Min. : 12.75
## 1st Qu.:0.3235 1st Qu.:0.1014 1st Qu.:0.03693 1st Qu.: 95.97
## Median :0.8381 Median :0.1190 Median :0.04902 Median :120.06
## Mean :0.6405 Mean :0.1878 Mean :0.09917 Mean :123.08
## 3rd Qu.:0.9182 3rd Qu.:0.2110 3rd Qu.:0.08545 3rd Qu.:145.32
## Max. :0.9980 Max. :0.9803 Max. :0.96618 Max. :251.07
## valence
## Min. :0.00001
## 1st Qu.:0.19732
## Median :0.41774
## Mean :0.43976
## 3rd Qu.:0.66558
## Max. :0.99999
All features except tempo are already scaled between 0 and 1.
boxplot(
echonest_clean$tempo,
horizontal = TRUE,
main = "Distribution of Tempo",
xlab = "Tempo (BPM)"
)
# Summary statistics for tempo
summary(echonest_clean$tempo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.75 95.97 120.06 123.08 145.32 251.07
While some extreme tempo values (12 BPM or 251 BPM) appear unusual, they could likely represent cases where automatic tempo detection doubled or halved the actual BPM. This does not significantly affect clustering analysis.
# Scale tempo and remove track_id
X_audio <- echonest_clean %>%
mutate(tempo = scale(tempo)[, 1]) %>%
select(-track_id)
Does the data have tendency to form clusters? Or is it maily noise? In order to answer those questions, the Hopkins statistic measures the clustering tendency of data. A value close to 1 indicates strong clustering tendency, while a value near 0.5 suggests random distribution.
set.seed(42)
hopkins <- get_clust_tendency(X_audio, n = 50)
hopkins$hopkins_stat
## [1] 0.7509289
With a Hopkins result of 0.75 > 0.5, the data demonstrates clear clustering tendency. This justify further cluster analysis.
fviz_nbclust(X_audio, kmeans, method = "wss")
The elbow plot shows a strong decrease between k=1 and k=2, followed by a smaller decrease between k=2 and k=3. From k=3 onwards, the decrease becomes nearly constant, suggesting k=2 or k=3 as optimal values. The elbow plot by itself is not a sufficient method for determining the optimal number of clusters, so we complement it with silhouette analysis.
fviz_nbclust(X_audio, kmeans, method = "silhouette")
The silhouette analysis clearly indicates k=2 as the optimal partition, with the highest average silhouette width. The value at k=3 is the second highest, while k=4 shows a notable drop.
Both methods converge on k=2 as the most statistically robust solution. However, this result reflects the nature of the audio features: they describe continuous acoustic properties rather than discrete musical genres. Audio descriptors like energy, tempo, and danceability create gradual transitions rather than sharp categorical boundaries.
Therefore, to explore finer musical structures, it would be also interesting to examine solutions with k=3 and k=5.
set.seed(123)
km_2 <- kmeans(X_audio, centers = 2, nstart = 25)
fviz_cluster(km_2, data = X_audio, geom = "point", ellipse = TRUE)
Here, with k=2 solution, we can see two wide clouds of points, with considerable overlap in the 2D projection. We should note that clustering was performed in 8-dimensional space: the 2D visualization is a dimensionality reduction for interpretability. The separation occurs primarily along dimension 2 (y-axis), while cluster 2 (blue) appears more dense and populous than cluster 1 (red).
sil_2 <- silhouette(km_2$cluster, dist(X_audio))
mean(sil_2[, 3])
## [1] 0.3479694
set.seed(42)
km_3 <- kmeans(X_audio, centers = 3, nstart = 25)
fviz_cluster(km_3, data = X_audio, geom = "point", ellipse = TRUE)
With k=3, cluster 2 (green) closely resembles the blue cluster from k=2, while the red cluster from k=2 appears to split into two sub-clusters (orange and blue). The interpretation becomes more challenging with increased overlap, though the main separation still occurs along dimension 2.
sil_3 <- silhouette(km_3$cluster, dist(X_audio))
mean(sil_3[, 3])
## [1] 0.2535154
set.seed(42)
km_5 <- kmeans(X_audio, centers = 5, nstart = 25)
fviz_cluster(km_5, data = X_audio, geom = "point", ellipse = TRUE)
At k=5, cluster boundaries become increasingly fuzzy with substantial overlap in the 2D projection, making the visualization very difficult to interprete. Though, while individual clusters are smaller, dimension 1 begins to contribute more to the separation alongside dimension 2.
sil_5 <- silhouette(km_5$cluster, dist(X_audio))
mean(sil_5[, 3])
## [1] 0.2104938
Based on statistical evidence, k=2 provides the most robust and interpretable solution. We examine its internal structure using detailed silhouette analysis.
D <- dist(X_audio)
sil_k2 <- silhouette(km_2$cluster, D)
fviz_silhouette(sil_k2)
## cluster size ave.sil.width
## 1 1 5371 0.32
## 2 2 7758 0.37
mean(sil_k2[, "sil_width"])
## [1] 0.3479694
The silhouette analysis for k=2 yields an average silhouette width of approximately 0.35, indicating moderate but meaningful clustering structure. While the separation between clusters is not perfect, most observations have positive silhouette values, confirming that the two groups capture genuine differences in the audio feature space.
The clustering analysis reveals important insights about musical audio features:
Why k=2 rather than genre-based clustering?
Initially, we could expect clustering to group music by genres (rock, jazz, electronic, classical, etc.), which would require many more clusters. However, the audio features describe how music sounds rather than cultural categorization:
Continuous nature of audio features
The audio descriptors (energy, tempo, danceability, valence) capture perceptual characteristics along continuous scales, not categorical labels. This leads to:
Given the continuous nature of musical characteristics, the observed level of overlap is expected and does not invalidate the clustering results. The k=2 solution successfully identifies broad acoustic profiles within the dataset.
K-means clustering on audio features reveals that musical tracks naturally organize into a small number of acoustic profiles rather than discrete genre categories. The optimal k=2 partition, supported by both elbow and silhouette methods, demonstrates that continuous acoustic properties create overlapping but meaningful groups in the feature space.