FMA Audio Features - A clustering analysis

# Imports
library(dplyr) # Data manipulation
library(factoextra) # Hopkins test and clustering visualisation
library(cluster) # Silhouette evaluation

# Set language to English
Sys.setenv(LANGUAGE='en')
metadata_dir <- "/Users/teo/Datasets/fma/data"

Data Subset Description (FMA)

This project uses a subset of the Free Music Archive (FMA) dataset, specifically audio features. The dataset contains 8 audio features provided by Echonest (now Spotify) for 13,129 tracks:

acousticness: Confidence measure of whether the track is acoustic
danceability: How suitable a track is for dancing
energy: Perceptual measure of intensity and activity
instrumentalness: Predicts whether a track contains no vocals
liveness: Detects the presence of an audience in the recording
speechiness: Detects the presence of spoken words
tempo: Overall estimated tempo in beats per minute (BPM)
valence: Musical positiveness conveyed by a track

More information of this dataset can be found on the following links:

Github repositoy: https://github.com/mdeff/fma?tab=readme-ov-file
Website: https://freemusicarchive.org/
Codebook paper: https://arxiv.org/abs/1612.01840

Data Preparation

Data Loading and Subset Selection

echonest_raw <- read.csv(file.path(metadata_dir, "echonest.csv"), sep = ",")

# Find column indices
track_id_col <- which(echonest_raw[3, ] == "track_id")
audio_cols <- which(echonest_raw[1, ] == "audio_features")

# Set track_id column name
echonest_raw[2, 1] <- "track_id"

# Select relevant columns
echonest_clean <- echonest_raw %>% 
  select(track_id_col, audio_cols)

echonest_clean <- echonest_clean %>% 
  setNames(echonest_clean[2, ]) %>%
  slice(-c(1, 2, 3))

head(echonest_clean)

##   track_id acousticness danceability       energy instrumentalness     liveness
## 1        2 0.4166752327 0.6758939853 0.6344762684     0.0106280683 0.1776465712
## 2        3 0.3744077685 0.5286430621 0.8174611317     0.0018511032 0.1058799438
## 3        5 0.0435668989 0.7455658702 0.7014699916     0.0006967990 0.3731433124
## 4       10 0.9516699648 0.6581786543 0.9245251615     0.9654270154 0.1154738842
## 5      134 0.4522173071 0.5132380502 0.5604099311     0.0194426943 0.0965666940
## 6      139 0.1065495253 0.2609111726 0.6070668636     0.8350869898 0.2236762711
##    speechiness          tempo      valence
## 1 0.1593100648 165.9220000000 0.5766609880
## 2 0.4618181276 126.9570000000 0.2692402421
## 3 0.1245953419 100.2600000000 0.6216612236
## 4 0.0329852191 111.5620000000 0.9635898919
## 5 0.5255193792 114.2900000000 0.8940722715
## 6 0.0305692764 196.9610000000 0.1602670903

Data Cleaning

# Convert to appropriate data types
echonest_clean <- echonest_clean %>% 
  mutate(
    track_id = as.integer(track_id),
    acousticness = as.numeric(acousticness),
    danceability = as.numeric(danceability),
    energy = as.numeric(energy),
    instrumentalness = as.numeric(instrumentalness),
    liveness = as.numeric(liveness),
    speechiness = as.numeric(speechiness),
    tempo = as.numeric(tempo),
    valence = as.numeric(valence)
  )

# Check for missing values
colSums(is.na(echonest_clean))

##         track_id     acousticness     danceability           energy 
##                0                0                0                0 
## instrumentalness         liveness      speechiness            tempo 
##                0                0                0                0 
##          valence 
##                0

# Check for duplicates
sum(duplicated(echonest_clean$track_id))

## [1] 0

# Summary statistics
summary(echonest_clean)

##     track_id       acousticness        danceability         energy         
##  Min.   :     2   Min.   :0.0000009   Min.   :0.05131   Min.   :2.017e-05  
##  1st Qu.: 12986   1st Qu.:0.1037726   1st Qu.:0.34476   1st Qu.:3.213e-01  
##  Median : 28097   Median :0.5739848   Median :0.48563   Median :5.491e-01  
##  Mean   : 34031   Mean   :0.5246876   Mean   :0.48729   Mean   :5.375e-01  
##  3rd Qu.: 45021   3rd Qu.:0.9207270   3rd Qu.:0.62909   3rd Qu.:7.763e-01  
##  Max.   :124911   Max.   :0.9957965   Max.   :0.96864   Max.   :1.000e+00  
##  instrumentalness    liveness       speechiness          tempo       
##  Min.   :0.0000   Min.   :0.0253   Min.   :0.02232   Min.   : 12.75  
##  1st Qu.:0.3235   1st Qu.:0.1014   1st Qu.:0.03693   1st Qu.: 95.97  
##  Median :0.8381   Median :0.1190   Median :0.04902   Median :120.06  
##  Mean   :0.6405   Mean   :0.1878   Mean   :0.09917   Mean   :123.08  
##  3rd Qu.:0.9182   3rd Qu.:0.2110   3rd Qu.:0.08545   3rd Qu.:145.32  
##  Max.   :0.9980   Max.   :0.9803   Max.   :0.96618   Max.   :251.07  
##     valence       
##  Min.   :0.00001  
##  1st Qu.:0.19732  
##  Median :0.41774  
##  Mean   :0.43976  
##  3rd Qu.:0.66558  
##  Max.   :0.99999

All features except tempo are already scaled between 0 and 1.

boxplot(
  echonest_clean$tempo,
  horizontal = TRUE,
  main = "Distribution of Tempo",
  xlab = "Tempo (BPM)"
)

# Summary statistics for tempo
summary(echonest_clean$tempo)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.75   95.97  120.06  123.08  145.32  251.07

While some extreme tempo values (12 BPM or 251 BPM) appear unusual, they could likely represent cases where automatic tempo detection doubled or halved the actual BPM. This does not significantly affect clustering analysis.

# Scale tempo and remove track_id
X_audio <- echonest_clean %>% 
  mutate(tempo = scale(tempo)[, 1]) %>% 
  select(-track_id)

Clustering analysis

Clusterability Assessment: Hopkins Test

Does the data have tendency to form clusters? Or is it maily noise? In order to answer those questions, the Hopkins statistic measures the clustering tendency of data. A value close to 1 indicates strong clustering tendency, while a value near 0.5 suggests random distribution.

set.seed(42)
hopkins <- get_clust_tendency(X_audio, n = 50)
hopkins$hopkins_stat

## [1] 0.7509289

With a Hopkins result of 0.75 > 0.5, the data demonstrates clear clustering tendency. This justify further cluster analysis.

Determining Optimal Number of Clusters

Elbow Method

fviz_nbclust(X_audio, kmeans, method = "wss")

The elbow plot shows a strong decrease between k=1 and k=2, followed by a smaller decrease between k=2 and k=3. From k=3 onwards, the decrease becomes nearly constant, suggesting k=2 or k=3 as optimal values. The elbow plot by itself is not a sufficient method for determining the optimal number of clusters, so we complement it with silhouette analysis.

Silhouette Method

fviz_nbclust(X_audio, kmeans, method = "silhouette")

The silhouette analysis clearly indicates k=2 as the optimal partition, with the highest average silhouette width. The value at k=3 is the second highest, while k=4 shows a notable drop.

Both methods converge on k=2 as the most statistically robust solution. However, this result reflects the nature of the audio features: they describe continuous acoustic properties rather than discrete musical genres. Audio descriptors like energy, tempo, and danceability create gradual transitions rather than sharp categorical boundaries.

Therefore, to explore finer musical structures, it would be also interesting to examine solutions with k=3 and k=5.

K-means Clustering

Clustering with k=2

set.seed(123)
km_2 <- kmeans(X_audio, centers = 2, nstart = 25)

fviz_cluster(km_2, data = X_audio, geom = "point", ellipse = TRUE)

Here, with k=2 solution, we can see two wide clouds of points, with considerable overlap in the 2D projection. We should note that clustering was performed in 8-dimensional space: the 2D visualization is a dimensionality reduction for interpretability. The separation occurs primarily along dimension 2 (y-axis), while cluster 2 (blue) appears more dense and populous than cluster 1 (red).

sil_2 <- silhouette(km_2$cluster, dist(X_audio))
mean(sil_2[, 3])

## [1] 0.3479694

Clustering with k=3

set.seed(42)
km_3 <- kmeans(X_audio, centers = 3, nstart = 25)

fviz_cluster(km_3, data = X_audio, geom = "point", ellipse = TRUE)

With k=3, cluster 2 (green) closely resembles the blue cluster from k=2, while the red cluster from k=2 appears to split into two sub-clusters (orange and blue). The interpretation becomes more challenging with increased overlap, though the main separation still occurs along dimension 2.

sil_3 <- silhouette(km_3$cluster, dist(X_audio))
mean(sil_3[, 3])

## [1] 0.2535154

Clustering with k=5

set.seed(42)
km_5 <- kmeans(X_audio, centers = 5, nstart = 25)

fviz_cluster(km_5, data = X_audio, geom = "point", ellipse = TRUE)

At k=5, cluster boundaries become increasingly fuzzy with substantial overlap in the 2D projection, making the visualization very difficult to interprete. Though, while individual clusters are smaller, dimension 1 begins to contribute more to the separation alongside dimension 2.

sil_5 <- silhouette(km_5$cluster, dist(X_audio))
mean(sil_5[, 3])

## [1] 0.2104938

Silhouette Analysis of Final Partition

Based on statistical evidence, k=2 provides the most robust and interpretable solution. We examine its internal structure using detailed silhouette analysis.

D <- dist(X_audio)
sil_k2 <- silhouette(km_2$cluster, D)
fviz_silhouette(sil_k2)

##   cluster size ave.sil.width
## 1       1 5371          0.32
## 2       2 7758          0.37

mean(sil_k2[, "sil_width"])

## [1] 0.3479694

The silhouette analysis for k=2 yields an average silhouette width of approximately 0.35, indicating moderate but meaningful clustering structure. While the separation between clusters is not perfect, most observations have positive silhouette values, confirming that the two groups capture genuine differences in the audio feature space.

Discussion

The clustering analysis reveals important insights about musical audio features:

Why k=2 rather than genre-based clustering?

Initially, we could expect clustering to group music by genres (rock, jazz, electronic, classical, etc.), which would require many more clusters. However, the audio features describe how music sounds rather than cultural categorization:

A hip-hop instrumental and a chill electronic track might both be “calm,” “low energy,” “slightly danceable,” “instrumental,” and “low tempo”, which results in clustering them together despite different genres.
Two rock songs could differ dramatically: hard rock (loud, energetic, fast tempo) versus soft rock ballad (quiet, calm, slow tempo), which results in placing them in different clusters despite sharing a genre.

Continuous nature of audio features

The audio descriptors (energy, tempo, danceability, valence) capture perceptual characteristics along continuous scales, not categorical labels. This leads to:

Overlapping clusters with gradual transitions
Few well-separated groups rather than many discrete categories
Moderate silhouette values reflecting inherent feature continuity

Given the continuous nature of musical characteristics, the observed level of overlap is expected and does not invalidate the clustering results. The k=2 solution successfully identifies broad acoustic profiles within the dataset.

Conclusion

K-means clustering on audio features reveals that musical tracks naturally organize into a small number of acoustic profiles rather than discrete genre categories. The optimal k=2 partition, supported by both elbow and silhouette methods, demonstrates that continuous acoustic properties create overlapping but meaningful groups in the feature space.