Introduction

Music is an important part of human culture, with diverse styles and characteristics that appeal to listeners worldwide. With the rise of streaming platforms like Spotify and data-driven technologies, the ability to analyze and categorize songs has become more accessible than ever. This project aims to explore clustering techniques to group music based on various audio features and metadata. By utilizing the Million Song Dataset, introduced by Bertin-Mahieux, Ellis, Whitman, and Lamere (2011), it contains metadata for a million contemporary popular music tracks. For computational purposes I will make use of a subset of 10,000 songs (AsmitaRao, 2022), to provide a manageable yet representative sample for analysis. The aim is to uncover hidden patterns and insights within the data, using clustering techniques.

The Million Song Dataset

The Million Song Dataset is a freely available collection of music metadata designed to aid in research purposes.The dataset includes:

  • Song Metadata: Titles, IDs, albums, durations, and release years.

  • Artist Information: Names, locations, latitude, longitude, and familiarity scores.

  • Audio Features: Tempo, time signatures, danceability, energy, loudness, key signatures, etc.

  • Popularity Metrics: Hotness and familiarity scores (Note: These are excluded due to an unclear and unavailable explanation of how they were derived).

This wealth of information allows for a variety of analyses, from understanding musical styles to exploring geographic trends in music production.

Objective

The goal of this project is to apply clustering techniques to the dataset, grouping songs based on their audio features and metadata. By doing so, I aim to explore patterns and trends in music characteristics and possibly discover how certain attributes can help group songs into meaningful clusters.

Tools and Methodologies

To achieve the above objectives, we employ the following tools and techniques:

Data Preprocessing

First we drop the irrelevant variables and make the selection of variables we want to examine more in depth.

  • Handle missing and inconsistent values.

  • Normalize data to ensure fair comparisons across features.

# Step 1: Subset the data (we select the caracteristics duration and tempo for the first clustering)
clustering_data <- subset_data[, c("Duration","Tempo")]

# Step 2: Remove NA and Inf values
clustering_data <- na.omit(clustering_data)  # Remove rows with missing values
clustering_data <- clustering_data[is.finite(rowSums(clustering_data)), ]  # Remove rows with Inf/NaN

# Step 3: Feature Normalization
clustering_data <- scale(clustering_data)

Clustering

Once the data is ready, we apply k-means clustering to identify patterns in the data. The number of clusters is initially set to 3.

# Step 4: Run k-means clustering
km1 <- eclust(clustering_data, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)

# Step 5: Visualize clusters
fviz_cluster(km1, geom = "point", main = "k-means Clustering / Euclidean")

To enhance the clustering, we can incorporate quality control for the clustering results and address how to decide on the optimal number of clusters (k).

Choosing the Optimal Number of Clusters (k):
Elbow Method: This is a common way to select the optimal number of clusters. We compute the within-cluster sum of squares (WSS) for different values of k and look for an “elbow,” where the reduction in WSS slows down.

# Elbow method to determine optimal k
wss <- sapply(1:10, function(k) {
  kmeans(clustering_data, centers = k, nstart = 25)$tot.withinss
})
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 500050)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
# Plot the Elbow curve
plot(1:10, wss, type = "b", pch = 19, xlab = "Number of Clusters", 
     ylab = "Within-cluster Sum of Squares", main = "Elbow Method for Optimal k")

We can see an “elbow” around 3 clusters but first lets also do a silhouette analysis before we draw any conclusions.
Silhouette Analysis: This can be used to assess the quality of clustering. Silhouette scores range from -1 to 1, where a value closer to 1 indicates that the data points are well-clustered.

# Calculate the distance matrix (Euclidean distance in this case)
dist_matrix <- dist(clustering_data)
sil_scores <- silhouette(km1$cluster, dist_matrix)

# Visualize silhouette plot
fviz_silhouette(sil_scores)
##   cluster size ave.sil.width
## 1       1 3360          0.39
## 2       2 5689          0.44
## 3       3  952          0.27

Duration and tempo alone may not be enough to clearly differentiate between clusters, as these characteristics often overlap across different music genres and styles. The “Elbow Method” indicated that 3 clusters might be the most appropriate, which fits with the complexity of the data and suggests the need for further exploration.

To improve the clustering, we could include additional features such as loudness, key, time signature, etc. These variables could help create more distinct groupings. Additionally, experimenting with different clustering methods like hierarchical clustering, which reveals nested relationships between clusters, could lead to better results.

Clustering with more characteristics

We use Ward’s method in hierarchical clustering because it minimizes the within-cluster variance by merging clusters that result in the smallest increase in total variance. We choose the amount of clusters k, by calculating the silhouette width, which measures how well each data point fits within its cluster compared to other clusters, where the highest score indicates the best clustering configuration.

# Subset data to include relevant features
MillionSongSubset <- subset_data[, c("Duration", "Tempo", "Loudness", "Year", "TimeSignature", "key")]

# Remove NA and Inf values from the data
clustering_data <- na.omit(MillionSongSubset)  # Remove rows with missing values
clustering_data <- clustering_data[is.finite(rowSums(clustering_data)), ]  # Remove rows with Inf/NaN values

# Feature Normalization
clustering_data <- scale(clustering_data)

# Hierarchical Clustering
dist_matrix <- dist(clustering_data)
hc_result <- hclust(dist_matrix, method = "ward.D2")

# Automatically choose the number of clusters
sil_scores <- data.frame(k = 2:10, score = NA)
for (k in 2:10) {
  clusters <- cutree(hc_result, k = k)
  
  # Calculate silhouette for each cluster
  sil <- silhouette(clusters, dist_matrix)
  
  # Store the average silhouette width
  sil_scores$score[k-1] <- mean(sil[, "sil_width"])
}

#silhouette scores to determine the best k
ggplot(sil_scores, aes(x = k, y = score)) +
  geom_line() + geom_point() +
  labs(title = "Silhouette Scores for Hierarchical Clustering", x = "Number of Clusters", y = "Silhouette Score")

optimal_k <- sil_scores[which.max(sil_scores$score), ]$k
optimal_silhouette_score <- sil_scores[which.max(sil_scores$score), ]$score

cat("Optimal k:", optimal_k, "\n")
## Optimal k: 4
cat("Silhouette Score for Optimal k:", optimal_silhouette_score, "\n")
## Silhouette Score for Optimal k: 0.1930445
optimal_clusters <- cutree(hc_result, k = optimal_k)

Although 3 clusters were identified as the optimal solution, the low silhouette still indicates that the clusters may not be well-separated or highly meaningful.

Conclusion

These results imply that the data may not have a clear or strong inherent structure for clustering, which aligns with the low silhouette score obtained earlier. Based on this, it seems that clustering may not be the best approach for this dataset. Further analysis could involve exploring alternative techniques, such as adding more features to the data, using dimensionality reduction methods like PCA, or experimenting with different clustering algorithms, such as DBSCAN, which is less dependent on the number of clusters.

References

Bertin-Mahieux, Thierry & Ellis, Daniel & Whitman, Brian & Lamere, Paul. (2011). The Million Song Dataset.. Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011). 591-596.

AsmitaRao. (2022). Subset of the “Million Song dataset.” Retrieved December 24, 2024, from Kaggle.com website: https://www.kaggle.com/datasets/sansastark/subset-of-the-million-song-dataset?resource=download&select=billboard_rank.csv