Music is an important part of human culture, with diverse styles and characteristics that appeal to listeners worldwide. With the rise of streaming platforms like Spotify and data-driven technologies, the ability to analyze and categorize songs has become more accessible than ever. This project aims to explore clustering techniques to group music based on various audio features and metadata. By utilizing the Million Song Dataset, introduced by Bertin-Mahieux, Ellis, Whitman, and Lamere (2011), it contains metadata for a million contemporary popular music tracks. For computational purposes I will make use of a subset of 10,000 songs (AsmitaRao, 2022), to provide a manageable yet representative sample for analysis. The aim is to uncover hidden patterns and insights within the data, using clustering techniques.
The Million Song Dataset is a freely available collection of music metadata designed to aid in research purposes.The dataset includes:
Song Metadata: Titles, IDs, albums, durations, and release years.
Artist Information: Names, locations, latitude, longitude, and familiarity scores.
Audio Features: Tempo, time signatures, danceability, energy, loudness, key signatures, etc.
Popularity Metrics: Hotness and familiarity scores (Note: These are excluded due to an unclear and unavailable explanation of how they were derived).
This wealth of information allows for a variety of analyses, from understanding musical styles to exploring geographic trends in music production.
The goal of this project is to apply clustering techniques to the dataset, grouping songs based on their audio features and metadata. By doing so, I aim to explore patterns and trends in music characteristics and possibly discover how certain attributes can help group songs into meaningful clusters.
To achieve the above objectives, we employ the following tools and techniques:
First we drop the irrelevant variables and make the selection of variables we want to examine more in depth.
Handle missing and inconsistent values.
Normalize data to ensure fair comparisons across features.
# Step 1: Subset the data (we select the caracteristics duration and tempo for the first clustering)
clustering_data <- subset_data[, c("Duration","Tempo")]
# Step 2: Remove NA and Inf values
clustering_data <- na.omit(clustering_data) # Remove rows with missing values
clustering_data <- clustering_data[is.finite(rowSums(clustering_data)), ] # Remove rows with Inf/NaN
# Step 3: Feature Normalization
clustering_data <- scale(clustering_data)Once the data is ready, we apply k-means clustering to identify patterns in the data. The number of clusters is initially set to 3.
# Step 4: Run k-means clustering
km1 <- eclust(clustering_data, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)
# Step 5: Visualize clusters
fviz_cluster(km1, geom = "point", main = "k-means Clustering / Euclidean")To enhance the clustering, we can incorporate quality control for the clustering results and address how to decide on the optimal number of clusters (k).
Choosing the Optimal Number of Clusters (k):
Elbow Method: This is a common way to select the optimal number of
clusters. We compute the within-cluster sum of squares (WSS) for
different values of k and look for an “elbow,” where the reduction in
WSS slows down.
# Elbow method to determine optimal k
wss <- sapply(1:10, function(k) {
kmeans(clustering_data, centers = k, nstart = 25)$tot.withinss
})## Warning: Quick-TRANSfer stage steps exceeded maximum (= 500050)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
# Plot the Elbow curve
plot(1:10, wss, type = "b", pch = 19, xlab = "Number of Clusters",
ylab = "Within-cluster Sum of Squares", main = "Elbow Method for Optimal k")We can see an “elbow” around 3 clusters but first lets also do a
silhouette analysis before we draw any conclusions.
Silhouette Analysis: This can be used to assess the quality of
clustering. Silhouette scores range from -1 to 1, where a value closer
to 1 indicates that the data points are well-clustered.
# Calculate the distance matrix (Euclidean distance in this case)
dist_matrix <- dist(clustering_data)
sil_scores <- silhouette(km1$cluster, dist_matrix)
# Visualize silhouette plot
fviz_silhouette(sil_scores)## cluster size ave.sil.width
## 1 1 3360 0.39
## 2 2 5689 0.44
## 3 3 952 0.27
Duration and tempo alone may not be enough to clearly differentiate between clusters, as these characteristics often overlap across different music genres and styles. The “Elbow Method” indicated that 3 clusters might be the most appropriate, which fits with the complexity of the data and suggests the need for further exploration.
To improve the clustering, we could include additional features such as loudness, key, time signature, etc. These variables could help create more distinct groupings. Additionally, experimenting with different clustering methods like hierarchical clustering, which reveals nested relationships between clusters, could lead to better results.
Firstly, the tempo data is aggregated by year to explore how musical characteristics, such as tempo, have evolved over time. By grouping songs by their release year, we can identify long-term trends in musical styles.
#Aggregate tempo by year
tempo_by_year <- subset_data[, c("Duration", "Tempo", "Year")] %>%
filter(!is.na(Year) & Year != 0) %>%
group_by(Year) %>%
summarize(Avg_Tempo = mean(Tempo, na.rm = TRUE),
Median_Tempo = median(Tempo, na.rm = TRUE))
#Prepare Data for Clustering, Normalize the Avg_Tempo column
clustering_data <- scale(tempo_by_year$Avg_Tempo) # Normalize Avg_Tempo
tempo_for_clustering <- data.frame(
Year = tempo_by_year$Year,
Avg_Tempo = clustering_data
)
#Perform K-means clustering using eclust
km2 <- eclust(tempo_for_clustering, "kmeans", hc_metric = "euclidean", k = 3, graph=FALSE)
#Visualize the clusters and add year labels
fviz_cluster(km2, geom = "point", main = "k-means Clustering / Euclidean") +
geom_text(aes(label = tempo_for_clustering$Year), size = 3, vjust = -1)tempo_for_clustering$Cluster <- as.factor(km2$cluster)
# Visualization
ggplot(tempo_for_clustering, aes(x = Year, y = Avg_Tempo, color = Cluster)) +
geom_point(size = 3) +
geom_line(aes(group = 1), color = "grey") +
scale_color_discrete(name = "Cluster") +
labs(title = "Clustering of Years by Tempo", x = "Year", y = "Normalized Average Tempo") +
theme_minimal()#Evaluating Cluster Quality
# Elbow method to find the optimal number of clusters
wss <- sapply(1:10, function(k) {
kmeans(tempo_for_clustering$Avg_Tempo, centers = k, nstart = 25)$tot.withinss
})
# Plot the Elbow curve
plot(1:10, wss, type = "b", pch = 19, xlab = "Number of Clusters",
ylab = "Within-cluster Sum of Squares", main = "Elbow Method for Optimal k")# Perform silhouette analysis to evaluate cluster quality
dist_matrix <- dist(tempo_for_clustering$Avg_Tempo)
silhouette_scores <- silhouette(km2$cluster, dist_matrix)
plot(silhouette_scores, main = "Silhouette Analysis of K-means Clustering")By clustering years based on tempo, we can observe how musical trends evolve over time, such as changes in song speed (faster or slower tempos). Historical analysis and genre comparisons may reveal patterns where specific genres, like disco in the ’70s or electronic music in the ’90s, cluster within particular tempo ranges. These insights offer valuable perspectives on how musical preferences and trends have shifted across different eras. However, the current dataset lacks genre information, which limits this analysis. The Last.fm dataset, which includes essential metadata such as official song tags, would be crucial for identifying genres and supporting more detailed analysis. This dataset is also available on the official website: http://millionsongdataset.com/lastfm.
Given the low average silhouette width of 0.1, it becomes clear that further improvements can be made. Aggregating data by year doesn’t yield well-separated clusters, and including additional features would likely result in more meaningful groupings. To achieve better clustering, we should focus on a more comprehensive set of features for each song. The variables we will use include: Duration, Tempo, Loudness, Year, Time Signature, and Key. This approach should lead to clusters that are more distinct and representative of the data’s true structure.
| Variable | |
|---|---|
| Duration | Duration of the song in seconds |
| time signature | estimate of number of beats per bar, e.g. 4 |
| tempo | estimated tempo in BPM |
| key | key the song is in |
| loudness | overall loudness in dB |
| year | year the song was released |
We use Ward’s method in hierarchical clustering because it minimizes the within-cluster variance by merging clusters that result in the smallest increase in total variance. We choose the amount of clusters k, by calculating the silhouette width, which measures how well each data point fits within its cluster compared to other clusters, where the highest score indicates the best clustering configuration.
# Subset data to include relevant features
MillionSongSubset <- subset_data[, c("Duration", "Tempo", "Loudness", "Year", "TimeSignature", "key")]
# Remove NA and Inf values from the data
clustering_data <- na.omit(MillionSongSubset) # Remove rows with missing values
clustering_data <- clustering_data[is.finite(rowSums(clustering_data)), ] # Remove rows with Inf/NaN values
# Feature Normalization
clustering_data <- scale(clustering_data)
# Hierarchical Clustering
dist_matrix <- dist(clustering_data)
hc_result <- hclust(dist_matrix, method = "ward.D2")
# Automatically choose the number of clusters
sil_scores <- data.frame(k = 2:10, score = NA)
for (k in 2:10) {
clusters <- cutree(hc_result, k = k)
# Calculate silhouette for each cluster
sil <- silhouette(clusters, dist_matrix)
# Store the average silhouette width
sil_scores$score[k-1] <- mean(sil[, "sil_width"])
}
#silhouette scores to determine the best k
ggplot(sil_scores, aes(x = k, y = score)) +
geom_line() + geom_point() +
labs(title = "Silhouette Scores for Hierarchical Clustering", x = "Number of Clusters", y = "Silhouette Score")optimal_k <- sil_scores[which.max(sil_scores$score), ]$k
optimal_silhouette_score <- sil_scores[which.max(sil_scores$score), ]$score
cat("Optimal k:", optimal_k, "\n")## Optimal k: 4
## Silhouette Score for Optimal k: 0.1930445
Although 3 clusters were identified as the optimal solution, the low silhouette still indicates that the clusters may not be well-separated or highly meaningful.
These results imply that the data may not have a clear or strong inherent structure for clustering, which aligns with the low silhouette score obtained earlier. Based on this, it seems that clustering may not be the best approach for this dataset. Further analysis could involve exploring alternative techniques, such as adding more features to the data, using dimensionality reduction methods like PCA, or experimenting with different clustering algorithms, such as DBSCAN, which is less dependent on the number of clusters.
Bertin-Mahieux, Thierry & Ellis, Daniel & Whitman, Brian & Lamere, Paul. (2011). The Million Song Dataset.. Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011). 591-596.
AsmitaRao. (2022). Subset of the “Million Song dataset.” Retrieved December 24, 2024, from Kaggle.com website: https://www.kaggle.com/datasets/sansastark/subset-of-the-million-song-dataset?resource=download&select=billboard_rank.csv