INTRODUCTION
This analysis uses the Spotify dataset to explore and group music tracks based on their audio features. By applying Principal Component Analysis (PCA) for dimensionality reduction and K-Means Clustering, the goal is to uncover patterns and relationships between tracks.
DATASET DESCRIPTION
The Spotify dataset includes numerical features such as popularity, danceability, energy, tempo, acousticness, and more. Additionally, the dataset has categorical features like explicit and genres, which are one-hot encoded. Before applying machine learning algorithms, preprocessing ensures data quality and compatibility.
DATA PRE-PROCESSING
spotify_data<-read.csv("C:/Users/abdul/Documents/data/spotifydataset.csv", stringsAsFactors = FALSE)
columns_to_drop <- c("track_id", "artists", "album_name", "track_name", "Unnamed: 0")
spotify_data <- spotify_data %>% select(-all_of(intersect(columns_to_drop, colnames(spotify_data))))
spotify_data <- na.omit(spotify_data)
spotify_data$explicit <- ifelse(spotify_data$explicit == "True", 1, 0)
numeric_columns <- sapply(spotify_data, is.numeric)
spotify_data_numeric <- spotify_data[, numeric_columns]
spotify_data_numeric <- spotify_data_numeric[, apply(spotify_data_numeric, 2, var) > 0]
spotify_data_scaled <- preProcess(spotify_data_numeric, method = c("range")) %>%
predict(spotify_data_numeric)
pca_model <- prcomp(spotify_data_scaled, center = TRUE, scale. = TRUE)
DIMENSIONALITY REDUCTION ANALYSIS
Principal Component Analysis (PCA) was applied to reduce the dataset’s dimensionality while preserving variance. The cumulative explained variance plot shows that 20 principal components explain more than 90% of the variance in the data. Based on this finding, the first 10 principal components were selected for further analysis to strike a balance between reducing complexity and retaining meaningful patterns.
pca_model <- prcomp(spotify_data_scaled, center = TRUE, scale. = TRUE)
explained_var <- summary(pca_model)$importance[2, ]
cumulative_var <- cumsum(explained_var)
library(ggplot2)
ggplot(data.frame(PC = 1:length(cumulative_var), Variance = cumulative_var), aes(x = PC, y = Variance)) +
geom_point() +
geom_line() +
ggtitle("Cumulative Explained Variance by Principal Components") +
xlab("Principal Components") +
ylab("Cumulative Proportion of Variance Explained") +
geom_hline(yintercept = 0.9, linetype = "dashed", color = "red") +
annotate("text", x = 15, y = 0.9, label = "90% Variance", vjust = -1)
DETERMINING OPTIMAL CLUSTERS
To determine the optimal number of clusters for grouping tracks, the Elbow Method was employed. The Elbow Method plot illustrates the Within-Cluster Sum of Squares (WCSS) for different cluster counts. The “elbow point,” where the rate of decrease in WCSS slows significantly, was found at 3 clusters, making it the optimal choice for this dataset. Additionally, the Silhouette Method was used to validate this choice. Silhouette scores for cluster numbers ranging from 2 to 5, with the highest silhouette score observed at 3 clusters, reinforcing the conclusion drawn from the Elbow Method.
wcss <- c()
k_values <- 1:10
for (k in k_values) {
set.seed(42)
kmeans_model <- kmeans(pca_model$x[, 1:10], centers = k, nstart = 50, iter.max = 100)
wcss <- c(wcss, kmeans_model$tot.withinss)
}
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 1000050)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 1000050)
ggplot(data.frame(k = k_values, wcss = wcss), aes(x = k, y = wcss)) +
geom_point() +
geom_line() +
ggtitle("Elbow Method for Optimal Clusters") +
xlab("Number of Clusters") +
ylab("WCSS") +
geom_vline(xintercept = 3, linetype = "dashed", color = "blue") +
annotate("text", x = 3, y = max(wcss), label = "Optimal Clusters", vjust = -1)
silhouette_scores <- sapply(2:5, function(k) {
km <- kmeans(pca_model$x[, 1:10], centers = k, nstart = 25)
mean(silhouette(km$cluster, dist(pca_model$x[, 1:10]))[, 3])
})
ggplot(data.frame(k = 2:5, silhouette = silhouette_scores), aes(x = k, y = silhouette)) +
geom_point() +
geom_line() +
ggtitle("Silhouette Scores for Optimal Clusters") +
xlab("Number of Clusters") +
ylab("Silhouette Score")
CLUSTERING ANALYSIS
With 3 clusters chosen as optimal, K-Means clustering is applied to the PCA-transformed dataset. Visualizations demonstrate the separation between clusters.
kmeans_result <- kmeans(pca_model$x[, 1:10], centers = 3, nstart = 25)
pca_data <- as.data.frame(pca_model$x[, 1:10])
pca_data$Cluster <- as.factor(kmeans_result$cluster)
fviz_cluster(kmeans_result, data = pca_model$x[, 1:10]) +
ggtitle("K-Means Clustering Visualization (PCA Reduced Data)")
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(alpha = 0.6) +
ggtitle("Cluster Separation in First Two Principal Components") +
xlab("Principal Component 1") +
ylab("Principal Component 2") +
theme_minimal()
CLUSTER CHARACTERISTICS
The mean feature values for each cluster were calculated to understand their characteristics.
pca_data_summary <- spotify_data_scaled %>%
as.data.frame() %>%
mutate(Cluster = kmeans_result$cluster) %>%
group_by(Cluster) %>%
summarise(across(everything(), mean))
print(pca_data_summary)
## # A tibble: 3 × 16
## Cluster X popularity duration_ms danceability energy key loudness mode
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.506 0.359 0.0403 0.471 0.247 0.450 0.565 0.724
## 2 2 0.907 0.246 0.0446 0.567 0.715 0.460 0.666 0.695
## 3 3 0.472 0.322 0.0479 0.605 0.712 0.496 0.751 0.659
## # ℹ 7 more variables: speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # time_signature <dbl>
CLUSTER CHARACTERISTICS
Each cluster was analyzed by calculating the mean values of the features within it. The following observations were made:
• Cluster 1: Tracks with high energy and tempo, likely representing dance or electronic genres.
• Cluster 2: Tracks with balanced audio features, encompassing diverse genres.
• Cluster 3: Tracks with low energy and higher acousticness, often representing mellow or acoustic music.
These insights align with the grouping observed in the PCA visualizations and confirm that the clusters represent meaningful distinctions among the tracks.
CONCLUSION
This analysis demonstrates the power of PCA and K-Means clustering in uncovering meaningful patterns in high-dimensional datasets. By reducing the dataset’s complexity, clustering analysis reveals distinct groupings of tracks, which can inform music recommendations and other applications in the music industry.