K-Means Clustering Analysis
Data Explanation
Objective
The data used in this case consist of variables related to audio features information for a single track identified by Spotify. There are more than 232 thousands of observations that were categorized in a lot of genre.
Why?
The purpose of clustering analysis using K-means is to find groups which have not been explicitly labeled in the data. This kind of analysis can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. That’s why doing clustering analysis is such a good way to differentiate some data into groups so that we can find music with same characteristic.
Library for Clustering
library(tidyverse)
library(lubridate)
library(factoextra)
library(FactoMineR)Data Preparation
data <- read.csv("SpotifyFeatures.csv",encoding = "UTF-8")
glimpse(data)## Rows: 232,725
## Columns: 18
## $ X.U.FEFF.genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",~
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia~
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G~
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "~
## $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
## $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.0~
## $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G~
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
## $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",~
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
## $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4~
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~
Result: We can see that there are many columns of data. But, we only use the numerical data type to do clustering method.
First, we will use only 5% random data of total observation
index<- sample(x=nrow(data),
size = nrow(data)*0.05)
data_new <- data[index,]Fetching numerical columns of data_type
data_num <- data_new %>%
select_if(is.numeric)
data_num <- data_num %>%
select(-duration_ms) #this step is optionalChecking the NA data
colSums(is.na(data_num))## popularity acousticness danceability energy
## 0 0 0 0
## instrumentalness liveness loudness speechiness
## 0 0 0 0
## tempo valence
## 0 0
Result: There is no missing value in the data set.
Scaling the data
data_scale <- scale(data_num)K-Means Clustering
Looking for K Optimum Value Using Elbow Method
fviz_nbclust(
x = data_scale,
FUNcluster = kmeans,
method = "wss"
)## Warning: did not converge in 10 iterations
Result: The graph shows that there is a significant decrease towards point 3, but not so significant when at point 4 and so on. Thus, 3 was chosen as the optimum k value according to the Elbow Method.
data_kmeans <- kmeans(data_scale, centers=3)
data_kmeans$size #total observations in each cluster## [1] 2583 530 8523
data_kmeans$iter #how many iterations to generate stable group## [1] 3
data_kmeans$centers #position from the epic center/centroid## popularity acousticness danceability energy instrumentalness liveness
## 1 -0.6578030 1.2951257 -1.00753132 -1.3830455 0.9736966 -0.26007953
## 2 -1.1658180 1.1817326 0.02993458 0.3737098 -0.4819517 2.63161962
## 3 0.2718513 -0.4659895 0.30348329 0.3959099 -0.2651207 -0.08482611
## loudness speechiness tempo valence
## 1 -1.3469639 -0.3728035 -0.3518853 -0.8992572
## 2 -0.3982507 3.9807505 -0.6101984 -0.1668102
## 3 0.4329791 -0.1345590 0.1445881 0.2829040
Merging Clustering Result to Data Column
data_num$cluster <- data_kmeans$clusterlibrary(ggiraphExtra)## Warning: package 'ggiraphExtra' was built under R version 4.1.3
ggRadar(data = data_num,
mapping = aes(colours = cluster),
interactive = T)Profiling
data_centroid <- data_num %>%
group_by(cluster) %>%
summarise_all(mean)
data_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarise(min_group = which.min(value),
max_group = which.max(value))## # A tibble: 10 x 3
## name min_group max_group
## <chr> <int> <int>
## 1 acousticness 3 1
## 2 danceability 1 3
## 3 energy 1 3
## 4 instrumentalness 2 1
## 5 liveness 1 2
## 6 loudness 1 3
## 7 popularity 2 3
## 8 speechiness 1 2
## 9 tempo 2 3
## 10 valence 1 3
Result: Based on the graph, we can observe that the Cluster 1 has the highest loudness, danceability, energy, tempo, valence, and popularity with lowest acousticness. Then, Cluster 2 has the highest acousticness and instrumentalness, with lowest speechiness, valence, and danceability. And the last one, Cluster 3 has the highest value in speechiness and liveness, with lowest popularity, tempo, and energy.
Conclusion
Based on the result of clustering using K-means method, we would notice that Cluster 1 tends to be active, cheerful, and energetic music such as Hip-Hop and Rock music. The Cluster 2 is categorized as mellow, calm, and peaceful such as Pop and Blues. Then, Cluster 3 is kind of music with low tempo and not quite popular music genre.