K-Means Clustering Analysis

Data Explanation

Objective

The data used in this case consist of variables related to audio features information for a single track identified by Spotify. There are more than 232 thousands of observations that were categorized in a lot of genre.

Why?

The purpose of clustering analysis using K-means is to find groups which have not been explicitly labeled in the data. This kind of analysis can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. That’s why doing clustering analysis is such a good way to differentiate some data into groups so that we can find music with same characteristic.

Library for Clustering

library(tidyverse)
library(lubridate)
library(factoextra)
library(FactoMineR)

Data Preparation

data <- read.csv("SpotifyFeatures.csv",encoding = "UTF-8")
glimpse(data)

## Rows: 232,725
## Columns: 18
## $ X.U.FEFF.genre   <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",~
## $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia~
## $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G~
## $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "~
## $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
## $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.0~
## $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G~
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
## $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",~
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
## $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4~
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~

Result: We can see that there are many columns of data. But, we only use the numerical data type to do clustering method.

First, we will use only 5% random data of total observation

index<- sample(x=nrow(data),
               size = nrow(data)*0.05)
data_new <- data[index,]

Fetching numerical columns of data_type

data_num <- data_new %>% 
  select_if(is.numeric)

data_num <- data_num %>% 
  select(-duration_ms) #this step is optional

Checking the NA data

colSums(is.na(data_num))

##       popularity     acousticness     danceability           energy 
##                0                0                0                0 
## instrumentalness         liveness         loudness      speechiness 
##                0                0                0                0 
##            tempo          valence 
##                0                0

Result: There is no missing value in the data set.

Scaling the data

data_scale <- scale(data_num)

K-Means Clustering

Looking for K Optimum Value Using Elbow Method

fviz_nbclust(
  x = data_scale,
  FUNcluster = kmeans,
  method = "wss"
)

## Warning: did not converge in 10 iterations

Result: The graph shows that there is a significant decrease towards point 3, but not so significant when at point 4 and so on. Thus, 3 was chosen as the optimum k value according to the Elbow Method.

data_kmeans <- kmeans(data_scale, centers=3)
data_kmeans$size #total observations in each cluster

## [1] 2583  530 8523

data_kmeans$iter #how many iterations to generate stable group

## [1] 3

data_kmeans$centers #position from the epic center/centroid

##   popularity acousticness danceability     energy instrumentalness    liveness
## 1 -0.6578030    1.2951257  -1.00753132 -1.3830455        0.9736966 -0.26007953
## 2 -1.1658180    1.1817326   0.02993458  0.3737098       -0.4819517  2.63161962
## 3  0.2718513   -0.4659895   0.30348329  0.3959099       -0.2651207 -0.08482611
##     loudness speechiness      tempo    valence
## 1 -1.3469639  -0.3728035 -0.3518853 -0.8992572
## 2 -0.3982507   3.9807505 -0.6101984 -0.1668102
## 3  0.4329791  -0.1345590  0.1445881  0.2829040

Merging Clustering Result to Data Column

data_num$cluster <- data_kmeans$cluster

library(ggiraphExtra)

## Warning: package 'ggiraphExtra' was built under R version 4.1.3

ggRadar(data = data_num,
        mapping = aes(colours = cluster),
        interactive = T)

Profiling

data_centroid <- data_num %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

data_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarise(min_group = which.min(value),
            max_group = which.max(value))

## # A tibble: 10 x 3
##    name             min_group max_group
##    <chr>                <int>     <int>
##  1 acousticness             3         1
##  2 danceability             1         3
##  3 energy                   1         3
##  4 instrumentalness         2         1
##  5 liveness                 1         2
##  6 loudness                 1         3
##  7 popularity               2         3
##  8 speechiness              1         2
##  9 tempo                    2         3
## 10 valence                  1         3

Result: Based on the graph, we can observe that the Cluster 1 has the highest loudness, danceability, energy, tempo, valence, and popularity with lowest acousticness. Then, Cluster 2 has the highest acousticness and instrumentalness, with lowest speechiness, valence, and danceability. And the last one, Cluster 3 has the highest value in speechiness and liveness, with lowest popularity, tempo, and energy.

Conclusion

Based on the result of clustering using K-means method, we would notice that Cluster 1 tends to be active, cheerful, and energetic music such as Hip-Hop and Rock music. The Cluster 2 is categorized as mellow, calm, and peaceful such as Pop and Blues. Then, Cluster 3 is kind of music with low tempo and not quite popular music genre.

Reference

https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/