Case : From music data in Spotify, we want to clustering music with the same charateristic in 1 group. Data can be downloaded here https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db
songs <- read.csv('SpotifyFeatures.csv')
head(songs)Check data structure
str(songs)## 'data.frame': 232725 obs. of 18 variables:
## $ ï..genre : chr "Movie" "Movie" "Movie" "Movie" ...
## $ artist_name : chr "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
## $ track_name : chr "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
## $ track_id : chr "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
## $ popularity : int 0 1 3 0 4 0 2 15 0 10 ...
## $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
## $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
## $ duration_ms : int 99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
## $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
## $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
## $ key : chr "C#" "F#" "C" "C#" ...
## $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
## $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
## $ mode : chr "Major" "Minor" "Minor" "Major" ...
## $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
## $ tempo : num 167 174 99.5 171.8 140.6 ...
## $ time_signature : chr "4/4" "4/4" "5/4" "4/4" ...
## $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
From data songs, there’s column below
ï..genre : Genre of Musicartist_name : The Name of Music’s Singertrack_name : Name of the Songtrack_id : The Spotify ID for the track.popularity : Song Popularityacousticness : A confidence measure from 0.0 to 1.0 of
whether the track is acoustic.danceability : Danceability describes how suitable a
track is for dancing based on a combination of musical elements
including tempo, rhythm stability, beat strength, and overall
regularity.energy : Energy is a measure from 0.0 to 1.0 and
represents a perceptual measure of intensity and activity.instrumentalness : Predicts whether a track contains no
vocals. “Ooh” and “aah” sounds are treated as instrumental in this
context.key : The key the track is in. Integers map to pitches
using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and
so on. If no key was detected, the value is -1.liveness : Detects the presence of an audience in the
recording.loudness : The overall loudness of a track in decibels
(dB). Loudness values are averaged across the entire track and are
useful for comparing relative loudness of tracks.mode : Mode indicates the modality (major or minor) of
a track, the type of scale from which its melodic content is
derived.speechiness : Speechiness detects the presence of
spoken words in a track.tempo : The overall estimated tempo of a track in beats
per minute (BPM).time_signature : The time signature (meter) is a
notational convention to specify how many beats are in each bar (or
measure).valence : A measure from 0.0 to 1.0 describing the
musical positiveness conveyed by a track.library(dplyr)
songs <- songs %>% mutate(duration_ms = duration_ms/60000) %>% rename(duration_in_min = duration_ms)
head(songs)Check Missing Value If exist, remove it
anyNA(songs)## [1] FALSE
Check duplicate data. If exist, remove it
sum(duplicated(songs$track_id))## [1] 55951
spotify <- songs[!duplicated(songs$track_id),]
sum(duplicated(spotify$track_id))## [1] 0
We want to do clustering from the track_id Assign track_id column into rownames
rownames(spotify) <- spotify$track_id Select song with popularity >= 70 to speed up computation
spotify <- spotify %>% filter(popularity >= 70)Because K-Means use the numerical variable, do subset data just for numeric
spotify <- spotify %>% select_if(is.numeric)Check the numerical variable
str(spotify)## 'data.frame': 3835 obs. of 11 variables:
## $ popularity : int 71 76 70 72 70 70 70 71 70 70 ...
## $ acousticness : num 0.735 0.0233 0.41 0.323 0.198 0.53 0.15 0.0983 0.023 0.0104 ...
## $ danceability : num 0.501 0.845 0.721 0.637 0.781 0.679 0.866 0.596 0.689 0.549 ...
## $ duration_in_min : num 4.3 3.13 3.56 3.65 4.49 ...
## $ energy : num 0.378 0.709 0.881 0.73 0.745 0.44 0.749 0.557 0.791 0.791 ...
## $ instrumentalness: num 0.00 0.00 7.61e-06 0.00 1.14e-05 5.15e-06 0.00 0.00 0.00 0.00 ...
## $ liveness : num 0.119 0.094 0.292 0.0981 0.36 0.228 0.0591 0.0565 0.0526 0.444 ...
## $ loudness : num -9.37 -4.55 -2.53 -5.38 -5.81 ...
## $ speechiness : num 0.029 0.0714 0.342 0.0874 0.0332 0.351 0.253 0.134 0.053 0.133 ...
## $ tempo : num 120 98.1 127.8 93.9 130 ...
## $ valence : num 0.178 0.62 0.643 0.732 0.326 0.0336 0.891 0.661 0.755 0.293 ...
Do scaling data for standarization
spotify_scale <- scale(spotify)Search for optimum k with fviz_nbclust method from library(factoextra)
RNGkind(sample.kind = "Rounding")
set.seed(572)
library(factoextra)
fviz_nbclust(x = spotify_scale,
FUNcluster = kmeans,
method = "wss")From the graph above, let’s take 4 for the optimum k to do k-means clustering
Do the clustering using the optimum key which we have search it though elbow method above.
RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_kmeans <- kmeans(spotify_scale, centers = 4)
spotify_kmeans$size## [1] 690 1396 1065 684
From data above, we know that in cluster 1 there’s 690 observation, cluster 2 there’s 1396 observation, cluster 3 there’s 1065 observation, cluster 4 there’s 684 observation. For better information, let’s do profiling with ggRadar function from library ggiraphExtra
# Assign clustering result into data
spotify$cluster <- spotify_kmeans$cluster
library(ggiraphExtra)
ggRadar(data = spotify, mapping = aes(colours = cluster), interactive = T)For better information, we can use summarize for viewing the minimum and maximum value of each cluster
library(tidyverse)
spotify_centroid <- spotify %>% group_by(cluster) %>% summarise_all(mean)
spotify_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
min = which.min(value),
max = which.max(value))From information above, here are some insight that can help to make decision
Cluster 1 : 690 observationCluster 2 : 1396 observationCluster 3 : 1065 observationCluster 4 : 684 observation.Audio characteristics in every cluster
Cluster 1 Highest : acousticness, instrumentalness;
Lowest : energy, liveness, loudness, speechiness, tempo, valenceCluster 2 Highest : loudness, valence; Lowest : not
spesificCluster 3 Highest : duration_in_min, energy, liveness,
tempo; Lowest : acousticness, danceability, popularityCluster 4 Highest : danceability, popularity,
speechiness; Lowest : duration_in_min, instrumentalness