Case : From music data in Spotify, we want to clustering music with the same charateristic in 1 group. Data can be downloaded here https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db

Read Data

songs <- read.csv('SpotifyFeatures.csv')
head(songs)

Data Wrangling

Check data structure

str(songs)
## 'data.frame':    232725 obs. of  18 variables:
##  $ ï..genre        : chr  "Movie" "Movie" "Movie" "Movie" ...
##  $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
##  $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
##  $ track_id        : chr  "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
##  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
##  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
##  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
##  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
##  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
##  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
##  $ key             : chr  "C#" "F#" "C" "C#" ...
##  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
##  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
##  $ mode            : chr  "Major" "Minor" "Minor" "Major" ...
##  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
##  $ tempo           : num  167 174 99.5 171.8 140.6 ...
##  $ time_signature  : chr  "4/4" "4/4" "5/4" "4/4" ...
##  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

From data songs, there’s column below

  • ï..genre : Genre of Music
  • artist_name : The Name of Music’s Singer
  • track_name : Name of the Song
  • track_id : The Spotify ID for the track.
  • popularity : Song Popularity
  • acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
  • danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
  • `` duration_ms : The duration of the track in milliseconds.
  • energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
  • instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context.
  • key : The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
  • liveness : Detects the presence of an audience in the recording.
  • loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
  • mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
  • speechiness : Speechiness detects the presence of spoken words in a track.
  • tempo : The overall estimated tempo of a track in beats per minute (BPM).
  • time_signature : The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
  • valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
library(dplyr)
songs <- songs %>% mutate(duration_ms = duration_ms/60000) %>% rename(duration_in_min = duration_ms)

head(songs)

Check Missing Value If exist, remove it

anyNA(songs)
## [1] FALSE

Check duplicate data. If exist, remove it

sum(duplicated(songs$track_id))
## [1] 55951
spotify <- songs[!duplicated(songs$track_id),]

sum(duplicated(spotify$track_id))
## [1] 0

Exploratory Data Analysis

We want to do clustering from the track_id Assign track_id column into rownames

rownames(spotify) <- spotify$track_id 

Select song with popularity >= 70 to speed up computation

spotify <- spotify %>% filter(popularity >= 70)

Because K-Means use the numerical variable, do subset data just for numeric

spotify <- spotify %>% select_if(is.numeric)

Check the numerical variable

str(spotify)
## 'data.frame':    3835 obs. of  11 variables:
##  $ popularity      : int  71 76 70 72 70 70 70 71 70 70 ...
##  $ acousticness    : num  0.735 0.0233 0.41 0.323 0.198 0.53 0.15 0.0983 0.023 0.0104 ...
##  $ danceability    : num  0.501 0.845 0.721 0.637 0.781 0.679 0.866 0.596 0.689 0.549 ...
##  $ duration_in_min : num  4.3 3.13 3.56 3.65 4.49 ...
##  $ energy          : num  0.378 0.709 0.881 0.73 0.745 0.44 0.749 0.557 0.791 0.791 ...
##  $ instrumentalness: num  0.00 0.00 7.61e-06 0.00 1.14e-05 5.15e-06 0.00 0.00 0.00 0.00 ...
##  $ liveness        : num  0.119 0.094 0.292 0.0981 0.36 0.228 0.0591 0.0565 0.0526 0.444 ...
##  $ loudness        : num  -9.37 -4.55 -2.53 -5.38 -5.81 ...
##  $ speechiness     : num  0.029 0.0714 0.342 0.0874 0.0332 0.351 0.253 0.134 0.053 0.133 ...
##  $ tempo           : num  120 98.1 127.8 93.9 130 ...
##  $ valence         : num  0.178 0.62 0.643 0.732 0.326 0.0336 0.891 0.661 0.755 0.293 ...

Data Pre-Processing

Do scaling data for standarization

spotify_scale <- scale(spotify)

Search for optimum k with fviz_nbclust method from library(factoextra)

RNGkind(sample.kind = "Rounding")
set.seed(572)
library(factoextra)

fviz_nbclust(x = spotify_scale,
             FUNcluster = kmeans,
             method = "wss")

From the graph above, let’s take 4 for the optimum k to do k-means clustering

K-Means Clustering

Do the clustering using the optimum key which we have search it though elbow method above.

RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_kmeans <- kmeans(spotify_scale, centers = 4)

spotify_kmeans$size
## [1]  690 1396 1065  684

From data above, we know that in cluster 1 there’s 690 observation, cluster 2 there’s 1396 observation, cluster 3 there’s 1065 observation, cluster 4 there’s 684 observation. For better information, let’s do profiling with ggRadar function from library ggiraphExtra

# Assign clustering result into data
spotify$cluster <- spotify_kmeans$cluster

library(ggiraphExtra)
ggRadar(data = spotify, mapping = aes(colours = cluster), interactive = T)

For better information, we can use summarize for viewing the minimum and maximum value of each cluster

library(tidyverse)
spotify_centroid <- spotify %>% group_by(cluster) %>% summarise_all(mean) 

spotify_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(
    min = which.min(value),
    max = which.max(value))

Conclusion

From information above, here are some insight that can help to make decision

  • Cluster 1 : 690 observation
  • Cluster 2 : 1396 observation
  • Cluster 3 : 1065 observation
  • Cluster 4 : 684 observation.

Audio characteristics in every cluster

  • Cluster 1 Highest : acousticness, instrumentalness; Lowest : energy, liveness, loudness, speechiness, tempo, valence
  • Cluster 2 Highest : loudness, valence; Lowest : not spesific
  • Cluster 3 Highest : duration_in_min, energy, liveness, tempo; Lowest : acousticness, danceability, popularity
  • Cluster 4 Highest : danceability, popularity, speechiness; Lowest : duration_in_min, instrumentalness