1. Objective

The objective of this analysis is to cluster popular songs based on its music components rating on Spotify based on its categorization. This analysis uses data from https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db that will be analysed by using Principal Component Analysis and K-Means Clustering Methods.

2. Data Preparation and Library & Setup

2.1. Library & Setup

Here are the libraries used in this analysis

library(tidyverse)
library(plotly)
library(GGally)
library(cowplot)
library(FactoMineR)
library(factoextra)
library(dplyr)
library(scales)
library(ggiraphExtra)

2.2. Data Inputted

Here is the used data.

spotify <- read.csv("SpotifyFeatures.csv")

glimpse(spotify)
## Rows: 232,725
## Columns: 18
## $ ï..genre         <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",~
## $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willi~
## $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G~
## $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "~
## $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
## $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.0~
## $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G~
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
## $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",~
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
## $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4~
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~

Column Description:

  • ï..genre: song genre
  • artist_name: name of the artist
  • track_name: song name
  • track_id: song ID
  • popularity: Song popularity score (0-100)
  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • duration_ms: Duration of song in milliseconds
  • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
  • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
  • key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
  • mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
  • speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
  • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

3. Data Preparation & EDA

3.1. Data Wrangling

There will be adjusted in the dataset:

  • Type data change to factor: “genre”, “artist_name”, “popularity”, “key”, and “mode”
  • Column name change: from “ï..genre” to “genre”
colnames(spotify)[1] = "genre"

spotify <- spotify %>% 
  mutate(genre = as.factor(genre), 
         artist_name = as.factor(artist_name),
         key = as.factor(key),
         mode = as.factor(mode)) %>% 
  relocate(mode, .after = track_id)

3.2. Missing Value Check

colSums(is.na(spotify))
##            genre      artist_name       track_name         track_id 
##                0                0                0                0 
##             mode       popularity     acousticness     danceability 
##                0                0                0                0 
##      duration_ms           energy instrumentalness              key 
##                0                0                0                0 
##         liveness         loudness      speechiness            tempo 
##                0                0                0                0 
##   time_signature          valence 
##                0                0

From checking dataset above, there is no missing value.

3.3 Duplicated Data Check

sum(duplicated(spotify))
## [1] 0

There is no duplicated data within the dataset

3.4. Popularity Assumption

Assumption

Since the objective of this analysis is to cluster popularity, it is important to determine minimum popularity rating assumption as its bases, therefore, the assumption used in this analysis is is equal to or above 80.

pop <- spotify %>% 
  filter(popularity >= 80) %>% 
  select_if(is.numeric)

head(pop)

3.5. Scaling

pop <- pop %>% 
  select(-popularity) %>% 
  mutate_if(is.numeric, scale)

#Summary
summary(pop)
##    acousticness.V1     danceability.V1     duration_ms.V1   
##  Min.   :-0.955447   Min.   :-3.336624   Min.   :-2.915567  
##  1st Qu.:-0.791561   1st Qu.:-0.638045   1st Qu.:-0.571740  
##  Median :-0.377482   Median : 0.046210   Median :-0.091764  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.: 0.458083   3rd Qu.: 0.676648   3rd Qu.: 0.437254  
##  Max.   : 3.311964   Max.   : 2.091287   Max.   : 7.447512  
##       energy.V1       instrumentalness.V1     liveness.V1    
##  Min.   :-3.0771310   Min.   :-0.156660   Min.   :-1.245467  
##  1st Qu.:-0.6654032   1st Qu.:-0.156660   1st Qu.:-0.610231  
##  Median : 0.0624327   Median :-0.156660   Median :-0.391842  
##  Mean   : 0.0000000   Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.: 0.7779325   3rd Qu.:-0.156117   3rd Qu.: 0.232128  
##  Max.   : 1.8943589   Max.   :17.393947   Max.   : 5.189220  
##      loudness.V1       speechiness.V1          tempo.V1      
##  Min.   :-5.278633   Min.   :-0.910396   Min.   :-2.0150107  
##  1st Qu.:-0.547862   1st Qu.:-0.708679   1st Qu.:-0.8082041  
##  Median : 0.157128   Median :-0.421064   Median :-0.0600111  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.0000000  
##  3rd Qu.: 0.695544   3rd Qu.: 0.347198   3rd Qu.: 0.6869917  
##  Max.   : 1.942509   Max.   : 4.318803   Max.   : 2.8600938  
##       valence.V1     
##  Min.   :-2.0427858  
##  1st Qu.:-0.7454640  
##  Median :-0.0534979  
##  Mean   : 0.0000000  
##  3rd Qu.: 0.7117890  
##  Max.   : 2.2240328

4. Principal Component Analysis

To do the Principal Component Analysis, the correlation among variables must be checked to understand its level of correlation, so that, we can reduce the strong correlation in related variables.

ggcorr(pop, label = T, label_size = 2, hjust = 0.9, size = 3, layout.exp = 2)

Based on the matrix, most of the variables have low relation to other variables. This analysis will use data around 80% of the data, therefore we use the PC accumulation until it reaches 80%

pop.pca <- PCA(pop, scale.unit = FALSE)

pop.pca$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   2.3116319              23.134992                          23.13499
## comp 2   1.4496584              14.508294                          37.64329
## comp 3   1.1818541              11.828088                          49.47137
## comp 4   1.0405453              10.413858                          59.88523
## comp 5   0.9506650               9.514329                          69.39956
## comp 6   0.9004916               9.012190                          78.41175
## comp 7   0.7961682               7.968113                          86.37986
## comp 8   0.6162914               6.167892                          92.54776
## comp 9   0.5103521               5.107644                          97.65540
## comp 10  0.2342709               2.344601                         100.00000
fviz_eig(pop.pca, ncp = 11, addlabels =T )

According to the cumulative percentage of variance, the data will be used is principal components from 1 to 7 with the retained data around 86,37%.

Based on the data above, here is the PCA Plot.

plot.PCA(x = pop.pca, 
         choix = "ind",
         invisible = "quali", 
         select = "contrib 5")

Based on PCA Plot above shows that outliers are 486, 505, 821, 919, and 1224. Therefore, before clustering, the outliers will be removed from the data.

Besides, we will see the contributing variables to each dimension, namely PC1 and PC2. Here it is the process.

fviz_contrib(X = pop.pca, choice = "var", axes = 1)

fviz_contrib(X = pop.pca, choice = "var", axes = 2)

Based on two plot above, energy, loudness, accousticness variables have strong contribution to Dim-1 (PC-1), while danceability, speechiness, duration_ms have strong contribution to Dim-2 (PC-2).

5. K-Means Clustering

Outliers Removing

#Remove Outliers & Scaling

pop.outliers <- c(486, 505, 821, 919, 1224)
pop.clean <- pop[!(row.names(pop) %in% pop.outliers),] %>% 
  scale() %>% 
  as.data.frame()

head(pop.clean)

Finding Optinum Number of K

fviz_nbclust(x = pop.clean, 
             FUNcluster = kmeans, 
             method = "wss", 
             print.summary = TRUE)

RNGkind(sample.kind = "Rounding")
set.seed(192)

kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 1:maxK) {
    set.seed(192)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

kmeansTunning(pop.clean, maxK = 10)

Based on the two plots above, the most significant reduction is cluster 1 to 2, yet, we weill use 3 cluster to use this analysis.

Building Clusters

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(192)

pop.cluster <- kmeans(x = pop.clean, centers = 3)

fviz_cluster(object = pop.cluster, data = pop.clean)+theme_classic()

pop.clean$cluster <- as.factor(pop.cluster$cluster)

head(pop.clean)

Clusters Scaling 0 - 100

pop.clean$acousticness <- rescale(pop.clean$acousticness, to = c(0,100))
pop.clean$danceability <- rescale(pop.clean$danceability, to = c(0,100))
pop.clean$duration_ms <- rescale(pop.clean$duration_ms, to = c(0,100))
pop.clean$energy <- rescale(pop.clean$energy, to = c(0,100))
pop.clean$instrumentalness <- rescale(pop.clean$instrumentalness, to = c(0,100))
pop.clean$liveness <- rescale(pop.clean$liveness, to = c(0,100))
pop.clean$loudness <- rescale(pop.clean$loudness, to = c(0,100))
pop.clean$speechiness <- rescale(pop.clean$speechiness, to = c(0,100))
pop.clean$tempo <- rescale(pop.clean$tempo, to = c(0,100))
pop.clean$valence <- rescale(pop.clean$valence, to = c(0,100))

head(pop.clean)

Clusters Profiling

agg.pop.clean <- pop.clean %>%  
  group_by(cluster) %>% 
  summarise_all(mean)

agg.pop.clean
agg.pop.clean %>% 
  pivot_longer(-cluster) %>% 
  ggplot(aes(x = cluster, y = value, fill = cluster)) +
  geom_col() +
  facet_wrap(~name)

ggRadar(data = agg.pop.clean, 
        aes(colour = cluster), 
        interactive=TRUE)

Cluster Profiling Characteristic:

  • Cluster 1: Low acousticness, medium danceability, medium duration_ms, high energy, low instrumentalness, low liveness, high loudness, low speechiness. medium tempo, and medium valence.

  • Cluster 2: Low acousticness, high danceability, medium duration_ms, medium energy, low instrumentalness, low liveness, high loudness, medium speechiness, medium tempo, and medium valence.

  • Cluster 3: Medium acousticness, medium danceability, medium duration_ms, mdeium energy, low instrumentalness, low liveness, high loudness, low speechiness, low tempo, and low valence.

6. Conclusion

Based on the clustering analysis using PCA and K-Means, the data can be categorized into at least 3 clusters by finding the optimum K using elbow method. Besides, the dimension can be reduce to 7 dimension with data retain around 86%.