The objective of this analysis is to cluster popular songs based on its music components rating on Spotify based on its categorization. This analysis uses data from https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db that will be analysed by using Principal Component Analysis and K-Means Clustering Methods.
Here are the libraries used in this analysis
library(tidyverse)
library(plotly)
library(GGally)
library(cowplot)
library(FactoMineR)
library(factoextra)
library(dplyr)
library(scales)
library(ggiraphExtra)Here is the used data.
spotify <- read.csv("SpotifyFeatures.csv")
glimpse(spotify)## Rows: 232,725
## Columns: 18
## $ ï..genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",~
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willi~
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G~
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "~
## $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
## $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.0~
## $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G~
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
## $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",~
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
## $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4~
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~
Column Description:
There will be adjusted in the dataset:
colnames(spotify)[1] = "genre"
spotify <- spotify %>%
mutate(genre = as.factor(genre),
artist_name = as.factor(artist_name),
key = as.factor(key),
mode = as.factor(mode)) %>%
relocate(mode, .after = track_id)colSums(is.na(spotify))## genre artist_name track_name track_id
## 0 0 0 0
## mode popularity acousticness danceability
## 0 0 0 0
## duration_ms energy instrumentalness key
## 0 0 0 0
## liveness loudness speechiness tempo
## 0 0 0 0
## time_signature valence
## 0 0
From checking dataset above, there is no missing value.
sum(duplicated(spotify))## [1] 0
There is no duplicated data within the dataset
Assumption
Since the objective of this analysis is to cluster popularity, it is important to determine minimum popularity rating assumption as its bases, therefore, the assumption used in this analysis is is equal to or above 80.
pop <- spotify %>%
filter(popularity >= 80) %>%
select_if(is.numeric)
head(pop)pop <- pop %>%
select(-popularity) %>%
mutate_if(is.numeric, scale)
#Summary
summary(pop)## acousticness.V1 danceability.V1 duration_ms.V1
## Min. :-0.955447 Min. :-3.336624 Min. :-2.915567
## 1st Qu.:-0.791561 1st Qu.:-0.638045 1st Qu.:-0.571740
## Median :-0.377482 Median : 0.046210 Median :-0.091764
## Mean : 0.000000 Mean : 0.000000 Mean : 0.000000
## 3rd Qu.: 0.458083 3rd Qu.: 0.676648 3rd Qu.: 0.437254
## Max. : 3.311964 Max. : 2.091287 Max. : 7.447512
## energy.V1 instrumentalness.V1 liveness.V1
## Min. :-3.0771310 Min. :-0.156660 Min. :-1.245467
## 1st Qu.:-0.6654032 1st Qu.:-0.156660 1st Qu.:-0.610231
## Median : 0.0624327 Median :-0.156660 Median :-0.391842
## Mean : 0.0000000 Mean : 0.000000 Mean : 0.000000
## 3rd Qu.: 0.7779325 3rd Qu.:-0.156117 3rd Qu.: 0.232128
## Max. : 1.8943589 Max. :17.393947 Max. : 5.189220
## loudness.V1 speechiness.V1 tempo.V1
## Min. :-5.278633 Min. :-0.910396 Min. :-2.0150107
## 1st Qu.:-0.547862 1st Qu.:-0.708679 1st Qu.:-0.8082041
## Median : 0.157128 Median :-0.421064 Median :-0.0600111
## Mean : 0.000000 Mean : 0.000000 Mean : 0.0000000
## 3rd Qu.: 0.695544 3rd Qu.: 0.347198 3rd Qu.: 0.6869917
## Max. : 1.942509 Max. : 4.318803 Max. : 2.8600938
## valence.V1
## Min. :-2.0427858
## 1st Qu.:-0.7454640
## Median :-0.0534979
## Mean : 0.0000000
## 3rd Qu.: 0.7117890
## Max. : 2.2240328
To do the Principal Component Analysis, the correlation among variables must be checked to understand its level of correlation, so that, we can reduce the strong correlation in related variables.
ggcorr(pop, label = T, label_size = 2, hjust = 0.9, size = 3, layout.exp = 2)Based on the matrix, most of the variables have low relation to other variables. This analysis will use data around 80% of the data, therefore we use the PC accumulation until it reaches 80%
pop.pca <- PCA(pop, scale.unit = FALSE)pop.pca$eig## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.3116319 23.134992 23.13499
## comp 2 1.4496584 14.508294 37.64329
## comp 3 1.1818541 11.828088 49.47137
## comp 4 1.0405453 10.413858 59.88523
## comp 5 0.9506650 9.514329 69.39956
## comp 6 0.9004916 9.012190 78.41175
## comp 7 0.7961682 7.968113 86.37986
## comp 8 0.6162914 6.167892 92.54776
## comp 9 0.5103521 5.107644 97.65540
## comp 10 0.2342709 2.344601 100.00000
fviz_eig(pop.pca, ncp = 11, addlabels =T )According to the cumulative percentage of variance, the data will be used is principal components from 1 to 7 with the retained data around 86,37%.
Based on the data above, here is the PCA Plot.
plot.PCA(x = pop.pca,
choix = "ind",
invisible = "quali",
select = "contrib 5")Based on PCA Plot above shows that outliers are 486, 505, 821, 919, and 1224. Therefore, before clustering, the outliers will be removed from the data.
Besides, we will see the contributing variables to each dimension, namely PC1 and PC2. Here it is the process.
fviz_contrib(X = pop.pca, choice = "var", axes = 1)fviz_contrib(X = pop.pca, choice = "var", axes = 2)Based on two plot above, energy, loudness, accousticness variables have strong contribution to Dim-1 (PC-1), while danceability, speechiness, duration_ms have strong contribution to Dim-2 (PC-2).
Outliers Removing
#Remove Outliers & Scaling
pop.outliers <- c(486, 505, 821, 919, 1224)
pop.clean <- pop[!(row.names(pop) %in% pop.outliers),] %>%
scale() %>%
as.data.frame()
head(pop.clean)Finding Optinum Number of K
fviz_nbclust(x = pop.clean,
FUNcluster = kmeans,
method = "wss",
print.summary = TRUE)RNGkind(sample.kind = "Rounding")
set.seed(192)
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 1:maxK) {
set.seed(192)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}
kmeansTunning(pop.clean, maxK = 10)Based on the two plots above, the most significant reduction is cluster 1 to 2, yet, we weill use 3 cluster to use this analysis.
Building Clusters
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(192)
pop.cluster <- kmeans(x = pop.clean, centers = 3)
fviz_cluster(object = pop.cluster, data = pop.clean)+theme_classic()pop.clean$cluster <- as.factor(pop.cluster$cluster)
head(pop.clean)Clusters Scaling 0 - 100
pop.clean$acousticness <- rescale(pop.clean$acousticness, to = c(0,100))
pop.clean$danceability <- rescale(pop.clean$danceability, to = c(0,100))
pop.clean$duration_ms <- rescale(pop.clean$duration_ms, to = c(0,100))
pop.clean$energy <- rescale(pop.clean$energy, to = c(0,100))
pop.clean$instrumentalness <- rescale(pop.clean$instrumentalness, to = c(0,100))
pop.clean$liveness <- rescale(pop.clean$liveness, to = c(0,100))
pop.clean$loudness <- rescale(pop.clean$loudness, to = c(0,100))
pop.clean$speechiness <- rescale(pop.clean$speechiness, to = c(0,100))
pop.clean$tempo <- rescale(pop.clean$tempo, to = c(0,100))
pop.clean$valence <- rescale(pop.clean$valence, to = c(0,100))
head(pop.clean)Clusters Profiling
agg.pop.clean <- pop.clean %>%
group_by(cluster) %>%
summarise_all(mean)
agg.pop.cleanagg.pop.clean %>%
pivot_longer(-cluster) %>%
ggplot(aes(x = cluster, y = value, fill = cluster)) +
geom_col() +
facet_wrap(~name)ggRadar(data = agg.pop.clean,
aes(colour = cluster),
interactive=TRUE)Cluster Profiling Characteristic:
Cluster 1: Low acousticness, medium danceability, medium duration_ms, high energy, low instrumentalness, low liveness, high loudness, low speechiness. medium tempo, and medium valence.
Cluster 2: Low acousticness, high danceability, medium duration_ms, medium energy, low instrumentalness, low liveness, high loudness, medium speechiness, medium tempo, and medium valence.
Cluster 3: Medium acousticness, medium danceability, medium duration_ms, mdeium energy, low instrumentalness, low liveness, high loudness, low speechiness, low tempo, and low valence.
Based on the clustering analysis using PCA and K-Means, the data can be categorized into at least 3 clusters by finding the optimum K using elbow method. Besides, the dimension can be reduce to 7 dimension with data retain around 86%.