2018 has been a fantastic year for data enthusiasts, with numerous opportunities to access intriguing data. Platforms like Kaggle, which boasts over 10,000 published datasets across various industries, are particularly valuable. Google, which owns Kaggle, has also introduced a dataset search tool, making it as simple to find datasets as installing a data science library such as Pandas.
For those eager to dive into data, APIs offer a great way to obtain valuable information. Tech giants like Twitter, Slack, and Google provide APIs that enable developers to build applications and extract data for analysis.
This series of articles will explore how the Spotify Web API was used to automatically retrieve data, with a focus on this topic. Future articles will discuss leveraging data science tools such as Python, SQL, and Bash to gain insights from the data.
Source : https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db/data
The goal is to perform clustering analysis using the K-means method. Additionally, the possibility of applying dimensionality reduction through Principal Component Analysis (PCA) will be explored.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(cowplot)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(FactoMineR)
library(scales)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ lubridate::stamp() masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(lubridate)
library(cluster)
library(ggforce)
options(scipen = 100, max.print = 101)
data <- read.csv("SpotifyFeatures.csv")
head(data)
## genre artist_name track_name
## 1 Movie Henri Salvador C'est beau de faire un Show
## 2 Movie Martin & les fées Perdu d'avance (par Gad Elmaleh)
## 3 Movie Joseph Williams Don't Let Me Be Lonely Tonight
## 4 Movie Henri Salvador Dis-moi Monsieur Gordon Cooper
## 5 Movie Fabien Nataf Ouverture
## track_id popularity acousticness danceability duration_ms
## 1 0BRjO6ga9RKCKjfDqeFgWV 0 0.611 0.389 99373
## 2 0BjC1NfoEOOusryehmNudP 1 0.246 0.590 137373
## 3 0CoSDzoNIKCRs124s9uTVy 3 0.952 0.663 170267
## 4 0Gc6TVm52BwZD07Ki6tIvf 0 0.703 0.240 152427
## 5 0IuslXpMROHdEPvSl1fTQK 4 0.950 0.331 82625
## energy instrumentalness key liveness loudness mode speechiness tempo
## 1 0.910 0.000 C# 0.3460 -1.828 Major 0.0525 166.969
## 2 0.737 0.000 F# 0.1510 -5.559 Minor 0.0868 174.003
## 3 0.131 0.000 C 0.1030 -13.879 Minor 0.0362 99.488
## 4 0.326 0.000 C# 0.0985 -12.178 Major 0.0395 171.758
## 5 0.225 0.123 F 0.2020 -21.150 Major 0.0456 140.576
## time_signature valence
## 1 4/4 0.814
## 2 4/4 0.816
## 3 5/4 0.368
## 4 4/4 0.227
## 5 4/4 0.390
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
Let’s check for missing values in our data.
str(data)
## 'data.frame': 232725 obs. of 18 variables:
## $ genre : chr "Movie" "Movie" "Movie" "Movie" ...
## $ artist_name : chr "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
## $ track_name : chr "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
## $ track_id : chr "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
## $ popularity : int 0 1 3 0 4 0 2 15 0 10 ...
## $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
## $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
## $ duration_ms : int 99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
## $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
## $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
## $ key : chr "C#" "F#" "C" "C#" ...
## $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
## $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
## $ mode : chr "Major" "Minor" "Minor" "Major" ...
## $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
## $ tempo : num 167 174 99.5 171.8 140.6 ...
## $ time_signature : chr "4/4" "4/4" "5/4" "4/4" ...
## $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
anyNA(data)
## [1] FALSE
colSums(is.na(data))
## genre artist_name track_name track_id
## 0 0 0 0
## popularity acousticness danceability duration_ms
## 0 0 0 0
## energy instrumentalness key liveness
## 0 0 0 0
## loudness mode speechiness tempo
## 0 0 0 0
## time_signature valence
## 0 0
To make processing easier, let’s change our data type.
data1 <- data %>%
mutate(
genre = as.character(genre),
artist_name = as.character(artist_name),
track_name = as.character(track_name),
track_id = as.character(track_id),
popularity = as.numeric(popularity),
acousticness = as.numeric(acousticness),
danceability = as.numeric(danceability),
duration_ms = as.numeric(duration_ms),
energy = as.numeric(energy),
instrumentalness = as.numeric(instrumentalness),
key = as.factor(key),
liveness = as.numeric(liveness),
loudness = as.numeric(loudness),
mode = as.factor(mode),
speechiness = as.numeric(speechiness),
tempo = as.numeric(tempo),
time_signature = as.factor(time_signature),
valence = as.numeric(valence)
)
Once we have finished preparing our data, let’s start exploring our data.
rownames(data1) <- data1$name
data2 <- data1 %>%
select(-c(genre, artist_name, track_name, track_id, key, mode, time_signature))
data2 %>% str()
## 'data.frame': 232725 obs. of 11 variables:
## $ popularity : num 0 1 3 0 4 0 2 15 0 10 ...
## $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
## $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
## $ duration_ms : num 99373 137373 170267 152427 82625 ...
## $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
## $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
## $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
## $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
## $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
## $ tempo : num 167 174 99.5 171.8 140.6 ...
## $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
summary(data2)
## popularity acousticness danceability duration_ms
## Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. : 15387
## 1st Qu.: 29.00 1st Qu.:0.0376 1st Qu.:0.4350 1st Qu.: 182857
## Median : 43.00 Median :0.2320 Median :0.5710 Median : 220427
## Mean : 41.13 Mean :0.3686 Mean :0.5544 Mean : 235122
## 3rd Qu.: 55.00 3rd Qu.:0.7220 3rd Qu.:0.6920 3rd Qu.: 265768
## Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :5552917
## energy instrumentalness liveness loudness
## Min. :0.0000203 Min. :0.0000000 Min. :0.00967 Min. :-52.457
## 1st Qu.:0.3850000 1st Qu.:0.0000000 1st Qu.:0.09740 1st Qu.:-11.771
## Median :0.6050000 Median :0.0000443 Median :0.12800 Median : -7.762
## Mean :0.5709577 Mean :0.1483012 Mean :0.21501 Mean : -9.570
## 3rd Qu.:0.7870000 3rd Qu.:0.0358000 3rd Qu.:0.26400 3rd Qu.: -5.501
## Max. :0.9990000 Max. :0.9990000 Max. :1.00000 Max. : 3.744
## speechiness tempo valence
## Min. :0.0222 Min. : 30.38 Min. :0.0000
## 1st Qu.:0.0367 1st Qu.: 92.96 1st Qu.:0.2370
## Median :0.0501 Median :115.78 Median :0.4440
## Mean :0.1208 Mean :117.67 Mean :0.4549
## 3rd Qu.:0.1050 3rd Qu.:139.05 3rd Qu.:0.6600
## Max. :0.9670 Max. :242.90 Max. :1.0000
The distribution of several features, such as acousticness,
instrumentalness, and liveness, shows clear differences between Major
and Minor modes, indicating good potential for clustering based on these
features. However, other features like popularity and loudness display
more similar distributions, which may be less effective for use as a
basis for clustering.
ggcorr(data2, label = TRUE, hjust = 1, layout.exp = 2)
Visualization shows a strong correlation among several audio features (such as energy, danceability, and acousticness). PCA can be used to identify principal components that explain most of the variance in the data, thereby reducing dimensionality and eliminating redundancy.
data_scale <- scale(data2)
head(data_scale)
## popularity acousticness danceability duration_ms energy
## [1,] -2.261002 0.6833748 -0.8909329 -1.1413655 1.2869052
## [2,] -2.206026 -0.3454664 0.1919933 -0.8218657 0.6302479
## [3,] -2.096075 1.6445663 0.5852948 -0.5452965 -1.6699502
## [4,] -2.261002 0.9426992 -1.6936990 -0.6952933 -0.9297874
## [5,] -2.041100 1.6389288 -1.2034190 -1.2821808 -1.3131538
## [6,] -2.261002 1.0723614 0.1273410 -0.6263486 -1.8073548
## instrumentalness liveness loudness speechiness tempo valence
## [1,] -0.48981747 0.66065975 1.2907007 -0.3679692 1.5956039 1.3807413
## [2,] -0.48981747 -0.32283477 0.6686811 -0.1830817 1.8232495 1.3884316
## [3,] -0.48981747 -0.56492573 -0.7184009 -0.4558311 -0.5883245 -0.3342114
## [4,] -0.48981747 -0.58762176 -0.4348159 -0.4380431 1.7505932 -0.8763826
## [5,] -0.08356631 -0.06561313 -1.9305971 -0.4051623 0.7414313 -0.2496173
## [6,] -0.48981747 -0.54475148 -0.9002886 0.1198533 -0.9769791 -0.3726633
The data exhibits significant variation across audio features, and after scaling the data, PCA will be employed to identify the principal components that explain most of the variance in the dataset.
pca_data <- PCA(X = data_scale,
scale.unit = FALSE,
graph = F,
ncp = 11)
The PCA results include various statistics and coordinates for both variables and individuals, providing insights into the principal components and their contributions to the variance in the dataset.
To proceed with PCA modeling, we will analyze the eigenvalues and eigenvectors to understand the variance captured by each principal component, and then use this information to transform the original data into new values based on the principal components.
pca_data$eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.6104430 32.822350 32.82235
## comp 2 1.7100248 15.545747 48.36810
## comp 3 1.1712427 10.647707 59.01580
## comp 4 0.9998305 9.089407 68.10521
## comp 5 0.8617186 7.833839 75.93905
## comp 6 0.7567533 6.879605 82.81866
## comp 7 0.6378529 5.798688 88.61734
## comp 8 0.4853764 4.412532 93.02988
## comp 9 0.3751899 3.410832 96.44071
## comp 10 0.2767438 2.515864 98.95657
## comp 11 0.1147767 1.043429 100.00000
as.data.frame(pca_data$ind$coord) %>% head()
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## 1 0.9904481 0.9993903 -0.1597105 -3.0962912 0.71325308 -0.74345195
## 2 1.2096105 0.2730414 -0.6842276 -2.7298245 1.28269932 -0.12134335
## 3 -2.1123978 0.3531611 -1.8668172 -0.2666744 0.70972338 0.39952609
## 4 -1.9545010 -0.1868432 0.2513141 -2.6613814 -0.02211554 0.60691571
## 5 -2.9355815 0.3915754 -1.1231315 -2.0937683 -0.02518811 0.41551544
## 6 -2.2612410 0.7007202 -1.7307093 -0.1446502 0.52331036 0.03778243
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## 1 -1.3200974 0.4246202 0.576058781 1.16234095 0.04739096
## 2 -0.9083572 -0.4478824 -0.057434104 0.09104371 -0.13059333
## 3 -1.4628733 -0.4420655 -0.872496223 0.59542626 -0.09975048
## 4 -1.7930776 -0.5882799 -0.041750169 0.17564117 -0.22897820
## 5 -1.1238702 0.2555989 -0.007753993 -0.30349575 0.41629524
## 6 -1.5234275 -0.7598055 -0.626642226 -0.09946476 -0.44224934
The Individual & Variable Factor Map visualization displays the relationships and projections of individuals and variables onto the principal components, highlighting how each contributes to the overall structure of the data.
custom_colors <- custom_colors <- c( "cyan")
plot.PCA(
x = pca_data,
choix = "ind",
select = "contrib 10",
invisible = "quali",
col.ind = custom_colors
)
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The PCA results indicate that two principal components capture most of the data variance, with Principal Component 1 (Dim 1) explaining approximately 32.82% and Principal Component 2 (Dim 2) accounting for about 15.55%. The data distribution is relatively uniform along Principal Component 1, while along Principal Component 2, most data points are clustered around zero, with some outliers showing extreme values. This suggests greater variability along the first dimension compared to the second.
plot.PCA(x = pca_data,
choix = "var")
The first two principal components capture approximately 48.37% of the total variance in the audio data, with Component 1 representing a spectrum from high-energy to more calm and acoustic tracks, as it positively correlates with features like energy, danceability, and loudness, and negatively with acousticness and instrumentalness.
fviz_contrib(X = pca_data,
choice = "var",
axes = 2)
The variable contribution plot for Dimension 2 shows that “danceability” has the most significant impact, suggesting that Dimension 2 represents the “danceability” or “rhythm” of a song.
pca_data$ind$coord %>% head()
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## 1 0.9904481 0.9993903 -0.1597105 -3.0962912 0.71325308 -0.74345195
## 2 1.2096105 0.2730414 -0.6842276 -2.7298245 1.28269932 -0.12134335
## 3 -2.1123978 0.3531611 -1.8668172 -0.2666744 0.70972338 0.39952609
## 4 -1.9545010 -0.1868432 0.2513141 -2.6613814 -0.02211554 0.60691571
## 5 -2.9355815 0.3915754 -1.1231315 -2.0937683 -0.02518811 0.41551544
## 6 -2.2612410 0.7007202 -1.7307093 -0.1446502 0.52331036 0.03778243
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## 1 -1.3200974 0.4246202 0.576058781 1.16234095 0.04739096
## 2 -0.9083572 -0.4478824 -0.057434104 0.09104371 -0.13059333
## 3 -1.4628733 -0.4420655 -0.872496223 0.59542626 -0.09975048
## 4 -1.7930776 -0.5882799 -0.041750169 0.17564117 -0.22897820
## 5 -1.1238702 0.2555989 -0.007753993 -0.30349575 0.41629524
## 6 -1.5234275 -0.7598055 -0.626642226 -0.09946476 -0.44224934
data_keep <- as.data.frame(pca_data$ind$coord[,1:6])
data_keep %>% head()
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## 1 0.9904481 0.9993903 -0.1597105 -3.0962912 0.71325308 -0.74345195
## 2 1.2096105 0.2730414 -0.6842276 -2.7298245 1.28269932 -0.12134335
## 3 -2.1123978 0.3531611 -1.8668172 -0.2666744 0.70972338 0.39952609
## 4 -1.9545010 -0.1868432 0.2513141 -2.6613814 -0.02211554 0.60691571
## 5 -2.9355815 0.3915754 -1.1231315 -2.0937683 -0.02518811 0.41551544
## 6 -2.2612410 0.7007202 -1.7307093 -0.1446502 0.52331036 0.03778243
data_small <- data2 %>% head(100)
pca_small <- prcomp(data_small, scale = TRUE)
biplot(x = pca_small,
cex = 0.7,
scale = FALSE)
The data shows substantial variation along both PC1 and PC2, with certain feature groups clustering in specific areas, suggesting correlations between features, while the contributions of features like “danceability” and “energy” in shaping the principal components need further confirmation through loading plots or contribution tables.
data2[55,]
## popularity acousticness danceability duration_ms energy instrumentalness
## 55 0 0.924 0.683 101653 0.147 0
## liveness loudness speechiness tempo valence
## 55 0.606 -21.998 0.822 32.244 0.595
data2[16,]
## popularity acousticness danceability duration_ms energy instrumentalness
## 16 0 0.548 0.588 2447870 0.405 0
## liveness loudness speechiness tempo valence
## 16 0.754 -15.55 0.938 83.56 0.48
data2[97,]
## popularity acousticness danceability duration_ms energy instrumentalness
## 97 0 0.84 0.688 3435625 0.331 0
## liveness loudness speechiness tempo valence
## 97 0.0673 -8.645 0.772 102.244 0.529
K-Means Clustering is a method of grouping data based on similarity, represented through distance metrics, which requires numerical data for effective modeling.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 2:maxK) {
set.seed(101)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Cluster", ylab = "Total")
}
kmeansTunning(data_scale, maxK = 6)
The plot shows a decrease in the total within-cluster sum of squares (WSS) as the number of clusters increases. However, this decrease starts to slow down after reaching 3 clusters, suggesting that 3 clusters may be an optimal choice.
The sets a random seed for reproducibility and performs K-Means clustering on the scaled data with 5 clusters.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
data_cluster <- kmeans(x = data_scale,
centers = 5)
data_cluster$size
## [1] 60940 86609 29700 10111 45365
data_cluster$centers
## popularity acousticness danceability duration_ms energy instrumentalness
## 1 0.07985509 -0.7489145 -0.31576634 0.08376743 0.7581379 -0.1571035
## 2 0.45783618 -0.5509724 0.77070603 -0.08734646 0.4015419 -0.3278841
## 3 -0.79234540 1.3780146 -1.44247233 0.23239420 -1.5619442 1.7361848
## 4 -1.12475422 1.1971189 0.04380536 0.07222012 0.3461937 -0.4857360
## 5 -0.21192669 0.8889432 -0.11261470 -0.11401130 -0.8396052 -0.1913769
## liveness loudness speechiness tempo valence
## 1 0.1767196 0.5947072 -0.15163567 0.8780751 -0.02396741
## 2 -0.2150034 0.4576242 -0.07841903 -0.2938935 0.58362231
## 3 -0.3041790 -1.8847884 -0.39716186 -0.5059485 -1.17793132
## 4 2.5972431 -0.4103686 4.07706176 -0.6370862 -0.15647396
## 5 -0.2066498 -0.3471483 -0.29527152 -0.1452184 -0.27597713
head(data_cluster$cluster)
## [1] 1 1 5 5 3 5
data_cluster$iter
## [1] 6
Goodness of Fit refers to how well a statistical model or algorithm represents the observed data.
data_cluster$withinss
## [1] 368381.0 381937.3 264187.0 107129.7 285441.3
The values indicate the within-cluster sum of squares (WSS) for each of the 5 clusters, providing insight into the goodness of fit, where lower WSS values suggest better cluster cohesion and a more accurate fit of the model to the data.
data_cluster$betweenss
## [1] 1152888
data_cluster$totss
## [1] 2559964
data_cluster$betweenss / data_cluster$totss
## [1] 0.4503531
unique(data$genre)
## [1] "Movie" "R&B" "A Capella" "Alternative"
## [5] "Country" "Dance" "Electronic" "Anime"
## [9] "Folk" "Blues" "Opera" "Hip-Hop"
## [13] "Children's Music" "Children’s Music" "Rap" "Indie"
## [17] "Classical" "Pop" "Reggae" "Reggaeton"
## [21] "Jazz" "Rock" "Ska" "Comedy"
## [25] "Soul" "Soundtrack" "World"
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
data_cluster27 <- kmeans(x = data_scale,
centers = 27)
data_cluster27$withinss
## [1] 22575.059 27644.655 42819.421 24646.616 24215.116 29743.248 22348.232
## [8] 28213.419 25815.628 21078.422 24295.697 56681.674 27028.144 35776.112
## [15] 24123.662 33350.647 23510.063 26924.699 25370.206 27777.885 6301.294
## [22] 26402.215 16263.833 56616.432 27218.945 59446.095 29374.792
data_cluster27$betweenss / data_cluster$totss
## [1] 0.6892291
data_cluster27$size
## [1] 7638 14021 10580 4559 7274 12051 5841 11958 6483 5321 9121 10659
## [13] 11281 15181 9712 11752 8556 8649 6252 9682 108 5752 2691 9417
## [25] 14272 11277 2637
Cluster profiling involves analyzing the characteristics and attributes of each cluster to understand the distinct features and patterns of the data grouped within them.
data2$cluster <- as.factor(data_cluster27$cluster)
data2 %>% head()
## popularity acousticness danceability duration_ms energy instrumentalness
## 1 0 0.611 0.389 99373 0.9100 0.000
## 2 1 0.246 0.590 137373 0.7370 0.000
## 3 3 0.952 0.663 170267 0.1310 0.000
## 4 0 0.703 0.240 152427 0.3260 0.000
## 5 4 0.950 0.331 82625 0.2250 0.123
## 6 0 0.749 0.578 160627 0.0948 0.000
## liveness loudness speechiness tempo valence cluster
## 1 0.3460 -1.828 0.0525 166.969 0.814 1
## 2 0.1510 -5.559 0.0868 174.003 0.816 1
## 3 0.1030 -13.879 0.0362 99.488 0.368 10
## 4 0.0985 -12.178 0.0395 171.758 0.227 10
## 5 0.2020 -21.150 0.0456 140.576 0.390 10
## 6 0.1070 -14.970 0.1430 87.479 0.358 10
data_centroid <- data2 %>%
group_by(cluster) %>%
summarise_all(mean)
data_centroid
## # A tibble: 27 × 12
## cluster popularity acousticness danceability duration_ms energy
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 27.4 0.0868 0.460 208434. 0.855
## 2 2 60.5 0.137 0.735 219909. 0.747
## 3 3 31.1 0.904 0.228 215890. 0.0809
## 4 4 36.5 0.146 0.453 294500. 0.773
## 5 5 54.9 0.240 0.717 218027. 0.617
## 6 6 53.3 0.0664 0.435 235116. 0.790
## 7 7 39.8 0.189 0.424 384482. 0.558
## 8 8 52.2 0.231 0.539 237436. 0.540
## 9 9 17.2 0.933 0.264 244659. 0.112
## 10 10 10.5 0.850 0.464 177478. 0.247
## # ℹ 17 more rows
## # ℹ 6 more variables: instrumentalness <dbl>, liveness <dbl>, loudness <dbl>,
## # speechiness <dbl>, tempo <dbl>, valence <dbl>
data_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
kelompok_min = which.min(value),
kelompok_max = which.max(value))
## # A tibble: 11 × 3
## name kelompok_min kelompok_max
## <chr> <int> <int>
## 1 acousticness 25 9
## 2 danceability 3 15
## 3 duration_ms 22 21
## 4 energy 3 1
## 5 instrumentalness 24 3
## 6 liveness 3 4
## 7 loudness 3 25
## 8 popularity 22 2
## 9 speechiness 3 24
## 10 tempo 8 1
## 11 valence 3 13
data_small_cluster <- data2 %>% select(-cluster) %>% head(100)
data_cluster_small <- kmeans(x = data_scale %>% head(100),
centers = 5)
# visualisasi 2 dimensi
fviz_cluster(object = data_cluster_small,
data = data_small_cluster )
The visualization reveals 5 distinct data clusters with some overlapping
points, suggesting the potential presence of sub-groups within the
data.
data_pca <- PCA(X = data_scale %>% head(100),
scale.unit = F,
graph = F)
fviz_pca_biplot(X = data_pca,
geom.ind = "point",
addEllipses = T)
Based on this exercise using the Spotify model, several evaluations can be made. First, it’s advisable to use a manageable amount of data for unsupervised learning to facilitate visualization, although analyzing larger datasets remains feasible. From the PCA and K-means models, one can compare the variables influencing cluster formation. For instance, PCA visualization reveals a strong correlation between speechiness and liveness, as well as danceability and loudness. In clustering results, Cluster 15, for example, exhibits the highest levels of speechiness and liveness. This indicates that both models effectively classify data, whether through eigenvalue-eigenvector methods or distance-based clustering with K-means.