Introduction

2018 has been a fantastic year for data enthusiasts, with numerous opportunities to access intriguing data. Platforms like Kaggle, which boasts over 10,000 published datasets across various industries, are particularly valuable. Google, which owns Kaggle, has also introduced a dataset search tool, making it as simple to find datasets as installing a data science library such as Pandas.

For those eager to dive into data, APIs offer a great way to obtain valuable information. Tech giants like Twitter, Slack, and Google provide APIs that enable developers to build applications and extract data for analysis.

This series of articles will explore how the Spotify Web API was used to automatically retrieve data, with a focus on this topic. Future articles will discuss leveraging data science tools such as Python, SQL, and Bash to gain insights from the data.

Source : https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db/data

Business Question

The goal is to perform clustering analysis using the K-means method. Additionally, the possibility of applying dimensionality reduction through Principal Component Analysis (PCA) will be explored.

Variable Description

1. Data Preparation

1.1 Prerequisites

1.2 Importing Library

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(cowplot)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(FactoMineR)
library(scales)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ lubridate::stamp()  masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(lubridate)
library(cluster)
library(ggforce)

options(scipen = 100, max.print = 101)

1.3 Importing Dataset

data <- read.csv("SpotifyFeatures.csv")
head(data)
##   genre       artist_name                       track_name
## 1 Movie    Henri Salvador      C'est beau de faire un Show
## 2 Movie Martin & les fées Perdu d'avance (par Gad Elmaleh)
## 3 Movie   Joseph Williams   Don't Let Me Be Lonely Tonight
## 4 Movie    Henri Salvador   Dis-moi Monsieur Gordon Cooper
## 5 Movie      Fabien Nataf                        Ouverture
##                 track_id popularity acousticness danceability duration_ms
## 1 0BRjO6ga9RKCKjfDqeFgWV          0        0.611        0.389       99373
## 2 0BjC1NfoEOOusryehmNudP          1        0.246        0.590      137373
## 3 0CoSDzoNIKCRs124s9uTVy          3        0.952        0.663      170267
## 4 0Gc6TVm52BwZD07Ki6tIvf          0        0.703        0.240      152427
## 5 0IuslXpMROHdEPvSl1fTQK          4        0.950        0.331       82625
##   energy instrumentalness key liveness loudness  mode speechiness   tempo
## 1  0.910            0.000  C#   0.3460   -1.828 Major      0.0525 166.969
## 2  0.737            0.000  F#   0.1510   -5.559 Minor      0.0868 174.003
## 3  0.131            0.000   C   0.1030  -13.879 Minor      0.0362  99.488
## 4  0.326            0.000  C#   0.0985  -12.178 Major      0.0395 171.758
## 5  0.225            0.123   F   0.2020  -21.150 Major      0.0456 140.576
##   time_signature valence
## 1            4/4   0.814
## 2            4/4   0.816
## 3            5/4   0.368
## 4            4/4   0.227
## 5            4/4   0.390
##  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

Let’s check for missing values in our data.

1.4 Data Inspection

str(data)
## 'data.frame':    232725 obs. of  18 variables:
##  $ genre           : chr  "Movie" "Movie" "Movie" "Movie" ...
##  $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
##  $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
##  $ track_id        : chr  "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
##  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
##  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
##  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
##  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
##  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
##  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
##  $ key             : chr  "C#" "F#" "C" "C#" ...
##  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
##  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
##  $ mode            : chr  "Major" "Minor" "Minor" "Major" ...
##  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
##  $ tempo           : num  167 174 99.5 171.8 140.6 ...
##  $ time_signature  : chr  "4/4" "4/4" "5/4" "4/4" ...
##  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

1.5 Missing Value

anyNA(data)
## [1] FALSE
colSums(is.na(data))
##            genre      artist_name       track_name         track_id 
##                0                0                0                0 
##       popularity     acousticness     danceability      duration_ms 
##                0                0                0                0 
##           energy instrumentalness              key         liveness 
##                0                0                0                0 
##         loudness             mode      speechiness            tempo 
##                0                0                0                0 
##   time_signature          valence 
##                0                0

To make processing easier, let’s change our data type.

1.6 Data Types

data1 <- data %>%
  mutate(
    genre = as.character(genre),
    artist_name = as.character(artist_name),
    track_name = as.character(track_name),
    track_id = as.character(track_id),
    popularity = as.numeric(popularity),
    acousticness = as.numeric(acousticness),
    danceability = as.numeric(danceability),
    duration_ms = as.numeric(duration_ms),
    energy = as.numeric(energy),
    instrumentalness = as.numeric(instrumentalness),
    key = as.factor(key),
    liveness = as.numeric(liveness),
    loudness = as.numeric(loudness),
    mode = as.factor(mode),
    speechiness = as.numeric(speechiness),
    tempo = as.numeric(tempo),
    time_signature = as.factor(time_signature),
    valence = as.numeric(valence)
  )

Once we have finished preparing our data, let’s start exploring our data.

1.7 Subsetting

rownames(data1) <- data1$name

data2 <- data1 %>% 
  select(-c(genre, artist_name, track_name, track_id, key, mode, time_signature))

data2 %>% str()
## 'data.frame':    232725 obs. of  11 variables:
##  $ popularity      : num  0 1 3 0 4 0 2 15 0 10 ...
##  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
##  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
##  $ duration_ms     : num  99373 137373 170267 152427 82625 ...
##  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
##  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
##  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
##  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
##  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
##  $ tempo           : num  167 174 99.5 171.8 140.6 ...
##  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
summary(data2)
##    popularity      acousticness     danceability     duration_ms     
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.0569   Min.   :  15387  
##  1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350   1st Qu.: 182857  
##  Median : 43.00   Median :0.2320   Median :0.5710   Median : 220427  
##  Mean   : 41.13   Mean   :0.3686   Mean   :0.5544   Mean   : 235122  
##  3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920   3rd Qu.: 265768  
##  Max.   :100.00   Max.   :0.9960   Max.   :0.9890   Max.   :5552917  
##      energy          instrumentalness       liveness          loudness      
##  Min.   :0.0000203   Min.   :0.0000000   Min.   :0.00967   Min.   :-52.457  
##  1st Qu.:0.3850000   1st Qu.:0.0000000   1st Qu.:0.09740   1st Qu.:-11.771  
##  Median :0.6050000   Median :0.0000443   Median :0.12800   Median : -7.762  
##  Mean   :0.5709577   Mean   :0.1483012   Mean   :0.21501   Mean   : -9.570  
##  3rd Qu.:0.7870000   3rd Qu.:0.0358000   3rd Qu.:0.26400   3rd Qu.: -5.501  
##  Max.   :0.9990000   Max.   :0.9990000   Max.   :1.00000   Max.   :  3.744  
##   speechiness         tempo           valence      
##  Min.   :0.0222   Min.   : 30.38   Min.   :0.0000  
##  1st Qu.:0.0367   1st Qu.: 92.96   1st Qu.:0.2370  
##  Median :0.0501   Median :115.78   Median :0.4440  
##  Mean   :0.1208   Mean   :117.67   Mean   :0.4549  
##  3rd Qu.:0.1050   3rd Qu.:139.05   3rd Qu.:0.6600  
##  Max.   :0.9670   Max.   :242.90   Max.   :1.0000

2. Data Wrangling

2.1 Clustering Potential

The distribution of several features, such as acousticness, instrumentalness, and liveness, shows clear differences between Major and Minor modes, indicating good potential for clustering based on these features. However, other features like popularity and loudness display more similar distributions, which may be less effective for use as a basis for clustering.

3. Principal Component Analysis (PCA)

3.1 Possibility for Principle Component Analysis (PCA)

ggcorr(data2, label = TRUE, hjust = 1, layout.exp = 2)

Visualization shows a strong correlation among several audio features (such as energy, danceability, and acousticness). PCA can be used to identify principal components that explain most of the variance in the data, thereby reducing dimensionality and eliminating redundancy.

3.2 Data Scale

data_scale <- scale(data2)
head(data_scale)
##      popularity acousticness danceability duration_ms     energy
## [1,]  -2.261002    0.6833748   -0.8909329  -1.1413655  1.2869052
## [2,]  -2.206026   -0.3454664    0.1919933  -0.8218657  0.6302479
## [3,]  -2.096075    1.6445663    0.5852948  -0.5452965 -1.6699502
## [4,]  -2.261002    0.9426992   -1.6936990  -0.6952933 -0.9297874
## [5,]  -2.041100    1.6389288   -1.2034190  -1.2821808 -1.3131538
## [6,]  -2.261002    1.0723614    0.1273410  -0.6263486 -1.8073548
##      instrumentalness    liveness   loudness speechiness      tempo    valence
## [1,]      -0.48981747  0.66065975  1.2907007  -0.3679692  1.5956039  1.3807413
## [2,]      -0.48981747 -0.32283477  0.6686811  -0.1830817  1.8232495  1.3884316
## [3,]      -0.48981747 -0.56492573 -0.7184009  -0.4558311 -0.5883245 -0.3342114
## [4,]      -0.48981747 -0.58762176 -0.4348159  -0.4380431  1.7505932 -0.8763826
## [5,]      -0.08356631 -0.06561313 -1.9305971  -0.4051623  0.7414313 -0.2496173
## [6,]      -0.48981747 -0.54475148 -0.9002886   0.1198533 -0.9769791 -0.3726633

The data exhibits significant variation across audio features, and after scaling the data, PCA will be employed to identify the principal components that explain most of the variance in the dataset.

3.3 PCA Modeling

pca_data <- PCA(X = data_scale,
                scale.unit = FALSE,
                graph = F,
                ncp = 11) 

The PCA results include various statistics and coordinates for both variables and individuals, providing insights into the principal components and their contributions to the variance in the dataset.

To proceed with PCA modeling, we will analyze the eigenvalues and eigenvectors to understand the variance captured by each principal component, and then use this information to transform the original data into new values based on the principal components.

3.3.1 Eigenvalue

pca_data$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   3.6104430              32.822350                          32.82235
## comp 2   1.7100248              15.545747                          48.36810
## comp 3   1.1712427              10.647707                          59.01580
## comp 4   0.9998305               9.089407                          68.10521
## comp 5   0.8617186               7.833839                          75.93905
## comp 6   0.7567533               6.879605                          82.81866
## comp 7   0.6378529               5.798688                          88.61734
## comp 8   0.4853764               4.412532                          93.02988
## comp 9   0.3751899               3.410832                          96.44071
## comp 10  0.2767438               2.515864                          98.95657
## comp 11  0.1147767               1.043429                         100.00000

3.3.2 Eigenvector

3.3.3 New values on every PC

as.data.frame(pca_data$ind$coord) %>% head()
##        Dim.1      Dim.2      Dim.3      Dim.4       Dim.5       Dim.6
## 1  0.9904481  0.9993903 -0.1597105 -3.0962912  0.71325308 -0.74345195
## 2  1.2096105  0.2730414 -0.6842276 -2.7298245  1.28269932 -0.12134335
## 3 -2.1123978  0.3531611 -1.8668172 -0.2666744  0.70972338  0.39952609
## 4 -1.9545010 -0.1868432  0.2513141 -2.6613814 -0.02211554  0.60691571
## 5 -2.9355815  0.3915754 -1.1231315 -2.0937683 -0.02518811  0.41551544
## 6 -2.2612410  0.7007202 -1.7307093 -0.1446502  0.52331036  0.03778243
##        Dim.7      Dim.8        Dim.9      Dim.10      Dim.11
## 1 -1.3200974  0.4246202  0.576058781  1.16234095  0.04739096
## 2 -0.9083572 -0.4478824 -0.057434104  0.09104371 -0.13059333
## 3 -1.4628733 -0.4420655 -0.872496223  0.59542626 -0.09975048
## 4 -1.7930776 -0.5882799 -0.041750169  0.17564117 -0.22897820
## 5 -1.1238702  0.2555989 -0.007753993 -0.30349575  0.41629524
## 6 -1.5234275 -0.7598055 -0.626642226 -0.09946476 -0.44224934

3.4 Individual & Variable Factor Map

The Individual & Variable Factor Map visualization displays the relationships and projections of individuals and variables onto the principal components, highlighting how each contributes to the overall structure of the data.

3.4.1 Individual Factor Map

custom_colors <- custom_colors <- c( "cyan")


plot.PCA(
  x = pca_data,
  choix = "ind",
  select = "contrib 10",
  invisible = "quali",
  col.ind = custom_colors
)
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The PCA results indicate that two principal components capture most of the data variance, with Principal Component 1 (Dim 1) explaining approximately 32.82% and Principal Component 2 (Dim 2) accounting for about 15.55%. The data distribution is relatively uniform along Principal Component 1, while along Principal Component 2, most data points are clustered around zero, with some outliers showing extreme values. This suggests greater variability along the first dimension compared to the second.

3.4.2 Variable Factor Map

plot.PCA(x = pca_data, 
         choix = "var") 

The first two principal components capture approximately 48.37% of the total variance in the audio data, with Component 1 representing a spectrum from high-energy to more calm and acoustic tracks, as it positively correlates with features like energy, danceability, and loudness, and negatively with acousticness and instrumentalness.

3.4.3 Dimension Description

fviz_contrib(X = pca_data,
             choice = "var",
             axes = 2) 

The variable contribution plot for Dimension 2 shows that “danceability” has the most significant impact, suggesting that Dimension 2 represents the “danceability” or “rhythm” of a song.

3.5 Reduce Dimension

pca_data$ind$coord %>% head()
##        Dim.1      Dim.2      Dim.3      Dim.4       Dim.5       Dim.6
## 1  0.9904481  0.9993903 -0.1597105 -3.0962912  0.71325308 -0.74345195
## 2  1.2096105  0.2730414 -0.6842276 -2.7298245  1.28269932 -0.12134335
## 3 -2.1123978  0.3531611 -1.8668172 -0.2666744  0.70972338  0.39952609
## 4 -1.9545010 -0.1868432  0.2513141 -2.6613814 -0.02211554  0.60691571
## 5 -2.9355815  0.3915754 -1.1231315 -2.0937683 -0.02518811  0.41551544
## 6 -2.2612410  0.7007202 -1.7307093 -0.1446502  0.52331036  0.03778243
##        Dim.7      Dim.8        Dim.9      Dim.10      Dim.11
## 1 -1.3200974  0.4246202  0.576058781  1.16234095  0.04739096
## 2 -0.9083572 -0.4478824 -0.057434104  0.09104371 -0.13059333
## 3 -1.4628733 -0.4420655 -0.872496223  0.59542626 -0.09975048
## 4 -1.7930776 -0.5882799 -0.041750169  0.17564117 -0.22897820
## 5 -1.1238702  0.2555989 -0.007753993 -0.30349575  0.41629524
## 6 -1.5234275 -0.7598055 -0.626642226 -0.09946476 -0.44224934
data_keep <- as.data.frame(pca_data$ind$coord[,1:6])

data_keep %>% head()
##        Dim.1      Dim.2      Dim.3      Dim.4       Dim.5       Dim.6
## 1  0.9904481  0.9993903 -0.1597105 -3.0962912  0.71325308 -0.74345195
## 2  1.2096105  0.2730414 -0.6842276 -2.7298245  1.28269932 -0.12134335
## 3 -2.1123978  0.3531611 -1.8668172 -0.2666744  0.70972338  0.39952609
## 4 -1.9545010 -0.1868432  0.2513141 -2.6613814 -0.02211554  0.60691571
## 5 -2.9355815  0.3915754 -1.1231315 -2.0937683 -0.02518811  0.41551544
## 6 -2.2612410  0.7007202 -1.7307093 -0.1446502  0.52331036  0.03778243

3.6 Biplot

data_small <- data2 %>% head(100)


pca_small <- prcomp(data_small, scale = TRUE)


biplot(x = pca_small,
       cex = 0.7,
       scale = FALSE)

The data shows substantial variation along both PC1 and PC2, with certain feature groups clustering in specific areas, suggesting correlations between features, while the contributions of features like “danceability” and “energy” in shaping the principal components need further confirmation through loading plots or contribution tables.

data2[55,]
##    popularity acousticness danceability duration_ms energy instrumentalness
## 55          0        0.924        0.683      101653  0.147                0
##    liveness loudness speechiness  tempo valence
## 55    0.606  -21.998       0.822 32.244   0.595
data2[16,]
##    popularity acousticness danceability duration_ms energy instrumentalness
## 16          0        0.548        0.588     2447870  0.405                0
##    liveness loudness speechiness tempo valence
## 16    0.754   -15.55       0.938 83.56    0.48
data2[97,]
##    popularity acousticness danceability duration_ms energy instrumentalness
## 97          0         0.84        0.688     3435625  0.331                0
##    liveness loudness speechiness   tempo valence
## 97   0.0673   -8.645       0.772 102.244   0.529

4. K-Means Clustering

K-Means Clustering is a method of grouping data based on similarity, represented through distance metrics, which requires numerical data for effective modeling.

4.1 Data Preprocessing | Optimum number of K using Elbow Method

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(101)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Cluster", ylab = "Total")
}

kmeansTunning(data_scale, maxK = 6)

The plot shows a decrease in the total within-cluster sum of squares (WSS) as the number of clusters increases. However, this decrease starts to slow down after reaching 3 clusters, suggesting that 3 clusters may be an optimal choice.

4.2 Cluster Modelling

The sets a random seed for reproducibility and performs K-Means clustering on the scaled data with 5 clusters.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

data_cluster <- kmeans(x = data_scale,
                       centers = 5)
data_cluster$size
## [1] 60940 86609 29700 10111 45365
data_cluster$centers
##    popularity acousticness danceability duration_ms     energy instrumentalness
## 1  0.07985509   -0.7489145  -0.31576634  0.08376743  0.7581379       -0.1571035
## 2  0.45783618   -0.5509724   0.77070603 -0.08734646  0.4015419       -0.3278841
## 3 -0.79234540    1.3780146  -1.44247233  0.23239420 -1.5619442        1.7361848
## 4 -1.12475422    1.1971189   0.04380536  0.07222012  0.3461937       -0.4857360
## 5 -0.21192669    0.8889432  -0.11261470 -0.11401130 -0.8396052       -0.1913769
##     liveness   loudness speechiness      tempo     valence
## 1  0.1767196  0.5947072 -0.15163567  0.8780751 -0.02396741
## 2 -0.2150034  0.4576242 -0.07841903 -0.2938935  0.58362231
## 3 -0.3041790 -1.8847884 -0.39716186 -0.5059485 -1.17793132
## 4  2.5972431 -0.4103686  4.07706176 -0.6370862 -0.15647396
## 5 -0.2066498 -0.3471483 -0.29527152 -0.1452184 -0.27597713
head(data_cluster$cluster)
## [1] 1 1 5 5 3 5
data_cluster$iter
## [1] 6

4.3. Goodness of Fit

Goodness of Fit refers to how well a statistical model or algorithm represents the observed data.

data_cluster$withinss
## [1] 368381.0 381937.3 264187.0 107129.7 285441.3

The values indicate the within-cluster sum of squares (WSS) for each of the 5 clusters, providing insight into the goodness of fit, where lower WSS values suggest better cluster cohesion and a more accurate fit of the model to the data.

data_cluster$betweenss
## [1] 1152888
data_cluster$totss
## [1] 2559964
data_cluster$betweenss / data_cluster$totss
## [1] 0.4503531
unique(data$genre)
##  [1] "Movie"            "R&B"              "A Capella"        "Alternative"     
##  [5] "Country"          "Dance"            "Electronic"       "Anime"           
##  [9] "Folk"             "Blues"            "Opera"            "Hip-Hop"         
## [13] "Children's Music" "Children’s Music" "Rap"              "Indie"           
## [17] "Classical"        "Pop"              "Reggae"           "Reggaeton"       
## [21] "Jazz"             "Rock"             "Ska"              "Comedy"          
## [25] "Soul"             "Soundtrack"       "World"
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

data_cluster27 <- kmeans(x = data_scale,
                       centers = 27)
data_cluster27$withinss
##  [1] 22575.059 27644.655 42819.421 24646.616 24215.116 29743.248 22348.232
##  [8] 28213.419 25815.628 21078.422 24295.697 56681.674 27028.144 35776.112
## [15] 24123.662 33350.647 23510.063 26924.699 25370.206 27777.885  6301.294
## [22] 26402.215 16263.833 56616.432 27218.945 59446.095 29374.792
data_cluster27$betweenss / data_cluster$totss
## [1] 0.6892291
data_cluster27$size 
##  [1]  7638 14021 10580  4559  7274 12051  5841 11958  6483  5321  9121 10659
## [13] 11281 15181  9712 11752  8556  8649  6252  9682   108  5752  2691  9417
## [25] 14272 11277  2637

4.4 Interpretation: Cluster Profiling

Cluster profiling involves analyzing the characteristics and attributes of each cluster to understand the distinct features and patterns of the data grouped within them.

data2$cluster <- as.factor(data_cluster27$cluster)

data2 %>% head()
##   popularity acousticness danceability duration_ms energy instrumentalness
## 1          0        0.611        0.389       99373 0.9100            0.000
## 2          1        0.246        0.590      137373 0.7370            0.000
## 3          3        0.952        0.663      170267 0.1310            0.000
## 4          0        0.703        0.240      152427 0.3260            0.000
## 5          4        0.950        0.331       82625 0.2250            0.123
## 6          0        0.749        0.578      160627 0.0948            0.000
##   liveness loudness speechiness   tempo valence cluster
## 1   0.3460   -1.828      0.0525 166.969   0.814       1
## 2   0.1510   -5.559      0.0868 174.003   0.816       1
## 3   0.1030  -13.879      0.0362  99.488   0.368      10
## 4   0.0985  -12.178      0.0395 171.758   0.227      10
## 5   0.2020  -21.150      0.0456 140.576   0.390      10
## 6   0.1070  -14.970      0.1430  87.479   0.358      10
data_centroid <- data2 %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

data_centroid
## # A tibble: 27 × 12
##    cluster popularity acousticness danceability duration_ms energy
##    <fct>        <dbl>        <dbl>        <dbl>       <dbl>  <dbl>
##  1 1             27.4       0.0868        0.460     208434. 0.855 
##  2 2             60.5       0.137         0.735     219909. 0.747 
##  3 3             31.1       0.904         0.228     215890. 0.0809
##  4 4             36.5       0.146         0.453     294500. 0.773 
##  5 5             54.9       0.240         0.717     218027. 0.617 
##  6 6             53.3       0.0664        0.435     235116. 0.790 
##  7 7             39.8       0.189         0.424     384482. 0.558 
##  8 8             52.2       0.231         0.539     237436. 0.540 
##  9 9             17.2       0.933         0.264     244659. 0.112 
## 10 10            10.5       0.850         0.464     177478. 0.247 
## # ℹ 17 more rows
## # ℹ 6 more variables: instrumentalness <dbl>, liveness <dbl>, loudness <dbl>,
## #   speechiness <dbl>, tempo <dbl>, valence <dbl>
data_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(
    kelompok_min = which.min(value),
    kelompok_max = which.max(value))
## # A tibble: 11 × 3
##    name             kelompok_min kelompok_max
##    <chr>                   <int>        <int>
##  1 acousticness               25            9
##  2 danceability                3           15
##  3 duration_ms                22           21
##  4 energy                      3            1
##  5 instrumentalness           24            3
##  6 liveness                    3            4
##  7 loudness                    3           25
##  8 popularity                 22            2
##  9 speechiness                 3           24
## 10 tempo                       8            1
## 11 valence                     3           13

4.5 Clustering Visualization

data_small_cluster <- data2 %>% select(-cluster) %>% head(100)
data_cluster_small <- kmeans(x = data_scale %>% head(100),
                       centers = 5)
# visualisasi 2 dimensi
fviz_cluster(object = data_cluster_small, 
             data = data_small_cluster )

The visualization reveals 5 distinct data clusters with some overlapping points, suggesting the potential presence of sub-groups within the data.

4.6 Biplot dan Clustering Visualization

data_pca <- PCA(X = data_scale %>% head(100),
               scale.unit = F,
               graph = F)
fviz_pca_biplot(X = data_pca, 
                geom.ind = "point",
                addEllipses = T)

5. Conclusion

Based on this exercise using the Spotify model, several evaluations can be made. First, it’s advisable to use a manageable amount of data for unsupervised learning to facilitate visualization, although analyzing larger datasets remains feasible. From the PCA and K-means models, one can compare the variables influencing cluster formation. For instance, PCA visualization reveals a strong correlation between speechiness and liveness, as well as danceability and loudness. In clustering results, Cluster 15, for example, exhibits the highest levels of speechiness and liveness. This indicates that both models effectively classify data, whether through eigenvalue-eigenvector methods or distance-based clustering with K-means.

6. Dataset