Intro

What’ll we do

We will try to do analysis using k-means method. We will also gonna try to reduce the dimensionality of the dataset usin Principal Component Analysis (PCA).

Datasets

The datasets is acquired through Kaggle.

Library and Setup

Load the required library.

library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(ggiraphExtra)
options(scipen = 100, max.print = 101)

Import Data

spotify <- read.csv("data_input/SpotifyFeatures.csv", sep = ";")
glimpse(spotify)

#> Rows: 232,725
#> Columns: 18
#> $ genres           <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
#> $ artist_name      <chr> "Henri Salvador", "Martin & les f\xe9es", "Joseph Wil…
#> $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <chr> "0,424305556", "0,170833333", "0,661111111", "0,48819…
#> $ danceability     <chr> "0,270138889", "00.59", "0,460416667", "00.24", "0,22…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <chr> "0,063194444", "0,511805556", "0,090972222", "0,22638…
#> $ instrumentalness <chr> "0", "0", "0", "0", "0,085416667", "0", "0", "0", "0.…
#> $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
#> $ liveness         <chr> "0,240277778", "0,104861111", "0,071527778", "0,68402…
#> $ loudness         <chr> "-1.828", "-5.559", "-13.879", "-12.178", "-21.15", "…
#> $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
#> $ speechiness      <chr> "0,364583333", "0,602777778", "0,251388889", "0,27430…
#> $ tempo            <chr> "166.969", "174.003", "99.488", "171.758", "140.576",…
#> $ time_signature   <chr> "04-Apr", "04-Apr", "05-Apr", "04-Apr", "04-Apr", "04…
#> $ valence          <chr> "0,565277778", "0,566666667", "0,255555556", "0,15763…

Variable explanation:

Based on the provided dataset, here is an explanation of each feature:

genres: The genre of the track.
artist_name: The name of the artist or performer.
track_name: The name of the track.
track_id: The unique identifier for the track.
popularity: The popularity score of the track.
acousticness: The measure of how acoustic the track.
danceability: The measure of how suitable the track is for dancing.
duration_ms: The duration of the track in milliseconds.
energy: The energy level of the track.
instrumentalness: The measure of how likely the track is to be instrumental.
key: The key in which the track is composed.
liveness: The measure of how likely the track is to be performed live.
loudness: The loudness of the track.
mode: The modality (major or minor) of the track.
speechiness: The measure of how much spoken words are present in the track.
tempo: The tempo (beats per minute) of the track.
time_signature: The time signature of the track.
valence: The measure of musical positiveness conveyed by the track.

Data Preprocessing

Data type

Checking the unique variable in each features

sapply(spotify, function(X) length(unique(X)))

#>           genres      artist_name       track_name         track_id 
#>               27            14556           147093           176774 
#>       popularity     acousticness     danceability      duration_ms 
#>              101             3918              935            70749 
#>           energy instrumentalness              key         liveness 
#>             1698             4582               12              979 
#>         loudness             mode      speechiness            tempo 
#>            27923                2              965            77978 
#>   time_signature          valence 
#>                5              973

mode supposed to be categorical or factor data type.
acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence should be numerical.

spotify$acousticness <- gsub(",",".", spotify$acousticness)
spotify$danceability <- gsub(",",".", spotify$danceability)
spotify$energy <- gsub(",",".", spotify$energy)
spotify$instrumentalness <- gsub(",",".", spotify$instrumentalness)
spotify$liveness <- gsub(",",".", spotify$liveness)
spotify$speechiness <- gsub(",",".", spotify$speechiness)
spotify$valence <- gsub(",",".", spotify$valence)

spotify_c <- spotify %>%
  mutate_at(vars(genres, mode), as.factor) %>% 
  mutate_at(vars(acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence), as.numeric)
glimpse(spotify_c)

#> Rows: 232,725
#> Columns: 18
#> $ genres           <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi…
#> $ artist_name      <chr> "Henri Salvador", "Martin & les f\xe9es", "Joseph Wil…
#> $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <dbl> 0.42430556, 0.17083333, 0.66111111, 0.48819444, 0.065…
#> $ danceability     <dbl> 0.27013889, 0.59000000, 0.46041667, 0.24000000, 0.229…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.06319444, 0.51180556, 0.09097222, 0.22638889, 0.156…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.085…
#> $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
#> $ liveness         <dbl> 0.24027778, 0.10486111, 0.07152778, 0.68402778, 0.140…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode             <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
#> $ speechiness      <dbl> 0.36458333, 0.60277778, 0.25138889, 0.27430556, 0.316…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature   <chr> "04-Apr", "04-Apr", "05-Apr", "04-Apr", "04-Apr", "04…
#> $ valence          <dbl> 0.56527778, 0.56666667, 0.25555556, 0.15763889, 0.390…

Now the data type is set.

Checking is there missing value.

anyNA(spotify_c)

#> [1] TRUE

handling the missing value.

colSums(is.na(spotify_c))

#>           genres      artist_name       track_name         track_id 
#>                0                0                0                0 
#>       popularity     acousticness     danceability      duration_ms 
#>                0                0                0                0 
#>           energy instrumentalness              key         liveness 
#>                0                0                0                0 
#>         loudness             mode      speechiness            tempo 
#>               53                0                0            13077 
#>   time_signature          valence 
#>                0                0

best option we can do is drop all the row’s that contain missing value.

spotify_c <- spotify_c %>% 
  filter(complete.cases(.))

anyNA(spotify_c)

#> [1] FALSE

nrow(spotify_c)

#> [1] 219597

Because of PCA analysist onlu use variance, we just gonna need the features with numerical data type.

spotify_num <- spotify_c %>% 
  select_if(is.numeric)
glimpse(spotify_num)

#> Rows: 219,597
#> Columns: 11
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 10, 0, 2, 4, 3, 0, 0, 3, 1, …
#> $ acousticness     <dbl> 0.42430556, 0.17083333, 0.66111111, 0.48819444, 0.065…
#> $ danceability     <dbl> 0.27013889, 0.59000000, 0.46041667, 0.24000000, 0.229…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.06319444, 0.51180556, 0.09097222, 0.22638889, 0.156…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.085…
#> $ liveness         <dbl> 0.24027778, 0.10486111, 0.07152778, 0.68402778, 0.140…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ speechiness      <dbl> 0.36458333, 0.60277778, 0.25138889, 0.27430556, 0.316…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ valence          <dbl> 0.56527778, 0.56666667, 0.25555556, 0.15763889, 0.390…

statistical peak

summary(spotify_num)

#>    popularity      acousticness      danceability     duration_ms     
#>  Min.   :  0.00   Min.   :0.00000   Min.   :0.0100   Min.   :  15387  
#>  1st Qu.: 29.00   1st Qu.:0.09097   1st Qu.:0.2882   1st Qu.: 182885  
#>  Median : 43.00   Median :0.24861   Median :0.3931   Median : 220462  
#>  Mean   : 41.14   Mean   :0.29748   Mean   :0.3746   Mean   : 235057  
#>  3rd Qu.: 55.00   3rd Qu.:0.51944   3rd Qu.:0.4813   3rd Qu.: 265700  
#>  Max.   :100.00   Max.   :0.69375   Max.   :0.6937   Max.   :5552917  
#>      energy          instrumentalness       liveness          loudness      
#>  Min.   :0.0000203   Min.   :0.0000000   Min.   :0.00967   Min.   :-52.457  
#>  1st Qu.:0.2583333   1st Qu.:0.0000000   1st Qu.:0.09097   1st Qu.:-11.750  
#>  Median :0.4145833   Median :0.0000448   Median :0.17847   Median : -7.754  
#>  Mean   :0.3921543   Mean   :0.1125078   Mean   :0.26875   Mean   : -9.559  
#>  3rd Qu.:0.5430556   3rd Qu.:0.0965278   3rd Qu.:0.46000   3rd Qu.: -5.500  
#>  Max.   :0.6937500   Max.   :0.6937500   Max.   :1.00000   Max.   :  3.744  
#>   speechiness         tempo           valence      
#>  Min.   :0.0100   Min.   : 30.38   Min.   :0.0000  
#>  1st Qu.:0.1868   1st Qu.: 92.98   1st Qu.:0.2028  
#>  Median :0.2556   Median :115.90   Median :0.3271  
#>  Mean   :0.2796   Mean   :117.72   Mean   :0.3363  
#>  3rd Qu.:0.3535   3rd Qu.:139.09   3rd Qu.:0.4701  
#>  Max.   :0.6937   Max.   :242.90   Max.   :1.0000

The scale of data for each variable cannot be considered the same due to significant variations in the range of values among the variables.

Variance

cov(spotify_num)

#>                     popularity  acousticness      danceability
#> popularity        330.76086683  -1.190446167     0.47471461926
#> acousticness       -1.19044617   0.053760063    -0.00561330404
#> danceability        0.47471462  -0.005613304     0.02069178987
#> duration_ms      4056.56339317 308.016447290 -1310.09048668792
#> energy              0.50851660  -0.017069707     0.00325540905
#> instrumentalness   -0.75464087   0.010285226    -0.00665807172
#> liveness           -0.26689974   0.001389778    -0.00002348215
#> loudness           39.51673087  -0.758403068     0.24583248509
#> speechiness        -0.35347937   0.002220435     0.00006632012
#>                        duration_ms         energy instrumentalness
#> popularity              4056.56339    0.508516600    -0.7546408728
#> acousticness             308.01645   -0.017069707     0.0102852258
#> danceability           -1310.09049    0.003255409    -0.0066580717
#> duration_ms      14004813224.70875 -317.316117845  1951.7082595447
#> energy                  -317.31612    0.033241566    -0.0081901969
#> instrumentalness        1951.70826   -0.008190197     0.0445239410
#> liveness                 923.60011    0.001440206    -0.0006159730
#> loudness              -33344.80557    0.575879063    -0.5881938502
#> speechiness                3.02723    0.001461263    -0.0008750963
#>                         liveness        loudness    speechiness
#> popularity        -0.26689974249     39.51673087 -0.35347937342
#> acousticness       0.00138977779     -0.75840307  0.00222043540
#> danceability      -0.00002348215      0.24583249  0.00006632012
#> duration_ms      923.60010683694 -33344.80556920  3.02723018453
#> energy             0.00144020583      0.57587906  0.00146126272
#> instrumentalness  -0.00061597301     -0.58819385 -0.00087509631
#> liveness           0.04298743496     -0.03194345  0.00290565235
#> loudness          -0.03194345349     35.83298005 -0.04674082649
#> speechiness        0.00290565235     -0.04674083  0.02456972282
#>                             tempo          valence
#> popularity            45.24597101    -0.0507765166
#> acousticness          -1.44253956    -0.0045151196
#> danceability           0.01146579     0.0049633824
#> duration_ms      -100305.45301324 -1656.3801618383
#> energy                 0.88294007     0.0047481783
#> instrumentalness      -0.58171172    -0.0038253664
#> liveness              -0.19860434     0.0004865074
#> loudness              41.96913774     0.1722833173
#> speechiness           -0.15978976     0.0002235817
#>  [ reached getOption("max.print") -- omitted 2 rows ]

The variance is also not in the same range for some variable. Specially for duration_ms

Correlation peak

ggcorr(spotify_num,
       label = TRUE,
       label_size = 3,
       hjust = 0.9)

Exploratory Data Analysis

plot(prcomp(spotify_num))

PC1 will be considered to capture the highest variance, while the subsequent PCs are considered to provide less variance.

we have to scale due to the process of PCA.

spotify_scale <- scale(spotify_num)

summary(spotify_scale)

#>    popularity       acousticness      danceability      duration_ms     
#>  Min.   :-2.2620   Min.   :-1.2830   Min.   :-2.5344   Min.   :-1.8562  
#>  1st Qu.:-0.6675   1st Qu.:-0.8907   1st Qu.:-0.6004   1st Qu.:-0.4409  
#>  Median : 0.1023   Median :-0.2108   Median : 0.1285   Median :-0.1233  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.7621   3rd Qu.: 0.9573   3rd Qu.: 0.7416   3rd Qu.: 0.2589  
#>  Max.   : 3.2365   Max.   : 1.7091   Max.   : 2.2189   Max.   :44.9364  
#>      energy        instrumentalness      liveness          loudness      
#>  Min.   :-2.1508   Min.   :-0.53319   Min.   :-1.2496   Min.   :-7.1662  
#>  1st Qu.:-0.7340   1st Qu.:-0.53319   1st Qu.:-0.8574   1st Qu.:-0.3660  
#>  Median : 0.1230   Median :-0.53298   Median :-0.4354   Median : 0.3016  
#>  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.8277   3rd Qu.:-0.07573   3rd Qu.: 0.9224   3rd Qu.: 0.6781  
#>  Max.   : 1.6542   Max.   : 2.75461   Max.   : 3.5269   Max.   : 2.2224  
#>   speechiness          tempo             valence        
#>  Min.   :-1.7200   Min.   :-2.82739   Min.   :-1.97606  
#>  1st Qu.:-0.5920   1st Qu.:-0.80092   1st Qu.:-0.78441  
#>  Median :-0.1534   Median :-0.05888   Median :-0.05391  
#>  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
#>  3rd Qu.: 0.4713   3rd Qu.: 0.69175   3rd Qu.: 0.78678  
#>  Max.   : 2.6421   Max.   : 4.05258   Max.   : 3.90058

Checking the plot how big the variance explained by PC using the scaled data.

plot(prcomp(spotify_scale))

After scaled, the bias in PCA is handled.

UL: Clustering

obtain k-optimum

fviz_nbclust(
  x = spotify_scale %>% head(1000),
  FUNcluster = kmeans,
  method = "wss"
)

fviz_nbclust(
  x = spotify_scale %>% head(1000),
  FUNcluster = kmeans,
  method = "silhouette"
)

fviz_nbclust(
  x = spotify_scale %>% head(1000),
  FUNcluster = kmeans,
  method = "gap_stat"
)

From the plots, we can see that 3 is the optimum number of K. After k=3, increasing the number of K does not result in a considerable decrease of the total within sum of squares (strong internal cohesion) nor a considerable increase of between sum of square and between/total sum of squares ratio (maximum external separation).

k-means clustering

RNGkind(sample.kind = "Rounding")
set.seed(123)
spotify_k <- kmeans(spotify_scale %>% head(1000), centers = 3)

# result
spotify_k

#> K-means clustering with 3 clusters of sizes 291, 497, 212
#> 
#> Cluster means:
#>   popularity acousticness danceability duration_ms      energy instrumentalness
#> 1 -2.0126152    0.9636559   -0.7361089  -0.3300738 -0.81199139       -0.1211050
#> 2  0.5085631   -0.1668230    0.1885333  -0.1443190  0.05033251       -0.4679869
#> 3 -2.1435023    0.1285317    0.4160414   0.7054267  0.00977984       -0.4391777
#>      liveness    loudness speechiness       tempo     valence
#> 1 -0.29269442 -0.75467594  -0.1794143 -0.29759088 -0.41351755
#> 2 -0.04202170  0.36277495  -0.1289156  0.11898597  0.03433042
#> 3  0.01044287  0.03461486   0.3139956  0.09357227  0.91782518
#> 
#> Clustering vector:
#>   [1] 3 3 1 1 1 1 3 1 3 1 1 1 3 1 3 3 1 3 1 3 3 3 1 3 3 3 3 3 1 1 3 3 3 3 1 1 3
#>  [38] 3 1 1 3 3 1 3 3 3 1 3 1 1 1 1 1 1 1 3 1 3 1 3 3 3 1 1 3 3 1 1 1 1 1 3 1 3
#>  [75] 1 3 3 3 1 1 1 3 1 3 3 1 3 1 3 3 1 3 3 1 1 3 1 1 3 3 1
#>  [ reached getOption("max.print") -- omitted 899 entries ]
#> 
#> Within cluster sum of squares by cluster:
#> [1] 2308.473 3150.552 5006.065
#>  (between_SS / total_SS =  22.1 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

spotify_clustered <- spotify_c %>% 
  head(1000) %>% 
  bind_cols(cluster = as.factor(spotify_k$cluster)) %>% 
  select(cluster, 1:18)
spotify_clustered

fviz_cluster(object = spotify_k,
             data = spotify_scale %>% head(1000))

spotify_clustered %>%
  group_by(cluster) %>% 
  summarise_all(.funs = "mean") %>% 
  select(where(~all(!is.na(.)))) %>% 
  mutate_if(is.numeric, .funs = "round", digits = 2)

by the information we get, - Cluster 1: music in cluster 1 is a lot more acoustic and more instrument than others cluster but it’s not a danceable music, less duration, non energetic, and little bit mellow according to loudness and its tempo. This indicates that the first cluster is some slow instrumental or accoustic music type. - Cluster 2: consist highest popularity among other 2 cluster. This is due to danceability, energic, liveness, loud, high tempo, and high valence. This can indicates this type of music in cluster 2 is music that common to use in a party, when people have fun, or live concert. This kind of music that will hype up the listener and of course very populer among teenager. - cluster 3: this kind of song is less populer but high in danceability and long duration also an energetic song. The speechiness is also highest than other cluster. This can indicates that this type of song is Rap or mix up and some DJ’s stuff. This statement is also supported by high valence which can conduct very hyped up song.

UL: Principal Component Analysis

PCA

Here we will make PCA from the df datasets. We will see the eigenvalues and the percentage of variances explained by each dimensions. The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

spotify_pca <- PCA(spotify_c %>% select(-c(artist_name, track_name, track_id, key, time_signature, mode)) %>% head(1000),
    scale.unit = T,
    quali.sup = "genres",
    ncp = 11,
    graph = F)

summary(spotify_pca)

#> 
#> Call:
#> PCA(X = spotify_c %>% select(-c(artist_name, track_name, track_id,  
#>      key, time_signature, mode)) %>% head(1000), scale.unit = T,  
#>      ncp = 11, quali.sup = "genres", graph = F) 
#> 
#> 
#> Eigenvalues
#>                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
#> Variance               2.610   1.274   1.127   0.997   0.928   0.879   0.866
#> % of var.             23.730  11.579  10.241   9.061   8.439   7.989   7.869
#> Cumulative % of var.  23.730  35.309  45.550  54.611  63.050  71.039  78.908
#>                        Dim.8   Dim.9  Dim.10  Dim.11
#> Variance               0.770   0.596   0.544   0.410
#> % of var.              6.996   5.419   4.945   3.731
#> Cumulative % of var.  85.904  91.323  96.269 100.000
#> 
#> Individuals (the 10 first)
#>                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
#> 1                |  3.536 |  0.058  0.000  0.000 |  0.627  0.031  0.031 |
#> 2                |  4.100 |  2.000  0.153  0.238 |  2.105  0.348  0.264 |
#> 3                |  2.868 | -2.033  0.158  0.503 | -0.007  0.000  0.000 |
#> 4                |  3.605 | -1.142  0.050  0.100 | -0.082  0.001  0.001 |
#> 5                |  3.512 | -1.678  0.108  0.228 | -0.133  0.001  0.001 |
#> 6                |  3.143 | -0.746  0.021  0.056 | -0.547  0.024  0.030 |
#>                   Dim.3    ctr   cos2  
#> 1                -1.346  0.161  0.145 |
#> 2                -2.061  0.377  0.253 |
#> 3                 0.125  0.001  0.002 |
#> 4                 0.163  0.002  0.002 |
#> 5                -1.312  0.153  0.140 |
#> 6                -0.657  0.038  0.044 |
#>  [ reached getOption("max.print") -- omitted 4 rows ]
#> 
#> Variables (the 10 first)
#>                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
#> popularity       |  0.608 14.157  0.370 | -0.356  9.931  0.126 |  0.348 10.779
#> acousticness     | -0.688 18.155  0.474 |  0.107  0.899  0.011 | -0.045  0.182
#> danceability     |  0.470  8.445  0.220 |  0.346  9.392  0.120 |  0.230  4.703
#> duration_ms      | -0.094  0.339  0.009 |  0.369 10.663  0.136 |  0.518 23.860
#> energy           |  0.671 17.243  0.450 |  0.045  0.159  0.002 | -0.296  7.759
#> instrumentalness | -0.446  7.635  0.199 | -0.050  0.194  0.002 | -0.171  2.595
#> liveness         |  0.078  0.230  0.006 |  0.323  8.190  0.104 |  0.474 19.986
#>                    cos2  
#> popularity        0.121 |
#> acousticness      0.002 |
#> danceability      0.053 |
#> duration_ms       0.269 |
#> energy            0.087 |
#> instrumentalness  0.029 |
#> liveness          0.225 |
#>  [ reached getOption("max.print") -- omitted 3 rows ]
#> 
#> Supplementary categories
#>                       Dist     Dim.1    cos2  v.test     Dim.2    cos2  v.test
#> A Capella        |   1.570 |  -1.379   0.771  -9.582 |  -0.030   0.000  -0.296
#> Alternative      |   2.154 |   1.464   0.462   3.767 |  -0.787   0.134  -2.899
#> Country          |   1.023 |   0.928   0.823  13.324 |  -0.339   0.110  -6.961
#> Movie            |   1.221 |  -0.921   0.569 -14.193 |   0.512   0.176  11.291
#> R&B              |   1.692 |   1.141   0.454   8.927 |  -0.440   0.068  -4.930
#>                      Dim.3    cos2  v.test  
#> A Capella        |  -0.221   0.020  -2.336 |
#> Alternative      |   0.616   0.082   2.414 |
#> Country          |   0.165   0.026   3.616 |
#> Movie            |  -0.355   0.085  -8.340 |
#> R&B              |   0.670   0.157   7.985 |

Case: we want to reduce the dimension of the data but we have to keep 90% of the information. By looking at standard deviation value, the pc that keep above 90% of information will be PC1 to PC7.

fviz_eig(spotify_pca,
         ncp = 11,
         addlabels = T)

23.7 + 11.6 + 10.2 + 9.1

#> [1] 54.6

50% of the variances can be explained by only using the first 4 dimensions, with the first dimensions can explain 24.8% of the total variances.

we can keep 80% of information by using only 7 dimensions. Thus were 36% dimensionality reduction from the original data. This mean that we can actually reduce the number of features on our dataset from 11 to just 7 numeric features.

We can extract the values of PC1 to PC7 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.

df_pca <- data.frame(spotify_pca$ind$coord[, 1:7]) %>% 
  bind_cols(cluster = as.factor(spotify_k$cluster)) %>% 
  select(cluster, 1:7)

df_pca

Individual and Variable Factor Map

fviz_pca_ind(spotify_pca,
             habillage = "genres",
             addEllipses = T)

From plot above we can acquired some information, - Music with genres of Movie is dominated as is PC1, we can see through the blue circle is the biggest in the plot. - the other genres such as country, RNB and alternative are almost in the same amount.

Music with Movie genres in 448, 430, and 92 seems quite an outlier.

Variable Factor Map

fviz_pca_var(spotify_pca,
             select.var = list(contrib = 11),
             col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

The plot displays the distribution of variables within a circle, suggesting that more than two components are necessary for a perfect representation of our data. The distance of each variable from the origin reflects its significance on the factor map. Additionally, variable contributions are indicated by the color, represented as percentages. Variables strongly correlated with PC1 and PC2 play a vital role in explaining data variability, while those weakly correlated or associated with later dimensions have low contribution and could be excluded for a more straightforward analysis.

To assess variable representation quality, we can examine the cos2 of each variable. A high cos2 implies a favorable representation on the principal component, indicated by proximity to the circumference of the correlation circle. Conversely, a low cos2 suggests that the variable is not well-represented by the principal components, evident when it is positioned closer to the center of the circle.

fviz_cos2(spotify_pca, 
          choice = "var", 
          fill = "cos2") + 
  scale_fill_viridis_c(option = "B") +
  theme(legend.position = "top")

Variables that highly contributed to our data is loudness, acousticness, energy, and popularity. Some variables are less contribute to our dataset such as speechiness, liveness, and duration_ms. The speechiness has the lowest correlation with the principle components.

ggRadar(data = spotify_clustered,
        aes(colour = cluster),
        interactive = T)

From plot visualization above, we can see each cluster tend to each variables.

Product Recommendation

We want to recommend to users that listen to dragon ball gt soundtrack. First, lets see the data.

spotify_clustered %>% 
  filter(track_name == "Dragon Ball GT")

From imputation above, we can see dragon ball gt is in the cluster 3 with MOvie genre and Bernard Minet as artist and mode in Major.

From that information we can pick some recommendation to the listener that result show below.

spotify_clustered %>% 
  filter(cluster == 3 & genres == "Movie" & artist_name == "Bernard Minet" & mode == "Major")

The result show us that are 13 recommendations that have similarity to Dragon Ball GT.

Conclusion

We can pull some conclusion conclusion regarding our dataset based on the previous cluster and principle component analysis:

We can separate our data into at least 3 clusters based on all of the numerical features.
We can reduce our dimensions from 18 features into just 7 dimensions and still retain more than 80% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.
However, as we have seen, the dimensionality reduction is not enough for us to visualize the clustering of our data, indicated by overlapping of clusters if we only use the first 2 dimensions. Perhaps the result from the gap statistic method is true, that there is only 1 big cluster.

LBB - Spotify

Ozy Prazuganda

Juli 29, 2023