1 Introduction

Data used for this LBB Project in Unsupervised Learning is taken from the following source: Kaggle - Spotify Tracks DB.

This is a music list taken from Spotify, a streaming platform containing information of each audio tracks.

As we are applying Unsupervised Learning to our data, the goal is to find the pattern within the data to generate useful information.

2 Data Preparation

First step is to prepare for our dataset.

# Library Setup and Installation necessary packages
# data wrangling
library(dplyr)
library(lubridate)
library(GGally)
library(factoextra)
library(tidyr)
library(ggiraphExtra)
library(FactoMineR)

2.1 Read Data

# Read dataset
spotify <- read.csv("data input/SpotifyFeatures.csv")
head(spotify)  # Check dataset

We can also check on simple information containing in our dataset

# Quick overview of dataset
str(spotify)

#> 'data.frame':    232725 obs. of  18 variables:
#>  $ genre           : chr  "Movie" "Movie" "Movie" "Movie" ...
#>  $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#>  $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#>  $ track_id        : chr  "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
#>  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
#>  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#>  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#>  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#>  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#>  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#>  $ key             : chr  "C#" "F#" "C" "C#" ...
#>  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#>  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
#>  $ mode            : chr  "Major" "Minor" "Minor" "Major" ...
#>  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#>  $ tempo           : num  167 174 99.5 171.8 140.6 ...
#>  $ time_signature  : chr  "4/4" "4/4" "5/4" "4/4" ...
#>  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

There are total of 232,725 observations data in our dataset spotify with dataset description as follows (reference information taken from Spotify)

1. genre : Genre of the audio track
1. artist_name : The name of the Artist who performed the audio track
1. track_name : The name of the audio track
1. track_id : The Spotify ID for the track.
1. popularity : The popularity of the track. The value will be between 0 and 100, with 100 being the most popular.
1. acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
1. danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
1. duration_ms : The duration of the track in milliseconds.
1. energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
1. instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
1. key : The key the track is in.
1. liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
1. loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
1. mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
1. speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
1. tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
1. time_signature : An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
1. valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

2.2 Data Cleaning

Check which columns appropriately have the potential to be change to factor type:

# Check for Potential Factor type
df_num_uniques <- spotify %>% 
  summarise(across(everything(), n_distinct)) %>% 
  pivot_longer(everything()) %>% 
  arrange(value)

df_num_uniques

Remove un-necessary column :

track_id

Adjust the appropriate data type :

Data with character type :

mode : chr → factor
time_signature : chr → factor
key : chr → factor
genre : chr → factor

# Change the above five columns to factor type
spotify_clean <- spotify %>% 
  select(-track_id) %>% 
  mutate_at(vars(mode, time_signature, key, genre), as.factor)

# Re-check current data type
str(spotify_clean)

#> 'data.frame':    232725 obs. of  17 variables:
#>  $ genre           : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...
#>  $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#>  $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#>  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
#>  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#>  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#>  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#>  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#>  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#>  $ key             : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
#>  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#>  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
#>  $ mode            : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
#>  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#>  $ tempo           : num  167 174 99.5 171.8 140.6 ...
#>  $ time_signature  : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 5 4 4 4 4 4 4 4 ...
#>  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

Check for any missing values :

colSums(is.na(spotify_clean))

#>            genre      artist_name       track_name       popularity 
#>                0                0                0                0 
#>     acousticness     danceability      duration_ms           energy 
#>                0                0                0                0 
#> instrumentalness              key         liveness         loudness 
#>                0                0                0                0 
#>             mode      speechiness            tempo   time_signature 
#>                0                0                0                0 
#>          valence 
#>                0

anyNA(spotify_clean)

#> [1] FALSE

Our dataset do not have any missing values and ready for further analysis.

3 Exploratory Data Analysis (EDA)

Analysis using PCA is using the value of variance hence we will only use columns with numeric data type

# Data for PCA
spotify_num <- spotify_clean %>% 
  select_if(is.numeric)

glimpse(spotify_num)

#> Rows: 232,725
#> Columns: 11
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…

Check for range for each columns to know whether the scale range is the same

summary(spotify_num)

#>    popularity      acousticness     danceability     duration_ms     
#>  Min.   :  0.00   Min.   :0.0000   Min.   :0.0569   Min.   :  15387  
#>  1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350   1st Qu.: 182857  
#>  Median : 43.00   Median :0.2320   Median :0.5710   Median : 220427  
#>  Mean   : 41.13   Mean   :0.3686   Mean   :0.5544   Mean   : 235122  
#>  3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920   3rd Qu.: 265768  
#>  Max.   :100.00   Max.   :0.9960   Max.   :0.9890   Max.   :5552917  
#>      energy          instrumentalness       liveness          loudness      
#>  Min.   :0.0000203   Min.   :0.0000000   Min.   :0.00967   Min.   :-52.457  
#>  1st Qu.:0.3850000   1st Qu.:0.0000000   1st Qu.:0.09740   1st Qu.:-11.771  
#>  Median :0.6050000   Median :0.0000443   Median :0.12800   Median : -7.762  
#>  Mean   :0.5709577   Mean   :0.1483012   Mean   :0.21501   Mean   : -9.570  
#>  3rd Qu.:0.7870000   3rd Qu.:0.0358000   3rd Qu.:0.26400   3rd Qu.: -5.501  
#>  Max.   :0.9990000   Max.   :0.9990000   Max.   :1.00000   Max.   :  3.744  
#>   speechiness         tempo           valence      
#>  Min.   :0.0222   Min.   : 30.38   Min.   :0.0000  
#>  1st Qu.:0.0367   1st Qu.: 92.96   1st Qu.:0.2370  
#>  Median :0.0501   Median :115.78   Median :0.4440  
#>  Mean   :0.1208   Mean   :117.67   Mean   :0.4549  
#>  3rd Qu.:0.1050   3rd Qu.:139.05   3rd Qu.:0.6600  
#>  Max.   :0.9670   Max.   :242.90   Max.   :1.0000

💡 Insight: the data do not yet have the same scaling

cov(spotify_num)

#>                    popularity  acousticness    danceability      duration_ms
#> popularity        330.8741927  -2.460579486     0.866213977        5079.7963
#> acousticness       -2.4605795   0.125860362    -0.024004549         472.7050
#> danceability        0.8662140  -0.024004549     0.034450413       -2776.6700
#> duration_ms      5079.7962942 472.705010446 -2776.670015161 14145750520.8081
#> energy              1.1928936  -0.067816440     0.015931805        -957.2578
#> instrumentalness   -1.1619558   0.033958915    -0.020508345        2737.5056
#> liveness           -0.6058861   0.004853762    -0.001534008         560.8353
#> loudness           39.6070158  -1.468729115     0.488376611      -33970.6429
#> speechiness        -0.5098157   0.009933929     0.004633400        -356.8187
#> tempo              45.5478771  -2.611654392     0.125822789     -104575.8610
#> valence             0.2841956  -0.030059094     0.026411285       -4386.3797
#>                          energy instrumentalness      liveness         loudness
#> popularity          1.192893559     -1.161955848  -0.605886064     39.607015849
#> acousticness       -0.067816440      0.033958915   0.004853762     -1.468729115
#> danceability        0.015931805     -0.020508345  -0.001534008      0.488376611
#> duration_ms      -957.257831674   2737.505647036 560.835264134 -33970.642881913
#> energy              0.069408833     -0.030227884   0.010071149      1.289631260
#> instrumentalness   -0.030227884      0.091668682  -0.008055978     -0.919510999
#> liveness            0.010071149     -0.008055978   0.039312018      0.054333071
#> loudness            1.289631260     -0.919510999   0.054333071     35.978446810
#> speechiness         0.007092851     -0.009950208   0.018764818     -0.002529084
#> tempo               1.862333254     -0.974186536  -0.314619135     42.324447453
#> valence             0.029925682     -0.024214148   0.000608679      0.623816414
#>                     speechiness           tempo         valence
#> popularity         -0.509815661      45.5478771     0.284195566
#> acousticness        0.009933929      -2.6116544    -0.030059094
#> danceability        0.004633400       0.1258228     0.026411285
#> duration_ms      -356.818722591 -104575.8610139 -4386.379669824
#> energy              0.007092851       1.8623333     0.029925682
#> instrumentalness   -0.009950208      -0.9741865    -0.024214148
#> liveness            0.018764818      -0.3146191     0.000608679
#> loudness           -0.002529084      42.3244475     0.623816414
#> speechiness         0.034417042      -0.4674159     0.001150285
#> tempo              -0.467415886     954.7424276     1.083677477
#> valence             0.001150285       1.0836775     0.067634056

plot(prcomp(spotify_num))

💡 Insight: the higher of our scale data then the higher value of variance and covariance

3.1 Data Pre-Processing: Scaling

Therefore, we need to scale our dataset

# Scaling
spotify_scaled <- scale(spotify_num)

# Check data range
summary(spotify_scaled)

#>    popularity       acousticness      danceability       duration_ms     
#>  Min.   :-2.2610   Min.   :-1.0389   Min.   :-2.68019   Min.   :-1.8475  
#>  1st Qu.:-0.6667   1st Qu.:-0.9329   1st Qu.:-0.64310   1st Qu.:-0.4394  
#>  Median : 0.1029   Median :-0.3849   Median : 0.08963   Median :-0.1236  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.7626   3rd Qu.: 0.9963   3rd Qu.: 0.74154   3rd Qu.: 0.2577  
#>  Max.   : 3.2365   Max.   : 1.7686   Max.   : 2.34168   Max.   :44.7114  
#>      energy        instrumentalness     liveness          loudness      
#>  Min.   :-2.1671   Min.   :-0.4898   Min.   :-1.0356   Min.   :-7.1500  
#>  1st Qu.:-0.7058   1st Qu.:-0.4898   1st Qu.:-0.5932   1st Qu.:-0.3670  
#>  Median : 0.1292   Median :-0.4897   Median :-0.4388   Median : 0.3014  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.8200   3rd Qu.:-0.3716   3rd Qu.: 0.2471   3rd Qu.: 0.6784  
#>  Max.   : 1.6247   Max.   : 2.8097   Max.   : 3.9591   Max.   : 2.2196  
#>   speechiness           tempo             valence        
#>  Min.   :-0.53129   Min.   :-2.82494   Min.   :-1.74924  
#>  1st Qu.:-0.45314   1st Qu.:-0.79963   1st Qu.:-0.83793  
#>  Median :-0.38091   Median :-0.06112   Median :-0.04198  
#>  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
#>  3rd Qu.:-0.08498   3rd Qu.: 0.69217   3rd Qu.: 0.78858  
#>  Max.   : 4.56146   Max.   : 4.05310   Max.   : 2.09595

Check the plot for the values of variances using current scaled dataset

# Variances value summarize for each PC
plot(prcomp(spotify_scaled))

4 Principle Component Analysis (PCA)

4.1 Method #1 : Using function `prcomp()` using scaled data

prcomp(spotify_scaled)

#> Standard deviations (1, .., p=11):
#>  [1] 1.9001207 1.3076820 1.0822420 0.9999174 0.9282900 0.8699176 0.7986587
#>  [8] 0.6966911 0.6125288 0.5260656 0.3387878
#> 
#> Rotation (n x k) = (11 x 11):
#>                          PC1         PC2         PC3          PC4           PC5
#> popularity        0.23639321 -0.29846005  0.09943228  0.422119586 -0.4736815401
#> acousticness     -0.42022853  0.18772369 -0.20855119  0.006879475  0.0282681519
#> danceability      0.33424065  0.06019282 -0.45070673  0.239688252  0.2376309068
#> duration_ms      -0.06002993 -0.03275554  0.59352497  0.418326167  0.6447009218
#> energy            0.44628732  0.09692738  0.24698541 -0.097178416  0.0002462744
#> instrumentalness -0.32180898 -0.18288404  0.06985215 -0.132956545  0.1569641273
#> liveness          0.02965112  0.61932242  0.25373682 -0.056658396 -0.1249026942
#> loudness          0.46703496 -0.02465578  0.15341481  0.011412322 -0.0633230456
#> speechiness       0.02970675  0.64556159  0.02852151  0.068078940 -0.0954483000
#> tempo             0.15717921 -0.14937864  0.25897753 -0.733221667  0.0571792098
#> valence           0.32385768  0.07008504 -0.41173869 -0.128880686  0.4960754526
#>                          PC6         PC7         PC8         PC9        PC10
#> popularity        0.32083437 -0.42232755 -0.31413684  0.24820465  0.01528871
#> acousticness      0.30090066  0.04694641 -0.15387384  0.26137273 -0.69955892
#> danceability      0.22268142 -0.29141242  0.18169713 -0.58512988 -0.14978489
#> duration_ms       0.22709391  0.01628894 -0.01447981  0.01538797 -0.01145835
#> energy           -0.35002606 -0.03904784  0.11993763  0.22449855 -0.14230425
#> instrumentalness -0.44386715 -0.75377712  0.07019165 -0.01147147 -0.17749694
#> liveness         -0.11179760 -0.13086640 -0.60551414 -0.36598616 -0.01034614
#> loudness         -0.15515093  0.14147365  0.11624804  0.01293022 -0.62277521
#> speechiness       0.21165146 -0.28954981  0.51727247  0.33736400  0.15979238
#> tempo             0.55013273 -0.18408344  0.01084837 -0.08771153 -0.02238735
#> valence          -0.04261779 -0.09837511 -0.42101780  0.47043909  0.14856079
#>                           PC11
#> popularity        0.0278823444
#> acousticness      0.2640209720
#> danceability      0.1878030923
#> duration_ms       0.0008992083
#> energy            0.7154782335
#> instrumentalness -0.1184748572
#> liveness         -0.0451919382
#> loudness         -0.5549575890
#> speechiness      -0.1795962710
#> tempo             0.0135260826
#> valence          -0.1607522457

4.2 Method #2 : Using function `prcomp()` using un-scaled data, with added parameter for scaling

pca_spotify <- prcomp(spotify_num, scale = TRUE)
pca_spotify

#> Standard deviations (1, .., p=11):
#>  [1] 1.9001207 1.3076820 1.0822420 0.9999174 0.9282900 0.8699176 0.7986587
#>  [8] 0.6966911 0.6125288 0.5260656 0.3387878
#> 
#> Rotation (n x k) = (11 x 11):
#>                          PC1         PC2         PC3          PC4           PC5
#> popularity        0.23639321 -0.29846005  0.09943228  0.422119586 -0.4736815401
#> acousticness     -0.42022853  0.18772369 -0.20855119  0.006879475  0.0282681519
#> danceability      0.33424065  0.06019282 -0.45070673  0.239688252  0.2376309068
#> duration_ms      -0.06002993 -0.03275554  0.59352497  0.418326167  0.6447009218
#> energy            0.44628732  0.09692738  0.24698541 -0.097178416  0.0002462744
#> instrumentalness -0.32180898 -0.18288404  0.06985215 -0.132956545  0.1569641273
#> liveness          0.02965112  0.61932242  0.25373682 -0.056658396 -0.1249026942
#> loudness          0.46703496 -0.02465578  0.15341481  0.011412322 -0.0633230456
#> speechiness       0.02970675  0.64556159  0.02852151  0.068078940 -0.0954483000
#> tempo             0.15717921 -0.14937864  0.25897753 -0.733221667  0.0571792098
#> valence           0.32385768  0.07008504 -0.41173869 -0.128880686  0.4960754526
#>                          PC6         PC7         PC8         PC9        PC10
#> popularity        0.32083437 -0.42232755 -0.31413684  0.24820465  0.01528871
#> acousticness      0.30090066  0.04694641 -0.15387384  0.26137273 -0.69955892
#> danceability      0.22268142 -0.29141242  0.18169713 -0.58512988 -0.14978489
#> duration_ms       0.22709391  0.01628894 -0.01447981  0.01538797 -0.01145835
#> energy           -0.35002606 -0.03904784  0.11993763  0.22449855 -0.14230425
#> instrumentalness -0.44386715 -0.75377712  0.07019165 -0.01147147 -0.17749694
#> liveness         -0.11179760 -0.13086640 -0.60551414 -0.36598616 -0.01034614
#> loudness         -0.15515093  0.14147365  0.11624804  0.01293022 -0.62277521
#> speechiness       0.21165146 -0.28954981  0.51727247  0.33736400  0.15979238
#> tempo             0.55013273 -0.18408344  0.01084837 -0.08771153 -0.02238735
#> valence          -0.04261779 -0.09837511 -0.42101780  0.47043909  0.14856079
#>                           PC11
#> popularity        0.0278823444
#> acousticness      0.2640209720
#> danceability      0.1878030923
#> duration_ms       0.0008992083
#> energy            0.7154782335
#> instrumentalness -0.1184748572
#> liveness         -0.0451919382
#> loudness         -0.5549575890
#> speechiness      -0.1795962710
#> tempo             0.0135260826
#> valence          -0.1607522457

Both methods above gave the same information result.

Three components information in function prcomp() :

Eigen values (Variance)

# Eigen values (variance)
pca_spotify$sdev^2

#>  [1] 3.6104585 1.7100322 1.1712478 0.9998348 0.8617223 0.7567566 0.6378557
#>  [8] 0.4853785 0.3751915 0.2767450 0.1147772

Eigen Vector (Matrix rotation)

# Eigen vector
pca_spotify$rotation

#>                          PC1         PC2         PC3          PC4           PC5
#> popularity        0.23639321 -0.29846005  0.09943228  0.422119586 -0.4736815401
#> acousticness     -0.42022853  0.18772369 -0.20855119  0.006879475  0.0282681519
#> danceability      0.33424065  0.06019282 -0.45070673  0.239688252  0.2376309068
#> duration_ms      -0.06002993 -0.03275554  0.59352497  0.418326167  0.6447009218
#> energy            0.44628732  0.09692738  0.24698541 -0.097178416  0.0002462744
#> instrumentalness -0.32180898 -0.18288404  0.06985215 -0.132956545  0.1569641273
#> liveness          0.02965112  0.61932242  0.25373682 -0.056658396 -0.1249026942
#> loudness          0.46703496 -0.02465578  0.15341481  0.011412322 -0.0633230456
#> speechiness       0.02970675  0.64556159  0.02852151  0.068078940 -0.0954483000
#> tempo             0.15717921 -0.14937864  0.25897753 -0.733221667  0.0571792098
#> valence           0.32385768  0.07008504 -0.41173869 -0.128880686  0.4960754526
#>                          PC6         PC7         PC8         PC9        PC10
#> popularity        0.32083437 -0.42232755 -0.31413684  0.24820465  0.01528871
#> acousticness      0.30090066  0.04694641 -0.15387384  0.26137273 -0.69955892
#> danceability      0.22268142 -0.29141242  0.18169713 -0.58512988 -0.14978489
#> duration_ms       0.22709391  0.01628894 -0.01447981  0.01538797 -0.01145835
#> energy           -0.35002606 -0.03904784  0.11993763  0.22449855 -0.14230425
#> instrumentalness -0.44386715 -0.75377712  0.07019165 -0.01147147 -0.17749694
#> liveness         -0.11179760 -0.13086640 -0.60551414 -0.36598616 -0.01034614
#> loudness         -0.15515093  0.14147365  0.11624804  0.01293022 -0.62277521
#> speechiness       0.21165146 -0.28954981  0.51727247  0.33736400  0.15979238
#> tempo             0.55013273 -0.18408344  0.01084837 -0.08771153 -0.02238735
#> valence          -0.04261779 -0.09837511 -0.42101780  0.47043909  0.14856079
#>                           PC11
#> popularity        0.0278823444
#> acousticness      0.2640209720
#> danceability      0.1878030923
#> duration_ms       0.0008992083
#> energy            0.7154782335
#> instrumentalness -0.1184748572
#> liveness         -0.0451919382
#> loudness         -0.5549575890
#> speechiness      -0.1795962710
#> tempo             0.0135260826
#> valence          -0.1607522457

Formula generated will be as follows :

PC1 = 0.23 * popularity + (-0.42) * acousticness + 0.34 * danceability + … + 0.34 * valence

💡 Insight : biggest contribution variables to PC1 is loudness and energy

pca$x : values for Each PC for every observations (new dataset values)

# Original data - Using scaled data
as.data.frame(spotify_scaled)

# New values in every PC
as.data.frame(pca_spotify$x)

Formula :

PC1 = 0.23 * popularity + (-0.42) * acousticness + 0.34 * danceability + … + 0.34 * valence = 0.990448080

Re-check calculation :

sum(spotify_scaled[1,] * pca_spotify$rotation[,1])

#> [1] 0.9904481

4.3 PCA for Dimensionality Reduction

summary(pca_spotify)

#> Importance of components:
#>                           PC1    PC2    PC3     PC4     PC5    PC6     PC7
#> Standard deviation     1.9001 1.3077 1.0822 0.99992 0.92829 0.8699 0.79866
#> Proportion of Variance 0.3282 0.1555 0.1065 0.09089 0.07834 0.0688 0.05799
#> Cumulative Proportion  0.3282 0.4837 0.5902 0.68105 0.75939 0.8282 0.88617
#>                            PC8     PC9    PC10    PC11
#> Standard deviation     0.69669 0.61253 0.52607 0.33879
#> Proportion of Variance 0.04413 0.03411 0.02516 0.01043
#> Cumulative Proportion  0.93030 0.96441 0.98957 1.00000

To retain about 82% of information, we will choose 6 PC because the Cumulative Proportion at PC6 is equals to **0.8282*

pca_spotify_keep <- as.data.frame(pca_spotify$x[, 1:6])
pca_spotify_keep

pca_spotify_keep is the result of dimensionality reduction, however the data given is not interpretable

4.4 PCA Visualization

Biplot using the first 500 data spotify_num

# 1. Subset the first 500 data
spotify500 <- spotify_num %>% head(n = 500)

pca_spotify500 <- prcomp(spotify500, scale = TRUE)

biplot(x = pca_spotify500,
       cex = 0.7,
       scale = F)

💡 Insight : High Positive Correlation

- `speechiness` X `liveness`
- `danceability` X `valence`
- `tempo`  X `energy`

💡 Insight : Almost No Correlation

- `speechiness` X `popularity`

Validation for correlation value

ggcorr(data = spotify500, label = T, hjust = T, layout.exp = 3)

The highest correlation is between energy and loudness at value of 0.8 between PC 1 and PC 2

Other method to verify is to check using fviz_contrib() to look at the order of contributing variables for every PC.

The following example is for PC-1 :

fviz_contrib(
  X = pca_spotify500,
  choice = "var",
  axes = 1  # Which PC to observed its contribution
)

LBB Unsupervised Learning: Spotify Features

Melissa Rusli

29 November 2023

1 Introduction

2 Data Preparation

2.1 Read Data

2.2 Data Cleaning

3 Exploratory Data Analysis (EDA)

3.1 Data Pre-Processing: Scaling

4 Principle Component Analysis (PCA)

4.1 Method #1 : Using function `prcomp()` using scaled data

4.2 Method #2 : Using function `prcomp()` using un-scaled data, with added parameter for scaling

4.3 PCA for Dimensionality Reduction

4.4 PCA Visualization

LBB Unsupervised Learning: Spotify Features

Melissa Rusli

29 November 2023

1 Introduction

2 Data Preparation

2.1 Read Data

2.2 Data Cleaning

3 Exploratory Data Analysis (EDA)

3.1 Data Pre-Processing: Scaling

4 Principle Component Analysis (PCA)

4.1 Method #1 : Using function prcomp() using scaled data

4.2 Method #2 : Using function prcomp() using un-scaled data, with added parameter for scaling

4.3 PCA for Dimensionality Reduction

4.4 PCA Visualization

4.1 Method #1 : Using function `prcomp()` using scaled data

4.2 Method #2 : Using function `prcomp()` using un-scaled data, with added parameter for scaling