Data used for this LBB Project in Unsupervised Learning is taken from the following source: Kaggle - Spotify Tracks DB.
This is a music list taken from Spotify, a streaming platform containing information of each audio tracks.
As we are applying Unsupervised Learning to our data, the goal is to find the pattern within the data to generate useful information.
First step is to prepare for our dataset.
# Library Setup and Installation necessary packages
# data wrangling
library(dplyr)
library(lubridate)
library(GGally)
library(factoextra)
library(tidyr)
library(ggiraphExtra)
library(FactoMineR)We can also check on simple information containing in our dataset
#> 'data.frame': 232725 obs. of 18 variables:
#> $ genre : chr "Movie" "Movie" "Movie" "Movie" ...
#> $ artist_name : chr "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#> $ track_name : chr "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#> $ track_id : chr "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
#> $ popularity : int 0 1 3 0 4 0 2 15 0 10 ...
#> $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#> $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#> $ duration_ms : int 99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#> $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#> $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#> $ key : chr "C#" "F#" "C" "C#" ...
#> $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#> $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
#> $ mode : chr "Major" "Minor" "Minor" "Major" ...
#> $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#> $ tempo : num 167 174 99.5 171.8 140.6 ...
#> $ time_signature : chr "4/4" "4/4" "5/4" "4/4" ...
#> $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
There are total of 232,725 observations data in our dataset
spotify with dataset description as follows (reference
information taken from Spotify)
genre : Genre of the audio trackartist_name : The name of the Artist who performed the
audio tracktrack_name : The name of the audio tracktrack_id : The Spotify ID for the track.popularity : The popularity of the track. The value
will be between 0 and 100, with 100 being the most popular.acousticness : A confidence measure from 0.0 to 1.0 of
whether the track is acoustic. 1.0 represents high confidence the track
is acoustic.danceability : Danceability describes how suitable a
track is for dancing based on a combination of musical elements
including tempo, rhythm stability, beat strength, and overall
regularity. A value of 0.0 is least danceable and 1.0 is most
danceable.duration_ms : The duration of the track in
milliseconds.energy : Energy is a measure from 0.0 to 1.0 and
represents a perceptual measure of intensity and activity. Typically,
energetic tracks feel fast, loud, and noisy. For example, death metal
has high energy, while a Bach prelude scores low on the scale.
Perceptual features contributing to this attribute include dynamic
range, perceived loudness, timbre, onset rate, and general entropy.instrumentalness : Predicts whether a track contains no
vocals. “Ooh” and “aah” sounds are treated as instrumental in this
context. Rap or spoken word tracks are clearly “vocal”. The closer the
instrumentalness value is to 1.0, the greater likelihood the track
contains no vocal content. Values above 0.5 are intended to represent
instrumental tracks, but confidence is higher as the value approaches
1.0.key : The key the track is in.liveness : Detects the presence of an audience in the
recording. Higher liveness values represent an increased probability
that the track was performed live. A value above 0.8 provides strong
likelihood that the track is live.loudness : The overall loudness of a track in decibels
(dB). Loudness values are averaged across the entire track and are
useful for comparing relative loudness of tracks. Loudness is the
quality of a sound that is the primary psychological correlate of
physical strength (amplitude). Values typically range between -60 and 0
db.mode : Mode indicates the modality (major or minor) of
a track, the type of scale from which its melodic content is
derived.speechiness : Speechiness detects the presence of
spoken words in a track. The more exclusively speech-like the recording
(e.g. talk show, audio book, poetry), the closer to 1.0 the attribute
value. Values above 0.66 describe tracks that are probably made entirely
of spoken words. Values between 0.33 and 0.66 describe tracks that may
contain both music and speech, either in sections or layered, including
such cases as rap music. Values below 0.33 most likely represent music
and other non-speech-like tracks.tempo : The overall estimated tempo of a track in beats
per minute (BPM). In musical terminology, tempo is the speed or pace of
a given piece and derives directly from the average beat duration.time_signature : An estimated time signature. The time
signature (meter) is a notational convention to specify how many beats
are in each bar (or measure).valence : A measure from 0.0 to 1.0 describing the
musical positiveness conveyed by a track. Tracks with high valence sound
more positive (e.g. happy, cheerful, euphoric), while tracks with low
valence sound more negative (e.g. sad, depressed, angry).Check which columns appropriately have the potential to be change to factor type:
# Check for Potential Factor type
df_num_uniques <- spotify %>%
summarise(across(everything(), n_distinct)) %>%
pivot_longer(everything()) %>%
arrange(value)
df_num_uniques track_idData with character type :
mode : chr → factortime_signature : chr → factorkey : chr → factorgenre : chr → factor# Change the above five columns to factor type
spotify_clean <- spotify %>%
select(-track_id) %>%
mutate_at(vars(mode, time_signature, key, genre), as.factor)
# Re-check current data type
str(spotify_clean)#> 'data.frame': 232725 obs. of 17 variables:
#> $ genre : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...
#> $ artist_name : chr "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#> $ track_name : chr "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#> $ popularity : int 0 1 3 0 4 0 2 15 0 10 ...
#> $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#> $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#> $ duration_ms : int 99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#> $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#> $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#> $ key : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
#> $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#> $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
#> $ mode : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
#> $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#> $ tempo : num 167 174 99.5 171.8 140.6 ...
#> $ time_signature : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 5 4 4 4 4 4 4 4 ...
#> $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
#> genre artist_name track_name popularity
#> 0 0 0 0
#> acousticness danceability duration_ms energy
#> 0 0 0 0
#> instrumentalness key liveness loudness
#> 0 0 0 0
#> mode speechiness tempo time_signature
#> 0 0 0 0
#> valence
#> 0
#> [1] FALSE
Our dataset do not have any missing values and ready for further analysis.
Analysis using PCA is using the value of variance hence we will only use columns with numeric data type
#> Rows: 232,725
#> Columns: 11
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
Check for range for each columns to know whether the scale range is the same
#> popularity acousticness danceability duration_ms
#> Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. : 15387
#> 1st Qu.: 29.00 1st Qu.:0.0376 1st Qu.:0.4350 1st Qu.: 182857
#> Median : 43.00 Median :0.2320 Median :0.5710 Median : 220427
#> Mean : 41.13 Mean :0.3686 Mean :0.5544 Mean : 235122
#> 3rd Qu.: 55.00 3rd Qu.:0.7220 3rd Qu.:0.6920 3rd Qu.: 265768
#> Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :5552917
#> energy instrumentalness liveness loudness
#> Min. :0.0000203 Min. :0.0000000 Min. :0.00967 Min. :-52.457
#> 1st Qu.:0.3850000 1st Qu.:0.0000000 1st Qu.:0.09740 1st Qu.:-11.771
#> Median :0.6050000 Median :0.0000443 Median :0.12800 Median : -7.762
#> Mean :0.5709577 Mean :0.1483012 Mean :0.21501 Mean : -9.570
#> 3rd Qu.:0.7870000 3rd Qu.:0.0358000 3rd Qu.:0.26400 3rd Qu.: -5.501
#> Max. :0.9990000 Max. :0.9990000 Max. :1.00000 Max. : 3.744
#> speechiness tempo valence
#> Min. :0.0222 Min. : 30.38 Min. :0.0000
#> 1st Qu.:0.0367 1st Qu.: 92.96 1st Qu.:0.2370
#> Median :0.0501 Median :115.78 Median :0.4440
#> Mean :0.1208 Mean :117.67 Mean :0.4549
#> 3rd Qu.:0.1050 3rd Qu.:139.05 3rd Qu.:0.6600
#> Max. :0.9670 Max. :242.90 Max. :1.0000
💡 Insight: the data do not yet have the same scaling
#> popularity acousticness danceability duration_ms
#> popularity 330.8741927 -2.460579486 0.866213977 5079.7963
#> acousticness -2.4605795 0.125860362 -0.024004549 472.7050
#> danceability 0.8662140 -0.024004549 0.034450413 -2776.6700
#> duration_ms 5079.7962942 472.705010446 -2776.670015161 14145750520.8081
#> energy 1.1928936 -0.067816440 0.015931805 -957.2578
#> instrumentalness -1.1619558 0.033958915 -0.020508345 2737.5056
#> liveness -0.6058861 0.004853762 -0.001534008 560.8353
#> loudness 39.6070158 -1.468729115 0.488376611 -33970.6429
#> speechiness -0.5098157 0.009933929 0.004633400 -356.8187
#> tempo 45.5478771 -2.611654392 0.125822789 -104575.8610
#> valence 0.2841956 -0.030059094 0.026411285 -4386.3797
#> energy instrumentalness liveness loudness
#> popularity 1.192893559 -1.161955848 -0.605886064 39.607015849
#> acousticness -0.067816440 0.033958915 0.004853762 -1.468729115
#> danceability 0.015931805 -0.020508345 -0.001534008 0.488376611
#> duration_ms -957.257831674 2737.505647036 560.835264134 -33970.642881913
#> energy 0.069408833 -0.030227884 0.010071149 1.289631260
#> instrumentalness -0.030227884 0.091668682 -0.008055978 -0.919510999
#> liveness 0.010071149 -0.008055978 0.039312018 0.054333071
#> loudness 1.289631260 -0.919510999 0.054333071 35.978446810
#> speechiness 0.007092851 -0.009950208 0.018764818 -0.002529084
#> tempo 1.862333254 -0.974186536 -0.314619135 42.324447453
#> valence 0.029925682 -0.024214148 0.000608679 0.623816414
#> speechiness tempo valence
#> popularity -0.509815661 45.5478771 0.284195566
#> acousticness 0.009933929 -2.6116544 -0.030059094
#> danceability 0.004633400 0.1258228 0.026411285
#> duration_ms -356.818722591 -104575.8610139 -4386.379669824
#> energy 0.007092851 1.8623333 0.029925682
#> instrumentalness -0.009950208 -0.9741865 -0.024214148
#> liveness 0.018764818 -0.3146191 0.000608679
#> loudness -0.002529084 42.3244475 0.623816414
#> speechiness 0.034417042 -0.4674159 0.001150285
#> tempo -0.467415886 954.7424276 1.083677477
#> valence 0.001150285 1.0836775 0.067634056
💡 Insight: the higher of our scale data then the higher value of variance and covariance
Therefore, we need to scale our dataset
#> popularity acousticness danceability duration_ms
#> Min. :-2.2610 Min. :-1.0389 Min. :-2.68019 Min. :-1.8475
#> 1st Qu.:-0.6667 1st Qu.:-0.9329 1st Qu.:-0.64310 1st Qu.:-0.4394
#> Median : 0.1029 Median :-0.3849 Median : 0.08963 Median :-0.1236
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
#> 3rd Qu.: 0.7626 3rd Qu.: 0.9963 3rd Qu.: 0.74154 3rd Qu.: 0.2577
#> Max. : 3.2365 Max. : 1.7686 Max. : 2.34168 Max. :44.7114
#> energy instrumentalness liveness loudness
#> Min. :-2.1671 Min. :-0.4898 Min. :-1.0356 Min. :-7.1500
#> 1st Qu.:-0.7058 1st Qu.:-0.4898 1st Qu.:-0.5932 1st Qu.:-0.3670
#> Median : 0.1292 Median :-0.4897 Median :-0.4388 Median : 0.3014
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.8200 3rd Qu.:-0.3716 3rd Qu.: 0.2471 3rd Qu.: 0.6784
#> Max. : 1.6247 Max. : 2.8097 Max. : 3.9591 Max. : 2.2196
#> speechiness tempo valence
#> Min. :-0.53129 Min. :-2.82494 Min. :-1.74924
#> 1st Qu.:-0.45314 1st Qu.:-0.79963 1st Qu.:-0.83793
#> Median :-0.38091 Median :-0.06112 Median :-0.04198
#> Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
#> 3rd Qu.:-0.08498 3rd Qu.: 0.69217 3rd Qu.: 0.78858
#> Max. : 4.56146 Max. : 4.05310 Max. : 2.09595
Check the plot for the values of variances using current scaled dataset
prcomp() using scaled data#> Standard deviations (1, .., p=11):
#> [1] 1.9001207 1.3076820 1.0822420 0.9999174 0.9282900 0.8699176 0.7986587
#> [8] 0.6966911 0.6125288 0.5260656 0.3387878
#>
#> Rotation (n x k) = (11 x 11):
#> PC1 PC2 PC3 PC4 PC5
#> popularity 0.23639321 -0.29846005 0.09943228 0.422119586 -0.4736815401
#> acousticness -0.42022853 0.18772369 -0.20855119 0.006879475 0.0282681519
#> danceability 0.33424065 0.06019282 -0.45070673 0.239688252 0.2376309068
#> duration_ms -0.06002993 -0.03275554 0.59352497 0.418326167 0.6447009218
#> energy 0.44628732 0.09692738 0.24698541 -0.097178416 0.0002462744
#> instrumentalness -0.32180898 -0.18288404 0.06985215 -0.132956545 0.1569641273
#> liveness 0.02965112 0.61932242 0.25373682 -0.056658396 -0.1249026942
#> loudness 0.46703496 -0.02465578 0.15341481 0.011412322 -0.0633230456
#> speechiness 0.02970675 0.64556159 0.02852151 0.068078940 -0.0954483000
#> tempo 0.15717921 -0.14937864 0.25897753 -0.733221667 0.0571792098
#> valence 0.32385768 0.07008504 -0.41173869 -0.128880686 0.4960754526
#> PC6 PC7 PC8 PC9 PC10
#> popularity 0.32083437 -0.42232755 -0.31413684 0.24820465 0.01528871
#> acousticness 0.30090066 0.04694641 -0.15387384 0.26137273 -0.69955892
#> danceability 0.22268142 -0.29141242 0.18169713 -0.58512988 -0.14978489
#> duration_ms 0.22709391 0.01628894 -0.01447981 0.01538797 -0.01145835
#> energy -0.35002606 -0.03904784 0.11993763 0.22449855 -0.14230425
#> instrumentalness -0.44386715 -0.75377712 0.07019165 -0.01147147 -0.17749694
#> liveness -0.11179760 -0.13086640 -0.60551414 -0.36598616 -0.01034614
#> loudness -0.15515093 0.14147365 0.11624804 0.01293022 -0.62277521
#> speechiness 0.21165146 -0.28954981 0.51727247 0.33736400 0.15979238
#> tempo 0.55013273 -0.18408344 0.01084837 -0.08771153 -0.02238735
#> valence -0.04261779 -0.09837511 -0.42101780 0.47043909 0.14856079
#> PC11
#> popularity 0.0278823444
#> acousticness 0.2640209720
#> danceability 0.1878030923
#> duration_ms 0.0008992083
#> energy 0.7154782335
#> instrumentalness -0.1184748572
#> liveness -0.0451919382
#> loudness -0.5549575890
#> speechiness -0.1795962710
#> tempo 0.0135260826
#> valence -0.1607522457
prcomp() using un-scaled data, with added
parameter for scaling#> Standard deviations (1, .., p=11):
#> [1] 1.9001207 1.3076820 1.0822420 0.9999174 0.9282900 0.8699176 0.7986587
#> [8] 0.6966911 0.6125288 0.5260656 0.3387878
#>
#> Rotation (n x k) = (11 x 11):
#> PC1 PC2 PC3 PC4 PC5
#> popularity 0.23639321 -0.29846005 0.09943228 0.422119586 -0.4736815401
#> acousticness -0.42022853 0.18772369 -0.20855119 0.006879475 0.0282681519
#> danceability 0.33424065 0.06019282 -0.45070673 0.239688252 0.2376309068
#> duration_ms -0.06002993 -0.03275554 0.59352497 0.418326167 0.6447009218
#> energy 0.44628732 0.09692738 0.24698541 -0.097178416 0.0002462744
#> instrumentalness -0.32180898 -0.18288404 0.06985215 -0.132956545 0.1569641273
#> liveness 0.02965112 0.61932242 0.25373682 -0.056658396 -0.1249026942
#> loudness 0.46703496 -0.02465578 0.15341481 0.011412322 -0.0633230456
#> speechiness 0.02970675 0.64556159 0.02852151 0.068078940 -0.0954483000
#> tempo 0.15717921 -0.14937864 0.25897753 -0.733221667 0.0571792098
#> valence 0.32385768 0.07008504 -0.41173869 -0.128880686 0.4960754526
#> PC6 PC7 PC8 PC9 PC10
#> popularity 0.32083437 -0.42232755 -0.31413684 0.24820465 0.01528871
#> acousticness 0.30090066 0.04694641 -0.15387384 0.26137273 -0.69955892
#> danceability 0.22268142 -0.29141242 0.18169713 -0.58512988 -0.14978489
#> duration_ms 0.22709391 0.01628894 -0.01447981 0.01538797 -0.01145835
#> energy -0.35002606 -0.03904784 0.11993763 0.22449855 -0.14230425
#> instrumentalness -0.44386715 -0.75377712 0.07019165 -0.01147147 -0.17749694
#> liveness -0.11179760 -0.13086640 -0.60551414 -0.36598616 -0.01034614
#> loudness -0.15515093 0.14147365 0.11624804 0.01293022 -0.62277521
#> speechiness 0.21165146 -0.28954981 0.51727247 0.33736400 0.15979238
#> tempo 0.55013273 -0.18408344 0.01084837 -0.08771153 -0.02238735
#> valence -0.04261779 -0.09837511 -0.42101780 0.47043909 0.14856079
#> PC11
#> popularity 0.0278823444
#> acousticness 0.2640209720
#> danceability 0.1878030923
#> duration_ms 0.0008992083
#> energy 0.7154782335
#> instrumentalness -0.1184748572
#> liveness -0.0451919382
#> loudness -0.5549575890
#> speechiness -0.1795962710
#> tempo 0.0135260826
#> valence -0.1607522457
Both methods above gave the same information result.
Three components information in function prcomp() :
#> [1] 3.6104585 1.7100322 1.1712478 0.9998348 0.8617223 0.7567566 0.6378557
#> [8] 0.4853785 0.3751915 0.2767450 0.1147772
#> PC1 PC2 PC3 PC4 PC5
#> popularity 0.23639321 -0.29846005 0.09943228 0.422119586 -0.4736815401
#> acousticness -0.42022853 0.18772369 -0.20855119 0.006879475 0.0282681519
#> danceability 0.33424065 0.06019282 -0.45070673 0.239688252 0.2376309068
#> duration_ms -0.06002993 -0.03275554 0.59352497 0.418326167 0.6447009218
#> energy 0.44628732 0.09692738 0.24698541 -0.097178416 0.0002462744
#> instrumentalness -0.32180898 -0.18288404 0.06985215 -0.132956545 0.1569641273
#> liveness 0.02965112 0.61932242 0.25373682 -0.056658396 -0.1249026942
#> loudness 0.46703496 -0.02465578 0.15341481 0.011412322 -0.0633230456
#> speechiness 0.02970675 0.64556159 0.02852151 0.068078940 -0.0954483000
#> tempo 0.15717921 -0.14937864 0.25897753 -0.733221667 0.0571792098
#> valence 0.32385768 0.07008504 -0.41173869 -0.128880686 0.4960754526
#> PC6 PC7 PC8 PC9 PC10
#> popularity 0.32083437 -0.42232755 -0.31413684 0.24820465 0.01528871
#> acousticness 0.30090066 0.04694641 -0.15387384 0.26137273 -0.69955892
#> danceability 0.22268142 -0.29141242 0.18169713 -0.58512988 -0.14978489
#> duration_ms 0.22709391 0.01628894 -0.01447981 0.01538797 -0.01145835
#> energy -0.35002606 -0.03904784 0.11993763 0.22449855 -0.14230425
#> instrumentalness -0.44386715 -0.75377712 0.07019165 -0.01147147 -0.17749694
#> liveness -0.11179760 -0.13086640 -0.60551414 -0.36598616 -0.01034614
#> loudness -0.15515093 0.14147365 0.11624804 0.01293022 -0.62277521
#> speechiness 0.21165146 -0.28954981 0.51727247 0.33736400 0.15979238
#> tempo 0.55013273 -0.18408344 0.01084837 -0.08771153 -0.02238735
#> valence -0.04261779 -0.09837511 -0.42101780 0.47043909 0.14856079
#> PC11
#> popularity 0.0278823444
#> acousticness 0.2640209720
#> danceability 0.1878030923
#> duration_ms 0.0008992083
#> energy 0.7154782335
#> instrumentalness -0.1184748572
#> liveness -0.0451919382
#> loudness -0.5549575890
#> speechiness -0.1795962710
#> tempo 0.0135260826
#> valence -0.1607522457
Formula generated will be as follows :
PC1 = 0.23 * popularity + (-0.42) * acousticness + 0.34 * danceability + … + 0.34 * valence
💡 Insight : biggest contribution variables to PC1 is
loudness and energy
pca$x : values for Each PC for every observations (new
dataset values)Formula :
PC1 = 0.23 * popularity + (-0.42) * acousticness + 0.34 * danceability + … + 0.34 * valence = 0.990448080
Re-check calculation :
#> [1] 0.9904481
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 1.9001 1.3077 1.0822 0.99992 0.92829 0.8699 0.79866
#> Proportion of Variance 0.3282 0.1555 0.1065 0.09089 0.07834 0.0688 0.05799
#> Cumulative Proportion 0.3282 0.4837 0.5902 0.68105 0.75939 0.8282 0.88617
#> PC8 PC9 PC10 PC11
#> Standard deviation 0.69669 0.61253 0.52607 0.33879
#> Proportion of Variance 0.04413 0.03411 0.02516 0.01043
#> Cumulative Proportion 0.93030 0.96441 0.98957 1.00000
To retain about 82% of information, we will choose 6 PC because the Cumulative Proportion at PC6 is equals to **0.8282*
pca_spotify_keep is the result of dimensionality
reduction, however the data given is not interpretable
Biplot using the first 500 data spotify_num
# 1. Subset the first 500 data
spotify500 <- spotify_num %>% head(n = 500)
pca_spotify500 <- prcomp(spotify500, scale = TRUE)
biplot(x = pca_spotify500,
cex = 0.7,
scale = F)💡 Insight : High Positive Correlation
- `speechiness` X `liveness`
- `danceability` X `valence`
- `tempo` X `energy`
💡 Insight : Almost No Correlation
- `speechiness` X `popularity`
Validation for correlation value
The highest correlation is between energy and
loudness at value of 0.8 between PC 1 and
PC 2
Other method to verify is to check using fviz_contrib()
to look at the order of contributing variables for every PC.
The following example is for PC-1 :
fviz_contrib(
X = pca_spotify500,
choice = "var",
axes = 1 # Which PC to observed its contribution
)