We will try to do analysis using k-means method. We will also gonna try to reduce the dimensionality of the dataset usin Principal Component Analysis (PCA).
The datasets is acquired through Kaggle.
Load the required library.
library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(ggiraphExtra)
options(scipen = 100, max.print = 101)spotify <- read.csv("data_input/SpotifyFeatures.csv", sep = ";")
glimpse(spotify)#> Rows: 232,725
#> Columns: 18
#> $ genres <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
#> $ artist_name <chr> "Henri Salvador", "Martin & les f\xe9es", "Joseph Wil…
#> $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <chr> "0,424305556", "0,170833333", "0,661111111", "0,48819…
#> $ danceability <chr> "0,270138889", "00.59", "0,460416667", "00.24", "0,22…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <chr> "0,063194444", "0,511805556", "0,090972222", "0,22638…
#> $ instrumentalness <chr> "0", "0", "0", "0", "0,085416667", "0", "0", "0", "0.…
#> $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
#> $ liveness <chr> "0,240277778", "0,104861111", "0,071527778", "0,68402…
#> $ loudness <chr> "-1.828", "-5.559", "-13.879", "-12.178", "-21.15", "…
#> $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
#> $ speechiness <chr> "0,364583333", "0,602777778", "0,251388889", "0,27430…
#> $ tempo <chr> "166.969", "174.003", "99.488", "171.758", "140.576",…
#> $ time_signature <chr> "04-Apr", "04-Apr", "05-Apr", "04-Apr", "04-Apr", "04…
#> $ valence <chr> "0,565277778", "0,566666667", "0,255555556", "0,15763…
Variable explanation:
Based on the provided dataset, here is an explanation of each feature:
genres: The genre of the track.artist_name: The name of the artist or performer.track_name: The name of the track.track_id: The unique identifier for the track.popularity: The popularity score of the track.acousticness: The measure of how acoustic the track.danceability: The measure of how suitable the track is for dancing.duration_ms: The duration of the track in milliseconds.energy: The energy level of the track.instrumentalness: The measure of how likely the track is to be instrumental.key: The key in which the track is composed.liveness: The measure of how likely the track is to be performed live.loudness: The loudness of the track.mode: The modality (major or minor) of the track.speechiness: The measure of how much spoken words are present in the track.tempo: The tempo (beats per minute) of the track.time_signature: The time signature of the track.valence: The measure of musical positiveness conveyed by the track.Checking the unique variable in each features
sapply(spotify, function(X) length(unique(X)))#> genres artist_name track_name track_id
#> 27 14556 147093 176774
#> popularity acousticness danceability duration_ms
#> 101 3918 935 70749
#> energy instrumentalness key liveness
#> 1698 4582 12 979
#> loudness mode speechiness tempo
#> 27923 2 965 77978
#> time_signature valence
#> 5 973
mode supposed to be categorical or factor data type.acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence should be numerical.spotify$acousticness <- gsub(",",".", spotify$acousticness)
spotify$danceability <- gsub(",",".", spotify$danceability)
spotify$energy <- gsub(",",".", spotify$energy)
spotify$instrumentalness <- gsub(",",".", spotify$instrumentalness)
spotify$liveness <- gsub(",",".", spotify$liveness)
spotify$speechiness <- gsub(",",".", spotify$speechiness)
spotify$valence <- gsub(",",".", spotify$valence)spotify_c <- spotify %>%
mutate_at(vars(genres, mode), as.factor) %>%
mutate_at(vars(acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence), as.numeric)
glimpse(spotify_c)#> Rows: 232,725
#> Columns: 18
#> $ genres <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi…
#> $ artist_name <chr> "Henri Salvador", "Martin & les f\xe9es", "Joseph Wil…
#> $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <dbl> 0.42430556, 0.17083333, 0.66111111, 0.48819444, 0.065…
#> $ danceability <dbl> 0.27013889, 0.59000000, 0.46041667, 0.24000000, 0.229…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.06319444, 0.51180556, 0.09097222, 0.22638889, 0.156…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.085…
#> $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
#> $ liveness <dbl> 0.24027778, 0.10486111, 0.07152778, 0.68402778, 0.140…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
#> $ speechiness <dbl> 0.36458333, 0.60277778, 0.25138889, 0.27430556, 0.316…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature <chr> "04-Apr", "04-Apr", "05-Apr", "04-Apr", "04-Apr", "04…
#> $ valence <dbl> 0.56527778, 0.56666667, 0.25555556, 0.15763889, 0.390…
Now the data type is set.
Checking is there missing value.
anyNA(spotify_c)#> [1] TRUE
handling the missing value.
colSums(is.na(spotify_c))#> genres artist_name track_name track_id
#> 0 0 0 0
#> popularity acousticness danceability duration_ms
#> 0 0 0 0
#> energy instrumentalness key liveness
#> 0 0 0 0
#> loudness mode speechiness tempo
#> 53 0 0 13077
#> time_signature valence
#> 0 0
best option we can do is drop all the row’s that contain missing value.
spotify_c <- spotify_c %>%
filter(complete.cases(.))anyNA(spotify_c)#> [1] FALSE
nrow(spotify_c)#> [1] 219597
Because of PCA analysist onlu use variance, we just gonna need the features with numerical data type.
spotify_num <- spotify_c %>%
select_if(is.numeric)
glimpse(spotify_num)#> Rows: 219,597
#> Columns: 11
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 10, 0, 2, 4, 3, 0, 0, 3, 1, …
#> $ acousticness <dbl> 0.42430556, 0.17083333, 0.66111111, 0.48819444, 0.065…
#> $ danceability <dbl> 0.27013889, 0.59000000, 0.46041667, 0.24000000, 0.229…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.06319444, 0.51180556, 0.09097222, 0.22638889, 0.156…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.085…
#> $ liveness <dbl> 0.24027778, 0.10486111, 0.07152778, 0.68402778, 0.140…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ speechiness <dbl> 0.36458333, 0.60277778, 0.25138889, 0.27430556, 0.316…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ valence <dbl> 0.56527778, 0.56666667, 0.25555556, 0.15763889, 0.390…
summary(spotify_num)#> popularity acousticness danceability duration_ms
#> Min. : 0.00 Min. :0.00000 Min. :0.0100 Min. : 15387
#> 1st Qu.: 29.00 1st Qu.:0.09097 1st Qu.:0.2882 1st Qu.: 182885
#> Median : 43.00 Median :0.24861 Median :0.3931 Median : 220462
#> Mean : 41.14 Mean :0.29748 Mean :0.3746 Mean : 235057
#> 3rd Qu.: 55.00 3rd Qu.:0.51944 3rd Qu.:0.4813 3rd Qu.: 265700
#> Max. :100.00 Max. :0.69375 Max. :0.6937 Max. :5552917
#> energy instrumentalness liveness loudness
#> Min. :0.0000203 Min. :0.0000000 Min. :0.00967 Min. :-52.457
#> 1st Qu.:0.2583333 1st Qu.:0.0000000 1st Qu.:0.09097 1st Qu.:-11.750
#> Median :0.4145833 Median :0.0000448 Median :0.17847 Median : -7.754
#> Mean :0.3921543 Mean :0.1125078 Mean :0.26875 Mean : -9.559
#> 3rd Qu.:0.5430556 3rd Qu.:0.0965278 3rd Qu.:0.46000 3rd Qu.: -5.500
#> Max. :0.6937500 Max. :0.6937500 Max. :1.00000 Max. : 3.744
#> speechiness tempo valence
#> Min. :0.0100 Min. : 30.38 Min. :0.0000
#> 1st Qu.:0.1868 1st Qu.: 92.98 1st Qu.:0.2028
#> Median :0.2556 Median :115.90 Median :0.3271
#> Mean :0.2796 Mean :117.72 Mean :0.3363
#> 3rd Qu.:0.3535 3rd Qu.:139.09 3rd Qu.:0.4701
#> Max. :0.6937 Max. :242.90 Max. :1.0000
The scale of data for each variable cannot be considered the same due to significant variations in the range of values among the variables.
cov(spotify_num)#> popularity acousticness danceability
#> popularity 330.76086683 -1.190446167 0.47471461926
#> acousticness -1.19044617 0.053760063 -0.00561330404
#> danceability 0.47471462 -0.005613304 0.02069178987
#> duration_ms 4056.56339317 308.016447290 -1310.09048668792
#> energy 0.50851660 -0.017069707 0.00325540905
#> instrumentalness -0.75464087 0.010285226 -0.00665807172
#> liveness -0.26689974 0.001389778 -0.00002348215
#> loudness 39.51673087 -0.758403068 0.24583248509
#> speechiness -0.35347937 0.002220435 0.00006632012
#> duration_ms energy instrumentalness
#> popularity 4056.56339 0.508516600 -0.7546408728
#> acousticness 308.01645 -0.017069707 0.0102852258
#> danceability -1310.09049 0.003255409 -0.0066580717
#> duration_ms 14004813224.70875 -317.316117845 1951.7082595447
#> energy -317.31612 0.033241566 -0.0081901969
#> instrumentalness 1951.70826 -0.008190197 0.0445239410
#> liveness 923.60011 0.001440206 -0.0006159730
#> loudness -33344.80557 0.575879063 -0.5881938502
#> speechiness 3.02723 0.001461263 -0.0008750963
#> liveness loudness speechiness
#> popularity -0.26689974249 39.51673087 -0.35347937342
#> acousticness 0.00138977779 -0.75840307 0.00222043540
#> danceability -0.00002348215 0.24583249 0.00006632012
#> duration_ms 923.60010683694 -33344.80556920 3.02723018453
#> energy 0.00144020583 0.57587906 0.00146126272
#> instrumentalness -0.00061597301 -0.58819385 -0.00087509631
#> liveness 0.04298743496 -0.03194345 0.00290565235
#> loudness -0.03194345349 35.83298005 -0.04674082649
#> speechiness 0.00290565235 -0.04674083 0.02456972282
#> tempo valence
#> popularity 45.24597101 -0.0507765166
#> acousticness -1.44253956 -0.0045151196
#> danceability 0.01146579 0.0049633824
#> duration_ms -100305.45301324 -1656.3801618383
#> energy 0.88294007 0.0047481783
#> instrumentalness -0.58171172 -0.0038253664
#> liveness -0.19860434 0.0004865074
#> loudness 41.96913774 0.1722833173
#> speechiness -0.15978976 0.0002235817
#> [ reached getOption("max.print") -- omitted 2 rows ]
The variance is also not in the same range for some variable. Specially for duration_ms
ggcorr(spotify_num,
label = TRUE,
label_size = 3,
hjust = 0.9)plot(prcomp(spotify_num))we have to scale due to the process of PCA.
spotify_scale <- scale(spotify_num)summary(spotify_scale)#> popularity acousticness danceability duration_ms
#> Min. :-2.2620 Min. :-1.2830 Min. :-2.5344 Min. :-1.8562
#> 1st Qu.:-0.6675 1st Qu.:-0.8907 1st Qu.:-0.6004 1st Qu.:-0.4409
#> Median : 0.1023 Median :-0.2108 Median : 0.1285 Median :-0.1233
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.7621 3rd Qu.: 0.9573 3rd Qu.: 0.7416 3rd Qu.: 0.2589
#> Max. : 3.2365 Max. : 1.7091 Max. : 2.2189 Max. :44.9364
#> energy instrumentalness liveness loudness
#> Min. :-2.1508 Min. :-0.53319 Min. :-1.2496 Min. :-7.1662
#> 1st Qu.:-0.7340 1st Qu.:-0.53319 1st Qu.:-0.8574 1st Qu.:-0.3660
#> Median : 0.1230 Median :-0.53298 Median :-0.4354 Median : 0.3016
#> Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.8277 3rd Qu.:-0.07573 3rd Qu.: 0.9224 3rd Qu.: 0.6781
#> Max. : 1.6542 Max. : 2.75461 Max. : 3.5269 Max. : 2.2224
#> speechiness tempo valence
#> Min. :-1.7200 Min. :-2.82739 Min. :-1.97606
#> 1st Qu.:-0.5920 1st Qu.:-0.80092 1st Qu.:-0.78441
#> Median :-0.1534 Median :-0.05888 Median :-0.05391
#> Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
#> 3rd Qu.: 0.4713 3rd Qu.: 0.69175 3rd Qu.: 0.78678
#> Max. : 2.6421 Max. : 4.05258 Max. : 3.90058
Checking the plot how big the variance explained by PC using the scaled data.
plot(prcomp(spotify_scale))After scaled, the bias in PCA is handled.
fviz_nbclust(
x = spotify_scale %>% head(1000),
FUNcluster = kmeans,
method = "wss"
)fviz_nbclust(
x = spotify_scale %>% head(1000),
FUNcluster = kmeans,
method = "silhouette"
)fviz_nbclust(
x = spotify_scale %>% head(1000),
FUNcluster = kmeans,
method = "gap_stat"
)From the plots, we can see that 3 is the optimum number of K. After k=3, increasing the number of K does not result in a considerable decrease of the total within sum of squares (strong internal cohesion) nor a considerable increase of between sum of square and between/total sum of squares ratio (maximum external separation).
RNGkind(sample.kind = "Rounding")
set.seed(123)
spotify_k <- kmeans(spotify_scale %>% head(1000), centers = 3)
# result
spotify_k#> K-means clustering with 3 clusters of sizes 291, 497, 212
#>
#> Cluster means:
#> popularity acousticness danceability duration_ms energy instrumentalness
#> 1 -2.0126152 0.9636559 -0.7361089 -0.3300738 -0.81199139 -0.1211050
#> 2 0.5085631 -0.1668230 0.1885333 -0.1443190 0.05033251 -0.4679869
#> 3 -2.1435023 0.1285317 0.4160414 0.7054267 0.00977984 -0.4391777
#> liveness loudness speechiness tempo valence
#> 1 -0.29269442 -0.75467594 -0.1794143 -0.29759088 -0.41351755
#> 2 -0.04202170 0.36277495 -0.1289156 0.11898597 0.03433042
#> 3 0.01044287 0.03461486 0.3139956 0.09357227 0.91782518
#>
#> Clustering vector:
#> [1] 3 3 1 1 1 1 3 1 3 1 1 1 3 1 3 3 1 3 1 3 3 3 1 3 3 3 3 3 1 1 3 3 3 3 1 1 3
#> [38] 3 1 1 3 3 1 3 3 3 1 3 1 1 1 1 1 1 1 3 1 3 1 3 3 3 1 1 3 3 1 1 1 1 1 3 1 3
#> [75] 1 3 3 3 1 1 1 3 1 3 3 1 3 1 3 3 1 3 3 1 1 3 1 1 3 3 1
#> [ reached getOption("max.print") -- omitted 899 entries ]
#>
#> Within cluster sum of squares by cluster:
#> [1] 2308.473 3150.552 5006.065
#> (between_SS / total_SS = 22.1 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
#> [6] "betweenss" "size" "iter" "ifault"
spotify_clustered <- spotify_c %>%
head(1000) %>%
bind_cols(cluster = as.factor(spotify_k$cluster)) %>%
select(cluster, 1:18)
spotify_clusteredfviz_cluster(object = spotify_k,
data = spotify_scale %>% head(1000))spotify_clustered %>%
group_by(cluster) %>%
summarise_all(.funs = "mean") %>%
select(where(~all(!is.na(.)))) %>%
mutate_if(is.numeric, .funs = "round", digits = 2)by the information we get, - Cluster 1: music in cluster 1 is a lot more acoustic and more instrument than others cluster but it’s not a danceable music, less duration, non energetic, and little bit mellow according to loudness and its tempo. This indicates that the first cluster is some slow instrumental or accoustic music type. - Cluster 2: consist highest popularity among other 2 cluster. This is due to danceability, energic, liveness, loud, high tempo, and high valence. This can indicates this type of music in cluster 2 is music that common to use in a party, when people have fun, or live concert. This kind of music that will hype up the listener and of course very populer among teenager. - cluster 3: this kind of song is less populer but high in danceability and long duration also an energetic song. The speechiness is also highest than other cluster. This can indicates that this type of song is Rap or mix up and some DJ’s stuff. This statement is also supported by high valence which can conduct very hyped up song.
Here we will make PCA from the df datasets. We will see the eigenvalues and the percentage of variances explained by each dimensions. The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.
spotify_pca <- PCA(spotify_c %>% select(-c(artist_name, track_name, track_id, key, time_signature, mode)) %>% head(1000),
scale.unit = T,
quali.sup = "genres",
ncp = 11,
graph = F)
summary(spotify_pca)#>
#> Call:
#> PCA(X = spotify_c %>% select(-c(artist_name, track_name, track_id,
#> key, time_signature, mode)) %>% head(1000), scale.unit = T,
#> ncp = 11, quali.sup = "genres", graph = F)
#>
#>
#> Eigenvalues
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
#> Variance 2.610 1.274 1.127 0.997 0.928 0.879 0.866
#> % of var. 23.730 11.579 10.241 9.061 8.439 7.989 7.869
#> Cumulative % of var. 23.730 35.309 45.550 54.611 63.050 71.039 78.908
#> Dim.8 Dim.9 Dim.10 Dim.11
#> Variance 0.770 0.596 0.544 0.410
#> % of var. 6.996 5.419 4.945 3.731
#> Cumulative % of var. 85.904 91.323 96.269 100.000
#>
#> Individuals (the 10 first)
#> Dist Dim.1 ctr cos2 Dim.2 ctr cos2
#> 1 | 3.536 | 0.058 0.000 0.000 | 0.627 0.031 0.031 |
#> 2 | 4.100 | 2.000 0.153 0.238 | 2.105 0.348 0.264 |
#> 3 | 2.868 | -2.033 0.158 0.503 | -0.007 0.000 0.000 |
#> 4 | 3.605 | -1.142 0.050 0.100 | -0.082 0.001 0.001 |
#> 5 | 3.512 | -1.678 0.108 0.228 | -0.133 0.001 0.001 |
#> 6 | 3.143 | -0.746 0.021 0.056 | -0.547 0.024 0.030 |
#> Dim.3 ctr cos2
#> 1 -1.346 0.161 0.145 |
#> 2 -2.061 0.377 0.253 |
#> 3 0.125 0.001 0.002 |
#> 4 0.163 0.002 0.002 |
#> 5 -1.312 0.153 0.140 |
#> 6 -0.657 0.038 0.044 |
#> [ reached getOption("max.print") -- omitted 4 rows ]
#>
#> Variables (the 10 first)
#> Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
#> popularity | 0.608 14.157 0.370 | -0.356 9.931 0.126 | 0.348 10.779
#> acousticness | -0.688 18.155 0.474 | 0.107 0.899 0.011 | -0.045 0.182
#> danceability | 0.470 8.445 0.220 | 0.346 9.392 0.120 | 0.230 4.703
#> duration_ms | -0.094 0.339 0.009 | 0.369 10.663 0.136 | 0.518 23.860
#> energy | 0.671 17.243 0.450 | 0.045 0.159 0.002 | -0.296 7.759
#> instrumentalness | -0.446 7.635 0.199 | -0.050 0.194 0.002 | -0.171 2.595
#> liveness | 0.078 0.230 0.006 | 0.323 8.190 0.104 | 0.474 19.986
#> cos2
#> popularity 0.121 |
#> acousticness 0.002 |
#> danceability 0.053 |
#> duration_ms 0.269 |
#> energy 0.087 |
#> instrumentalness 0.029 |
#> liveness 0.225 |
#> [ reached getOption("max.print") -- omitted 3 rows ]
#>
#> Supplementary categories
#> Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
#> A Capella | 1.570 | -1.379 0.771 -9.582 | -0.030 0.000 -0.296
#> Alternative | 2.154 | 1.464 0.462 3.767 | -0.787 0.134 -2.899
#> Country | 1.023 | 0.928 0.823 13.324 | -0.339 0.110 -6.961
#> Movie | 1.221 | -0.921 0.569 -14.193 | 0.512 0.176 11.291
#> R&B | 1.692 | 1.141 0.454 8.927 | -0.440 0.068 -4.930
#> Dim.3 cos2 v.test
#> A Capella | -0.221 0.020 -2.336 |
#> Alternative | 0.616 0.082 2.414 |
#> Country | 0.165 0.026 3.616 |
#> Movie | -0.355 0.085 -8.340 |
#> R&B | 0.670 0.157 7.985 |
Case: we want to reduce the dimension of the data but we have to keep 90% of the information. By looking at standard deviation value, the pc that keep above 90% of information will be PC1 to PC7.
fviz_eig(spotify_pca,
ncp = 11,
addlabels = T)23.7 + 11.6 + 10.2 + 9.1#> [1] 54.6
50% of the variances can be explained by only using the first 4 dimensions, with the first dimensions can explain 24.8% of the total variances.
we can keep 80% of information by using only 7 dimensions. Thus were 36% dimensionality reduction from the original data. This mean that we can actually reduce the number of features on our dataset from 11 to just 7 numeric features.
We can extract the values of PC1 to PC7 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.
df_pca <- data.frame(spotify_pca$ind$coord[, 1:7]) %>%
bind_cols(cluster = as.factor(spotify_k$cluster)) %>%
select(cluster, 1:7)
df_pcafviz_pca_ind(spotify_pca,
habillage = "genres",
addEllipses = T)From plot above we can acquired some information, - Music with genres of Movie is dominated as is PC1, we can see through the blue circle is the biggest in the plot. - the other genres such as country, RNB and alternative are almost in the same amount.
Music with Movie genres in 448, 430, and 92 seems quite an outlier.
fviz_pca_var(spotify_pca,
select.var = list(contrib = 11),
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)The plot displays the distribution of variables within a circle, suggesting that more than two components are necessary for a perfect representation of our data. The distance of each variable from the origin reflects its significance on the factor map. Additionally, variable contributions are indicated by the color, represented as percentages. Variables strongly correlated with PC1 and PC2 play a vital role in explaining data variability, while those weakly correlated or associated with later dimensions have low contribution and could be excluded for a more straightforward analysis.
To assess variable representation quality, we can examine the cos2 of each variable. A high cos2 implies a favorable representation on the principal component, indicated by proximity to the circumference of the correlation circle. Conversely, a low cos2 suggests that the variable is not well-represented by the principal components, evident when it is positioned closer to the center of the circle.
fviz_cos2(spotify_pca,
choice = "var",
fill = "cos2") +
scale_fill_viridis_c(option = "B") +
theme(legend.position = "top")Variables that highly contributed to our data is loudness, acousticness, energy, and popularity. Some variables are less contribute to our dataset such as speechiness, liveness, and duration_ms. The speechiness has the lowest correlation with the principle components.
ggRadar(data = spotify_clustered,
aes(colour = cluster),
interactive = T)From plot visualization above, we can see each cluster tend to each variables.
We want to recommend to users that listen to dragon ball gt soundtrack. First, lets see the data.
spotify_clustered %>%
filter(track_name == "Dragon Ball GT")From imputation above, we can see dragon ball gt is in the cluster 3 with MOvie genre and Bernard Minet as artist and mode in Major.
From that information we can pick some recommendation to the listener that result show below.
spotify_clustered %>%
filter(cluster == 3 & genres == "Movie" & artist_name == "Bernard Minet" & mode == "Major")The result show us that are 13 recommendations that have similarity to Dragon Ball GT.
We can pull some conclusion conclusion regarding our dataset based on the previous cluster and principle component analysis: