This is an unsupervised learning analysis of spotify tracks dataset which can be downloaded at:
https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db
The dataset is composed of tracks characterized by the following musical measures :
popularity, acousticness, danceability, duration_ms, energy, instrumentalness, key, liveness, loudnes, mode, speechiness, tempo, time_signature, valence.
The dataset have no target label variable and so are analyzes using Principal Component Analysis (PCA) and then visualized using K-Means clustering. Once the number of k clusters have been decided upon we are able to classify each row element in the dataset in terms of a cluster grouping.
Invoke the required libraries.
library(tm)
library(tidyverse)
library(lubridate)
library(cluster)
library(ggforce)
library(GGally)
library(scales)
library(cowplot)
library(plotly)
library(FactoMineR)
library(factoextra)
library(dplyr)
library(ggcorrplot)
options(scipen = 100, max.print = 101)
Import the dataset and use glimpse() to view the data fields’ information.
data <- read.csv("SpotifyFeatures.csv")
glimpse(data)
## Rows: 232,725
## Columns: 20
## $ genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
## $ popularity <chr> "0", "1", "3", "0", "4", "0", "2", "15", "0", "10", "…
## $ acousticness <chr> "0.611", "0.246", "0.952", "0.703", "0.95", "0.749", …
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
## $ duration_ms <dbl> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
## $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
## $ liveness <chr> "0.346", "0.151", "0.103", "0.0985", "0.202", "0.107"…
## $ loudness <chr> "-1.828", "-5.559", "-13.879", "-12.178", "-21.15", "…
## $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
## $ speechiness <chr> "0.0525", "0.0868", "0.0362", "0.0395", "0.0456", "0.…
## $ tempo <chr> "166.969", "174.003", "99.488", "171.758", "140.576",…
## $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
## $ valence <chr> "0.814", "0.816", "0.368", "0.227", "0.39", "0.358", …
## $ X <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ X.1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Check the unique values of data$x.
unique(data$x)
## NULL
Check the unique values of data$x.1.
unique(data$x.1)
## NULL
Check the unique values of instrumentalness.
unique(data$instrumentalness)
## [1] 0.00000000 0.12300000 0.00086000 0.00125000 0.52900000 0.88700000
## [7] 0.07290000 0.93300000 0.00000725 0.04220000 0.00000813 0.00032800
## [13] 0.00110000 0.00200000 0.00000127 0.91900000 0.00000832 0.97600000
## [19] 0.00004900 0.00073200 0.00068300 0.00002940 0.06080000 0.89100000
## [25] 0.04180000 0.02860000 0.00001660 0.01170000 0.00002440 0.01060000
## [31] 0.00018400 0.00008630 0.00000900 0.00016900 0.00445000 0.85500000
## [37] 0.00004810 0.00038100 0.00000603 0.00003550 0.00002950 0.00000316
## [43] 0.00007440 0.33100000 0.95900000 0.00007550 0.00000805 0.00000295
## [49] 0.86100000 0.03270000 0.22900000 0.00000655 0.00000140 0.13900000
## [55] 0.32700000 0.00051200 0.00061400 0.88100000 0.78500000 0.00000859
## [61] 0.92600000 0.14500000 0.00000503 0.00000395 0.00243000 0.00013000
## [67] 0.00020800 0.00000833 0.00000807 0.01820000 0.00000927 0.00013100
## [73] 0.00006590 0.00050100 0.09980000 0.00000761 0.00000141 0.00012700
## [79] 0.01610000 0.00001140 0.00000515 0.00000210 0.00000378 0.25400000
## [85] 0.00020000 0.01840000 0.00002880 0.00000588 0.00001780 0.00018900
## [91] 0.00001190 0.00132000 0.00000865 0.00016100 0.00005070 0.00003030
## [97] 0.00000108 0.00452000 0.00000139 0.01260000 0.00131000
## [ reached getOption("max.print") -- omitted 5309 entries ]
Check the unique values of acousticness.
unique(data$acousticness)
## [1] "0.611" "0.246" "0.952" "0.703" "0.95" "0.749" "0.344"
## [8] "0.939" "0.00104" "0.319" "0.921" "0.0383" "0.215" "0.958"
## [15] "0.97" "0.548" "0.7" "0.488" "0.381" "0.161" "0.852"
## [22] "0.513" "0.689" "0.669" "0.706" "0.882" "0.159" "0.864"
## [29] "0.716" "0.184" "0.00323" "0.305" "0.922" "0.942" "0.123"
## [36] "0.767" "0.164" "0.619" "0.932" "0.508" "0.649" "0.983"
## [43] "0.934" "0.576" "0.751" "0.614" "0.0287" "0.285" "0.659"
## [50] "0.581" "0.902" "0.386" "0.712" "0.917" "0.924" "0.967"
## [57] "0.989" "0.728" "0.871" "0.126" "0.929" "0.455" "0.733"
## [64] "0.71" "0.0575" "0.667" "0.791" "0.334" "0.933" "0.48"
## [71] "0.615" "0.979" "0.695" "0.756" "0.91" "0.776" "0.715"
## [78] "0.778" "0.398" "0.761" "0.616" "0.962" "0.525" "0.876"
## [85] "0.847" "0.66" "0.366" "0.357" "0.813" "0.345" "0.79"
## [92] "0.0367" "0.744" "0" "0.844" "0.987" "0.414" "0.636"
## [99] "0.951" "0.974" "0.523"
## [ reached getOption("max.print") -- omitted 4686 entries ]
Check the unique values of popularity.
unique(data$popularity)
## [1] "0" "1" "3"
## [4] "4" "2" "15"
## [7] "10" "8" "5"
## [10] "6" "7" "11"
## [13] "3NXlNZSmjO3DsJ3DQuyU8e" "65" "63"
## [16] "62" "61" "68"
## [19] "64" "66" "60"
## [22] "69" "71" "76"
## [25] "67" "70" "72"
## [28] "57" "59" "56"
## [31] "28" "31" "74"
## [34] "55" "53" "9"
## [37] "13" "23" "15hzCoAhSiWk8juLz5Jwut"
## [40] "12" "4Er5ftgKkMeKg07YPLVOd6" "7Ib5UBlT8wh2RuwbMkKRb5"
## [43] "44" "33" "25"
## [46] "26" "24" "22"
## [49] "20" "19" "18"
## [52] "16" "17" "14"
## [55] "83" "81" "73"
## [58] "78" "77" "75"
## [61] "45" "42" "46"
## [64] "54" "41" "52"
## [67] "58" "51" "43"
## [70] "47" "48" "40"
## [73] "50" "49" "39"
## [76] "80" "37" "35"
## [79] "21" "38" "36"
## [82] "29" "7gNbidsHK16wvnuK2VgaVc" "34"
## [85] "32" "99" "100"
## [88] "97" "92" "91"
## [91] "95" "90" "93"
## [94] "88" "87" "89"
## [97] "96" "86" "85"
## [100] "84" "94"
## [ reached getOption("max.print") -- omitted 103 entries ]
Find non numeric data values in popularity and drop rows containing such values.
data <- transform(data[grep("^\\d+$", data$popularity),,drop=F], A= as.numeric(as.character(popularity)))
Check the unique values of popularity.
unique(data$popularity)
## [1] "0" "1" "3" "4" "2" "15" "10" "8" "5" "6" "7" "11"
## [13] "65" "63" "62" "61" "68" "64" "66" "60" "69" "71" "76" "67"
## [25] "70" "72" "57" "59" "56" "28" "31" "74" "55" "53" "9" "13"
## [37] "23" "12" "44" "33" "25" "26" "24" "22" "20" "19" "18" "16"
## [49] "17" "14" "83" "81" "73" "78" "77" "75" "45" "42" "46" "54"
## [61] "41" "52" "58" "51" "43" "47" "48" "40" "50" "49" "39" "80"
## [73] "37" "35" "21" "38" "36" "29" "34" "32" "99" "100" "97" "92"
## [85] "91" "95" "90" "93" "88" "87" "89" "96" "86" "85" "84" "94"
## [97] "82" "79" "27" "30" "98"
Check numeric values of new data$A.
unique(data$A)
## [1] 0 1 3 4 2 15 10 8 5 6 7 11 65 63 62 61 68 64
## [19] 66 60 69 71 76 67 70 72 57 59 56 28 31 74 55 53 9 13
## [37] 23 12 44 33 25 26 24 22 20 19 18 16 17 14 83 81 73 78
## [55] 77 75 45 42 46 54 41 52 58 51 43 47 48 40 50 49 39 80
## [73] 37 35 21 38 36 29 34 32 99 100 97 92 91 95 90 93 88 87
## [91] 89 96 86 85 84 94 82 79 27 30 98
Convert key, mode and time_signature to factor type.
data<-data %>%
mutate(across(c(key,
mode,time_signature),
factor))
Convert key, acousticness, liveness, popularity, loudness, speechiness, tempo, and valence to numeric type.
data <- data %>% mutate_at(c('key','acousticness', 'liveness','popularity','loudness','speechiness','tempo', 'valence'), as.numeric)
View structure of transformed data.
glimpse(data)
## Rows: 232,603
## Columns: 21
## $ genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
## $ popularity <dbl> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
## $ duration_ms <dbl> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
## $ key <dbl> 5, 10, 4, 5, 9, 5, 5, 10, 4, 11, 8, 4, 10, 7, 11, 12,…
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
## $ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
## $ time_signature <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4…
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
## $ X <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ X.1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ A <dbl> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
Check for missing values.
colSums(is.na(data))
## genre artist_name track_name track_id
## 0 0 0 0
## popularity acousticness danceability duration_ms
## 0 0 0 0
## energy instrumentalness key liveness
## 0 0 0 0
## loudness mode speechiness tempo
## 0 0 0 0
## time_signature valence X X.1
## 0 0 0 232603
## A
## 0
Remove non meaningful columns.
data_cleaned <- data %>% select(-c(X, X.1, A, track_id))
Check number of rows and display the transformed and cleaned data.
nrow(data_cleaned)
## [1] 232603
rmarkdown::paged_table(data_cleaned )
We drop columns with non meaningful values for Principle Component Analysis clusterization. These are artist_name, track_name, genre, mode, time_signature; and convert “key” column into factor type.
spotify_reduced <- data_cleaned %>% select(-c(artist_name, track_name, genre, mode, time_signature))
spotify_reduced$key <- as.factor(data_cleaned$key)
Scale the chosen data columns.
spotify_reduced.scaled <- scale(spotify_reduced[, c('energy', 'liveness', 'tempo', 'speechiness' , 'acousticness', 'instrumentalness', 'danceability' , 'duration_ms' ,'loudness', 'valence')])
View the scaled data.
spotify_reduced.scaled
## energy liveness tempo speechiness acousticness
## 1 1.28677820573 0.66067044236 1.59553078739 -0.368026641 0.683600143
## 2 0.63011056975 -0.32280635876 1.82317135659 -0.183147727 -0.345290172
## 3 -1.67012403954 -0.56489295596 -0.58834940172 -0.455884551 1.644837396
## 4 -0.92994953655 -0.58758857445 1.75051666824 -0.438097367 0.942936880
## 5 -1.31332197143 -0.06558934924 0.74137702809 -0.405218026 1.639199641
## 6 -1.80753079343 -0.54471907286 -0.97699534934 0.119773409 1.072605248
## 7 -1.14251247074 -0.55480601441 -1.12605882129 4.485718606 -0.069040170
## 8 -1.14630823742 -0.51445824821 -0.67446705237 -0.496848975 1.608191987
## 9 -0.34160570084 -0.69854493150 0.23988168258 -0.403062004 -1.035802423
## 10 0.50864603592 0.67580085469 0.64169932648 -0.499544003 -0.139512109
## instrumentalness danceability duration_ms loudness
## 1 -0.489868378 -0.890999212 -1.14294225406 1.290537266908
## 2 -0.489868378 0.191875596 -0.82295591648 0.668520702628
## 3 -0.489868378 0.585158487 -0.54596563785 -0.718554562664
## 4 -0.489868378 -1.693727303 -0.69619080264 -0.434971025854
## 5 -0.083649263 -1.203470549 -1.28397202200 -1.930744977801
## 6 -0.489868378 0.127226354 -0.62714111927 -0.900441475457
## 7 -0.489868378 0.800655961 -0.19207758987 -0.517828767062
## 8 -0.489868378 -0.745538417 0.04179874012 0.103354218332
## 9 -0.487028147 0.967666503 -0.07497101101 0.307414329476
## 10 -0.485740135 0.234975091 -0.69394247759 0.296577803966
## valence
## 1 1.3809857490
## 2 1.3886766187
## 3 -0.3340781846
## 4 -0.8762844954
## 5 -0.2494786183
## 6 -0.3725325329
## 7 0.3004185622
## 8 -0.6955490585
## 9 1.1925594424
## 10 1.0118240055
## [ reached getOption("max.print") -- omitted 232593 rows ]
## attr(,"scaled:center")
## energy liveness tempo speechiness
## 0.5709965 0.2150048 117.6677546 0.1207788
## acousticness instrumentalness danceability duration_ms
## 0.3684921 0.1483283 0.5543846 235103.1876803
## loudness valence
## -9.5689426 0.4548766
## attr(,"scaled:scale")
## energy liveness tempo speechiness
## 0.2634514 0.1982762 30.8995889 0.1855268
## acousticness instrumentalness danceability duration_ms
## 0.3547511 0.3027923 0.1856170 118755.0702572
## loudness valence
## 5.9982325 0.2600486
View a brief statistical summary of dataset.
summary(spotify_reduced.scaled)
## energy liveness tempo speechiness
## Min. :-2.1673 Min. :-1.0356 Min. :-2.82492 Min. :-0.53134
## 1st Qu.:-0.7041 1st Qu.:-0.5931 1st Qu.:-0.79961 1st Qu.:-0.45319
## Median : 0.1291 Median :-0.4388 Median :-0.06113 Median :-0.38096
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.8199 3rd Qu.: 0.2471 3rd Qu.: 0.69241 3rd Qu.:-0.08505
## Max. : 1.6246 Max. : 3.9591 Max. : 4.05297 Max. : 4.56118
## acousticness instrumentalness danceability duration_ms
## Min. :-1.0387 Min. :-0.4899 Min. :-2.68017 Min. :-1.8502
## 1st Qu.:-0.9330 1st Qu.:-0.4899 1st Qu.:-0.64318 1st Qu.:-0.4399
## Median :-0.3848 Median :-0.4897 Median : 0.08951 Median :-0.1236
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.9965 3rd Qu.:-0.3716 3rd Qu.: 0.74139 3rd Qu.: 0.2581
## Max. : 1.7689 Max. : 2.8094 Max. : 2.34146 Max. :44.7797
## loudness valence
## Min. :-7.1501 Min. :-1.74920
## 1st Qu.:-0.3670 1st Qu.:-0.83783
## Median : 0.3014 Median :-0.04183
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6782 3rd Qu.: 0.78879
## Max. : 2.2195 Max. : 2.09624
Check for correlation between columns in dataset using cor().
corr_matrix <- cor(spotify_reduced.scaled)
ggcorrplot(corr_matrix)
We will use princomp() and prcomp() to generate new dimensions to describe the dataset. The princomp() function uses spectral decomposition of the variables (which yields eigenvalues and eigenvectors) to examine the covariances or the correlations, whereas prcomp() uses singular value decomposition of individual data points to examine the covariances or the correlations.
Check importance of each newly computed components using princomp(). Note that the parameter “cor=TRUE” is required if results with prcomp() are to be the same.
data.pca <- princomp(spotify_reduced.scaled, cor=TRUE)
summary(data.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.8598987 1.2736103 1.0805238 0.97557858 0.89115126
## Proportion of Variance 0.3459223 0.1622083 0.1167532 0.09517536 0.07941506
## Cumulative Proportion 0.3459223 0.5081307 0.6248838 0.72005918 0.79947423
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.83402081 0.72241907 0.62909636 0.52607089 0.33950675
## Proportion of Variance 0.06955907 0.05218893 0.03957622 0.02767506 0.01152648
## Cumulative Proportion 0.86903331 0.92122224 0.96079846 0.98847352 1.00000000
From the above process, we notice that 10 principal components have been generated (Comp.1 to Comp.10).
In the Cumulative Proportion section, the first principal component explains almost 65% of the total variance.
Check the loadings of the original data columns in these computed components.
data.pca$loadings[, 1:10]
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## energy 0.46288362 0.048539594 0.262962371 0.02410196 0.23315823
## liveness 0.05451613 0.641679889 0.228647374 0.14427840 0.07137532
## tempo 0.16182899 -0.206137689 0.316727458 0.67371223 -0.59241483
## speechiness 0.05355133 0.675961422 -0.009716966 0.06414449 -0.14133312
## acousticness -0.42027103 0.226232382 -0.225490866 0.06090180 -0.26438028
## instrumentalness -0.33290119 -0.173276562 0.086712133 0.04730942 0.04374225
## danceability 0.34159993 0.034058937 -0.461889968 -0.26365828 -0.29686399
## duration_ms -0.06634947 -0.002321853 0.579893204 -0.66220086 -0.46787149
## loudness 0.47466331 -0.062343730 0.161817820 -0.04173756 0.20751005
## valence 0.34621007 -0.014705015 -0.379462602 -0.06989459 -0.38433023
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## energy 0.22946948 0.02421827 0.27005340 0.14286895 0.716035365
## liveness 0.20752260 -0.43614997 -0.52092257 0.01047949 -0.048187408
## tempo -0.05790394 0.10561876 -0.10484265 0.02186972 0.013430821
## speechiness 0.03120918 0.54042965 0.40616092 -0.15627919 -0.183293004
## acousticness -0.13436722 -0.16894027 0.19459832 0.70479380 0.257955243
## instrumentalness 0.85631733 0.25492526 -0.04860984 0.17932361 -0.121224354
## danceability 0.06876644 0.38419642 -0.54825682 0.14141793 0.195557997
## duration_ms -0.01555523 -0.03386144 0.01948756 0.01152142 0.001206506
## loudness -0.07275085 0.05022084 0.05318060 0.62128597 -0.550619014
## valence 0.37195949 -0.51104191 0.36841180 -0.13984949 -0.168916196
Display a scree plot from data.pca to see the importance of each component.
fviz_eig(data.pca, addlabels = TRUE)
The scree plot above show that the first component explains 35% of the variance in the dataset. Plot a graph to view loadings of original variables on the new components. As this can only be visualized in two dimensions we can only see the loadings for component 1 and 2.
# Graph of the variables
fviz_pca_var(data.pca, col.var = "black")
Check to view how much of each original variable is represented in dimensions 1 and 2.
fviz_cos2(data.pca, choice = "var", axes = 1:2)
fviz_pca_var(data.pca, col.var = "cos2",
gradient.cols = c("black", "orange", "green"),
repel = TRUE)
View the eigenvalues and variables information for all dimensions in a tabular format.
# Eigenvalues
eig.val <- get_eigenvalue(data.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.4592233 34.592233 34.59223
## Dim.2 1.6220832 16.220832 50.81307
## Dim.3 1.1675317 11.675317 62.48838
## Dim.4 0.9517536 9.517536 72.00592
## Dim.5 0.7941506 7.941506 79.94742
## Dim.6 0.6955907 6.955907 86.90333
## Dim.7 0.5218893 5.218893 92.12222
## Dim.8 0.3957622 3.957622 96.07985
## Dim.9 0.2767506 2.767506 98.84735
## Dim.10 0.1152648 1.152648 100.00000
# Results for Variables
res.var <- get_pca_var(data.pca)
res.var$coord # Coordinates
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## energy 0.86091667 0.061820527 0.28413710 0.02351336 0.20777925
## liveness 0.10139448 0.817250120 0.24705893 0.14075491 0.06360620
## tempo 0.30098554 -0.262539085 0.34223155 0.65725922 -0.52793123
## speechiness 0.09960005 0.860911433 -0.01049941 0.06257799 -0.12594919
## acousticness -0.78166156 0.288131893 -0.24364824 0.05941449 -0.23560282
## instrumentalness -0.61916250 -0.220686815 0.09369452 0.04615405 0.03898096
## danceability 0.63534128 0.043377814 -0.49908310 -0.25721937 -0.26455072
## duration_ms -0.12340329 -0.002957136 0.62658840 -0.64602897 -0.41694427
## loudness 0.88282569 -0.079401617 0.17484800 -0.04071827 0.18492284
## valence 0.64391568 -0.018728458 -0.41001837 -0.06818766 -0.34249637
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## energy 0.19138232 0.01749574 0.16988961 0.075159195 0.2430988365
## liveness 0.17307817 -0.31508306 -0.32771049 0.005512956 -0.0163599500
## tempo -0.04829309 0.07630101 -0.06595613 0.011505021 0.0045598543
## speechiness 0.02602910 0.39041668 0.25551436 -0.082213934 -0.0622292113
## acousticness -0.11206506 -0.12204567 0.12242110 0.370771499 0.0875775449
## instrumentalness 0.71418647 0.18416287 -0.03058027 0.094336932 -0.0411564859
## danceability 0.05735264 0.27755082 -0.34490637 0.074395855 0.0663932591
## duration_ms -0.01297338 -0.02446215 0.01225955 0.006061084 0.0004096168
## loudness -0.06067572 0.03628049 0.03345572 0.326840463 -0.1869388693
## valence 0.31022195 -0.36918642 0.23176652 -0.073570747 -0.0573481881
res.var$contrib # Contributions to the PCs
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## energy 21.4261250 0.2356092189 6.914920880 0.05809046 5.4362759
## liveness 0.2972008 41.1753079632 5.227962142 2.08162554 0.5094436
## tempo 2.6188623 4.2492746845 10.031628234 45.38881732 35.0955335
## speechiness 0.2867745 45.6923843487 0.009441943 0.41145162 1.9975051
## acousticness 17.6627739 5.1181090659 5.084613056 0.37090295 6.9896932
## instrumentalness 11.0823202 3.0024766882 0.751899395 0.22381809 0.1913384
## danceability 11.6690512 0.1160011219 21.334234287 6.95156898 8.8128229
## duration_ms 0.4402252 0.0005391002 33.627612760 43.85099731 21.8903730
## loudness 22.5305254 0.3886740625 2.618500696 0.17420238 4.3060420
## valence 11.9861415 0.0216237461 14.399186608 0.48852534 14.7709724
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## energy 5.26562428 0.05865244 7.29288371 2.04115365 51.2706644604
## liveness 4.30656312 19.02267994 27.13603230 0.01098198 0.2322026282
## tempo 0.33528657 1.11553222 1.09919804 0.04782845 0.0180386949
## speechiness 0.09740128 29.20642047 16.49666962 2.44231866 3.3596325361
## acousticness 1.80545505 2.85408139 3.78685076 49.67342992 6.6540907199
## instrumentalness 73.32793643 6.49868897 0.23629163 3.21569579 1.4695343998
## danceability 0.47288229 14.76068858 30.05855449 1.99990307 3.8242930266
## duration_ms 0.02419651 0.11465971 0.03797649 0.01327431 0.0001455656
## loudness 0.52926860 0.25221327 0.28281767 38.59962610 30.3181298278
## valence 13.83538589 26.11638301 13.57272528 1.95578808 2.8532681407
res.var$cos2 # Quality of representation
## Dim.1 Dim.2 Dim.3 Dim.4
## energy 0.741177516 0.003821777587 0.0807338904 0.0005528781
## liveness 0.010280841 0.667897758379 0.0610381131 0.0198119452
## tempo 0.090592294 0.068926771332 0.1171224354 0.4319896863
## speechiness 0.009920169 0.741168496148 0.0001102377 0.0039160054
## acousticness 0.610994795 0.083019987982 0.0593644671 0.0035300821
## instrumentalness 0.383362205 0.048702670333 0.0087786635 0.0021301967
## danceability 0.403658542 0.001881634725 0.2490839392 0.0661618055
## duration_ms 0.015228373 0.000008744654 0.3926130247 0.4173534296
## loudness 0.779381192 0.006304616722 0.0305718246 0.0016579774
## valence 0.414627403 0.000350755155 0.1681150621 0.0046495573
## Dim.5 Dim.6 Dim.7 Dim.8
## energy 0.043172216 0.0366271938 0.0003061008 0.0288624793
## liveness 0.004045749 0.0299560533 0.0992773330 0.1073941668
## tempo 0.278711378 0.0023322223 0.0058218434 0.0043502107
## speechiness 0.015863198 0.0006775143 0.1524251862 0.0652875877
## acousticness 0.055508688 0.0125585777 0.0148951457 0.0149869251
## instrumentalness 0.001519515 0.5100623201 0.0339159630 0.0009351530
## danceability 0.069987083 0.0032893253 0.0770344558 0.1189604058
## duration_ms 0.173842522 0.0001683087 0.0005983968 0.0001502966
## loudness 0.034196457 0.0036815432 0.0013162741 0.0011192855
## valence 0.117303761 0.0962376602 0.1362986110 0.0537157204
## Dim.9 Dim.10
## energy 0.00564890454 0.059097044285
## liveness 0.00003039269 0.000267647965
## tempo 0.00013236551 0.000020792271
## speechiness 0.00675913101 0.003872474735
## acousticness 0.13747150463 0.007669826364
## instrumentalness 0.00889945671 0.001693856329
## danceability 0.00553474330 0.004408064860
## duration_ms 0.00003673674 0.000000167786
## loudness 0.10682468851 0.034946140838
## valence 0.00541265481 0.003288814675
Now we use another Principal Component Analysis (PCA) method :- prcomp() function; to reduce the dimensionality - from the currently ten(10) attributes.
km_pca = prcomp(spotify_reduced.scaled)
print(km_pca)
## Standard deviations (1, .., p=10):
## [1] 1.8598987 1.2736103 1.0805238 0.9755786 0.8911513 0.8340208 0.7224191
## [8] 0.6290964 0.5260709 0.3395067
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4 PC5
## energy -0.46288362 0.048539594 -0.262962371 0.02410196 0.23315823
## liveness -0.05451613 0.641679889 -0.228647374 0.14427840 0.07137532
## tempo -0.16182899 -0.206137689 -0.316727458 0.67371223 -0.59241483
## speechiness -0.05355133 0.675961422 0.009716966 0.06414449 -0.14133312
## acousticness 0.42027103 0.226232382 0.225490866 0.06090180 -0.26438028
## instrumentalness 0.33290119 -0.173276562 -0.086712133 0.04730942 0.04374225
## danceability -0.34159993 0.034058937 0.461889968 -0.26365828 -0.29686399
## duration_ms 0.06634947 -0.002321853 -0.579893204 -0.66220086 -0.46787149
## loudness -0.47466331 -0.062343730 -0.161817820 -0.04173756 0.20751005
## valence -0.34621007 -0.014705015 0.379462602 -0.06989459 -0.38433023
## PC6 PC7 PC8 PC9 PC10
## energy -0.22946948 0.02421827 -0.27005340 -0.14286895 -0.716035365
## liveness -0.20752260 -0.43614997 0.52092257 -0.01047949 0.048187408
## tempo 0.05790394 0.10561876 0.10484265 -0.02186972 -0.013430821
## speechiness -0.03120918 0.54042965 -0.40616092 0.15627919 0.183293004
## acousticness 0.13436722 -0.16894027 -0.19459832 -0.70479380 -0.257955243
## instrumentalness -0.85631733 0.25492526 0.04860984 -0.17932361 0.121224354
## danceability -0.06876644 0.38419642 0.54825682 -0.14141793 -0.195557997
## duration_ms 0.01555523 -0.03386144 -0.01948756 -0.01152142 -0.001206506
## loudness 0.07275085 0.05022084 -0.05318060 -0.62128597 0.550619014
## valence -0.37195949 -0.51104191 -0.36841180 0.13984949 0.168916196
PCA transforms the variables of the original data set into principal components (PC1:PC10). The degree to which each principal component explains the variance in the original data is also inferred from the standard deviation of each principal component and its calculated variance.
pca_tbl = tibble(proportional_variance = km_pca$sdev^2/sum(km_pca$sdev^2) , PC =paste0("PC", 1:10))
print(pca_tbl)
## # A tibble: 10 × 2
## proportional_variance PC
## <dbl> <chr>
## 1 0.346 PC1
## 2 0.162 PC2
## 3 0.117 PC3
## 4 0.0952 PC4
## 5 0.0794 PC5
## 6 0.0696 PC6
## 7 0.0522 PC7
## 8 0.0396 PC8
## 9 0.0277 PC9
## 10 0.0115 PC10
In contrast to the PRINCOMP() function method, here the first component only explains 35% of the variance in data. We plot the Cumulative Variance Explained.
ggplot(pca_tbl, aes(x = 1:10, y = cumsum(proportional_variance))) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:10, labels = pca_tbl$PC, name = "Principal Component") +
scale_y_continuous(name = "Cummulative Variance Explained", breaks = seq.default(from = 0.6, to = 1, by = 0.05), labels = scales::percent_format(accuracy = 1)) +
labs(caption = "Fig. 1 Explaining dataset variance using PCA")
The above shows that only using first six (6) principal components - PC1 to PC6 - we can explain more than 85% of the data set’s variance. We plot the importance of these principal components using a scree plot.
fviz_eig(km_pca, addlabels = TRUE)
Plot the loadings of the original variables on components 1 and 2.
# Graph of the variables
fviz_pca_var(km_pca, col.var = "black")
Check to view how much of each original variable is represented in dimensions 1 and 2.
fviz_cos2(km_pca, choice = "var", axes = 1:2)
Plot the quality of representation of original variables in newly computed dimensions.
library(corrplot)
## corrplot 0.92 loaded
var <- get_pca_var(km_pca)
corrplot(var$cos2, is.corr=FALSE)
Plot the graph of variables of km_pca.
fviz_pca_var(km_pca, col.var = "cos2",
gradient.cols = c("black", "orange", "green"),
repel = TRUE)
View the eigenvalue and variables information for all dimensions in a tabular format.
# Eigenvalues
eig.val <- get_eigenvalue(km_pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.4592233 34.592233 34.59223
## Dim.2 1.6220832 16.220832 50.81307
## Dim.3 1.1675317 11.675317 62.48838
## Dim.4 0.9517536 9.517536 72.00592
## Dim.5 0.7941506 7.941506 79.94742
## Dim.6 0.6955907 6.955907 86.90333
## Dim.7 0.5218893 5.218893 92.12222
## Dim.8 0.3957622 3.957622 96.07985
## Dim.9 0.2767506 2.767506 98.84735
## Dim.10 0.1152648 1.152648 100.00000
# Results for Variables
res.var <- get_pca_var(km_pca)
res.var$coord # Coordinates
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## energy -0.86091667 0.061820527 -0.28413710 0.02351336 0.20777925
## liveness -0.10139448 0.817250120 -0.24705893 0.14075491 0.06360620
## tempo -0.30098554 -0.262539085 -0.34223155 0.65725922 -0.52793123
## speechiness -0.09960005 0.860911433 0.01049941 0.06257799 -0.12594919
## acousticness 0.78166156 0.288131893 0.24364824 0.05941449 -0.23560282
## instrumentalness 0.61916250 -0.220686815 -0.09369452 0.04615405 0.03898096
## danceability -0.63534128 0.043377814 0.49908310 -0.25721937 -0.26455072
## duration_ms 0.12340329 -0.002957136 -0.62658840 -0.64602897 -0.41694427
## loudness -0.88282569 -0.079401617 -0.17484800 -0.04071827 0.18492284
## valence -0.64391568 -0.018728458 0.41001837 -0.06818766 -0.34249637
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## energy -0.19138232 0.01749574 -0.16988961 -0.075159195 -0.2430988365
## liveness -0.17307817 -0.31508306 0.32771049 -0.005512956 0.0163599500
## tempo 0.04829309 0.07630101 0.06595613 -0.011505021 -0.0045598543
## speechiness -0.02602910 0.39041668 -0.25551436 0.082213934 0.0622292113
## acousticness 0.11206506 -0.12204567 -0.12242110 -0.370771499 -0.0875775449
## instrumentalness -0.71418647 0.18416287 0.03058027 -0.094336932 0.0411564859
## danceability -0.05735264 0.27755082 0.34490637 -0.074395855 -0.0663932591
## duration_ms 0.01297338 -0.02446215 -0.01225955 -0.006061084 -0.0004096168
## loudness 0.06067572 0.03628049 -0.03345572 -0.326840463 0.1869388693
## valence -0.31022195 -0.36918642 -0.23176652 0.073570747 0.0573481881
res.var$contrib # Contributions to the PCs
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## energy 21.4261250 0.2356092189 6.914920880 0.05809046 5.4362759
## liveness 0.2972008 41.1753079632 5.227962142 2.08162554 0.5094436
## tempo 2.6188623 4.2492746845 10.031628234 45.38881732 35.0955335
## speechiness 0.2867745 45.6923843487 0.009441943 0.41145162 1.9975051
## acousticness 17.6627739 5.1181090659 5.084613056 0.37090295 6.9896932
## instrumentalness 11.0823202 3.0024766882 0.751899395 0.22381809 0.1913384
## danceability 11.6690512 0.1160011219 21.334234287 6.95156898 8.8128229
## duration_ms 0.4402252 0.0005391002 33.627612760 43.85099731 21.8903730
## loudness 22.5305254 0.3886740625 2.618500696 0.17420238 4.3060420
## valence 11.9861415 0.0216237461 14.399186608 0.48852534 14.7709724
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## energy 5.26562428 0.05865244 7.29288371 2.04115365 51.2706644603
## liveness 4.30656312 19.02267994 27.13603230 0.01098198 0.2322026282
## tempo 0.33528657 1.11553222 1.09919804 0.04782845 0.0180386949
## speechiness 0.09740128 29.20642047 16.49666962 2.44231866 3.3596325361
## acousticness 1.80545505 2.85408139 3.78685076 49.67342992 6.6540907199
## instrumentalness 73.32793643 6.49868897 0.23629163 3.21569579 1.4695343998
## danceability 0.47288229 14.76068858 30.05855449 1.99990307 3.8242930266
## duration_ms 0.02419651 0.11465971 0.03797649 0.01327431 0.0001455656
## loudness 0.52926860 0.25221327 0.28281767 38.59962610 30.3181298278
## valence 13.83538589 26.11638301 13.57272528 1.95578808 2.8532681407
res.var$cos2 # Quality of representation
## Dim.1 Dim.2 Dim.3 Dim.4
## energy 0.741177516 0.003821777587 0.0807338904 0.0005528781
## liveness 0.010280841 0.667897758379 0.0610381131 0.0198119452
## tempo 0.090592294 0.068926771332 0.1171224354 0.4319896863
## speechiness 0.009920169 0.741168496148 0.0001102377 0.0039160054
## acousticness 0.610994795 0.083019987982 0.0593644671 0.0035300821
## instrumentalness 0.383362205 0.048702670333 0.0087786635 0.0021301967
## danceability 0.403658542 0.001881634725 0.2490839392 0.0661618055
## duration_ms 0.015228373 0.000008744654 0.3926130247 0.4173534296
## loudness 0.779381192 0.006304616722 0.0305718246 0.0016579774
## valence 0.414627403 0.000350755155 0.1681150621 0.0046495573
## Dim.5 Dim.6 Dim.7 Dim.8
## energy 0.043172216 0.0366271938 0.0003061008 0.0288624793
## liveness 0.004045749 0.0299560533 0.0992773330 0.1073941668
## tempo 0.278711378 0.0023322223 0.0058218434 0.0043502107
## speechiness 0.015863198 0.0006775143 0.1524251862 0.0652875877
## acousticness 0.055508688 0.0125585777 0.0148951457 0.0149869251
## instrumentalness 0.001519515 0.5100623201 0.0339159630 0.0009351530
## danceability 0.069987083 0.0032893253 0.0770344558 0.1189604058
## duration_ms 0.173842522 0.0001683087 0.0005983968 0.0001502966
## loudness 0.034196457 0.0036815432 0.0013162741 0.0011192855
## valence 0.117303761 0.0962376602 0.1362986110 0.0537157204
## Dim.9 Dim.10
## energy 0.00564890454 0.059097044285
## liveness 0.00003039269 0.000267647965
## tempo 0.00013236551 0.000020792271
## speechiness 0.00675913101 0.003872474735
## acousticness 0.13747150463 0.007669826364
## instrumentalness 0.00889945671 0.001693856329
## danceability 0.00553474330 0.004408064860
## duration_ms 0.00003673674 0.000000167786
## loudness 0.10682468851 0.034946140838
## valence 0.00541265481 0.003288814675
Plot the top 1000 contributing individual points from the dataset to dimensional components 6 and 7.
fviz_pca_ind(km_pca, axes = c(6, 7), label = "none",
select.ind= list(contrib = 1000),
geom.ind = "point")+
labs(title ="PCA", x = "PC6", y = "PC7")
Plot the top 1000 contributing individual points from the dataset to dimensional components 1 and 2.
fviz_pca_ind(km_pca, axes = c(1, 2), label = "none",
select.ind= list(contrib = 1000),
geom.ind = "point")+
labs(title ="PCA", x = "PC1", y = "PC2")
Use k-means() to cluster dataset into 10 centers.
set.seed(123)
km_10 <- kmeans( spotify_reduced.scaled, centers = 10)
km_10
## K-means clustering with 10 clusters of sizes 22325, 35951, 9630, 3952, 52237, 28236, 46794, 108, 21086, 12284
##
## Cluster means:
## energy liveness tempo speechiness acousticness instrumentalness
## 1 -1.3290418 -0.1764674 -0.31024696 -0.36809381 1.3376732 -0.35598308
## 2 0.7731637 -0.1539308 1.41650750 -0.06430712 -0.7058481 -0.27337354
## 3 0.3681702 2.6126105 -0.64832710 4.17208956 1.2144820 -0.48800649
## 4 -1.3224690 -0.1758294 -0.44303299 -0.30439439 1.2101034 1.48929376
## 5 0.5918253 -0.2600115 -0.25000007 -0.10051790 -0.6329072 -0.33946172
## 6 -0.5969196 -0.3406919 -0.06118599 -0.15283855 0.7220474 -0.13622259
## 7 0.3042919 -0.2827280 -0.26654143 -0.21507673 -0.6536308 -0.15060635
## 8 -0.3753096 0.7539466 -0.80834933 2.49580982 0.9351803 -0.01980392
## 9 -1.5038465 -0.3892401 -0.45747143 -0.40931604 1.3068999 2.28914952
## 10 0.5704019 2.4069607 0.08048224 -0.03685787 -0.4366858 -0.24840148
## danceability duration_ms loudness valence
## 1 -0.91610715 -0.05680334 -0.9469556 -0.91255378
## 2 -0.26055771 -0.07760216 0.6283494 0.29094665
## 3 0.04410542 -0.12981649 -0.4107823 -0.18225243
## 4 -1.38658587 3.51740454 -1.5386806 -1.11317389
## 5 0.86109123 -0.10684960 0.5349201 1.01592334
## 6 0.59129195 -0.28915360 -0.1998337 0.45646890
## 7 0.05713629 0.11668976 0.4327238 -0.59749389
## 8 -0.17391141 23.80895659 -0.6481866 0.02627823
## 9 -1.36710407 -0.23352332 -1.8547459 -1.13155661
## 10 -0.05130122 0.16653602 0.4247701 0.15678481
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 2 2 1 1 1 1 3 1 5 5 4 9 7 6 9 8 1 5 7 2
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 6 5 6 6 6 5 6 6 5 5 3 6 6 5 6 5 6 9 6
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 6 9 1 6 6 6 5 5 5 6 1 10 3 1 3 1 1 10 9 7
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 6 1 6 6 2 5 1 5 6 6 1 1 6 3 1 6 1 4 3
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 98 99 100 101
## 5 6 3 9 10 6 6 1 6 6 7 6 2 6 5 1 6 9 7 5
## 102
## 6
## [ reached getOption("max.print") -- omitted 232502 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 94441.126 127524.764 57243.046 41856.639 145044.366 117894.327
## [7] 166138.730 6345.525 111628.508 67389.655
## (between_SS / total_SS = 59.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Plot the k-means result with 10 centers.
fviz_cluster(km_10 , data=spotify_reduced.scaled, palette = c("yellow", "black", "pink", "green", "orange", "red", "grey", "#2E9FDF", "#00AFBB", "#E7B800"),
geom="point",
ellipse.type="convex",
ggtheme=theme_bw()
)
Use k-means() to cluster dataset into 7 centers.
set.seed(123)
km_7 <- kmeans( spotify_reduced.scaled, centers = 7)
km_7
## K-means clustering with 7 clusters of sizes 22667, 39079, 10018, 23521, 56881, 31062, 49375
##
## Cluster means:
## energy liveness tempo speechiness acousticness instrumentalness
## 1 -1.3334698 -0.06860210 -0.33183609 -0.36652258 1.3419550 -0.3469254
## 2 0.7976342 0.16989815 1.37388833 -0.05595129 -0.7059861 -0.2757158
## 3 0.3599028 2.60575281 -0.64173234 4.10083161 1.1930227 -0.4867282
## 4 -1.5107929 -0.37611351 -0.46372339 -0.41159065 1.3185027 2.2779556
## 5 0.5789961 -0.14583843 -0.24627668 -0.08762423 -0.6202609 -0.3435237
## 6 -0.6116571 -0.29491438 -0.04630864 -0.16313431 0.7276884 -0.1311518
## 7 0.3453222 -0.09896193 -0.27109748 -0.21985191 -0.6706929 -0.1306630
## danceability duration_ms loudness valence
## 1 -0.977368450 0.08460353 -0.9840996 -0.9362126
## 2 -0.273900462 -0.05509184 0.6304404 0.2830091
## 3 0.041010693 0.06499160 -0.4009519 -0.1647969
## 4 -1.402124960 0.19937570 -1.8539274 -1.1570395
## 5 0.858405465 -0.10998391 0.5256913 0.9974758
## 6 0.538809220 -0.27931492 -0.2007224 0.3643702
## 7 -0.002777729 0.19902175 0.4379860 -0.5879171
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 2 2 6 1 1 1 3 1 5 5 4 4 7 6 4 3 1 5 7 2
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 6 5 6 6 6 5 6 6 5 5 3 6 6 5 6 5 6 4 6
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 6 4 1 6 6 6 5 5 5 6 1 2 3 6 3 1 1 1 4 7
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 6 1 6 6 2 5 6 5 6 6 1 1 6 3 6 6 1 1 3
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 98 99 100 101
## 5 6 3 4 6 6 6 6 6 6 7 6 2 6 5 1 6 4 7 5
## 102
## 6
## [ reached getOption("max.print") -- omitted 232502 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 115957.15 169521.95 97929.37 167498.56 178439.09 136345.26 233534.47
## (between_SS / total_SS = 52.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Plot the k-means result with 7 centers.
fviz_cluster(km_7 , data=spotify_reduced.scaled, palette = c("green", "orange", "red", "grey", "#2E9FDF", "#00AFBB", "#E7B800"),
geom="point",
ellipse.type="convex",
ggtheme=theme_bw()
)
Use kmeans() to cluster dataset into 3 centers.
set.seed(123)
km_3 <- kmeans( spotify_reduced.scaled, centers = 3)
km_3
## K-means clustering with 3 clusters of sizes 53140, 169097, 10366
##
## Cluster means:
## energy liveness tempo speechiness acousticness instrumentalness
## 1 -1.3585806 -0.25338106 -0.3686223 -0.3778857 1.2767734 0.9674164
## 2 0.4060388 -0.07866395 0.1540547 -0.1278091 -0.4730022 -0.2742902
## 3 0.3410212 2.58214421 -0.6233460 4.0220889 1.1707034 -0.4849378
## danceability duration_ms loudness valence
## 1 -1.01051313 0.09955897 -1.3183608 -0.9145171
## 2 0.31493676 -0.03539747 0.4387450 0.2965517
## 3 0.04281365 0.06705024 -0.3986855 -0.1493891
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 2 2 1 1 1 1 3 1 2 2 1 1 2 2 1 3 3 2 2 2
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 1 2
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 2 1 1 2 3 2 2 2 2 2 1 2 3 1 3 1 1 1 1 2
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 2 1 2 2 2 2 1 2 2 2 1 1 1 3 1 2 1 1 3
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 98 99 100 101
## 2 2 3 1 2 2 2 2 2 2 2 2 2 3 2 1 2 1 2 2
## 102
## 1
## [ reached getOption("max.print") -- omitted 232502 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 453830.3 908321.4 108320.2
## (between_SS / total_SS = 36.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Plot the k-means result with 3 centers.
fviz_cluster(km_3 , data=spotify_reduced.scaled, palette = c("#2E9FDF", "purple", "#E7B800"),
geom="point",
ellipse.type="convex",
ggtheme=theme_bw()
)
The k-means computation yield object components that can be accessed. The “betweenss” component is the mean sum of squares distances between cluster centers and we would like this to be as high as possible. The “tot.withinss” component computes the mean sum of squares distances from the center within each cluster and we would like this to be as small as possible.
Compute percentage of ratio of “betweens” to “tot.withinss” for 3-center cluster.
print(paste0(round(km_3$betweenss/km_3$tot.withinss, 4)*100, "%"))
## [1] "58.18%"
Compute percentage of ratio of “betweens” to “tot.withinss” for 7-center cluster.
print(paste0(round(km_7$betweenss/km_7$tot.withinss, 4)*100, "%"))
## [1] "111.61%"
Compute percentage of ratio of “betweens” to “tot.withinss” for 10-center cluster.
print(paste0(round(km_10$betweenss/km_10$tot.withinss, 4)*100, "%"))
## [1] "148.64%"
Choose to classify original data using 7-center cluster as this would explain 85% of variances and has a good balance between high ratio of “betweens” to “tot.withinss” and computational cost. Also create new column “cluster_num” in original dataset to denote classification.
spotify_songs_final <- cbind(spotify_reduced, cluster_num = km_7$cluster)
Display original dataset with new classification in new “cluster_num” column.
rmarkdown::paged_table(spotify_songs_final)
Check the proportion of each type of cluster in dataset.
table("\nFrequency" = factor(spotify_songs_final$cluster_num)
) %>%
prop.table()
##
## Frequency
## 1 2 3 4 5 6 7
## 0.09744930 0.16800729 0.04306909 0.10112079 0.24454113 0.13354084 0.21227155
We used princomp() and prcomp() as preliminaries to using k-means. According to the Scree plot for prcomp() a seven(7) cluster classification can explain 85% of variances in the original dataset. In the k-means process, We choose the seven(7) clusters to classify the original dataset. The resulting dataset show that 25% of the dataset lie in cluster number 7, 22% in cluster number 6, 17% in cluster number 5, and 15% in cluster number 2. The 7-cluster classification model provides a “betweenss to tot.withinss” ratio of 115%.
https://rpubs.com/DayaNayak/Spotify
https://rpubs.com/amritha_sharma/832890
https://rstudio-pubs-static.s3.amazonaws.com/569420_ca30064fcdc64caaa6f6b66ea3cdf9e1.html
http://data-mining.business-intelligence.uoc.edu/k-means
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html
https://stackoverflow.com/questions/48470534/principal-component-analysis-graph-of-individuals