Today, we’ll be analyzing and creating an unsupervised learning model from a data set of songs and their characteristics that were pulled from Spotify. You can find the data set here: https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db
Let’s do some data pre-processing and take a look at our data set.
## Rows: 232,725
## Columns: 18
## $ genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
## $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
## $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.0…
## $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
## $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
## $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
Seems like some of the variables have repeating values. Let’s check.
## [1] "C#" "F#" "C" "F" "G" "E" "D#" "G#" "D" "A#" "A" "B"
## [1] "4/4" "5/4" "3/4" "1/4" "0/4"
Let’s also check if there are any NA values.
## genre artist_name track_name track_id
## 0 0 0 0
## popularity acousticness danceability duration_ms
## 0 0 0 0
## energy instrumentalness key liveness
## 0 0 0 0
## loudness mode speechiness tempo
## 0 0 0 0
## time_signature valence
## 0 0
Now let’s get rid of non-numerical variables that have too many unique values. This would be the track_id, artist_name, and track_name. Let’s also change the non-numerical variables with repeating values into factors.
songs <- songs %>% distinct(track_name, .keep_all = TRUE)
rownames(songs) <- songs$track_name
songs <- songs %>%
select(-c(track_id, artist_name,track_name)) %>%
mutate(genre=as.factor(genre),
key=as.factor(key),
mode=as.factor(mode),
time_signature=as.factor(time_signature)
)Now let’s check the ranges of the values.
## genre popularity acousticness danceability
## Comedy : 8640 Min. : 0.00 Min. :0.0000 Min. :0.0569
## Classical : 8395 1st Qu.: 24.00 1st Qu.:0.0474 1st Qu.:0.4060
## Alternative: 8305 Median : 36.00 Median :0.3050 Median :0.5530
## Anime : 8258 Mean : 35.69 Mean :0.4151 Mean :0.5356
## Opera : 8096 3rd Qu.: 48.00 3rd Qu.:0.8100 3rd Qu.:0.6800
## Electronic : 7915 Max. :100.00 Max. :0.9960 Max. :0.9870
## (Other) :99006
## duration_ms energy instrumentalness key
## Min. : 15387 Min. :0.0000203 Min. :0.0000000 C :17495
## 1st Qu.: 177380 1st Qu.:0.3300000 1st Qu.:0.0000000 G :17328
## Median : 220000 Median :0.5890000 Median :0.0000887 D :15730
## Mean : 238023 Mean :0.5523232 Mean :0.1818384 A :14778
## 3rd Qu.: 270867 3rd Qu.:0.7910000 3rd Qu.:0.1300000 C# :14191
## Max. :5552917 Max. :0.9990000 Max. :0.9990000 F :13128
## (Other):55965
## liveness loudness mode speechiness
## Min. :0.00967 Min. :-52.457 Major:98080 Min. :0.0222
## 1st Qu.:0.09775 1st Qu.:-13.302 Minor:50535 1st Qu.:0.0371
## Median :0.13100 Median : -8.323 Median :0.0496
## Mean :0.22882 Mean :-10.382 Mean :0.1305
## 3rd Qu.:0.28400 3rd Qu.: -5.659 3rd Qu.:0.1030
## Max. :1.00000 Max. : 3.744 Max. :0.9670
##
## tempo time_signature valence
## Min. : 30.38 0/4: 6 Min. :0.0000
## 1st Qu.: 91.84 1/4: 2066 1st Qu.:0.2150
## Median :114.89 3/4: 18138 Median :0.4370
## Mean :116.93 4/4:124534 Mean :0.4486
## 3rd Qu.:138.28 5/4: 3871 3rd Qu.:0.6660
## Max. :242.90 Max. :1.0000
##
Seems like the data ranges are quite varied. Let’s scale these values later.
Since we have quite a lot of categorical values, let’s use the PCA method. For the amount of PCs we’ll use, let’s simply input it as the amount of numerical variables that we have for now.
# quantitative column names
quanti <- songs %>%
select_if(is.numeric) %>%
colnames()
# numeric column index
quantivar <- which(colnames(songs) %in% quanti)
# qualitative column names
quali <- songs %>%
select_if(is.factor) %>%
colnames()
# categorical column index
qualivar <- which(colnames(songs) %in% quali)Don’t forget to set the scale.unit to TRUE, as we want to scale our values.
songs_pca <- PCA(
X = songs,
scale.unit = T,
quali.sup = qualivar,
graph = F,
ncp = length(quanti)
)
songs_pca## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 148615 individuals, described by 15 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$quali.sup" "results for the supplementary categorical variables"
## 12 "$quali.sup$coord" "coord. for the supplementary categories"
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"
## 14 "$call" "summary statistics"
## 15 "$call$centre" "mean of the variables"
## 16 "$call$ecart.type" "standard error of the variables"
## 17 "$call$row.w" "weights for the individuals"
## 18 "$call$col.w" "weights for the variables"
In order to analyze the results, let’s plot out the individual values of the PCA.
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
It makes sense that each genre would naturally cluster up, as
people created genres in order to classify songs based off their
characteristics in the first place. But there are also other things that
we can infer from this plot, such as the fact that World music seems to
have overlapping characteristics with Electronic and Soundtrack music.
We can also see that one World song seems to have a characteristic
that’s quite different from the rest in its genre. In fact, it would
seem like that the variation in characteristics when it comes to the
World genre is quite large. Another interesting point is that Comedy
music seems to be quite distinct when compared to the other genres, as
it is sequestered at the top of the chart.
How about the variables?
In order to interpret this chart, first we have to remember that
between two variables: - <90 degree angle shows a positive
correlation - 90 degree angle shows no correlation - nearing
180 degree angle shows negative correlation
Thus we could infer, when it comes to a song’s popularity: - tempo, loudness, valence, danceability and energy correlates positively - acousticness, instrumentalness, duration_ms, speechiness, and liveness correlates negatively This all makes sense, as most songs that go viral tend to be short and fast-paced, the type that would be played during parties.
We can also see how much certain variables contribute to the information of the PC.
## Warning: package 'factoextra' was built under R version 4.4.1
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
In our first PC, we can see that loudness and energy have the
biggest effect. We can also see this numerically like below.
# dimdesc: dimension description
loan_dim <- dimdesc(songs_pca)
# variable yang berkontribusi untuk PC1
loan_dim$Dim.1$quanti %>% as.data.frame()We can use our PCA for supervised learning models, in that case we can prepare our PCA to avoid overfitting and reduce computation times by choosing a limited amount of dimensions. In order to decide how many we should keep, we can start by looking at the eigenvalues.
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.7330652 33.936956 33.93696
## comp 2 1.7935575 16.305069 50.24203
## comp 3 1.1467482 10.424984 60.66701
## comp 4 0.9719350 8.835773 69.50278
## comp 5 0.8486468 7.714971 77.21775
## comp 6 0.7098561 6.453237 83.67099
## comp 7 0.6447742 5.861583 89.53257
## comp 8 0.4463965 4.058150 93.59072
## comp 9 0.3382590 3.075082 96.66581
## comp 10 0.2610486 2.373169 99.03898
## comp 11 0.1057127 0.961025 100.00000
For example, if we’d like our PCA to retain no less than 90% of its information, we can discard comp 9-11 and keep comp 1-8 like below.
We can then combine this into the original data set by only keeping the categorical variables and replacing all the numerical ones with the dimensions we have selected. This data can be used for supervised learning models.
quali_songs <- songs %>%
select_if(is.factor)
songs_new <- merge(quali_songs,songs_keep, by="row.names", all=TRUE)
head(songs_new)We can use PCA to simplify data, visualizing clusters, and figure out how each of the variables correlate with one another. In our case, we’ve used it to see how each genre clusters together and which variables affect a song’s popularity. Our PCA can also be used in order to train supervised learning models, although it will no longer have the original values (as they have been scaled).