# Libraries
library(tidyverse)
library(corrplot)
library(ggbiplot)
library(ggpubr)
library(factoextra)
library(FactoMineR)
library(psych)
library(kableExtra)
# Importing dataset
songs_dataset = read.csv("spotify_data.csv") %>% select(-X)
What has been done:
Note: there were no other things in the dataset that needed to be cleaned up.
The focus of this project is Spotify. I guess, there is no need for a presentation of this company. But just in case, Spotify is the most popular audio-streaming service with ~489 million users (Spotify, 2023).
Now, let’s go to the details. To maintain such a huge audience, you, as a company, should have something really valuable to offer. And Spotify has. Its music collections contains a mind-boggling number of tracks — over 100 million (Spotify, 2023). But it’s not enough to take a tourist to a garden with all sorts of wonders, you need to help navigate in this garden, i.e. suggest what to enjoy — here, Spotify also has something to offer. This somehting is a fancy recommendation system highlighted by industry experts like Ural Garrett (2022).
To make recommendation system work, Spotify build really advanced models on many different sorts of data (Dmitry Pastukhov, 2022). Some of this data attached to users, another — to tracks themselves. The last type of data is partly publiс: using Spotify API (https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features) it is possible to get raw audio features on each song. There are ~13 features available, but do Spotify need them to build great models? Could there be a situation that feature A and feature B are very similar to each other? I did not find public discussion from Spotify on this issue — so, this issue is being raised by me right now.
Well, there are 2 moments why this issue is important: money and models quality. As for the money, you’ve already know how many songs Spotify has (over 100 million). All these millions of songs have features stored somewhere. Of course, this storing is not free — Spotify have to pay for maintaining and updating storage facilities for these features. And I guess, keeping one feature for over 100 million songs can cost a round sum — so, reducing the number of stored features can save a lot of money.
The second thing is model quality, if 2 features are correlating with each other, it can harm model estimates due to multicollinearity. Of course, I don’t think that Spotify’s data scientists are not accounting for it — I’m talking about risks, which are increasing when you just have such highly correlated variables in the data. For example, you can by mistake include these features together or forget to exclude one of them — consequences of such a situation can be critical for the company. So, on the scale of Spotify, it’s better to reduce risks whenever possible. Especially, it is worth doing when price is just several hours of student work, as in the case of the analysis that I will provide below.
PCA, or fully, principal component analysis (and its preparatory steps like creating correlation matrix) is the solution for Spotify. Why? This analysis can help identify similar features and significantly reduce their number — ideal fit to the stated issue! It also should be noted that this analysis is quite exclusive: no other method known for the author can do the same or anything similar.
Now, let’s do that PCA!
To be good, analysis should be done in such manner that its findings can be well generalized — thus, analysed data should cover as many options as possible. Also, as far as I know, PCA is not subject to the overfitting problem. Having all these considerations in mind, I decided to choose the largest and most diverse set of songs and their features. Chosen dataset contains 114,000 songs across 125 different genres.
Link to the dataset: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset
In chosen dataset there are 20 variables.
5 variables are are representing songs metadata. All of them are
nominal and only the last one, popularity, is a
quantitative one:
track_id: track identifier on the
Spotify platform;track_name: name of track;album_name: name of album track
belongs to;artists: creator(-s) of track;popularity: the degree of track
popularity. An algorithm calculates this value based on the track’s
total plays and the recency of these plays (from 0 to 100).Next, we have qualitative features of track:
explicit: presence of explicit lyrics
(swearing, violence, etc.) in track;track_genre: genre track belongs
to.And, finally, raw audio features:
duration_ms: length of track (in
milliseconds);danceability: the degree to which
track is suitable for dancing. It is based on a combination of musical
elements including rhythm stability, beat strength, and overall
regularity (from 0.0 to 1.0);energy: perceived measure of intensity
and activity. Energetic songs often feel fast, loud, and cacophonous
(from 0.0 to 1.0);key: key track is in. For example, 0
corresponds to C, 1 corresponds to C♯/D♭, 2 corresponds to D, and so
on;loudness: overall loudness of track
(in decibels);mode`: modality (major or minor) of
track (from 0.0 to 1.0);speechiness: the degree of spoken
words in track. Higher value indicates that recording is more
speech-like (from 0.0 to 1.0);acousticness: confidence degree of
whether track is acoustic or not (from 0.0 to 1.0);instrumentalness: confidence degree of
whether track contain no vocal (from 0.0 to 1.0);liveness: confidence degree of whether
track was performed live. Higher liveness value means an increased
probability of track being performed live (from 0.0 to 1.0);valence: the degree of positive or
negative emotional content conveyed by a track. High valence indicating
a more positive, happy, or euphoric sound, and low valence corresponds
to a more negative, sad, or angry sound (from 0.0 to 1.0);tempo: overall estimated track tempo
(in beats per minute);time_signature: measure representing
how many beats are in each bar (from 3.0 to 7.0).These raw audio features are raw material for PCA.
At this part we will do 2 things:
# Choosing necessary variables
songs_dataset_af = songs_dataset %>%
select(duration_ms, danceability:time_signature)
# Looking at chosen variables
str(songs_dataset_af)
## 'data.frame': 114000 obs. of 13 variables:
## $ duration_ms : int 230666 149610 210826 201933 198853 214240 229400 242946 189613 205594 ...
## $ danceability : num 0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
## $ energy : num 0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
## $ key : int 1 1 0 0 2 6 2 11 0 1 ...
## $ loudness : num -6.75 -17.23 -9.73 -18.52 -9.68 ...
## $ mode : int 0 1 1 1 1 1 1 1 1 1 ...
## $ speechiness : num 0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
## $ acousticness : num 0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
## $ instrumentalness: num 1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
## $ liveness : num 0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
## $ valence : num 0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
## $ tempo : num 87.9 77.5 76.3 181.7 119.9 ...
## $ time_signature : int 4 4 4 3 4 4 3 4 4 4 ...
# Renaming "key" variable so that it is not messed with gather() output
songs_dataset_hist = songs_dataset_af %>%
dplyr::rename("key_var" = "key")
# Building histograms
ggplot(
gather(songs_dataset_hist), aes(value)) +
geom_histogram(bins = 25) +
facet_wrap(~key, scales = 'free_x') +
ylab("Number of songs") +
xlab("Raw audio feature") +
ggtitle("Histograms for raw audio features")
# Building density plots
ggplot(
gather(songs_dataset_hist), aes(value)) +
geom_density() +
facet_wrap(~key, scales = 'free_x') +
ylab("Density of songs") +
xlab("Raw audio feature") +
ggtitle("Density plots for raw audio features")
# Looking at skew & kurtosis of variables in focus
describe(songs_dataset_af)[11:12]
## skew kurtosis
## duration_ms 11.19 354.93
## danceability -0.40 -0.18
## energy -0.60 -0.53
## key -0.01 -1.28
## loudness -2.01 5.90
## mode -0.57 -1.67
## speechiness 4.65 28.82
## acousticness 0.73 -0.95
## instrumentalness 1.73 1.27
## liveness 2.11 4.38
## valence 0.12 -1.03
## tempo 0.23 -0.11
## time_signature -4.10 26.01
mode and time_signature are not numeric, but
factor / ordinal with 2 and 6 levels respectively. Because of it we
exclude them from further analysis;danceability. In case of others, it would be good to
perform log-transformation.
Now, let’s find which variables are well correlated and which — not really. It will help us to understand which variables should be left and which ones should be deleted before the PCA.
# Removing ordinal variables
songs_dataset_corr = songs_dataset_af %>%
select(-mode, -time_signature)
# Creating circle correlation matrix
songs_dataset_corr %>%
cor() %>%
corrplot(diag = FALSE, order = 'FPC')
# Creating number correlation matrix
songs_dataset_corr %>%
cor() %>%
corrplot(diag = FALSE, order = 'FPC', method = "number", number.cex = 0.8)
We have 2 sets of non-weak correlations (> 0.4):
danceability ~ valence;energy ~ loudness ~
acousticness, loudness ~
instrumentalness.It can be supposed that for each of these sets we will have a principal component that is well describing variance of variables belonging to the set.
2 manipulations with data were performed:
danceability (already distributed normally) and
loudness (has negative values);# Log-transformation
songs_dataset_final = songs_dataset_corr
songs_dataset_final[, c(1, 3, 4, 6:11)] =
sapply(songs_dataset_final[, c(1, 3, 4, 6:11)], function(x) log(x + 0.00000000001))
# Standardization (= scaling)
songs_dataset_scaled = songs_dataset_final %>% scale() %>% as.data.frame()
We can perform PCA on 2 sets of variables:
What’s the plan? The plan is to run PCA on each of these datasets and compare them by cumulative variance of the first two principal components — the one with the higher number wins and goes on to the next steps of analysis.
All features:
# Creating PCA
pca_scores_all = prcomp(songs_dataset_scaled)
# Looking at the results of PCA
summary(pca_scores_all)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7589 1.2994 1.1596 1.0577 0.9972 0.96041 0.84970
## Proportion of Variance 0.2812 0.1535 0.1222 0.1017 0.0904 0.08385 0.06564
## Cumulative Proportion 0.2812 0.4347 0.5570 0.6587 0.7491 0.83295 0.89859
## PC8 PC9 PC10 PC11
## Standard deviation 0.65274 0.59931 0.41512 0.39746
## Proportion of Variance 0.03873 0.03265 0.01567 0.01436
## Cumulative Proportion 0.93732 0.96997 0.98564 1.00000
Cumulative variance explained of the first two components: 43.47%.
Well-correlated features:
# Creating PCA
pca_scores_wf = prcomp(
songs_dataset_scaled[
c("danceability", "valence", "energy",
"loudness", "acousticness", "instrumentalness")])
# Looking at the results of PCA
summary(pca_scores_wf)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.5617 1.2179 0.9422 0.76872 0.64516 0.42774
## Proportion of Variance 0.4065 0.2472 0.1480 0.09849 0.06937 0.03049
## Cumulative Proportion 0.4065 0.6537 0.8016 0.90014 0.96951 1.00000
Cumulative variance explained of the first two components: 65.37%.
Verdict: as we can see, in the case of only well-correlated features we get higher share of variance explained in the first two components compared to all features (65.37% vs 43.47%, respectively). It means that in further analysis we will use dataset with only well-correlated features.
To understand amount of principal components to focus on, let’s build a scree plot for proportion of explained variance by each component:
fviz_screeplot(pca_scores_wf, addlabels = TRUE)
Description: there is noticeable drop in the proportion of explained variance from the first PC to the third. But this drop is not that sharp to easily discard other principal components. So, if picture with the first 3 PC will not be clear, we can easily take a look at other components.
In the previous section it was decided to consider first 3 principal components. So, let’s do 2 things:
Note: due to a huge amount of observations, a minimal value of alpha (0.002) doesn’t allow to see all the variables clearly — so, next to usual plot with observations, we also built the plot without observations.
# Creating graphs
first_second_PCA = PCA(
songs_dataset_scaled[
c("danceability", "valence", "energy",
"loudness", "acousticness", "instrumentalness")],
graph = F,
axes = c(1, 2),
ncp = 6)
first_second_PCs_circles = plot.PCA(first_second_PCA, choix = "varcor")
first_second_PCs_obs =
ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.002, choices = c(1, 2)) +
theme_bw() + xlim(-4, 4) + ylim(-6, 6)
first_second_PCs =
ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.0019, choices = c(1, 2)) +
theme_bw() + xlim(-4, 4) + ylim(-6, 6)
# Plotting graphs
ggarrange(first_second_PCs_obs, first_second_PCs)
Description: all variables are pretty close to both
components at the same time. Still, we can say to which one they are
closer: energy, loudness,
valence, danceability are closer to the 1st
component (X-axis), while instrumentalness and
acousticness — to the 2nd component (Y-axis).
# Creating graphs
first_third_PCA = PCA(
songs_dataset_scaled[
c("danceability", "valence", "energy",
"loudness", "acousticness", "instrumentalness")],
graph = F,
axes = c(1, 3),
ncp = 6)
first_third_PCs_circles = plot.PCA(first_third_PCA, choix = "varcor")
first_third_PCs_obs =
ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.002, choices = c(1, 3)) +
theme_bw() + xlim(-4, 4) + ylim(-6, 6)
first_third_PCs =
ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.0019, choices = c(1, 3)) +
theme_bw() + xlim(-4, 4) + ylim(-6, 6)
# Plotting graphs
ggarrange(first_third_PCs_obs, first_third_PCs)
Description: 5/6 variables (energy,
loudness, valence, danceability,
acousticness) are noticeably closer to the 1st component
(X-axis), and only instrumentalness is closer to the 3rd
component (Y-axis).
# Creating graphs
second_third_PCA = PCA(
songs_dataset_scaled[
c("danceability", "valence", "energy",
"loudness", "acousticness", "instrumentalness")],
graph = F,
axes = c(2, 3),
ncp = 6)
second_third_PCs_circles = plot.PCA(second_third_PCA, choix = "varcor")
second_third_PCs_obs =
ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.002, choices = c(2, 3)) +
theme_bw() + xlim(-4, 4) + ylim(-6, 6)
second_third_PCs =
ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.0019, choices = c(2, 3)) +
theme_bw() + xlim(-4, 4) + ylim(-6, 6)
# Plotting graphs
ggarrange(second_third_PCs_obs, second_third_PCs)
Description: we see a bit similar picture to the
previous plot. Again, 5/6 variables (energy,
loudness, valence, danceability,
acousticness) are noticeably closer to the 2nd component
(X-axis), and only instrumentalness is closer to the 3rd
component (Y-axis).
# Creating table with loadings
PCs_scores = as.data.frame(pca_scores_wf$rotation[, 1:3])
PCs_scores$Variable = rownames(PCs_scores)
rownames(PCs_scores) = NULL
colnames(PCs_scores) = c("PC1 loading", "PC2 loading", "PC3 loading", "Variable")
PCs_scores[, 1:3] = sapply(PCs_scores[, 1:3], function(x) round(x, 2))
# Presenting table with loadings
PCs_scores %>%
relocate(Variable, .before = "PC1 loading") %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered"), full_width = T)
| Variable | PC1 loading | PC2 loading | PC3 loading |
|---|---|---|---|
| danceability | -0.32 | -0.40 | -0.54 |
| valence | -0.40 | -0.34 | -0.31 |
| energy | -0.56 | 0.22 | 0.01 |
| loudness | -0.56 | 0.18 | 0.22 |
| acousticness | 0.22 | -0.67 | 0.05 |
| instrumentalness | 0.24 | 0.44 | -0.75 |
Description: danceability and
valence are contributing not greatly, but noticeably to the
all 3 components in the similar negative manner. Next, some differences
start to appear: energy and loudness have
considerable negative contributions to the PC1 (-0.56), while
contributions to other 2 components are rather small and positive (from
0.01 to 0.22). As for acousticness, it has substantial
negative contribution to the PC2 (-0.67), while contribution to the
other 2 PCs is positive and not really noticeable (0.22 — to the PC1,
and 0.05 — to the PC2). The last variable,
instrumentalness, has big negative contribution to the PC3
(-0.75), middle-size positive contribution to the PC2 (0.44), and,
again, positive, but now small contribution to the PC1 (0.24).
If we will take cut-off point at the level of |0.30|, we will see a next composition structure:
danceability, valence,
energy, loudness;danceability, valence,
acousticness, instrumentalness;danceability, valence,
instrumentalness.If we try to give a meaning to this structure, we can suppose next:
all components have danceability and valence
in their structure. I guess, combination of these 2 features describe
overall mood (vibe) of the song and variables that are exclusive to the
PCs can be treated like a condition for this mood. Going into details,
PC1 is about song mood conditioned by the degree to which song is
emotionally charged (more or less energetic / loud). At the same time
PC2 and PC3 are about more technical conditions: they describe how rich
is song by music instruments (acousticness) and vocals
(valence). Again, it’s just a guess — I cannot say it for
sure without deeper analysis.
Why do we see such unfocused and overlapping components? My best guess is the next: there are more complex relationships under the hood, not purely linear ones as implied by PCA. It would be great to check these relationships by looking at scatterplots of danceability / valence in relation with all other variables.
In our dataset we have 2 types of songs: non-explicit (safe for children) and explicit (includes content that can be harmful for children: violence, drug abuse, swearing, etc.). Using common-sense logic, we can assume that explicit songs can imply violation of certain norms. Such norms violation can be somehow expressed in the mood of the song. So, explicit and non-explicit songs can differ by our main components — these components in one way or another represent mood of song. So, it is interesting to look if explicit and non-explicit songs are different by our main PCs. We will do it performing the next steps:
Note: for the simplicity of analysis, we will focus only on 2 first PCs (as PC2 and PC3 are quite similar).
# Creating plot
ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.002, groups = songs_dataset$explicit) +
theme_bw() +
ggtitle("PCA with songs explicitness") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_discrete(name = "Explicit") +
theme_bw() + xlim(-9, 9) + ylim(-6, 6)
Description: call the police, someone shot the plot… Talking seriously, there are so huge amount of tracks that explicit ones are barely seen being overlapped by non-explicit. Fortunately, we still can assess the spread of values by looking at ellipses describing it. Actually, we do not see serious differences, only one thing is noticeable: explicit songs are less varied by PC1 (mode conditioned by emotional charge). But it is probably caused by differences in sample sizes: we have ~10k explicit songs and ~104k non-explicit.
Now, let’s conduct T-tests and find out if there are indeed no differences between explicit and non-explicit songs.
# Writing value by components in a convenient way
components = as.data.frame(pca_scores_wf$x)
PC1_t = t.test(components$PC1 ~ songs_dataset$explicit)
PC1_t
##
## Welch Two Sample t-test
##
## data: components$PC1 by songs_dataset$explicit
## t = 81.735, df = 20316, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group False and group True is not equal to 0
## 95 percent confidence interval:
## 0.7071480 0.7418974
## sample estimates:
## mean in group False mean in group True
## 0.06194669 -0.66257600
Description: P-value < 0.001, difference is statistically significant. We have -0.66 in case of explicit songs and 0.06 in case of non-explicit songs. Taking into account that all key features of PC1 are contributing negatively, we can explain this difference in the next way: explicit songs are more emotionally charged compared to non-explicit.
In case of our sample, even the tiniest difference would become significant. So, it is important to assess the size of difference. We can do it by looking at IQR (chosen as do not include extreme values) and comparing it with the size of difference:
cat(paste0(
" IQR of PC1: ", round(IQR(components$PC1), 2)), "\n",
"Difference between groups: ",
round((abs(PC1_t$estimate[2]) - abs(PC1_t$estimate[1])), 2))
## IQR of PC1: 1.38
## Difference between groups: 0.6
Description: taken in the IQR context, difference size is quite expressed.
# Writing value by components in a convenient way
components = as.data.frame(pca_scores_wf$x)
PC2_t = t.test(components$PC2 ~ songs_dataset$explicit)
PC2_t
##
## Welch Two Sample t-test
##
## data: components$PC2 by songs_dataset$explicit
## t = 5.7185, df = 11220, p-value = 1.102e-08
## alternative hypothesis: true difference in means between group False and group True is not equal to 0
## 95 percent confidence interval:
## 0.05363125 0.10957439
## sample estimates:
## mean in group False mean in group True
## 0.006977041 -0.074625778
Description: P-value < 0.001, difference is statistically significant. We have -0.01 in case of explicit songs and 0.01 in case of non-explicit songs. Now, again, let’s see if this difference is considerable compared to IQR:
cat(paste0(
" IQR of PC2: ", round(IQR(components$PC2), 2)), "\n",
"Difference between groups: ",
round((abs(PC2_t$estimate[2]) - abs(PC2_t$estimate[1])), 2))
## IQR of PC2: 1.47
## Difference between groups: 0.07
Description: taken in the IQR context, difference size is almost neglectable. It is significant only because of huge sample size — so, there is no point in further interpretion of this difference.
This analysis can be concluded in the next way: