Helping Spotify to Save Money by Optimizing the Way Songs’ Features are Stored

Preparation part

# Libraries
library(tidyverse)
library(corrplot)
library(ggbiplot)
library(ggpubr)
library(factoextra)
library(FactoMineR)
library(psych)
library(kableExtra)

# Importing dataset
songs_dataset = read.csv("spotify_data.csv") %>% select(-X)

What has been done:

Necessary libraries are uploaded;
Dataset is imported and pre-cleaned: an unnecessary artifact column was removed.

Note: there were no other things in the dataset that needed to be cleaned up.

Case in focus: Spotify

The focus of this project is Spotify. I guess, there is no need for a presentation of this company. But just in case, Spotify is the most popular audio-streaming service with ~489 million users (Spotify, 2023).

Now, let’s go to the details. To maintain such a huge audience, you, as a company, should have something really valuable to offer. And Spotify has. Its music collections contains a mind-boggling number of tracks — over 100 million (Spotify, 2023). But it’s not enough to take a tourist to a garden with all sorts of wonders, you need to help navigate in this garden, i.e. suggest what to enjoy — here, Spotify also has something to offer. This somehting is a fancy recommendation system highlighted by industry experts like Ural Garrett (2022).

To make recommendation system work, Spotify build really advanced models on many different sorts of data (Dmitry Pastukhov, 2022). Some of this data attached to users, another — to tracks themselves. The last type of data is partly publiс: using Spotify API (https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features) it is possible to get raw audio features on each song. There are ~13 features available, but do Spotify need them to build great models? Could there be a situation that feature A and feature B are very similar to each other? I did not find public discussion from Spotify on this issue — so, this issue is being raised by me right now.

Well, there are 2 moments why this issue is important: money and models quality. As for the money, you’ve already know how many songs Spotify has (over 100 million). All these millions of songs have features stored somewhere. Of course, this storing is not free — Spotify have to pay for maintaining and updating storage facilities for these features. And I guess, keeping one feature for over 100 million songs can cost a round sum — so, reducing the number of stored features can save a lot of money.

The second thing is model quality, if 2 features are correlating with each other, it can harm model estimates due to multicollinearity. Of course, I don’t think that Spotify’s data scientists are not accounting for it — I’m talking about risks, which are increasing when you just have such highly correlated variables in the data. For example, you can by mistake include these features together or forget to exclude one of them — consequences of such a situation can be critical for the company. So, on the scale of Spotify, it’s better to reduce risks whenever possible. Especially, it is worth doing when price is just several hours of student work, as in the case of the analysis that I will provide below.

PCA, or fully, principal component analysis (and its preparatory steps like creating correlation matrix) is the solution for Spotify. Why? This analysis can help identify similar features and significantly reduce their number — ideal fit to the stated issue! It also should be noted that this analysis is quite exclusive: no other method known for the author can do the same or anything similar.

Now, let’s do that PCA!

Data description

Choosen dataset

To be good, analysis should be done in such manner that its findings can be well generalized — thus, analysed data should cover as many options as possible. Also, as far as I know, PCA is not subject to the overfitting problem. Having all these considerations in mind, I decided to choose the largest and most diverse set of songs and their features. Chosen dataset contains 114,000 songs across 125 different genres.

Link to the dataset: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset

Variables meaning

In chosen dataset there are 20 variables.

5 variables are are representing songs metadata. All of them are nominal and only the last one, popularity, is a quantitative one:

track_id: track identifier on the Spotify platform;
track_name: name of track;
album_name: name of album track belongs to;
artists: creator(-s) of track;
popularity: the degree of track popularity. An algorithm calculates this value based on the track’s total plays and the recency of these plays (from 0 to 100).

Next, we have qualitative features of track:

explicit: presence of explicit lyrics (swearing, violence, etc.) in track;
track_genre: genre track belongs to.

And, finally, raw audio features:

duration_ms: length of track (in milliseconds);
danceability: the degree to which track is suitable for dancing. It is based on a combination of musical elements including rhythm stability, beat strength, and overall regularity (from 0.0 to 1.0);
energy: perceived measure of intensity and activity. Energetic songs often feel fast, loud, and cacophonous (from 0.0 to 1.0);
key: key track is in. For example, 0 corresponds to C, 1 corresponds to C♯/D♭, 2 corresponds to D, and so on;
loudness: overall loudness of track (in decibels);
mode`: modality (major or minor) of track (from 0.0 to 1.0);
speechiness: the degree of spoken words in track. Higher value indicates that recording is more speech-like (from 0.0 to 1.0);
acousticness: confidence degree of whether track is acoustic or not (from 0.0 to 1.0);
instrumentalness: confidence degree of whether track contain no vocal (from 0.0 to 1.0);
liveness: confidence degree of whether track was performed live. Higher liveness value means an increased probability of track being performed live (from 0.0 to 1.0);
valence: the degree of positive or negative emotional content conveyed by a track. High valence indicating a more positive, happy, or euphoric sound, and low valence corresponds to a more negative, sad, or angry sound (from 0.0 to 1.0);
tempo: overall estimated track tempo (in beats per minute);
time_signature: measure representing how many beats are in each bar (from 3.0 to 7.0).

These raw audio features are raw material for PCA.

Descriptive analysis

At this part we will do 2 things:

Make sure that all variables in focus are indeed numeric, not just ordinal / factor with number values;
Assess normality of variables distributions to understand what variables can benifit from log-transformations.

Variables summary

# Choosing necessary variables
songs_dataset_af = songs_dataset %>% 
  select(duration_ms, danceability:time_signature)

# Looking at chosen variables
str(songs_dataset_af)

## 'data.frame':    114000 obs. of  13 variables:
##  $ duration_ms     : int  230666 149610 210826 201933 198853 214240 229400 242946 189613 205594 ...
##  $ danceability    : num  0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
##  $ energy          : num  0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
##  $ key             : int  1 1 0 0 2 6 2 11 0 1 ...
##  $ loudness        : num  -6.75 -17.23 -9.73 -18.52 -9.68 ...
##  $ mode            : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ speechiness     : num  0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
##  $ acousticness    : num  0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
##  $ instrumentalness: num  1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
##  $ liveness        : num  0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
##  $ valence         : num  0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
##  $ tempo           : num  87.9 77.5 76.3 181.7 119.9 ...
##  $ time_signature  : int  4 4 4 3 4 4 3 4 4 4 ...

Histograms

# Renaming "key" variable so that it is not messed with gather() output
songs_dataset_hist = songs_dataset_af %>% 
  dplyr::rename("key_var" = "key")

# Building histograms
ggplot(
  gather(songs_dataset_hist), aes(value)) + 
  geom_histogram(bins = 25) + 
  facet_wrap(~key, scales = 'free_x') +
  ylab("Number of songs") +
  xlab("Raw audio feature") +
  ggtitle("Histograms for raw audio features")

Density plots

# Building density plots
ggplot(
  gather(songs_dataset_hist), aes(value)) + 
  geom_density() + 
  facet_wrap(~key, scales = 'free_x') +
  ylab("Density of songs") +
  xlab("Raw audio feature") +
  ggtitle("Density plots for raw audio features")

Skew & Curtosis

# Looking at skew & kurtosis of variables in focus
describe(songs_dataset_af)[11:12]

##                   skew kurtosis
## duration_ms      11.19   354.93
## danceability     -0.40    -0.18
## energy           -0.60    -0.53
## key              -0.01    -1.28
## loudness         -2.01     5.90
## mode             -0.57    -1.67
## speechiness       4.65    28.82
## acousticness      0.73    -0.95
## instrumentalness  1.73     1.27
## liveness          2.11     4.38
## valence           0.12    -1.03
## tempo             0.23    -0.11
## time_signature   -4.10    26.01

Description

As it can be seen from histograms and variables summary, mode and time_signature are not numeric, but factor / ordinal with 2 and 6 levels respectively. Because of it we exclude them from further analysis;
As it can be seen from histograms, density plots, and skew & kurtosis table, there is only one variable with normal distribution — danceability. In case of others, it would be good to perform log-transformation.

Correlations

Now, let’s find which variables are well correlated and which — not really. It will help us to understand which variables should be left and which ones should be deleted before the PCA.

Circle correlation matrix

# Removing ordinal variables
songs_dataset_corr = songs_dataset_af %>% 
  select(-mode, -time_signature)

# Creating circle correlation matrix
songs_dataset_corr %>% 
  cor() %>% 
  corrplot(diag = FALSE, order = 'FPC')

Number correlation matrix

# Creating number correlation matrix
songs_dataset_corr %>% 
  cor() %>% 
  corrplot(diag = FALSE, order = 'FPC', method = "number", number.cex = 0.8)

Description

We have 2 sets of non-weak correlations (> 0.4):

danceability ~ valence;
energy ~ loudness ~ acousticness, loudness ~ instrumentalness.

It can be supposed that for each of these sets we will have a principal component that is well describing variance of variables belonging to the set.

Data preparation for PCA

2 manipulations with data were performed:

Log-transformation of all variables except for danceability (already distributed normally) and loudness (has negative values);
Standardization (= scaling) of all variables.

# Log-transformation
songs_dataset_final = songs_dataset_corr
songs_dataset_final[, c(1, 3, 4, 6:11)] = 
  sapply(songs_dataset_final[, c(1, 3, 4, 6:11)], function(x) log(x + 0.00000000001))

# Standardization (= scaling)
songs_dataset_scaled = songs_dataset_final %>% scale() %>% as.data.frame()

Choosing best configuration of PCA

We can perform PCA on 2 sets of variables:

All numeric raw audio features (11 variables);
Only those raw audio features that have at least one non-weak correlation with another variable (6 variables).

What’s the plan? The plan is to run PCA on each of these datasets and compare them by cumulative variance of the first two principal components — the one with the higher number wins and goes on to the next steps of analysis.

All features:

# Creating PCA
pca_scores_all = prcomp(songs_dataset_scaled)

# Looking at the results of PCA
summary(pca_scores_all)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.7589 1.2994 1.1596 1.0577 0.9972 0.96041 0.84970
## Proportion of Variance 0.2812 0.1535 0.1222 0.1017 0.0904 0.08385 0.06564
## Cumulative Proportion  0.2812 0.4347 0.5570 0.6587 0.7491 0.83295 0.89859
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.65274 0.59931 0.41512 0.39746
## Proportion of Variance 0.03873 0.03265 0.01567 0.01436
## Cumulative Proportion  0.93732 0.96997 0.98564 1.00000

Cumulative variance explained of the first two components: 43.47%.

Well-correlated features:

# Creating PCA
pca_scores_wf = prcomp(
  songs_dataset_scaled[
    c("danceability", "valence", "energy", 
      "loudness", "acousticness", "instrumentalness")])

# Looking at the results of PCA
summary(pca_scores_wf)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.5617 1.2179 0.9422 0.76872 0.64516 0.42774
## Proportion of Variance 0.4065 0.2472 0.1480 0.09849 0.06937 0.03049
## Cumulative Proportion  0.4065 0.6537 0.8016 0.90014 0.96951 1.00000

Cumulative variance explained of the first two components: 65.37%.

Verdict: as we can see, in the case of only well-correlated features we get higher share of variance explained in the first two components compared to all features (65.37% vs 43.47%, respectively). It means that in further analysis we will use dataset with only well-correlated features.

How many PCs are important? (Scree plot)

To understand amount of principal components to focus on, let’s build a scree plot for proportion of explained variance by each component:

fviz_screeplot(pca_scores_wf, addlabels = TRUE)

Description: there is noticeable drop in the proportion of explained variance from the first PC to the third. But this drop is not that sharp to easily discard other principal components. So, if picture with the first 3 PC will not be clear, we can easily take a look at other components.

PCA itself

In the previous section it was decided to consider first 3 principal components. So, let’s do 2 things:

Look at visualizations of how different configurations of PCs (1 and 2, 2 and 3, 1 and 3) explain variance of our variables. It will help us understand what variables are explained by what PCs;
Look at variables loadings on PCs. It will help us to understand the degree to which PCs explain variance of one sort variables, while ignoring other. So, after it we will be able to understand how focused or unfocused PCs are.

Note: due to a huge amount of observations, a minimal value of alpha (0.002) doesn’t allow to see all the variables clearly — so, next to usual plot with observations, we also built the plot without observations.

The 1st and the 2nd PCs

# Creating graphs 
first_second_PCA = PCA(
  songs_dataset_scaled[
    c("danceability", "valence", "energy", 
      "loudness", "acousticness", "instrumentalness")], 
  graph = F, 
  axes = c(1, 2),
  ncp = 6)

first_second_PCs_circles = plot.PCA(first_second_PCA, choix = "varcor")

first_second_PCs_obs = 
  ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.002, choices = c(1, 2)) + 
  theme_bw() + xlim(-4, 4) + ylim(-6, 6)

first_second_PCs = 
  ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.0019, choices = c(1, 2)) + 
  theme_bw() + xlim(-4, 4) + ylim(-6, 6)

# Plotting graphs
ggarrange(first_second_PCs_obs, first_second_PCs)

Description: all variables are pretty close to both components at the same time. Still, we can say to which one they are closer: energy, loudness, valence, danceability are closer to the 1st component (X-axis), while instrumentalness and acousticness — to the 2nd component (Y-axis).

The 1st and the 3nd PCs

# Creating graphs 
first_third_PCA = PCA(
  songs_dataset_scaled[
    c("danceability", "valence", "energy", 
      "loudness", "acousticness", "instrumentalness")], 
  graph = F, 
  axes = c(1, 3),
  ncp = 6)

first_third_PCs_circles = plot.PCA(first_third_PCA, choix = "varcor")

first_third_PCs_obs = 
  ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.002, choices = c(1, 3)) + 
  theme_bw() + xlim(-4, 4) + ylim(-6, 6)

first_third_PCs = 
  ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.0019, choices = c(1, 3)) + 
  theme_bw() + xlim(-4, 4) + ylim(-6, 6)

# Plotting graphs
ggarrange(first_third_PCs_obs, first_third_PCs)

Description: 5/6 variables (energy, loudness, valence, danceability, acousticness) are noticeably closer to the 1st component (X-axis), and only instrumentalness is closer to the 3rd component (Y-axis).

The 2nd and the 3rd PCs

# Creating graphs 
second_third_PCA = PCA(
  songs_dataset_scaled[
    c("danceability", "valence", "energy", 
      "loudness", "acousticness", "instrumentalness")], 
  graph = F, 
  axes = c(2, 3),
  ncp = 6)

second_third_PCs_circles = plot.PCA(second_third_PCA, choix = "varcor")

second_third_PCs_obs = 
  ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.002, choices = c(2, 3)) + 
  theme_bw() + xlim(-4, 4) + ylim(-6, 6)

second_third_PCs = 
  ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.0019, choices = c(2, 3)) + 
  theme_bw() + xlim(-4, 4) + ylim(-6, 6)

# Plotting graphs
ggarrange(second_third_PCs_obs, second_third_PCs)

Description: we see a bit similar picture to the previous plot. Again, 5/6 variables (energy, loudness, valence, danceability, acousticness) are noticeably closer to the 2nd component (X-axis), and only instrumentalness is closer to the 3rd component (Y-axis).

Table with variables loadings on components

# Creating table with loadings
PCs_scores = as.data.frame(pca_scores_wf$rotation[, 1:3])
PCs_scores$Variable = rownames(PCs_scores)
rownames(PCs_scores) = NULL
colnames(PCs_scores) = c("PC1 loading", "PC2 loading", "PC3 loading", "Variable")
PCs_scores[, 1:3] = sapply(PCs_scores[, 1:3], function(x) round(x, 2))

# Presenting table with loadings
PCs_scores %>%
  relocate(Variable, .before = "PC1 loading") %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered"), full_width = T)

Variable	PC1 loading	PC2 loading	PC3 loading
danceability	-0.32	-0.40	-0.54
valence	-0.40	-0.34	-0.31
energy	-0.56	0.22	0.01
loudness	-0.56	0.18	0.22
acousticness	0.22	-0.67	0.05
instrumentalness	0.24	0.44	-0.75

Description: danceability and valence are contributing not greatly, but noticeably to the all 3 components in the similar negative manner. Next, some differences start to appear: energy and loudness have considerable negative contributions to the PC1 (-0.56), while contributions to other 2 components are rather small and positive (from 0.01 to 0.22). As for acousticness, it has substantial negative contribution to the PC2 (-0.67), while contribution to the other 2 PCs is positive and not really noticeable (0.22 — to the PC1, and 0.05 — to the PC2). The last variable, instrumentalness, has big negative contribution to the PC3 (-0.75), middle-size positive contribution to the PC2 (0.44), and, again, positive, but now small contribution to the PC1 (0.24).

If we will take cut-off point at the level of |0.30|, we will see a next composition structure:

PC1: danceability, valence, energy, loudness;
PC2: danceability, valence, acousticness, instrumentalness;
PC3: danceability, valence, instrumentalness.

If we try to give a meaning to this structure, we can suppose next: all components have danceability and valence in their structure. I guess, combination of these 2 features describe overall mood (vibe) of the song and variables that are exclusive to the PCs can be treated like a condition for this mood. Going into details, PC1 is about song mood conditioned by the degree to which song is emotionally charged (more or less energetic / loud). At the same time PC2 and PC3 are about more technical conditions: they describe how rich is song by music instruments (acousticness) and vocals (valence). Again, it’s just a guess — I cannot say it for sure without deeper analysis.

Why do we see such unfocused and overlapping components? My best guess is the next: there are more complex relationships under the hood, not purely linear ones as implied by PCA. It would be great to check these relationships by looking at scatterplots of danceability / valence in relation with all other variables.

Application of PCA results: is explicitness of song associated with its mood?

In our dataset we have 2 types of songs: non-explicit (safe for children) and explicit (includes content that can be harmful for children: violence, drug abuse, swearing, etc.). Using common-sense logic, we can assume that explicit songs can imply violation of certain norms. Such norms violation can be somehow expressed in the mood of the song. So, explicit and non-explicit songs can differ by our main components — these components in one way or another represent mood of song. So, it is interesting to look if explicit and non-explicit songs are different by our main PCs. We will do it performing the next steps:

Visual assessment of PCA plot with observations colored in accordance with type of song (explicit or non-explicit);
T-tests by PC1 and PC2 for explicit and non-explicit songs.

Note: for the simplicity of analysis, we will focus only on 2 first PCs (as PC2 and PC3 are quite similar).

PCA

# Creating plot
ggbiplot(pca_scores_wf, ellipse = TRUE, alpha = 0.002, groups = songs_dataset$explicit) + 
  theme_bw() +
  ggtitle("PCA with songs explicitness") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_color_discrete(name = "Explicit") + 
  theme_bw() + xlim(-9, 9) + ylim(-6, 6)

Description: call the police, someone shot the plot… Talking seriously, there are so huge amount of tracks that explicit ones are barely seen being overlapped by non-explicit. Fortunately, we still can assess the spread of values by looking at ellipses describing it. Actually, we do not see serious differences, only one thing is noticeable: explicit songs are less varied by PC1 (mode conditioned by emotional charge). But it is probably caused by differences in sample sizes: we have ~10k explicit songs and ~104k non-explicit.

T-tests

Now, let’s conduct T-tests and find out if there are indeed no differences between explicit and non-explicit songs.

PC1 T-test

# Writing value by components in a convenient way
components = as.data.frame(pca_scores_wf$x)
PC1_t = t.test(components$PC1 ~ songs_dataset$explicit)
PC1_t

## 
##  Welch Two Sample t-test
## 
## data:  components$PC1 by songs_dataset$explicit
## t = 81.735, df = 20316, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group False and group True is not equal to 0
## 95 percent confidence interval:
##  0.7071480 0.7418974
## sample estimates:
## mean in group False  mean in group True 
##          0.06194669         -0.66257600

Description: P-value < 0.001, difference is statistically significant. We have -0.66 in case of explicit songs and 0.06 in case of non-explicit songs. Taking into account that all key features of PC1 are contributing negatively, we can explain this difference in the next way: explicit songs are more emotionally charged compared to non-explicit.

In case of our sample, even the tiniest difference would become significant. So, it is important to assess the size of difference. We can do it by looking at IQR (chosen as do not include extreme values) and comparing it with the size of difference:

cat(paste0(
  " IQR of PC1: ", round(IQR(components$PC1), 2)), "\n", 
  "Difference between groups: ", 
  round((abs(PC1_t$estimate[2]) - abs(PC1_t$estimate[1])), 2))

##  IQR of PC1: 1.38 
##  Difference between groups:  0.6

Description: taken in the IQR context, difference size is quite expressed.

PC2 T-test

# Writing value by components in a convenient way
components = as.data.frame(pca_scores_wf$x)
PC2_t = t.test(components$PC2 ~ songs_dataset$explicit)
PC2_t

## 
##  Welch Two Sample t-test
## 
## data:  components$PC2 by songs_dataset$explicit
## t = 5.7185, df = 11220, p-value = 1.102e-08
## alternative hypothesis: true difference in means between group False and group True is not equal to 0
## 95 percent confidence interval:
##  0.05363125 0.10957439
## sample estimates:
## mean in group False  mean in group True 
##         0.006977041        -0.074625778

Description: P-value < 0.001, difference is statistically significant. We have -0.01 in case of explicit songs and 0.01 in case of non-explicit songs. Now, again, let’s see if this difference is considerable compared to IQR:

cat(paste0(
  " IQR of PC2: ", round(IQR(components$PC2), 2)), "\n", 
  "Difference between groups: ", 
  round((abs(PC2_t$estimate[2]) - abs(PC2_t$estimate[1])), 2))

##  IQR of PC2: 1.47 
##  Difference between groups:  0.07

Description: taken in the IQR context, difference size is almost neglectable. It is significant only because of huge sample size — so, there is no point in further interpretion of this difference.

Summary

This analysis can be concluded in the next way:

The best PCA solution explains 65.37% of total variance in its first 2 PC (overall number of variables used — 6);
3 first components of PCA were found to be important with 80.1% of cumulative variance explained;
Main principal components are noticeably overlapping with each other. Such behavior was not expected: in correlation matrix there were 2 separate sets o variables (variables were correlating inside, but not between sets). Possible explanation is presence of more complex, nonlinear relationships between variables;
Main principal components describe overall mood (vibe) of song in connection with a) the degree to which song is emotionally charged (PC1), and b) the degree to which song is rich by music instruments and vocals (PC2 and PC3);
We compared explicit and non-explicit songs by the PC1 and PC2. This comparison has shown that mood of explicit songs (compared to non-explicit ones) is way more expressed in terms of emotional charge. At the same time, differences of mood expressed in terms of instruments and vocals richness are pretty much neglectable;
Unfortunately, posed question about possibility of dimension reduction cannot be answered properly without deeper analysis, revealing the nature of relationships between variables.

Refereneces

Pastukhov, D. (2022, February 9). How Spotify’s Algorithm Works? A Complete Guide to Spotify Recommendation System. Music Tomorrow. https://www.music-tomorrow.com/blog/how-spotify-recommendation-system-works-a-complete-guide-2022
Garrett, U. (2022, October 17). Best music streaming services in 2022. CNN Underscored. https://edition.cnn.com/cnn-underscored/reviews/best-music-streaming-service
Spotify. (n.d.). About Spotify. Retrieved February 10, 2023, from https://newsroom.spotify.com/company-info/

Helping Spotify to Save Money by Optimizing the Way Songs’ Features are Stored

Niki Sharin

20 February 2023

Preparation part

Case in focus: Spotify

Data description

Choosen dataset

Variables meaning

Descriptive analysis

Variables summary

Histograms

Density plots

Skew & Curtosis

Description

Correlations

Circle correlation matrix

Number correlation matrix

Description

Data preparation for PCA

Choosing best configuration of PCA

How many PCs are important? (Scree plot)

PCA itself

The 1st and the 2nd PCs

The 1st and the 3nd PCs

The 2nd and the 3rd PCs

Table with variables loadings on components

Application of PCA results: is explicitness of song associated with its mood?

PCA

T-tests

PC1 T-test

PC2 T-test

Summary

Refereneces