# loading all necessary libraries

library(readxl)
library(corrplot)
library(factoextra)
library(caret)
library(gridExtra)
library(psych)
library(ggplot2)
library(plotly)
library(ggfortify)
library(GGally)
library(Hmisc)
library(kableExtra)

Introduction

The goal of the project is to apply the Principal Component Analysis (PCA) method on a dataset of songs on Spotify and their characteristics. PCA is a technique commonly used for dimensional reduction in datasets with many features. It transforms the original features into a smaller set of new, uncorrelated features called principal components, while preserving as much variability (information) in the data as possible.

Review of the data

The paper uses a dataset of songs on a playlist on Spotify and their characteristics. The data comes from the kaggle website and can be found under the name “Spotify Playlist-ORIGINS.” This dataset contains 265 osberations and 12 variables, the meaning of each variable is presented below.

# loading the dataset as data and selecting only the necessary columns

options(scipen=999)
data <- read_excel("spotify.xlsx")
r_name<-data[,2]$`Track Name`
data_num <- data[,c(6, 7, 12:15, 17:22)]
rownames(data_num) <- r_name


summary(data_num)
##  Duration (ms)      Popularity     Danceability        Energy     
##  Min.   :127931   Min.   : 0.00   Min.   :0.1740   Min.   :0.152  
##  1st Qu.:195320   1st Qu.:15.00   1st Qu.:0.4420   1st Qu.:0.524  
##  Median :218040   Median :31.00   Median :0.5360   Median :0.637  
##  Mean   :219548   Mean   :31.71   Mean   :0.5338   Mean   :0.622  
##  3rd Qu.:238266   3rd Qu.:47.00   3rd Qu.:0.6200   3rd Qu.:0.766  
##  Max.   :485333   Max.   :83.00   Max.   :0.8900   Max.   :0.970  
##       Key            Loudness        Speechiness       Acousticness      
##  Min.   : 0.000   Min.   :-16.550   Min.   :0.02430   Min.   :0.0000133  
##  1st Qu.: 2.000   1st Qu.: -7.937   1st Qu.:0.03170   1st Qu.:0.0198000  
##  Median : 6.000   Median : -6.353   Median :0.03890   Median :0.1210000  
##  Mean   : 5.509   Mean   : -6.841   Mean   :0.04963   Mean   :0.2599463  
##  3rd Qu.: 8.000   3rd Qu.: -5.254   3rd Qu.:0.05230   3rd Qu.:0.4430000  
##  Max.   :11.000   Max.   : -1.395   Max.   :0.28400   Max.   :0.9350000  
##  Instrumentalness       Liveness         Valence           Tempo       
##  Min.   :0.0000000   Min.   :0.0304   Min.   :0.0370   Min.   : 65.53  
##  1st Qu.:0.0000000   1st Qu.:0.0973   1st Qu.:0.2300   1st Qu.: 88.00  
##  Median :0.0000108   Median :0.1180   Median :0.3840   Median :100.10  
##  Mean   :0.0306197   Mean   :0.1611   Mean   :0.3954   Mean   :117.96  
##  3rd Qu.:0.0005750   3rd Qu.:0.1900   3rd Qu.:0.5210   3rd Qu.:151.98  
##  Max.   :0.9420000   Max.   :0.6920   Max.   :0.9750   Max.   :202.00

The basic statistics of each variable are presented above. As we can observe, some of the variables have significantly different orders of magnitude (for example, the difference between the Duration and Acousticness variables), which could negatively affect the validity of the study results. It was therefore decided to standardize the variables.

preproc <- preProcess(data_num, method=c("center", "scale"))
data_norm <- predict(preproc, data_num)
kable(head(data_norm))
Duration (ms) Popularity Danceability Energy Key Loudness Speechiness Acousticness Instrumentalness Liveness Valence Tempo
-0.4321688 1.1098094 0.4297122 -0.0639959 -1.5644244 0.3976705 -0.1263595 -0.8546238 -0.2332278 -0.8458215 -1.3717324 -1.1659985
0.1013924 0.3472763 1.9178646 -0.3896815 -1.5644244 0.3910216 -0.2578149 -0.8172455 -0.2321995 0.5343138 1.5822328 -0.6872256
-0.6268363 1.4910760 0.2869097 1.0759037 -1.5644244 0.6204089 0.2321552 -0.6125379 -0.2332278 -0.8923874 1.1488518 1.1689991
-0.9200203 -0.6058901 1.1587566 -0.0349168 -1.2804707 -0.2048867 -0.1353224 -0.8598638 -0.2297164 -0.3970051 -0.9483141 -0.8591569
-1.2298165 0.2996180 -0.1414979 -1.2329746 -1.5644244 -0.1163731 -0.2548273 1.0307121 -0.2329171 -0.5753427 -0.7440770 -0.9204696
0.1773038 0.1089847 -0.3970392 -0.5699717 0.4232518 -0.0806352 -0.5804781 -0.7980323 -0.2332196 0.7324668 -1.1625139 -1.1638028
dim(data_norm)
## [1] 265  12

Two correlation charts are presented to examine the relationship between the variables and the distribution of each variable.

In the following graphs, we can see the lack of correlation between most of the variables. The exceptions are the large positive correlation between the Loudness and Energy variables and the significant correlations between the Acousticness variable and the Loudness and Energy variables. The second graph also shows the distributions of each variable. As we can observe, none of them resembles a normal distribution (except for the distribution of the Danceability variable).

kor<-cor(data_norm)
corrplot(kor, method="color", type="lower", tl.col = 'black', tl.cex = 0.75)

ggpairs(as.data.frame(data_norm))

Principal Component Analysis (PCA)

In order to perform dimension reduction in our dataset, it was decided to choose the Principal Component Analysis (PCA) method.

pca1<-prcomp(data_norm, center=FALSE, scale.=FALSE)
pca1$rotation
##                           PC1         PC2         PC3          PC4         PC5
## Duration (ms)     0.086391370 -0.46850636  0.37070876  0.111584030  0.03724326
## Popularity        0.021089620  0.09414013  0.25989767  0.693160910 -0.09180853
## Danceability     -0.170894010  0.60666517  0.24711157 -0.057971544  0.19723048
## Energy           -0.536428477 -0.16863332  0.09072684  0.008798769  0.12367162
## Key              -0.005817961 -0.15480857 -0.16112581  0.481046799  0.47545519
## Loudness         -0.487588518 -0.19734460  0.10231296  0.104017059 -0.05205065
## Speechiness      -0.154440871  0.16743385 -0.49787860 -0.003976076  0.33943423
## Acousticness      0.474439469  0.18813878 -0.08329414  0.053627118 -0.09144706
## Instrumentalness  0.121438202 -0.15726077  0.15453546 -0.335101768  0.68810569
## Liveness         -0.230928715 -0.12844632  0.01998414 -0.348877450 -0.29744716
## Valence          -0.331704968  0.41646712  0.04240320  0.086184368 -0.01457413
## Tempo            -0.101522348 -0.18478218 -0.63906466  0.130710273 -0.14651918
##                          PC6         PC7         PC8         PC9        PC10
## Duration (ms)    -0.29139737  0.14187313  0.29755268 -0.63526924  0.07636888
## Popularity       -0.39953859 -0.32380046  0.09833904  0.36294767  0.15407166
## Danceability      0.04561951  0.06377205 -0.02124302 -0.18447888  0.49417906
## Energy           -0.03832814  0.08649149 -0.06632376  0.08892684 -0.15017524
## Key               0.61453727 -0.27054335  0.01146402 -0.19328586 -0.01252928
## Loudness          0.01748997  0.18385061 -0.05531270  0.24463436 -0.30175815
## Speechiness      -0.28589723  0.13362538  0.68954178  0.05705242 -0.05745829
## Acousticness     -0.07066133 -0.13071873  0.04333394 -0.04312228 -0.53475857
## Instrumentalness -0.35733589 -0.31816476 -0.27590898  0.17172291 -0.05355805
## Liveness          0.12267608 -0.74261040  0.34857968 -0.01953979  0.10714726
## Valence          -0.18122473 -0.23454845 -0.18871335 -0.49111371 -0.46412869
## Tempo            -0.33411574 -0.11759143 -0.42797059 -0.22221252  0.30564127
##                         PC11         PC12
## Duration (ms)    -0.13557902  0.020405420
## Popularity        0.06151943 -0.029035989
## Danceability     -0.46243017 -0.027231850
## Energy           -0.06154105 -0.782692337
## Key              -0.05943333  0.023352820
## Loudness         -0.48088154  0.529148690
## Speechiness       0.02730844  0.051345942
## Acousticness     -0.58468503 -0.260179868
## Instrumentalness -0.03188091  0.111538491
## Liveness         -0.14010899  0.007970105
## Valence           0.32425190  0.147896126
## Tempo            -0.23882378 -0.011848238

Choosing number of components

An important part of the dimension reduction analysis is the selection of the right number of components. For this purpose, two charts, shown below, were used: the first shows the percentage of variance of each component, and the second shows the eigenvalues (which represents the variance of the data along the direction of the corresponding principal component) for each component.

summary(pca1)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.6676 1.2469 1.1323 1.06108 1.00164 0.96460 0.93334
## Proportion of Variance 0.2317 0.1296 0.1068 0.09382 0.08361 0.07754 0.07259
## Cumulative Proportion  0.2317 0.3613 0.4681 0.56196 0.64557 0.72311 0.79570
##                           PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.8709 0.82221 0.68479 0.61909 0.40607
## Proportion of Variance 0.0632 0.05634 0.03908 0.03194 0.01374
## Cumulative Proportion  0.8589 0.91524 0.95432 0.98626 1.00000
fviz_eig(pca1)

screeplot(pca1, col = 'steelblue', main = 'Scree plot')
abline(h = 1, col = "black", lty = 2, lwd = 2)

In the graphs above, we can see that the first 2 components account for only 36% of the total variance. Only the selection of 7 components results in 80% of the variance being explained. The second graph shows that for the first 5 components the eigenvalues are greater than 1, i.e. they contain a proportionally large part of the variance. It was therefore decided to reduce dimentions to 5 components, which explain 65% of the total variance.

Graphical analysis

Another part of the study is graphical analysis, the interpretation of the relevant graphs should be helpful in analyzing the various components obtained using the PCA method. The analysis will mainly cover the first two dimensions (PC1 and PC2), since we are dealing with 2-dimensional charts

fviz_pca_var(pca1,
             col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     
)

The first graph shows the relationships between all variables, with warm colors indicating a large share of a given variable in the PC, and cold colors indicating a small share (as does the length of the arrows from the center of the graph). The direction of the arrows, however, indicates the correlations between the variables in question.

In the graph, it can be observed that a large share of the components are the Energy and Loudness variables, which are positively correlated with each other, and the Acousticness variable, negatively correlated with these variables. The variables Valence and Dancebility also have a large share in the first 2 components. The variables Key and Popularity are least represented.

fviz_pca_ind(pca1,
             col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

The next graph shows the relationships between individuals (individuals with similar profiles are grouped together) and how they are represented by a given component (warm colors and a large distance from the center of the graph mean that an individual is well described by a given component, while cold colors and a small distance mean the opposite).

In the graph above, we can see that the best represented individuals are in the upper right corner of the graph. Combined with the analysis of the variable graph, we can conclude that these are observations with a high level of the Acousticness variable. This graph also helps detect outliers, but with the exception of the observation ‘The Concept’ (probably due to the high value of the Duration variable), no other individuals significantly deviate in distance from the others.

fviz_pca_biplot(pca1, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969",  # Individuals color
                label = 'var'
)

The next chart is a combination of the previous two charts. It allows you to analyze both variables and individuals in terms of PC1 and PC2. In it, we can observe that components 1 and 2 most likely depend in large part on the variables Acousticness, Energy, or Loudness, and that they describe well observations with high levels of these variables. We can also see that the observations in the lower right corner of the graph are probably not well described by PC1 and PC2, as they depend heavily on underrepresented variables such as Instrumentalness.

Analysis of components

The following two charts show the extent to which each variable contributes to a component. The first graph shows the components PC1 and PC2 analyzed above, and the second graph shows the other 3 components (PC3, PC4, PC5)

var<-get_pca_var(pca1)
PC1<-fviz_contrib(pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
PC2<-fviz_contrib(pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(PC1,PC2,top='Contribution to the first two Principal Components')

As we can see from the graph above, the first component consists primarily of the Energy, Loudness, and Acousticness variables. So you can see that this component focuses primarily on the way a song sounds, whether it is loud and energetic or rather calm and performed with acoustic instruments.

The second component consists primarily of the Danceability, Duration and Valence variables. Thus, it can be assumed that this component determines whether a song is suitable for a party, that is, if it is positive and good to dance to.

The other 3 components can be found in the chart below, but it is already more difficult to find a guiding characteristic for them that would describe the entire component. A special case of this is PC4, about half of whose entire variance is accounted for by the Popularity variable.

PC3<-fviz_contrib(pca1, "var", axes=3, xtickslab.rt=90) # default angle=45°
PC4<-fviz_contrib(pca1, "var", axes=4, xtickslab.rt=90)
PC5<-fviz_contrib(pca1, "var", axes=5, xtickslab.rt=90)
grid.arrange(PC3,PC4,PC5)

Conclusions

The study presents the application of Principal Component Analysis (PCA) to reduce the dimensions of a dataset of songs on Spotify. Using the PCA method, the number of dimensions of the set was reduced from 12 to 5 while retaining 65% of the total variance. The study also attempted to name the newly created components in order to better understand the set trimmed to 5 dimensions.