# loading all necessary libraries
library(readxl)
library(corrplot)
library(factoextra)
library(caret)
library(gridExtra)
library(psych)
library(ggplot2)
library(plotly)
library(ggfortify)
library(GGally)
library(Hmisc)
library(kableExtra)
The goal of the project is to apply the Principal Component Analysis (PCA) method on a dataset of songs on Spotify and their characteristics. PCA is a technique commonly used for dimensional reduction in datasets with many features. It transforms the original features into a smaller set of new, uncorrelated features called principal components, while preserving as much variability (information) in the data as possible.
The paper uses a dataset of songs on a playlist on Spotify and their characteristics. The data comes from the kaggle website and can be found under the name “Spotify Playlist-ORIGINS.” This dataset contains 265 osberations and 12 variables, the meaning of each variable is presented below.
Duration (ms) - Length of the song in milliseconds
Popularity - How popular the song is from 1-100. 100 is very popular.
Danceability - How well a song can be danced to
Energy: - How active or intense a song is
Key - A group of notes, or scale, as the basis of the song
Loudness - Volume of the song
Speechiness - The amount of words in a song
Acousticness - How likely they use acoustic instruments
Instrumentalness - How often instruments are used in a song
Liveness - How likely the songs are meant to be played live
Valence: - How positive the song can be
Tempo - How fast or slow the song can be
# loading the dataset as data and selecting only the necessary columns
options(scipen=999)
data <- read_excel("spotify.xlsx")
r_name<-data[,2]$`Track Name`
data_num <- data[,c(6, 7, 12:15, 17:22)]
rownames(data_num) <- r_name
summary(data_num)
## Duration (ms) Popularity Danceability Energy
## Min. :127931 Min. : 0.00 Min. :0.1740 Min. :0.152
## 1st Qu.:195320 1st Qu.:15.00 1st Qu.:0.4420 1st Qu.:0.524
## Median :218040 Median :31.00 Median :0.5360 Median :0.637
## Mean :219548 Mean :31.71 Mean :0.5338 Mean :0.622
## 3rd Qu.:238266 3rd Qu.:47.00 3rd Qu.:0.6200 3rd Qu.:0.766
## Max. :485333 Max. :83.00 Max. :0.8900 Max. :0.970
## Key Loudness Speechiness Acousticness
## Min. : 0.000 Min. :-16.550 Min. :0.02430 Min. :0.0000133
## 1st Qu.: 2.000 1st Qu.: -7.937 1st Qu.:0.03170 1st Qu.:0.0198000
## Median : 6.000 Median : -6.353 Median :0.03890 Median :0.1210000
## Mean : 5.509 Mean : -6.841 Mean :0.04963 Mean :0.2599463
## 3rd Qu.: 8.000 3rd Qu.: -5.254 3rd Qu.:0.05230 3rd Qu.:0.4430000
## Max. :11.000 Max. : -1.395 Max. :0.28400 Max. :0.9350000
## Instrumentalness Liveness Valence Tempo
## Min. :0.0000000 Min. :0.0304 Min. :0.0370 Min. : 65.53
## 1st Qu.:0.0000000 1st Qu.:0.0973 1st Qu.:0.2300 1st Qu.: 88.00
## Median :0.0000108 Median :0.1180 Median :0.3840 Median :100.10
## Mean :0.0306197 Mean :0.1611 Mean :0.3954 Mean :117.96
## 3rd Qu.:0.0005750 3rd Qu.:0.1900 3rd Qu.:0.5210 3rd Qu.:151.98
## Max. :0.9420000 Max. :0.6920 Max. :0.9750 Max. :202.00
The basic statistics of each variable are presented above. As we can observe, some of the variables have significantly different orders of magnitude (for example, the difference between the Duration and Acousticness variables), which could negatively affect the validity of the study results. It was therefore decided to standardize the variables.
preproc <- preProcess(data_num, method=c("center", "scale"))
data_norm <- predict(preproc, data_num)
kable(head(data_norm))
| Duration (ms) | Popularity | Danceability | Energy | Key | Loudness | Speechiness | Acousticness | Instrumentalness | Liveness | Valence | Tempo |
|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.4321688 | 1.1098094 | 0.4297122 | -0.0639959 | -1.5644244 | 0.3976705 | -0.1263595 | -0.8546238 | -0.2332278 | -0.8458215 | -1.3717324 | -1.1659985 |
| 0.1013924 | 0.3472763 | 1.9178646 | -0.3896815 | -1.5644244 | 0.3910216 | -0.2578149 | -0.8172455 | -0.2321995 | 0.5343138 | 1.5822328 | -0.6872256 |
| -0.6268363 | 1.4910760 | 0.2869097 | 1.0759037 | -1.5644244 | 0.6204089 | 0.2321552 | -0.6125379 | -0.2332278 | -0.8923874 | 1.1488518 | 1.1689991 |
| -0.9200203 | -0.6058901 | 1.1587566 | -0.0349168 | -1.2804707 | -0.2048867 | -0.1353224 | -0.8598638 | -0.2297164 | -0.3970051 | -0.9483141 | -0.8591569 |
| -1.2298165 | 0.2996180 | -0.1414979 | -1.2329746 | -1.5644244 | -0.1163731 | -0.2548273 | 1.0307121 | -0.2329171 | -0.5753427 | -0.7440770 | -0.9204696 |
| 0.1773038 | 0.1089847 | -0.3970392 | -0.5699717 | 0.4232518 | -0.0806352 | -0.5804781 | -0.7980323 | -0.2332196 | 0.7324668 | -1.1625139 | -1.1638028 |
dim(data_norm)
## [1] 265 12
Two correlation charts are presented to examine the relationship between the variables and the distribution of each variable.
In the following graphs, we can see the lack of correlation between most of the variables. The exceptions are the large positive correlation between the Loudness and Energy variables and the significant correlations between the Acousticness variable and the Loudness and Energy variables. The second graph also shows the distributions of each variable. As we can observe, none of them resembles a normal distribution (except for the distribution of the Danceability variable).
kor<-cor(data_norm)
corrplot(kor, method="color", type="lower", tl.col = 'black', tl.cex = 0.75)
ggpairs(as.data.frame(data_norm))
In order to perform dimension reduction in our dataset, it was decided to choose the Principal Component Analysis (PCA) method.
pca1<-prcomp(data_norm, center=FALSE, scale.=FALSE)
pca1$rotation
## PC1 PC2 PC3 PC4 PC5
## Duration (ms) 0.086391370 -0.46850636 0.37070876 0.111584030 0.03724326
## Popularity 0.021089620 0.09414013 0.25989767 0.693160910 -0.09180853
## Danceability -0.170894010 0.60666517 0.24711157 -0.057971544 0.19723048
## Energy -0.536428477 -0.16863332 0.09072684 0.008798769 0.12367162
## Key -0.005817961 -0.15480857 -0.16112581 0.481046799 0.47545519
## Loudness -0.487588518 -0.19734460 0.10231296 0.104017059 -0.05205065
## Speechiness -0.154440871 0.16743385 -0.49787860 -0.003976076 0.33943423
## Acousticness 0.474439469 0.18813878 -0.08329414 0.053627118 -0.09144706
## Instrumentalness 0.121438202 -0.15726077 0.15453546 -0.335101768 0.68810569
## Liveness -0.230928715 -0.12844632 0.01998414 -0.348877450 -0.29744716
## Valence -0.331704968 0.41646712 0.04240320 0.086184368 -0.01457413
## Tempo -0.101522348 -0.18478218 -0.63906466 0.130710273 -0.14651918
## PC6 PC7 PC8 PC9 PC10
## Duration (ms) -0.29139737 0.14187313 0.29755268 -0.63526924 0.07636888
## Popularity -0.39953859 -0.32380046 0.09833904 0.36294767 0.15407166
## Danceability 0.04561951 0.06377205 -0.02124302 -0.18447888 0.49417906
## Energy -0.03832814 0.08649149 -0.06632376 0.08892684 -0.15017524
## Key 0.61453727 -0.27054335 0.01146402 -0.19328586 -0.01252928
## Loudness 0.01748997 0.18385061 -0.05531270 0.24463436 -0.30175815
## Speechiness -0.28589723 0.13362538 0.68954178 0.05705242 -0.05745829
## Acousticness -0.07066133 -0.13071873 0.04333394 -0.04312228 -0.53475857
## Instrumentalness -0.35733589 -0.31816476 -0.27590898 0.17172291 -0.05355805
## Liveness 0.12267608 -0.74261040 0.34857968 -0.01953979 0.10714726
## Valence -0.18122473 -0.23454845 -0.18871335 -0.49111371 -0.46412869
## Tempo -0.33411574 -0.11759143 -0.42797059 -0.22221252 0.30564127
## PC11 PC12
## Duration (ms) -0.13557902 0.020405420
## Popularity 0.06151943 -0.029035989
## Danceability -0.46243017 -0.027231850
## Energy -0.06154105 -0.782692337
## Key -0.05943333 0.023352820
## Loudness -0.48088154 0.529148690
## Speechiness 0.02730844 0.051345942
## Acousticness -0.58468503 -0.260179868
## Instrumentalness -0.03188091 0.111538491
## Liveness -0.14010899 0.007970105
## Valence 0.32425190 0.147896126
## Tempo -0.23882378 -0.011848238
An important part of the dimension reduction analysis is the selection of the right number of components. For this purpose, two charts, shown below, were used: the first shows the percentage of variance of each component, and the second shows the eigenvalues (which represents the variance of the data along the direction of the corresponding principal component) for each component.
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6676 1.2469 1.1323 1.06108 1.00164 0.96460 0.93334
## Proportion of Variance 0.2317 0.1296 0.1068 0.09382 0.08361 0.07754 0.07259
## Cumulative Proportion 0.2317 0.3613 0.4681 0.56196 0.64557 0.72311 0.79570
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.8709 0.82221 0.68479 0.61909 0.40607
## Proportion of Variance 0.0632 0.05634 0.03908 0.03194 0.01374
## Cumulative Proportion 0.8589 0.91524 0.95432 0.98626 1.00000
fviz_eig(pca1)
screeplot(pca1, col = 'steelblue', main = 'Scree plot')
abline(h = 1, col = "black", lty = 2, lwd = 2)
In the graphs above, we can see that the first 2 components account for only 36% of the total variance. Only the selection of 7 components results in 80% of the variance being explained. The second graph shows that for the first 5 components the eigenvalues are greater than 1, i.e. they contain a proportionally large part of the variance. It was therefore decided to reduce dimentions to 5 components, which explain 65% of the total variance.
Another part of the study is graphical analysis, the interpretation of the relevant graphs should be helpful in analyzing the various components obtained using the PCA method. The analysis will mainly cover the first two dimensions (PC1 and PC2), since we are dealing with 2-dimensional charts
fviz_pca_var(pca1,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
The first graph shows the relationships between all variables, with warm colors indicating a large share of a given variable in the PC, and cold colors indicating a small share (as does the length of the arrows from the center of the graph). The direction of the arrows, however, indicates the correlations between the variables in question.
In the graph, it can be observed that a large share of the components are the Energy and Loudness variables, which are positively correlated with each other, and the Acousticness variable, negatively correlated with these variables. The variables Valence and Dancebility also have a large share in the first 2 components. The variables Key and Popularity are least represented.
fviz_pca_ind(pca1,
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
The next graph shows the relationships between individuals (individuals with similar profiles are grouped together) and how they are represented by a given component (warm colors and a large distance from the center of the graph mean that an individual is well described by a given component, while cold colors and a small distance mean the opposite).
In the graph above, we can see that the best represented individuals are in the upper right corner of the graph. Combined with the analysis of the variable graph, we can conclude that these are observations with a high level of the Acousticness variable. This graph also helps detect outliers, but with the exception of the observation ‘The Concept’ (probably due to the high value of the Duration variable), no other individuals significantly deviate in distance from the others.
fviz_pca_biplot(pca1, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969", # Individuals color
label = 'var'
)
The next chart is a combination of the previous two charts. It allows you to analyze both variables and individuals in terms of PC1 and PC2. In it, we can observe that components 1 and 2 most likely depend in large part on the variables Acousticness, Energy, or Loudness, and that they describe well observations with high levels of these variables. We can also see that the observations in the lower right corner of the graph are probably not well described by PC1 and PC2, as they depend heavily on underrepresented variables such as Instrumentalness.
The following two charts show the extent to which each variable contributes to a component. The first graph shows the components PC1 and PC2 analyzed above, and the second graph shows the other 3 components (PC3, PC4, PC5)
var<-get_pca_var(pca1)
PC1<-fviz_contrib(pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
PC2<-fviz_contrib(pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(PC1,PC2,top='Contribution to the first two Principal Components')
As we can see from the graph above, the first component consists primarily of the Energy, Loudness, and Acousticness variables. So you can see that this component focuses primarily on the way a song sounds, whether it is loud and energetic or rather calm and performed with acoustic instruments.
The second component consists primarily of the Danceability, Duration and Valence variables. Thus, it can be assumed that this component determines whether a song is suitable for a party, that is, if it is positive and good to dance to.
The other 3 components can be found in the chart below, but it is already more difficult to find a guiding characteristic for them that would describe the entire component. A special case of this is PC4, about half of whose entire variance is accounted for by the Popularity variable.
PC3<-fviz_contrib(pca1, "var", axes=3, xtickslab.rt=90) # default angle=45°
PC4<-fviz_contrib(pca1, "var", axes=4, xtickslab.rt=90)
PC5<-fviz_contrib(pca1, "var", axes=5, xtickslab.rt=90)
grid.arrange(PC3,PC4,PC5)
The study presents the application of Principal Component Analysis (PCA) to reduce the dimensions of a dataset of songs on Spotify. Using the PCA method, the number of dimensions of the set was reduced from 12 to 5 while retaining 65% of the total variance. The study also attempted to name the newly created components in order to better understand the set trimmed to 5 dimensions.