The aim of this work is to apply PCA to data on songs from the 1980s available on Spotify. PCA is used to reduce the number of variables while preserving as much information as possible.
https://www.kaggle.com/datasets/thebumpkin/1980s-classic-hits-with-spotify-data - This dataset comprises 997 classic hit songs from the 1980s, featuring tracks from 478 different artists.
Duration - The length of the song, typically measured in minutes and seconds.
Time signature - The musical meter of the song, indicating the number of beats per measure.
Danceability - A measure of how suitable a track is for dancing, based on tempo, rhythm stability, beat strength, and overall regularity.
Energy - A measure of intensity and activity in the song, with higher values indicating a more energetic track.
Key - The musical key in which the song is composed.
Loudness - The average volume of the song, measured in decibels (dB).
Speechiness - A measure of the presence of spoken words in a track, with higher values indicating more speech-like qualities
Acousticness - A measure of the track’s acoustic quality, with higher values indicating a greater likelihood of being acoustic.
Instrumentalness - A measure indicating the presence of vocals, with higher values representing more instrumental tracks.
Liveness - A measure of the likelihood that the track was performed live, with higher values indicating more audience noise.
Valence - A measure of the musical positiveness of the track, with higher values indicating more positive or happy music.
Tempo - The speed or pace of the track, measured in beats per minute (BPM).
The variable Mode had to be removed because it is binary, and such variables pose challenges to the assumptions of PCA. The variable Popularity was also removed as it reflects current popularity rather than that of the 1980s. Additionally, it is not directly related to the content of the songs, whereas our focus is primarily on their musical attributes.
dane = read.csv(file = "1980sClassics.csv", header = TRUE,sep=",")
dane = unique(dane)
convert_to_seconds <- function(time_str) {
parts <- strsplit(time_str, ":")[[1]]
minutes <- as.numeric(parts[1])
seconds <- as.numeric(parts[2])
total_seconds <- minutes * 60 + seconds
return(total_seconds)
}
dane$Duration <- sapply(dane$Duration, convert_to_seconds)
dane <- dane[-1:-2] # names of artists and songs
dane <- dane[-7] # binary variable
dane <- dane[-13:-14] # years and popularity (on Spotify)
dane <- na.omit(dane)
summary(dane)
## Duration Time_Signature Danceability Energy
## Min. : 41.0 Min. :1.000 Min. :0.1740 Min. :0.0183
## 1st Qu.:205.0 1st Qu.:4.000 1st Qu.:0.5340 1st Qu.:0.4890
## Median :234.0 Median :4.000 Median :0.6330 Median :0.6520
## Mean :240.2 Mean :3.967 Mean :0.6265 Mean :0.6335
## 3rd Qu.:270.0 3rd Qu.:4.000 3rd Qu.:0.7350 3rd Qu.:0.7970
## Max. :929.0 Max. :5.000 Max. :0.9880 Max. :0.9940
## Key Loudness Speechiness Acousticness
## Min. : 0.00 Min. :-28.980 Min. :0.02270 Min. :0.0000035
## 1st Qu.: 2.00 1st Qu.:-11.262 1st Qu.:0.03170 1st Qu.:0.0433000
## Median : 5.00 Median : -8.269 Median :0.03930 Median :0.1550000
## Mean : 5.23 Mean : -8.885 Mean :0.05763 Mean :0.2442580
## 3rd Qu.: 9.00 3rd Qu.: -6.042 3rd Qu.:0.05650 3rd Qu.:0.3850000
## Max. :11.00 Max. : -1.496 Max. :0.52400 Max. :0.9960000
## Instrumentalness Liveness Valence Tempo
## Min. :0.0000000 Min. :0.0223 Min. :0.0287 Min. : 61.53
## 1st Qu.:0.0000000 1st Qu.:0.0839 1st Qu.:0.3880 1st Qu.:102.48
## Median :0.0000221 Median :0.1130 Median :0.6440 Median :119.97
## Mean :0.0425860 Mean :0.1787 Mean :0.6030 Mean :120.94
## 3rd Qu.:0.0013800 3rd Qu.:0.2260 3rd Qu.:0.8250 3rd Qu.:135.00
## Max. :0.9740000 Max. :0.9810 Max. :0.9840 Max. :208.57
dane <- as.data.frame(lapply(dane, scale))
cor_w <- cor(dane)
corrplot(cor_w, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
The analysis of the covariance matrix indicates a strong correlation between several variables but also a complete lack of correlation with others, such as key, tempo, speechiness, and liveness.
The data was normalized due to the fact that some variables differ significantly in terms of value ranges, which could negatively affect the analysis. The 5 components should be chosen because eigenvalues of those are higher than 1.
pca <- prcomp(dane, center = TRUE, scale = TRUE)
fviz_eig(pca, choice='eigenvalue')
fviz_eig(pca)
eig.val <- get_eigenvalue(pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.6128676 21.773896 21.77390
## Dim.2 1.5428467 12.857055 34.63095
## Dim.3 1.2480974 10.400812 45.03176
## Dim.4 1.0698691 8.915576 53.94734
## Dim.5 1.0375432 8.646194 62.59353
## Dim.6 0.9317870 7.764891 70.35842
## Dim.7 0.8889980 7.408317 77.76674
## Dim.8 0.8207588 6.839657 84.60640
## Dim.9 0.7940228 6.616857 91.22325
## Dim.10 0.4838693 4.032244 95.25550
## Dim.11 0.4079514 3.399595 98.65509
## Dim.12 0.1613887 1.344905 100.00000
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6164 1.2421 1.1172 1.03434 1.01860 0.96529 0.94287
## Proportion of Variance 0.2177 0.1286 0.1040 0.08916 0.08646 0.07765 0.07408
## Cumulative Proportion 0.2177 0.3463 0.4503 0.53947 0.62594 0.70358 0.77767
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.9060 0.89108 0.69561 0.6387 0.40173
## Proportion of Variance 0.0684 0.06617 0.04032 0.0340 0.01345
## Cumulative Proportion 0.8461 0.91223 0.95255 0.9866 1.00000
a<-summary(pca)
plot(a$importance[3,],type="l")
The results are not outstanding, as it is not possible to reduce data to at least three variables. PC1 explains 22% of the variation and PC1 to PC5 explain nearly 63% of the variation. 90% of the variation is explained by 9 first components.
fviz_pca_var(pca, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)
fviz_pca_ind(pca, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))
Let’s check contributions for first five components.
library(gridExtra)
xxx.pca1<-prcomp(dane, center = TRUE, scale = TRUE)
var<-get_pca_var(xxx.pca1)
a1<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90)
a2<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90)
a3<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90)
a4<-fviz_contrib(xxx.pca1, "var", axes=4, xtickslab.rt=90)
a5<-fviz_contrib(xxx.pca1, "var", axes=5, xtickslab.rt=90)
grid.arrange(a1,top='Contribution to the five Principal Components')
grid.arrange(a2,a3,top='Contribution to the five Principal Components')
grid.arrange(a4,a5,top='Contribution to the five Principal Components')
pca_var <- get_pca_var(pca)
fviz_contrib(pca, "var", axes = 1:5, fill = "tomato3", color = "tomato4")
Energy and Danceability contribute the most to our results.
Let’s now perform rotated PCA to try simplifying the analysis of our variables. Let’s attempt it with 5 and 8 factors.
dane.pca2<-principal(dane, nfactors=5, rotate="varimax")
dane.pca2
## Principal Components Analysis
## Call: principal(r = dane, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 RC4 RC5 h2 u2 com
## Duration 0.05 0.06 -0.76 -0.14 0.23 0.66 0.34 1.3
## Time_Signature 0.15 0.20 -0.09 -0.04 0.46 0.29 0.71 1.7
## Danceability 0.07 0.84 0.16 -0.16 0.13 0.78 0.22 1.2
## Energy 0.91 0.07 0.02 0.06 0.07 0.84 0.16 1.0
## Key -0.08 -0.12 0.08 0.05 0.83 0.72 0.28 1.1
## Loudness 0.81 -0.16 0.20 -0.12 -0.07 0.73 0.27 1.3
## Speechiness 0.11 0.06 0.72 -0.08 0.19 0.58 0.42 1.2
## Acousticness -0.72 -0.28 0.13 -0.01 -0.12 0.63 0.37 1.4
## Instrumentalness -0.29 0.07 0.19 0.63 0.00 0.53 0.47 1.6
## Liveness 0.27 -0.51 0.15 -0.07 0.20 0.40 0.60 2.1
## Valence 0.41 0.69 -0.03 0.03 0.12 0.66 0.34 1.7
## Tempo 0.21 -0.13 -0.12 0.79 0.00 0.69 0.31 1.3
##
## RC1 RC2 RC3 RC4 RC5
## SS loadings 2.40 1.64 1.26 1.10 1.10
## Proportion Var 0.20 0.14 0.11 0.09 0.09
## Cumulative Var 0.20 0.34 0.44 0.53 0.63
## Proportion Explained 0.32 0.22 0.17 0.15 0.15
## Cumulative Proportion 0.32 0.54 0.71 0.85 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 5 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.1
## with the empirical chi square 1432.15 with prob < 2e-295
##
## Fit based upon off diagonal values = 0.67
summary(dane.pca2)
##
## Factor analysis with Call: principal(r = dane, nfactors = 5, rotate = "varimax")
##
## Test of the hypothesis that 5 factors are sufficient.
## The degrees of freedom for the model is 16 and the objective function was 1.61
## The number of observations was 997 with Chi Square = 1585.64 with prob < 0
##
## The root mean square of the residuals (RMSA) is 0.1
print(loadings(dane.pca2), digits=3, cutoff=0.4, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC3 RC4 RC5
## Energy 0.907
## Loudness 0.806
## Acousticness -0.719
## Danceability 0.843
## Liveness -0.513
## Valence 0.406 0.692
## Duration -0.758
## Speechiness 0.719
## Instrumentalness 0.635
## Tempo 0.786
## Key 0.834
## Time_Signature 0.464
##
## RC1 RC2 RC3 RC4 RC5
## SS loadings 2.404 1.643 1.260 1.103 1.101
## Proportion Var 0.200 0.137 0.105 0.092 0.092
## Cumulative Var 0.200 0.337 0.442 0.534 0.626
Fit is 0.67, which is similar to our previous results. The combinations of variables are interesting. In some cases, they make sense, while in others, it would require some thought on how they could be described and interpreted together.
dane.pca2<-principal(dane, nfactors=8, rotate="varimax")
dane.pca2
## Principal Components Analysis
## Call: principal(r = dane, nfactors = 8, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 RC4 RC7 RC6 RC8 RC5 h2 u2 com
## Duration 0.04 0.04 -0.76 -0.04 -0.04 0.20 -0.16 0.14 0.66 0.338 1.3
## Time_Signature 0.08 0.08 -0.01 0.01 0.02 0.96 -0.01 -0.01 0.94 0.058 1.0
## Danceability 0.02 0.84 0.12 -0.20 -0.21 0.10 0.00 0.02 0.81 0.193 1.3
## Energy 0.90 0.19 -0.02 0.09 0.14 0.04 0.01 0.02 0.88 0.124 1.2
## Key 0.01 0.00 0.01 0.00 0.02 -0.01 0.02 0.98 0.97 0.032 1.0
## Loudness 0.87 -0.13 0.19 -0.07 0.03 -0.02 -0.11 -0.03 0.83 0.165 1.2
## Speechiness 0.08 0.08 0.77 -0.02 0.01 0.18 -0.11 0.15 0.66 0.336 1.3
## Acousticness -0.70 -0.32 0.13 -0.07 0.05 -0.13 0.07 -0.04 0.64 0.355 1.6
## Instrumentalness -0.11 -0.04 0.03 0.03 -0.01 -0.01 0.98 0.02 0.98 0.025 1.0
## Liveness 0.09 -0.08 0.05 0.00 0.98 0.02 -0.01 0.02 0.97 0.026 1.0
## Valence 0.24 0.86 -0.06 0.11 0.09 0.01 -0.05 -0.02 0.83 0.169 1.2
## Tempo 0.06 -0.06 0.02 0.98 0.00 0.01 0.03 0.00 0.97 0.026 1.0
##
## RC1 RC2 RC3 RC4 RC7 RC6 RC8 RC5
## SS loadings 2.16 1.63 1.23 1.04 1.03 1.03 1.02 1.01
## Proportion Var 0.18 0.14 0.10 0.09 0.09 0.09 0.08 0.08
## Cumulative Var 0.18 0.32 0.42 0.51 0.59 0.68 0.76 0.85
## Proportion Explained 0.21 0.16 0.12 0.10 0.10 0.10 0.10 0.10
## Cumulative Proportion 0.21 0.37 0.49 0.60 0.70 0.80 0.90 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 8 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 587.46 with prob < NA
##
## Fit based upon off diagonal values = 0.87
summary(dane.pca2)
##
## Factor analysis with Call: principal(r = dane, nfactors = 8, rotate = "varimax")
##
## Test of the hypothesis that 8 factors are sufficient.
## The degrees of freedom for the model is -2 and the objective function was 1.68
## The number of observations was 997 with Chi Square = 1658.13 with prob < NA
##
## The root mean square of the residuals (RMSA) is 0.07
print(loadings(dane.pca2), digits=3, cutoff=0.4, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC3 RC4 RC7 RC6 RC8 RC5
## Energy 0.900
## Loudness 0.872
## Acousticness -0.703
## Danceability 0.836
## Valence 0.864
## Duration -0.756
## Speechiness 0.765
## Tempo 0.983
## Liveness 0.978
## Time_Signature 0.963
## Instrumentalness 0.979
## Key 0.983
##
## RC1 RC2 RC3 RC4 RC7 RC6 RC8 RC5
## SS loadings 2.163 1.628 1.234 1.039 1.033 1.026 1.018 1.012
## Proportion Var 0.180 0.136 0.103 0.087 0.086 0.085 0.085 0.084
## Cumulative Var 0.180 0.316 0.419 0.505 0.591 0.677 0.762 0.846
Energy is positively related in analysis with loudness, which is in some way very logical. Loud sounds could be more energetic and acoustic songs might be less energetic. Danceability and Valence, because more positive songs may be more danceable. But I don’t have any idea about duration and speechiness.
In summary, PCA is a useful tool for data analysis. It was applied to Spotify data for songs from the 1980s. Although it was not possible to reduce the variables to 3 dimensions, the analysis still proved to be interesting. As can be seen, Spotify does a good job of selecting and creating new variables for its data, which do not overlap and contain unique information.