Dimension Reduction with spotify data for songs from the 80s

The aim of this work is to apply PCA to data on songs from the 1980s available on Spotify. PCA is used to reduce the number of variables while preserving as much information as possible.

Data source

https://www.kaggle.com/datasets/thebumpkin/1980s-classic-hits-with-spotify-data - This dataset comprises 997 classic hit songs from the 1980s, featuring tracks from 478 different artists.

Variables

Duration - The length of the song, typically measured in minutes and seconds.

Time signature - The musical meter of the song, indicating the number of beats per measure.

Danceability - A measure of how suitable a track is for dancing, based on tempo, rhythm stability, beat strength, and overall regularity.

Energy - A measure of intensity and activity in the song, with higher values indicating a more energetic track.

Key - The musical key in which the song is composed.

Loudness - The average volume of the song, measured in decibels (dB).

Speechiness - A measure of the presence of spoken words in a track, with higher values indicating more speech-like qualities

Acousticness - A measure of the track’s acoustic quality, with higher values indicating a greater likelihood of being acoustic.

Instrumentalness - A measure indicating the presence of vocals, with higher values representing more instrumental tracks.

Liveness - A measure of the likelihood that the track was performed live, with higher values indicating more audience noise.

Valence - A measure of the musical positiveness of the track, with higher values indicating more positive or happy music.

Tempo - The speed or pace of the track, measured in beats per minute (BPM).

Data preparation

The variable Mode had to be removed because it is binary, and such variables pose challenges to the assumptions of PCA. The variable Popularity was also removed as it reflects current popularity rather than that of the 1980s. Additionally, it is not directly related to the content of the songs, whereas our focus is primarily on their musical attributes.

dane = read.csv(file = "1980sClassics.csv", header = TRUE,sep=",")
dane = unique(dane)

convert_to_seconds <- function(time_str) {
  parts <- strsplit(time_str, ":")[[1]] 
  minutes <- as.numeric(parts[1]) 
  seconds <- as.numeric(parts[2])     
  total_seconds <- minutes * 60 + seconds
  return(total_seconds)
}
dane$Duration  <- sapply(dane$Duration, convert_to_seconds)

dane <- dane[-1:-2] # names of artists and songs
dane <- dane[-7] # binary variable
dane <- dane[-13:-14] # years and popularity (on Spotify)
dane <- na.omit(dane)
summary(dane)

##     Duration     Time_Signature   Danceability        Energy      
##  Min.   : 41.0   Min.   :1.000   Min.   :0.1740   Min.   :0.0183  
##  1st Qu.:205.0   1st Qu.:4.000   1st Qu.:0.5340   1st Qu.:0.4890  
##  Median :234.0   Median :4.000   Median :0.6330   Median :0.6520  
##  Mean   :240.2   Mean   :3.967   Mean   :0.6265   Mean   :0.6335  
##  3rd Qu.:270.0   3rd Qu.:4.000   3rd Qu.:0.7350   3rd Qu.:0.7970  
##  Max.   :929.0   Max.   :5.000   Max.   :0.9880   Max.   :0.9940  
##       Key           Loudness        Speechiness       Acousticness      
##  Min.   : 0.00   Min.   :-28.980   Min.   :0.02270   Min.   :0.0000035  
##  1st Qu.: 2.00   1st Qu.:-11.262   1st Qu.:0.03170   1st Qu.:0.0433000  
##  Median : 5.00   Median : -8.269   Median :0.03930   Median :0.1550000  
##  Mean   : 5.23   Mean   : -8.885   Mean   :0.05763   Mean   :0.2442580  
##  3rd Qu.: 9.00   3rd Qu.: -6.042   3rd Qu.:0.05650   3rd Qu.:0.3850000  
##  Max.   :11.00   Max.   : -1.496   Max.   :0.52400   Max.   :0.9960000  
##  Instrumentalness       Liveness         Valence           Tempo       
##  Min.   :0.0000000   Min.   :0.0223   Min.   :0.0287   Min.   : 61.53  
##  1st Qu.:0.0000000   1st Qu.:0.0839   1st Qu.:0.3880   1st Qu.:102.48  
##  Median :0.0000221   Median :0.1130   Median :0.6440   Median :119.97  
##  Mean   :0.0425860   Mean   :0.1787   Mean   :0.6030   Mean   :120.94  
##  3rd Qu.:0.0013800   3rd Qu.:0.2260   3rd Qu.:0.8250   3rd Qu.:135.00  
##  Max.   :0.9740000   Max.   :0.9810   Max.   :0.9840   Max.   :208.57

Correlation matrix

dane <- as.data.frame(lapply(dane, scale))
cor_w <- cor(dane)
corrplot(cor_w, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

The analysis of the covariance matrix indicates a strong correlation between several variables but also a complete lack of correlation with others, such as key, tempo, speechiness, and liveness.

PCA

The data was normalized due to the fact that some variables differ significantly in terms of value ranges, which could negatively affect the analysis. The 5 components should be chosen because eigenvalues of those are higher than 1.

pca <- prcomp(dane, center = TRUE, scale = TRUE)
fviz_eig(pca, choice='eigenvalue')

fviz_eig(pca)

eig.val <- get_eigenvalue(pca)
eig.val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   2.6128676        21.773896                    21.77390
## Dim.2   1.5428467        12.857055                    34.63095
## Dim.3   1.2480974        10.400812                    45.03176
## Dim.4   1.0698691         8.915576                    53.94734
## Dim.5   1.0375432         8.646194                    62.59353
## Dim.6   0.9317870         7.764891                    70.35842
## Dim.7   0.8889980         7.408317                    77.76674
## Dim.8   0.8207588         6.839657                    84.60640
## Dim.9   0.7940228         6.616857                    91.22325
## Dim.10  0.4838693         4.032244                    95.25550
## Dim.11  0.4079514         3.399595                    98.65509
## Dim.12  0.1613887         1.344905                   100.00000

summary(pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.6164 1.2421 1.1172 1.03434 1.01860 0.96529 0.94287
## Proportion of Variance 0.2177 0.1286 0.1040 0.08916 0.08646 0.07765 0.07408
## Cumulative Proportion  0.2177 0.3463 0.4503 0.53947 0.62594 0.70358 0.77767
##                           PC8     PC9    PC10   PC11    PC12
## Standard deviation     0.9060 0.89108 0.69561 0.6387 0.40173
## Proportion of Variance 0.0684 0.06617 0.04032 0.0340 0.01345
## Cumulative Proportion  0.8461 0.91223 0.95255 0.9866 1.00000

a<-summary(pca)
plot(a$importance[3,],type="l")

The results are not outstanding, as it is not possible to reduce data to at least three variables. PC1 explains 22% of the variation and PC1 to PC5 explain nearly 63% of the variation. 90% of the variation is explained by 9 first components.

First 2 components

fviz_pca_var(pca, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=11)

fviz_pca_ind(pca, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))

Contributions

Let’s check contributions for first five components.

library(gridExtra)
xxx.pca1<-prcomp(dane, center = TRUE, scale = TRUE)
var<-get_pca_var(xxx.pca1)
a1<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90)
a2<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90)
a3<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90)
a4<-fviz_contrib(xxx.pca1, "var", axes=4, xtickslab.rt=90)
a5<-fviz_contrib(xxx.pca1, "var", axes=5, xtickslab.rt=90)
grid.arrange(a1,top='Contribution to the five Principal Components')

grid.arrange(a2,a3,top='Contribution to the five Principal Components')

grid.arrange(a4,a5,top='Contribution to the five Principal Components')

pca_var <- get_pca_var(pca)
fviz_contrib(pca, "var", axes = 1:5, fill = "tomato3", color = "tomato4")

Energy and Danceability contribute the most to our results.

Rotated PCA

Let’s now perform rotated PCA to try simplifying the analysis of our variables. Let’s attempt it with 5 and 8 factors.

dane.pca2<-principal(dane, nfactors=5, rotate="varimax")
dane.pca2

## Principal Components Analysis
## Call: principal(r = dane, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                    RC1   RC2   RC3   RC4   RC5   h2   u2 com
## Duration          0.05  0.06 -0.76 -0.14  0.23 0.66 0.34 1.3
## Time_Signature    0.15  0.20 -0.09 -0.04  0.46 0.29 0.71 1.7
## Danceability      0.07  0.84  0.16 -0.16  0.13 0.78 0.22 1.2
## Energy            0.91  0.07  0.02  0.06  0.07 0.84 0.16 1.0
## Key              -0.08 -0.12  0.08  0.05  0.83 0.72 0.28 1.1
## Loudness          0.81 -0.16  0.20 -0.12 -0.07 0.73 0.27 1.3
## Speechiness       0.11  0.06  0.72 -0.08  0.19 0.58 0.42 1.2
## Acousticness     -0.72 -0.28  0.13 -0.01 -0.12 0.63 0.37 1.4
## Instrumentalness -0.29  0.07  0.19  0.63  0.00 0.53 0.47 1.6
## Liveness          0.27 -0.51  0.15 -0.07  0.20 0.40 0.60 2.1
## Valence           0.41  0.69 -0.03  0.03  0.12 0.66 0.34 1.7
## Tempo             0.21 -0.13 -0.12  0.79  0.00 0.69 0.31 1.3
## 
##                        RC1  RC2  RC3  RC4  RC5
## SS loadings           2.40 1.64 1.26 1.10 1.10
## Proportion Var        0.20 0.14 0.11 0.09 0.09
## Cumulative Var        0.20 0.34 0.44 0.53 0.63
## Proportion Explained  0.32 0.22 0.17 0.15 0.15
## Cumulative Proportion 0.32 0.54 0.71 0.85 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  1432.15  with prob <  2e-295 
## 
## Fit based upon off diagonal values = 0.67

summary(dane.pca2)

## 
## Factor analysis with Call: principal(r = dane, nfactors = 5, rotate = "varimax")
## 
## Test of the hypothesis that 5 factors are sufficient.
## The degrees of freedom for the model is 16  and the objective function was  1.61 
## The number of observations was  997  with Chi Square =  1585.64  with prob <  0 
## 
## The root mean square of the residuals (RMSA) is  0.1

print(loadings(dane.pca2), digits=3, cutoff=0.4, sort=TRUE)

## 
## Loadings:
##                  RC1    RC2    RC3    RC4    RC5   
## Energy            0.907                            
## Loudness          0.806                            
## Acousticness     -0.719                            
## Danceability             0.843                     
## Liveness                -0.513                     
## Valence           0.406  0.692                     
## Duration                       -0.758              
## Speechiness                     0.719              
## Instrumentalness                       0.635       
## Tempo                                  0.786       
## Key                                           0.834
## Time_Signature                                0.464
## 
##                  RC1   RC2   RC3   RC4   RC5
## SS loadings    2.404 1.643 1.260 1.103 1.101
## Proportion Var 0.200 0.137 0.105 0.092 0.092
## Cumulative Var 0.200 0.337 0.442 0.534 0.626

Fit is 0.67, which is similar to our previous results. The combinations of variables are interesting. In some cases, they make sense, while in others, it would require some thought on how they could be described and interpreted together.

dane.pca2<-principal(dane, nfactors=8, rotate="varimax")
dane.pca2

## Principal Components Analysis
## Call: principal(r = dane, nfactors = 8, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                    RC1   RC2   RC3   RC4   RC7   RC6   RC8   RC5   h2    u2 com
## Duration          0.04  0.04 -0.76 -0.04 -0.04  0.20 -0.16  0.14 0.66 0.338 1.3
## Time_Signature    0.08  0.08 -0.01  0.01  0.02  0.96 -0.01 -0.01 0.94 0.058 1.0
## Danceability      0.02  0.84  0.12 -0.20 -0.21  0.10  0.00  0.02 0.81 0.193 1.3
## Energy            0.90  0.19 -0.02  0.09  0.14  0.04  0.01  0.02 0.88 0.124 1.2
## Key               0.01  0.00  0.01  0.00  0.02 -0.01  0.02  0.98 0.97 0.032 1.0
## Loudness          0.87 -0.13  0.19 -0.07  0.03 -0.02 -0.11 -0.03 0.83 0.165 1.2
## Speechiness       0.08  0.08  0.77 -0.02  0.01  0.18 -0.11  0.15 0.66 0.336 1.3
## Acousticness     -0.70 -0.32  0.13 -0.07  0.05 -0.13  0.07 -0.04 0.64 0.355 1.6
## Instrumentalness -0.11 -0.04  0.03  0.03 -0.01 -0.01  0.98  0.02 0.98 0.025 1.0
## Liveness          0.09 -0.08  0.05  0.00  0.98  0.02 -0.01  0.02 0.97 0.026 1.0
## Valence           0.24  0.86 -0.06  0.11  0.09  0.01 -0.05 -0.02 0.83 0.169 1.2
## Tempo             0.06 -0.06  0.02  0.98  0.00  0.01  0.03  0.00 0.97 0.026 1.0
## 
##                        RC1  RC2  RC3  RC4  RC7  RC6  RC8  RC5
## SS loadings           2.16 1.63 1.23 1.04 1.03 1.03 1.02 1.01
## Proportion Var        0.18 0.14 0.10 0.09 0.09 0.09 0.08 0.08
## Cumulative Var        0.18 0.32 0.42 0.51 0.59 0.68 0.76 0.85
## Proportion Explained  0.21 0.16 0.12 0.10 0.10 0.10 0.10 0.10
## Cumulative Proportion 0.21 0.37 0.49 0.60 0.70 0.80 0.90 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 8 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  587.46  with prob <  NA 
## 
## Fit based upon off diagonal values = 0.87

summary(dane.pca2)

## 
## Factor analysis with Call: principal(r = dane, nfactors = 8, rotate = "varimax")
## 
## Test of the hypothesis that 8 factors are sufficient.
## The degrees of freedom for the model is -2  and the objective function was  1.68 
## The number of observations was  997  with Chi Square =  1658.13  with prob <  NA 
## 
## The root mean square of the residuals (RMSA) is  0.07

print(loadings(dane.pca2), digits=3, cutoff=0.4, sort=TRUE)

## 
## Loadings:
##                  RC1    RC2    RC3    RC4    RC7    RC6    RC8    RC5   
## Energy            0.900                                                 
## Loudness          0.872                                                 
## Acousticness     -0.703                                                 
## Danceability             0.836                                          
## Valence                  0.864                                          
## Duration                       -0.756                                   
## Speechiness                     0.765                                   
## Tempo                                  0.983                            
## Liveness                                      0.978                     
## Time_Signature                                       0.963              
## Instrumentalness                                            0.979       
## Key                                                                0.983
## 
##                  RC1   RC2   RC3   RC4   RC7   RC6   RC8   RC5
## SS loadings    2.163 1.628 1.234 1.039 1.033 1.026 1.018 1.012
## Proportion Var 0.180 0.136 0.103 0.087 0.086 0.085 0.085 0.084
## Cumulative Var 0.180 0.316 0.419 0.505 0.591 0.677 0.762 0.846

Energy is positively related in analysis with loudness, which is in some way very logical. Loud sounds could be more energetic and acoustic songs might be less energetic. Danceability and Valence, because more positive songs may be more danceable. But I don’t have any idea about duration and speechiness.

Summary

In summary, PCA is a useful tool for data analysis. It was applied to Spotify data for songs from the 1980s. Although it was not possible to reduce the variables to 3 dimensions, the analysis still proved to be interesting. As can be seen, Spotify does a good job of selecting and creating new variables for its data, which do not overlap and contain unique information.