Introduction

Project aims to use Principal Component Analysis (PCA) to reduce the dimensionality of a Spotify music dataset that contains the features of top songs. The aim is to find out the most significant factors that define the characteristics of music while, at the same time, reducing the complexity of the dataset for simplicity in analysis. By applying PCA, the original variables will be transformed into a smaller set of components that preserve as much variance as possible, thus keeping valuable information. The dimensional reduction will enable us to more easily comprehend the structure and defining features of popular music tracks.

Data set

I used a data set from Kaggle. The data was gathered by using a Python script that extracts top songs and corresponding audio and descriptive features from Spotify’s API. The descriptive features provide metadata about the song, like the name of the artist, album title, and release date. The audio features, according to Spotify’s audio analysis, comprise attributes like key, valence, danceability, and energy. Here is the set:

https://www.kaggle.com/datasets/solomonameh/spotify-music-dataset?resource=download

Audio features:

Descriptive features:

Data preparation and cleaning

df_spotify_music<-read.csv("high_popularity_spotify_data.csv", sep=",", dec=".", header=TRUE)
summary(df_spotify_music)
##      energy            tempo        danceability    playlist_genre    
##  Min.   :0.00161   Min.   : 49.3   Min.   :0.1360   Length:1686       
##  1st Qu.:0.55100   1st Qu.:100.1   1st Qu.:0.5433   Class :character  
##  Median :0.68900   Median :120.0   Median :0.6645   Mode  :character  
##  Mean   :0.66722   Mean   :121.1   Mean   :0.6504                     
##  3rd Qu.:0.80700   3rd Qu.:136.8   3rd Qu.:0.7690                     
##  Max.   :0.99000   Max.   :209.7   Max.   :0.9790                     
##     loudness          liveness         valence       track_artist      
##  Min.   :-43.643   Min.   :0.0210   Min.   :0.0348   Length:1686       
##  1st Qu.: -7.950   1st Qu.:0.0934   1st Qu.:0.3390   Class :character  
##  Median : -5.974   Median :0.1210   Median :0.5280   Mode  :character  
##  Mean   : -6.704   Mean   :0.1716   Mean   :0.5257                     
##  3rd Qu.: -4.687   3rd Qu.:0.2100   3rd Qu.:0.7200                     
##  Max.   :  1.295   Max.   :0.9500   Max.   :0.9780                     
##  time_signature  speechiness     track_popularity  track_href       
##  Min.   :1.00   Min.   :0.0232   Min.   : 68.00   Length:1686       
##  1st Qu.:4.00   1st Qu.:0.0379   1st Qu.: 71.00   Class :character  
##  Median :4.00   Median :0.0581   Median : 75.00   Mode  :character  
##  Mean   :3.95   Mean   :0.1009   Mean   : 75.81                     
##  3rd Qu.:4.00   3rd Qu.:0.1180   3rd Qu.: 79.00                     
##  Max.   :5.00   Max.   :0.8480   Max.   :100.00                     
##      uri            track_album_name   playlist_name      analysis_url      
##  Length:1686        Length:1686        Length:1686        Length:1686       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    track_id          track_name        track_album_release_date
##  Length:1686        Length:1686        Length:1686             
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  instrumentalness    track_album_id          mode             key        
##  Min.   :0.0000000   Length:1686        Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0000000   Class :character   1st Qu.:0.0000   1st Qu.: 2.000  
##  Median :0.0000060   Mode  :character   Median :1.0000   Median : 5.000  
##  Mean   :0.0415204                      Mean   :0.5783   Mean   : 5.338  
##  3rd Qu.:0.0008138                      3rd Qu.:1.0000   3rd Qu.: 8.000  
##  Max.   :0.9710000                      Max.   :1.0000   Max.   :11.000  
##   duration_ms      acousticness            id            playlist_subgenre 
##  Min.   : 61673   Min.   :0.0000133   Length:1686        Length:1686       
##  1st Qu.:176608   1st Qu.:0.0230500   Class :character   Class :character  
##  Median :211180   Median :0.1240000   Mode  :character   Mode  :character  
##  Mean   :214562   Mean   :0.2212195                                        
##  3rd Qu.:244993   3rd Qu.:0.3347500                                        
##  Max.   :547107   Max.   :0.9950000                                        
##      type           playlist_id       
##  Length:1686        Length:1686       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
dim(df_spotify_music)
## [1] 1686   29
head(df_spotify_music,5)
##   energy   tempo danceability playlist_genre loudness liveness valence
## 1  0.592 157.969        0.521            pop   -7.777    0.122   0.535
## 2  0.507 104.978        0.747            pop  -10.171    0.117   0.438
## 3  0.808 108.548        0.554            pop   -4.169    0.159   0.372
## 4  0.910 112.966        0.670            pop   -4.070    0.304   0.786
## 5  0.783 149.027        0.777            pop   -4.477    0.355   0.939
##            track_artist time_signature speechiness track_popularity
## 1 Lady Gaga, Bruno Mars              3      0.0304              100
## 2         Billie Eilish              4      0.0358               97
## 3         Gracie Abrams              4      0.0368               93
## 4     Sabrina Carpenter              4      0.0634               81
## 5      ROSÉ, Bruno Mars              4      0.2600               98
##                                                 track_href
## 1 https://api.spotify.com/v1/tracks/2plbrEY59IikOBgBGLjaoe
## 2 https://api.spotify.com/v1/tracks/6dOtVTDdiauQNBQEDOtlAB
## 3 https://api.spotify.com/v1/tracks/7ne4VBA60CxGM75vw0EYad
## 4 https://api.spotify.com/v1/tracks/1d7Ptw3qYcfpdLNL5REhtJ
## 5 https://api.spotify.com/v1/tracks/5vNRhkKd0yEAg8suGBpjeY
##                                    uri          track_album_name
## 1 spotify:track:2plbrEY59IikOBgBGLjaoe          Die With A Smile
## 2 spotify:track:6dOtVTDdiauQNBQEDOtlAB      HIT ME HARD AND SOFT
## 3 spotify:track:7ne4VBA60CxGM75vw0EYad The Secret of Us (Deluxe)
## 4 spotify:track:1d7Ptw3qYcfpdLNL5REhtJ            Short n' Sweet
## 5 spotify:track:5vNRhkKd0yEAg8suGBpjeY                      APT.
##      playlist_name
## 1 Today's Top Hits
## 2 Today's Top Hits
## 3 Today's Top Hits
## 4 Today's Top Hits
## 5 Today's Top Hits
##                                                       analysis_url
## 1 https://api.spotify.com/v1/audio-analysis/2plbrEY59IikOBgBGLjaoe
## 2 https://api.spotify.com/v1/audio-analysis/6dOtVTDdiauQNBQEDOtlAB
## 3 https://api.spotify.com/v1/audio-analysis/7ne4VBA60CxGM75vw0EYad
## 4 https://api.spotify.com/v1/audio-analysis/1d7Ptw3qYcfpdLNL5REhtJ
## 5 https://api.spotify.com/v1/audio-analysis/5vNRhkKd0yEAg8suGBpjeY
##                 track_id         track_name track_album_release_date
## 1 2plbrEY59IikOBgBGLjaoe   Die With A Smile               2024-08-16
## 2 6dOtVTDdiauQNBQEDOtlAB BIRDS OF A FEATHER               2024-05-17
## 3 7ne4VBA60CxGM75vw0EYad     That’s So True               2024-10-18
## 4 1d7Ptw3qYcfpdLNL5REhtJ              Taste               2024-08-23
## 5 5vNRhkKd0yEAg8suGBpjeY               APT.               2024-10-18
##   instrumentalness         track_album_id mode key duration_ms acousticness
## 1           0.0000 10FLjwfpbxLmW8c25Xyc2N    0   6      251668       0.3080
## 2           0.0608 7aJuG4TFXa2hmE4z1yxc3n    1   2      210373       0.2000
## 3           0.0000 0hBRqPYPXhr1RkTDG3n4Mk    1   1      166300       0.2140
## 4           0.0000 4B4Elma4nNDUyl6D5PvQkj    0   0      157280       0.0939
## 5           0.0000 2IYQwwgxgOIn7t3iF6ufFD    0   0      169917       0.0283
##                       id playlist_subgenre           type
## 1 2plbrEY59IikOBgBGLjaoe        mainstream audio_features
## 2 6dOtVTDdiauQNBQEDOtlAB        mainstream audio_features
## 3 7ne4VBA60CxGM75vw0EYad        mainstream audio_features
## 4 1d7Ptw3qYcfpdLNL5REhtJ        mainstream audio_features
## 5 5vNRhkKd0yEAg8suGBpjeY        mainstream audio_features
##              playlist_id
## 1 37i9dQZF1DXcBWIGoYBM5M
## 2 37i9dQZF1DXcBWIGoYBM5M
## 3 37i9dQZF1DXcBWIGoYBM5M
## 4 37i9dQZF1DXcBWIGoYBM5M
## 5 37i9dQZF1DXcBWIGoYBM5M

Since the data set consists some descriptive features I will start with removing all columns where we have char type and are not numeric

library(dplyr)

df_spotify_music <- df_spotify_music %>%
  select(
    -playlist_genre, -track_artist, -track_href, -uri, -track_album_name,
    -playlist_name, -analysis_url, -track_id, -track_name, -track_album_release_date,
    -track_album_id, -id, -playlist_subgenre, -type, -playlist_id
  )

head(df_spotify_music,5)
##   energy   tempo danceability loudness liveness valence time_signature
## 1  0.592 157.969        0.521   -7.777    0.122   0.535              3
## 2  0.507 104.978        0.747  -10.171    0.117   0.438              4
## 3  0.808 108.548        0.554   -4.169    0.159   0.372              4
## 4  0.910 112.966        0.670   -4.070    0.304   0.786              4
## 5  0.783 149.027        0.777   -4.477    0.355   0.939              4
##   speechiness track_popularity instrumentalness mode key duration_ms
## 1      0.0304              100           0.0000    0   6      251668
## 2      0.0358               97           0.0608    1   2      210373
## 3      0.0368               93           0.0000    1   1      166300
## 4      0.0634               81           0.0000    0   0      157280
## 5      0.2600               98           0.0000    0   0      169917
##   acousticness
## 1       0.3080
## 2       0.2000
## 3       0.2140
## 4       0.0939
## 5       0.0283

Missing values

summary(df_spotify_music)
##      energy            tempo        danceability       loudness      
##  Min.   :0.00161   Min.   : 49.3   Min.   :0.1360   Min.   :-43.643  
##  1st Qu.:0.55100   1st Qu.:100.1   1st Qu.:0.5433   1st Qu.: -7.950  
##  Median :0.68900   Median :120.0   Median :0.6645   Median : -5.974  
##  Mean   :0.66722   Mean   :121.1   Mean   :0.6504   Mean   : -6.704  
##  3rd Qu.:0.80700   3rd Qu.:136.8   3rd Qu.:0.7690   3rd Qu.: -4.687  
##  Max.   :0.99000   Max.   :209.7   Max.   :0.9790   Max.   :  1.295  
##     liveness         valence       time_signature  speechiness    
##  Min.   :0.0210   Min.   :0.0348   Min.   :1.00   Min.   :0.0232  
##  1st Qu.:0.0934   1st Qu.:0.3390   1st Qu.:4.00   1st Qu.:0.0379  
##  Median :0.1210   Median :0.5280   Median :4.00   Median :0.0581  
##  Mean   :0.1716   Mean   :0.5257   Mean   :3.95   Mean   :0.1009  
##  3rd Qu.:0.2100   3rd Qu.:0.7200   3rd Qu.:4.00   3rd Qu.:0.1180  
##  Max.   :0.9500   Max.   :0.9780   Max.   :5.00   Max.   :0.8480  
##  track_popularity instrumentalness         mode             key        
##  Min.   : 68.00   Min.   :0.0000000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 71.00   1st Qu.:0.0000000   1st Qu.:0.0000   1st Qu.: 2.000  
##  Median : 75.00   Median :0.0000060   Median :1.0000   Median : 5.000  
##  Mean   : 75.81   Mean   :0.0415204   Mean   :0.5783   Mean   : 5.338  
##  3rd Qu.: 79.00   3rd Qu.:0.0008138   3rd Qu.:1.0000   3rd Qu.: 8.000  
##  Max.   :100.00   Max.   :0.9710000   Max.   :1.0000   Max.   :11.000  
##   duration_ms      acousticness      
##  Min.   : 61673   Min.   :0.0000133  
##  1st Qu.:176608   1st Qu.:0.0230500  
##  Median :211180   Median :0.1240000  
##  Mean   :214562   Mean   :0.2212195  
##  3rd Qu.:244993   3rd Qu.:0.3347500  
##  Max.   :547107   Max.   :0.9950000

There are no missing values in the data set

Correlation

library(corrplot)

cor_df <- cor(df_spotify_music)
corrplot(cor_df, type = "full", order = "hclust", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = 0.6)

High correlation between energy and loudness was observed (0.69).It would mean that music with higher energy, which typically goes hand in hand with intensity, dynamic range, and overall activity, is also higher in loudness. The strong correlation indicates that energetic songs are also found to be louder, validating the idea that energy is one of the driving forces in the audio experience of music.

PCA

Optimal number of components

library(factoextra)

df <- df_spotify_music #creating a copy of our df
df.pca1 <- prcomp(df, scale.=TRUE)

df.pca1$rotation
##                           PC1         PC2         PC3         PC4         PC5
## energy            0.498365906 -0.25895683  0.05335802  0.13915665  0.03720911
## tempo             0.075550702 -0.18748189  0.08491305  0.16600461 -0.68194753
## danceability      0.220238340  0.59212024 -0.13484838 -0.04363353  0.13005576
## loudness          0.508615428 -0.09573085 -0.10848692  0.02372267 -0.01170464
## liveness          0.109653023 -0.13049501  0.02215428 -0.06035775 -0.29411711
## valence           0.306501000  0.22891180 -0.21077983 -0.04035818  0.18976558
## time_signature    0.237831483  0.11098830  0.33716582 -0.26862082  0.14241955
## speechiness       0.111654266  0.40939607  0.34994639 -0.05095480 -0.39838760
## track_popularity  0.011912416 -0.11159525 -0.72260156  0.14295565 -0.08452811
## instrumentalness -0.242715074 -0.01056810  0.19918622  0.23729274  0.01476424
## mode             -0.050566977 -0.18770973 -0.16211959 -0.60770230 -0.20087987
## key              -0.008428124  0.06841685  0.06045678  0.64533995  0.04433057
## duration_ms      -0.037548303 -0.43929971  0.26933385 -0.05188086  0.40643765
## acousticness     -0.447930079  0.20785279 -0.12091449 -0.07337815 -0.01127919
##                          PC6         PC7         PC8         PC9        PC10
## energy            0.12095256  0.15119858  0.05401774 -0.03240875 -0.20758926
## tempo            -0.35075375  0.28108007  0.05028737 -0.19587107  0.39623506
## danceability     -0.09045587  0.07075547  0.09512079 -0.06292296 -0.01905207
## loudness         -0.04750215 -0.10789014 -0.01924759  0.10556814 -0.26030265
## liveness          0.77263984 -0.34881107 -0.12372866 -0.13990798  0.19867297
## valence           0.08592547  0.27326869 -0.12040201 -0.58974516  0.30403709
## time_signature    0.04136050  0.18193182 -0.25582928  0.57088440  0.53552989
## speechiness      -0.13548090 -0.33627080  0.15318211  0.04211304 -0.18942006
## track_popularity -0.05000591 -0.08840399  0.15696044  0.40304714  0.21748458
## instrumentalness  0.35507154  0.62593690  0.20557512  0.11380287 -0.19297557
## mode             -0.11625212  0.22658407 -0.49556288 -0.03585591 -0.34726109
## key              -0.10014210 -0.09236512 -0.71782183  0.04150845 -0.07313487
## duration_ms      -0.26855801 -0.26320587  0.10890729 -0.23339554  0.20391127
## acousticness      0.05691654 -0.10100957 -0.16372775 -0.13245375  0.15856988
##                           PC11        PC12        PC13          PC14
## energy            4.783531e-02  0.25315915 -0.12050890  0.7039349171
## tempo             4.966577e-02 -0.16760523 -0.16969638  0.0060547727
## danceability     -2.152887e-01 -0.58754654 -0.30284562  0.2324903103
## loudness          1.732155e-01  0.08622987 -0.52224944 -0.5624267049
## liveness         -1.592915e-01 -0.23834054 -0.07070086 -0.0115956537
## valence          -1.467080e-01  0.36232728  0.19779450 -0.2070462722
## time_signature   -3.047197e-05  0.09824555 -0.04839608  0.0065104333
## speechiness      -4.133107e-01  0.40852317  0.07673507  0.0002968965
## track_popularity -4.049695e-01  0.16114974  0.01677100  0.0521579672
## instrumentalness -3.789019e-01  0.06361912 -0.22269141 -0.1929938010
## mode             -2.972624e-01 -0.05179226 -0.01822152  0.0507712276
## key              -1.704787e-01 -0.05000989  0.02627828  0.0031778383
## duration_ms      -4.927721e-01 -0.06770713 -0.26145919 -0.0098805142
## acousticness      1.826204e-01  0.38827497 -0.64854673  0.2199223409
fviz_eig(df.pca1, addlabels = TRUE)

fviz_eig(df.pca1, choice= "eigenvalue", addlabels = TRUE, main = "Eigenvalues") +
  geom_line(linetype = "dashed", y = 1)

summary(df.pca1)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6    PC7
## Standard deviation     1.6299 1.2328 1.08236 1.07848 1.04255 1.0117 1.0026
## Proportion of Variance 0.1898 0.1085 0.08368 0.08308 0.07764 0.0731 0.0718
## Cumulative Proportion  0.1898 0.2983 0.38197 0.46505 0.54269 0.6158 0.6876
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.94037 0.93309 0.85352 0.83626 0.73492 0.67724 0.43840
## Proportion of Variance 0.06316 0.06219 0.05203 0.04995 0.03858 0.03276 0.01373
## Cumulative Proportion  0.75076 0.81294 0.86498 0.91493 0.95351 0.98627 1.00000

A cumulative variance of 70% is achieved on selecting eight principal components. The eighth component, while having a standard deviation of less than 1, is worthwhile retaining to ensure sufficient information is retained. Reducing the selection to the first seven components alone explains only 63.7% of the variance, with the danger of over-simplification of the model. Including the eighth component pushes the cumulative variance to 75%, meeting the requirement for adequacy. Thus, eight components were selected to give a more comprehensive picture of the dataset.

PCA components analysis

library(gridExtra)
library(cowplot)
fviz_pca_var(df.pca1, col.var="contrib")

df.pca1$rotation[,1:8]
##                           PC1         PC2         PC3         PC4         PC5
## energy            0.498365906 -0.25895683  0.05335802  0.13915665  0.03720911
## tempo             0.075550702 -0.18748189  0.08491305  0.16600461 -0.68194753
## danceability      0.220238340  0.59212024 -0.13484838 -0.04363353  0.13005576
## loudness          0.508615428 -0.09573085 -0.10848692  0.02372267 -0.01170464
## liveness          0.109653023 -0.13049501  0.02215428 -0.06035775 -0.29411711
## valence           0.306501000  0.22891180 -0.21077983 -0.04035818  0.18976558
## time_signature    0.237831483  0.11098830  0.33716582 -0.26862082  0.14241955
## speechiness       0.111654266  0.40939607  0.34994639 -0.05095480 -0.39838760
## track_popularity  0.011912416 -0.11159525 -0.72260156  0.14295565 -0.08452811
## instrumentalness -0.242715074 -0.01056810  0.19918622  0.23729274  0.01476424
## mode             -0.050566977 -0.18770973 -0.16211959 -0.60770230 -0.20087987
## key              -0.008428124  0.06841685  0.06045678  0.64533995  0.04433057
## duration_ms      -0.037548303 -0.43929971  0.26933385 -0.05188086  0.40643765
## acousticness     -0.447930079  0.20785279 -0.12091449 -0.07337815 -0.01127919
##                          PC6         PC7         PC8
## energy            0.12095256  0.15119858  0.05401774
## tempo            -0.35075375  0.28108007  0.05028737
## danceability     -0.09045587  0.07075547  0.09512079
## loudness         -0.04750215 -0.10789014 -0.01924759
## liveness          0.77263984 -0.34881107 -0.12372866
## valence           0.08592547  0.27326869 -0.12040201
## time_signature    0.04136050  0.18193182 -0.25582928
## speechiness      -0.13548090 -0.33627080  0.15318211
## track_popularity -0.05000591 -0.08840399  0.15696044
## instrumentalness  0.35507154  0.62593690  0.20557512
## mode             -0.11625212  0.22658407 -0.49556288
## key              -0.10014210 -0.09236512 -0.71782183
## duration_ms      -0.26855801 -0.26320587  0.10890729
## acousticness      0.05691654 -0.10100957 -0.16372775
PC1 <- fviz_contrib(df.pca1, "var", axes=1)
PC2 <- fviz_contrib(df.pca1, "var", axes=2)
PC3 <- fviz_contrib(df.pca1, "var", axes=3)
PC4 <- fviz_contrib(df.pca1, "var", axes=4)
PC5 <- fviz_contrib(df.pca1, "var", axes=5)
PC6 <- fviz_contrib(df.pca1, "var", axes=6)
PC7 <- fviz_contrib(df.pca1, "var", axes=7)
PC8 <- fviz_contrib(df.pca1, "var", axes=8)

plot_grid(PC1, PC2, PC3, PC4, PC5, PC6, PC7, PC8, ncol = 3)

PC1

  • Dominant Variables: Loudness, Energy, Acousticness
  • Interpretation:
    PC1 reflects the sonic intensity and acoustic profile of the tracks. Tracks with high PC1 values are likely louder, more energetic, and less acoustic.

PC2

  • Dominant Variables: Danceability, Duration, Speechiness
  • Interpretation:
    PC2 captures the track rhythm and vocal features. Tracks with high PC2 scores are more danceable and speech-oriented.

PC3

  • Dominant Variables: Track Popularity, Speechiness, Time Signature
  • Interpretation:
    PC3 represents the popularity and structural characteristics of tracks.

PC4

  • Dominant Variables: Key, Mode, Time Signature
  • Interpretation:
    PC4 reflects musical structure and tonal characteristics, focusing on composition-related features.

PC5

  • Dominant Variables: Tempo, Duration, Speechiness
  • Interpretation:
    PC5 highlights tempo and track length characteristics.

PC6

  • Dominant Variables: Liveness, Instrumentalness, Key
  • Interpretation:
    PC6 captures the live or studio-recorded nature of tracks.

PC7

  • Dominant Variables: Instrumentalness, Liveness, Speechiness
  • Interpretation:
    PC7 reflects the degree of instrumentation versus vocals in the tracks.

PC8

  • Dominant Variables: Key, Mode, Time Signature
  • Interpretation:
    PC8 focuses on tonal and structural attributes of the tracks.

General Observations

  • PC1 and PC2 focus on sound and rhythm (e.g., energy, danceability, acousticness).
  • PC3 and PC4 emphasize popularity and musical structure (e.g., key, mode, time signature).
  • PC5 to PC8 delve into more specific attributes like tempo, liveness, and instrumentalness.
  • This decomposition provides a nuanced understanding of the dataset, from track popularity to energy and composition.

Summary

The study aimed to perform dimensionality reduction on the dataset of top Spotify tracks without loss of important information, in our case, variance in the data. It has used Principal Component Analysis (PCA) as its dimensionality reduction technique, from which 8 principal components had been identified as the optimum number of components to maintain. The selection of components depended on their efficiency in explaining the majority of the variance in the dataset and minimizing information loss.