Project aims to use Principal Component Analysis (PCA) to reduce the dimensionality of a Spotify music dataset that contains the features of top songs. The aim is to find out the most significant factors that define the characteristics of music while, at the same time, reducing the complexity of the dataset for simplicity in analysis. By applying PCA, the original variables will be transformed into a smaller set of components that preserve as much variance as possible, thus keeping valuable information. The dimensional reduction will enable us to more easily comprehend the structure and defining features of popular music tracks.
I used a data set from Kaggle. The data was gathered by using a Python script that extracts top songs and corresponding audio and descriptive features from Spotify’s API. The descriptive features provide metadata about the song, like the name of the artist, album title, and release date. The audio features, according to Spotify’s audio analysis, comprise attributes like key, valence, danceability, and energy. Here is the set:
https://www.kaggle.com/datasets/solomonameh/spotify-music-dataset?resource=download
Audio features:
Energy: A measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
Tempo: The speed of a track, measured in beats per minute (BPM).
Danceability: A score describing how suitable a track is for dancing based on tempo, rhythm stability, beat strength, and overall regularity.
Loudness: The overall loudness of a track in decibels (dB). Higher values indicate louder tracks overall.
Liveness: The likelihood of a track being performed live. Higher values suggest more audience presence.
Valence: The overall musical positiveness (emotion) of a track. High valence sounds happy; low valence sounds sad or angry.
Speechiness: Measures the presence of spoken words.
Instrumentalness: The likelihood a track contains no vocals. Values closer to 1.0 suggest solely instrumental tracks.
Mode: Indicates the modality of the track.
Key: The musical key, represented as an integer from 0 to 11, mapping to standard Pitch class notation.
Duration_ms: The length of the track in milliseconds.
Acousticness: A confidence measure of whether a track is acoustic (1) or not (0).
Descriptive features:
Track Name: The name of the track.
Track Artist: The artist(s) performing the track.
Track Album Name: The album in which the track is featured.
Track Album Release Date: The release date of the album containing the track.
Track ID: A unique identifier assigned to the track by Spotify.
Track Album ID: A unique identifier for the album.
Playlist Name: The name of the playlist where the track is included.
Playlist Genre: The main genre associated with the playlist (e.g., pop, rock, classical).
Playlist Subgenre: A more specific subgenre tied to the playlist (e.g., indie pop, punk rock).
Playlist ID: A unique identifier for the playlist.
Track Popularity: A score (0–100) which is calculated based on the total number of streams in relation to other songs.
df_spotify_music<-read.csv("high_popularity_spotify_data.csv", sep=",", dec=".", header=TRUE)
summary(df_spotify_music)
## energy tempo danceability playlist_genre
## Min. :0.00161 Min. : 49.3 Min. :0.1360 Length:1686
## 1st Qu.:0.55100 1st Qu.:100.1 1st Qu.:0.5433 Class :character
## Median :0.68900 Median :120.0 Median :0.6645 Mode :character
## Mean :0.66722 Mean :121.1 Mean :0.6504
## 3rd Qu.:0.80700 3rd Qu.:136.8 3rd Qu.:0.7690
## Max. :0.99000 Max. :209.7 Max. :0.9790
## loudness liveness valence track_artist
## Min. :-43.643 Min. :0.0210 Min. :0.0348 Length:1686
## 1st Qu.: -7.950 1st Qu.:0.0934 1st Qu.:0.3390 Class :character
## Median : -5.974 Median :0.1210 Median :0.5280 Mode :character
## Mean : -6.704 Mean :0.1716 Mean :0.5257
## 3rd Qu.: -4.687 3rd Qu.:0.2100 3rd Qu.:0.7200
## Max. : 1.295 Max. :0.9500 Max. :0.9780
## time_signature speechiness track_popularity track_href
## Min. :1.00 Min. :0.0232 Min. : 68.00 Length:1686
## 1st Qu.:4.00 1st Qu.:0.0379 1st Qu.: 71.00 Class :character
## Median :4.00 Median :0.0581 Median : 75.00 Mode :character
## Mean :3.95 Mean :0.1009 Mean : 75.81
## 3rd Qu.:4.00 3rd Qu.:0.1180 3rd Qu.: 79.00
## Max. :5.00 Max. :0.8480 Max. :100.00
## uri track_album_name playlist_name analysis_url
## Length:1686 Length:1686 Length:1686 Length:1686
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## track_id track_name track_album_release_date
## Length:1686 Length:1686 Length:1686
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## instrumentalness track_album_id mode key
## Min. :0.0000000 Length:1686 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000000 Class :character 1st Qu.:0.0000 1st Qu.: 2.000
## Median :0.0000060 Mode :character Median :1.0000 Median : 5.000
## Mean :0.0415204 Mean :0.5783 Mean : 5.338
## 3rd Qu.:0.0008138 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :0.9710000 Max. :1.0000 Max. :11.000
## duration_ms acousticness id playlist_subgenre
## Min. : 61673 Min. :0.0000133 Length:1686 Length:1686
## 1st Qu.:176608 1st Qu.:0.0230500 Class :character Class :character
## Median :211180 Median :0.1240000 Mode :character Mode :character
## Mean :214562 Mean :0.2212195
## 3rd Qu.:244993 3rd Qu.:0.3347500
## Max. :547107 Max. :0.9950000
## type playlist_id
## Length:1686 Length:1686
## Class :character Class :character
## Mode :character Mode :character
##
##
##
dim(df_spotify_music)
## [1] 1686 29
head(df_spotify_music,5)
## energy tempo danceability playlist_genre loudness liveness valence
## 1 0.592 157.969 0.521 pop -7.777 0.122 0.535
## 2 0.507 104.978 0.747 pop -10.171 0.117 0.438
## 3 0.808 108.548 0.554 pop -4.169 0.159 0.372
## 4 0.910 112.966 0.670 pop -4.070 0.304 0.786
## 5 0.783 149.027 0.777 pop -4.477 0.355 0.939
## track_artist time_signature speechiness track_popularity
## 1 Lady Gaga, Bruno Mars 3 0.0304 100
## 2 Billie Eilish 4 0.0358 97
## 3 Gracie Abrams 4 0.0368 93
## 4 Sabrina Carpenter 4 0.0634 81
## 5 ROSÉ, Bruno Mars 4 0.2600 98
## track_href
## 1 https://api.spotify.com/v1/tracks/2plbrEY59IikOBgBGLjaoe
## 2 https://api.spotify.com/v1/tracks/6dOtVTDdiauQNBQEDOtlAB
## 3 https://api.spotify.com/v1/tracks/7ne4VBA60CxGM75vw0EYad
## 4 https://api.spotify.com/v1/tracks/1d7Ptw3qYcfpdLNL5REhtJ
## 5 https://api.spotify.com/v1/tracks/5vNRhkKd0yEAg8suGBpjeY
## uri track_album_name
## 1 spotify:track:2plbrEY59IikOBgBGLjaoe Die With A Smile
## 2 spotify:track:6dOtVTDdiauQNBQEDOtlAB HIT ME HARD AND SOFT
## 3 spotify:track:7ne4VBA60CxGM75vw0EYad The Secret of Us (Deluxe)
## 4 spotify:track:1d7Ptw3qYcfpdLNL5REhtJ Short n' Sweet
## 5 spotify:track:5vNRhkKd0yEAg8suGBpjeY APT.
## playlist_name
## 1 Today's Top Hits
## 2 Today's Top Hits
## 3 Today's Top Hits
## 4 Today's Top Hits
## 5 Today's Top Hits
## analysis_url
## 1 https://api.spotify.com/v1/audio-analysis/2plbrEY59IikOBgBGLjaoe
## 2 https://api.spotify.com/v1/audio-analysis/6dOtVTDdiauQNBQEDOtlAB
## 3 https://api.spotify.com/v1/audio-analysis/7ne4VBA60CxGM75vw0EYad
## 4 https://api.spotify.com/v1/audio-analysis/1d7Ptw3qYcfpdLNL5REhtJ
## 5 https://api.spotify.com/v1/audio-analysis/5vNRhkKd0yEAg8suGBpjeY
## track_id track_name track_album_release_date
## 1 2plbrEY59IikOBgBGLjaoe Die With A Smile 2024-08-16
## 2 6dOtVTDdiauQNBQEDOtlAB BIRDS OF A FEATHER 2024-05-17
## 3 7ne4VBA60CxGM75vw0EYad That’s So True 2024-10-18
## 4 1d7Ptw3qYcfpdLNL5REhtJ Taste 2024-08-23
## 5 5vNRhkKd0yEAg8suGBpjeY APT. 2024-10-18
## instrumentalness track_album_id mode key duration_ms acousticness
## 1 0.0000 10FLjwfpbxLmW8c25Xyc2N 0 6 251668 0.3080
## 2 0.0608 7aJuG4TFXa2hmE4z1yxc3n 1 2 210373 0.2000
## 3 0.0000 0hBRqPYPXhr1RkTDG3n4Mk 1 1 166300 0.2140
## 4 0.0000 4B4Elma4nNDUyl6D5PvQkj 0 0 157280 0.0939
## 5 0.0000 2IYQwwgxgOIn7t3iF6ufFD 0 0 169917 0.0283
## id playlist_subgenre type
## 1 2plbrEY59IikOBgBGLjaoe mainstream audio_features
## 2 6dOtVTDdiauQNBQEDOtlAB mainstream audio_features
## 3 7ne4VBA60CxGM75vw0EYad mainstream audio_features
## 4 1d7Ptw3qYcfpdLNL5REhtJ mainstream audio_features
## 5 5vNRhkKd0yEAg8suGBpjeY mainstream audio_features
## playlist_id
## 1 37i9dQZF1DXcBWIGoYBM5M
## 2 37i9dQZF1DXcBWIGoYBM5M
## 3 37i9dQZF1DXcBWIGoYBM5M
## 4 37i9dQZF1DXcBWIGoYBM5M
## 5 37i9dQZF1DXcBWIGoYBM5M
Since the data set consists some descriptive features I will start with removing all columns where we have char type and are not numeric
library(dplyr)
df_spotify_music <- df_spotify_music %>%
select(
-playlist_genre, -track_artist, -track_href, -uri, -track_album_name,
-playlist_name, -analysis_url, -track_id, -track_name, -track_album_release_date,
-track_album_id, -id, -playlist_subgenre, -type, -playlist_id
)
head(df_spotify_music,5)
## energy tempo danceability loudness liveness valence time_signature
## 1 0.592 157.969 0.521 -7.777 0.122 0.535 3
## 2 0.507 104.978 0.747 -10.171 0.117 0.438 4
## 3 0.808 108.548 0.554 -4.169 0.159 0.372 4
## 4 0.910 112.966 0.670 -4.070 0.304 0.786 4
## 5 0.783 149.027 0.777 -4.477 0.355 0.939 4
## speechiness track_popularity instrumentalness mode key duration_ms
## 1 0.0304 100 0.0000 0 6 251668
## 2 0.0358 97 0.0608 1 2 210373
## 3 0.0368 93 0.0000 1 1 166300
## 4 0.0634 81 0.0000 0 0 157280
## 5 0.2600 98 0.0000 0 0 169917
## acousticness
## 1 0.3080
## 2 0.2000
## 3 0.2140
## 4 0.0939
## 5 0.0283
summary(df_spotify_music)
## energy tempo danceability loudness
## Min. :0.00161 Min. : 49.3 Min. :0.1360 Min. :-43.643
## 1st Qu.:0.55100 1st Qu.:100.1 1st Qu.:0.5433 1st Qu.: -7.950
## Median :0.68900 Median :120.0 Median :0.6645 Median : -5.974
## Mean :0.66722 Mean :121.1 Mean :0.6504 Mean : -6.704
## 3rd Qu.:0.80700 3rd Qu.:136.8 3rd Qu.:0.7690 3rd Qu.: -4.687
## Max. :0.99000 Max. :209.7 Max. :0.9790 Max. : 1.295
## liveness valence time_signature speechiness
## Min. :0.0210 Min. :0.0348 Min. :1.00 Min. :0.0232
## 1st Qu.:0.0934 1st Qu.:0.3390 1st Qu.:4.00 1st Qu.:0.0379
## Median :0.1210 Median :0.5280 Median :4.00 Median :0.0581
## Mean :0.1716 Mean :0.5257 Mean :3.95 Mean :0.1009
## 3rd Qu.:0.2100 3rd Qu.:0.7200 3rd Qu.:4.00 3rd Qu.:0.1180
## Max. :0.9500 Max. :0.9780 Max. :5.00 Max. :0.8480
## track_popularity instrumentalness mode key
## Min. : 68.00 Min. :0.0000000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 71.00 1st Qu.:0.0000000 1st Qu.:0.0000 1st Qu.: 2.000
## Median : 75.00 Median :0.0000060 Median :1.0000 Median : 5.000
## Mean : 75.81 Mean :0.0415204 Mean :0.5783 Mean : 5.338
## 3rd Qu.: 79.00 3rd Qu.:0.0008138 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :100.00 Max. :0.9710000 Max. :1.0000 Max. :11.000
## duration_ms acousticness
## Min. : 61673 Min. :0.0000133
## 1st Qu.:176608 1st Qu.:0.0230500
## Median :211180 Median :0.1240000
## Mean :214562 Mean :0.2212195
## 3rd Qu.:244993 3rd Qu.:0.3347500
## Max. :547107 Max. :0.9950000
There are no missing values in the data set
library(corrplot)
cor_df <- cor(df_spotify_music)
corrplot(cor_df, type = "full", order = "hclust", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = 0.6)
High correlation between energy and loudness was observed (0.69).It would mean that music with higher energy, which typically goes hand in hand with intensity, dynamic range, and overall activity, is also higher in loudness. The strong correlation indicates that energetic songs are also found to be louder, validating the idea that energy is one of the driving forces in the audio experience of music.
library(factoextra)
df <- df_spotify_music #creating a copy of our df
df.pca1 <- prcomp(df, scale.=TRUE)
df.pca1$rotation
## PC1 PC2 PC3 PC4 PC5
## energy 0.498365906 -0.25895683 0.05335802 0.13915665 0.03720911
## tempo 0.075550702 -0.18748189 0.08491305 0.16600461 -0.68194753
## danceability 0.220238340 0.59212024 -0.13484838 -0.04363353 0.13005576
## loudness 0.508615428 -0.09573085 -0.10848692 0.02372267 -0.01170464
## liveness 0.109653023 -0.13049501 0.02215428 -0.06035775 -0.29411711
## valence 0.306501000 0.22891180 -0.21077983 -0.04035818 0.18976558
## time_signature 0.237831483 0.11098830 0.33716582 -0.26862082 0.14241955
## speechiness 0.111654266 0.40939607 0.34994639 -0.05095480 -0.39838760
## track_popularity 0.011912416 -0.11159525 -0.72260156 0.14295565 -0.08452811
## instrumentalness -0.242715074 -0.01056810 0.19918622 0.23729274 0.01476424
## mode -0.050566977 -0.18770973 -0.16211959 -0.60770230 -0.20087987
## key -0.008428124 0.06841685 0.06045678 0.64533995 0.04433057
## duration_ms -0.037548303 -0.43929971 0.26933385 -0.05188086 0.40643765
## acousticness -0.447930079 0.20785279 -0.12091449 -0.07337815 -0.01127919
## PC6 PC7 PC8 PC9 PC10
## energy 0.12095256 0.15119858 0.05401774 -0.03240875 -0.20758926
## tempo -0.35075375 0.28108007 0.05028737 -0.19587107 0.39623506
## danceability -0.09045587 0.07075547 0.09512079 -0.06292296 -0.01905207
## loudness -0.04750215 -0.10789014 -0.01924759 0.10556814 -0.26030265
## liveness 0.77263984 -0.34881107 -0.12372866 -0.13990798 0.19867297
## valence 0.08592547 0.27326869 -0.12040201 -0.58974516 0.30403709
## time_signature 0.04136050 0.18193182 -0.25582928 0.57088440 0.53552989
## speechiness -0.13548090 -0.33627080 0.15318211 0.04211304 -0.18942006
## track_popularity -0.05000591 -0.08840399 0.15696044 0.40304714 0.21748458
## instrumentalness 0.35507154 0.62593690 0.20557512 0.11380287 -0.19297557
## mode -0.11625212 0.22658407 -0.49556288 -0.03585591 -0.34726109
## key -0.10014210 -0.09236512 -0.71782183 0.04150845 -0.07313487
## duration_ms -0.26855801 -0.26320587 0.10890729 -0.23339554 0.20391127
## acousticness 0.05691654 -0.10100957 -0.16372775 -0.13245375 0.15856988
## PC11 PC12 PC13 PC14
## energy 4.783531e-02 0.25315915 -0.12050890 0.7039349171
## tempo 4.966577e-02 -0.16760523 -0.16969638 0.0060547727
## danceability -2.152887e-01 -0.58754654 -0.30284562 0.2324903103
## loudness 1.732155e-01 0.08622987 -0.52224944 -0.5624267049
## liveness -1.592915e-01 -0.23834054 -0.07070086 -0.0115956537
## valence -1.467080e-01 0.36232728 0.19779450 -0.2070462722
## time_signature -3.047197e-05 0.09824555 -0.04839608 0.0065104333
## speechiness -4.133107e-01 0.40852317 0.07673507 0.0002968965
## track_popularity -4.049695e-01 0.16114974 0.01677100 0.0521579672
## instrumentalness -3.789019e-01 0.06361912 -0.22269141 -0.1929938010
## mode -2.972624e-01 -0.05179226 -0.01822152 0.0507712276
## key -1.704787e-01 -0.05000989 0.02627828 0.0031778383
## duration_ms -4.927721e-01 -0.06770713 -0.26145919 -0.0098805142
## acousticness 1.826204e-01 0.38827497 -0.64854673 0.2199223409
fviz_eig(df.pca1, addlabels = TRUE)
fviz_eig(df.pca1, choice= "eigenvalue", addlabels = TRUE, main = "Eigenvalues") +
geom_line(linetype = "dashed", y = 1)
summary(df.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6299 1.2328 1.08236 1.07848 1.04255 1.0117 1.0026
## Proportion of Variance 0.1898 0.1085 0.08368 0.08308 0.07764 0.0731 0.0718
## Cumulative Proportion 0.1898 0.2983 0.38197 0.46505 0.54269 0.6158 0.6876
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.94037 0.93309 0.85352 0.83626 0.73492 0.67724 0.43840
## Proportion of Variance 0.06316 0.06219 0.05203 0.04995 0.03858 0.03276 0.01373
## Cumulative Proportion 0.75076 0.81294 0.86498 0.91493 0.95351 0.98627 1.00000
A cumulative variance of 70% is achieved on selecting eight principal components. The eighth component, while having a standard deviation of less than 1, is worthwhile retaining to ensure sufficient information is retained. Reducing the selection to the first seven components alone explains only 63.7% of the variance, with the danger of over-simplification of the model. Including the eighth component pushes the cumulative variance to 75%, meeting the requirement for adequacy. Thus, eight components were selected to give a more comprehensive picture of the dataset.
library(gridExtra)
library(cowplot)
fviz_pca_var(df.pca1, col.var="contrib")
df.pca1$rotation[,1:8]
## PC1 PC2 PC3 PC4 PC5
## energy 0.498365906 -0.25895683 0.05335802 0.13915665 0.03720911
## tempo 0.075550702 -0.18748189 0.08491305 0.16600461 -0.68194753
## danceability 0.220238340 0.59212024 -0.13484838 -0.04363353 0.13005576
## loudness 0.508615428 -0.09573085 -0.10848692 0.02372267 -0.01170464
## liveness 0.109653023 -0.13049501 0.02215428 -0.06035775 -0.29411711
## valence 0.306501000 0.22891180 -0.21077983 -0.04035818 0.18976558
## time_signature 0.237831483 0.11098830 0.33716582 -0.26862082 0.14241955
## speechiness 0.111654266 0.40939607 0.34994639 -0.05095480 -0.39838760
## track_popularity 0.011912416 -0.11159525 -0.72260156 0.14295565 -0.08452811
## instrumentalness -0.242715074 -0.01056810 0.19918622 0.23729274 0.01476424
## mode -0.050566977 -0.18770973 -0.16211959 -0.60770230 -0.20087987
## key -0.008428124 0.06841685 0.06045678 0.64533995 0.04433057
## duration_ms -0.037548303 -0.43929971 0.26933385 -0.05188086 0.40643765
## acousticness -0.447930079 0.20785279 -0.12091449 -0.07337815 -0.01127919
## PC6 PC7 PC8
## energy 0.12095256 0.15119858 0.05401774
## tempo -0.35075375 0.28108007 0.05028737
## danceability -0.09045587 0.07075547 0.09512079
## loudness -0.04750215 -0.10789014 -0.01924759
## liveness 0.77263984 -0.34881107 -0.12372866
## valence 0.08592547 0.27326869 -0.12040201
## time_signature 0.04136050 0.18193182 -0.25582928
## speechiness -0.13548090 -0.33627080 0.15318211
## track_popularity -0.05000591 -0.08840399 0.15696044
## instrumentalness 0.35507154 0.62593690 0.20557512
## mode -0.11625212 0.22658407 -0.49556288
## key -0.10014210 -0.09236512 -0.71782183
## duration_ms -0.26855801 -0.26320587 0.10890729
## acousticness 0.05691654 -0.10100957 -0.16372775
PC1 <- fviz_contrib(df.pca1, "var", axes=1)
PC2 <- fviz_contrib(df.pca1, "var", axes=2)
PC3 <- fviz_contrib(df.pca1, "var", axes=3)
PC4 <- fviz_contrib(df.pca1, "var", axes=4)
PC5 <- fviz_contrib(df.pca1, "var", axes=5)
PC6 <- fviz_contrib(df.pca1, "var", axes=6)
PC7 <- fviz_contrib(df.pca1, "var", axes=7)
PC8 <- fviz_contrib(df.pca1, "var", axes=8)
plot_grid(PC1, PC2, PC3, PC4, PC5, PC6, PC7, PC8, ncol = 3)
The study aimed to perform dimensionality reduction on the dataset of top Spotify tracks without loss of important information, in our case, variance in the data. It has used Principal Component Analysis (PCA) as its dimensionality reduction technique, from which 8 principal components had been identified as the optimum number of components to maintain. The selection of components depended on their efficiency in explaining the majority of the variance in the dataset and minimizing information loss.