Dimensions Reduction for Spotify Data

Introduction

The goal of the project is to perform and analyze outputs of two different dimension reduction methods – PCA and MFA - on musical tracks data released in 2023, specifically focusing on pop genre.

R Setup

Following R packages were used to perform the analysis:

library(stats) # pca
library(corrplot) # corrplots
library(factoextra) # plots
library(gridExtra) # displaying most significant variables that constitute pc
library(psych) # rotated pca
library(FactoMineR) # mfa

Data Overview

The dataset was found on Kaggle, and includes information on thousands of musical tracks, pulled with Spotify’s Web API. Song metadata, popularity and audio analysis are included, summing up to total of 24 columns, which are used in the analysis:

track_name: Name of the track.
album_type: Album type, describing whether the track was released as a single or within an album.
album_popularity: Popularity of the album, ranging between 0 and 100, with 100 being the most popular.
artist_0: The main artist of the track.
genre_0: The main genre of the track.
artist_1: The secondary artist of the track.
duration_sec: Duration of the track in seconds.
followers: Total number of followers the artist has on Spotify.
artist_popularity: Popularity of the artist, calculated by Spotify based on all of the artist’s tracks (ranging between 0 and 100).
acousticness: A confidence measure from 0.0 to 1.0, assessing whether the track is acoustic (higher values indicate more acoustic content).
danceability: How suitable a track is for dancing, based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity (0.0 is least danceable, 1.0 is most danceable).
energy: A measure from 0.0 to 1.0 representing intensity and activity (energetic tracks feel fast, loud, and noisy, leading to a higher score).
instrumentalness: A measure of whether a track contains no vocals (values closer to 1.0 indicate a greater likelihood that the track is purely instrumental).
key: The key the track is in (integers map to pitches using standard Pitch Class notation).
liveness: Detects the presence of an audience in the recording, ranging between 0.0 and 1.0 (higher values indicate a greater probability that the track was performed live).
loudness: The overall loudness of a track in decibels (typically ranging between -60 and 0 dB).
mode: Represents the modality of the track (either major or minor).
speechiness: Detects the presence of spoken words in a track, ranging between 0.0 and 1.0 (the more exclusively speech-like the recording, the closer to 1.0 the attribute value).
tempo: The overall estimated tempo of a track in beats per minute (BPM).
valence: A measure from 0.0 to 1.0, describing the musical positiveness conveyed by a track (positive-sounding tracks have higher valence).
time_signature: The number of beats in each bar (ranges from 3 to 7, indicating time signatures such as 3/4 to 7/4).
explicit: Indicates whether the track has explicit lyrics.
track_popularity: Popularity of the track calculated by Spotify’s algorithm, ranging between 0 and 100, with 100 being the most popular.
collab: Whether the song was a collaboration or performed solo.

spotify <- read.csv("spotify_data_12_20_2023.csv")

Detailed information on how the measures are asigned can be found inside Spotify’s Web API Documentation. Before performing the analysis and initial pre-processing of the data, the dataset was filtered to only include tracks released in 2023, with pop as the main genre.

spotify <- spotify[grep("pop", spotify$genre_0), ]
spotify <- spotify[spotify$release_year==2023,]

Data Preprocessing

First, initial processing of the data is performed in order to allow for the use of dimensions reduction algorithms.

str(spotify)

## 'data.frame':    2257 obs. of  49 variables:
##  $ album_id         : chr  "13u5VtPqwiS0olRYW2iUet" "02V8WaeAzCF6Tfk26yWwnL" "1oxqIMFixLqwqxqqHifJAk" "1oxqIMFixLqwqxqqHifJAk" ...
##  $ album_name       : chr  "잘해보고 싶어요" "O'Rkha D'Shmayya" "Perfected" "Perfected" ...
##  $ album_popularity : int  1 1 0 0 2 2 2 2 2 2 ...
##  $ album_type       : chr  "single" "single" "single" "single" ...
##  $ artists          : chr  "['Lia Kim']" "['Rayan Zaito']" "['Orpheus in red velvet']" "['Orpheus in red velvet']" ...
##  $ artist_0         : chr  "Lia Kim" "Rayan Zaito" "Orpheus in red velvet" "Orpheus in red velvet" ...
##  $ artist_1         : chr  "" "" "" "" ...
##  $ artist_2         : chr  "" "" "" "" ...
##  $ artist_3         : chr  "" "" "" "" ...
##  $ artist_4         : chr  "" "" "" "" ...
##  $ artist_id        : chr  "4XkyKEhzoNQEg8ruN7OkPs" "2VCgGRjYUv9LKgLzE3yz0o" "3TNxQnixmRnIOr1jV8ghsJ" "3TNxQnixmRnIOr1jV8ghsJ" ...
##  $ duration_sec     : num  232 257 238 297 173 ...
##  $ label            : chr  "jh kimsmusic" "Babylon Records" "3185475 Records DK" "3185475 Records DK" ...
##  $ release_date     : chr  "2023-02-06 00:00:00 UTC" "2023-11-01 00:00:00 UTC" "2023-07-01 00:00:00 UTC" "2023-07-01 00:00:00 UTC" ...
##  $ total_tracks     : int  1 1 2 2 10 10 10 10 10 10 ...
##  $ track_id         : chr  "3qPXM4XdLCv6GNK7n8GoL6" "5o7296SqFh2dK2rjnfskwh" "2tWoFHY2Sl32KtiiRRGK8J" "17BtcevO6GU9l2fqVZ2ePs" ...
##  $ track_name       : chr  "잘해보고 싶어요" "O'Rkha D'Shmayya" "Perfected - Single Version" "Perfected - B-Side-Mix" ...
##  $ track_number     : int  1 1 1 2 3 7 9 8 10 6 ...
##  $ artist_genres    : chr  "['k-pop ballad']" "['assyrian pop']" "['deep neo-synthpop']" "['deep neo-synthpop']" ...
##  $ artist_popularity: int  0 0 0 0 1 1 1 1 1 1 ...
##  $ followers        : int  60 60 69 69 204 204 204 204 204 204 ...
##  $ name             : chr  "Lia Kim" "Rayan Zaito" "Orpheus in red velvet" "Orpheus in red velvet" ...
##  $ genre_0          : chr  "k-pop ballad" "assyrian pop" "deep neo-synthpop" "deep neo-synthpop" ...
##  $ genre_1          : chr  "" "" "" "" ...
##  $ genre_2          : chr  "" "" "" "" ...
##  $ genre_3          : chr  "" "" "" "" ...
##  $ genre_4          : chr  "" "" "" "" ...
##  $ acousticness     : num  0.697 0.783 0.101 0.0208 0.497 0.163 0.271 0.176 0.563 0.248 ...
##  $ analysis_url     : chr  "https://api.spotify.com/v1/audio-analysis/3qPXM4XdLCv6GNK7n8GoL6" "https://api.spotify.com/v1/audio-analysis/5o7296SqFh2dK2rjnfskwh" "https://api.spotify.com/v1/audio-analysis/2tWoFHY2Sl32KtiiRRGK8J" "https://api.spotify.com/v1/audio-analysis/17BtcevO6GU9l2fqVZ2ePs" ...
##  $ danceability     : num  0.467 0.499 0.573 0.53 0.564 0.798 0.739 0.756 0.529 0.65 ...
##  $ duration_ms      : int  231613 257088 238427 296976 173103 219591 214625 184808 216052 228195 ...
##  $ energy           : num  0.307 0.284 0.977 0.99 0.534 0.428 0.433 0.433 0.191 0.328 ...
##  $ instrumentalness : num  0.00 2.31e-06 1.53e-01 2.43e-01 1.14e-04 3.53e-05 1.37e-04 0.00 0.00 2.35e-05 ...
##  $ key              : int  7 4 6 6 5 9 5 10 6 10 ...
##  $ liveness         : num  0.151 0.133 0.338 0.353 0.0925 0.14 0.25 0.0934 0.098 0.108 ...
##  $ loudness         : num  -7.83 -11.22 -8.37 -7.46 -9.24 ...
##  $ mode             : int  1 0 0 0 0 1 1 1 0 1 ...
##  $ speechiness      : num  0.0271 0.0366 0.0489 0.0472 0.0431 0.0344 0.0337 0.0676 0.0292 0.137 ...
##  $ tempo            : num  75 142 124 124 100 ...
##  $ time_signature   : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ track_href       : chr  "https://api.spotify.com/v1/tracks/3qPXM4XdLCv6GNK7n8GoL6" "https://api.spotify.com/v1/tracks/5o7296SqFh2dK2rjnfskwh" "https://api.spotify.com/v1/tracks/2tWoFHY2Sl32KtiiRRGK8J" "https://api.spotify.com/v1/tracks/17BtcevO6GU9l2fqVZ2ePs" ...
##  $ type             : chr  "audio_features" "audio_features" "audio_features" "audio_features" ...
##  $ uri              : chr  "spotify:track:3qPXM4XdLCv6GNK7n8GoL6" "spotify:track:5o7296SqFh2dK2rjnfskwh" "spotify:track:2tWoFHY2Sl32KtiiRRGK8J" "spotify:track:17BtcevO6GU9l2fqVZ2ePs" ...
##  $ valence          : num  0.161 0.356 0.659 0.614 0.39 0.237 0.342 0.243 0.228 0.238 ...
##  $ explicit         : chr  "false" "false" "false" "false" ...
##  $ track_popularity : int  3 3 0 0 1 0 0 0 0 0 ...
##  $ release_year     : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
##  $ release_month    : chr  "February" "November" "July" "July" ...
##  $ rn               : int  1 1 1 1 1 1 1 1 1 1 ...

Redundant and unnecessary variables are removed.

spotify <- spotify[, c("track_name", "album_type", "album_popularity", "artist_0", "genre_0", "artist_1", "duration_sec", "followers", "artist_popularity","label", "acousticness", "danceability", "energy", "instrumentalness", "key", "liveness", "loudness", "mode", "speechiness", "tempo", "valence", "time_signature", "explicit", "track_popularity")]

Multiple factors are created, including a new variable collab, which allows for assessing whether the track was performed solo or with another artist(s).

spotify$album_type <- as.factor(spotify$album_type)
spotify$explicit <- as.factor(spotify$explicit)
spotify$label <- as.factor(spotify$label)
spotify$key <- as.factor(spotify$key)
spotify$time_signature <- as.factor(spotify$time_signature)
spotify$genre_0 <- as.factor(spotify$genre_0)
spotify$mode <- factor(spotify$mode, levels = c(0, 1), labels = c("minor", "major"))

spotify$collab <- NA
spotify[spotify$artist_1 == "", "collab"] <- "solo"
spotify[spotify$artist_1 != "", "collab"] <- "collab"
spotify$collab <- as.factor(spotify$collab)

Missing values are identified and removed. Overall, the final dataset contains 2252 observations.

spotify[!complete.cases(spotify),]

##                         track_name album_type album_popularity    artist_0
## 149153 Places That Are Gone (Live)      album                4 Tommy Keene
## 280007                        Play      album               47        Yeek
## 280008          At Least You Tried      album               47        Yeek
## 363531         Dorothy'S Interlude      album               55   Sam Smith
## 363553         Dorothy'S Interlude      album               77   Sam Smith
##           genre_0 artist_1 duration_sec followers artist_popularity
## 149153 jangle pop              1002.000      6275                16
## 280007   hyperpop                 7.363    223625                48
## 280008   hyperpop                13.636    223625                48
## 363531        pop                 8.991  23538152                83
## 363553        pop                 8.991  23538152                83
##                           label acousticness danceability energy
## 149153             DePaul Music           NA           NA     NA
## 280007           Valencia House           NA           NA     NA
## 280008           Valencia House           NA           NA     NA
## 363531 Capitol Records UK / EMI           NA           NA     NA
## 363553 Capitol Records UK / EMI           NA           NA     NA
##        instrumentalness  key liveness loudness mode speechiness tempo valence
## 149153               NA <NA>       NA       NA <NA>          NA    NA      NA
## 280007               NA <NA>       NA       NA <NA>          NA    NA      NA
## 280008               NA <NA>       NA       NA <NA>          NA    NA      NA
## 363531               NA <NA>       NA       NA <NA>          NA    NA      NA
## 363553               NA <NA>       NA       NA <NA>          NA    NA      NA
##        time_signature explicit track_popularity collab
## 149153           <NA>    false                0   solo
## 280007           <NA>    false                1   solo
## 280008           <NA>    false                0   solo
## 363531           <NA>    false                3   solo
## 363553           <NA>    false                3   solo

spotify <- spotify[complete.cases(spotify),]

Lastly, a dataset made out of only quantitative variables is created.

spotify_quan <- spotify[, c("album_popularity", "duration_sec", "followers", "artist_popularity", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "track_popularity")]

Correlation Analysis

One of the main goals of of PCA is to summarize information of multi-dimensional data, in order to facilitate its later analysis. These methods work especially well if variables are highly correlated with each other. This behaviour is expected for multiple groups of variables, one of them being the popularity of the artist, album, and its track. Before using any algorithm, let’s visualize the corrplot and confirm whether reducing dimensions of the dataset is reasonable.

corrplot(cor(spotify_quan))

As expected, artist popularity, album popularity, track popularity and followers are positively correlated with each other. High negative correlation can be observed between the variables energy and acousticness. Weaker correlations between other variables are also observed, meaning that performing dimensionality reduction could potentially improve future analysis of the dataset.

PCA

Principal Component Analysis (PCA) is the first method used in the project, performed only on quantitative data. In the first step, stats::prcomp function is used on normalized data.

spotify_pca <- prcomp(spotify_quan, center=TRUE, scale.=TRUE) 
summary(spotify_pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7814 1.6487 1.1891 1.04998 1.02312 0.99335 0.96614
## Proportion of Variance 0.2267 0.1941 0.1010 0.07875 0.07477 0.07048 0.06667
## Cumulative Proportion  0.2267 0.4208 0.5218 0.60055 0.67532 0.74580 0.81248
##                            PC8    PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.85941 0.7539 0.68664 0.61466 0.47311 0.40279 0.28812
## Proportion of Variance 0.05276 0.0406 0.03368 0.02699 0.01599 0.01159 0.00593
## Cumulative Proportion  0.86523 0.9058 0.93951 0.96649 0.98248 0.99407 1.00000

Let’s examine the results, starting with proportion of variance explained by the principle components.

fviz_eig(spotify_pca, addlabels = TRUE)

The first two PCs explain only 42% of the dataset’s variation, which is not satisfactory. Thus, 6 principle components are selected, as they explain over 70% of variance within the data.

pca_var<-get_pca_var(spotify_pca)

Next, results of dataset variables are extracted for further analysis, starting with correlation circles. The data is additionally grouped by their cos2 values, which reflects the quality of representation on the factor map.

Let’s analyze the outputs:

PCA1/PCA2: Album popularity, followers, artist popularity and track popularity are positively correlated with each other and well-represented by the first two principal components. Strong correlation can also be found between these variables and the first principal components. Loudness, energy and valence are negatively correlated with acousticness and well-represented by the first two principal components. These variables are also correlated with the second principal component.

PCA3/PCA4: Speechiness, danceability and song duration are moderately represented by the third and fourth principal component. Speechiness and song duration are negatively correlated with each other. Danceability is highly correlated with the 3rd principal component.

PCA5/PCA6: Liveness and instrumentalness are moderately represented by the fifth and sixth principal component. Negative correlation between liveness and tempo seems to exist.

Next, contributions of variables to PCs are analyzed. Variables highly correlated with the first two principal components are the most important, as they explain most of the variability within the data.

Let’s look at the results:

Dim1: Variables describing popularity and artist’s followers contribute the most to the first dimension.
Dim2: Variables energy, loudness, acousticness, and valence contribute the most to the second dimension.
Dim3: Variables danceability, valence, liveness, tempo, and speechiness contribute the most to the third dimension.
Dim4: Variables speechiness, song duration, tempo, and instrumentalness contribute the most to the fourth dimension.
Dim5: Variables instrumentalness, liveness, and speechiness contribute the most to the fifth dimension.
Dim6: Variables tempo, instrumentalness, and liveness contribute the most to the sixth dimension.

To better understand the PCA results and interpret the principal components, in the next step, rotated PCA is performed. Due to a large sample size (2252), a cutoff of 0.3 is selected as a significance threshold of loadings.

# rotated pca
spotify_pca_rotated<-principal(spotify_quan, nfactors=6, rotate="varimax")

print(loadings(spotify_pca_rotated), digits=3, cutoff=0.3, sort=TRUE)

## 
## Loadings:
##                   RC1    RC2    RC3    RC4    RC5    RC6   
## album_popularity   0.917                                   
## followers          0.749                                   
## artist_popularity  0.904                                   
## track_popularity   0.904                                   
## acousticness             -0.824                            
## energy                    0.924                            
## loudness                  0.815                            
## danceability                     0.659  0.305        -0.367
## tempo                           -0.784                     
## duration_sec                           -0.636              
## speechiness                             0.754              
## instrumentalness                              -0.922       
## liveness                                              0.908
## valence                   0.470  0.328         0.354       
## 
##                  RC1   RC2   RC3   RC4   RC5   RC6
## SS loadings    3.150 2.602 1.217 1.200 1.152 1.121
## Proportion Var 0.225 0.186 0.087 0.086 0.082 0.080
## Cumulative Var 0.225 0.411 0.498 0.583 0.666 0.746

Let’s give „umbrella names” to the principal components:

RC1 is the overall popularity component within the data, capturing how well-known a musical track is.
RC2 represents the overall intensity of a musical track.
RC3 represents the rhythm of a musical track.
RC4 represents the “density” of words in a musical track.
RC5 differentiates between tracks that are highly instrumental with little vocals and their tone.
RC6 focuses on the live performance aspect of the track. Music performed live might feel less danceable, since these tracks could be quite demanding to perform.

It is important to note that RC5 and RC6 are composed of only two variables, which poses the question of their necessity.

Next, complexity and uniqueness of variables within rotated PCA is examined.

plot(spotify_pca_rotated$complexity, spotify_pca_rotated$uniqueness)
text(spotify_pca_rotated$complexity, spotify_pca_rotated$uniqueness, labels=names(spotify_pca_rotated$uniqueness), cex=0.8)
abline(h=c(0.38, 0.75), lty=3, col=2)
abline(v=c(1.8), lty=3, col=2)

Most variables seem to reflect similar level of complexity within an acceptable threshold, with danceability and valence as the only exception. More variability can be found when looking at uniqueness, though only track duration variable has a value of uniqueness higher than 0.5. Overall, none of the variables appear to be particularly problematic – i.e. having extremely high values of both uniqueness and complexity.

Multiple Factor Analysis

In the following section of the project, Multiple Factor Analysis will be used as an additional dimensions reduction method. Multiple Factor analysis allows for using both quantitative and qualitative variables, thus expanding on PCA. Individual, presumably related variables are selected to form several groups. This solution allows for balancing the influence of each group of variables on the data. Following groups of variables are identified within the selected dataset (initially, tempo and duration_sec were supposed to be in group 3 and 4, however, it is not possible to create mixed groups of quantitative and qualitative variables):

Group 1: Popularity: album_popularity, artist_popularity, track_popularity, followers
Group 2: Audio Analysis: acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence
Group 3: Time features: tempo, duration_sec
Group 4: Track Features: key, mode, time_signature
Group 5: Metadata: album_type, explicit, collab

First, the columns are reordered, so that variables belonging to the same group are next to each other in the dataframe.

spotify_reordered <- spotify[, c("album_popularity", "artist_popularity", "track_popularity", "followers", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "valence", "tempo", "duration_sec","key", "mode", "time_signature", "album_type", "explicit", "collab")]

Next, factominer::MFA function is used. Standardization of continuous variables is specified within the function.

spotify_mfa <- MFA(spotify_reordered, 
                   group = c(4, 8, 2, 3, 3), 
                   type = c("s", "s", "s", "n", "n"), 
                   name.group = c("popularity", "audio_analysis", "time_features", "track_features", "metadata"),
                   graph = FALSE)

Again, let’s start with analyzing the percentage of explained variance and visualizing the screeplot.

fviz_eig(spotify_mfa, addlabels = TRUE)

Compared to PCA, MFA results are much worse. The first two dimensions explain only about 13% of variance. In order for the cumulative variance to pass the 70% threshold, 16 dimensions would be needed. Thus, to simplify the analysis, let’s focus only on the first two dimensions.

fviz_mfa_var(spotify_mfa, "group")

fviz_contrib(spotify_mfa, "group", axes = 1)

fviz_contrib(spotify_mfa, "group", axes = 2)

Results of correlation and contribution between groups of variables and the first two dimensions are visualized in the next step. Metadata contributes the most two the first dimension, and popularity the least. This is an interesting result, since popularity group variables were most important in the first dimension of PCA. Time features contribute the most to the second MFA dimension. Audio analysis contributes to both the first and second dimension, though it’s nowhere near as near as some of its variables in PCA.

fviz_mfa_var(spotify_mfa, repel = TRUE)

Similarly as in PCA, correlation circle for quantitative variables can be created. Energy, loudness, and valence are negatively correlated with acousticness. Once again, the popularity group variables are positively correlated with each other (but no correlation between the first two dimensions is found). The time features seem to be negatively correlated with popularity group.

fviz_contrib(spotify_mfa, choice = "quanti.var", axes = 1, top = 20,
             palette = "jco")

fviz_contrib(spotify_mfa, choice = "quanti.var", axes = 2, top = 20,
             palette = "jco")

When it comes to contribution of individual quantitative variables to the first two dimensions, song duration is the most important variable for both the first and the second dimension. It is also the only variable that passes the expected average value for the first dimension. For the second dimension, additional variables that contribute the most are tempo and energy. Overall, these results can be explained as poor, as they don’t really allow for reducing the dimensions of the data or finding any meaningful groups.

Conclusions

Results indicate that PCA using only quantitative metrics proves to be a better way of dimensionality reduction for this particular dataset. In this case, adding qualitative variables in MFA significantly reduced percentage of explained variance within the first two dimensions. As an extension of this project, it could be interesting to replicate it on a different dataset.