The goal of the project is to perform and analyze outputs of two different dimension reduction methods – PCA and MFA - on musical tracks data released in 2023, specifically focusing on pop genre.
Following R packages were used to perform the analysis:
library(stats) # pca
library(corrplot) # corrplots
library(factoextra) # plots
library(gridExtra) # displaying most significant variables that constitute pc
library(psych) # rotated pca
library(FactoMineR) # mfa
The dataset was found on Kaggle, and includes information on thousands of musical tracks, pulled with Spotify’s Web API. Song metadata, popularity and audio analysis are included, summing up to total of 24 columns, which are used in the analysis:
track_name: Name of the track.
album_type: Album type, describing whether the track was released as a single or within an album.
album_popularity: Popularity of the album, ranging between 0 and 100, with 100 being the most popular.
artist_0: The main artist of the track.
genre_0: The main genre of the track.
artist_1: The secondary artist of the track.
duration_sec: Duration of the track in seconds.
followers: Total number of followers the artist has on Spotify.
artist_popularity: Popularity of the artist, calculated by Spotify based on all of the artist’s tracks (ranging between 0 and 100).
acousticness: A confidence measure from 0.0 to 1.0, assessing whether the track is acoustic (higher values indicate more acoustic content).
danceability: How suitable a track is for dancing, based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity (0.0 is least danceable, 1.0 is most danceable).
energy: A measure from 0.0 to 1.0 representing intensity and activity (energetic tracks feel fast, loud, and noisy, leading to a higher score).
instrumentalness: A measure of whether a track contains no vocals (values closer to 1.0 indicate a greater likelihood that the track is purely instrumental).
key: The key the track is in (integers map to pitches using standard Pitch Class notation).
liveness: Detects the presence of an audience in the recording, ranging between 0.0 and 1.0 (higher values indicate a greater probability that the track was performed live).
loudness: The overall loudness of a track in decibels (typically ranging between -60 and 0 dB).
mode: Represents the modality of the track (either major or minor).
speechiness: Detects the presence of spoken words in a track, ranging between 0.0 and 1.0 (the more exclusively speech-like the recording, the closer to 1.0 the attribute value).
tempo: The overall estimated tempo of a track in beats per minute (BPM).
valence: A measure from 0.0 to 1.0, describing the musical positiveness conveyed by a track (positive-sounding tracks have higher valence).
time_signature: The number of beats in each bar (ranges from 3 to 7, indicating time signatures such as 3/4 to 7/4).
explicit: Indicates whether the track has explicit lyrics.
track_popularity: Popularity of the track calculated by Spotify’s algorithm, ranging between 0 and 100, with 100 being the most popular.
collab: Whether the song was a collaboration or performed solo.
spotify <- read.csv("spotify_data_12_20_2023.csv")
Detailed information on how the measures are asigned can be found inside Spotify’s Web API Documentation. Before performing the analysis and initial pre-processing of the data, the dataset was filtered to only include tracks released in 2023, with pop as the main genre.
spotify <- spotify[grep("pop", spotify$genre_0), ]
spotify <- spotify[spotify$release_year==2023,]
First, initial processing of the data is performed in order to allow for the use of dimensions reduction algorithms.
str(spotify)
## 'data.frame': 2257 obs. of 49 variables:
## $ album_id : chr "13u5VtPqwiS0olRYW2iUet" "02V8WaeAzCF6Tfk26yWwnL" "1oxqIMFixLqwqxqqHifJAk" "1oxqIMFixLqwqxqqHifJAk" ...
## $ album_name : chr "잘해보고 싶어요" "O'Rkha D'Shmayya" "Perfected" "Perfected" ...
## $ album_popularity : int 1 1 0 0 2 2 2 2 2 2 ...
## $ album_type : chr "single" "single" "single" "single" ...
## $ artists : chr "['Lia Kim']" "['Rayan Zaito']" "['Orpheus in red velvet']" "['Orpheus in red velvet']" ...
## $ artist_0 : chr "Lia Kim" "Rayan Zaito" "Orpheus in red velvet" "Orpheus in red velvet" ...
## $ artist_1 : chr "" "" "" "" ...
## $ artist_2 : chr "" "" "" "" ...
## $ artist_3 : chr "" "" "" "" ...
## $ artist_4 : chr "" "" "" "" ...
## $ artist_id : chr "4XkyKEhzoNQEg8ruN7OkPs" "2VCgGRjYUv9LKgLzE3yz0o" "3TNxQnixmRnIOr1jV8ghsJ" "3TNxQnixmRnIOr1jV8ghsJ" ...
## $ duration_sec : num 232 257 238 297 173 ...
## $ label : chr "jh kimsmusic" "Babylon Records" "3185475 Records DK" "3185475 Records DK" ...
## $ release_date : chr "2023-02-06 00:00:00 UTC" "2023-11-01 00:00:00 UTC" "2023-07-01 00:00:00 UTC" "2023-07-01 00:00:00 UTC" ...
## $ total_tracks : int 1 1 2 2 10 10 10 10 10 10 ...
## $ track_id : chr "3qPXM4XdLCv6GNK7n8GoL6" "5o7296SqFh2dK2rjnfskwh" "2tWoFHY2Sl32KtiiRRGK8J" "17BtcevO6GU9l2fqVZ2ePs" ...
## $ track_name : chr "잘해보고 싶어요" "O'Rkha D'Shmayya" "Perfected - Single Version" "Perfected - B-Side-Mix" ...
## $ track_number : int 1 1 1 2 3 7 9 8 10 6 ...
## $ artist_genres : chr "['k-pop ballad']" "['assyrian pop']" "['deep neo-synthpop']" "['deep neo-synthpop']" ...
## $ artist_popularity: int 0 0 0 0 1 1 1 1 1 1 ...
## $ followers : int 60 60 69 69 204 204 204 204 204 204 ...
## $ name : chr "Lia Kim" "Rayan Zaito" "Orpheus in red velvet" "Orpheus in red velvet" ...
## $ genre_0 : chr "k-pop ballad" "assyrian pop" "deep neo-synthpop" "deep neo-synthpop" ...
## $ genre_1 : chr "" "" "" "" ...
## $ genre_2 : chr "" "" "" "" ...
## $ genre_3 : chr "" "" "" "" ...
## $ genre_4 : chr "" "" "" "" ...
## $ acousticness : num 0.697 0.783 0.101 0.0208 0.497 0.163 0.271 0.176 0.563 0.248 ...
## $ analysis_url : chr "https://api.spotify.com/v1/audio-analysis/3qPXM4XdLCv6GNK7n8GoL6" "https://api.spotify.com/v1/audio-analysis/5o7296SqFh2dK2rjnfskwh" "https://api.spotify.com/v1/audio-analysis/2tWoFHY2Sl32KtiiRRGK8J" "https://api.spotify.com/v1/audio-analysis/17BtcevO6GU9l2fqVZ2ePs" ...
## $ danceability : num 0.467 0.499 0.573 0.53 0.564 0.798 0.739 0.756 0.529 0.65 ...
## $ duration_ms : int 231613 257088 238427 296976 173103 219591 214625 184808 216052 228195 ...
## $ energy : num 0.307 0.284 0.977 0.99 0.534 0.428 0.433 0.433 0.191 0.328 ...
## $ instrumentalness : num 0.00 2.31e-06 1.53e-01 2.43e-01 1.14e-04 3.53e-05 1.37e-04 0.00 0.00 2.35e-05 ...
## $ key : int 7 4 6 6 5 9 5 10 6 10 ...
## $ liveness : num 0.151 0.133 0.338 0.353 0.0925 0.14 0.25 0.0934 0.098 0.108 ...
## $ loudness : num -7.83 -11.22 -8.37 -7.46 -9.24 ...
## $ mode : int 1 0 0 0 0 1 1 1 0 1 ...
## $ speechiness : num 0.0271 0.0366 0.0489 0.0472 0.0431 0.0344 0.0337 0.0676 0.0292 0.137 ...
## $ tempo : num 75 142 124 124 100 ...
## $ time_signature : int 4 4 4 4 4 4 4 4 4 4 ...
## $ track_href : chr "https://api.spotify.com/v1/tracks/3qPXM4XdLCv6GNK7n8GoL6" "https://api.spotify.com/v1/tracks/5o7296SqFh2dK2rjnfskwh" "https://api.spotify.com/v1/tracks/2tWoFHY2Sl32KtiiRRGK8J" "https://api.spotify.com/v1/tracks/17BtcevO6GU9l2fqVZ2ePs" ...
## $ type : chr "audio_features" "audio_features" "audio_features" "audio_features" ...
## $ uri : chr "spotify:track:3qPXM4XdLCv6GNK7n8GoL6" "spotify:track:5o7296SqFh2dK2rjnfskwh" "spotify:track:2tWoFHY2Sl32KtiiRRGK8J" "spotify:track:17BtcevO6GU9l2fqVZ2ePs" ...
## $ valence : num 0.161 0.356 0.659 0.614 0.39 0.237 0.342 0.243 0.228 0.238 ...
## $ explicit : chr "false" "false" "false" "false" ...
## $ track_popularity : int 3 3 0 0 1 0 0 0 0 0 ...
## $ release_year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
## $ release_month : chr "February" "November" "July" "July" ...
## $ rn : int 1 1 1 1 1 1 1 1 1 1 ...
Redundant and unnecessary variables are removed.
spotify <- spotify[, c("track_name", "album_type", "album_popularity", "artist_0", "genre_0", "artist_1", "duration_sec", "followers", "artist_popularity","label", "acousticness", "danceability", "energy", "instrumentalness", "key", "liveness", "loudness", "mode", "speechiness", "tempo", "valence", "time_signature", "explicit", "track_popularity")]
Multiple factors are created, including a new variable collab, which allows for assessing whether the track was performed solo or with another artist(s).
spotify$album_type <- as.factor(spotify$album_type)
spotify$explicit <- as.factor(spotify$explicit)
spotify$label <- as.factor(spotify$label)
spotify$key <- as.factor(spotify$key)
spotify$time_signature <- as.factor(spotify$time_signature)
spotify$genre_0 <- as.factor(spotify$genre_0)
spotify$mode <- factor(spotify$mode, levels = c(0, 1), labels = c("minor", "major"))
spotify$collab <- NA
spotify[spotify$artist_1 == "", "collab"] <- "solo"
spotify[spotify$artist_1 != "", "collab"] <- "collab"
spotify$collab <- as.factor(spotify$collab)
Missing values are identified and removed. Overall, the final dataset contains 2252 observations.
spotify[!complete.cases(spotify),]
## track_name album_type album_popularity artist_0
## 149153 Places That Are Gone (Live) album 4 Tommy Keene
## 280007 Play album 47 Yeek
## 280008 At Least You Tried album 47 Yeek
## 363531 Dorothy'S Interlude album 55 Sam Smith
## 363553 Dorothy'S Interlude album 77 Sam Smith
## genre_0 artist_1 duration_sec followers artist_popularity
## 149153 jangle pop 1002.000 6275 16
## 280007 hyperpop 7.363 223625 48
## 280008 hyperpop 13.636 223625 48
## 363531 pop 8.991 23538152 83
## 363553 pop 8.991 23538152 83
## label acousticness danceability energy
## 149153 DePaul Music NA NA NA
## 280007 Valencia House NA NA NA
## 280008 Valencia House NA NA NA
## 363531 Capitol Records UK / EMI NA NA NA
## 363553 Capitol Records UK / EMI NA NA NA
## instrumentalness key liveness loudness mode speechiness tempo valence
## 149153 NA <NA> NA NA <NA> NA NA NA
## 280007 NA <NA> NA NA <NA> NA NA NA
## 280008 NA <NA> NA NA <NA> NA NA NA
## 363531 NA <NA> NA NA <NA> NA NA NA
## 363553 NA <NA> NA NA <NA> NA NA NA
## time_signature explicit track_popularity collab
## 149153 <NA> false 0 solo
## 280007 <NA> false 1 solo
## 280008 <NA> false 0 solo
## 363531 <NA> false 3 solo
## 363553 <NA> false 3 solo
spotify <- spotify[complete.cases(spotify),]
Lastly, a dataset made out of only quantitative variables is created.
spotify_quan <- spotify[, c("album_popularity", "duration_sec", "followers", "artist_popularity", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "track_popularity")]
One of the main goals of of PCA is to summarize information of multi-dimensional data, in order to facilitate its later analysis. These methods work especially well if variables are highly correlated with each other. This behaviour is expected for multiple groups of variables, one of them being the popularity of the artist, album, and its track. Before using any algorithm, let’s visualize the corrplot and confirm whether reducing dimensions of the dataset is reasonable.
corrplot(cor(spotify_quan))
As expected, artist popularity, album popularity, track popularity and followers are positively correlated with each other. High negative correlation can be observed between the variables energy and acousticness. Weaker correlations between other variables are also observed, meaning that performing dimensionality reduction could potentially improve future analysis of the dataset.
Principal Component Analysis (PCA) is the first method used in the project, performed only on quantitative data. In the first step, stats::prcomp function is used on normalized data.
spotify_pca <- prcomp(spotify_quan, center=TRUE, scale.=TRUE)
summary(spotify_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7814 1.6487 1.1891 1.04998 1.02312 0.99335 0.96614
## Proportion of Variance 0.2267 0.1941 0.1010 0.07875 0.07477 0.07048 0.06667
## Cumulative Proportion 0.2267 0.4208 0.5218 0.60055 0.67532 0.74580 0.81248
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.85941 0.7539 0.68664 0.61466 0.47311 0.40279 0.28812
## Proportion of Variance 0.05276 0.0406 0.03368 0.02699 0.01599 0.01159 0.00593
## Cumulative Proportion 0.86523 0.9058 0.93951 0.96649 0.98248 0.99407 1.00000
Let’s examine the results, starting with proportion of variance explained by the principle components.
fviz_eig(spotify_pca, addlabels = TRUE)
The first two PCs explain only 42% of the dataset’s variation, which is not satisfactory. Thus, 6 principle components are selected, as they explain over 70% of variance within the data.
pca_var<-get_pca_var(spotify_pca)
Next, results of dataset variables are extracted for further analysis, starting with correlation circles. The data is additionally grouped by their cos2 values, which reflects the quality of representation on the factor map.
Let’s analyze the outputs:
PCA1/PCA2: Album popularity, followers, artist popularity and track popularity are positively correlated with each other and well-represented by the first two principal components. Strong correlation can also be found between these variables and the first principal components. Loudness, energy and valence are negatively correlated with acousticness and well-represented by the first two principal components. These variables are also correlated with the second principal component.
PCA3/PCA4: Speechiness, danceability and song duration are moderately represented by the third and fourth principal component. Speechiness and song duration are negatively correlated with each other. Danceability is highly correlated with the 3rd principal component.
PCA5/PCA6: Liveness and instrumentalness are moderately represented by the fifth and sixth principal component. Negative correlation between liveness and tempo seems to exist.
Next, contributions of variables to PCs are analyzed. Variables highly correlated with the first two principal components are the most important, as they explain most of the variability within the data.
Let’s look at the results:
To better understand the PCA results and interpret the principal components, in the next step, rotated PCA is performed. Due to a large sample size (2252), a cutoff of 0.3 is selected as a significance threshold of loadings.
# rotated pca
spotify_pca_rotated<-principal(spotify_quan, nfactors=6, rotate="varimax")
print(loadings(spotify_pca_rotated), digits=3, cutoff=0.3, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC3 RC4 RC5 RC6
## album_popularity 0.917
## followers 0.749
## artist_popularity 0.904
## track_popularity 0.904
## acousticness -0.824
## energy 0.924
## loudness 0.815
## danceability 0.659 0.305 -0.367
## tempo -0.784
## duration_sec -0.636
## speechiness 0.754
## instrumentalness -0.922
## liveness 0.908
## valence 0.470 0.328 0.354
##
## RC1 RC2 RC3 RC4 RC5 RC6
## SS loadings 3.150 2.602 1.217 1.200 1.152 1.121
## Proportion Var 0.225 0.186 0.087 0.086 0.082 0.080
## Cumulative Var 0.225 0.411 0.498 0.583 0.666 0.746
Let’s give „umbrella names” to the principal components:
It is important to note that RC5 and RC6 are composed of only two variables, which poses the question of their necessity.
Next, complexity and uniqueness of variables within rotated PCA is examined.
plot(spotify_pca_rotated$complexity, spotify_pca_rotated$uniqueness)
text(spotify_pca_rotated$complexity, spotify_pca_rotated$uniqueness, labels=names(spotify_pca_rotated$uniqueness), cex=0.8)
abline(h=c(0.38, 0.75), lty=3, col=2)
abline(v=c(1.8), lty=3, col=2)
Most variables seem to reflect similar level of complexity within an acceptable threshold, with danceability and valence as the only exception. More variability can be found when looking at uniqueness, though only track duration variable has a value of uniqueness higher than 0.5. Overall, none of the variables appear to be particularly problematic – i.e. having extremely high values of both uniqueness and complexity.
In the following section of the project, Multiple Factor Analysis will be used as an additional dimensions reduction method. Multiple Factor analysis allows for using both quantitative and qualitative variables, thus expanding on PCA. Individual, presumably related variables are selected to form several groups. This solution allows for balancing the influence of each group of variables on the data. Following groups of variables are identified within the selected dataset (initially, tempo and duration_sec were supposed to be in group 3 and 4, however, it is not possible to create mixed groups of quantitative and qualitative variables):
Group 1: Popularity: album_popularity, artist_popularity, track_popularity, followers
Group 2: Audio Analysis: acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence
Group 3: Time features: tempo, duration_sec
Group 4: Track Features: key, mode, time_signature
Group 5: Metadata: album_type, explicit, collab
First, the columns are reordered, so that variables belonging to the same group are next to each other in the dataframe.
spotify_reordered <- spotify[, c("album_popularity", "artist_popularity", "track_popularity", "followers", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "valence", "tempo", "duration_sec","key", "mode", "time_signature", "album_type", "explicit", "collab")]
Next, factominer::MFA function is used. Standardization of continuous variables is specified within the function.
spotify_mfa <- MFA(spotify_reordered,
group = c(4, 8, 2, 3, 3),
type = c("s", "s", "s", "n", "n"),
name.group = c("popularity", "audio_analysis", "time_features", "track_features", "metadata"),
graph = FALSE)
Again, let’s start with analyzing the percentage of explained variance and visualizing the screeplot.
fviz_eig(spotify_mfa, addlabels = TRUE)
Compared to PCA, MFA results are much worse. The first two dimensions explain only about 13% of variance. In order for the cumulative variance to pass the 70% threshold, 16 dimensions would be needed. Thus, to simplify the analysis, let’s focus only on the first two dimensions.
fviz_mfa_var(spotify_mfa, "group")
fviz_contrib(spotify_mfa, "group", axes = 1)
fviz_contrib(spotify_mfa, "group", axes = 2)
Results of correlation and contribution between groups of variables and the first two dimensions are visualized in the next step. Metadata contributes the most two the first dimension, and popularity the least. This is an interesting result, since popularity group variables were most important in the first dimension of PCA. Time features contribute the most to the second MFA dimension. Audio analysis contributes to both the first and second dimension, though it’s nowhere near as near as some of its variables in PCA.
fviz_mfa_var(spotify_mfa, repel = TRUE)
Similarly as in PCA, correlation circle for quantitative variables can be created. Energy, loudness, and valence are negatively correlated with acousticness. Once again, the popularity group variables are positively correlated with each other (but no correlation between the first two dimensions is found). The time features seem to be negatively correlated with popularity group.
fviz_contrib(spotify_mfa, choice = "quanti.var", axes = 1, top = 20,
palette = "jco")
fviz_contrib(spotify_mfa, choice = "quanti.var", axes = 2, top = 20,
palette = "jco")
When it comes to contribution of individual quantitative variables to the first two dimensions, song duration is the most important variable for both the first and the second dimension. It is also the only variable that passes the expected average value for the first dimension. For the second dimension, additional variables that contribute the most are tempo and energy. Overall, these results can be explained as poor, as they don’t really allow for reducing the dimensions of the data or finding any meaningful groups.
Results indicate that PCA using only quantitative metrics proves to be a better way of dimensionality reduction for this particular dataset. In this case, adding qualitative variables in MFA significantly reduced percentage of explained variance within the first two dimensions. As an extension of this project, it could be interesting to replicate it on a different dataset.