A. Introduction How do intrinsic audio characteristics like energy, danceability, and acousticness define the primary differences between tracks on Spotify?
This research paper utilizes the Spotify Songs dataset, which was originally sourced from the TidyTuesday project and collected via the spotifyr package. The dataset contains 32,833 observations and 23 variables. Each case in the dataset represents a unique music track. The variables used for this analysis are audio features (like danceability, energy, loudness, …..) This topic was chosen because understanding the dimensions of music audio features can reveal how streaming platforms categorize sounds and how different songs are mathematically structured.
# Loading necessary libraries for analysis and visualization
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.5.3
## Welcome to factoextra!
## Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
B. Data Analysis
To prepare the Spotify dataset for Principal Component Analysis, first, I used the select() function to isolate the nine specific numeric audio features required while excluding non-numeric metadata such as track names and IDs. Second, I applied the drop_na() function to handle missing values, ensuring that only complete observations are included for the mathematical model. Finally, the mutate() function was used to ensure all variables were correctly formatted as numeric data types suitable for standardization.
# Import the dataset
spotify_raw <- read_csv("spotify_songs.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Cleaning and preparing the data using dplyr
pca_ready_data <- spotify_raw %>%
# 1. select()
select(danceability, energy, loudness, speechiness,
acousticness, instrumentalness, liveness, valence, tempo) %>%
# 2. drop_na():
drop_na() %>%
# 3. mutate()
mutate(across(everything(), as.numeric))
C. Statistical Analysis 1. Scaling and PCA Execution
The data was standardized and scaled during the analysis. This step is critical because audio features like “Loudness” (measured in decibels) and “Danceability” (measured on a 0-1 scale) have different units.
# Perform PCA with scaling enabled
pca_result <- prcomp(pca_ready_data, center = TRUE, scale. = TRUE)
# Report the importance of components
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4670 1.2280 1.0441 0.9906 0.9891 0.92572 0.79872
## Proportion of Variance 0.2391 0.1676 0.1211 0.1090 0.1087 0.09522 0.07088
## Cumulative Proportion 0.2391 0.4067 0.5278 0.6368 0.7455 0.84075 0.91164
## PC8 PC9
## Standard deviation 0.75532 0.47409
## Proportion of Variance 0.06339 0.02497
## Cumulative Proportion 0.97503 1.00000
# Graph 1: Individual Songs (Dots) - PC1 vs PC2
fviz_pca_ind(pca_result,
geom.ind = "point",
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
title = "PCA - Spotify Song Map (PC1 vs PC2)")
# Graph 2: Rotation Matrix (Arrows) - Directions and Magnitude
fviz_pca_var(pca_result,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "Rotation Matrix: Variables' Directions and Magnitude")
4. Interpretation of the Rotation Matrix The rotation matrix reveals the
recipe for each component. PC1 (23.91% variance) is heavily weighted by
Energy and Loudness in the negative direction, directly opposing
Acousticness in the positive direction. This suggests that the primary
difference in Spotify tracks is the contrast between loud, electric
sounds and quiet, acoustic ones. PC2 (16.76% variance) represents the
Vibe, with Danceability and Valence (positivity) pulling strongly
together. Combined, PC1 and PC2 explain 40.67% of the model,
successfully reducing nine variables into two core musical
dimensions.
D. Conclusion and Future Directions This analysis simplified 32,833 tracks into a clear two-dimensional map, proving that Intensity and Mood are the dominant structural pillars of modern music. PCA effectively captured over 40% of the dataset’s variance using only two components. These results imply that complex musical profiles can be mathematically categorized for recommendation algorithms. Future research should apply this model to specific genres like Rap or Pop to see if these patterns shift or remain the sames.
E. References
Dataset: TidyTuesday Spotify Songs (2020). Retrieved from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md