A. Introduction How do intrinsic audio characteristics like energy, danceability, and acousticness define the primary differences between tracks on Spotify?

This research paper utilizes the Spotify Songs dataset, which was originally sourced from the TidyTuesday project and collected via the spotifyr package. The dataset contains 32,833 observations and 23 variables. Each case in the dataset represents a unique music track. The variables used for this analysis are audio features (like danceability, energy, loudness, …..) This topic was chosen because understanding the dimensions of music audio features can reveal how streaming platforms categorize sounds and how different songs are mathematically structured.

# Loading necessary libraries for analysis and visualization
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.5.3
## Welcome to factoextra!
## Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/

B. Data Analysis

To prepare the Spotify dataset for Principal Component Analysis, first, I used the select() function to isolate the nine specific numeric audio features required while excluding non-numeric metadata such as track names and IDs. Second, I applied the drop_na() function to handle missing values, ensuring that only complete observations are included for the mathematical model. Finally, the mutate() function was used to ensure all variables were correctly formatted as numeric data types suitable for standardization.

# Import the dataset
spotify_raw <- read_csv("spotify_songs.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Cleaning and preparing the data using dplyr
pca_ready_data <- spotify_raw %>%
  # 1. select()
  select(danceability, energy, loudness, speechiness, 
         acousticness, instrumentalness, liveness, valence, tempo) %>%
  # 2. drop_na():
  drop_na() %>%
  # 3. mutate()
  mutate(across(everything(), as.numeric))

C. Statistical Analysis 1. Scaling and PCA Execution

The data was standardized and scaled during the analysis. This step is critical because audio features like “Loudness” (measured in decibels) and “Danceability” (measured on a 0-1 scale) have different units.

# Perform PCA with scaling enabled
pca_result <- prcomp(pca_ready_data, center = TRUE, scale. = TRUE)

# Report the importance of components
summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.4670 1.2280 1.0441 0.9906 0.9891 0.92572 0.79872
## Proportion of Variance 0.2391 0.1676 0.1211 0.1090 0.1087 0.09522 0.07088
## Cumulative Proportion  0.2391 0.4067 0.5278 0.6368 0.7455 0.84075 0.91164
##                            PC8     PC9
## Standard deviation     0.75532 0.47409
## Proportion of Variance 0.06339 0.02497
## Cumulative Proportion  0.97503 1.00000
  1. Visualizations The following graphs illustrate the distribution of songs and the relationships between audio features.
# Graph 1: Individual Songs (Dots) - PC1 vs PC2
fviz_pca_ind(pca_result, 
             geom.ind = "point", 
             col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title = "PCA - Spotify Song Map (PC1 vs PC2)")

# Graph 2: Rotation Matrix (Arrows) - Directions and Magnitude
fviz_pca_var(pca_result,
             col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE, 
             title = "Rotation Matrix: Variables' Directions and Magnitude")

4. Interpretation of the Rotation Matrix The rotation matrix reveals the recipe for each component. PC1 (23.91% variance) is heavily weighted by Energy and Loudness in the negative direction, directly opposing Acousticness in the positive direction. This suggests that the primary difference in Spotify tracks is the contrast between loud, electric sounds and quiet, acoustic ones. PC2 (16.76% variance) represents the Vibe, with Danceability and Valence (positivity) pulling strongly together. Combined, PC1 and PC2 explain 40.67% of the model, successfully reducing nine variables into two core musical dimensions.

D. Conclusion and Future Directions This analysis simplified 32,833 tracks into a clear two-dimensional map, proving that Intensity and Mood are the dominant structural pillars of modern music. PCA effectively captured over 40% of the dataset’s variance using only two components. These results imply that complex musical profiles can be mathematically categorized for recommendation algorithms. Future research should apply this model to specific genres like Rap or Pop to see if these patterns shift or remain the sames.

E. References

Dataset: TidyTuesday Spotify Songs (2020). Retrieved from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md