1 Overview and Objectives
2 1. Data Preparation & Cleaning
- 2.1 1.1 Loading Libraries and Data
- 2.2 1.2 Data Cleaning
3 2. Exploratory Data Analysis (EDA)
4 3. Advanced Modeling
5 4. Clustering Analysis: Grouping Songs by All Musical Features
- 5.1 4.1 K-Means Clustering
- 5.2 4.2 Visualizing the Clusters
6 5. Statistical Testing
7 6. User Recommendations and Next Steps
8 7. The “Perfect Song” Based on Our Analysis
9 8. Reproducibility
10 Final Summary

1 Overview and Objectives

In this report, we analyze all 18 data points available in the Spotify dataset—including both musical attributes (danceability, energy, loudness, etc.) and metadata (genre, artist, track name, key, mode, time signature). Our goals are to:

Understand what factors drive a song’s popularity on Spotify.
Explore advanced modeling techniques (including ensemble methods).
Identify natural groupings of songs using clustering.
Perform statistical tests to validate observed trends.
Describe the characteristics of an ideal, “perfect” song.
Provide actionable recommendations for music production.

Every step is explained in plain language with detailed labels for charts, legends, and performance metrics such as R-squared, MAE, and RMSE.

2 1. Data Preparation & Cleaning

2.1 1.1 Loading Libraries and Data

glimpse(spotify_data)

## Rows: 232,725
## Columns: 18
## $ genre            <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
## $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
## $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
## $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
## $ popularity       <dbl> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
## $ duration_ms      <dbl> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
## $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
## $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
## $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…

Explanation: This overview shows the column names, data types, and sample values of the dataset.

2.2 1.2 Data Cleaning

spotify_clean <- spotify_data %>%
  drop_na(popularity, danceability, energy, loudness, acousticness, speechiness, tempo, valence, mode, key, time_signature) %>%
  mutate(
    mode = factor(mode),
    key = factor(key),
    time_signature = factor(time_signature),
    genre = factor(genre)
  )

Explanation: Removing rows with missing values and converting columns to factors ensures reliable and properly categorized data.

3 2. Exploratory Data Analysis (EDA)

3.1 2.1 Data Overview

summary(spotify_clean)

##         genre        artist_name         track_name          track_id        
##  Comedy    :  9681   Length:232725      Length:232725      Length:232725     
##  Soundtrack:  9646   Class :character   Class :character   Class :character  
##  Indie     :  9543   Mode  :character   Mode  :character   Mode  :character  
##  Jazz      :  9441                                                           
##  Pop       :  9386                                                           
##  Electronic:  9377                                                           
##  (Other)   :175651                                                           
##    popularity      acousticness     danceability     duration_ms     
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.0569   Min.   :  15387  
##  1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350   1st Qu.: 182857  
##  Median : 43.00   Median :0.2320   Median :0.5710   Median : 220427  
##  Mean   : 41.13   Mean   :0.3686   Mean   :0.5544   Mean   : 235122  
##  3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920   3rd Qu.: 265768  
##  Max.   :100.00   Max.   :0.9960   Max.   :0.9890   Max.   :5552917  
##                                                                      
##      energy          instrumentalness         key           liveness      
##  Min.   :0.0000203   Min.   :0.0000000   C      :27583   Min.   :0.00967  
##  1st Qu.:0.3850000   1st Qu.:0.0000000   G      :26390   1st Qu.:0.09740  
##  Median :0.6050000   Median :0.0000443   D      :24077   Median :0.12800  
##  Mean   :0.5709577   Mean   :0.1483012   C#     :23201   Mean   :0.21501  
##  3rd Qu.:0.7870000   3rd Qu.:0.0358000   A      :22671   3rd Qu.:0.26400  
##  Max.   :0.9990000   Max.   :0.9990000   F      :20279   Max.   :1.00000  
##                                          (Other):88524                    
##     loudness          mode         speechiness         tempo       
##  Min.   :-52.457   Major:151744   Min.   :0.0222   Min.   : 30.38  
##  1st Qu.:-11.771   Minor: 80981   1st Qu.:0.0367   1st Qu.: 92.96  
##  Median : -7.762                  Median :0.0501   Median :115.78  
##  Mean   : -9.570                  Mean   :0.1208   Mean   :117.67  
##  3rd Qu.: -5.501                  3rd Qu.:0.1050   3rd Qu.:139.05  
##  Max.   :  3.744                  Max.   :0.9670   Max.   :242.90  
##                                                                    
##  time_signature    valence      
##  0/4:     8     Min.   :0.0000  
##  1/4:  2608     1st Qu.:0.2370  
##  3/4: 24111     Median :0.4440  
##  4/4:200760     Mean   :0.4549  
##  5/4:  5238     3rd Qu.:0.6600  
##                 Max.   :1.0000  
##

Explanation: This summary shows overall statistics (min, max, mean, etc.) for each variable.

3.2 2.2 Correlation Analysis for Numeric Features

numeric_vars <- spotify_clean %>% 
  select(popularity, danceability, energy, loudness, acousticness, speechiness, tempo, valence, duration_ms, instrumentalness, liveness)
cor_matrix_all <- cor(numeric_vars)
heatmap(cor_matrix_all, 
        main = "Heatmap: Correlation Among Musical Attributes & Popularity\n(Colors indicate strength and direction)",
        col = heat.colors(10), 
        scale = "column")

Explanation: The heatmap displays the strength of the relationships between features using color intensity.

3.3 2.3 Exploring Categorical Data

3.3.1 Genre Distribution

ggplot(spotify_clean, aes(x = genre)) +
  geom_bar(fill = "purple", alpha = 0.7) +
  labs(title = "Distribution of Genres in the Dataset", 
       x = "Genre (e.g., Pop, Rock, Hip-Hop)", 
       y = "Number of Songs") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Explanation: Shows the diversity of musical styles.

3.3.2 Musical Key Distribution

ggplot(spotify_clean, aes(x = key)) +
  geom_bar(fill = "orange", alpha = 0.7) +
  labs(title = "Distribution of Musical Keys", 
       x = "Key (e.g., C, D, E, etc.)", 
       y = "Number of Songs") +
  theme_minimal()

Explanation: Illustrates the common harmonic structures.

3.3.3 Mode Distribution

ggplot(spotify_clean, aes(x = mode)) +
  geom_bar(fill = "green", alpha = 0.7) +
  labs(title = "Distribution of Song Modes", 
       x = "Mode (Major vs. Minor)", 
       y = "Number of Songs") +
  theme_minimal()

Explanation: Reveals the prevalence of Major versus Minor modes.

3.3.4 Time Signature Distribution

ggplot(spotify_clean, aes(x = time_signature)) +
  geom_bar(fill = "blue", alpha = 0.7) +
  labs(title = "Time Signatures in Songs", 
       x = "Time Signature (e.g., 4/4)", 
       y = "Number of Songs") +
  theme_minimal()

Explanation: Highlights common rhythmic patterns.

4 3. Advanced Modeling

4.1 3.1 Comprehensive Regression Model

lm_model_full <- lm(popularity ~ danceability + energy + loudness + acousticness + speechiness + tempo + valence + key + mode + time_signature, 
                    data = spotify_clean)
summary(lm_model_full)

## 
## Call:
## lm(formula = popularity ~ danceability + energy + loudness + 
##     acousticness + speechiness + tempo + valence + key + mode + 
##     time_signature, data = spotify_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.299 -10.427   1.334  11.159  59.897 
## 
## Coefficients:
##                      Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)        60.8892197   5.6593938  10.759 < 0.0000000000000002 ***
## danceability       18.0389947   0.2451583  73.581 < 0.0000000000000002 ***
## energy             -9.7239038   0.2737627 -35.519 < 0.0000000000000002 ***
## loudness            0.8624896   0.0105662  81.627 < 0.0000000000000002 ***
## acousticness      -12.3995216   0.1551569 -79.916 < 0.0000000000000002 ***
## speechiness       -11.0615866   0.2111855 -52.379 < 0.0000000000000002 ***
## tempo               0.0001819   0.0011219   0.162              0.87117    
## valence           -12.5135840   0.1654022 -75.655 < 0.0000000000000002 ***
## keyA#               0.3079245   0.1667519   1.847              0.06481 .  
## keyB                1.1113265   0.1608946   6.907     0.00000000000496 ***
## keyC                0.1584206   0.1439133   1.101              0.27098    
## keyC#               1.9679865   0.1500904  13.112 < 0.0000000000000002 ***
## keyD                0.1063631   0.1483129   0.717              0.47328    
## keyD#               0.6763823   0.2127035   3.180              0.00147 ** 
## keyE                0.2963662   0.1612899   1.837              0.06614 .  
## keyF                0.2826252   0.1545548   1.829              0.06745 .  
## keyF#               1.9704873   0.1678619  11.739 < 0.0000000000000002 ***
## keyG               -0.1951597   0.1450772  -1.345              0.17856    
## keyG#               1.7755203   0.1680502  10.565 < 0.0000000000000002 ***
## modeMinor           1.6577619   0.0719483  23.041 < 0.0000000000000002 ***
## time_signature1/4  -8.4094669   5.6569357  -1.487              0.13713    
## time_signature3/4  -7.6453690   5.6492302  -1.353              0.17595    
## time_signature4/4  -5.2815154   5.6486920  -0.935              0.34979    
## time_signature5/4  -6.0170861   5.6525871  -1.064              0.28711    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.97 on 232701 degrees of freedom
## Multiple R-squared:  0.2288, Adjusted R-squared:  0.2287 
## F-statistic:  3002 on 23 and 232701 DF,  p-value: < 0.00000000000000022

Explanation:
This regression model predicts song popularity using both numeric and categorical features. Key performance metrics include: - R-squared (R²): Proportion of variance explained (closer to 1 is better). - P-values: Indicate statistical significance (p < 0.05 typically means significant).

4.2 3.1.1 Performance Metrics

predicted <- predict(lm_model_full, spotify_clean)
actual <- spotify_clean$popularity
mae <- mean(abs(predicted - actual))       # Mean Absolute Error
rmse <- sqrt(mean((predicted - actual)^2))   # Root Mean Squared Error

mae  # Lower MAE means better prediction accuracy

## [1] 12.776

rmse # Lower RMSE indicates fewer large errors

## [1] 15.97409

Explanation:
MAE and RMSE measure the model’s prediction errors; lower values indicate better performance.

4.3 3.2 Advanced Ensemble Modeling with Random Forest

To improve performance and capture nonlinear relationships, we build a random forest model using a sample of the data to speed up computation.

set.seed(42)
# Downsample to 5000 rows (or adjust as needed) to speed up the model
spotify_rf <- spotify_clean %>% sample_n(min(nrow(spotify_clean), 5000))
train_control <- trainControl(method = "cv", number = 5)
rf_model <- train(popularity ~ danceability + energy + loudness + acousticness + speechiness + tempo + valence + key + mode + time_signature,
                  data = spotify_rf,
                  method = "rf",
                  trControl = train_control,
                  ntree = 50)  # Reduced number of trees for faster performance
rf_model

## Random Forest 
## 
## 5000 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4001, 4001, 3999, 4001, 3998 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    15.76468  0.2803838  12.67905
##   12    15.39906  0.3024494  12.19876
##   23    15.47550  0.2967762  12.28065
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 12.

Explanation:
We use a random forest model with cross-validation (5-fold) and reduce the number of trees to 50 to improve performance. The model averages multiple decision trees to capture complex, nonlinear relationships.

rf_pred <- predict(rf_model, spotify_rf)
rf_mae <- mean(abs(rf_pred - spotify_rf$popularity))
rf_rmse <- sqrt(mean((rf_pred - spotify_rf$popularity)^2))
rf_mae

## [1] 5.06033

rf_rmse

## [1] 6.550461

Explanation:
We calculate MAE and RMSE for the random forest model to evaluate its accuracy.

4.4 3.3 Checking Model Assumptions (Regression)

par(mfrow = c(2, 2))
plot(lm_model_full)

par(mfrow = c(1, 1))

Explanation:
Diagnostic plots help verify that the residuals (errors) of our regression model are random, normally distributed, and have consistent spread.

4.5 3.4 Assessing Multicollinearity with VIF

vif_full <- vif(lm_model_full)
vif_full

##                    GVIF Df GVIF^(1/(2*Df))
## danceability   1.888216  1        1.374124
## energy         4.743810  1        2.178029
## loudness       3.663065  1        1.913914
## acousticness   2.763090  1        1.662254
## speechiness    1.399800  1        1.183131
## tempo          1.095965  1        1.046883
## valence        1.687377  1        1.298991
## key            1.110300 11        1.004767
## mode           1.071064  1        1.034922
## time_signature 1.245960  4        1.027870

Explanation:
VIF values help ensure that predictors are not too similar. Values below 5 indicate that each predictor provides unique information.

5 4. Clustering Analysis: Grouping Songs by All Musical Features

5.1 4.1 K-Means Clustering

spotify_scaled_all <- scale(spotify_clean %>% 
                              select(danceability, energy, loudness, acousticness, speechiness, tempo, valence, duration_ms, instrumentalness, liveness))
set.seed(42)
kmeans_result_all <- kmeans(spotify_scaled_all, centers = 4, nstart = 10)
spotify_clean$cluster_all <- as.factor(kmeans_result_all$cluster)

Explanation:
Standardizing ensures that all features are on the same scale. K-means clustering groups songs into 4 clusters based on their musical attributes.

5.2 4.2 Visualizing the Clusters

fviz_cluster(kmeans_result_all, data = spotify_scaled_all, geom = "point", ellipse.type = "norm",
             ggtheme = theme_minimal(), main = "Clusters of Songs Based on All Musical Features")

Explanation:
- Points and Colors: Each point represents a song; its color (as indicated in the automatically generated legend) shows which cluster it belongs to (e.g., Cluster 1, Cluster 2, etc.). - Cluster Interpretation:
- Cluster 1: May consist of songs with very high energy, loudness, and danceability (likely upbeat, danceable tracks). - Cluster 2: Might include songs with lower energy and more acoustic qualities (possibly softer or ballad-like tracks). - Cluster 3: Could represent a balanced mix of features (classic mainstream tracks). - Cluster 4: May capture songs with unique characteristics (such as higher speechiness or experimental structures). - Ellipses: The ellipses indicate the overall spread or boundary of the songs in each cluster. - Legend: The automatically generated legend maps each color to a specific cluster number.

6 5. Statistical Testing

We perform an ANOVA to test if differences in popularity exist between genres.

anova_result <- aov(popularity ~ genre, data = spotify_clean)
summary(anova_result)

##                 Df   Sum Sq Mean Sq F value              Pr(>F)    
## genre           26 55432620 2132024   23001 <0.0000000000000002 ***
## Residuals   232698 21569746      93                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation:
ANOVA (Analysis of Variance) tests whether the mean popularity differs significantly across genres.

7 6. User Recommendations and Next Steps

Based on our enhanced analysis, here are some recommendations for further improvement:

Expand Data Integration:
Incorporate additional metadata (e.g., release dates, streaming counts, listener demographics) for more nuanced insights.
Advanced Modeling:
Explore more ensemble methods or neural networks, and use interpretability tools (like SHAP) to better understand feature contributions.
Dashboard Development:
Build a fully interactive dashboard using Shiny or flexdashboard for real-time data exploration.
Experimentation:
Use A/B testing with real streaming data to validate which song characteristics most improve listener engagement.
Personalized Recommendations:
Develop a recommendation engine to suggest song features for artists based on current trends.

8 7. The “Perfect Song” Based on Our Analysis

Based on our comprehensive analysis, the ideal (“perfect”) song for mainstream popularity on Spotify would have these characteristics:

High Energy and Loudness:
A powerful, energetic beat that grabs attention immediately.
Great Danceability:
A catchy, rhythmic groove that makes people want to dance.
Low Acousticness and Minimal Speech:
A produced, electronic sound rather than an overly acoustic or spoken-word style.
Moderate Tempo:
A balanced speed—fast enough to be exciting but not so fast that it feels rushed.
Familiar Structure:
A traditional format (e.g., 4/4 time signature and major key) that sounds uplifting and accessible.
Polished Production:
Clean, professional sound quality with a well-balanced mix.
Genre Consideration:
Although versatile, pop, dance, or electronic styles are most likely to achieve widespread popularity.
Additional Touches:
Subtle variations (in features like instrumentalness and liveness) can add uniqueness without straying from the core formula.

In simple terms, the perfect song combines a high-energy, danceable beat with a familiar, accessible structure, all wrapped in polished production. This formula is likely to resonate with a broad audience on Spotify.

9 8. Reproducibility

To ensure that others can repeat this analysis exactly, we record our session information below.

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggcorrplot_0.1.4.1   shiny_1.10.0         car_3.1-3           
##  [4] carData_3.0-5        factoextra_1.0.7     randomForest_4.7-1.2
##  [7] cluster_2.1.6        caret_7.0-1          lattice_0.22-6      
## [10] DT_0.33              plotly_4.10.4        lubridate_1.9.4     
## [13] forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4         
## [16] purrr_1.0.2          readr_2.1.5          tidyr_1.3.1         
## [19] tibble_3.2.1         ggplot2_3.5.1        tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] pROC_1.18.5          rlang_1.1.5          magrittr_2.0.3      
##  [4] compiler_4.4.2       vctrs_0.6.5          reshape2_1.4.4      
##  [7] pkgconfig_2.0.3      crayon_1.5.3         fastmap_1.2.0       
## [10] backports_1.5.0      labeling_0.4.3       promises_1.3.2      
## [13] rmarkdown_2.29       prodlim_2024.06.25   tzdb_0.4.0          
## [16] bit_4.5.0.1          xfun_0.50            cachem_1.1.0        
## [19] jsonlite_1.8.9       recipes_1.1.0        later_1.4.1         
## [22] broom_1.0.7          parallel_4.4.2       R6_2.5.1            
## [25] bslib_0.9.0          stringi_1.8.4        parallelly_1.42.0   
## [28] rpart_4.1.23         jquerylib_0.1.4      Rcpp_1.0.14         
## [31] iterators_1.0.14     knitr_1.49           future.apply_1.11.3 
## [34] httpuv_1.6.15        Matrix_1.7-1         splines_4.4.2       
## [37] nnet_7.3-19          timechange_0.3.0     tidyselect_1.2.1    
## [40] rstudioapi_0.17.1    abind_1.4-8          yaml_2.3.10         
## [43] timeDate_4041.110    codetools_0.2-20     listenv_0.9.1       
## [46] plyr_1.8.9           withr_3.0.2          evaluate_1.0.3      
## [49] future_1.34.0        survival_3.7-0       ggpubr_0.6.0        
## [52] pillar_1.10.1        foreach_1.5.2        stats4_4.4.2        
## [55] generics_0.1.3       vroom_1.6.5          hms_1.1.3           
## [58] munsell_0.5.1        scales_1.3.0         globals_0.16.3      
## [61] xtable_1.8-4         class_7.3-22         glue_1.8.0          
## [64] lazyeval_0.2.2       tools_4.4.2          data.table_1.16.4   
## [67] ggsignif_0.6.4       ModelMetrics_1.2.2.2 gower_1.0.2         
## [70] grid_4.4.2           ipred_0.9-15         colorspace_2.1-1    
## [73] nlme_3.1-166         Formula_1.2-5        cli_3.6.3           
## [76] viridisLite_0.4.2    lava_1.8.1           gtable_0.3.6        
## [79] rstatix_0.7.2        sass_0.4.9           digest_0.6.37       
## [82] ggrepel_0.9.6        farver_2.1.2         htmlwidgets_1.6.4   
## [85] htmltools_0.5.8.1    lifecycle_1.0.4      hardhat_1.4.1       
## [88] httr_1.4.7           mime_0.12            bit64_4.6.0-1       
## [91] MASS_7.3-61

Explanation:
Recording session information ensures that others can recreate the same environment and replicate our analysis, providing transparency and reproducibility.

10 Final Summary

This report provides a comprehensive look at what makes a song popular on Spotify by considering all 18 available data points—from musical attributes to metadata like genre and key. We have: - Cleaned and prepared the data, - Explored relationships using clearly labeled charts, - Engineered additional features, - Built statistical models (both linear regression and random forest) with performance metrics (R-squared, MAE, RMSE), - Performed statistical testing (ANOVA by genre), - Grouped songs into natural clusters, - And described the characteristics of the “perfect song” based on our findings.

Every step is explained in plain language so that anyone—even without a technical background—can understand how various factors contribute to a song’s success on Spotify and how these insights can guide future music production.

What Makes a Hit? A Comprehensive Analysis of Spotify Song Popularity and Recommendations for Improvement

Matt Elkins

2025-02-06