In this report, we analyze all 18 data points available in the Spotify dataset—including both musical attributes (danceability, energy, loudness, etc.) and metadata (genre, artist, track name, key, mode, time signature). Our goals are to:
Every step is explained in plain language with detailed labels for charts, legends, and performance metrics such as R-squared, MAE, and RMSE.
glimpse(spotify_data)
## Rows: 232,725
## Columns: 18
## $ genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
## $ popularity <dbl> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
## $ duration_ms <dbl> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
## $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
## $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
## $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
Explanation: This overview shows the column names, data types, and sample values of the dataset.
spotify_clean <- spotify_data %>%
drop_na(popularity, danceability, energy, loudness, acousticness, speechiness, tempo, valence, mode, key, time_signature) %>%
mutate(
mode = factor(mode),
key = factor(key),
time_signature = factor(time_signature),
genre = factor(genre)
)
Explanation: Removing rows with missing values and converting columns to factors ensures reliable and properly categorized data.
summary(spotify_clean)
## genre artist_name track_name track_id
## Comedy : 9681 Length:232725 Length:232725 Length:232725
## Soundtrack: 9646 Class :character Class :character Class :character
## Indie : 9543 Mode :character Mode :character Mode :character
## Jazz : 9441
## Pop : 9386
## Electronic: 9377
## (Other) :175651
## popularity acousticness danceability duration_ms
## Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. : 15387
## 1st Qu.: 29.00 1st Qu.:0.0376 1st Qu.:0.4350 1st Qu.: 182857
## Median : 43.00 Median :0.2320 Median :0.5710 Median : 220427
## Mean : 41.13 Mean :0.3686 Mean :0.5544 Mean : 235122
## 3rd Qu.: 55.00 3rd Qu.:0.7220 3rd Qu.:0.6920 3rd Qu.: 265768
## Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :5552917
##
## energy instrumentalness key liveness
## Min. :0.0000203 Min. :0.0000000 C :27583 Min. :0.00967
## 1st Qu.:0.3850000 1st Qu.:0.0000000 G :26390 1st Qu.:0.09740
## Median :0.6050000 Median :0.0000443 D :24077 Median :0.12800
## Mean :0.5709577 Mean :0.1483012 C# :23201 Mean :0.21501
## 3rd Qu.:0.7870000 3rd Qu.:0.0358000 A :22671 3rd Qu.:0.26400
## Max. :0.9990000 Max. :0.9990000 F :20279 Max. :1.00000
## (Other):88524
## loudness mode speechiness tempo
## Min. :-52.457 Major:151744 Min. :0.0222 Min. : 30.38
## 1st Qu.:-11.771 Minor: 80981 1st Qu.:0.0367 1st Qu.: 92.96
## Median : -7.762 Median :0.0501 Median :115.78
## Mean : -9.570 Mean :0.1208 Mean :117.67
## 3rd Qu.: -5.501 3rd Qu.:0.1050 3rd Qu.:139.05
## Max. : 3.744 Max. :0.9670 Max. :242.90
##
## time_signature valence
## 0/4: 8 Min. :0.0000
## 1/4: 2608 1st Qu.:0.2370
## 3/4: 24111 Median :0.4440
## 4/4:200760 Mean :0.4549
## 5/4: 5238 3rd Qu.:0.6600
## Max. :1.0000
##
Explanation: This summary shows overall statistics (min, max, mean, etc.) for each variable.
numeric_vars <- spotify_clean %>%
select(popularity, danceability, energy, loudness, acousticness, speechiness, tempo, valence, duration_ms, instrumentalness, liveness)
cor_matrix_all <- cor(numeric_vars)
heatmap(cor_matrix_all,
main = "Heatmap: Correlation Among Musical Attributes & Popularity\n(Colors indicate strength and direction)",
col = heat.colors(10),
scale = "column")
Explanation: The heatmap displays the strength of the relationships between features using color intensity.
ggplot(spotify_clean, aes(x = genre)) +
geom_bar(fill = "purple", alpha = 0.7) +
labs(title = "Distribution of Genres in the Dataset",
x = "Genre (e.g., Pop, Rock, Hip-Hop)",
y = "Number of Songs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Explanation: Shows the diversity of musical styles.
ggplot(spotify_clean, aes(x = key)) +
geom_bar(fill = "orange", alpha = 0.7) +
labs(title = "Distribution of Musical Keys",
x = "Key (e.g., C, D, E, etc.)",
y = "Number of Songs") +
theme_minimal()
Explanation: Illustrates the common harmonic structures.
ggplot(spotify_clean, aes(x = mode)) +
geom_bar(fill = "green", alpha = 0.7) +
labs(title = "Distribution of Song Modes",
x = "Mode (Major vs. Minor)",
y = "Number of Songs") +
theme_minimal()
Explanation: Reveals the prevalence of Major versus Minor modes.
ggplot(spotify_clean, aes(x = time_signature)) +
geom_bar(fill = "blue", alpha = 0.7) +
labs(title = "Time Signatures in Songs",
x = "Time Signature (e.g., 4/4)",
y = "Number of Songs") +
theme_minimal()
Explanation: Highlights common rhythmic patterns.
lm_model_full <- lm(popularity ~ danceability + energy + loudness + acousticness + speechiness + tempo + valence + key + mode + time_signature,
data = spotify_clean)
summary(lm_model_full)
##
## Call:
## lm(formula = popularity ~ danceability + energy + loudness +
## acousticness + speechiness + tempo + valence + key + mode +
## time_signature, data = spotify_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.299 -10.427 1.334 11.159 59.897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8892197 5.6593938 10.759 < 0.0000000000000002 ***
## danceability 18.0389947 0.2451583 73.581 < 0.0000000000000002 ***
## energy -9.7239038 0.2737627 -35.519 < 0.0000000000000002 ***
## loudness 0.8624896 0.0105662 81.627 < 0.0000000000000002 ***
## acousticness -12.3995216 0.1551569 -79.916 < 0.0000000000000002 ***
## speechiness -11.0615866 0.2111855 -52.379 < 0.0000000000000002 ***
## tempo 0.0001819 0.0011219 0.162 0.87117
## valence -12.5135840 0.1654022 -75.655 < 0.0000000000000002 ***
## keyA# 0.3079245 0.1667519 1.847 0.06481 .
## keyB 1.1113265 0.1608946 6.907 0.00000000000496 ***
## keyC 0.1584206 0.1439133 1.101 0.27098
## keyC# 1.9679865 0.1500904 13.112 < 0.0000000000000002 ***
## keyD 0.1063631 0.1483129 0.717 0.47328
## keyD# 0.6763823 0.2127035 3.180 0.00147 **
## keyE 0.2963662 0.1612899 1.837 0.06614 .
## keyF 0.2826252 0.1545548 1.829 0.06745 .
## keyF# 1.9704873 0.1678619 11.739 < 0.0000000000000002 ***
## keyG -0.1951597 0.1450772 -1.345 0.17856
## keyG# 1.7755203 0.1680502 10.565 < 0.0000000000000002 ***
## modeMinor 1.6577619 0.0719483 23.041 < 0.0000000000000002 ***
## time_signature1/4 -8.4094669 5.6569357 -1.487 0.13713
## time_signature3/4 -7.6453690 5.6492302 -1.353 0.17595
## time_signature4/4 -5.2815154 5.6486920 -0.935 0.34979
## time_signature5/4 -6.0170861 5.6525871 -1.064 0.28711
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.97 on 232701 degrees of freedom
## Multiple R-squared: 0.2288, Adjusted R-squared: 0.2287
## F-statistic: 3002 on 23 and 232701 DF, p-value: < 0.00000000000000022
Explanation:
This regression model predicts song popularity using both numeric and
categorical features. Key performance metrics include: -
R-squared (R²): Proportion of variance explained
(closer to 1 is better). - P-values: Indicate
statistical significance (p < 0.05 typically means significant).
predicted <- predict(lm_model_full, spotify_clean)
actual <- spotify_clean$popularity
mae <- mean(abs(predicted - actual)) # Mean Absolute Error
rmse <- sqrt(mean((predicted - actual)^2)) # Root Mean Squared Error
mae # Lower MAE means better prediction accuracy
## [1] 12.776
rmse # Lower RMSE indicates fewer large errors
## [1] 15.97409
Explanation:
MAE and RMSE measure the model’s prediction errors; lower values
indicate better performance.
To improve performance and capture nonlinear relationships, we build a random forest model using a sample of the data to speed up computation.
set.seed(42)
# Downsample to 5000 rows (or adjust as needed) to speed up the model
spotify_rf <- spotify_clean %>% sample_n(min(nrow(spotify_clean), 5000))
train_control <- trainControl(method = "cv", number = 5)
rf_model <- train(popularity ~ danceability + energy + loudness + acousticness + speechiness + tempo + valence + key + mode + time_signature,
data = spotify_rf,
method = "rf",
trControl = train_control,
ntree = 50) # Reduced number of trees for faster performance
rf_model
## Random Forest
##
## 5000 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4001, 4001, 3999, 4001, 3998
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 15.76468 0.2803838 12.67905
## 12 15.39906 0.3024494 12.19876
## 23 15.47550 0.2967762 12.28065
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 12.
Explanation:
We use a random forest model with cross-validation (5-fold) and reduce
the number of trees to 50 to improve performance. The model averages
multiple decision trees to capture complex, nonlinear relationships.
rf_pred <- predict(rf_model, spotify_rf)
rf_mae <- mean(abs(rf_pred - spotify_rf$popularity))
rf_rmse <- sqrt(mean((rf_pred - spotify_rf$popularity)^2))
rf_mae
## [1] 5.06033
rf_rmse
## [1] 6.550461
Explanation:
We calculate MAE and RMSE for the random forest model to evaluate its
accuracy.
par(mfrow = c(2, 2))
plot(lm_model_full)
par(mfrow = c(1, 1))
Explanation:
Diagnostic plots help verify that the residuals (errors) of our
regression model are random, normally distributed, and have consistent
spread.
vif_full <- vif(lm_model_full)
vif_full
## GVIF Df GVIF^(1/(2*Df))
## danceability 1.888216 1 1.374124
## energy 4.743810 1 2.178029
## loudness 3.663065 1 1.913914
## acousticness 2.763090 1 1.662254
## speechiness 1.399800 1 1.183131
## tempo 1.095965 1 1.046883
## valence 1.687377 1 1.298991
## key 1.110300 11 1.004767
## mode 1.071064 1 1.034922
## time_signature 1.245960 4 1.027870
Explanation:
VIF values help ensure that predictors are not too similar. Values below
5 indicate that each predictor provides unique information.
spotify_scaled_all <- scale(spotify_clean %>%
select(danceability, energy, loudness, acousticness, speechiness, tempo, valence, duration_ms, instrumentalness, liveness))
set.seed(42)
kmeans_result_all <- kmeans(spotify_scaled_all, centers = 4, nstart = 10)
spotify_clean$cluster_all <- as.factor(kmeans_result_all$cluster)
Explanation:
Standardizing ensures that all features are on the same scale. K-means
clustering groups songs into 4 clusters based on their musical
attributes.
fviz_cluster(kmeans_result_all, data = spotify_scaled_all, geom = "point", ellipse.type = "norm",
ggtheme = theme_minimal(), main = "Clusters of Songs Based on All Musical Features")
Explanation:
- Points and Colors: Each point represents a song; its
color (as indicated in the automatically generated legend) shows which
cluster it belongs to (e.g., Cluster 1, Cluster 2, etc.). -
Cluster Interpretation:
- Cluster 1: May consist of songs with very high
energy, loudness, and danceability (likely upbeat, danceable tracks). -
Cluster 2: Might include songs with lower energy and
more acoustic qualities (possibly softer or ballad-like tracks). -
Cluster 3: Could represent a balanced mix of features
(classic mainstream tracks). - Cluster 4: May capture
songs with unique characteristics (such as higher speechiness or
experimental structures). - Ellipses: The ellipses
indicate the overall spread or boundary of the songs in each cluster. -
Legend: The automatically generated legend maps each
color to a specific cluster number.
We perform an ANOVA to test if differences in popularity exist between genres.
anova_result <- aov(popularity ~ genre, data = spotify_clean)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 26 55432620 2132024 23001 <0.0000000000000002 ***
## Residuals 232698 21569746 93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explanation:
ANOVA (Analysis of Variance) tests whether the mean popularity differs
significantly across genres.
Based on our enhanced analysis, here are some recommendations for further improvement:
Based on our comprehensive analysis, the ideal (“perfect”) song for mainstream popularity on Spotify would have these characteristics:
In simple terms, the perfect song combines a high-energy, danceable beat with a familiar, accessible structure, all wrapped in polished production. This formula is likely to resonate with a broad audience on Spotify.
To ensure that others can repeat this analysis exactly, we record our session information below.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggcorrplot_0.1.4.1 shiny_1.10.0 car_3.1-3
## [4] carData_3.0-5 factoextra_1.0.7 randomForest_4.7-1.2
## [7] cluster_2.1.6 caret_7.0-1 lattice_0.22-6
## [10] DT_0.33 plotly_4.10.4 lubridate_1.9.4
## [13] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [16] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
## [19] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] pROC_1.18.5 rlang_1.1.5 magrittr_2.0.3
## [4] compiler_4.4.2 vctrs_0.6.5 reshape2_1.4.4
## [7] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [10] backports_1.5.0 labeling_0.4.3 promises_1.3.2
## [13] rmarkdown_2.29 prodlim_2024.06.25 tzdb_0.4.0
## [16] bit_4.5.0.1 xfun_0.50 cachem_1.1.0
## [19] jsonlite_1.8.9 recipes_1.1.0 later_1.4.1
## [22] broom_1.0.7 parallel_4.4.2 R6_2.5.1
## [25] bslib_0.9.0 stringi_1.8.4 parallelly_1.42.0
## [28] rpart_4.1.23 jquerylib_0.1.4 Rcpp_1.0.14
## [31] iterators_1.0.14 knitr_1.49 future.apply_1.11.3
## [34] httpuv_1.6.15 Matrix_1.7-1 splines_4.4.2
## [37] nnet_7.3-19 timechange_0.3.0 tidyselect_1.2.1
## [40] rstudioapi_0.17.1 abind_1.4-8 yaml_2.3.10
## [43] timeDate_4041.110 codetools_0.2-20 listenv_0.9.1
## [46] plyr_1.8.9 withr_3.0.2 evaluate_1.0.3
## [49] future_1.34.0 survival_3.7-0 ggpubr_0.6.0
## [52] pillar_1.10.1 foreach_1.5.2 stats4_4.4.2
## [55] generics_0.1.3 vroom_1.6.5 hms_1.1.3
## [58] munsell_0.5.1 scales_1.3.0 globals_0.16.3
## [61] xtable_1.8-4 class_7.3-22 glue_1.8.0
## [64] lazyeval_0.2.2 tools_4.4.2 data.table_1.16.4
## [67] ggsignif_0.6.4 ModelMetrics_1.2.2.2 gower_1.0.2
## [70] grid_4.4.2 ipred_0.9-15 colorspace_2.1-1
## [73] nlme_3.1-166 Formula_1.2-5 cli_3.6.3
## [76] viridisLite_0.4.2 lava_1.8.1 gtable_0.3.6
## [79] rstatix_0.7.2 sass_0.4.9 digest_0.6.37
## [82] ggrepel_0.9.6 farver_2.1.2 htmlwidgets_1.6.4
## [85] htmltools_0.5.8.1 lifecycle_1.0.4 hardhat_1.4.1
## [88] httr_1.4.7 mime_0.12 bit64_4.6.0-1
## [91] MASS_7.3-61
Explanation:
Recording session information ensures that others can recreate the same
environment and replicate our analysis, providing transparency and
reproducibility.
This report provides a comprehensive look at what makes a song popular on Spotify by considering all 18 available data points—from musical attributes to metadata like genre and key. We have: - Cleaned and prepared the data, - Explored relationships using clearly labeled charts, - Engineered additional features, - Built statistical models (both linear regression and random forest) with performance metrics (R-squared, MAE, RMSE), - Performed statistical testing (ANOVA by genre), - Grouped songs into natural clusters, - And described the characteristics of the “perfect song” based on our findings.
Every step is explained in plain language so that anyone—even without a technical background—can understand how various factors contribute to a song’s success on Spotify and how these insights can guide future music production.