library(tidyverse)
library(janitor)
library(gt)
library(caret) # train/test split, confusion matrix
library(randomForest) # Random Forest
library(pROC) # AUC / ROC curve
library(broom) # tidy model output06 Modeling
Executive Summary
Objective: To identify which audio features best explain the characteristics of a Billboard Hit using two classification models: Logistic Regression and Random Forest.
Research Question: Which audio features are most strongly associated with a song’s chart performance?
Methodology:
Logistic Regression: Used to interpret the direction and magnitude of the association for each feature.
Random Forest: Used to validate feature importance and capture potential non-linear relationships.
Note:
This research employs Explanatory Analysis rather than predictive modeling. The goal is to identify historical features associated with hits, not to build a system for predicting future songs.
The model is trained on pooled data across six decades. Given that musical trends evolve over time, feature importance may vary by era. Future research could explore decade-specific models to capture temporal variation.
Setup
Build Modeling Dataset
# -- Modeling directly with D2 raw data ----------------
# D2 contains complete hit(1) and flop(0) records
# It is already a 50/50 balanced dataset, no resampling needed
D2_model <- readRDS("../Data/wrangled_D2.rds") %>%
# Convert target to factor for classification models
mutate(target = as.factor(target)) %>%
# Keep only the features required for modeling
select(
target, # Target Variable: 1 = Hit, 0 = Flop
danceability, # Danceability
energy, # Energy
loudness, # Loudness
speechiness, # Speechiness
acousticness, # Acousticness
instrumentalness, # Instrumentalness
liveness, # Liveness
valence, # Valence
tempo, # Tempo
duration_ms # Song Duration
) %>%
drop_na()
cat("Total Records: ", nrow(D2_model), "\n")Total Records: 41106
cat("Hit (1) Count: ", sum(D2_model$target == 1), "\n")Hit (1) Count: 20553
cat("Flop (0) Count: ", sum(D2_model$target == 0), "\n")Flop (0) Count: 20553
cat("Hit Percentage: ",
round(sum(D2_model$target == 1) / nrow(D2_model) * 100, 2), "%\n")Hit Percentage: 50 %
Train / Test Split
# 80/20 Random Split
set.seed(42)
train_index <- createDataPartition(
D2_model$target,
p = 0.8,
list = FALSE
)
train_data <- D2_model[train_index, ] # 80% Training Set
test_data <- D2_model[-train_index, ] # 20% Testing Set
cat("Train set size: ", nrow(train_data), "\n")Train set size: 32886
cat("Test set size: ", nrow(test_data), "\n\n")Test set size: 8220
# Verify hit/flop balance in both subsets
cat("Train set Hit ratio: ",
round(sum(train_data$target == 1) / nrow(train_data) * 100, 2), "%\n")Train set Hit ratio: 50 %
cat("Test set Hit ratio: ",
round(sum(test_data$target == 1) / nrow(test_data) * 100, 2), "%\n")Test set Hit ratio: 50 %
Part 1: Logistic Regression
1.1 Model Construction
# Build Logistic Regression Model
# Use all audio features to predict target (hit/flop)
logistic_model <- glm(
target ~ danceability + energy + loudness +
speechiness + acousticness + instrumentalness +
liveness + valence + tempo + duration_ms,
data = train_data,
family = binomial
)1.2 Model Summary
# Coefficient Summary
tidy(logistic_model) %>%
mutate(
significant = ifelse(p.value < 0.05, "Yes", "No"),
direction = ifelse(estimate > 0, "↑ Positive", "↓ Negative")
) %>%
arrange(p.value) %>%
gt() %>%
tab_header(
title = "Logistic Regression Coefficients",
subtitle = "Sorted by significance (p-value)"
) %>%
fmt_number(
columns = c(estimate, std.error, statistic, p.value),
decimals = 4
) %>%
cols_label(
term = "Feature",
estimate = "Coefficient",
std.error = "Std Error",
statistic = "Z-value",
p.value = "P-value",
significant = "Significant?",
direction = "Direction"
)| Logistic Regression Coefficients | ||||||
| Sorted by significance (p-value) | ||||||
| Feature | Coefficient | Std Error | Z-value | P-value | Significant? | Direction |
|---|---|---|---|---|---|---|
| instrumentalness | −3.4265 | 0.0755 | −45.4098 | 0.0000 | Yes | ↓ Negative |
| danceability | 3.3055 | 0.1023 | 32.2979 | 0.0000 | Yes | ↑ Positive |
| acousticness | −1.3480 | 0.0588 | −22.9141 | 0.0000 | Yes | ↓ Negative |
| loudness | 0.1054 | 0.0049 | 21.5045 | 0.0000 | Yes | ↑ Positive |
| speechiness | −3.3248 | 0.1752 | −18.9767 | 0.0000 | Yes | ↓ Negative |
| energy | −1.8304 | 0.1106 | −16.5421 | 0.0000 | Yes | ↓ Negative |
| (Intercept) | 1.2236 | 0.1379 | 8.8712 | 0.0000 | Yes | ↑ Positive |
| duration_ms | 0.0000 | 0.0000 | −6.9537 | 0.0000 | Yes | ↓ Negative |
| valence | 0.3576 | 0.0668 | 5.3527 | 0.0000 | Yes | ↑ Positive |
| tempo | 0.0020 | 0.0005 | 4.2360 | 0.0000 | Yes | ↑ Positive |
| liveness | −0.1709 | 0.0771 | −2.2161 | 0.0267 | Yes | ↓ Negative |
1.3 Feature Importance Visualization
# Coefficient Plot: Visualizing Direction and Magnitude
# Highlights significant features (p < 0.05)
tidy(logistic_model) %>%
filter(term != "(Intercept)") %>%
mutate(
term = fct_reorder(term, estimate),
direction = ifelse(estimate > 0, "Positive", "Negative"),
significant = p.value < 0.05
) %>%
ggplot(aes(x = term, y = estimate, fill = direction)) +
geom_col(aes(alpha = significant)) +
geom_errorbar(
aes(ymin = estimate - std.error,
ymax = estimate + std.error),
width = 0.3
) +
scale_fill_manual(values = c(
"Positive" = "#27AE60",
"Negative" = "#E74C3C"
)) +
scale_alpha_manual(
values = c("TRUE" = 1, "FALSE" = 0.3),
guide = "none"
) +
geom_hline(yintercept = 0, linetype = "dashed",
color = "gray50") +
coord_flip() +
labs(
title = "Logistic Regression: Feature Coefficients",
subtitle = "Green = positive association with Hit | Red = negative | Transparent = not significant",
x = "Audio Feature",
y = "Coefficient",
fill = "Direction",
caption = "Source: Spotify Hit Predictor Dataset"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 9, color = "gray40"),
legend.position = "bottom"
)1.4 Model Evaluation
# -- Evaluate performance on Test Set ----------------
# Predicted probabilities
pred_prob <- predict(logistic_model,
newdata = test_data,
type = "response")
# Convert probability to class (threshold = 0.5)
pred_class <- ifelse(pred_prob > 0.5, 1, 0) %>%
as.factor()
# Confusion Matrix
cm <- confusionMatrix(pred_class, test_data$target,
positive = "1")
print(cm)Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2636 803
1 1474 3307
Accuracy : 0.723
95% CI : (0.7132, 0.7326)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.446
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8046
Specificity : 0.6414
Pos Pred Value : 0.6917
Neg Pred Value : 0.7665
Prevalence : 0.5000
Detection Rate : 0.4023
Detection Prevalence : 0.5816
Balanced Accuracy : 0.7230
'Positive' Class : 1
# AUC / ROC Curve
roc_obj <- roc(as.numeric(test_data$target),
as.numeric(pred_prob))
cat("\n========================================\n")
========================================
cat("Logistic Regression Evaluation\n")Logistic Regression Evaluation
cat("========================================\n\n")========================================
cat("Accuracy:", round(cm$overall["Accuracy"], 4), "\n")Accuracy: 0.723
cat("AUC: ", round(auc(roc_obj), 4), "\n")AUC: 0.8032
# -- ROC Curve ------------------------------------
# AUC closer to 1 indicates better performance
# 0.5 = random guessing
plot(roc_obj,
main = "ROC Curve - Logistic Regression",
col = "#E74C3C",
lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray50")
legend("bottomright",
legend = paste("AUC =", round(auc(roc_obj), 4)),
col = "#E74C3C",
lwd = 2)Note Audio features alone can predict chart success with 72.3% accuracy and an AUC of 0.80, suggesting that the sonic characteristics of a song are meaningfully associated with its commercial appeal — though not deterministic, as approximately 28% of cases remain unexplained by audio features alone.
Part 2: Random Forest
2.1 Model Building
# Build Random Forest Model
# ntree = 500: Build 500 decision trees
# importance = TRUE: Calculate importance for each feature
# set.seed ensures reproducibility
set.seed(42)
rf_model <- randomForest(
target ~ danceability + energy + loudness +
speechiness + acousticness + instrumentalness +
liveness + valence + tempo + duration_ms,
data = train_data,
ntree = 500,
importance = TRUE
)
print(rf_model)
Call:
randomForest(formula = target ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train_data, ntree = 500, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 21.71%
Confusion matrix:
0 1 class.error
0 12006 4437 0.2698413
1 2703 13740 0.1643861
2.2 Feature Importance
# MeanDecreaseAccuracy: How much accuracy decreases when the feature is removed
# MeanDecreaseGini: Contribution of the feature to node purity
# Higher values indicate more important features
importance_df <- importance(rf_model) %>%
as.data.frame() %>%
rownames_to_column("feature") %>%
arrange(desc(MeanDecreaseAccuracy))
cat("Feature Importance Ranking:\n\n")Feature Importance Ranking:
importance_df %>%
gt() %>%
tab_header(
title = "Random Forest Feature Importance",
subtitle = "Sorted by Mean Decrease Accuracy"
) %>%
fmt_number(
columns = c(MeanDecreaseAccuracy, MeanDecreaseGini),
decimals = 4
) %>%
cols_label(
feature = "Audio Feature",
MeanDecreaseAccuracy = "Mean Decrease Accuracy",
MeanDecreaseGini = "Mean Decrease Gini"
) %>%
data_color(
columns = MeanDecreaseAccuracy,
palette = "Greens"
)| Random Forest Feature Importance | ||||
| Sorted by Mean Decrease Accuracy | ||||
| Audio Feature | 0 | 1 | Mean Decrease Accuracy | Mean Decrease Gini |
|---|---|---|---|---|
| instrumentalness | 126.91536 | 278.78191 | 278.0437 | 3,091.2316 |
| acousticness | 44.90685 | 136.01724 | 155.9062 | 2,080.4282 |
| danceability | 61.63692 | 125.09639 | 138.1979 | 1,971.1687 |
| duration_ms | -10.06077 | 138.50291 | 117.3328 | 1,490.8381 |
| speechiness | 46.24600 | 93.26129 | 103.4313 | 1,428.5137 |
| energy | 36.20967 | 74.76022 | 97.5500 | 1,507.6373 |
| loudness | 16.82617 | 87.65934 | 92.0379 | 1,433.2766 |
| valence | 36.05523 | 77.72137 | 86.4586 | 1,304.0270 |
| tempo | 3.54342 | 64.18624 | 56.0789 | 1,100.2033 |
| liveness | 10.66290 | 31.77347 | 32.7011 | 1,023.0966 |
2.3 Feature Importance Visualization
# Feature Importance Bar Plot
# Sorted by MeanDecreaseAccuracy
# Longer bars represent features that are more important to the model
importance_df %>%
mutate(feature = fct_reorder(feature, MeanDecreaseAccuracy)) %>%
ggplot(aes(x = feature, y = MeanDecreaseAccuracy)) +
geom_col(fill = "#27AE60", alpha = 0.85) +
geom_text(aes(label = round(MeanDecreaseAccuracy, 3)),
hjust = -0.2, size = 3.5, fontface = "bold") +
coord_flip() +
labs(
title = "Random Forest: Feature Importance",
subtitle = "Mean Decrease in Accuracy when feature is removed",
x = "Audio Feature",
y = "Mean Decrease Accuracy",
caption = "Source: Spotify Hit Predictor Dataset"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40")
)2.4 Model Evaluation
# Evaluate model performance on Test Set
# Predict categories
rf_pred_class <- predict(rf_model,
newdata = test_data,
type = "class")
# Predict probabilities (for AUC calculation)
rf_pred_prob <- predict(rf_model,
newdata = test_data,
type = "prob")[, 2]
# Confusion Matrix
rf_cm <- confusionMatrix(rf_pred_class, test_data$target,
positive = "1")
print(rf_cm)Confusion Matrix and Statistics
Reference
Prediction 0 1
0 3010 644
1 1100 3466
Accuracy : 0.7878
95% CI : (0.7788, 0.7966)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5757
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8433
Specificity : 0.7324
Pos Pred Value : 0.7591
Neg Pred Value : 0.8238
Prevalence : 0.5000
Detection Rate : 0.4217
Detection Prevalence : 0.5555
Balanced Accuracy : 0.7878
'Positive' Class : 1
# AUC
rf_roc <- roc(as.numeric(test_data$target),
as.numeric(rf_pred_prob))
cat("\n========================================\n")
========================================
cat("Random Forest Model Evaluation\n")Random Forest Model Evaluation
cat("========================================\n\n")========================================
cat("Accuracy:", round(rf_cm$overall["Accuracy"], 4), "\n")Accuracy: 0.7878
cat("AUC: ", round(auc(rf_roc), 4), "\n")AUC: 0.8682
# ROC Curve
plot(rf_roc,
main = "ROC Curve - Random Forest",
col = "#27AE60",
lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray50")
legend("bottomright",
legend = paste("AUC =", round(auc(rf_roc), 4)),
col = "#27AE60",
lwd = 2)Note Random Forest achieved higher accuracy (78.8% vs 72.3%) and AUC (0.87 vs 0.80) compared to Logistic Regression, suggesting that the relationship between audio features and chart success contains non-linear patterns that Logistic Regression cannot fully capture.
Part 3: Model Comparison
3.1 Performance Comparison Table
# Create comparison table for both models
comparison_df <- tibble(
Model = c("Logistic Regression", "Random Forest"),
Accuracy = c(
round(cm$overall["Accuracy"], 4),
round(rf_cm$overall["Accuracy"], 4)
),
AUC = c(
round(auc(roc_obj), 4),
round(auc(rf_roc), 4)
),
Sensitivity = c(
round(cm$byClass["Sensitivity"], 4),
round(rf_cm$byClass["Sensitivity"], 4)
),
Specificity = c(
round(cm$byClass["Specificity"], 4),
round(rf_cm$byClass["Specificity"], 4)
)
)
comparison_df %>%
gt() %>%
tab_header(
title = "Model Performance Comparison",
subtitle = "Logistic Regression vs Random Forest"
) %>%
fmt_number(
columns = c(Accuracy, AUC, Sensitivity, Specificity),
decimals = 4
) %>%
data_color(
columns = c(Accuracy, AUC),
palette = "Greens"
) %>%
cols_label(
Model = "Model",
Accuracy = "Accuracy",
AUC = "AUC",
Sensitivity = "Sensitivity (Hit)",
Specificity = "Specificity (Flop)"
)| Model Performance Comparison | ||||
| Logistic Regression vs Random Forest | ||||
| Model | Accuracy | AUC | Sensitivity (Hit) | Specificity (Flop) |
|---|---|---|---|---|
| Logistic Regression | 0.7230 | 0.8032 | 0.8046 | 0.6414 |
| Random Forest | 0.7878 | 0.8682 | 0.8433 | 0.7324 |
3.2 ROC Curve Comparison
# Plot both ROC Curves on the same graph
# Facilitates visual comparison of classification power between models
plot(roc_obj,
main = "ROC Curve Comparison",
col = "#E74C3C",
lwd = 2)
plot(rf_roc,
add = TRUE,
col = "#27AE60",
lwd = 2)
abline(a = 0, b = 1,
lty = 2, col = "gray50")
legend(
"bottomright",
legend = c(
paste("Logistic Regression (AUC =",
round(auc(roc_obj), 4), ")"),
paste("Random Forest (AUC =",
round(auc(rf_roc), 4), ")")
),
col = c("#E74C3C", "#27AE60"),
lwd = 2
)3.3 Feature Importance Comparison
# ── Compare feature importance across both models ────
# Logistic Regression: uses absolute value of coefficients
# Random Forest: uses MeanDecreaseAccuracy
# Standardized to 0-1 for direct comparison
# Logistic Regression Feature Importance
lr_importance <- tidy(logistic_model) %>%
filter(term != "(Intercept)") %>%
mutate(
feature = term,
importance = abs(estimate),
# Scaled to 0-1
importance_scaled = (importance - min(importance)) /
(max(importance) - min(importance)),
model = "Logistic Regression"
) %>%
select(feature, importance_scaled, model)
# Random Forest Feature Importance
rf_importance <- importance_df %>%
mutate(
feature = feature,
importance = MeanDecreaseAccuracy,
# Scaled to 0-1
importance_scaled = (importance - min(importance)) /
(max(importance) - min(importance)),
model = "Random Forest"
) %>%
select(feature, importance_scaled, model)
# Combine importance data from both models
combined_importance <- bind_rows(lr_importance, rf_importance)
# Visualization
combined_importance %>%
ggplot(aes(x = reorder(feature, importance_scaled),
y = importance_scaled,
fill = model)) +
geom_col(position = "dodge", alpha = 0.85) +
scale_fill_manual(values = c(
"Logistic Regression" = "#E74C3C",
"Random Forest" = "#27AE60"
)) +
coord_flip() +
labs(
title = "Feature Importance Comparison",
subtitle = "Logistic Regression (scaled coefficients) vs Random Forest (Mean Decrease Accuracy)",
x = "Audio Feature",
y = "Scaled Importance (0–1)",
fill = "Model",
caption = "Source: Spotify Hit Predictor Dataset"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 9, color = "gray40"),
legend.position = "bottom"
)Note
The audio features most strongly associated with chart success are instrumentalness (negative), danceability (positive), speechiness (positive), and acousticness (negative). These findings suggest that a typical Billboard hit tends to feature vocals, danceable rhythms, spoken-word elements, and electronic production — consistent with the dominance of Hip-Hop and Pop in the modern era.
3.4 Conclusion
Research Question
Which audio features are most strongly associated with a song’s chart performance?
Model Performance Summary
Logistic Regression: Accuracy ≈ 0.723 | AUC ≈ 0.803
Random Forest: Accuracy ≈ 0.788 | AUC ≈ 0.868
Key Findings
- Model Superiority:
Random Forest outperformed Logistic Regression, suggesting that the relationship between audio features and “hits” involves non-linear components that a linear model cannot fully capture.
- Shared Importance:
Both models consistently identified specific features as critical predictors (refer to the comparison chart in Section 3.3).
- Explanatory Power:
An AUC of 0.87 indicates strong discriminative ability. However, approximately 21% of the variance remains unexplained by audio features alone, highlighting the importance of external factors like marketing and artist reputation.
Analysis Limitations
Temporal Bias: The model treats sixty years of data as a single pool, overlooking shifting musical standards across decades.
Label Definition: The “hit/flop” labels are defined by the dataset author and may not perfectly align with official Billboard chart performance.
Part 4: Final Summary and Limitations
Project Purpose
This analysis explores the association between Spotify audio features and commercial chart success (Explanatory Analysis).
Dataset Overview
Source: Spotify Hit Predictor (D2)
Sample Size: ~40,000 tracks (Balanced: 50% Hit / 50% Flop)
Time Range: 1960s – 2010s
| Model | Accuracy | AUC |
|---|---|---|
| Logistic Regression | 0.7230 | 0.8032 |
| Random Forest | 0.7878 | 0.8682 |
As shown in Section 3.1, Random Forest outperformed Logistic Regression across all metrics.
Core Conclusion
Audio features alone can predict chart success with reasonable accuracy (AUC = 0.87). This suggests that the sonic characteristics of a song are meaningfully associated with its commercial appeal—though they are not the sole determinants of success.
Research Limitations
- Temporal Consistency:
Standards for a “hit” have evolved; the model does not account for era-specific trends.
- Operational Definitions:
Labels are proxy indicators based on specific dataset criteria rather than universal industry standards.
- Missing Variables:
Critical non-musical factors such as marketing budgets, social media presence, and artist fame were not included.
- Genre Bias:
Variations in audio feature distributions across genres may influence the model’s fairness and generalizability.