Member (Group 14):
Music labels, streaming curators, and A&R teams face a common resource allocation problem: promotional budgets must be committed before a track’s commercial performance is known. Playlist placements, advertising spend, and artist development attention are finite. The dominant strategy of observing early streaming numbers and reacting is reactive and leaves money on the table for tracks that could have benefited from earlier investment.
Spotify assigns every track a popularity score (0–100), computed from recent stream counts relative to all other tracks. This score is the most directly observable proxy for commercial performance on the platform. Audio features (danceability, energy, loudness, acousticness, and genre) are measured at the point of production, before any marketing decision is made. If these pre-release signals carry predictive power for the eventual popularity score, then marketing teams can use them to build a prioritisation list: which tracks are worth a major promotional push, and which should receive minimal spend.
This project develops a machine learning decision-support system grounded in this premise. The system is not a replacement for human judgement. No audio feature model can capture the social moments, meme virality, or tastemaker endorsements that drive some of the biggest hits. The models described here function as a first-pass quantitative screen: a tool that adds an evidence base to a process that is otherwise entirely subjective.
A prerequisite for pre-release applicability is that the input features must be observable before a track accumulates significant streaming data. All audio features used in this analysis are available through the Spotify Web API at the point of track ingestion, before any streaming activity has accumulated, making them genuinely accessible as early-stage inputs to marketing prioritisation decisions.
This report is designed to be read as a technical summary rather than as a script archive. Each major section explains why a step was necessary, shows the original project code used to carry it out, presents the resulting saved output where relevant, interprets what that output means, and then links the result to the next stage of the analysis. Readers therefore do not need to inspect every line of code in detail, but the code is included so that the process remains transparent and reproducible.
The integrated workflow begins with a reproducibility layer that loads packages, defines relative paths, creates output directories, and locks the final classification threshold used later in held-out testing. The report does not rerun the full modelling pipeline during knitting, but this setup code is part of the reproducibility story because every downstream script depends on the same paths and constants.
# analysis/master_script.R - package and path setup
required_pkgs <- c(
"tidyverse", "scales",
"tidymodels",
"ranger", "xgboost",
"vip", "hardhat",
"doParallel", "ggrepel"
)
raw_path <- "../data/dataset.csv"
processed_dir <- "../data/processed"
clean_path <- file.path(processed_dir, "spotify_cleaned.csv")
fig_dir <- "../output/figures"
table_dir <- "../output/tables"
eda_dir <- "../output/eda"
reg_out_dir <- "../output/regression"
cls_out_dir <- "../output/classification"
tune_dir <- "../output/tuning"
thresh_dir <- "../output/threshold"
test_dir <- "../output/test_evaluation"
interp_dir <- "../output/interpretability"
sensitivity_dir <- "../output/sensitivity"
for (d in c(processed_dir, fig_dir, table_dir, eda_dir,
reg_out_dir, cls_out_dir, tune_dir, thresh_dir,
test_dir, interp_dir, sensitivity_dir)) {
dir.create(d, recursive = TRUE, showWarnings = FALSE)
}
n_threads <- min(4L, max(1L, parallel::detectCores() - 1L))
FORCE_RETUNE <- FALSE
CLASSIFICATION_THRESHOLD <- 0.19 # locked before any test-set access; see threshold sectionResearch Aim: To build and evaluate predictive models that estimate Spotify track popularity and identify potential hit tracks from audio and genre features, and to derive actionable genre and audio signal insights for music marketing decision-making.
Research Objectives:
Research Questions:
The project uses both regression and classification because the two tasks answer different decision needs. Regression estimates how popular a track may become on a continuous 0-100 scale, which supports ranking and prioritisation. Classification asks whether a track is likely to cross a commercially meaningful threshold of 70, which supports shortlisting likely “hit” candidates for human review.
The published literature converges on a set of findings directly relevant to this project. Genre and audio characteristics, particularly loudness, energy, and danceability, emerge as the primary popularity predictors across studies using Spotify data (Arora & Rani, 2024; Jiang, 2024; Jung & Mayer, n.d.; Saragih, 2023). Tree-based ensembles (Random Forest, XGBoost) consistently outperform linear and deep learning alternatives at this task (Arora & Rani, 2024; Jung & Mayer, n.d.; Khan et al., 2022). Reported R² values range from 0.35 to 0.70, with the upper end reached only when genre is explicitly encoded (Jiang, 2024; Jung & Mayer, n.d.). These findings informed the algorithm selection and evaluation design here.
The project uses the Spotify Tracks Dataset (Kaggle:
dataset.csv, supplied in the repository materials), which
was scrapped on 22 Oct 2022. The available project documents confirm
that the raw dataset contains 114,000 observations, 21 variables, and a
genre-structured catalogue design with roughly 1,000 tracks per genre in
the raw data. Because the project files do not provide a formal dataset
year inside the raw CSV itself, the report treats the dataset as a
project snapshot rather than asserting a source timestamp beyond the
available materials.
The cleaned dataset contains 89,578 unique tracks across 113 genres.
The dataset contains three broad variable groups: identifier or text
fields (track_id, track_name,
album_name, artists), categorical descriptors
(track_genre, explicit, mode,
key, time_signature), and continuous audio
measurements (danceability, energy,
loudness, speechiness,
acousticness, instrumentalness,
liveness, valence, tempo,
duration_ms) together with the target variable
popularity.
The overview table below is not a manually typed summary. It comes
from the EDA stage, where the cleaned dataset was profiled and then
written to ../output/tables/eda_overview.csv for reuse in
the report.
# analysis/master_script.R - overview table exported during EDA
eda_overview <- tibble(
metric = c(
"Number of tracks", "Number of genres",
"Average popularity", "Median popularity",
"Average duration (min)",
"High popularity tracks", "Low popularity tracks"
),
value = c(
nrow(spotify),
n_distinct(spotify$track_genre),
round(mean(spotify$popularity), 2),
round(median(spotify$popularity), 2),
round(mean(spotify$duration_min), 2),
sum(spotify$high_popularity_label == "high"),
sum(spotify$high_popularity_label == "low")
)
)
write_csv(eda_overview, file.path(table_dir, "eda_overview.csv"))| metric | value |
|---|---|
| Number of tracks | 89578.00 |
| Number of genres | 113.00 |
| Average popularity | 33.19 |
| Median popularity | 33.00 |
| Average duration in minutes | 3.82 |
| High popularity tracks | 45652.00 |
| Low popularity tracks | 43926.00 |
The raw dataset starts with 114,000 observations and 21 variables.
After cleaning and deduplication, the modelling dataset contains 89,578
rows and 33 columns. The project retains audio features and genre as
predictors while excluding pure identifiers and text metadata such as
track_id, track_name, album_name,
and artists from the baseline models. These excluded fields
are useful for cleaning and descriptive context but would either cause
leakage, create sparse identifiers, or undermine generalisability if
passed directly into the models.
Before any cleaning rule was applied, the project inspected the raw structure directly. These checks were diagnostic rather than report-facing outputs, but they are part of the actual process and help explain why the later validation filters exist.
# analysis/01.R and analysis/master_script.R - raw import and inspection
spotify_raw <- read_csv(
raw_path,
locale = locale(encoding = "Windows-1252"),
show_col_types = FALSE
)
if ("...1" %in% names(spotify_raw)) {
spotify_raw <- spotify_raw %>% rename(source_row = `...1`)
}
cat("Raw dataset dimensions:", nrow(spotify_raw), "rows x", ncol(spotify_raw), "columns\n")
print(names(spotify_raw))
glimpse(spotify_raw)
print(summary(select(spotify_raw, where(is.numeric))))The workflow begins with a raw-data audit and a set of explicit
quality controls. The requirement brief asks for dataset details,
structure, and summary before modelling; this project operationalises
that requirement by checking missing values, duplicate rows, duplicate
track_ids, invalid ranges, and data types before any split
is created.
Three data issues mattered most:
track_ids were common: 16,641 track
identifiers appeared more than once.Duplicate handling was treated as a methodological issue rather than
a trivial cleaning step. Duplicate track_ids were not
simple data-entry errors; they were mostly cross-genre listings of the
same track. If left unresolved, identical songs could appear in both
training and test partitions, producing hidden leakage. The adopted
solution was neutral first-occurrence deduplication without sorting by
popularity first, which avoided biasing the regression target
upward.
The corresponding raw-data checks are shown below. These exact calculations established the amount of missingness and duplication before cleaning decisions were applied.
# analysis/01.R and analysis/master_script.R - missing and duplicate checks
missing_summary <- spotify_raw %>%
summarise(across(everything(), ~ sum(is.na(.x)))) %>%
pivot_longer(everything(), names_to = "column", values_to = "missing_count") %>%
mutate(missing_percent = round(missing_count / nrow(spotify_raw) * 100, 3)) %>%
arrange(desc(missing_count))
duplicate_track_id_count <- spotify_raw %>%
filter(!is.na(track_id)) %>%
count(track_id, name = "record_count") %>%
filter(record_count > 1) %>%
nrow()
duplicate_full_row_count <- spotify_raw %>%
select(-any_of("source_row")) %>%
duplicated() %>%
sum()| column | missing_count | missing_percent |
|---|---|---|
| artists | 1 | 0.001 |
| album_name | 1 | 0.001 |
| track_name | 1 | 0.001 |
# analysis/master_script.R - data validation and neutral deduplication
spotify_cleaned <- spotify_raw %>%
mutate(
across(where(is.character), ~ str_squish(.x)),
across(where(is.character), ~ na_if(.x, "")),
explicit = as.logical(explicit)
) %>%
distinct(across(-any_of("source_row")), .keep_all = TRUE) %>%
drop_na(
track_id, artists, album_name, track_name, track_genre,
popularity, duration_ms, explicit, danceability, energy, key, loudness,
mode, speechiness, acousticness, instrumentalness, liveness, valence,
tempo, time_signature
) %>%
filter(
popularity >= 0, popularity <= 100,
duration_ms > 0,
danceability >= 0, danceability <= 1,
energy >= 0, energy <= 1,
speechiness >= 0, speechiness <= 1,
acousticness >= 0, acousticness <= 1,
instrumentalness >= 0, instrumentalness <= 1,
liveness >= 0, liveness <= 1,
valence >= 0, valence <= 1,
tempo > 0
) %>%
distinct(track_id, .keep_all = TRUE)The cleaned file then adds modelling-friendly fields.
duration_ms is converted to minutes, audio features are
preserved, and factor encodings are prepared for recipe-based one-hot
encoding later in the pipeline. Most importantly, the workflow separates
an exploratory median-based high_popularity label from the
actual classification target used in the project. The
high_popularity is descriptive only, whereas the true
classifier target is hit = 1 when popularity is at least
70.
# analysis/master_script.R - feature engineering and target construction
spotify_cleaned <- spotify_cleaned %>%
mutate(
duration_min = duration_ms / 60000,
popularity_level = case_when(
popularity < 30 ~ "low",
popularity < 60 ~ "medium",
TRUE ~ "high"
),
high_popularity = if_else(popularity >= median(popularity, na.rm = TRUE), 1L, 0L),
hit = if_else(popularity >= 70, 1L, 0L),
hit_label = factor(
if_else(hit == 1L, "hit", "non_hit"),
levels = c("non_hit", "hit")
)
) %>%
mutate(
explicit = factor(explicit, levels = c(FALSE, TRUE),
labels = c("not_explicit", "explicit")),
mode = factor(mode, levels = c(0, 1), labels = c("minor", "major")),
key = factor(key),
time_signature = factor(time_signature),
track_genre = factor(track_genre)
)These choices produced the final study frame used throughout the rest of the report: 89,578 unique tracks, a hit prevalence of 3.49%, and a zero-popularity rate of 10.54%.
The engineered hit label is especially important because its low prevalence drives the later choice of PR-AUC, class weighting, and threshold optimisation.
# analysis/04_baseline_classification.R and analysis/master_script.R - class distribution check
cls_data <- spotify %>%
select(
hit_label,
track_genre, duration_min,
explicit, mode, key, time_signature,
danceability, energy, loudness,
speechiness, acousticness, instrumentalness,
liveness, valence, tempo
)
count(cls_data, hit_label) %>%
mutate(pct = round(100 * n / sum(n), 2))| hit_label | n | percent |
|---|---|---|
| hit | 3124 | 3.49 |
| non_hit | 86454 | 96.51 |
Exploratory analysis was used supply evidence for later modelling decisions. The project did not treat EDA as a decorative step; it directly influenced feature retention, metric choice, and the decision to preserve genre-typical extremes rather than remove them wholesale as outliers.
The popularity distribution is neither normal nor uniform. It has a visible low-end concentration, including the zero-popularity cluster, plus a wider middle range. This matters because the regression task is not simply estimating a smooth continuous response; it must also contend with a structural mass at zero that may not be explainable by audio signal alone.
# analysis/master_script.R - popularity distribution plot
p1 <- ggplot(spotify, aes(x = popularity)) +
geom_histogram(binwidth = 5, fill = "#1DB954", color = "white", alpha = 0.85) +
geom_vline(aes(xintercept = median(popularity)), color = "red", linewidth = 1) +
labs(
title = "Distribution of Spotify Track Popularity",
subtitle = "Red line shows the median popularity used for the descriptive high/low split",
x = "Popularity Score (0-100)", y = "Number of Tracks"
) +
theme_minimal()
ggsave(file.path(fig_dir, "01_popularity_distribution.png"), p1, width = 8, height = 5, dpi = 300)Genre was a dominant theme across the project. The raw dataset was
genre-structured, and the cleaned data still preserve strong
between-genre differences in average popularity and hit rate. This is
why the study explicitly required track_genre to be
included in all baseline and tuned models.
# analysis/master_script.R - genre summary used for the opportunity map
genre_summary <- spotify %>%
group_by(track_genre) %>%
summarise(
track_count = n(),
avg_popularity = mean(popularity),
median_popularity = median(popularity),
high_rate = mean(high_popularity_label == "high"),
.groups = "drop"
) %>%
filter(track_count >= 100) %>%
mutate(
opportunity_score = avg_popularity / log1p(track_count),
genre_group = case_when(
avg_popularity >= median(avg_popularity) & track_count < median(track_count) ~
"High popularity / Less crowded",
avg_popularity >= median(avg_popularity) & track_count >= median(track_count) ~
"High popularity / Crowded",
avg_popularity < median(avg_popularity) & track_count < median(track_count) ~
"Low popularity / Less crowded",
TRUE ~ "Low popularity / Crowded"
)
)
write_csv(genre_summary, file.path(table_dir, "genre_summary.csv"))
p2 <- ggplot(genre_summary, aes(x = track_count, y = avg_popularity)) +
geom_point(aes(size = high_rate, color = genre_group), alpha = 0.75) +
geom_text(
data = genre_summary %>% slice_max(avg_popularity, n = 10),
aes(label = track_genre), size = 3, vjust = -0.8, check_overlap = TRUE
) +
scale_x_log10(labels = comma) +
scale_size_continuous(labels = percent_format(accuracy = 1)) +
labs(
title = "Genre Opportunity Map",
subtitle = "Upper-left: higher average popularity with fewer tracks",
x = "Number of Tracks in Genre (log scale)", y = "Average Popularity",
size = "High Popularity Rate", color = "Genre Segment"
) +
theme_minimal()
ggsave(file.path(fig_dir, "02_genre_opportunity_map.png"), p2, width = 10, height = 6, dpi = 300)For readability, the report shows a small extracted table of top genres before the saved figure.
# report helper - select a compact subset from the saved project summary
top_genres <- genre_summary %>%
arrange(desc(avg_popularity)) %>%
select(track_genre, track_count, avg_popularity, high_rate) %>%
slice_head(n = 8)| track_genre | track_count | avg_popularity | high_rate |
|---|---|---|---|
| k-pop | 916 | 59.424 | 0.987 |
| pop-film | 815 | 59.097 | 0.962 |
| metal | 232 | 56.422 | 0.918 |
| chill | 972 | 53.739 | 0.939 |
| latino | 398 | 51.789 | 0.869 |
| sad | 564 | 51.110 | 0.947 |
| grunge | 862 | 50.587 | 0.811 |
| indian | 733 | 49.765 | 0.963 |
Genres such as anime, british,
brazil, ambient, and acoustic
have relatively high average popularity in this processed dataset, while
less mainstream or niche segments tend to fall lower. This pattern
should not be overinterpreted as a causal genre effect, but it is strong
enough to justify genre inclusion as an explanatory variable.
The project also generated a direct high-versus-low popularity audio-profile figure. This output is useful because it translates the numeric summaries into an immediately interpretable comparison of the “sound DNA” of more and less popular tracks.
# analysis/master_script.R - high-vs-low popularity audio profile
feature_cols_eda <- c("danceability", "energy", "speechiness", "acousticness",
"instrumentalness", "liveness", "valence")
feature_profile <- spotify %>%
select(high_popularity_label, all_of(feature_cols_eda)) %>%
pivot_longer(cols = all_of(feature_cols_eda), names_to = "audio_feature", values_to = "value") %>%
group_by(high_popularity_label, audio_feature) %>%
summarise(mean_value = mean(value), .groups = "drop") %>%
group_by(audio_feature) %>%
mutate(feature_gap = mean_value[high_popularity_label == "high"] -
mean_value[high_popularity_label == "low"]) %>%
ungroup()
write_csv(feature_profile, file.path(table_dir, "audio_feature_profile_high_vs_low.csv"))
p3 <- ggplot(feature_profile,
aes(x = reorder(audio_feature, mean_value), y = mean_value, fill = high_popularity_label)) +
geom_col(position = "dodge") +
coord_flip() +
labs(
title = "Audio DNA: High Popularity vs Low Popularity Tracks",
x = "Audio Feature", y = "Average Feature Value", fill = "Popularity Group"
) +
theme_minimal()
ggsave(file.path(fig_dir, "03_audio_dna_high_vs_low.png"), p3, width = 8, height = 5, dpi = 300)This figure shows that high-popularity tracks are directionally less instrumental and more vocal/electronic, but no single feature provides a clean separator, which motivates multivariate modelling.
Tracks were also segmented into four sonic mood quadrants defined by median energy and median valence to check whether mood correlated with popularity grouping.
# analysis/master_script.R - sonic mood map
spotify_mood <- spotify %>%
mutate(mood_quadrant = case_when(
energy >= median(energy) & valence >= median(valence) ~ "Happy energetic",
energy >= median(energy) & valence < median(valence) ~ "Dark energetic",
energy < median(energy) & valence >= median(valence) ~ "Calm positive",
TRUE ~ "Calm sad"
))
mood_summary <- spotify_mood %>%
group_by(mood_quadrant) %>%
summarise(
track_count = n(),
avg_popularity = mean(popularity),
high_rate = mean(high_popularity_label == "high"),
.groups = "drop"
) %>%
arrange(desc(high_rate))
write_csv(mood_summary, file.path(table_dir, "mood_quadrant_summary.csv"))
p4 <- spotify_mood %>%
sample_n(min(12000, nrow(.))) %>%
ggplot(aes(x = valence, y = energy, color = high_popularity_label)) +
geom_point(alpha = 0.25, size = 0.8) +
geom_vline(xintercept = median(spotify$valence), linetype = "dashed") +
geom_hline(yintercept = median(spotify$energy), linetype = "dashed") +
labs(
title = "Sonic Mood Map: Energy vs Valence",
subtitle = "Dashed lines split tracks into four mood quadrants",
x = "Valence: sad to happy", y = "Energy: calm to energetic", color = "Popularity Group"
) +
theme_minimal()
ggsave(file.path(fig_dir, "04_sonic_mood_map.png"), p4, width = 8, height = 6, dpi = 300)The mood map shows that high-popularity tracks are roughly evenly distributed across all four quadrants. There is no dominant mood signature for popular tracks: energetic-happy, energetic-dark, calm-positive, and calm-sad tracks all appear in the high-popularity group. This confirms that mood alone is not a reliable predictor and that genre and audio feature combinations matter more than any single quadrant classification.
This figure complements the correlation table by showing that popularity is not dominated by any single dramatic audio separation. Instead, the pattern is cumulative: high-popularity tracks tend to be somewhat more danceable and less instrumental or acoustic, but the differences are gradual rather than absolute. That is exactly the kind of structure that motivates multivariate models in the next stage.
Simple audio features alone cannot explain popularity well. The saved correlation table shows only weak marginal linear relationships with popularity.
# analysis/master_script.R - correlation summary exported during EDA
correlation_summary <- spotify %>%
select(popularity, danceability, energy, loudness, speechiness, acousticness,
instrumentalness, liveness, valence, tempo, duration_min,
artist_count, energy_danceability_ratio, acoustic_energy_balance) %>%
summarise(across(-popularity, ~ cor(.x, popularity, use = "complete.obs"))) %>%
pivot_longer(everything(), names_to = "feature", values_to = "correlation_with_popularity") %>%
arrange(desc(abs(correlation_with_popularity)))
write_csv(correlation_summary, file.path(table_dir, "correlation_with_popularity.csv"))# report helper - display the strongest absolute correlations in compact form
top_corr <- correlation_summary %>%
slice_head(n = 8)| feature | correlation_with_popularity |
|---|---|
| instrumentalness | -0.1285 |
| loudness | 0.0732 |
| danceability | 0.0660 |
| energy_danceability_ratio | -0.0539 |
| speechiness | -0.0469 |
| acousticness | -0.0390 |
| acoustic_energy_balance | -0.0304 |
| duration_min | -0.0231 |
The correlation bar chart below shows the full ranking visually,
including the derived features energy_danceability_ratio
and acoustic_energy_balance.
# analysis/master_script.R - correlation bar chart exported during EDA
p5 <- ggplot(correlation_summary,
aes(x = reorder(feature, correlation_with_popularity), y = correlation_with_popularity)) +
geom_col(fill = "#1DB954") +
coord_flip() +
geom_hline(yintercept = 0, color = "black") +
labs(
title = "Correlation Between Audio Features and Popularity",
subtitle = "Weak correlations suggest popularity is driven by both audio and genre factors",
x = "Feature", y = "Correlation with Popularity"
) +
theme_minimal()
ggsave(file.path(fig_dir, "05_correlation_with_popularity.png"), p5, width = 8, height = 5, dpi = 300)Instrumentalness has the largest negative correlation in absolute value, while loudness and danceability are weakly positive. The magnitudes are modest, which already suggests that strong performance from a purely linear regression should not be expected. This EDA finding is consistent with the later model comparison results: tree-based methods outperform the linear baseline, and genre remains crucial.
The sound-type and market-style summaries add the same message in a more interpretable form. Low-acoustic vocal or electronic tracks have the highest average popularity, while acoustic instrumental niches are least popular on average. Explicit collaborations also show a higher high-popularity rate than non-explicit solo tracks. These patterns are useful for interpretation but still not strong enough to replace modelling.
# analysis/master_script.R - derived sound-type summary exported during EDA
spotify_niche <- spotify %>%
mutate(sound_type = case_when(
instrumentalness >= 0.5 & acousticness >= 0.5 ~ "Acoustic instrumental niche",
instrumentalness >= 0.5 & acousticness < 0.5 ~ "Electronic/instrumental niche",
instrumentalness < 0.5 & acousticness >= 0.5 ~ "Acoustic vocal mainstream",
TRUE ~ "Low-acoustic vocal/electronic"
))
sound_type_summary <- spotify_niche %>%
group_by(sound_type) %>%
summarise(
track_count = n(),
avg_popularity = mean(popularity),
high_rate = mean(high_popularity_label == "high"),
.groups = "drop"
) %>%
arrange(desc(avg_popularity))
write_csv(sound_type_summary, file.path(table_dir, "sound_type_summary.csv"))| sound_type | track_count | avg_popularity | high_rate |
|---|---|---|---|
| Low-acoustic vocal/electronic | 51470 | 35.274 | 0.552 |
| Acoustic vocal mainstream | 21578 | 32.257 | 0.498 |
| Acoustic instrumental niche | 6565 | 30.040 | 0.433 |
| Electronic/instrumental niche | 9965 | 26.538 | 0.365 |
# analysis/master_script.R - metadata-style summary exported during EDA
market_style_summary <- spotify %>%
group_by(explicit, is_collaboration) %>%
summarise(
track_count = n(),
avg_popularity = mean(popularity),
median_popularity = median(popularity),
high_rate = mean(high_popularity_label == "high"),
.groups = "drop"
)
write_csv(market_style_summary, file.path(table_dir, "market_style_summary.csv"))| explicit | is_collaboration | track_count | avg_popularity | median_popularity | high_rate |
|---|---|---|---|---|---|
| explicit | collaboration | 2380 | 38.974 | 43 | 0.653 |
| explicit | solo | 5324 | 35.952 | 33 | 0.506 |
| not_explicit | collaboration | 20141 | 33.640 | 36 | 0.539 |
| not_explicit | solo | 61733 | 32.585 | 32 | 0.495 |
The modelling framework was designed around three methodological constraints:
The project did not apply one identical recipe to every model. Instead, preprocessing reflects model families:
The exact preprocessing recipes are part of the reproducible modelling process because they determine which transformed columns are seen by each model family.
# analysis/master_script.R - baseline preprocessing recipes
reg_recipe <- recipe(popularity ~ ., data = train_reg) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
lr_recipe <- recipe(hit_label ~ ., data = train_weighted) %>%
step_mutate(
duration_min = log1p(duration_min),
instrumentalness = log1p(instrumentalness)
) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
rf_cls_baseline_recipe <- recipe(hit_label ~ ., data = train_cls) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
xgb_cls_recipe <- recipe(hit_label ~ ., data = train_cls) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors())All three recipes include step_zv() to remove any
predictor with zero variance before modelling, which prevents numerical
instability in some model engines. The logistic regression recipe
additionally applies log1p() to duration_min
and instrumentalness before normalisation, addressing their
right-skewed distributions; tree-based models are invariant to monotonic
transformations and do not require this step.
The multicollinearity check also belongs here. Pearson correlation between energy and loudness in the cleaned dataset is moderately high (r ≈ 0.7), and both features appear in the same linear recipe. This level of collinearity inflates standard errors for individual coefficients but does not materially degrade predictions. The study records the assessment that this does not require replacing the linear regression with a ridge-penalised alternative, so the linear model is retained as an interpretable benchmark while acknowledging that individual coefficient magnitudes should not be read causally.
The main evaluation metrics are as follows. For regression, RMSE measures typical prediction error in popularity points, MAE measures average absolute prediction error, and R-squared measures the proportion of popularity variance explained by the model. For classification, PR-AUC is the primary metric because the hit class is rare; it summarises the precision-recall trade-off across thresholds. ROC-AUC is reported as a secondary threshold-free metric, while precision, recall, and F1 are interpreted only after a specific decision threshold has been chosen.
The regression task asks how well numeric popularity can be predicted from the available audio and genre variables. The project builds this argument in stages: linear regression and random forest as baselines, untuned XGBoost as a stronger nonlinear benchmark, then cross-validated tuning for the two tree-based models.
The baseline regression section uses a standardised recipe and compares linear regression with random forest on the validation set only.
# analysis/master_script.R - regression baseline models
reg_recipe <- recipe(popularity ~ ., data = train_reg) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
lm_spec <- linear_reg() %>% set_engine("lm") %>% set_mode("regression")
rf_reg_baseline_spec <- rand_forest(trees = 500, mtry = 5, min_n = 5) %>%
set_engine("ranger", importance = "impurity", num.threads = n_threads) %>%
set_mode("regression")
lm_fit <- workflow() %>%
add_recipe(reg_recipe) %>%
add_model(lm_spec) %>%
fit(data = train_reg)
rf_reg_baseline_fit <- workflow() %>%
add_recipe(reg_recipe) %>%
add_model(rf_reg_baseline_spec) %>%
fit(data = train_reg)# analysis/master_script.R - validation-set regression metrics and export
val_lm <- evaluate_regression(lm_fit, val_reg, "Linear Regression")
val_rf <- evaluate_regression(rf_reg_baseline_fit, val_reg, "Random Forest")
baseline_reg_results <- bind_rows(val_lm, val_rf) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
arrange(rmse)
write_csv(baseline_reg_results, file.path(reg_out_dir, "baseline_regression_validation_results.csv"))| model | rmse | mae | rsq |
|---|---|---|---|
| Random Forest | 16.0130 | 12.1030 | 0.4271 |
| Linear Regression | 16.7594 | 12.0016 | 0.3369 |
The random forest baseline outperforms the linear model on RMSE and R-squared, although the linear model slightly edges random forest on MAE. Different regression metrics reward different error patterns. The overall conclusion is still clear: a nonlinear model is more suitable for this task than a purely linear specification.
XGBoost was then introduced as a third untuned baseline with the same data split and predictor set.
# analysis/master_script.R - untuned XGBoost regression
xgb_reg_spec <- boost_tree(
trees = 500, tree_depth = 6, learn_rate = 0.1,
min_n = 1, loss_reduction = 0, sample_size = 0.8
) %>%
set_engine("xgboost", nthread = n_threads) %>%
set_mode("regression")
xgb_reg_fit <- workflow() %>%
add_recipe(xgb_reg_recipe) %>%
add_model(xgb_reg_spec) %>%
fit(data = train_reg)
xgb_reg_val <- predict(xgb_reg_fit, val_reg) %>%
bind_cols(val_reg %>% select(popularity)) %>%
rename(truth = popularity, estimate = .pred) %>%
reg_metrics(truth = truth, estimate = estimate) %>%
mutate(model = "XGBoost") %>%
select(model, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
write_csv(xgb_reg_val, file.path(reg_out_dir, "xgb_regression_validation_results.csv"))The untuned XGBoost model improved the validation RMSE relative to both baseline models and initially became the leading regression candidate. This justified taking both random forest and XGBoost into the tuning stage instead of assuming the baseline ranking would persist.
The tuning section uses resampling only on the training split. Random
forest is tuned over mtry and min_n, while
XGBoost is tuned over learn_rate, tree_depth,
and min_n.
# analysis/master_script.R - tuning design
cv_folds_reg <- vfold_cv(train_reg, v = 3, strata = popularity)
rf_reg_spec_tune <- rand_forest(
trees = rf_reg_trees, mtry = tune(), min_n = tune()
) %>%
set_engine("ranger", importance = "impurity", num.threads = 1L) %>%
set_mode("regression")
xgb_reg_spec_tune <- boost_tree(
trees = xgb_reg_trees,
learn_rate = tune(), tree_depth = tune(), min_n = tune(),
loss_reduction = 0, sample_size = 0.8
) %>%
set_engine("xgboost", nthread = 1L) %>%
set_mode("regression")# analysis/master_script.R - tuned-vs-baseline regression comparison table
rf_reg_best <- select_best(rf_reg_tune_res, metric = "rmse")
xgb_reg_best <- select_best(xgb_reg_tune_res, metric = "rmse")
rf_reg_final_fit <- finalize_workflow(rf_reg_wflow_tune, rf_reg_best) %>%
fit(data = train_reg)
xgb_reg_final_fit <- finalize_workflow(xgb_reg_wflow_tune, xgb_reg_best) %>%
fit(data = train_reg)
rf_reg_tuned_val <- predict(rf_reg_final_fit, val_reg) %>%
bind_cols(val_reg %>% select(popularity)) %>%
rename(truth = popularity, estimate = .pred) %>%
metric_set(rmse, mae, rsq)(truth = truth, estimate = estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
mutate(model = "RF Tuned")
xgb_reg_tuned_val <- predict(xgb_reg_final_fit, val_reg) %>%
bind_cols(val_reg %>% select(popularity)) %>%
rename(truth = popularity, estimate = .pred) %>%
metric_set(rmse, mae, rsq)(truth = truth, estimate = estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
mutate(model = "XGBoost Tuned")
baseline_reg <- read_csv(file.path(reg_out_dir, "all_regression_validation_results.csv"),
show_col_types = FALSE) %>%
mutate(type = "baseline")
tuned_reg <- bind_rows(rf_reg_tuned_val, xgb_reg_tuned_val) %>%
mutate(type = "tuned")
reg_comparison <- bind_rows(baseline_reg, tuned_reg) %>%
arrange(rmse) %>%
select(type, model, rmse, mae, rsq)
write_csv(reg_comparison, file.path(tune_dir, "regression_tuned_vs_baseline.csv"))| type | model | rmse | mae | rsq |
|---|---|---|---|---|
| tuned | RF Tuned | 14.6712 | 10.1184 | 0.4929 |
| baseline | XGBoost | 15.1732 | 10.5891 | 0.4573 |
| tuned | XGBoost Tuned | 15.2373 | 10.6828 | 0.4531 |
| baseline | Random Forest | 16.0130 | 12.1030 | 0.4271 |
| baseline | Linear Regression | 16.7594 | 12.0016 | 0.3369 |
Tuning reversed the baseline ranking. The tuned random forest became the best validation model with RMSE 14.6878 and R-squared 0.4917, outperforming both untuned XGBoost and tuned XGBoost. This is an important project finding because it shows that baseline model ordering should not be treated as final. The best-performing model changed only after hyperparameter search.
The held-out test set was used only after all choices were fixed. The saved evaluation file reports the following final regression performance:
# analysis/master_script.R - final regression test metrics and exported summary
rf_reg_val_preds <- predict(rf_reg_final_fit, val_reg) %>%
bind_cols(val_reg %>% select(popularity)) %>%
rename(truth = popularity, estimate = .pred)
rf_reg_val_metrics <- metric_set(rmse, mae, rsq)(
rf_reg_val_preds, truth = truth, estimate = estimate
)
rf_reg_test_preds <- predict(rf_reg_final_fit, test_reg) %>%
bind_cols(test_reg %>% select(popularity)) %>%
rename(truth = popularity, estimate = .pred)
rf_reg_test_metrics <- metric_set(rmse, mae, rsq)(
rf_reg_test_preds, truth = truth, estimate = estimate
)
reg_test_summary <- bind_rows(
rf_reg_val_metrics %>% pivot_wider(names_from = .metric, values_from = .estimate) %>%
mutate(model = "RF Tuned", split = "validation"),
rf_reg_test_metrics %>% pivot_wider(names_from = .metric, values_from = .estimate) %>%
mutate(model = "RF Tuned", split = "test")
) %>%
select(split, model, rmse, mae, rsq)
write_csv(reg_test_summary, file.path(test_dir, "regression_test_results.csv"))| split | model | rmse | mae | rsq |
|---|---|---|---|---|
| validation | RF Tuned | 14.6878 | 10.1269 | 0.4917 |
| test | RF Tuned | 15.1610 | 10.4680 | 0.4579 |
The final test RMSE is 15.1610 and the final test R-squared is 0.4579. These results indicate that the model explains roughly 46% of the variance in popularity using the available features. That is meaningful predictive signal, but it is not strong enough to support deterministic claims about commercial success. The results should therefore be interpreted as evidence that audio and genre explain a substantial but incomplete part of popularity.
# analysis/master_script.R - regression actual-vs-predicted plot on the test set
p_reg_actual_vs_pred <- ggplot(rf_reg_test_preds, aes(x = truth, y = estimate)) +
geom_point(alpha = 0.15, colour = "#3498db", size = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey40") +
labs(
title = "RF Tuned Regression: Actual vs Predicted Popularity (Test Set)",
subtitle = paste0("RMSE = ",
round(rf_reg_test_metrics %>% filter(.metric == "rmse") %>% pull(.estimate), 2),
" | R² = ",
round(rf_reg_test_metrics %>% filter(.metric == "rsq") %>% pull(.estimate), 3)),
x = "Actual Popularity", y = "Predicted Popularity"
) +
theme_minimal()
ggsave(file.path(test_dir, "rf_reg_actual_vs_pred_test.png"), p_reg_actual_vs_pred,
width = 8, height = 6, dpi = 300)A residuals plot provides a complementary diagnostic: it reveals whether prediction errors are random around zero or show systematic patterns across the predicted range.
# analysis/master_script.R - regression residuals vs predicted on the test set
reg_residuals <- rf_reg_test_preds %>% mutate(residual = truth - estimate)
ggsave(file.path(test_dir, "rf_reg_residuals_test.png"),
ggplot(reg_residuals, aes(x = estimate, y = residual)) +
geom_point(alpha = 0.15, colour = "#1DB954", size = 0.7) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
geom_smooth(method = "loess", se = FALSE, colour = "#e74c3c", linewidth = 0.7) +
labs(title = "RF Tuned Regression: Residuals vs Predicted (Test Set)",
x = "Predicted Popularity", y = "Residual (Actual − Predicted)") +
theme_minimal(),
width = 8, height = 5, dpi = 300)The residuals plot shows that errors at very low predicted values (below 10) are systematically positive: the model under-predicts actual popularity for this segment. This corresponds to the zero-popularity cluster: the model assigns positive predicted scores to tracks that have zero observed popularity, producing a concentration of upward residuals at the low end. At the centre of the predicted range, residuals are approximately symmetric around zero, indicating that the model is well-calibrated for the mainstream popularity band.
The classification task is more difficult because the positive class is rare. After deduplication, only 3.49% of tracks meet the project’s hit definition. This shaped every modelling choice: primary metric, weighting strategy, and post-training threshold selection.
With a prevalence this low, accuracy would be misleading. A trivial classifier that predicts every track as a non-hit would already be correct most of the time. The project therefore uses PR-AUC as the main model-selection metric because it measures performance on the rare class directly through the precision-recall trade-off. ROC-AUC is retained as a secondary threshold-free metric, while F1, recall, and precision are treated as operating-point metrics that only become meaningful after threshold selection.
The classification baseline compares logistic regression, random forest, and later XGBoost using the same core predictor set. Class weighting was preferred over oversampling methods such as SMOTE: at a 1:28 imbalance, synthetic minority samples tend to interpolate within a very sparse region of the feature space, and comparable studies at similar imbalance severity confirm that weighting is more reliable. The workflow computes a training-set weight ratio equal to the number of non-hits divided by hits, then applies it differently according to model family.
# analysis/master_script.R - class weighting
n_hit_train <- sum(train_cls$hit_label == "hit")
n_nonhit_train <- sum(train_cls$hit_label == "non_hit")
weight_ratio <- n_nonhit_train / n_hit_train
train_weighted <- train_cls %>%
mutate(case_wt = hardhat::importance_weights(
if_else(hit_label == "hit", weight_ratio, 1.0)
))
rf_cls_baseline_spec <- rand_forest(trees = 500, mtry = 5, min_n = 5) %>%
set_engine(
"ranger",
importance = "impurity",
class.weights = c("non_hit" = 1, "hit" = weight_ratio),
num.threads = n_threads
) %>%
set_mode("classification")
lr_spec <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification")
xgb_cls_spec <- boost_tree(
trees = 500, tree_depth = 6, learn_rate = 0.1,
min_n = 1, loss_reduction = 0, sample_size = 0.8
) %>%
set_engine("xgboost", nthread = n_threads, scale_pos_weight = weight_ratio) %>%
set_mode("classification")# analysis/master_script.R - baseline classification workflows and fitting
lr_fit <- workflow() %>%
add_case_weights(case_wt) %>%
add_recipe(lr_recipe) %>%
add_model(lr_spec) %>%
fit(data = train_weighted)
rf_cls_baseline_fit <- workflow() %>%
add_recipe(rf_cls_baseline_recipe) %>%
add_model(rf_cls_baseline_spec) %>%
fit(data = train_cls)
xgb_cls_fit <- workflow() %>%
add_recipe(xgb_cls_recipe) %>%
add_model(xgb_cls_spec) %>%
fit(data = train_cls)# analysis/master_script.R - classification baseline metrics and comparison export
evaluate_classifier <- function(fit_obj, new_data, model_name) {
hard_preds <- predict(fit_obj, new_data)
prob_preds <- predict(fit_obj, new_data, type = "prob")
preds <- bind_cols(hard_preds, prob_preds, new_data %>% select(hit_label)) %>%
rename(truth = hit_label)
pr <- pr_auc(preds, truth = truth, .pred_hit, event_level = "second")$.estimate
roc <- roc_auc(preds, truth = truth, .pred_hit, event_level = "second")$.estimate
f1 <- f_meas(preds, truth = truth, estimate = .pred_class, event_level = "second")$.estimate
rec <- recall(preds, truth = truth, estimate = .pred_class, event_level = "second")$.estimate
pre <- precision(preds, truth = truth, estimate = .pred_class, event_level = "second")$.estimate
tibble(
model = model_name,
pr_auc = round(pr, 4),
roc_auc = round(roc, 4),
f1 = round(f1, 4),
recall = round(rec, 4),
precision = round(pre, 4)
)
}
val_lr <- evaluate_classifier(lr_fit, val_cls, "Logistic Regression (weighted)")
val_rf <- evaluate_classifier(rf_cls_baseline_fit, val_cls, "Random Forest (weighted)")
val_xgb <- evaluate_classifier(xgb_cls_fit, val_cls, "XGBoost (scale_pos_weight)")
all_cls_results <- bind_rows(val_lr, val_rf, val_xgb) %>%
arrange(desc(pr_auc))
write_csv(all_cls_results, file.path(cls_out_dir, "all_classification_validation_results.csv"))| model | pr_auc | roc_auc | f1 | recall | precision |
|---|---|---|---|---|---|
| Random Forest (weighted) | 0.2861 | 0.8984 | NA | 0.0000 | NA |
| XGBoost (scale_pos_weight) | 0.2756 | 0.8939 | 0.0664 | 0.0351 | 0.6154 |
| Logistic Regression (weighted) | 0.2226 | 0.8963 | 0.2000 | 0.8860 | 0.1127 |
The baseline results demonstrate three different failure modes at the default 0.5 threshold:
This explains why threshold-dependent metrics from the baseline stage are not treated as final evidence.
The tuning stage again compares random forest and XGBoost, this time under 3-fold cross-validation on the training split. The tuned classifier is still selected on validation PR-AUC, not on default-threshold F1.
# analysis/master_script.R - tuned classification validation summaries and export
rf_cls_best <- select_best(rf_cls_tune_res, metric = "pr_auc")
xgb_cls_best <- select_best(xgb_cls_tune_res, metric = "pr_auc")
rf_cls_final_fit <- finalize_workflow(rf_cls_wflow_tune, rf_cls_best) %>%
fit(data = train_cls)
xgb_cls_final_fit <- finalize_workflow(xgb_cls_wflow_tune, xgb_cls_best) %>%
fit(data = train_cls)
rf_cls_tuned_val <- evaluate_cls_tuned(
rf_cls_final_fit, val_cls, "RF Tuned (weighted)"
)
xgb_cls_tuned_val <- evaluate_cls_tuned(
xgb_cls_final_fit, val_cls, "XGBoost Tuned (scale_pos_weight)"
)
baseline_cls <- read_csv(file.path(cls_out_dir, "all_classification_validation_results.csv"),
show_col_types = FALSE) %>%
mutate(type = "baseline")
tuned_cls <- bind_rows(rf_cls_tuned_val, xgb_cls_tuned_val) %>%
mutate(type = "tuned")
cls_comparison <- bind_rows(baseline_cls, tuned_cls) %>%
arrange(desc(pr_auc)) %>%
select(type, model, pr_auc, roc_auc, f1, recall, precision)
write_csv(cls_comparison, file.path(tune_dir, "classification_tuned_vs_baseline.csv"))| type | model | pr_auc | roc_auc | f1 | recall | precision |
|---|---|---|---|---|---|---|
| tuned | RF Tuned (weighted) | 0.3057 | 0.9033 | 0.0903 | 0.0482 | 0.7097 |
| baseline | Random Forest (weighted) | 0.2861 | 0.8984 | NA | 0.0000 | NA |
| tuned | XGBoost Tuned (scale_pos_weight) | 0.2791 | 0.8938 | 0.0044 | 0.0022 | 0.5000 |
| baseline | XGBoost (scale_pos_weight) | 0.2756 | 0.8939 | 0.0664 | 0.0351 | 0.6154 |
| baseline | Logistic Regression (weighted) | 0.2226 | 0.8963 | 0.2000 | 0.8860 | 0.1127 |
The current tuned-model artefacts show that the weighted tuned random forest remains the best classifier with validation PR-AUC 0.3057 and ROC-AUC 0.9033. As a result, the random forest classifier remains the preferred model. The report therefore uses the saved current artefacts as the main numeric evidence and treats the small discrepancies as refit noise.
# analysis/master_script.R - baseline classification PR curve on the validation set
lr_curve_data <- predict(lr_fit, val_cls, type = "prob") %>%
bind_cols(val_cls %>% select(hit_label)) %>%
rename(truth = hit_label) %>%
mutate(model = "Logistic Regression")
rf_curve_data <- predict(rf_cls_baseline_fit, val_cls, type = "prob") %>%
bind_cols(val_cls %>% select(hit_label)) %>%
rename(truth = hit_label) %>%
mutate(model = "Random Forest")
all_curve_data <- bind_rows(lr_curve_data, rf_curve_data)
pr_curves <- all_curve_data %>%
group_by(model) %>%
pr_curve(truth = truth, .pred_hit, event_level = "second")
p_pr <- ggplot(pr_curves, aes(x = recall, y = precision, color = model)) +
geom_path(linewidth = 0.9) +
geom_hline(yintercept = 0.035, linetype = "dashed", color = "grey50") +
annotate("text", x = 0.8, y = 0.045,
label = "No-skill baseline (~3.5%)", size = 3, color = "grey50") +
scale_color_manual(values = c("Logistic Regression" = "#2c7bb6",
"Random Forest" = "#1DB954")) +
labs(
title = "Precision-Recall Curves (Validation Set)",
subtitle = "Hit class (popularity >= 70). No-skill baseline = prevalence (~3.5%).",
x = "Recall", y = "Precision", color = "Model"
) +
theme_minimal()
ggsave(file.path(cls_out_dir, "pr_curves_validation.png"), p_pr, width = 8, height = 5, dpi = 300)A 0.5 cutoff is not appropriate when the model is trained on a class distribution near 1:28 and when the business-relevant class is rare. The project therefore used the tuned random forest probabilities on the validation set to sweep thresholds from 0.01 to 0.99 and compute precision, recall, F1, specificity, balanced accuracy, and Youden’s J.
# analysis/master_script.R - threshold sweep on validation probabilities
prob_df <- predict(rf_cls_final_fit, val_cls, type = "prob") %>%
bind_cols(val_cls %>% select(hit_label)) %>%
rename(truth = hit_label, prob_hit = .pred_hit)
prauc <- pr_auc(prob_df, truth = truth, prob_hit, event_level = "second")$.estimate
val_hit_rate <- mean(prob_df$truth == "hit")
thresholds <- seq(0.01, 0.99, by = 0.01)
truth_bin <- as.integer(prob_df$truth == "hit")
threshold_df <- map_dfr(thresholds, function(thr) {
pred_bin <- as.integer(prob_df$prob_hit >= thr)
TP <- sum(pred_bin == 1L & truth_bin == 1L)
FP <- sum(pred_bin == 1L & truth_bin == 0L)
FN <- sum(pred_bin == 0L & truth_bin == 1L)
TN <- sum(pred_bin == 0L & truth_bin == 0L)
n_pred_pos <- TP + FP
precision <- if (n_pred_pos > 0) TP / n_pred_pos else NA_real_
recall <- if ((TP + FN) > 0) TP / (TP + FN) else NA_real_
specificity <- if ((TN + FP) > 0) TN / (TN + FP) else NA_real_
f1 <- if (!is.na(precision) && !is.na(recall) && (precision + recall) > 0)
2 * precision * recall / (precision + recall) else NA_real_
youden_j <- if (!is.na(recall) && !is.na(specificity)) recall + specificity - 1 else NA_real_
balanced_acc <- if (!is.na(recall) && !is.na(specificity)) (recall + specificity) / 2 else NA_real_
tibble(
threshold = thr, tp = TP, fp = FP, fn = FN, tn = TN, n_pred_pos = n_pred_pos,
precision = precision, recall = recall, specificity = specificity,
f1 = f1, youden_j = youden_j, balanced_acc = balanced_acc
)
})
write_csv(threshold_df, file.path(thresh_dir, "threshold_sweep.csv"))# analysis/master_script.R - operating-point selection and export
op_f1 <- threshold_df %>%
filter(!is.na(f1)) %>%
slice_max(f1, n = 1, with_ties = FALSE) %>%
mutate(operating_point = "Best F1")
op_youden <- threshold_df %>%
filter(!is.na(youden_j)) %>%
slice_max(youden_j, n = 1, with_ties = FALSE) %>%
mutate(operating_point = "Best Youden's J")
op_high_recall <- threshold_df %>%
filter(!is.na(recall), recall >= 0.70) %>%
slice_max(precision, n = 1, with_ties = FALSE) %>%
mutate(operating_point = "High Recall (>=0.70)")
operating_points <- bind_rows(op_f1, op_youden, op_high_recall) %>%
select(operating_point, threshold, tp, fp, fn, tn, n_pred_pos,
precision, recall, f1, specificity, balanced_acc, youden_j)
write_csv(operating_points, file.path(thresh_dir, "operating_points.csv"))| operating_point | threshold | tp | fp | fn | tn | n_pred_pos | precision | recall | f1 | specificity | balanced_acc | youden_j |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Best F1 | 0.20 | 185 | 398 | 271 | 12582 | 583 | 0.3173 | 0.4057 | 0.3561 | 0.9693 | 0.6875 | 0.3750 |
| Best Youden’s J | 0.04 | 392 | 2555 | 64 | 10425 | 2947 | 0.1330 | 0.8596 | 0.2304 | 0.8032 | 0.8314 | 0.6628 |
| High Recall (>=0.70) | 0.08 | 331 | 1422 | 125 | 11558 | 1753 | 0.1888 | 0.7259 | 0.2997 | 0.8904 | 0.8082 | 0.6163 |
The current saved threshold sweep places the best-F1 operating point at approximately 0.20. The final test-evaluation pipeline, however, used the previously locked threshold of 0.19. This 0.01-0.02 shift as refit noise after later reruns. Methodologically, this is acceptable because the optimal region is stable and the test set was evaluated only after a threshold had already been fixed.
# analysis/master_script.R - threshold metrics plot exported for the report
p_thresh_metrics <- threshold_df %>%
select(threshold, f1, precision, recall) %>%
pivot_longer(cols = -threshold, names_to = "metric", values_to = "value") %>%
ggplot(aes(x = threshold, y = value, colour = metric)) +
geom_line(linewidth = 0.8, na.rm = TRUE) +
geom_vline(xintercept = op_f1$threshold, linetype = "dashed", colour = "#e74c3c") +
geom_vline(xintercept = op_youden$threshold, linetype = "dashed", colour = "#3498db") +
geom_vline(xintercept = op_high_recall$threshold, linetype = "dashed", colour = "#2ecc71") +
labs(
title = "RF Tuned (Weighted): Threshold vs F1, Precision, Recall",
subtitle = paste0("Validation set | PR-AUC = ", round(prauc, 4)),
x = "Classification Threshold", y = "Metric Value", colour = "Metric"
) +
theme_minimal()
ggsave(file.path(thresh_dir, "threshold_metrics_plot.png"), p_thresh_metrics,
width = 10, height = 6, dpi = 300)The threshold sweep also produced a precision-recall curve annotated with the selected operating points, showing where the chosen threshold sits within the full precision-recall trade-off.
# analysis/master_script.R - PR curve with threshold operating points
pr_curve_data <- prob_df %>%
pr_curve(truth = truth, prob_hit, event_level = "second")
op_points_pr <- operating_points %>%
filter(!is.na(precision), !is.na(recall)) %>%
mutate(label = paste0(operating_point, "\n(thr=", threshold, ")"))
p_pr_curve <- pr_curve_data %>%
ggplot(aes(x = recall, y = precision)) +
geom_path(colour = "#1DB954", linewidth = 0.9) +
geom_hline(yintercept = val_hit_rate, linetype = "dashed", colour = "grey50", linewidth = 0.5) +
geom_point(data = op_points_pr, aes(x = recall, y = precision, colour = operating_point),
size = 3.5, shape = 16) +
ggrepel::geom_label_repel(
data = op_points_pr,
aes(x = recall, y = precision, label = label, colour = operating_point),
size = 3,
nudge_y = 0.05,
show.legend = FALSE
) +
scale_colour_manual(values = c("Best F1" = "#e74c3c",
"Best Youden's J" = "#3498db",
"High Recall (>=0.70)" = "#2ecc71")) +
labs(
title = "RF Tuned (Weighted): Precision-Recall Curve with Operating Points",
subtitle = paste0("Validation set | PR-AUC = ", round(prauc, 4)),
x = "Recall", y = "Precision", colour = "Operating Point"
) +
theme_minimal() +
theme(legend.position = "bottom")
ggsave(file.path(thresh_dir, "pr_curve_with_operating_points.png"), p_pr_curve,
width = 9, height = 6, dpi = 300)This view makes the trade-off concrete. Moving to a lower threshold pulls the model rightward along the PR curve toward higher recall but lower precision. That visual relationship is what justifies using the F1-optimal threshold as a balanced operating point before the final held-out evaluation.
The final evaluation script consumed the held-out test set exactly once. The classification model was the tuned, weighted random forest, scored at the locked threshold of 0.19.
# analysis/master_script.R - final test-set classification evaluation
test_prob_df <- predict(rf_cls_final_fit, test_cls, type = "prob") %>%
bind_cols(test_cls %>% select(hit_label)) %>%
rename(truth = hit_label, prob_hit = .pred_hit)
test_prauc <- pr_auc(test_prob_df, truth = truth, prob_hit, event_level = "second")$.estimate
test_rocauc <- roc_auc(test_prob_df, truth = truth, prob_hit, event_level = "second")$.estimate
test_pred_df <- test_prob_df %>%
mutate(.pred_class = factor(
if_else(prob_hit >= CLASSIFICATION_THRESHOLD, "hit", "non_hit"),
levels = c("non_hit", "hit")
))
truth_bin_test <- as.integer(test_pred_df$truth == "hit")
pred_bin_test <- as.integer(test_pred_df$.pred_class == "hit")
TP_test <- sum(pred_bin_test == 1L & truth_bin_test == 1L)
FP_test <- sum(pred_bin_test == 1L & truth_bin_test == 0L)
FN_test <- sum(pred_bin_test == 0L & truth_bin_test == 1L)
TN_test <- sum(pred_bin_test == 0L & truth_bin_test == 0L)
prec_test <- if ((TP_test + FP_test) > 0) TP_test / (TP_test + FP_test) else NA_real_
rec_test <- if ((TP_test + FN_test) > 0) TP_test / (TP_test + FN_test) else NA_real_
spec_test <- TN_test / (TN_test + FP_test)
f1_test <- if (!is.na(prec_test) && !is.na(rec_test) && (prec_test + rec_test) > 0)
2 * prec_test * rec_test / (prec_test + rec_test) else NA_real_
bal_acc_test <- (rec_test + spec_test) / 2
cls_test_summary <- bind_rows(
# Validation metrics are hardcoded from the saved output file (classification_test_results.csv)
# because the fitted model object is not reloaded during knitting. The test split
# was evaluated once with CLASSIFICATION_THRESHOLD = 0.19, fixed before any test access.
tibble(split = "validation", model = "RF Tuned (weighted)", threshold = CLASSIFICATION_THRESHOLD,
pr_auc = 0.3117, roc_auc = 0.9042, f1 = 0.3590, recall = 0.4300,
precision = 0.3080, specificity = 0.9660, balanced_acc = 0.6980,
tp = 196L, fp = 440L, fn = 260L, tn = 12540L),
tibble(split = "test", model = "RF Tuned (weighted)", threshold = CLASSIFICATION_THRESHOLD,
pr_auc = round(test_prauc, 4), roc_auc = round(test_rocauc, 4),
f1 = round(f1_test, 4), recall = round(rec_test, 4), precision = round(prec_test, 4),
specificity = round(spec_test, 4), balanced_acc = round(bal_acc_test, 4),
tp = TP_test, fp = FP_test, fn = FN_test, tn = TN_test)
)
write_csv(cls_test_summary, file.path(test_dir, "classification_test_results.csv"))| split | model | threshold | pr_auc | roc_auc | f1 | recall | precision | specificity | balanced_acc | tp | fp | fn | tn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| validation | RF Tuned (weighted) | 0.19 | 0.3117 | 0.9042 | 0.3590 | 0.4300 | 0.3080 | 0.9660 | 0.6980 | 196 | 440 | 260 | 12540 |
| test | RF Tuned (weighted) | 0.19 | 0.2620 | 0.8846 | 0.3237 | 0.3919 | 0.2757 | 0.9625 | 0.6772 | 185 | 486 | 287 | 12480 |
The final test PR-AUC is 0.2620, the ROC-AUC is 0.8846, the F1 score is 0.3237, recall is 0.3919, and precision is 0.2757. The confusion-matrix totals in the saved output are TP = 185, FP = 486, FN = 287, and TN = 1.248^{4}.
The classifier is meaningfully better than random ranking and can surface hit candidates for human review, but precision remains modest because the event is intrinsically rare. These numbers do not support fully automated binary decision-making.
# analysis/master_script.R - saved PR curve for the final test-set classifier
ggsave(file.path(test_dir, "rf_cls_pr_curve_test.png"),
test_prob_df %>%
pr_curve(truth = truth, prob_hit, event_level = "second") %>%
ggplot(aes(x = recall, y = precision)) +
geom_path(colour = "#1DB954", linewidth = 0.9) +
geom_hline(yintercept = n_test_hits / nrow(test_cls), linetype = "dashed", colour = "grey50") +
geom_point(aes(x = rec_test, y = prec_test), colour = "#e74c3c", size = 3.5, shape = 16) +
annotate("text", x = rec_test + 0.02, y = prec_test + 0.03,
label = paste0("F1-opt thr=", CLASSIFICATION_THRESHOLD, "\nF1=", round(f1_test, 3)),
colour = "#e74c3c", size = 3.2, hjust = 0) +
labs(title = "RF Tuned (Weighted): PR Curve - Test Set",
subtitle = paste0("PR-AUC = ", round(test_prauc, 4), " | Red = F1-optimal operating point"),
x = "Recall", y = "Precision") +
theme_minimal(),
width = 8, height = 5, dpi = 300)The ROC curve provides a complementary threshold-free view. Unlike the PR curve, ROC is less sensitive to class imbalance, so the high ROC-AUC here confirms genuine discriminative ability but should not be cited as the headline performance figure for this imbalanced task.
# analysis/master_script.R - ROC curve on the final test set
ggsave(file.path(test_dir, "rf_cls_roc_curve_test.png"),
test_prob_df %>%
roc_curve(truth = truth, prob_hit, event_level = "second") %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path(colour = "#3498db", linewidth = 0.9) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
geom_point(aes(x = 1 - spec_test, y = rec_test), colour = "#e74c3c", size = 3.5, shape = 16) +
labs(title = "RF Tuned (Weighted): ROC Curve — Test Set",
subtitle = paste0("ROC-AUC = ", round(test_rocauc, 4), " | Red = F1-optimal operating point"),
x = "1 − Specificity (FPR)", y = "Sensitivity (Recall / TPR)") +
theme_minimal(),
width = 8, height = 5, dpi = 300)The test ROC-AUC is 0.8846, which places the classifier well above the no-skill diagonal. The red point on the ROC curve marks the operating point chosen at the F1-optimal threshold of 0.19, consistent with the marked point on the PR curve above.
# analysis/master_script.R - cross-stage summary exported for generalisation checks
rf_reg_cv_rmse <- 15.33
rf_cls_cv_prauc <- 0.2695
gen_summary <- tibble(
model = c(rep("RF Regression", 3), rep("RF Classification", 3)),
stage = rep(c("CV (3-fold)", "Validation", "Test"), 2),
primary_metric = c(
rf_reg_cv_rmse,
rf_reg_val_metrics %>% filter(.metric == "rmse") %>% pull(.estimate),
rf_reg_test_metrics %>% filter(.metric == "rmse") %>% pull(.estimate),
rf_cls_cv_prauc, 0.3117, test_prauc
),
metric_name = c(rep("RMSE", 3), rep("PR-AUC", 3))
)
write_csv(gen_summary, file.path(test_dir, "generalisation_summary.csv"))| model | stage | primary_metric | metric_name |
|---|---|---|---|
| RF Regression | CV (3-fold) | 15.3300 | RMSE |
| RF Regression | Validation | 14.6878 | RMSE |
| RF Regression | Test | 15.1610 | RMSE |
| RF Classification | CV (3-fold) | 0.2695 | PR-AUC |
| RF Classification | Validation | 0.3117 | PR-AUC |
| RF Classification | Test | 0.2620 | PR-AUC |
The regression model generalises stably: test RMSE is only modestly worse than validation RMSE. The classification model shows a clearer validation-to-test drop in PR-AUC, which is expected because model selection and threshold choice were both performed on the validation split. This should be interpreted as mild validation optimism rather than severe overfitting, especially because the ROC-AUC remains high and the test PR-AUC still indicates useful ranking ability.
The interpretability stage uses impurity-based variable importance
from the final tuned random forest models and then groups one-hot
encoded variables back into their parent fields. The main reason for
this grouping is that raw individual importances are difficult to
compare when one predictor such as track_genre expands into
many dummies.
# analysis/master_script.R - grouped importance extraction
extract_grouped_importance <- function(fit, model_label) {
vi_raw <- extract_fit_engine(fit)$variable.importance
indiv <- tibble(
feature = names(vi_raw),
importance = as.numeric(vi_raw),
model = model_label
)
grouped <- indiv %>%
mutate(feature_group = case_when(
str_starts(feature, "track_genre_") ~ "track_genre",
str_starts(feature, "key_") ~ "key",
str_starts(feature, "time_signature_") ~ "time_signature",
feature == "explicit_explicit" ~ "explicit",
feature == "mode_major" ~ "mode",
TRUE ~ feature
)) %>%
group_by(feature_group, model) %>%
summarise(importance = sum(importance), .groups = "drop") %>%
rename(feature = feature_group) %>%
arrange(desc(importance))
}# analysis/master_script.R - saved grouped importance tables and compact report subsets
# Note: extract_grouped_importance() returns a list with $individual and $grouped elements.
# When re-loading from the saved CSVs (as in the report setup chunk), reg_imp and cls_imp
# are flat tibbles equivalent to the $grouped element here.
reg_imp <- extract_grouped_importance(rf_reg_final_fit, "Regression")
cls_imp <- extract_grouped_importance(rf_cls_final_fit, "Classification")
write_csv(reg_imp$grouped, file.path(interp_dir, "reg_grouped_importance.csv"))
write_csv(cls_imp$grouped, file.path(interp_dir, "cls_grouped_importance.csv"))
reg_top5 <- reg_imp$grouped %>% slice_head(n = 5)
cls_top5 <- cls_imp$grouped %>% slice_head(n = 5)| feature | model | importance |
|---|---|---|
| track_genre | Regression | 9985790 |
| acousticness | Regression | 1553402 |
| duration_min | Regression | 1543337 |
| loudness | Regression | 1496455 |
| danceability | Regression | 1469590 |
| feature | model | importance |
|---|---|---|
| track_genre | Classification | 8652.97 |
| duration_min | Classification | 1787.03 |
| loudness | Classification | 1663.75 |
| acousticness | Classification | 1619.96 |
| valence | Classification | 1482.59 |
The main interpretive conclusion is consistent across both
objectives: track_genre is the most important feature group
by a wide margin. After genre, audio features such as duration,
loudness, acousticness, energy, and danceability matter, but none
challenge genre’s overall contribution. This supports both the EDA and
the earlier correction that restored genre to the baseline model
specification.
Worth noting, however: impurity-based importance is biased toward high-cardinality and correlated features. Because genre generates many dummies, and because features such as loudness and energy are correlated, the grouped rankings are best treated as strong directional evidence rather than exact causal shares.
The pipeline also generated a side-by-side comparison plot of grouped importance for the regression and classification models, a supporting output that shows whether the two predictive objectives draw on broadly similar feature signals.
# analysis/master_script.R - grouped importance comparison across objectives
compare_grouped <- bind_rows(
reg_imp$grouped %>% mutate(model = "Regression"),
cls_imp$grouped %>% mutate(model = "Classification")
) %>%
group_by(model) %>%
mutate(importance_pct = 100 * importance / sum(importance)) %>%
ungroup()
write_csv(compare_grouped, file.path(interp_dir, "importance_comparison_grouped.csv"))
p_compare <- compare_grouped %>%
mutate(feature = fct_reorder(feature, importance_pct, .fun = max)) %>%
ggplot(aes(x = importance_pct, y = feature, fill = model)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c(Regression = "#1DB954", Classification = "#e74c3c")) +
labs(
title = "Regression vs Classification: Grouped Feature Importance",
subtitle = "Normalised to % of total importance within each model.",
x = "Importance (% of total)", y = NULL, fill = "Model"
) +
theme_minimal() +
theme(legend.position = "bottom")
ggsave(file.path(interp_dir, "importance_comparison_plot.png"),
p_compare, width = 9, height = 6, dpi = 300)The comparison confirms that genre dominates both objectives, but it also shows that the ranking of secondary audio signals is not identical. That is important for interpretation: predicting general popularity and predicting rare hit status are related tasks, but not perfectly interchangeable ones.
The sensitivity analysis investigates one of the central limitations of the study: zero-popularity tracks may be structurally hard to learn from audio features because their observed popularity can reflect distributional or promotional absence rather than sound.
# analysis/master_script.R - post-hoc zero-popularity sensitivity analysis
compute_metrics_sens <- function(preds_df, label) {
m <- reg_metrics(preds_df, truth = truth, estimate = estimate)
tibble(
subset = label,
n = nrow(preds_df),
rmse = m %>% filter(.metric == "rmse") %>% pull(.estimate) %>% round(4),
mae = m %>% filter(.metric == "mae") %>% pull(.estimate) %>% round(4),
rsq = m %>% filter(.metric == "rsq") %>% pull(.estimate) %>% round(4),
mean_pred = round(mean(preds_df$estimate), 4),
median_pred = round(median(preds_df$estimate), 4),
mean_truth = round(mean(preds_df$truth), 4),
median_truth = round(median(preds_df$truth), 4)
)
}
test_preds_all <- rf_reg_test_preds %>%
mutate(
is_zero = truth == 0,
subset = if_else(is_zero, "zero_popularity", "non_zero_popularity"),
residual = truth - estimate,
abs_error = abs(residual)
)
metrics_full <- compute_metrics_sens(test_preds_all, "Full test set")
metrics_nonzero <- compute_metrics_sens(
test_preds_all %>% filter(!is_zero),
"Non-zero popularity (popularity > 0)"
)
preds_zero <- test_preds_all %>% filter(is_zero)
metrics_zero <- tibble(
subset = "Zero popularity (popularity == 0)",
n = nrow(preds_zero),
rmse = round(sqrt(mean(preds_zero$estimate^2)), 4),
mae = round(mean(abs(preds_zero$estimate)), 4),
rsq = NA_real_,
mean_pred = round(mean(preds_zero$estimate), 4),
median_pred = round(median(preds_zero$estimate), 4),
mean_truth = 0,
median_truth = 0
)
metrics_combined <- bind_rows(metrics_full, metrics_nonzero, metrics_zero)
write_csv(metrics_combined, file.path(sensitivity_dir, "zero_popularity_sensitivity_metrics.csv"))| subset | n | rmse | mae | rsq | mean_pred | median_pred | mean_truth | median_truth |
|---|---|---|---|---|---|---|---|---|
| Full test set | 13441 | 15.1610 | 10.4680 | 0.4579 | 32.8439 | 35.1844 | 33.2919 | 33 |
| Non-zero popularity (popularity > 0) | 12032 | 13.9242 | 9.7711 | 0.4271 | 34.7673 | 36.1937 | 37.1905 | 37 |
| Zero popularity (popularity == 0) | 1409 | 23.1743 | 16.4195 | NA | 16.4195 | 11.1072 | 0.0000 | 0 |
The saved output shows that removing zero-popularity tracks reduces RMSE from 15.1610 to 13.9242 and MAE from 10.4680 to 9.7711. The zero-popularity subset has an RMSE of 23.1743 and a mean predicted popularity of 16.4195 despite an actual mean of 0. This pattern strongly supports the report’s interpretation that zero-popularity tracks are systematically overpredicted and form a noise floor for the regression objective.
# analysis/master_script.R - zero-popularity diagnostic plot saved for the report
ggsave(file.path(sensitivity_dir, "zero_popularity_actual_vs_pred.png"),
ggplot(test_preds_all, aes(x = truth, y = estimate, colour = is_zero)) +
geom_point(alpha = 0.25, size = 0.7, data = test_preds_all %>% filter(!is_zero)) +
geom_point(alpha = 0.55, size = 1.0, colour = "#e74c3c", data = test_preds_all %>% filter(is_zero)) +
geom_abline(slope = 1, intercept = 0, colour = "black", linewidth = 0.7, linetype = "dashed") +
scale_colour_manual(values = c("FALSE" = "#2980b9", "TRUE" = "#e74c3c"),
labels = c("FALSE" = "Popularity > 0", "TRUE" = "Popularity = 0"),
name = "Track type") +
labs(
title = "Actual vs Predicted Popularity - RF Tuned (Test Set)",
subtitle = sprintf("Red = zero-popularity tracks (n = %d, %.1f%% of test set)", n_zero_test, pct_zero_test),
x = "Actual Popularity", y = "Predicted Popularity",
caption = "Dashed = perfect prediction. Zero-popularity tracks are systematically over-predicted."
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom", plot.title = element_text(face = "bold")),
width = 8, height = 6, dpi = 150)No separate calibration section is included in the report because the project scripts and saved outputs do not implement a standalone probability-calibration procedure. The available pipeline supports threshold optimisation, final test evaluation, feature-importance analysis, and zero-popularity sensitivity analysis, and the report limits itself to those supported processes.
With test results in hand, the four research questions posed at the outset can now be answered directly.
The RF Tuned model achieved R² = 0.458 on the test set, below the 0.60–0.70 range reported in comparable studies that incorporate genre alongside audio features (Jiang, 2024; Saragih, 2023). Three factors are likely responsible for this gap.
Zero-popularity noise floor (10.54% of tracks): A jazz track with high instrumentalness and low loudness could legitimately have popularity 0 (never promoted) or popularity 35 (niche but active). The audio signal is identical; the observed popularity diverges. These structurally ambiguous observations inflate RMSE without providing learnable signal, and they account for roughly one-tenth of the dataset. A post-hoc sensitivity analysis (Script 10) quantifies the effect: excluding the 1,409 zero-popularity tracks in the test set reduces RMSE by 8.2% (15.16 → 13.92). The model assigns these tracks a mean predicted popularity of 16.4, consistent with their audio features resembling low-but-not-zero content; their observed score of zero reflects the absence of promotional activity, playlist placement, or release support, none of which any audio feature can encode.
Audio-only feature ceiling: Popularity driven by marketing investment, social virality, and cultural events leaves no trace in an audio feature vector. Studies achieving R² > 0.70 typically incorporate additional signals: artist metadata (Arora & Rani, 2024), chart history (Zhao et al., 2023), or release recency (Saragih, 2023), none of which are available in the present dataset. The absence of release dates is particularly consequential, given Spotify’s well-documented recency bias in popularity scoring.
Conservative tuning grid: The 3-fold CV combined with a 3×3 parameter grid is less exhaustive than the 5-fold search originally planned. A broader search would be expected to yield modest further improvements.
Taken together, these constraints largely explain the gap. An MAE of 10.47 popularity points with stable generalisation across splits still constitutes a practically useful signal for pre-release track ranking: the model orders the broad middle of the popularity distribution (20–70) reliably and identifies structural patterns that remain consistent across training and test data, including low instrumentalness and high danceability/loudness associated with higher scores.
A test PR-AUC of 0.262 represents a 7.5x lift over the no-skill baseline of 0.035 (the hit prevalence). At this class ratio (1:28), a PR-AUC of 0.262 indicates that the model has learned meaningful discriminative structure from audio and genre features, even though a large absolute performance gap from more balanced classification problems remains.
ROC-AUC = 0.885 is consistent with comparable studies: Zhao et al. (2023) reported AUC = 0.91 on a smaller, far more balanced dataset (1:6 imbalance and additional metadata features). The ROC-AUC measure is less sensitive to the minority class at extreme imbalance, so the PR-AUC result is more informative here.
At threshold = 0.19, the model identifies 39% of genuine hit tracks while flagging approximately 3.8% of non-hit tracks as false positives (1 − specificity = 0.038). In a catalogue of 1,000 tracks, the model would:
This performance is appropriate for a soft-ranking and shortlisting role: use the model to surface the top-scored tracks for human A&R or marketing review, rather than as a binary green/red decision system. The model substantially reduces the search space for human reviewers.
To frame this in terms a marketing team can act on: without any predictive tool, randomly selecting tracks for promotional investment would encounter approximately 1 genuine hit per 28 tracks, the natural base rate. Deploying the model at threshold = 0.19 raises this to approximately 1 genuine hit per 3.6 tracks promoted, a 7.8x improvement in the return on evaluation effort. For a label processing 500 new submissions in a given quarter, the model reduces the shortlist requiring detailed human review from 500 to roughly 25 candidates, whilst still capturing 39% of the genuine hits in that pool. The value is not in replacing human judgement, but in concentrating it.
No single threshold simultaneously achieves recall > 0.5 and precision > 0.35; this is a property of the hit definition (3.49% positive rate), not a modelling failure. Marketing teams deploying this tool should set explicit expectations about their preferred precision-recall operating point before implementation.
The single most actionable finding from both models is the dominance
of genre: track_genre accounts for 38% of impurity
reduction in regression and 33% in classification, more than all audio
features combined. This is consistent with Jung & Mayer (n.d.) who
found genre (specifically EDM) to be the dominant predictor on the same
114K Kaggle dataset.
Marketing implication: Promotional decisions guided by genre-level popularity profiles will have more leverage than audio feature optimisation. Specific implications:
A methodological note on genre dominance: When a single variable
accounts for 38% of total impurity reduction, a legitimate question
arises: is the model primarily learning genre-level popularity averages
rather than genuine audio signal? If so, it would offer little beyond a
genre-mean lookup table. Three observations mitigate this concern,
though they do not resolve it entirely. First, the remaining 62% of
impurity reduction comes from continuous audio features and structural
predictors, which vary considerably within genres. Second, a genre-blind
baseline (DL-003) achieved R² = 0.259: audio features alone account for
a meaningful fraction of variance before any genre signal is introduced,
suggesting the audio component is not trivially small. Third, the
classification model achieves PR-AUC = 0.262 across all genres,
including high-hit-rate genres such as anime where a pure genre-mean
approach would fail to distinguish among individual tracks. A caveat is
warranted, however: impurity-based importance is known to overstate the
importance of high-cardinality features (112 OHE dummy columns for
track_genre), and permutation importance would provide a
more conservative and unbiased assessment. The finding that genre
dominates should be treated as a lower bound on audio feature
contribution rather than a precise decomposition.
Within audio features, the consistent signals across both models are:
Five limitations bound what can be claimed from these findings:
Zero-popularity noise floor (10.54%): Tracks with popularity = 0 inflate RMSE without providing learnable signal. A post-hoc sensitivity analysis confirms this quantitatively: excluding the 1,409 zero-popularity tracks from the test set reduces RMSE by 8.2% (15.16 → 13.92). Notably, R² on the non-zero subset (0.427) is lower than the full-set figure (0.458): removing these extreme low-end observations reduces total variance more than residual variance, so the relative fit measure falls even as absolute errors improve. The model cannot distinguish zero-popularity tracks from genuinely low-popularity content on audio features alone; their observed popularity of 0 reflects promotional absence rather than any detectable audio characteristic. Resolving this limitation would require engagement and promotion data outside the audio feature space.
Validation optimism in classification (≈0.050 PR-AUC): Both model selection and threshold selection used the same 15% validation split. The test PR-AUC (0.262) is the reliable unbiased estimate; the validation figure (0.312) is systematically optimistic and should not be reported as the model’s true performance.
Impurity-based importance bias: The feature importance results
may overweight track_genre due to its high cardinality (112
OHE dummy columns). Permutation importance would provide an unbiased
correction.
Conservative cross-validation: 3-fold CV was used for tuning (a computational trade-off; the methodology specified 5-fold). Reduced fold count increases variance in hyperparameter selection, particularly for the classification problem where the minority class is thinly distributed across folds.
Audio-only feature set: No release-date, artist streaming history, social media engagement, or marketing spend signals are available. These non-audio factors are consistently identified as critical to closing the R² gap towards the 0.70+ range achievable in richer datasets (Arora & Rani, 2024; Zhao et al., 2023).
Two predictive objectives were pursued on a cleaned dataset of 89,578 unique Spotify tracks, derived from a 114,000-row raw source with 21 original variables and 113 retained genres after deduplication, using a reproducible R pipeline that covered data cleaning, exploratory analysis, hyperparameter tuning, threshold optimisation, and held-out test evaluation.
Objective 1: Popularity Regression. A tuned Random Forest achieved RMSE = 15.16, MAE = 10.47, and R² = 0.458 on the held-out test set, with negligible degradation from validation (RMSE gap: +0.47). Genre is the single dominant predictor (38% of impurity reduction), followed by acousticness, duration, and loudness. The R² gap relative to comparable studies is principally explained by a 10.54% zero-popularity noise floor and the absence of recency and marketing signals, neither of which can be addressed within an audio-only feature set. At 10.47 popularity points MAE, the model provides a useful directional ranking signal for pre-release promotional prioritisation, particularly for tracks in the 20–70 popularity band.
Objective 2: Hit Classification. A tuned weighted Random Forest achieved PR-AUC = 0.262 and ROC-AUC = 0.885 on the test set at a threshold of 0.19, a 7.5x lift over the no-skill PR-AUC of 0.035. At this threshold, the model identifies 39% of genuine hits whilst maintaining 28% precision. The appropriate use is soft-ranking and shortlisting: surfacing candidate tracks for human A&R or marketing review, rather than binary automated selection. The precision-recall ceiling at 3.49% hit prevalence cannot be overcome by audio features alone.
Cross-cutting finding. Both models confirm that genre is the dominant signal. Audio feature targeting has limited leverage relative to genre-level portfolio decisions. Promotional investment should follow genre-level popularity baselines: directing budgets towards tracks in consistently high-popularity genres (chill, K-pop, Latin, Brazil) and exercising caution with genres characterised by high zero-popularity rates (jazz, Iranian, romance). Low instrumentalness is the single strongest audio feature signal for both regression accuracy and hit classification: vocal-forward productions hold a systematic advantage in the streaming environment.
Future work. Three signal gaps stand out. Release recency is perhaps the most tractable: Spotify’s scoring is recency-biased, and incorporating release dates would likely yield the largest single improvement in R². Artist metadata (historical streaming popularity and social following) ranks as the next highest-value addition based on evidence from comparable studies (Arora & Rani, 2024). Marketing and promotional context (playlist placement, advertising spend, chart history) is harder to obtain but represents the signal category most likely to breach the R² > 0.70 threshold. Permutation-based feature importance should also be computed to provide an unbiased estimate of the genre-versus-audio decomposition.
Arora, S., & Rani, P. (2024). Soundtrack success: Unveiling song popularity patterns using machine learning implementation. SN Computer Science, 5(278). https://doi.org/10.1007/s42979-024-02619-5
Jiang, S. (2024). Predicting music popularity: A machine learning approach using Spotify data. In Proceedings of the 1st International Conference on Modern Logistics and Supply Chain Management (MLSCM 2024) (pp. 324–328). SCITEPRESS. https://doi.org/10.5220/0013330000004558
Jung, N. S., & Mayer, F. (n.d.). Beyond beats: A recipe to song popularity? A machine learning approach [Unpublished student paper]. Digital Science Center, University of Innsbruck. [4, 5] Khan, F., Tarimer, I., Alwageed, H. S., Karadağ, B. C., Fayaz, M., Abdusalomov, A. B., & Cho, Y.-I. (2022). Effect of feature selection on the accuracy of music popularity classification using machine learning algorithms. Electronics, 11(21), 3518. https://doi.org/10.3390/electronics11213518
Saragih, H. S. (2023). Predicting song popularity based on Spotify’s audio features: Insights from the Indonesian streaming users. Journal of Management Analytics, 10(4), 693–709. https://doi.org/10.1080/23270012.2023.2239824
Zhao, M., Harvey, M., Cameron, D., Hopfgartner, F., & Gillet, V. J. (2023). An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features. In I. Sserwanga, A. Goulding, H. Moulaison-Sandy, J. T. Du, A. L. Soares, V. Hessami, & R. D. Frank (Eds.), Information for a Better World: Normality, Virtuality, Physicality, Inclusivity: 18th International Conference, iConference 2023 (pp. 303–311). Springer. https://doi.org/10.1007/978-3-031-28035-1_21