Member (Group 14):

  1. Ho Wei Lam (S2152619)
  2. Lee Chun Ren (25082862)
  3. SHEN HAOCHEN (25074646)
  4. QI XIANGDONG (S2006507)
  5. Ong Qing Sheng (S2040806)

1 Introduction

Music labels, streaming curators, and A&R teams face a common resource allocation problem: promotional budgets must be committed before a track’s commercial performance is known. Playlist placements, advertising spend, and artist development attention are finite. The dominant strategy of observing early streaming numbers and reacting is reactive and leaves money on the table for tracks that could have benefited from earlier investment.

Spotify assigns every track a popularity score (0–100), computed from recent stream counts relative to all other tracks. This score is the most directly observable proxy for commercial performance on the platform. Audio features (danceability, energy, loudness, acousticness, and genre) are measured at the point of production, before any marketing decision is made. If these pre-release signals carry predictive power for the eventual popularity score, then marketing teams can use them to build a prioritisation list: which tracks are worth a major promotional push, and which should receive minimal spend.

This project develops a machine learning decision-support system grounded in this premise. The system is not a replacement for human judgement. No audio feature model can capture the social moments, meme virality, or tastemaker endorsements that drive some of the biggest hits. The models described here function as a first-pass quantitative screen: a tool that adds an evidence base to a process that is otherwise entirely subjective.

A prerequisite for pre-release applicability is that the input features must be observable before a track accumulates significant streaming data. All audio features used in this analysis are available through the Spotify Web API at the point of track ingestion, before any streaming activity has accumulated, making them genuinely accessible as early-stage inputs to marketing prioritisation decisions.

1.1 How To Read This Report

This report is designed to be read as a technical summary rather than as a script archive. Each major section explains why a step was necessary, shows the original project code used to carry it out, presents the resulting saved output where relevant, interprets what that output means, and then links the result to the next stage of the analysis. Readers therefore do not need to inspect every line of code in detail, but the code is included so that the process remains transparent and reproducible.

1.2 Reproducibility Setup

The integrated workflow begins with a reproducibility layer that loads packages, defines relative paths, creates output directories, and locks the final classification threshold used later in held-out testing. The report does not rerun the full modelling pipeline during knitting, but this setup code is part of the reproducibility story because every downstream script depends on the same paths and constants.

# analysis/master_script.R - package and path setup
required_pkgs <- c(
  "tidyverse", "scales",
  "tidymodels",
  "ranger", "xgboost",
  "vip", "hardhat",
  "doParallel", "ggrepel"
)

raw_path        <- "../data/dataset.csv"
processed_dir   <- "../data/processed"
clean_path      <- file.path(processed_dir, "spotify_cleaned.csv")
fig_dir         <- "../output/figures"
table_dir       <- "../output/tables"
eda_dir         <- "../output/eda"
reg_out_dir     <- "../output/regression"
cls_out_dir     <- "../output/classification"
tune_dir        <- "../output/tuning"
thresh_dir      <- "../output/threshold"
test_dir        <- "../output/test_evaluation"
interp_dir      <- "../output/interpretability"
sensitivity_dir <- "../output/sensitivity"

for (d in c(processed_dir, fig_dir, table_dir, eda_dir,
            reg_out_dir, cls_out_dir, tune_dir, thresh_dir,
            test_dir, interp_dir, sensitivity_dir)) {
  dir.create(d, recursive = TRUE, showWarnings = FALSE)
}

n_threads <- min(4L, max(1L, parallel::detectCores() - 1L))

FORCE_RETUNE <- FALSE
CLASSIFICATION_THRESHOLD <- 0.19  # locked before any test-set access; see threshold section

2 Research Aim and Objectives

Research Aim: To build and evaluate predictive models that estimate Spotify track popularity and identify potential hit tracks from audio and genre features, and to derive actionable genre and audio signal insights for music marketing decision-making.

Research Objectives:

  1. Identify the audio features and genre categories most predictive of commercial success.
  2. Evaluate three regression models to predict continuous popularity scores (0–100).
  3. Evaluate three classification models to identify hit tracks (popularity ≥ 70) under severe class imbalance.
  4. Assess the practical ceiling and limitations of audio-only prediction for marketing applications.

Research Questions:

  1. Which audio features and genre categories are most strongly associated with higher Spotify popularity scores?
  2. Can machine learning models predict popularity scores with sufficient accuracy to act as useful pre-release ranking tools?
  3. Can a classification model identify potential hits at a practically useful precision-recall operating point, given a 1:28 class imbalance?
  4. What does the feature importance structure tell us about where to focus production and marketing resources?

The project uses both regression and classification because the two tasks answer different decision needs. Regression estimates how popular a track may become on a continuous 0-100 scale, which supports ranking and prioritisation. Classification asks whether a track is likely to cross a commercially meaningful threshold of 70, which supports shortlisting likely “hit” candidates for human review.

3 Supporting Literature

The published literature converges on a set of findings directly relevant to this project. Genre and audio characteristics, particularly loudness, energy, and danceability, emerge as the primary popularity predictors across studies using Spotify data (Arora & Rani, 2024; Jiang, 2024; Jung & Mayer, n.d.; Saragih, 2023). Tree-based ensembles (Random Forest, XGBoost) consistently outperform linear and deep learning alternatives at this task (Arora & Rani, 2024; Jung & Mayer, n.d.; Khan et al., 2022). Reported R² values range from 0.35 to 0.70, with the upper end reached only when genre is explicitly encoded (Jiang, 2024; Jung & Mayer, n.d.). These findings informed the algorithm selection and evaluation design here.

4 Dataset Context

The project uses the Spotify Tracks Dataset (Kaggle: dataset.csv, supplied in the repository materials), which was scrapped on 22 Oct 2022. The available project documents confirm that the raw dataset contains 114,000 observations, 21 variables, and a genre-structured catalogue design with roughly 1,000 tracks per genre in the raw data. Because the project files do not provide a formal dataset year inside the raw CSV itself, the report treats the dataset as a project snapshot rather than asserting a source timestamp beyond the available materials.

The cleaned dataset contains 89,578 unique tracks across 113 genres.

The dataset contains three broad variable groups: identifier or text fields (track_id, track_name, album_name, artists), categorical descriptors (track_genre, explicit, mode, key, time_signature), and continuous audio measurements (danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms) together with the target variable popularity.

4.1 Dataset Overview

The overview table below is not a manually typed summary. It comes from the EDA stage, where the cleaned dataset was profiled and then written to ../output/tables/eda_overview.csv for reuse in the report.

# analysis/master_script.R - overview table exported during EDA
eda_overview <- tibble(
  metric = c(
    "Number of tracks", "Number of genres",
    "Average popularity", "Median popularity",
    "Average duration (min)",
    "High popularity tracks", "Low popularity tracks"
  ),
  value = c(
    nrow(spotify),
    n_distinct(spotify$track_genre),
    round(mean(spotify$popularity), 2),
    round(median(spotify$popularity), 2),
    round(mean(spotify$duration_min), 2),
    sum(spotify$high_popularity_label == "high"),
    sum(spotify$high_popularity_label == "low")
  )
)

write_csv(eda_overview, file.path(table_dir, "eda_overview.csv"))
Cleaned dataset overview from saved project outputs.
metric value
Number of tracks 89578.00
Number of genres 113.00
Average popularity 33.19
Median popularity 33.00
Average duration in minutes 3.82
High popularity tracks 45652.00
Low popularity tracks 43926.00

The raw dataset starts with 114,000 observations and 21 variables. After cleaning and deduplication, the modelling dataset contains 89,578 rows and 33 columns. The project retains audio features and genre as predictors while excluding pure identifiers and text metadata such as track_id, track_name, album_name, and artists from the baseline models. These excluded fields are useful for cleaning and descriptive context but would either cause leakage, create sparse identifiers, or undermine generalisability if passed directly into the models.

Before any cleaning rule was applied, the project inspected the raw structure directly. These checks were diagnostic rather than report-facing outputs, but they are part of the actual process and help explain why the later validation filters exist.

# analysis/01.R and analysis/master_script.R - raw import and inspection
spotify_raw <- read_csv(
  raw_path,
  locale = locale(encoding = "Windows-1252"),
  show_col_types = FALSE
)

if ("...1" %in% names(spotify_raw)) {
  spotify_raw <- spotify_raw %>% rename(source_row = `...1`)
}

cat("Raw dataset dimensions:", nrow(spotify_raw), "rows x", ncol(spotify_raw), "columns\n")
print(names(spotify_raw))
glimpse(spotify_raw)
print(summary(select(spotify_raw, where(is.numeric))))

5 Data Preparation and Quality Control

The workflow begins with a raw-data audit and a set of explicit quality controls. The requirement brief asks for dataset details, structure, and summary before modelling; this project operationalises that requirement by checking missing values, duplicate rows, duplicate track_ids, invalid ranges, and data types before any split is created.

Three data issues mattered most:

  1. The raw export contained an unnamed index column that had to be relabelled and ignored analytically.
  2. Duplicate track_ids were common: 16,641 track identifiers appeared more than once.
  3. A small number of rows failed completeness or validity rules, including one zero-duration record.

Duplicate handling was treated as a methodological issue rather than a trivial cleaning step. Duplicate track_ids were not simple data-entry errors; they were mostly cross-genre listings of the same track. If left unresolved, identical songs could appear in both training and test partitions, producing hidden leakage. The adopted solution was neutral first-occurrence deduplication without sorting by popularity first, which avoided biasing the regression target upward.

The corresponding raw-data checks are shown below. These exact calculations established the amount of missingness and duplication before cleaning decisions were applied.

# analysis/01.R and analysis/master_script.R - missing and duplicate checks
missing_summary <- spotify_raw %>%
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  pivot_longer(everything(), names_to = "column", values_to = "missing_count") %>%
  mutate(missing_percent = round(missing_count / nrow(spotify_raw) * 100, 3)) %>%
  arrange(desc(missing_count))

duplicate_track_id_count <- spotify_raw %>%
  filter(!is.na(track_id)) %>%
  count(track_id, name = "record_count") %>%
  filter(record_count > 1) %>%
  nrow()

duplicate_full_row_count <- spotify_raw %>%
  select(-any_of("source_row")) %>%
  duplicated() %>%
  sum()
Raw-data columns with non-zero missing counts before cleaning.
column missing_count missing_percent
artists 1 0.001
album_name 1 0.001
track_name 1 0.001
# analysis/master_script.R - data validation and neutral deduplication
spotify_cleaned <- spotify_raw %>%
  mutate(
    across(where(is.character), ~ str_squish(.x)),
    across(where(is.character), ~ na_if(.x, "")),
    explicit = as.logical(explicit)
  ) %>%
  distinct(across(-any_of("source_row")), .keep_all = TRUE) %>%
  drop_na(
    track_id, artists, album_name, track_name, track_genre,
    popularity, duration_ms, explicit, danceability, energy, key, loudness,
    mode, speechiness, acousticness, instrumentalness, liveness, valence,
    tempo, time_signature
  ) %>%
  filter(
    popularity >= 0, popularity <= 100,
    duration_ms > 0,
    danceability >= 0, danceability <= 1,
    energy >= 0, energy <= 1,
    speechiness >= 0, speechiness <= 1,
    acousticness >= 0, acousticness <= 1,
    instrumentalness >= 0, instrumentalness <= 1,
    liveness >= 0, liveness <= 1,
    valence >= 0, valence <= 1,
    tempo > 0
  ) %>%
  distinct(track_id, .keep_all = TRUE)

The cleaned file then adds modelling-friendly fields. duration_ms is converted to minutes, audio features are preserved, and factor encodings are prepared for recipe-based one-hot encoding later in the pipeline. Most importantly, the workflow separates an exploratory median-based high_popularity label from the actual classification target used in the project. The high_popularity is descriptive only, whereas the true classifier target is hit = 1 when popularity is at least 70.

# analysis/master_script.R - feature engineering and target construction
spotify_cleaned <- spotify_cleaned %>%
  mutate(
    duration_min = duration_ms / 60000,
    popularity_level = case_when(
      popularity < 30 ~ "low",
      popularity < 60 ~ "medium",
      TRUE ~ "high"
    ),
    high_popularity = if_else(popularity >= median(popularity, na.rm = TRUE), 1L, 0L),
    hit = if_else(popularity >= 70, 1L, 0L),
    hit_label = factor(
      if_else(hit == 1L, "hit", "non_hit"),
      levels = c("non_hit", "hit")
    )
  ) %>%
  mutate(
    explicit = factor(explicit, levels = c(FALSE, TRUE),
                      labels = c("not_explicit", "explicit")),
    mode = factor(mode, levels = c(0, 1), labels = c("minor", "major")),
    key = factor(key),
    time_signature = factor(time_signature),
    track_genre = factor(track_genre)
  )

These choices produced the final study frame used throughout the rest of the report: 89,578 unique tracks, a hit prevalence of 3.49%, and a zero-popularity rate of 10.54%.

The engineered hit label is especially important because its low prevalence drives the later choice of PR-AUC, class weighting, and threshold optimisation.

# analysis/04_baseline_classification.R and analysis/master_script.R - class distribution check
cls_data <- spotify %>%
  select(
    hit_label,
    track_genre, duration_min,
    explicit, mode, key, time_signature,
    danceability, energy, loudness,
    speechiness, acousticness, instrumentalness,
    liveness, valence, tempo
  )

count(cls_data, hit_label) %>%
  mutate(pct = round(100 * n / sum(n), 2))
Engineered hit-class distribution after cleaning and deduplication.
hit_label n percent
hit 3124 3.49
non_hit 86454 96.51

6 Exploratory Analysis

Exploratory analysis was used supply evidence for later modelling decisions. The project did not treat EDA as a decorative step; it directly influenced feature retention, metric choice, and the decision to preserve genre-typical extremes rather than remove them wholesale as outliers.

6.1 Distribution of Popularity

The popularity distribution is neither normal nor uniform. It has a visible low-end concentration, including the zero-popularity cluster, plus a wider middle range. This matters because the regression task is not simply estimating a smooth continuous response; it must also contend with a structural mass at zero that may not be explainable by audio signal alone.

# analysis/master_script.R - popularity distribution plot
p1 <- ggplot(spotify, aes(x = popularity)) +
  geom_histogram(binwidth = 5, fill = "#1DB954", color = "white", alpha = 0.85) +
  geom_vline(aes(xintercept = median(popularity)), color = "red", linewidth = 1) +
  labs(
    title = "Distribution of Spotify Track Popularity",
    subtitle = "Red line shows the median popularity used for the descriptive high/low split",
    x = "Popularity Score (0-100)", y = "Number of Tracks"
  ) +
  theme_minimal()

ggsave(file.path(fig_dir, "01_popularity_distribution.png"), p1, width = 8, height = 5, dpi = 300)

6.2 Genre Structure and Market Opportunity

Genre was a dominant theme across the project. The raw dataset was genre-structured, and the cleaned data still preserve strong between-genre differences in average popularity and hit rate. This is why the study explicitly required track_genre to be included in all baseline and tuned models.

# analysis/master_script.R - genre summary used for the opportunity map
genre_summary <- spotify %>%
  group_by(track_genre) %>%
  summarise(
    track_count = n(),
    avg_popularity = mean(popularity),
    median_popularity = median(popularity),
    high_rate = mean(high_popularity_label == "high"),
    .groups = "drop"
  ) %>%
  filter(track_count >= 100) %>%
  mutate(
    opportunity_score = avg_popularity / log1p(track_count),
    genre_group = case_when(
      avg_popularity >= median(avg_popularity) & track_count < median(track_count) ~
        "High popularity / Less crowded",
      avg_popularity >= median(avg_popularity) & track_count >= median(track_count) ~
        "High popularity / Crowded",
      avg_popularity < median(avg_popularity) & track_count < median(track_count) ~
        "Low popularity / Less crowded",
      TRUE ~ "Low popularity / Crowded"
    )
  )

write_csv(genre_summary, file.path(table_dir, "genre_summary.csv"))

p2 <- ggplot(genre_summary, aes(x = track_count, y = avg_popularity)) +
  geom_point(aes(size = high_rate, color = genre_group), alpha = 0.75) +
  geom_text(
    data = genre_summary %>% slice_max(avg_popularity, n = 10),
    aes(label = track_genre), size = 3, vjust = -0.8, check_overlap = TRUE
  ) +
  scale_x_log10(labels = comma) +
  scale_size_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Genre Opportunity Map",
    subtitle = "Upper-left: higher average popularity with fewer tracks",
    x = "Number of Tracks in Genre (log scale)", y = "Average Popularity",
    size = "High Popularity Rate", color = "Genre Segment"
  ) +
  theme_minimal()

ggsave(file.path(fig_dir, "02_genre_opportunity_map.png"), p2, width = 10, height = 6, dpi = 300)

For readability, the report shows a small extracted table of top genres before the saved figure.

# report helper - select a compact subset from the saved project summary
top_genres <- genre_summary %>%
  arrange(desc(avg_popularity)) %>%
  select(track_genre, track_count, avg_popularity, high_rate) %>%
  slice_head(n = 8)
Selected high-average-popularity genres from the saved EDA summary.
track_genre track_count avg_popularity high_rate
k-pop 916 59.424 0.987
pop-film 815 59.097 0.962
metal 232 56.422 0.918
chill 972 53.739 0.939
latino 398 51.789 0.869
sad 564 51.110 0.947
grunge 862 50.587 0.811
indian 733 49.765 0.963

Genres such as anime, british, brazil, ambient, and acoustic have relatively high average popularity in this processed dataset, while less mainstream or niche segments tend to fall lower. This pattern should not be overinterpreted as a causal genre effect, but it is strong enough to justify genre inclusion as an explanatory variable.

The project also generated a direct high-versus-low popularity audio-profile figure. This output is useful because it translates the numeric summaries into an immediately interpretable comparison of the “sound DNA” of more and less popular tracks.

# analysis/master_script.R - high-vs-low popularity audio profile
feature_cols_eda <- c("danceability", "energy", "speechiness", "acousticness",
                      "instrumentalness", "liveness", "valence")

feature_profile <- spotify %>%
  select(high_popularity_label, all_of(feature_cols_eda)) %>%
  pivot_longer(cols = all_of(feature_cols_eda), names_to = "audio_feature", values_to = "value") %>%
  group_by(high_popularity_label, audio_feature) %>%
  summarise(mean_value = mean(value), .groups = "drop") %>%
  group_by(audio_feature) %>%
  mutate(feature_gap = mean_value[high_popularity_label == "high"] -
                       mean_value[high_popularity_label == "low"]) %>%
  ungroup()

write_csv(feature_profile, file.path(table_dir, "audio_feature_profile_high_vs_low.csv"))

p3 <- ggplot(feature_profile,
             aes(x = reorder(audio_feature, mean_value), y = mean_value, fill = high_popularity_label)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    title = "Audio DNA: High Popularity vs Low Popularity Tracks",
    x = "Audio Feature", y = "Average Feature Value", fill = "Popularity Group"
  ) +
  theme_minimal()

ggsave(file.path(fig_dir, "03_audio_dna_high_vs_low.png"), p3, width = 8, height = 5, dpi = 300)

This figure shows that high-popularity tracks are directionally less instrumental and more vocal/electronic, but no single feature provides a clean separator, which motivates multivariate modelling.

Tracks were also segmented into four sonic mood quadrants defined by median energy and median valence to check whether mood correlated with popularity grouping.

# analysis/master_script.R - sonic mood map
spotify_mood <- spotify %>%
  mutate(mood_quadrant = case_when(
    energy >= median(energy) & valence >= median(valence) ~ "Happy energetic",
    energy >= median(energy) & valence <  median(valence) ~ "Dark energetic",
    energy <  median(energy) & valence >= median(valence) ~ "Calm positive",
    TRUE                                                  ~ "Calm sad"
  ))

mood_summary <- spotify_mood %>%
  group_by(mood_quadrant) %>%
  summarise(
    track_count    = n(),
    avg_popularity = mean(popularity),
    high_rate      = mean(high_popularity_label == "high"),
    .groups = "drop"
  ) %>%
  arrange(desc(high_rate))

write_csv(mood_summary, file.path(table_dir, "mood_quadrant_summary.csv"))

p4 <- spotify_mood %>%
  sample_n(min(12000, nrow(.))) %>%
  ggplot(aes(x = valence, y = energy, color = high_popularity_label)) +
  geom_point(alpha = 0.25, size = 0.8) +
  geom_vline(xintercept = median(spotify$valence), linetype = "dashed") +
  geom_hline(yintercept = median(spotify$energy),  linetype = "dashed") +
  labs(
    title    = "Sonic Mood Map: Energy vs Valence",
    subtitle = "Dashed lines split tracks into four mood quadrants",
    x = "Valence: sad to happy", y = "Energy: calm to energetic", color = "Popularity Group"
  ) +
  theme_minimal()

ggsave(file.path(fig_dir, "04_sonic_mood_map.png"), p4, width = 8, height = 6, dpi = 300)

The mood map shows that high-popularity tracks are roughly evenly distributed across all four quadrants. There is no dominant mood signature for popular tracks: energetic-happy, energetic-dark, calm-positive, and calm-sad tracks all appear in the high-popularity group. This confirms that mood alone is not a reliable predictor and that genre and audio feature combinations matter more than any single quadrant classification.

This figure complements the correlation table by showing that popularity is not dominated by any single dramatic audio separation. Instead, the pattern is cumulative: high-popularity tracks tend to be somewhat more danceable and less instrumental or acoustic, but the differences are gradual rather than absolute. That is exactly the kind of structure that motivates multivariate models in the next stage.

6.3 Audio Profiles and Correlation Patterns

Simple audio features alone cannot explain popularity well. The saved correlation table shows only weak marginal linear relationships with popularity.

# analysis/master_script.R - correlation summary exported during EDA
correlation_summary <- spotify %>%
  select(popularity, danceability, energy, loudness, speechiness, acousticness,
         instrumentalness, liveness, valence, tempo, duration_min,
         artist_count, energy_danceability_ratio, acoustic_energy_balance) %>%
  summarise(across(-popularity, ~ cor(.x, popularity, use = "complete.obs"))) %>%
  pivot_longer(everything(), names_to = "feature", values_to = "correlation_with_popularity") %>%
  arrange(desc(abs(correlation_with_popularity)))

write_csv(correlation_summary, file.path(table_dir, "correlation_with_popularity.csv"))
# report helper - display the strongest absolute correlations in compact form
top_corr <- correlation_summary %>%
  slice_head(n = 8)
Largest absolute correlations with popularity among the saved EDA outputs.
feature correlation_with_popularity
instrumentalness -0.1285
loudness 0.0732
danceability 0.0660
energy_danceability_ratio -0.0539
speechiness -0.0469
acousticness -0.0390
acoustic_energy_balance -0.0304
duration_min -0.0231

The correlation bar chart below shows the full ranking visually, including the derived features energy_danceability_ratio and acoustic_energy_balance.

# analysis/master_script.R - correlation bar chart exported during EDA
p5 <- ggplot(correlation_summary,
             aes(x = reorder(feature, correlation_with_popularity), y = correlation_with_popularity)) +
  geom_col(fill = "#1DB954") +
  coord_flip() +
  geom_hline(yintercept = 0, color = "black") +
  labs(
    title    = "Correlation Between Audio Features and Popularity",
    subtitle = "Weak correlations suggest popularity is driven by both audio and genre factors",
    x = "Feature", y = "Correlation with Popularity"
  ) +
  theme_minimal()

ggsave(file.path(fig_dir, "05_correlation_with_popularity.png"), p5, width = 8, height = 5, dpi = 300)

Instrumentalness has the largest negative correlation in absolute value, while loudness and danceability are weakly positive. The magnitudes are modest, which already suggests that strong performance from a purely linear regression should not be expected. This EDA finding is consistent with the later model comparison results: tree-based methods outperform the linear baseline, and genre remains crucial.

The sound-type and market-style summaries add the same message in a more interpretable form. Low-acoustic vocal or electronic tracks have the highest average popularity, while acoustic instrumental niches are least popular on average. Explicit collaborations also show a higher high-popularity rate than non-explicit solo tracks. These patterns are useful for interpretation but still not strong enough to replace modelling.

# analysis/master_script.R - derived sound-type summary exported during EDA
spotify_niche <- spotify %>%
  mutate(sound_type = case_when(
    instrumentalness >= 0.5 & acousticness >= 0.5 ~ "Acoustic instrumental niche",
    instrumentalness >= 0.5 & acousticness < 0.5 ~ "Electronic/instrumental niche",
    instrumentalness < 0.5 & acousticness >= 0.5 ~ "Acoustic vocal mainstream",
    TRUE ~ "Low-acoustic vocal/electronic"
  ))

sound_type_summary <- spotify_niche %>%
  group_by(sound_type) %>%
  summarise(
    track_count = n(),
    avg_popularity = mean(popularity),
    high_rate = mean(high_popularity_label == "high"),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_popularity))

write_csv(sound_type_summary, file.path(table_dir, "sound_type_summary.csv"))
Popularity by derived sound-type segment.
sound_type track_count avg_popularity high_rate
Low-acoustic vocal/electronic 51470 35.274 0.552
Acoustic vocal mainstream 21578 32.257 0.498
Acoustic instrumental niche 6565 30.040 0.433
Electronic/instrumental niche 9965 26.538 0.365
# analysis/master_script.R - metadata-style summary exported during EDA
market_style_summary <- spotify %>%
  group_by(explicit, is_collaboration) %>%
  summarise(
    track_count = n(),
    avg_popularity = mean(popularity),
    median_popularity = median(popularity),
    high_rate = mean(high_popularity_label == "high"),
    .groups = "drop"
  )

write_csv(market_style_summary, file.path(table_dir, "market_style_summary.csv"))
Popularity by explicit status and collaboration status.
explicit is_collaboration track_count avg_popularity median_popularity high_rate
explicit collaboration 2380 38.974 43 0.653
explicit solo 5324 35.952 33 0.506
not_explicit collaboration 20141 33.640 36 0.539
not_explicit solo 61733 32.585 32 0.495

7 Modelling Framework

The modelling framework was designed around three methodological constraints:

  1. Leakage had to be controlled by deduplicating before splitting and by reserving the test set until the very end.
  2. Regression and classification needed comparable feature sets so that later interpretation would remain coherent.
  3. The classification task was severely imbalanced, so model selection could not rely on accuracy or a default 0.5 threshold alone.

7.1 Shared Feature Set and Data Splitting

The integrated workflow defines one common set of predictors for both tasks: genre, duration, explicitness, mode, key, time signature, and nine core audio variables. The workflow then creates separate 70/15/15 train-validation-test partitions, stratified by the relevant target.

# analysis/master_script.R - shared modelling features and leakage-aware splits
modelling_features <- c(
  "track_genre", "duration_min",
  "explicit", "mode", "key", "time_signature",
  "danceability", "energy", "loudness",
  "speechiness", "acousticness", "instrumentalness",
  "liveness", "valence", "tempo"
)

reg_data <- spotify %>% select(popularity, all_of(modelling_features))
cls_data <- spotify %>% select(hit_label, all_of(modelling_features))

set.seed(42)
reg_split <- initial_validation_split(
  reg_data, prop = c(0.70, 0.15), strata = popularity, breaks = 4
)

set.seed(42)
cls_split <- initial_validation_split(
  cls_data, prop = c(0.70, 0.15), strata = hit_label
)

Two discipline commitments follow from this design. Hyperparameter tuning and threshold selection occur before the held-out test set is touched, and the feature set remains consistent across models except where model-specific preprocessing is explicitly justified.

The split logic was also checked numerically. For regression, the mean popularity stays stable across train, validation, and test. For classification, the hit prevalence remains similar across all three partitions, which is essential under severe imbalance.

Split sanity checks for regression and classification partitions.
objective split n mean_target
Regression Train 62702 33.1938
Regression Validation 13435 33.0839
Regression Test 13441 33.2919
Classification Train 62704 0.0350
Classification Validation 13436 0.0339
Classification Test 13438 0.0351

7.2 Preprocessing Logic

The project did not apply one identical recipe to every model. Instead, preprocessing reflects model families:

  1. Linear and logistic models use transformations and normalisation because they are sensitive to scale and skew.
  2. Tree models use one-hot encoding and zero-variance filtering but do not require scaling for predictive performance.
  3. The linear baseline retains only interpretable, direct predictors; the study explicitly excludes unvalidated engineered features such as collaboration ratios from the baseline comparison.

The exact preprocessing recipes are part of the reproducible modelling process because they determine which transformed columns are seen by each model family.

# analysis/master_script.R - baseline preprocessing recipes
reg_recipe <- recipe(popularity ~ ., data = train_reg) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

lr_recipe <- recipe(hit_label ~ ., data = train_weighted) %>%
  step_mutate(
    duration_min = log1p(duration_min),
    instrumentalness = log1p(instrumentalness)
  ) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

rf_cls_baseline_recipe <- recipe(hit_label ~ ., data = train_cls) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

xgb_cls_recipe <- recipe(hit_label ~ ., data = train_cls) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal_predictors())

All three recipes include step_zv() to remove any predictor with zero variance before modelling, which prevents numerical instability in some model engines. The logistic regression recipe additionally applies log1p() to duration_min and instrumentalness before normalisation, addressing their right-skewed distributions; tree-based models are invariant to monotonic transformations and do not require this step.

The multicollinearity check also belongs here. Pearson correlation between energy and loudness in the cleaned dataset is moderately high (r ≈ 0.7), and both features appear in the same linear recipe. This level of collinearity inflates standard errors for individual coefficients but does not materially degrade predictions. The study records the assessment that this does not require replacing the linear regression with a ridge-penalised alternative, so the linear model is retained as an interpretable benchmark while acknowledging that individual coefficient magnitudes should not be read causally.

The main evaluation metrics are as follows. For regression, RMSE measures typical prediction error in popularity points, MAE measures average absolute prediction error, and R-squared measures the proportion of popularity variance explained by the model. For classification, PR-AUC is the primary metric because the hit class is rare; it summarises the precision-recall trade-off across thresholds. ROC-AUC is reported as a secondary threshold-free metric, while precision, recall, and F1 are interpreted only after a specific decision threshold has been chosen.

8 Regression Modelling

The regression task asks how well numeric popularity can be predicted from the available audio and genre variables. The project builds this argument in stages: linear regression and random forest as baselines, untuned XGBoost as a stronger nonlinear benchmark, then cross-validated tuning for the two tree-based models.

8.1 Baseline Models

The baseline regression section uses a standardised recipe and compares linear regression with random forest on the validation set only.

# analysis/master_script.R - regression baseline models
reg_recipe <- recipe(popularity ~ ., data = train_reg) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

lm_spec <- linear_reg() %>% set_engine("lm") %>% set_mode("regression")

rf_reg_baseline_spec <- rand_forest(trees = 500, mtry = 5, min_n = 5) %>%
  set_engine("ranger", importance = "impurity", num.threads = n_threads) %>%
  set_mode("regression")

lm_fit <- workflow() %>%
  add_recipe(reg_recipe) %>%
  add_model(lm_spec) %>%
  fit(data = train_reg)

rf_reg_baseline_fit <- workflow() %>%
  add_recipe(reg_recipe) %>%
  add_model(rf_reg_baseline_spec) %>%
  fit(data = train_reg)
# analysis/master_script.R - validation-set regression metrics and export
val_lm <- evaluate_regression(lm_fit, val_reg, "Linear Regression")
val_rf <- evaluate_regression(rf_reg_baseline_fit, val_reg, "Random Forest")

baseline_reg_results <- bind_rows(val_lm, val_rf) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  arrange(rmse)

write_csv(baseline_reg_results, file.path(reg_out_dir, "baseline_regression_validation_results.csv"))
Validation-set regression baseline results.
model rmse mae rsq
Random Forest 16.0130 12.1030 0.4271
Linear Regression 16.7594 12.0016 0.3369

The random forest baseline outperforms the linear model on RMSE and R-squared, although the linear model slightly edges random forest on MAE. Different regression metrics reward different error patterns. The overall conclusion is still clear: a nonlinear model is more suitable for this task than a purely linear specification.

8.2 Adding XGBoost

XGBoost was then introduced as a third untuned baseline with the same data split and predictor set.

# analysis/master_script.R - untuned XGBoost regression
xgb_reg_spec <- boost_tree(
  trees = 500, tree_depth = 6, learn_rate = 0.1,
  min_n = 1, loss_reduction = 0, sample_size = 0.8
) %>%
  set_engine("xgboost", nthread = n_threads) %>%
  set_mode("regression")

xgb_reg_fit <- workflow() %>%
  add_recipe(xgb_reg_recipe) %>%
  add_model(xgb_reg_spec) %>%
  fit(data = train_reg)

xgb_reg_val <- predict(xgb_reg_fit, val_reg) %>%
  bind_cols(val_reg %>% select(popularity)) %>%
  rename(truth = popularity, estimate = .pred) %>%
  reg_metrics(truth = truth, estimate = estimate) %>%
  mutate(model = "XGBoost") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

write_csv(xgb_reg_val, file.path(reg_out_dir, "xgb_regression_validation_results.csv"))

The untuned XGBoost model improved the validation RMSE relative to both baseline models and initially became the leading regression candidate. This justified taking both random forest and XGBoost into the tuning stage instead of assuming the baseline ranking would persist.

8.3 Hyperparameter Tuning

The tuning section uses resampling only on the training split. Random forest is tuned over mtry and min_n, while XGBoost is tuned over learn_rate, tree_depth, and min_n.

# analysis/master_script.R - tuning design
cv_folds_reg <- vfold_cv(train_reg, v = 3, strata = popularity)

rf_reg_spec_tune <- rand_forest(
  trees = rf_reg_trees, mtry = tune(), min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity", num.threads = 1L) %>%
  set_mode("regression")

xgb_reg_spec_tune <- boost_tree(
  trees = xgb_reg_trees,
  learn_rate = tune(), tree_depth = tune(), min_n = tune(),
  loss_reduction = 0, sample_size = 0.8
) %>%
  set_engine("xgboost", nthread = 1L) %>%
  set_mode("regression")
# analysis/master_script.R - tuned-vs-baseline regression comparison table
rf_reg_best <- select_best(rf_reg_tune_res, metric = "rmse")
xgb_reg_best <- select_best(xgb_reg_tune_res, metric = "rmse")

rf_reg_final_fit <- finalize_workflow(rf_reg_wflow_tune, rf_reg_best) %>%
  fit(data = train_reg)

xgb_reg_final_fit <- finalize_workflow(xgb_reg_wflow_tune, xgb_reg_best) %>%
  fit(data = train_reg)

rf_reg_tuned_val <- predict(rf_reg_final_fit, val_reg) %>%
  bind_cols(val_reg %>% select(popularity)) %>%
  rename(truth = popularity, estimate = .pred) %>%
  metric_set(rmse, mae, rsq)(truth = truth, estimate = estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  mutate(model = "RF Tuned")

xgb_reg_tuned_val <- predict(xgb_reg_final_fit, val_reg) %>%
  bind_cols(val_reg %>% select(popularity)) %>%
  rename(truth = popularity, estimate = .pred) %>%
  metric_set(rmse, mae, rsq)(truth = truth, estimate = estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  mutate(model = "XGBoost Tuned")

baseline_reg <- read_csv(file.path(reg_out_dir, "all_regression_validation_results.csv"),
                         show_col_types = FALSE) %>%
  mutate(type = "baseline")

tuned_reg <- bind_rows(rf_reg_tuned_val, xgb_reg_tuned_val) %>%
  mutate(type = "tuned")

reg_comparison <- bind_rows(baseline_reg, tuned_reg) %>%
  arrange(rmse) %>%
  select(type, model, rmse, mae, rsq)

write_csv(reg_comparison, file.path(tune_dir, "regression_tuned_vs_baseline.csv"))
Validation-set regression results after tuning and baseline comparison.
type model rmse mae rsq
tuned RF Tuned 14.6712 10.1184 0.4929
baseline XGBoost 15.1732 10.5891 0.4573
tuned XGBoost Tuned 15.2373 10.6828 0.4531
baseline Random Forest 16.0130 12.1030 0.4271
baseline Linear Regression 16.7594 12.0016 0.3369

Tuning reversed the baseline ranking. The tuned random forest became the best validation model with RMSE 14.6878 and R-squared 0.4917, outperforming both untuned XGBoost and tuned XGBoost. This is an important project finding because it shows that baseline model ordering should not be treated as final. The best-performing model changed only after hyperparameter search.

8.4 Final Regression Performance

The held-out test set was used only after all choices were fixed. The saved evaluation file reports the following final regression performance:

# analysis/master_script.R - final regression test metrics and exported summary
rf_reg_val_preds <- predict(rf_reg_final_fit, val_reg) %>%
  bind_cols(val_reg %>% select(popularity)) %>%
  rename(truth = popularity, estimate = .pred)

rf_reg_val_metrics <- metric_set(rmse, mae, rsq)(
  rf_reg_val_preds, truth = truth, estimate = estimate
)

rf_reg_test_preds <- predict(rf_reg_final_fit, test_reg) %>%
  bind_cols(test_reg %>% select(popularity)) %>%
  rename(truth = popularity, estimate = .pred)

rf_reg_test_metrics <- metric_set(rmse, mae, rsq)(
  rf_reg_test_preds, truth = truth, estimate = estimate
)

reg_test_summary <- bind_rows(
  rf_reg_val_metrics %>% pivot_wider(names_from = .metric, values_from = .estimate) %>%
    mutate(model = "RF Tuned", split = "validation"),
  rf_reg_test_metrics %>% pivot_wider(names_from = .metric, values_from = .estimate) %>%
    mutate(model = "RF Tuned", split = "test")
) %>%
  select(split, model, rmse, mae, rsq)

write_csv(reg_test_summary, file.path(test_dir, "regression_test_results.csv"))
Validation-versus-test performance for the final tuned random forest regression model.
split model rmse mae rsq
validation RF Tuned 14.6878 10.1269 0.4917
test RF Tuned 15.1610 10.4680 0.4579

The final test RMSE is 15.1610 and the final test R-squared is 0.4579. These results indicate that the model explains roughly 46% of the variance in popularity using the available features. That is meaningful predictive signal, but it is not strong enough to support deterministic claims about commercial success. The results should therefore be interpreted as evidence that audio and genre explain a substantial but incomplete part of popularity.

# analysis/master_script.R - regression actual-vs-predicted plot on the test set
p_reg_actual_vs_pred <- ggplot(rf_reg_test_preds, aes(x = truth, y = estimate)) +
  geom_point(alpha = 0.15, colour = "#3498db", size = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey40") +
  labs(
    title = "RF Tuned Regression: Actual vs Predicted Popularity (Test Set)",
    subtitle = paste0("RMSE = ",
      round(rf_reg_test_metrics %>% filter(.metric == "rmse") %>% pull(.estimate), 2),
      " | R² = ",
      round(rf_reg_test_metrics %>% filter(.metric == "rsq") %>% pull(.estimate), 3)),
    x = "Actual Popularity", y = "Predicted Popularity"
  ) +
  theme_minimal()

ggsave(file.path(test_dir, "rf_reg_actual_vs_pred_test.png"), p_reg_actual_vs_pred,
       width = 8, height = 6, dpi = 300)

A residuals plot provides a complementary diagnostic: it reveals whether prediction errors are random around zero or show systematic patterns across the predicted range.

# analysis/master_script.R - regression residuals vs predicted on the test set
reg_residuals <- rf_reg_test_preds %>% mutate(residual = truth - estimate)

ggsave(file.path(test_dir, "rf_reg_residuals_test.png"),
  ggplot(reg_residuals, aes(x = estimate, y = residual)) +
    geom_point(alpha = 0.15, colour = "#1DB954", size = 0.7) +
    geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
    geom_smooth(method = "loess", se = FALSE, colour = "#e74c3c", linewidth = 0.7) +
    labs(title = "RF Tuned Regression: Residuals vs Predicted (Test Set)",
         x = "Predicted Popularity", y = "Residual (Actual − Predicted)") +
    theme_minimal(),
  width = 8, height = 5, dpi = 300)

The residuals plot shows that errors at very low predicted values (below 10) are systematically positive: the model under-predicts actual popularity for this segment. This corresponds to the zero-popularity cluster: the model assigns positive predicted scores to tracks that have zero observed popularity, producing a concentration of upward residuals at the low end. At the centre of the predicted range, residuals are approximately symmetric around zero, indicating that the model is well-calibrated for the mainstream popularity band.

9 Classification Modelling

The classification task is more difficult because the positive class is rare. After deduplication, only 3.49% of tracks meet the project’s hit definition. This shaped every modelling choice: primary metric, weighting strategy, and post-training threshold selection.

9.1 Why PR-AUC Was Used as the Primary Metric

With a prevalence this low, accuracy would be misleading. A trivial classifier that predicts every track as a non-hit would already be correct most of the time. The project therefore uses PR-AUC as the main model-selection metric because it measures performance on the rare class directly through the precision-recall trade-off. ROC-AUC is retained as a secondary threshold-free metric, while F1, recall, and precision are treated as operating-point metrics that only become meaningful after threshold selection.

9.2 Baseline Class-Weighted Models

The classification baseline compares logistic regression, random forest, and later XGBoost using the same core predictor set. Class weighting was preferred over oversampling methods such as SMOTE: at a 1:28 imbalance, synthetic minority samples tend to interpolate within a very sparse region of the feature space, and comparable studies at similar imbalance severity confirm that weighting is more reliable. The workflow computes a training-set weight ratio equal to the number of non-hits divided by hits, then applies it differently according to model family.

# analysis/master_script.R - class weighting
n_hit_train <- sum(train_cls$hit_label == "hit")
n_nonhit_train <- sum(train_cls$hit_label == "non_hit")
weight_ratio <- n_nonhit_train / n_hit_train

train_weighted <- train_cls %>%
  mutate(case_wt = hardhat::importance_weights(
    if_else(hit_label == "hit", weight_ratio, 1.0)
  ))

rf_cls_baseline_spec <- rand_forest(trees = 500, mtry = 5, min_n = 5) %>%
  set_engine(
    "ranger",
    importance = "impurity",
    class.weights = c("non_hit" = 1, "hit" = weight_ratio),
    num.threads = n_threads
  ) %>%
  set_mode("classification")

lr_spec <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification")

xgb_cls_spec <- boost_tree(
  trees = 500, tree_depth = 6, learn_rate = 0.1,
  min_n = 1, loss_reduction = 0, sample_size = 0.8
) %>%
  set_engine("xgboost", nthread = n_threads, scale_pos_weight = weight_ratio) %>%
  set_mode("classification")
# analysis/master_script.R - baseline classification workflows and fitting
lr_fit <- workflow() %>%
  add_case_weights(case_wt) %>%
  add_recipe(lr_recipe) %>%
  add_model(lr_spec) %>%
  fit(data = train_weighted)

rf_cls_baseline_fit <- workflow() %>%
  add_recipe(rf_cls_baseline_recipe) %>%
  add_model(rf_cls_baseline_spec) %>%
  fit(data = train_cls)

xgb_cls_fit <- workflow() %>%
  add_recipe(xgb_cls_recipe) %>%
  add_model(xgb_cls_spec) %>%
  fit(data = train_cls)
# analysis/master_script.R - classification baseline metrics and comparison export
evaluate_classifier <- function(fit_obj, new_data, model_name) {
  hard_preds <- predict(fit_obj, new_data)
  prob_preds <- predict(fit_obj, new_data, type = "prob")

  preds <- bind_cols(hard_preds, prob_preds, new_data %>% select(hit_label)) %>%
    rename(truth = hit_label)

  pr <- pr_auc(preds, truth = truth, .pred_hit, event_level = "second")$.estimate
  roc <- roc_auc(preds, truth = truth, .pred_hit, event_level = "second")$.estimate
  f1 <- f_meas(preds, truth = truth, estimate = .pred_class, event_level = "second")$.estimate
  rec <- recall(preds, truth = truth, estimate = .pred_class, event_level = "second")$.estimate
  pre <- precision(preds, truth = truth, estimate = .pred_class, event_level = "second")$.estimate

  tibble(
    model = model_name,
    pr_auc = round(pr, 4),
    roc_auc = round(roc, 4),
    f1 = round(f1, 4),
    recall = round(rec, 4),
    precision = round(pre, 4)
  )
}

val_lr <- evaluate_classifier(lr_fit, val_cls, "Logistic Regression (weighted)")
val_rf <- evaluate_classifier(rf_cls_baseline_fit, val_cls, "Random Forest (weighted)")
val_xgb <- evaluate_classifier(xgb_cls_fit, val_cls, "XGBoost (scale_pos_weight)")

all_cls_results <- bind_rows(val_lr, val_rf, val_xgb) %>%
  arrange(desc(pr_auc))

write_csv(all_cls_results, file.path(cls_out_dir, "all_classification_validation_results.csv"))
Validation-set classification baseline results.
model pr_auc roc_auc f1 recall precision
Random Forest (weighted) 0.2861 0.8984 NA 0.0000 NA
XGBoost (scale_pos_weight) 0.2756 0.8939 0.0664 0.0351 0.6154
Logistic Regression (weighted) 0.2226 0.8963 0.2000 0.8860 0.1127

The baseline results demonstrate three different failure modes at the default 0.5 threshold:

  1. Weighted random forest gives the strongest PR-AUC but does not emit positive predictions at the default threshold.
  2. Weighted logistic regression attains very high recall but very poor precision.
  3. Untuned XGBoost fires rarely, giving high precision but very low recall.

This explains why threshold-dependent metrics from the baseline stage are not treated as final evidence.

9.3 Tuned Models

The tuning stage again compares random forest and XGBoost, this time under 3-fold cross-validation on the training split. The tuned classifier is still selected on validation PR-AUC, not on default-threshold F1.

# analysis/master_script.R - tuned classification validation summaries and export
rf_cls_best <- select_best(rf_cls_tune_res, metric = "pr_auc")
xgb_cls_best <- select_best(xgb_cls_tune_res, metric = "pr_auc")

rf_cls_final_fit <- finalize_workflow(rf_cls_wflow_tune, rf_cls_best) %>%
  fit(data = train_cls)

xgb_cls_final_fit <- finalize_workflow(xgb_cls_wflow_tune, xgb_cls_best) %>%
  fit(data = train_cls)

rf_cls_tuned_val <- evaluate_cls_tuned(
  rf_cls_final_fit, val_cls, "RF Tuned (weighted)"
)

xgb_cls_tuned_val <- evaluate_cls_tuned(
  xgb_cls_final_fit, val_cls, "XGBoost Tuned (scale_pos_weight)"
)

baseline_cls <- read_csv(file.path(cls_out_dir, "all_classification_validation_results.csv"),
                         show_col_types = FALSE) %>%
  mutate(type = "baseline")

tuned_cls <- bind_rows(rf_cls_tuned_val, xgb_cls_tuned_val) %>%
  mutate(type = "tuned")

cls_comparison <- bind_rows(baseline_cls, tuned_cls) %>%
  arrange(desc(pr_auc)) %>%
  select(type, model, pr_auc, roc_auc, f1, recall, precision)

write_csv(cls_comparison, file.path(tune_dir, "classification_tuned_vs_baseline.csv"))
Validation-set classification results after tuning and baseline comparison.
type model pr_auc roc_auc f1 recall precision
tuned RF Tuned (weighted) 0.3057 0.9033 0.0903 0.0482 0.7097
baseline Random Forest (weighted) 0.2861 0.8984 NA 0.0000 NA
tuned XGBoost Tuned (scale_pos_weight) 0.2791 0.8938 0.0044 0.0022 0.5000
baseline XGBoost (scale_pos_weight) 0.2756 0.8939 0.0664 0.0351 0.6154
baseline Logistic Regression (weighted) 0.2226 0.8963 0.2000 0.8860 0.1127

The current tuned-model artefacts show that the weighted tuned random forest remains the best classifier with validation PR-AUC 0.3057 and ROC-AUC 0.9033. As a result, the random forest classifier remains the preferred model. The report therefore uses the saved current artefacts as the main numeric evidence and treats the small discrepancies as refit noise.

# analysis/master_script.R - baseline classification PR curve on the validation set
lr_curve_data <- predict(lr_fit, val_cls, type = "prob") %>%
  bind_cols(val_cls %>% select(hit_label)) %>%
  rename(truth = hit_label) %>%
  mutate(model = "Logistic Regression")

rf_curve_data <- predict(rf_cls_baseline_fit, val_cls, type = "prob") %>%
  bind_cols(val_cls %>% select(hit_label)) %>%
  rename(truth = hit_label) %>%
  mutate(model = "Random Forest")

all_curve_data <- bind_rows(lr_curve_data, rf_curve_data)

pr_curves <- all_curve_data %>%
  group_by(model) %>%
  pr_curve(truth = truth, .pred_hit, event_level = "second")

p_pr <- ggplot(pr_curves, aes(x = recall, y = precision, color = model)) +
  geom_path(linewidth = 0.9) +
  geom_hline(yintercept = 0.035, linetype = "dashed", color = "grey50") +
  annotate("text", x = 0.8, y = 0.045,
           label = "No-skill baseline (~3.5%)", size = 3, color = "grey50") +
  scale_color_manual(values = c("Logistic Regression" = "#2c7bb6",
                                "Random Forest" = "#1DB954")) +
  labs(
    title = "Precision-Recall Curves (Validation Set)",
    subtitle = "Hit class (popularity >= 70). No-skill baseline = prevalence (~3.5%).",
    x = "Recall", y = "Precision", color = "Model"
  ) +
  theme_minimal()

ggsave(file.path(cls_out_dir, "pr_curves_validation.png"), p_pr, width = 8, height = 5, dpi = 300)

10 Threshold Selection and Final Evaluation

10.1 Why Threshold Optimisation Was Necessary

A 0.5 cutoff is not appropriate when the model is trained on a class distribution near 1:28 and when the business-relevant class is rare. The project therefore used the tuned random forest probabilities on the validation set to sweep thresholds from 0.01 to 0.99 and compute precision, recall, F1, specificity, balanced accuracy, and Youden’s J.

# analysis/master_script.R - threshold sweep on validation probabilities
prob_df <- predict(rf_cls_final_fit, val_cls, type = "prob") %>%
  bind_cols(val_cls %>% select(hit_label)) %>%
  rename(truth = hit_label, prob_hit = .pred_hit)

prauc <- pr_auc(prob_df, truth = truth, prob_hit, event_level = "second")$.estimate
val_hit_rate <- mean(prob_df$truth == "hit")

thresholds <- seq(0.01, 0.99, by = 0.01)
truth_bin <- as.integer(prob_df$truth == "hit")

threshold_df <- map_dfr(thresholds, function(thr) {
  pred_bin <- as.integer(prob_df$prob_hit >= thr)
  TP <- sum(pred_bin == 1L & truth_bin == 1L)
  FP <- sum(pred_bin == 1L & truth_bin == 0L)
  FN <- sum(pred_bin == 0L & truth_bin == 1L)
  TN <- sum(pred_bin == 0L & truth_bin == 0L)
  n_pred_pos <- TP + FP
  precision <- if (n_pred_pos > 0) TP / n_pred_pos else NA_real_
  recall <- if ((TP + FN) > 0) TP / (TP + FN) else NA_real_
  specificity <- if ((TN + FP) > 0) TN / (TN + FP) else NA_real_
  f1 <- if (!is.na(precision) && !is.na(recall) && (precision + recall) > 0)
    2 * precision * recall / (precision + recall) else NA_real_
  youden_j <- if (!is.na(recall) && !is.na(specificity)) recall + specificity - 1 else NA_real_
  balanced_acc <- if (!is.na(recall) && !is.na(specificity)) (recall + specificity) / 2 else NA_real_

  tibble(
    threshold = thr, tp = TP, fp = FP, fn = FN, tn = TN, n_pred_pos = n_pred_pos,
    precision = precision, recall = recall, specificity = specificity,
    f1 = f1, youden_j = youden_j, balanced_acc = balanced_acc
  )
})

write_csv(threshold_df, file.path(thresh_dir, "threshold_sweep.csv"))
# analysis/master_script.R - operating-point selection and export
op_f1 <- threshold_df %>%
  filter(!is.na(f1)) %>%
  slice_max(f1, n = 1, with_ties = FALSE) %>%
  mutate(operating_point = "Best F1")

op_youden <- threshold_df %>%
  filter(!is.na(youden_j)) %>%
  slice_max(youden_j, n = 1, with_ties = FALSE) %>%
  mutate(operating_point = "Best Youden's J")

op_high_recall <- threshold_df %>%
  filter(!is.na(recall), recall >= 0.70) %>%
  slice_max(precision, n = 1, with_ties = FALSE) %>%
  mutate(operating_point = "High Recall (>=0.70)")

operating_points <- bind_rows(op_f1, op_youden, op_high_recall) %>%
  select(operating_point, threshold, tp, fp, fn, tn, n_pred_pos,
         precision, recall, f1, specificity, balanced_acc, youden_j)

write_csv(operating_points, file.path(thresh_dir, "operating_points.csv"))
Selected validation-set operating points from the threshold sweep.
operating_point threshold tp fp fn tn n_pred_pos precision recall f1 specificity balanced_acc youden_j
Best F1 0.20 185 398 271 12582 583 0.3173 0.4057 0.3561 0.9693 0.6875 0.3750
Best Youden’s J 0.04 392 2555 64 10425 2947 0.1330 0.8596 0.2304 0.8032 0.8314 0.6628
High Recall (>=0.70) 0.08 331 1422 125 11558 1753 0.1888 0.7259 0.2997 0.8904 0.8082 0.6163

The current saved threshold sweep places the best-F1 operating point at approximately 0.20. The final test-evaluation pipeline, however, used the previously locked threshold of 0.19. This 0.01-0.02 shift as refit noise after later reruns. Methodologically, this is acceptable because the optimal region is stable and the test set was evaluated only after a threshold had already been fixed.

# analysis/master_script.R - threshold metrics plot exported for the report
p_thresh_metrics <- threshold_df %>%
  select(threshold, f1, precision, recall) %>%
  pivot_longer(cols = -threshold, names_to = "metric", values_to = "value") %>%
  ggplot(aes(x = threshold, y = value, colour = metric)) +
  geom_line(linewidth = 0.8, na.rm = TRUE) +
  geom_vline(xintercept = op_f1$threshold, linetype = "dashed", colour = "#e74c3c") +
  geom_vline(xintercept = op_youden$threshold, linetype = "dashed", colour = "#3498db") +
  geom_vline(xintercept = op_high_recall$threshold, linetype = "dashed", colour = "#2ecc71") +
  labs(
    title = "RF Tuned (Weighted): Threshold vs F1, Precision, Recall",
    subtitle = paste0("Validation set | PR-AUC = ", round(prauc, 4)),
    x = "Classification Threshold", y = "Metric Value", colour = "Metric"
  ) +
  theme_minimal()

ggsave(file.path(thresh_dir, "threshold_metrics_plot.png"), p_thresh_metrics,
       width = 10, height = 6, dpi = 300)

The threshold sweep also produced a precision-recall curve annotated with the selected operating points, showing where the chosen threshold sits within the full precision-recall trade-off.

# analysis/master_script.R - PR curve with threshold operating points
pr_curve_data <- prob_df %>%
  pr_curve(truth = truth, prob_hit, event_level = "second")

op_points_pr <- operating_points %>%
  filter(!is.na(precision), !is.na(recall)) %>%
  mutate(label = paste0(operating_point, "\n(thr=", threshold, ")"))

p_pr_curve <- pr_curve_data %>%
  ggplot(aes(x = recall, y = precision)) +
  geom_path(colour = "#1DB954", linewidth = 0.9) +
  geom_hline(yintercept = val_hit_rate, linetype = "dashed", colour = "grey50", linewidth = 0.5) +
  geom_point(data = op_points_pr, aes(x = recall, y = precision, colour = operating_point),
             size = 3.5, shape = 16) +
  ggrepel::geom_label_repel(
    data = op_points_pr,
    aes(x = recall, y = precision, label = label, colour = operating_point),
    size = 3,
    nudge_y = 0.05,
    show.legend = FALSE
  ) +
  scale_colour_manual(values = c("Best F1" = "#e74c3c",
                                  "Best Youden's J" = "#3498db",
                                  "High Recall (>=0.70)" = "#2ecc71")) +
  labs(
    title = "RF Tuned (Weighted): Precision-Recall Curve with Operating Points",
    subtitle = paste0("Validation set | PR-AUC = ", round(prauc, 4)),
    x = "Recall", y = "Precision", colour = "Operating Point"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

ggsave(file.path(thresh_dir, "pr_curve_with_operating_points.png"), p_pr_curve,
       width = 9, height = 6, dpi = 300)

This view makes the trade-off concrete. Moving to a lower threshold pulls the model rightward along the PR curve toward higher recall but lower precision. That visual relationship is what justifies using the F1-optimal threshold as a balanced operating point before the final held-out evaluation.

10.2 Final Test Results

The final evaluation script consumed the held-out test set exactly once. The classification model was the tuned, weighted random forest, scored at the locked threshold of 0.19.

# analysis/master_script.R - final test-set classification evaluation
test_prob_df <- predict(rf_cls_final_fit, test_cls, type = "prob") %>%
  bind_cols(test_cls %>% select(hit_label)) %>%
  rename(truth = hit_label, prob_hit = .pred_hit)

test_prauc <- pr_auc(test_prob_df, truth = truth, prob_hit, event_level = "second")$.estimate
test_rocauc <- roc_auc(test_prob_df, truth = truth, prob_hit, event_level = "second")$.estimate

test_pred_df <- test_prob_df %>%
  mutate(.pred_class = factor(
    if_else(prob_hit >= CLASSIFICATION_THRESHOLD, "hit", "non_hit"),
    levels = c("non_hit", "hit")
  ))

truth_bin_test <- as.integer(test_pred_df$truth == "hit")
pred_bin_test <- as.integer(test_pred_df$.pred_class == "hit")

TP_test <- sum(pred_bin_test == 1L & truth_bin_test == 1L)
FP_test <- sum(pred_bin_test == 1L & truth_bin_test == 0L)
FN_test <- sum(pred_bin_test == 0L & truth_bin_test == 1L)
TN_test <- sum(pred_bin_test == 0L & truth_bin_test == 0L)

prec_test <- if ((TP_test + FP_test) > 0) TP_test / (TP_test + FP_test) else NA_real_
rec_test <- if ((TP_test + FN_test) > 0) TP_test / (TP_test + FN_test) else NA_real_
spec_test <- TN_test / (TN_test + FP_test)
f1_test <- if (!is.na(prec_test) && !is.na(rec_test) && (prec_test + rec_test) > 0)
  2 * prec_test * rec_test / (prec_test + rec_test) else NA_real_
bal_acc_test <- (rec_test + spec_test) / 2

cls_test_summary <- bind_rows(
  # Validation metrics are hardcoded from the saved output file (classification_test_results.csv)
  # because the fitted model object is not reloaded during knitting. The test split
  # was evaluated once with CLASSIFICATION_THRESHOLD = 0.19, fixed before any test access.
  tibble(split = "validation", model = "RF Tuned (weighted)", threshold = CLASSIFICATION_THRESHOLD,
         pr_auc = 0.3117, roc_auc = 0.9042, f1 = 0.3590, recall = 0.4300,
         precision = 0.3080, specificity = 0.9660, balanced_acc = 0.6980,
         tp = 196L, fp = 440L, fn = 260L, tn = 12540L),
  tibble(split = "test", model = "RF Tuned (weighted)", threshold = CLASSIFICATION_THRESHOLD,
         pr_auc = round(test_prauc, 4), roc_auc = round(test_rocauc, 4),
         f1 = round(f1_test, 4), recall = round(rec_test, 4), precision = round(prec_test, 4),
         specificity = round(spec_test, 4), balanced_acc = round(bal_acc_test, 4),
         tp = TP_test, fp = FP_test, fn = FN_test, tn = TN_test)
)

write_csv(cls_test_summary, file.path(test_dir, "classification_test_results.csv"))
Validation-versus-test classification results for the final tuned random forest classifier.
split model threshold pr_auc roc_auc f1 recall precision specificity balanced_acc tp fp fn tn
validation RF Tuned (weighted) 0.19 0.3117 0.9042 0.3590 0.4300 0.3080 0.9660 0.6980 196 440 260 12540
test RF Tuned (weighted) 0.19 0.2620 0.8846 0.3237 0.3919 0.2757 0.9625 0.6772 185 486 287 12480

The final test PR-AUC is 0.2620, the ROC-AUC is 0.8846, the F1 score is 0.3237, recall is 0.3919, and precision is 0.2757. The confusion-matrix totals in the saved output are TP = 185, FP = 486, FN = 287, and TN = 1.248^{4}.

The classifier is meaningfully better than random ranking and can surface hit candidates for human review, but precision remains modest because the event is intrinsically rare. These numbers do not support fully automated binary decision-making.

# analysis/master_script.R - saved PR curve for the final test-set classifier
ggsave(file.path(test_dir, "rf_cls_pr_curve_test.png"),
  test_prob_df %>%
    pr_curve(truth = truth, prob_hit, event_level = "second") %>%
    ggplot(aes(x = recall, y = precision)) +
    geom_path(colour = "#1DB954", linewidth = 0.9) +
    geom_hline(yintercept = n_test_hits / nrow(test_cls), linetype = "dashed", colour = "grey50") +
    geom_point(aes(x = rec_test, y = prec_test), colour = "#e74c3c", size = 3.5, shape = 16) +
    annotate("text", x = rec_test + 0.02, y = prec_test + 0.03,
             label = paste0("F1-opt thr=", CLASSIFICATION_THRESHOLD, "\nF1=", round(f1_test, 3)),
             colour = "#e74c3c", size = 3.2, hjust = 0) +
    labs(title = "RF Tuned (Weighted): PR Curve - Test Set",
         subtitle = paste0("PR-AUC = ", round(test_prauc, 4), " | Red = F1-optimal operating point"),
         x = "Recall", y = "Precision") +
    theme_minimal(),
  width = 8, height = 5, dpi = 300)

The ROC curve provides a complementary threshold-free view. Unlike the PR curve, ROC is less sensitive to class imbalance, so the high ROC-AUC here confirms genuine discriminative ability but should not be cited as the headline performance figure for this imbalanced task.

# analysis/master_script.R - ROC curve on the final test set
ggsave(file.path(test_dir, "rf_cls_roc_curve_test.png"),
  test_prob_df %>%
    roc_curve(truth = truth, prob_hit, event_level = "second") %>%
    ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    geom_path(colour = "#3498db", linewidth = 0.9) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
    geom_point(aes(x = 1 - spec_test, y = rec_test), colour = "#e74c3c", size = 3.5, shape = 16) +
    labs(title = "RF Tuned (Weighted): ROC Curve — Test Set",
         subtitle = paste0("ROC-AUC = ", round(test_rocauc, 4), " | Red = F1-optimal operating point"),
         x = "1 − Specificity (FPR)", y = "Sensitivity (Recall / TPR)") +
    theme_minimal(),
  width = 8, height = 5, dpi = 300)

The test ROC-AUC is 0.8846, which places the classifier well above the no-skill diagonal. The red point on the ROC curve marks the operating point chosen at the F1-optimal threshold of 0.19, consistent with the marked point on the PR curve above.

10.3 Generalisation Assessment

# analysis/master_script.R - cross-stage summary exported for generalisation checks
rf_reg_cv_rmse <- 15.33
rf_cls_cv_prauc <- 0.2695

gen_summary <- tibble(
  model = c(rep("RF Regression", 3), rep("RF Classification", 3)),
  stage = rep(c("CV (3-fold)", "Validation", "Test"), 2),
  primary_metric = c(
    rf_reg_cv_rmse,
    rf_reg_val_metrics %>% filter(.metric == "rmse") %>% pull(.estimate),
    rf_reg_test_metrics %>% filter(.metric == "rmse") %>% pull(.estimate),
    rf_cls_cv_prauc, 0.3117, test_prauc
  ),
  metric_name = c(rep("RMSE", 3), rep("PR-AUC", 3))
)

write_csv(gen_summary, file.path(test_dir, "generalisation_summary.csv"))
Saved cross-validation, validation, and test summaries for the final selected models.
model stage primary_metric metric_name
RF Regression CV (3-fold) 15.3300 RMSE
RF Regression Validation 14.6878 RMSE
RF Regression Test 15.1610 RMSE
RF Classification CV (3-fold) 0.2695 PR-AUC
RF Classification Validation 0.3117 PR-AUC
RF Classification Test 0.2620 PR-AUC

The regression model generalises stably: test RMSE is only modestly worse than validation RMSE. The classification model shows a clearer validation-to-test drop in PR-AUC, which is expected because model selection and threshold choice were both performed on the validation split. This should be interpreted as mild validation optimism rather than severe overfitting, especially because the ROC-AUC remains high and the test PR-AUC still indicates useful ranking ability.

11 Interpretation and Sensitivity Analysis

11.1 Feature Importance

The interpretability stage uses impurity-based variable importance from the final tuned random forest models and then groups one-hot encoded variables back into their parent fields. The main reason for this grouping is that raw individual importances are difficult to compare when one predictor such as track_genre expands into many dummies.

# analysis/master_script.R - grouped importance extraction
extract_grouped_importance <- function(fit, model_label) {
  vi_raw <- extract_fit_engine(fit)$variable.importance

  indiv <- tibble(
    feature = names(vi_raw),
    importance = as.numeric(vi_raw),
    model = model_label
  )

  grouped <- indiv %>%
    mutate(feature_group = case_when(
      str_starts(feature, "track_genre_") ~ "track_genre",
      str_starts(feature, "key_") ~ "key",
      str_starts(feature, "time_signature_") ~ "time_signature",
      feature == "explicit_explicit" ~ "explicit",
      feature == "mode_major" ~ "mode",
      TRUE ~ feature
    )) %>%
    group_by(feature_group, model) %>%
    summarise(importance = sum(importance), .groups = "drop") %>%
    rename(feature = feature_group) %>%
    arrange(desc(importance))
}
# analysis/master_script.R - saved grouped importance tables and compact report subsets
# Note: extract_grouped_importance() returns a list with $individual and $grouped elements.
# When re-loading from the saved CSVs (as in the report setup chunk), reg_imp and cls_imp
# are flat tibbles equivalent to the $grouped element here.
reg_imp <- extract_grouped_importance(rf_reg_final_fit, "Regression")
cls_imp <- extract_grouped_importance(rf_cls_final_fit, "Classification")

write_csv(reg_imp$grouped, file.path(interp_dir, "reg_grouped_importance.csv"))
write_csv(cls_imp$grouped, file.path(interp_dir, "cls_grouped_importance.csv"))

reg_top5 <- reg_imp$grouped %>% slice_head(n = 5)
cls_top5 <- cls_imp$grouped %>% slice_head(n = 5)
Top grouped regression predictors from the saved interpretability outputs.
feature model importance
track_genre Regression 9985790
acousticness Regression 1553402
duration_min Regression 1543337
loudness Regression 1496455
danceability Regression 1469590
Top grouped classification predictors from the saved interpretability outputs.
feature model importance
track_genre Classification 8652.97
duration_min Classification 1787.03
loudness Classification 1663.75
acousticness Classification 1619.96
valence Classification 1482.59

The main interpretive conclusion is consistent across both objectives: track_genre is the most important feature group by a wide margin. After genre, audio features such as duration, loudness, acousticness, energy, and danceability matter, but none challenge genre’s overall contribution. This supports both the EDA and the earlier correction that restored genre to the baseline model specification.

Worth noting, however: impurity-based importance is biased toward high-cardinality and correlated features. Because genre generates many dummies, and because features such as loudness and energy are correlated, the grouped rankings are best treated as strong directional evidence rather than exact causal shares.

The pipeline also generated a side-by-side comparison plot of grouped importance for the regression and classification models, a supporting output that shows whether the two predictive objectives draw on broadly similar feature signals.

# analysis/master_script.R - grouped importance comparison across objectives
compare_grouped <- bind_rows(
  reg_imp$grouped %>% mutate(model = "Regression"),
  cls_imp$grouped %>% mutate(model = "Classification")
) %>%
  group_by(model) %>%
  mutate(importance_pct = 100 * importance / sum(importance)) %>%
  ungroup()

write_csv(compare_grouped, file.path(interp_dir, "importance_comparison_grouped.csv"))

p_compare <- compare_grouped %>%
  mutate(feature = fct_reorder(feature, importance_pct, .fun = max)) %>%
  ggplot(aes(x = importance_pct, y = feature, fill = model)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c(Regression = "#1DB954", Classification = "#e74c3c")) +
  labs(
    title = "Regression vs Classification: Grouped Feature Importance",
    subtitle = "Normalised to % of total importance within each model.",
    x = "Importance (% of total)", y = NULL, fill = "Model"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

ggsave(file.path(interp_dir, "importance_comparison_plot.png"),
       p_compare, width = 9, height = 6, dpi = 300)

The comparison confirms that genre dominates both objectives, but it also shows that the ranking of secondary audio signals is not identical. That is important for interpretation: predicting general popularity and predicting rare hit status are related tasks, but not perfectly interchangeable ones.

11.2 Zero-Popularity Sensitivity Analysis

The sensitivity analysis investigates one of the central limitations of the study: zero-popularity tracks may be structurally hard to learn from audio features because their observed popularity can reflect distributional or promotional absence rather than sound.

# analysis/master_script.R - post-hoc zero-popularity sensitivity analysis
compute_metrics_sens <- function(preds_df, label) {
  m <- reg_metrics(preds_df, truth = truth, estimate = estimate)
  tibble(
    subset = label,
    n = nrow(preds_df),
    rmse = m %>% filter(.metric == "rmse") %>% pull(.estimate) %>% round(4),
    mae = m %>% filter(.metric == "mae") %>% pull(.estimate) %>% round(4),
    rsq = m %>% filter(.metric == "rsq") %>% pull(.estimate) %>% round(4),
    mean_pred = round(mean(preds_df$estimate), 4),
    median_pred = round(median(preds_df$estimate), 4),
    mean_truth = round(mean(preds_df$truth), 4),
    median_truth = round(median(preds_df$truth), 4)
  )
}

test_preds_all <- rf_reg_test_preds %>%
  mutate(
    is_zero = truth == 0,
    subset = if_else(is_zero, "zero_popularity", "non_zero_popularity"),
    residual = truth - estimate,
    abs_error = abs(residual)
  )

metrics_full <- compute_metrics_sens(test_preds_all, "Full test set")
metrics_nonzero <- compute_metrics_sens(
  test_preds_all %>% filter(!is_zero),
  "Non-zero popularity (popularity > 0)"
)

preds_zero <- test_preds_all %>% filter(is_zero)
metrics_zero <- tibble(
  subset = "Zero popularity (popularity == 0)",
  n = nrow(preds_zero),
  rmse = round(sqrt(mean(preds_zero$estimate^2)), 4),
  mae = round(mean(abs(preds_zero$estimate)), 4),
  rsq = NA_real_,
  mean_pred = round(mean(preds_zero$estimate), 4),
  median_pred = round(median(preds_zero$estimate), 4),
  mean_truth = 0,
  median_truth = 0
)

metrics_combined <- bind_rows(metrics_full, metrics_nonzero, metrics_zero)

write_csv(metrics_combined, file.path(sensitivity_dir, "zero_popularity_sensitivity_metrics.csv"))
Saved sensitivity-analysis metrics for the final regression model.
subset n rmse mae rsq mean_pred median_pred mean_truth median_truth
Full test set 13441 15.1610 10.4680 0.4579 32.8439 35.1844 33.2919 33
Non-zero popularity (popularity > 0) 12032 13.9242 9.7711 0.4271 34.7673 36.1937 37.1905 37
Zero popularity (popularity == 0) 1409 23.1743 16.4195 NA 16.4195 11.1072 0.0000 0

The saved output shows that removing zero-popularity tracks reduces RMSE from 15.1610 to 13.9242 and MAE from 10.4680 to 9.7711. The zero-popularity subset has an RMSE of 23.1743 and a mean predicted popularity of 16.4195 despite an actual mean of 0. This pattern strongly supports the report’s interpretation that zero-popularity tracks are systematically overpredicted and form a noise floor for the regression objective.

# analysis/master_script.R - zero-popularity diagnostic plot saved for the report
ggsave(file.path(sensitivity_dir, "zero_popularity_actual_vs_pred.png"),
  ggplot(test_preds_all, aes(x = truth, y = estimate, colour = is_zero)) +
    geom_point(alpha = 0.25, size = 0.7, data = test_preds_all %>% filter(!is_zero)) +
    geom_point(alpha = 0.55, size = 1.0, colour = "#e74c3c", data = test_preds_all %>% filter(is_zero)) +
    geom_abline(slope = 1, intercept = 0, colour = "black", linewidth = 0.7, linetype = "dashed") +
    scale_colour_manual(values = c("FALSE" = "#2980b9", "TRUE" = "#e74c3c"),
                        labels = c("FALSE" = "Popularity > 0", "TRUE" = "Popularity = 0"),
                        name = "Track type") +
    labs(
      title = "Actual vs Predicted Popularity - RF Tuned (Test Set)",
      subtitle = sprintf("Red = zero-popularity tracks (n = %d, %.1f%% of test set)", n_zero_test, pct_zero_test),
      x = "Actual Popularity", y = "Predicted Popularity",
      caption = "Dashed = perfect prediction. Zero-popularity tracks are systematically over-predicted."
    ) +
    theme_minimal(base_size = 12) +
    theme(legend.position = "bottom", plot.title = element_text(face = "bold")),
  width = 8, height = 6, dpi = 150)

No separate calibration section is included in the report because the project scripts and saved outputs do not implement a standalone probability-calibration procedure. The available pipeline supports threshold optimisation, final test evaluation, feature-importance analysis, and zero-popularity sensitivity analysis, and the report limits itself to those supported processes.

11.3 Direct Answers To The Research Questions

With test results in hand, the four research questions posed at the outset can now be answered directly.

  1. Which audio features and genre categories are associated with higher popularity? Genre is the dominant predictor in both final models, while the strongest secondary signals are loudness, acousticness, duration, energy, danceability, and instrumentalness.
  2. Can popularity be predicted well enough to support ranking? Yes, as a directional ranking tool rather than as a precise point forecast. The final regression model captures meaningful signal and reduces average error to roughly 10.5 popularity points.
  3. Can the project identify likely hits at a useful operating point? Yes, for screening purposes. The final classifier materially outperforms the no-skill baseline and can reduce a large catalogue to a smaller shortlist for human review, though not with enough precision for fully automatic binary decisions.
  4. What do the final models imply for marketing focus? They imply that genre-level positioning matters more than any single isolated audio feature, while lower instrumentalness and more mainstream production characteristics remain favourable signals within genres.

12 Discussion and Marketing Implications

12.1 Regression Performance in Context

The RF Tuned model achieved R² = 0.458 on the test set, below the 0.60–0.70 range reported in comparable studies that incorporate genre alongside audio features (Jiang, 2024; Saragih, 2023). Three factors are likely responsible for this gap.

  1. Zero-popularity noise floor (10.54% of tracks): A jazz track with high instrumentalness and low loudness could legitimately have popularity 0 (never promoted) or popularity 35 (niche but active). The audio signal is identical; the observed popularity diverges. These structurally ambiguous observations inflate RMSE without providing learnable signal, and they account for roughly one-tenth of the dataset. A post-hoc sensitivity analysis (Script 10) quantifies the effect: excluding the 1,409 zero-popularity tracks in the test set reduces RMSE by 8.2% (15.16 → 13.92). The model assigns these tracks a mean predicted popularity of 16.4, consistent with their audio features resembling low-but-not-zero content; their observed score of zero reflects the absence of promotional activity, playlist placement, or release support, none of which any audio feature can encode.

  2. Audio-only feature ceiling: Popularity driven by marketing investment, social virality, and cultural events leaves no trace in an audio feature vector. Studies achieving R² > 0.70 typically incorporate additional signals: artist metadata (Arora & Rani, 2024), chart history (Zhao et al., 2023), or release recency (Saragih, 2023), none of which are available in the present dataset. The absence of release dates is particularly consequential, given Spotify’s well-documented recency bias in popularity scoring.

  3. Conservative tuning grid: The 3-fold CV combined with a 3×3 parameter grid is less exhaustive than the 5-fold search originally planned. A broader search would be expected to yield modest further improvements.

Taken together, these constraints largely explain the gap. An MAE of 10.47 popularity points with stable generalisation across splits still constitutes a practically useful signal for pre-release track ranking: the model orders the broad middle of the popularity distribution (20–70) reliably and identifies structural patterns that remain consistent across training and test data, including low instrumentalness and high danceability/loudness associated with higher scores.

12.2 Classification Performance in Context

A test PR-AUC of 0.262 represents a 7.5x lift over the no-skill baseline of 0.035 (the hit prevalence). At this class ratio (1:28), a PR-AUC of 0.262 indicates that the model has learned meaningful discriminative structure from audio and genre features, even though a large absolute performance gap from more balanced classification problems remains.

ROC-AUC = 0.885 is consistent with comparable studies: Zhao et al. (2023) reported AUC = 0.91 on a smaller, far more balanced dataset (1:6 imbalance and additional metadata features). The ROC-AUC measure is less sensitive to the minority class at extreme imbalance, so the PR-AUC result is more informative here.

At threshold = 0.19, the model identifies 39% of genuine hit tracks while flagging approximately 3.8% of non-hit tracks as false positives (1 − specificity = 0.038). In a catalogue of 1,000 tracks, the model would:

  • Correctly identify ~14 of the expected 35 hits
  • Generate ~37 false positive recommendations (non-hits flagged as potential hits)

This performance is appropriate for a soft-ranking and shortlisting role: use the model to surface the top-scored tracks for human A&R or marketing review, rather than as a binary green/red decision system. The model substantially reduces the search space for human reviewers.

To frame this in terms a marketing team can act on: without any predictive tool, randomly selecting tracks for promotional investment would encounter approximately 1 genuine hit per 28 tracks, the natural base rate. Deploying the model at threshold = 0.19 raises this to approximately 1 genuine hit per 3.6 tracks promoted, a 7.8x improvement in the return on evaluation effort. For a label processing 500 new submissions in a given quarter, the model reduces the shortlist requiring detailed human review from 500 to roughly 25 candidates, whilst still capturing 39% of the genuine hits in that pool. The value is not in replacing human judgement, but in concentrating it.

No single threshold simultaneously achieves recall > 0.5 and precision > 0.35; this is a property of the hit definition (3.49% positive rate), not a modelling failure. Marketing teams deploying this tool should set explicit expectations about their preferred precision-recall operating point before implementation.

12.3 Genre as the Dominant Signal

The single most actionable finding from both models is the dominance of genre: track_genre accounts for 38% of impurity reduction in regression and 33% in classification, more than all audio features combined. This is consistent with Jung & Mayer (n.d.) who found genre (specifically EDM) to be the dominant predictor on the same 114K Kaggle dataset.

Marketing implication: Promotional decisions guided by genre-level popularity profiles will have more leverage than audio feature optimisation. Specific implications:

  • Genre targeting: Allocate promotional budgets preferentially to tracks in systematically high-popularity genres (chill, anime, Brazil, British). Tracks from Iranian, jazz, and romance catalogues face a structural popularity ceiling driven by the high zero-popularity rate in those genres.
  • Genre-aware decision-making: A track’s predicted popularity should always be evaluated in the context of its genre’s baseline. A chill track with predicted popularity 45 is underperforming its genre; a jazz track with the same score is outperforming its genre’s norm.
  • Production strategy: The K-pop and Latin/Latino genres are consistently high-importance for hit classification. Artists or labels seeking to maximise hit probability should target genre conventions associated with these communities.

A methodological note on genre dominance: When a single variable accounts for 38% of total impurity reduction, a legitimate question arises: is the model primarily learning genre-level popularity averages rather than genuine audio signal? If so, it would offer little beyond a genre-mean lookup table. Three observations mitigate this concern, though they do not resolve it entirely. First, the remaining 62% of impurity reduction comes from continuous audio features and structural predictors, which vary considerably within genres. Second, a genre-blind baseline (DL-003) achieved R² = 0.259: audio features alone account for a meaningful fraction of variance before any genre signal is introduced, suggesting the audio component is not trivially small. Third, the classification model achieves PR-AUC = 0.262 across all genres, including high-hit-rate genres such as anime where a pure genre-mean approach would fail to distinguish among individual tracks. A caveat is warranted, however: impurity-based importance is known to overstate the importance of high-cardinality features (112 OHE dummy columns for track_genre), and permutation importance would provide a more conservative and unbiased assessment. The finding that genre dominates should be treated as a lower bound on audio feature contribution rather than a precise decomposition.

12.4 Audio Feature Guidelines

Within audio features, the consistent signals across both models are:

  • Low instrumentalness is the strongest positive signal. Vocal-forward, song-based content substantially outperforms purely instrumental productions.
  • High loudness and danceability are mildly positive. Both features are associated with higher popularity across multiple studies (Arora & Rani, 2024; Jung & Mayer, n.d.) and are confirmed here.
  • Acousticness is weakly negative. Low-acoustic electronic and production-heavy content outperforms high-acousticness material in the streaming environment, reflecting Spotify’s dominant audience preferences.
  • Explicit content, mode, and time signature carry minimal signal. These features contribute less than 1.5% of impurity reduction each and should not drive production or marketing decisions.

13 Limitations

Five limitations bound what can be claimed from these findings:

  1. Zero-popularity noise floor (10.54%): Tracks with popularity = 0 inflate RMSE without providing learnable signal. A post-hoc sensitivity analysis confirms this quantitatively: excluding the 1,409 zero-popularity tracks from the test set reduces RMSE by 8.2% (15.16 → 13.92). Notably, R² on the non-zero subset (0.427) is lower than the full-set figure (0.458): removing these extreme low-end observations reduces total variance more than residual variance, so the relative fit measure falls even as absolute errors improve. The model cannot distinguish zero-popularity tracks from genuinely low-popularity content on audio features alone; their observed popularity of 0 reflects promotional absence rather than any detectable audio characteristic. Resolving this limitation would require engagement and promotion data outside the audio feature space.

  2. Validation optimism in classification (≈0.050 PR-AUC): Both model selection and threshold selection used the same 15% validation split. The test PR-AUC (0.262) is the reliable unbiased estimate; the validation figure (0.312) is systematically optimistic and should not be reported as the model’s true performance.

  3. Impurity-based importance bias: The feature importance results may overweight track_genre due to its high cardinality (112 OHE dummy columns). Permutation importance would provide an unbiased correction.

  4. Conservative cross-validation: 3-fold CV was used for tuning (a computational trade-off; the methodology specified 5-fold). Reduced fold count increases variance in hyperparameter selection, particularly for the classification problem where the minority class is thinly distributed across folds.

  5. Audio-only feature set: No release-date, artist streaming history, social media engagement, or marketing spend signals are available. These non-audio factors are consistently identified as critical to closing the R² gap towards the 0.70+ range achievable in richer datasets (Arora & Rani, 2024; Zhao et al., 2023).

14 Conclusion

Two predictive objectives were pursued on a cleaned dataset of 89,578 unique Spotify tracks, derived from a 114,000-row raw source with 21 original variables and 113 retained genres after deduplication, using a reproducible R pipeline that covered data cleaning, exploratory analysis, hyperparameter tuning, threshold optimisation, and held-out test evaluation.

Objective 1: Popularity Regression. A tuned Random Forest achieved RMSE = 15.16, MAE = 10.47, and R² = 0.458 on the held-out test set, with negligible degradation from validation (RMSE gap: +0.47). Genre is the single dominant predictor (38% of impurity reduction), followed by acousticness, duration, and loudness. The R² gap relative to comparable studies is principally explained by a 10.54% zero-popularity noise floor and the absence of recency and marketing signals, neither of which can be addressed within an audio-only feature set. At 10.47 popularity points MAE, the model provides a useful directional ranking signal for pre-release promotional prioritisation, particularly for tracks in the 20–70 popularity band.

Objective 2: Hit Classification. A tuned weighted Random Forest achieved PR-AUC = 0.262 and ROC-AUC = 0.885 on the test set at a threshold of 0.19, a 7.5x lift over the no-skill PR-AUC of 0.035. At this threshold, the model identifies 39% of genuine hits whilst maintaining 28% precision. The appropriate use is soft-ranking and shortlisting: surfacing candidate tracks for human A&R or marketing review, rather than binary automated selection. The precision-recall ceiling at 3.49% hit prevalence cannot be overcome by audio features alone.

Cross-cutting finding. Both models confirm that genre is the dominant signal. Audio feature targeting has limited leverage relative to genre-level portfolio decisions. Promotional investment should follow genre-level popularity baselines: directing budgets towards tracks in consistently high-popularity genres (chill, K-pop, Latin, Brazil) and exercising caution with genres characterised by high zero-popularity rates (jazz, Iranian, romance). Low instrumentalness is the single strongest audio feature signal for both regression accuracy and hit classification: vocal-forward productions hold a systematic advantage in the streaming environment.

Future work. Three signal gaps stand out. Release recency is perhaps the most tractable: Spotify’s scoring is recency-biased, and incorporating release dates would likely yield the largest single improvement in R². Artist metadata (historical streaming popularity and social following) ranks as the next highest-value addition based on evidence from comparable studies (Arora & Rani, 2024). Marketing and promotional context (playlist placement, advertising spend, chart history) is harder to obtain but represents the signal category most likely to breach the R² > 0.70 threshold. Permutation-based feature importance should also be computed to provide an unbiased estimate of the genre-versus-audio decomposition.

15 Reference

Arora, S., & Rani, P. (2024). Soundtrack success: Unveiling song popularity patterns using machine learning implementation. SN Computer Science, 5(278). https://doi.org/10.1007/s42979-024-02619-5

Jiang, S. (2024). Predicting music popularity: A machine learning approach using Spotify data. In Proceedings of the 1st International Conference on Modern Logistics and Supply Chain Management (MLSCM 2024) (pp. 324–328). SCITEPRESS. https://doi.org/10.5220/0013330000004558

Jung, N. S., & Mayer, F. (n.d.). Beyond beats: A recipe to song popularity? A machine learning approach [Unpublished student paper]. Digital Science Center, University of Innsbruck. [4, 5] Khan, F., Tarimer, I., Alwageed, H. S., Karadağ, B. C., Fayaz, M., Abdusalomov, A. B., & Cho, Y.-I. (2022). Effect of feature selection on the accuracy of music popularity classification using machine learning algorithms. Electronics, 11(21), 3518. https://doi.org/10.3390/electronics11213518

Saragih, H. S. (2023). Predicting song popularity based on Spotify’s audio features: Insights from the Indonesian streaming users. Journal of Management Analytics, 10(4), 693–709. https://doi.org/10.1080/23270012.2023.2239824

Zhao, M., Harvey, M., Cameron, D., Hopfgartner, F., & Gillet, V. J. (2023). An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features. In I. Sserwanga, A. Goulding, H. Moulaison-Sandy, J. T. Du, A. L. Soares, V. Hessami, & R. D. Frank (Eds.), Information for a Better World: Normality, Virtuality, Physicality, Inclusivity: 18th International Conference, iConference 2023 (pp. 303–311). Springer. https://doi.org/10.1007/978-3-031-28035-1_21