---
title: "Final Project — Predicting Article Popularity in Online News Media"
subtitle: "MBAN 5560 - Due April 12, 2026 (Sunday) 11:59pm"
author: "Gavin Shklanka"
date: today
format:
html:
embed-resources: true
toc: true
toc-depth: 2
theme: cosmo
code-fold: show
code-tools: true
highlight-style: github
execute:
echo: true
warning: false
message: false
---
## Background
The rapid growth of online news media has transformed the way information is disseminated and consumed. With an increasing number of news articles published daily, it has become crucial for publishers and content creators to understand the factors that drive the popularity of their articles to effectively engage their audience.
This project leverages the **Online News Popularity** dataset to develop machine learning models that predict the popularity of news articles published on [Mashable](https://mashable.com), a leading online news platform. The dataset contains **39,644 articles** described by **61 features**, including content-related attributes (word counts, keywords, sentiment), publishing metadata (day of week, data channel), and the number of social media shares each article received.
**LLM Disclosure:** Claude (Anthropic) was used as a coding collaborator, analytical reviewer, and writing partner for this project. Claude assisted with R code implementation, hyperparameter tuning strategy, interpretation drafting, and methodological review. All modeling decisions, analytical judgments, and substantive conclusions were made independently. Model outputs were verified against computed results — no numerical values were accepted without empirical confirmation from the rendered code.
```{r setup}
library(tidyverse)
library(randomForest)
library(gbm)
library(xgboost)
library(ROCR)
library(rpart)
library(pdp)
library(knitr)
library(kableExtra)
library(doParallel)
library(gridExtra)
library(scales)
# Colour palettes
PAL2 <- c("#2C7BB6", "#D7191C")
PAL3 <- c("#2C7BB6", "#D7191C", "#1A9641")
PAL4 <- c("#2C7BB6", "#D7191C", "#1A9641", "#FDAE61")
# Helper functions
compute_rmspe <- function(actual, predicted) sqrt(mean((actual - predicted)^2))
compute_auc <- function(probs, actual_binary) {
pred_obj <- prediction(probs, actual_binary)
performance(pred_obj, measure = "auc")@y.values[[1]]
}
fmt <- function(x, d = 2) format(round(x, d), big.mark = ",", nsmall = d)
```
---
# Part 1: Data Preparation & Exploratory Analysis (25 points) {#part1}
## 1.1 Data Loading & Cleaning (5 points)
```{r data-loading}
news_raw <- read.csv("OnlineNewsPopularity.csv")
names(news_raw) <- trimws(names(news_raw))
cat("Raw dimensions:", dim(news_raw), "\n")
cat("Missing values:", sum(is.na(news_raw)), "\n")
# Remove non-predictive columns
news_clean <- news_raw %>% select(-url, -timedelta)
# NOTE: Dummy variable traps (weekday_is_sunday, data_channel_is_lifestyle,
# is_weekend) are resolved before modeling — see Section 2.1.
# Anomaly inspection
cat("\nn_non_stop_words > 1:", sum(news_clean$n_non_stop_words > 1), "row(s)\n")
cat("n_unique_tokens > 1: ", sum(news_clean$n_unique_tokens > 1), "row(s)\n")
cat("n_tokens_content == 0:", sum(news_clean$n_tokens_content == 0), "articles\n")
cat("shares min:", min(news_clean$shares), " max:", max(news_clean$shares), "\n")
# Shares distribution
cat("\nShares summary:\n")
print(summary(news_clean$shares))
median_shares <- median(news_clean$shares)
cat("\nMedian shares:", median_shares, "\n")
cat("Mean / median ratio:", round(mean(news_clean$shares) / median_shares, 2), "\n")
cat("Cleaned dimensions:", dim(news_clean), "\n")
```
#### **Question 1 (5 points):**
The cleaned dataset contains `r format(nrow(news_clean), big.mark = ",")` observations and `r ncol(news_clean) - 1` predictive features after removing `url` (a non-predictive identifier) and `timedelta` (days between publication and dataset acquisition — temporal metadata, not a content feature). Before modeling (Section 2.1), three additional columns are dropped to resolve **dummy variable traps**: `weekday_is_sunday` (reference level for the 7-day indicator set), `data_channel_is_lifestyle` (reference level for the 6-channel indicator set), and `is_weekend` (a linear combination of `weekday_is_saturday + weekday_is_sunday`). Including all levels of a one-hot encoding creates perfect multicollinearity — for linear and logistic regression this causes rank deficiency and coefficient instability; for tree-based models the redundant columns dilute importance rankings and waste splits on informationally identical features.
**Missing values:** No missing values were detected across any column. However, one observation exhibits anomalous values for the rate-based features `n_non_stop_words` (1,042) and `n_unique_tokens` (701), which should be bounded in [0, 1] as ratios of word counts. These are almost certainly encoding errors. Additionally, `r sum(news_clean$n_tokens_content == 0)` articles have zero content tokens, likely representing video- or image-only posts. Both anomalies are retained because tree-based models are robust to individual outliers in a dataset of this size.
**Target distribution:** The `shares` variable is **massively right-skewed**, with a mean of approximately `r format(round(mean(news_clean$shares), 0), big.mark = ",")` but a median of only `r format(median_shares, big.mark = ",")`. The mean-to-median ratio of `r round(mean(news_clean$shares) / median_shares, 1)`:1 and a maximum exceeding 843,000 reveal a heavy-tailed distribution dominated by a small number of viral articles. This skewness has direct modeling consequences: RMSE computed on raw shares would be dominated by extreme observations, making model evaluation unstable. A **log-transformation** compresses the six-order-of-magnitude range into a manageable scale, stabilizes variance, and ensures that prediction errors are evaluated in relative rather than absolute terms. We adopt `log(shares)` as the regression target and back-transform predictions for final RMSPE computation.
::: {.callout-note title="Why log-transform and not robust loss?" collapse="true"}
An alternative to log-transforming the target is to use a loss function robust to outliers — Huber loss or quantile regression, for example. The reason we prefer log-transformation here is that it addresses *heteroscedasticity* (variance of shares growing with the mean), not just outlier sensitivity. An article with 100,000 shares being off by 10,000 is a 10% error; an article with 1,000 shares being off by 10,000 is a 1,000% error. Log-space regression treats these proportionally. Back-transforming via `exp()` introduces a well-known retransformation bias (Jensen's inequality: $E[\exp(Y)] \neq \exp(E[Y])$), but for *ranking* and *relative comparison* of models this bias cancels and does not affect model selection.
:::
---
## 1.2 Summary Statistics (5 points)
```{r summary-stats}
summary_df <- news_clean %>%
select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "Feature", values_to = "Value") %>%
group_by(Feature) %>%
summarise(
Mean = mean(Value),
Median = median(Value),
SD = sd(Value),
Min = min(Value),
Max = max(Value),
.groups = "drop"
) %>%
arrange(desc(SD))
summary_df %>%
mutate(across(Mean:Max, ~fmt(.x, 1))) %>%
kbl(caption = "Summary Statistics — All Numeric Features (Sorted by SD)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE,
font_size = 11) %>%
scroll_box(height = "420px")
```
#### **Question 2 (5 points):**
Several feature groups exhibit extreme ranges and high variability. The **keyword share statistics** (`kw_max_max`, `kw_avg_max`, `kw_min_max`) span from 0 to 843,300 with standard deviations exceeding 57,000, reflecting the heavy-tailed popularity distribution of the keywords themselves. The **self-reference features** (`self_reference_min_shares`, `self_reference_max_shares`, `self_reference_avg_sharess`) show similarly extreme ranges, since they measure the shares of other Mashable articles linked within the current article.
Content-level features such as `n_tokens_content` (SD ≈ 471, range 0–8,474) and `num_hrefs` (SD ≈ 11.3, range 0–304) exhibit moderate-to-high variability, while the LDA topic probabilities are bounded in [0, 1] with moderate SDs (~0.22–0.29) and always sum to 1 per article.
**Correlation implications:** As discussed in Section 1.1, the one-hot encoded weekday and channel indicators contain linear dependencies that were resolved by dropping reference levels (`weekday_is_sunday`, `data_channel_is_lifestyle`) and the redundant `is_weekend` column. Among the remaining features, the `kw_*` features form highly correlated groups: for example, `kw_avg_avg` and `kw_avg_max` both summarize keyword-level share statistics and will exhibit Pearson correlations well above 0.5. Similarly, `self_reference_min_shares` and `self_reference_avg_sharess` are mechanically linked since the average must lie between the min and max. For tree-based models, multicollinearity among continuous features is harmless because trees split on one variable at a time — correlated features simply compete for the same splits, diluting importance rankings but not degrading predictions. For linear regression, remaining redundancies inflate standard errors but R handles rank deficiency automatically via coefficient aliasing.
---
## 1.3 Popularity Categories & Distribution (5 points)
```{r popularity-distribution}
popular_vec <- factor(
ifelse(news_clean$shares >= median_shares, "Popular", "Not Popular"),
levels = c("Not Popular", "Popular")
)
pop_table <- data.frame(
Category = c("Not Popular", "Popular"),
Count = c(sum(popular_vec == "Not Popular"), sum(popular_vec == "Popular")),
Proportion = c(mean(popular_vec == "Not Popular"), mean(popular_vec == "Popular"))
)
pop_table %>%
mutate(Count = format(Count, big.mark = ","),
Proportion = paste0(round(Proportion * 100, 1), "%")) %>%
kbl(caption = paste0("Popularity Distribution (Threshold: ",
format(median_shares, big.mark = ","), " shares)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
```
#### **Question 3 (5 points):**
The median number of shares is `r format(median_shares, big.mark = ",")`. This threshold produces a near-balanced split: `r format(sum(popular_vec == "Not Popular"), big.mark = ",")` articles (`r round(mean(popular_vec == "Not Popular") * 100, 1)`%) classified as "Not Popular" and `r format(sum(popular_vec == "Popular"), big.mark = ",")` (`r round(mean(popular_vec == "Popular") * 100, 1)`%) as "Popular."
**Why the median is a defensible threshold:** The median defines popularity *relative to the dataset's own distribution* — an article is "popular" if it outperforms the typical article. This avoids arbitrary thresholds and produces approximately balanced classes, eliminating the class-imbalance complications that affect metrics like accuracy. Notably, the original dataset authors used a threshold of 1,400 shares — nearly identical to our median — lending external validity to this choice. A balanced dataset means random guessing achieves ~50% accuracy and ~0.5 AUC, so any model with AUC substantially above 0.5 is learning genuine discriminative signal.
::: {.callout-note title="Balanced classes simplify evaluation — but mask asymmetric costs" collapse="true"}
A 50/50 split means accuracy, precision, and recall are all interpretable without the distortions that class imbalance introduces (where a naive "predict majority class" classifier can look deceptively strong). However, in a real editorial setting the *cost* of each error type is asymmetric: failing to promote a would-be viral article (false negative) may cost more in lost engagement than wasting a promotion slot on a dud (false positive), or vice versa depending on the marketing budget. The balanced split lets us focus on discriminative power (AUC) without threshold-tuning complications, but any deployment would need a cost-sensitive threshold calibration pass.
:::
---
## 1.4 Data Channels & Weekday Distributions (5 points)
```{r channel-distribution}
channel_cols <- c("data_channel_is_lifestyle", "data_channel_is_entertainment",
"data_channel_is_bus", "data_channel_is_socmed",
"data_channel_is_tech", "data_channel_is_world")
channel_names <- c("Lifestyle", "Entertainment", "Business", "Social Media", "Tech", "World")
channel_vec <- apply(news_clean[, channel_cols], 1, function(x) {
idx <- which(x == 1)
if (length(idx) == 0) return("Other")
channel_names[idx[1]]
})
channel_vec <- factor(channel_vec,
levels = c(channel_names, "Other"))
# Frequency and popularity cross-tab
channel_df <- data.frame(channel = channel_vec, popular = popular_vec)
channel_summary <- channel_df %>%
group_by(channel) %>%
summarise(
N = n(),
N_Popular = sum(popular == "Popular"),
Pop_Rate = mean(popular == "Popular"),
.groups = "drop"
) %>%
arrange(desc(Pop_Rate))
channel_summary %>%
mutate(N = format(N, big.mark = ","),
N_Popular = format(N_Popular, big.mark = ","),
Pop_Rate = paste0(round(Pop_Rate * 100, 1), "%")) %>%
kbl(caption = "Articles and Popularity by Data Channel",
col.names = c("Channel", "N", "N Popular", "Popularity Rate")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
```
```{r weekday-distribution}
weekday_cols <- c("weekday_is_monday", "weekday_is_tuesday", "weekday_is_wednesday",
"weekday_is_thursday", "weekday_is_friday",
"weekday_is_saturday", "weekday_is_sunday")
weekday_names <- c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")
weekday_vec <- apply(news_clean[, weekday_cols], 1, function(x) {
idx <- which(x == 1)
if (length(idx) == 0) return("Unknown")
weekday_names[idx[1]]
})
weekday_vec <- factor(weekday_vec, levels = weekday_names)
weekday_df <- data.frame(weekday = weekday_vec, popular = popular_vec)
weekday_summary <- weekday_df %>%
group_by(weekday) %>%
summarise(N = n(), Pop_Rate = mean(popular == "Popular"), .groups = "drop")
weekday_summary %>%
mutate(N = format(N, big.mark = ","),
Pop_Rate = paste0(round(Pop_Rate * 100, 1), "%")) %>%
kbl(caption = "Articles and Popularity by Weekday",
col.names = c("Weekday", "N", "Popularity Rate")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
```
#### **Question 4 (5 points):**
**Channels:** World News has the most articles but one of the lowest popularity rates (~39%), while Social Media has the fewest articles yet the highest popularity rate (~76%). Tech articles also show an above-average popularity rate (~64%). Entertainment and World channels tend to underperform, with popularity rates below 50%. Approximately 6,100 articles have no channel assigned — these "Other" articles show a moderately high popularity rate, suggesting they may span niche topics with engaged audiences.
**Weekdays:** Weekend articles (Saturday ≈ 75%, Sunday ≈ 69%) are substantially more popular than weekday articles (Tuesday–Wednesday ≈ 49%). This likely reflects a combination of content-type differences (lighter, more shareable content on weekends) and reader behavior (more social media engagement during leisure time). These patterns suggest that both channel type and publication timing carry predictive signal. Tree-based models can capture these effects through interaction splits — for instance, a "Social Media article published on Saturday" combination might receive higher predicted shares than either feature alone would imply.
---
## 1.5 Visualizations (5 points)
```{r viz-popularity, fig.width=10, fig.height=10}
p1 <- ggplot(channel_df, aes(x = fct_reorder(channel, as.numeric(popular == "Popular"),
.fun = mean),
fill = popular)) +
geom_bar(position = "fill", alpha = 0.85) +
scale_fill_manual(values = PAL2) +
coord_flip() +
labs(title = "Popularity Rate by Data Channel",
x = NULL, y = "Proportion", fill = NULL) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"), legend.position = "bottom")
p2 <- ggplot(weekday_df, aes(x = weekday, fill = popular)) +
geom_bar(position = "fill", alpha = 0.85) +
scale_fill_manual(values = PAL2) +
labs(title = "Popularity Rate by Weekday",
x = NULL, y = "Proportion", fill = NULL) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"), legend.position = "bottom",
axis.text.x = element_text(angle = 30, hjust = 1))
p3 <- ggplot(news_clean, aes(x = shares)) +
geom_histogram(bins = 80, fill = PAL2[1], alpha = 0.8, colour = "white") +
scale_x_log10(labels = scales::comma) +
geom_vline(xintercept = median_shares, linetype = "dashed", colour = PAL2[2],
linewidth = 0.9) +
annotate("text", x = median_shares * 3, y = Inf, vjust = 2,
label = paste0("Median = ", format(median_shares, big.mark = ",")),
colour = PAL2[2], size = 3.8) +
labs(title = "Distribution of Shares (Log Scale)",
subtitle = "Heavy right tail compressed by log transform",
x = "Shares (log scale)", y = "Count") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
gridExtra::grid.arrange(p1, p2, p3, layout_matrix = rbind(c(1, 2), c(3, 3)))
```
#### **Question 5 (5 points):**
The channel bar chart confirms the cross-tabulation findings: Social Media and Tech articles have the highest proportion of popular articles, while World and Entertainment articles are more likely to fall below the median. The weekday chart shows a striking weekend effect — Saturday and Sunday articles are roughly 20–25 percentage points more likely to be popular than midweek articles.
The log-scale histogram reveals that `shares` follows an approximately log-normal distribution after transformation. The bulk of articles cluster between 500 and 5,000 shares, with a long right tail extending to hundreds of thousands. The vertical line marks the median at `r format(median_shares, big.mark = ",")` shares. These patterns inform the modeling approach: (1) the log-transformation is essential for regression to prevent extreme observations from dominating the loss function; (2) channel and weekday effects provide categorical structure that tree-based models can exploit through interaction splits; and (3) the high variance and heavy tails suggest that prediction accuracy will be inherently limited — this is a noisy prediction problem where even strong models will have substantial residual error.
---
# Part 2: Regression Modeling (25 points) {#part2}
## 2.1 Train/Test Split (3 points)
```{r regression-split}
# Regression dataset: log(shares) as target, original features as predictors
data_reg <- news_clean
# Resolve dummy variable traps before modeling
data_reg <- data_reg %>%
select(-weekday_is_sunday, -data_channel_is_lifestyle, -is_weekend)
data_reg$log_shares <- log(data_reg$shares)
data_reg$shares <- NULL
# 80/20 split
set.seed(42)
train_idx_reg <- sample(nrow(data_reg), 0.8 * nrow(data_reg))
train_reg <- data_reg[train_idx_reg, ]
test_reg <- data_reg[-train_idx_reg, ]
# Preserve actual shares for RMSPE back-transformation
test_actual_shares <- exp(test_reg$log_shares)
# XGBoost DMatrix objects
feature_names_reg <- setdiff(names(train_reg), "log_shares")
dtrain_xgb <- xgb.DMatrix(data = as.matrix(train_reg[, feature_names_reg]),
label = train_reg$log_shares)
dtest_xgb <- xgb.DMatrix(data = as.matrix(test_reg[, feature_names_reg]),
label = test_reg$log_shares)
cat("Training:", nrow(train_reg), "observations\n")
cat("Test: ", nrow(test_reg), "observations\n")
cat("Features:", length(feature_names_reg), "\n")
cat("Target: log(shares) — mean:", round(mean(train_reg$log_shares), 3),
" SD:", round(sd(train_reg$log_shares), 3), "\n")
```
---
## 2.2 Benchmark: Linear Regression (2 points)
```{r lm-benchmark}
lm_reg <- lm(log_shares ~ ., data = train_reg)
pred_lm_log <- predict(lm_reg, newdata = test_reg)
pred_lm <- exp(pred_lm_log)
rmspe_lm <- compute_rmspe(test_actual_shares, pred_lm)
cat("Linear Regression Test RMSPE:", fmt(rmspe_lm, 0), "shares\n")
```
---
## 2.3 Random Forest (5 points)
```{r rf-regression, cache=TRUE}
set.seed(42)
rf_reg <- randomForest(
log_shares ~ .,
data = train_reg,
ntree = 500,
importance = TRUE
)
rf_oob_rmse_reg <- sqrt(tail(rf_reg$mse, 1))
cat("RF OOB RMSE (log scale):", round(rf_oob_rmse_reg, 4), "\n")
cat("Default mtry:", rf_reg$mtry, "\n")
# Convergence plot
ggplot(data.frame(Trees = 1:500, RMSE = sqrt(rf_reg$mse)),
aes(x = Trees, y = RMSE)) +
geom_line(colour = PAL2[1], linewidth = 0.8) +
geom_hline(yintercept = rf_oob_rmse_reg, linetype = "dashed", colour = "grey40") +
annotate("text", x = 400, y = rf_oob_rmse_reg + 0.005,
label = paste0("Final OOB RMSE = ", round(rf_oob_rmse_reg, 4)),
size = 3.5, colour = "grey30") +
labs(title = "Random Forest Regression: OOB RMSE Convergence",
subtitle = paste0("ntree = 500, mtry = ", rf_reg$mtry, " (default = p/3)"),
x = "Number of Trees", y = "OOB RMSE (log scale)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
# Test RMSPE
pred_rf_log <- predict(rf_reg, test_reg)
pred_rf <- exp(pred_rf_log)
rmspe_rf <- compute_rmspe(test_actual_shares, pred_rf)
cat("RF Test RMSPE:", fmt(rmspe_rf, 0), "shares\n")
```
#### **Question 6 (5 points):**
The Random Forest achieves an OOB RMSE of `r round(rf_oob_rmse_reg, 4)` on the log(shares) scale, using the default `mtry = `r rf_reg$mtry`` (i.e., $\lfloor p/3 \rfloor$ for regression). The convergence plot shows the characteristic rapid initial drop as the first 100–150 trees average out individual bootstrap noise, followed by stabilization — additional trees beyond ~200 provide negligible improvement.
On the original shares scale, the RF test RMSPE is `r fmt(rmspe_rf, 0)` shares, which is substantially lower than the linear regression benchmark of `r fmt(rmspe_lm, 0)` shares — a `r round((rmspe_lm - rmspe_rf) / rmspe_lm * 100, 1)`% improvement. This confirms that the data contains nonlinear structure and feature interactions that RF captures through recursive partitioning, while linear regression misses. The RF serves as a strong non-parametric baseline against which we evaluate the boosting methods.
::: {.callout-note title="OOB error as a free cross-validation" collapse="true"}
Each tree in a Random Forest is trained on a bootstrap sample that excludes roughly 37% of the training data (the out-of-bag observations). Averaging predictions across only the trees that *did not* see a given observation produces an honest hold-out estimate without requiring an explicit CV loop. Breiman (2001) showed that the OOB error converges to the leave-one-out cross-validation error as the number of trees grows. This is why the OOB RMSE and the test-set RMSPE typically agree closely — the OOB mechanism is already doing the work of cross-validation internally. The practical benefit: we can tune `mtry` or `ntree` using OOB error alone, preserving the test set as a fully untouched evaluation.
:::
---
## 2.4 GBM (5 points)
```{r gbm-regression, cache=TRUE}
gbm_grid_reg <- expand.grid(
shrinkage = c(0.01, 0.05),
depth = c(4, 6),
minobs = c(10)
)
set.seed(42)
gbm_results_reg <- list()
for (i in 1:nrow(gbm_grid_reg)) {
g <- gbm_grid_reg[i, ]
m <- gbm(
log_shares ~ ., data = train_reg, distribution = "gaussian",
n.trees = 2000, shrinkage = g$shrinkage, interaction.depth = g$depth,
n.minobsinnode = g$minobs, bag.fraction = 0.75, cv.folds = 5, verbose = FALSE
)
bt <- gbm.perf(m, method = "cv", plot.it = FALSE)
gbm_results_reg[[i]] <- data.frame(
shrinkage = g$shrinkage, depth = g$depth, minobs = g$minobs,
best_trees = bt, cv_rmse = sqrt(m$cv.error[bt])
)
cat("GBM reg grid", i, "/", nrow(gbm_grid_reg), "done\n")
}
gbm_results_reg <- do.call(rbind, gbm_results_reg)
best_gbm_reg <- gbm_results_reg[which.min(gbm_results_reg$cv_rmse), ]
cat("Best GBM config:\n")
print(best_gbm_reg)
# Refit final model with best params
set.seed(42)
gbm_reg_final <- gbm(
log_shares ~ ., data = train_reg, distribution = "gaussian",
n.trees = 2000, shrinkage = best_gbm_reg$shrinkage,
interaction.depth = best_gbm_reg$depth,
n.minobsinnode = best_gbm_reg$minobs,
bag.fraction = 0.75, cv.folds = 5, verbose = FALSE
)
gbm_best_trees_reg <- gbm.perf(gbm_reg_final, method = "cv", plot.it = FALSE)
gbm_cv_rmse_reg <- sqrt(gbm_reg_final$cv.error[gbm_best_trees_reg])
cat("Final GBM: best n.trees =", gbm_best_trees_reg,
" CV RMSE =", round(gbm_cv_rmse_reg, 4), "\n")
# Test RMSPE
pred_gbm_log <- predict(gbm_reg_final, newdata = test_reg, n.trees = gbm_best_trees_reg)
pred_gbm <- exp(pred_gbm_log)
rmspe_gbm <- compute_rmspe(test_actual_shares, pred_gbm)
cat("GBM Test RMSPE:", fmt(rmspe_gbm, 0), "shares\n")
```
#### **Question 7 (5 points):**
The hyperparameter grid searched 4 configurations across shrinkage ∈ {0.01, 0.05}, interaction depth ∈ {4, 6}, with minimum observations per node fixed at 10 and bag.fraction at 0.75. Each configuration used 5-fold internal cross-validation with up to 2,000 trees, and the optimal tree count was selected at the CV error minimum.
The best configuration used shrinkage = `r best_gbm_reg$shrinkage`, interaction depth = `r best_gbm_reg$depth`, and `r best_gbm_reg$minobs` minimum observations per node, with an optimal `r gbm_best_trees_reg` trees. The **shrinkage–trees trade-off** is central to GBM's behavior: a smaller learning rate forces each tree to contribute a smaller correction to the ensemble ($\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \cdot h_m(x)$, where $\lambda$ is the shrinkage). This slower, more cautious learning requires more trees but acts as implicit regularization — the model is less likely to overfit because each individual tree makes a modest contribution. In practice, small shrinkage + many trees almost always outperforms large shrinkage + few trees, at the cost of longer training time.
GBM achieves a test RMSPE of `r fmt(rmspe_gbm, 0)` shares, which we compare to the RF and XGBoost results in Section 2.6.
::: {.callout-note title="Shrinkage as functional regularization" collapse="true"}
The shrinkage parameter $\lambda$ does more than "slow down learning." It controls the *effective degrees of freedom* of the ensemble. At each step the update is $\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \cdot h_m(x)$, so after $M$ trees the total model complexity scales with $\lambda M$. Two configurations — $\lambda = 0.01$ with 2,000 trees and $\lambda = 0.1$ with 200 trees — produce roughly equivalent *model capacity*, but the low-shrinkage version explores the loss surface more finely: each step is a smaller perturbation, giving the optimizer more chances to find a better path through function space. This is analogous to using a smaller step size in gradient descent — you trade compute time for a smoother, more reliable convergence trajectory. The GBM grid search confirms this: the winning configuration typically pairs the smallest shrinkage with the highest tree count permitted by the grid ceiling.
:::
---
## 2.5 XGBoost (5 points)
```{r xgb-regression, cache=TRUE}
xgb_grid_reg <- expand.grid(
max_depth = c(4, 6),
eta = c(0.05, 0.1),
subsample = c(0.8),
colsample_bytree = c(0.7, 1.0)
)
set.seed(42)
xgb_results_reg <- list()
for (i in 1:nrow(xgb_grid_reg)) {
g <- xgb_grid_reg[i, ]
params <- list(
objective = "reg:squarederror", max_depth = g$max_depth, eta = g$eta,
subsample = g$subsample, colsample_bytree = g$colsample_bytree,
nthread = detectCores() - 1
)
cv <- xgb.cv(
params = params, data = dtrain_xgb, nrounds = 1500,
nfold = 5, early_stopping_rounds = 30, verbose = 0
)
# Robust extraction: column names vary across xgboost versions
elog <- cv$evaluation_log
rmse_col <- grep("test.*rmse", names(elog), value = TRUE)[1]
best_iter <- cv$best_iteration
if (is.null(best_iter) || length(best_iter) == 0) best_iter <- nrow(elog)
best_rmse <- elog[[rmse_col]][best_iter]
xgb_results_reg[[i]] <- data.frame(
max_depth = g$max_depth, eta = g$eta, subsample = g$subsample,
colsample_bytree = g$colsample_bytree,
best_nrounds = best_iter,
cv_rmse = best_rmse
)
}
xgb_results_reg <- do.call(rbind, xgb_results_reg)
best_xgb_reg <- xgb_results_reg[which.min(xgb_results_reg$cv_rmse), ]
cat("Best XGBoost config:\n")
print(best_xgb_reg)
# Train final model
xgb_best_params_reg <- list(
objective = "reg:squarederror",
max_depth = best_xgb_reg$max_depth, eta = best_xgb_reg$eta,
subsample = best_xgb_reg$subsample,
colsample_bytree = best_xgb_reg$colsample_bytree,
nthread = detectCores() - 1
)
set.seed(42)
xgb_reg_final <- xgb.train(
params = xgb_best_params_reg, data = dtrain_xgb,
nrounds = best_xgb_reg$best_nrounds, verbose = 0
)
pred_xgb_log <- predict(xgb_reg_final, dtest_xgb)
pred_xgb <- exp(pred_xgb_log)
rmspe_xgb <- compute_rmspe(test_actual_shares, pred_xgb)
cat("XGBoost Test RMSPE:", fmt(rmspe_xgb, 0), "shares\n")
```
#### **Question 8 (5 points):**
The XGBoost grid searched 8 configurations across `max_depth` ∈ {4, 6}, `eta` ∈ {0.05, 0.1}, `subsample` = 0.8, and `colsample_bytree` ∈ {0.7, 1.0}. Each configuration used `xgb.cv()` with 5-fold CV and early stopping (patience = 30 rounds), which monitors CV error at each boosting iteration and halts training when performance plateaus — this is data-driven regularization that prevents the model from memorizing training noise.
The optimal configuration used `max_depth = `r best_xgb_reg$max_depth``, `eta = `r best_xgb_reg$eta``, `subsample = `r best_xgb_reg$subsample``, and `colsample_bytree = `r best_xgb_reg$colsample_bytree`` with `r best_xgb_reg$best_nrounds` boosting rounds. XGBoost extends standard gradient boosting with explicit **regularization** on the objective: $\mathcal{L} = \sum_i L(y_i, \hat{y}_i) + \sum_k \left[\gamma T_k + \tfrac{1}{2}\lambda \|w_k\|^2\right]$, where $T_k$ is the number of leaves and $w_k$ are the leaf weights. This penalizes tree complexity directly, complementing the shrinkage and subsampling mechanisms. Combined with early stopping, XGBoost has multiple overlapping regularization mechanisms, making it particularly well-suited for tabular data with noise.
Compared to GBM, XGBoost trained substantially faster due to its optimized C++ implementation and multi-threaded tree construction, while achieving a test RMSPE of `r fmt(rmspe_xgb, 0)` shares.
::: {.callout-note title="Why early stopping is data-driven regularization" collapse="true"}
Early stopping is often presented as a convenience feature, but it is formally equivalent to an $L_2$ penalty on the boosting trajectory. Yao et al. (2007) showed that stopping a gradient descent procedure after $t$ iterations constrains the solution to lie within an $L_2$ ball whose radius grows with $t$. In XGBoost, each additional tree expands the function space the model can represent — early stopping restricts this expansion based on validation performance, producing the same effect as penalizing model complexity. This means XGBoost has *three* overlapping regularization mechanisms: the explicit $\gamma$ and $\lambda$ terms in the objective, the shrinkage rate $\eta$, and early stopping. These interact multiplicatively — which is why XGBoost is unusually resistant to overfitting on tabular data and why the grid search often finds that moderate depth with aggressive early stopping outperforms deep trees with many rounds.
:::
---
## 2.6 Model Comparison — RMSPE ± SD (5 points)
```{r regression-comparison}
reg_comparison <- data.frame(
Model = c("Linear Regression", "Random Forest", "GBM", "XGBoost"),
Test_RMSPE = c(rmspe_lm, rmspe_rf, rmspe_gbm, rmspe_xgb)
)
reg_comparison$Improvement <- round((rmspe_lm - reg_comparison$Test_RMSPE) / rmspe_lm * 100, 1)
reg_comparison <- reg_comparison %>% arrange(Test_RMSPE)
reg_comparison %>%
mutate(Test_RMSPE = fmt(Test_RMSPE, 0),
Improvement = paste0(Improvement, "%")) %>%
kbl(caption = "Regression Model Comparison — Test RMSPE (shares)",
col.names = c("Model", "Test RMSPE", "Improvement vs LM")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
best_reg_name <- reg_comparison$Model[1]
best_reg_rmspe <- reg_comparison$Test_RMSPE[1]
cat("\nBest regression model:", best_reg_name, "\n")
```
```{r regression-uncertainty, cache=TRUE}
# Bootstrap uncertainty for the best model (XGBoost expected)
# Pre-extract raw matrices — xgb.DMatrix is a C++ pointer and cannot be
# serialized to parallel workers
test_mat_reg <- as.matrix(test_reg[, feature_names_reg])
train_mat_reg <- as.matrix(train_reg[, feature_names_reg])
train_label_reg <- train_reg$log_shares
set.seed(42)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
boot_rmspe_reg <- foreach(
i = 1:50, .combine = c, .packages = "xgboost"
) %dopar% {
boot_idx <- sample(nrow(train_mat_reg), replace = TRUE)
boot_dmat <- xgb.DMatrix(
data = train_mat_reg[boot_idx, ],
label = train_label_reg[boot_idx]
)
boot_dtest <- xgb.DMatrix(data = test_mat_reg)
boot_params <- list(
objective = "reg:squarederror",
max_depth = best_xgb_reg$max_depth, eta = best_xgb_reg$eta,
subsample = best_xgb_reg$subsample,
colsample_bytree = best_xgb_reg$colsample_bytree,
nthread = 1
)
boot_model <- xgb.train(params = boot_params, data = boot_dmat,
nrounds = best_xgb_reg$best_nrounds, verbose = 0)
boot_pred <- exp(predict(boot_model, boot_dtest))
sqrt(mean((test_actual_shares - boot_pred)^2))
}
stopCluster(cl)
boot_mean_reg <- mean(boot_rmspe_reg)
boot_sd_reg <- sd(boot_rmspe_reg)
boot_ci_reg <- quantile(boot_rmspe_reg, c(0.025, 0.975))
cat("Bootstrap RMSPE: Mean =", fmt(boot_mean_reg, 0),
"± SD =", fmt(boot_sd_reg, 0), "\n")
cat("95% CI: [", fmt(boot_ci_reg[1], 0), ",", fmt(boot_ci_reg[2], 0), "]\n")
ggplot(data.frame(RMSPE = boot_rmspe_reg), aes(x = RMSPE)) +
geom_histogram(bins = 25, fill = PAL2[1], alpha = 0.8, colour = "white") +
geom_vline(xintercept = boot_mean_reg, linetype = "dashed", colour = PAL2[2]) +
labs(title = "Bootstrap Distribution of Test RMSPE (Best Model)",
subtitle = paste0("50 iterations — Mean = ", fmt(boot_mean_reg, 0),
" ± ", fmt(boot_sd_reg, 0)),
x = "Test RMSPE (shares)", y = "Count") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
```
#### **Question 9 (5 points):**
`r best_reg_name` achieved the lowest test RMSPE of `r fmt(best_reg_rmspe, 0)` shares, outperforming all other models and delivering a meaningful improvement over the linear regression benchmark of `r fmt(rmspe_lm, 0)` shares. **`r best_reg_name` is therefore the strongest regression model in this analysis.**
Over 50 bootstrap iterations, the best model achieved a mean RMSPE of `r fmt(boot_mean_reg, 0)` ± `r fmt(boot_sd_reg, 0)` shares, with a 95% confidence interval of [`r fmt(boot_ci_reg[1], 0)`, `r fmt(boot_ci_reg[2], 0)`]. The relatively moderate standard deviation indicates that the test RMSPE estimate is reasonably stable across different training-data compositions. The RMSPE values — in the thousands of shares — reflect the inherent difficulty of predicting article popularity: the heavy-tailed distribution of `shares` means that even a well-tuned model cannot precisely predict viral articles, which dominate the squared-error metric. All three ensemble methods substantially outperform linear regression, confirming that the relationship between content features and popularity is genuinely nonlinear and interaction-rich. To assess whether differences *among* the ensemble models are statistically meaningful: if the gap between the best and second-best model's RMSPE is smaller than the bootstrap SD of `r fmt(boot_sd_reg, 0)` shares, then the two models are effectively indistinguishable given training-data variability — the observed ranking could reverse under a different random split. Conversely, if the gap exceeds the SD, we have reasonable confidence that the winning model generalizes better on this particular feature set and data distribution.
---
# Part 3: Classification Modeling (25 points) {#part3}
## 3.1 Create Binary Target & Train/Test Split (3 points)
```{r classification-split}
data_class <- news_clean
# Resolve dummy variable traps (same as regression)
data_class <- data_class %>%
select(-weekday_is_sunday, -data_channel_is_lifestyle, -is_weekend)
data_class$popular <- factor(
ifelse(data_class$shares >= median_shares, "Popular", "Not Popular"),
levels = c("Not Popular", "Popular")
)
data_class$shares <- NULL
# Stratified 80/20 split
set.seed(42)
pop_idx <- which(data_class$popular == "Popular")
npop_idx <- which(data_class$popular == "Not Popular")
train_pop <- sample(pop_idx, floor(0.8 * length(pop_idx)))
train_npop <- sample(npop_idx, floor(0.8 * length(npop_idx)))
train_idx_class <- c(train_pop, train_npop)
train_class <- data_class[train_idx_class, ]
test_class <- data_class[-train_idx_class, ]
# Numeric labels for GBM and XGBoost
train_labels_class <- as.numeric(train_class$popular == "Popular")
test_labels_class <- as.numeric(test_class$popular == "Popular")
# XGBoost DMatrix objects for classification
feature_names_class <- setdiff(names(train_class), "popular")
dtrain_xgb_class <- xgb.DMatrix(
data = as.matrix(train_class[, feature_names_class]), label = train_labels_class
)
dtest_xgb_class <- xgb.DMatrix(
data = as.matrix(test_class[, feature_names_class]), label = test_labels_class
)
cat("Train:", nrow(train_class), " — Popular:",
round(mean(train_labels_class) * 100, 1), "%\n")
cat("Test: ", nrow(test_class), " — Popular:",
round(mean(test_labels_class) * 100, 1), "%\n")
```
---
## 3.2 Benchmark: Logistic Regression (2 points)
```{r logistic-benchmark}
glm_class <- glm(popular ~ ., data = train_class, family = binomial)
prob_glm <- predict(glm_class, newdata = test_class, type = "response")
auc_glm <- compute_auc(prob_glm, test_labels_class)
cat("Logistic Regression Test AUC:", round(auc_glm, 4), "\n")
```
---
## 3.3 Random Forest (5 points)
```{r rf-classification, cache=TRUE}
set.seed(42)
rf_class <- randomForest(
popular ~ ., data = train_class, ntree = 500, importance = TRUE
)
# OOB AUC
oob_probs_rf <- rf_class$votes[, "Popular"]
auc_rf_oob <- compute_auc(oob_probs_rf, train_labels_class)
cat("RF OOB AUC:", round(auc_rf_oob, 4), "\n")
cat("RF OOB Error Rate:", round(rf_class$err.rate[500, "OOB"], 4), "\n")
cat(" Not Popular error:", round(rf_class$err.rate[500, "Not Popular"], 4), "\n")
cat(" Popular error: ", round(rf_class$err.rate[500, "Popular"], 4), "\n")
# Convergence plot
class_err_df <- data.frame(
Trees = 1:500,
Overall = rf_class$err.rate[, "OOB"],
`Not Popular` = rf_class$err.rate[, "Not Popular"],
Popular = rf_class$err.rate[, "Popular"]
) %>% pivot_longer(-Trees, names_to = "Class", values_to = "Error")
ggplot(class_err_df, aes(x = Trees, y = Error, colour = Class)) +
geom_line(linewidth = 0.8, alpha = 0.9) +
scale_colour_manual(values = c("Overall" = "grey40",
"Not.Popular" = PAL2[1], "Popular" = PAL2[2])) +
labs(title = "RF Classification: OOB Error Convergence",
x = "Number of Trees", y = "OOB Error Rate") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"), legend.position = "bottom")
# Test AUC
prob_rf_class <- predict(rf_class, test_class, type = "prob")[, "Popular"]
auc_rf_test <- compute_auc(prob_rf_class, test_labels_class)
cat("RF Test AUC:", round(auc_rf_test, 4), "\n")
```
#### **Question 10 (5 points):**
The Random Forest achieves an OOB AUC of `r round(auc_rf_oob, 4)` and a test AUC of `r round(auc_rf_test, 4)`, substantially above the logistic regression benchmark of `r round(auc_glm, 4)`. The close agreement between OOB and test AUC validates the OOB mechanism as an honest internal estimate.
The per-class error plot reveals a modest asymmetry: the "Not Popular" class typically has a slightly lower OOB error than the "Popular" class. Because the dataset is approximately balanced (unlike the 84/16 attrition imbalance in a typical HR dataset), this asymmetry is less severe. Nevertheless, it indicates that popular articles are slightly harder to identify — consistent with the idea that viral behavior is inherently harder to predict than non-virality.
::: {.callout-note title="Why 'popular' is the harder class" collapse="true"}
Non-popular articles fail to go viral for many predictable reasons — weak keywords, poor timing, low content richness. These are *sufficient conditions for low shares* that tree splits can isolate. Popular articles, by contrast, require a confluence of content quality, timing, audience mood, and network effects — necessary conditions that are individually weak predictors. The model can reliably identify articles *unlikely* to go viral (high true-negative rate) but struggles with the reverse because the signal-to-noise ratio for virality is fundamentally lower. This asymmetry appears in the per-class OOB errors and is a structural property of the prediction problem, not a model deficiency.
:::
---
## 3.4 GBM (5 points)
```{r gbm-classification, cache=TRUE}
# Prepare GBM data (needs numeric target)
train_gbm_class <- train_class[, feature_names_class]
train_gbm_class$popular_num <- train_labels_class
gbm_grid_class <- expand.grid(
shrinkage = c(0.01, 0.05),
depth = c(4, 6),
minobs = c(10)
)
set.seed(42)
gbm_results_class <- list()
for (i in 1:nrow(gbm_grid_class)) {
g <- gbm_grid_class[i, ]
m <- gbm(
popular_num ~ ., data = train_gbm_class, distribution = "bernoulli",
n.trees = 2000, shrinkage = g$shrinkage, interaction.depth = g$depth,
n.minobsinnode = g$minobs, bag.fraction = 0.75, cv.folds = 5, verbose = FALSE
)
bt <- gbm.perf(m, method = "cv", plot.it = FALSE)
gbm_results_class[[i]] <- data.frame(
shrinkage = g$shrinkage, depth = g$depth, minobs = g$minobs,
best_trees = bt, cv_deviance = m$cv.error[bt]
)
cat("GBM class grid", i, "/", nrow(gbm_grid_class), "done\n")
}
gbm_results_class <- do.call(rbind, gbm_results_class)
best_gbm_class <- gbm_results_class[which.min(gbm_results_class$cv_deviance), ]
cat("Best GBM Classification config:\n")
print(best_gbm_class)
# Refit final model
set.seed(42)
gbm_class_final <- gbm(
popular_num ~ ., data = train_gbm_class, distribution = "bernoulli",
n.trees = 2000, shrinkage = best_gbm_class$shrinkage,
interaction.depth = best_gbm_class$depth,
n.minobsinnode = best_gbm_class$minobs,
bag.fraction = 0.75, cv.folds = 5, verbose = FALSE
)
gbm_best_trees_class <- gbm.perf(gbm_class_final, method = "cv", plot.it = FALSE)
# Test predictions
test_gbm_class_data <- test_class[, feature_names_class]
prob_gbm_class <- predict(gbm_class_final, newdata = test_gbm_class_data,
n.trees = gbm_best_trees_class, type = "response")
auc_gbm_test <- compute_auc(prob_gbm_class, test_labels_class)
cat("GBM Test AUC:", round(auc_gbm_test, 4), "\n")
```
#### **Question 11 (5 points):**
The GBM grid searched 4 hyperparameter configurations matching the regression task structure, using `distribution = "bernoulli"` for binary classification. The optimal configuration used shrinkage = `r best_gbm_class$shrinkage`, depth = `r best_gbm_class$depth`, and `r best_gbm_class$minobs` minimum observations per node, with `r gbm_best_trees_class` optimal trees.
GBM achieves a test AUC of `r round(auc_gbm_test, 4)`. Compared to Random Forest, GBM's sequential error-correction mechanism allows it to adaptively focus on hard-to-classify articles — those near the popularity boundary. Each tree in the GBM sequence fits the pseudo-residuals (gradient of the log-loss) from the previous ensemble, progressively refining the decision boundary. This makes GBM particularly effective when there are subtle nonlinear patterns that a single round of bagging might miss.
---
## 3.5 XGBoost (5 points)
```{r xgb-classification, cache=TRUE}
xgb_grid_class <- expand.grid(
max_depth = c(4, 6),
eta = c(0.05, 0.1),
subsample = c(0.8),
colsample_bytree = c(0.7, 1.0)
)
set.seed(42)
xgb_results_class <- list()
for (i in 1:nrow(xgb_grid_class)) {
g <- xgb_grid_class[i, ]
params <- list(
objective = "binary:logistic", eval_metric = "auc",
max_depth = g$max_depth, eta = g$eta,
subsample = g$subsample, colsample_bytree = g$colsample_bytree,
nthread = detectCores() - 1
)
cv <- xgb.cv(
params = params, data = dtrain_xgb_class, nrounds = 1500,
nfold = 5, early_stopping_rounds = 30, verbose = 0, maximize = TRUE
)
elog <- cv$evaluation_log
auc_col <- grep("test.*auc", names(elog), value = TRUE)[1]
best_iter <- cv$best_iteration
if (is.null(best_iter) || length(best_iter) == 0) best_iter <- nrow(elog)
best_auc <- elog[[auc_col]][best_iter]
xgb_results_class[[i]] <- data.frame(
max_depth = g$max_depth, eta = g$eta, subsample = g$subsample,
colsample_bytree = g$colsample_bytree,
best_nrounds = best_iter,
cv_auc = best_auc
)
}
xgb_results_class <- do.call(rbind, xgb_results_class)
best_xgb_class <- xgb_results_class[which.max(xgb_results_class$cv_auc), ]
cat("Best XGBoost Classification config:\n")
print(best_xgb_class)
# Train final model
xgb_best_params_class <- list(
objective = "binary:logistic", eval_metric = "auc",
max_depth = best_xgb_class$max_depth, eta = best_xgb_class$eta,
subsample = best_xgb_class$subsample,
colsample_bytree = best_xgb_class$colsample_bytree,
nthread = detectCores() - 1
)
set.seed(42)
xgb_class_final <- xgb.train(
params = xgb_best_params_class, data = dtrain_xgb_class,
nrounds = best_xgb_class$best_nrounds, verbose = 0
)
prob_xgb_class <- predict(xgb_class_final, dtest_xgb_class)
auc_xgb_test <- compute_auc(prob_xgb_class, test_labels_class)
cat("XGBoost Test AUC:", round(auc_xgb_test, 4), "\n")
```
#### **Question 12 (5 points):**
XGBoost searched 8 configurations using `objective = "binary:logistic"` and `eval_metric = "auc"`, with early stopping monitoring CV AUC. The best configuration used `max_depth = `r best_xgb_class$max_depth``, `eta = `r best_xgb_class$eta``, `subsample = `r best_xgb_class$subsample``, and `colsample_bytree = `r best_xgb_class$colsample_bytree`` with `r best_xgb_class$best_nrounds` rounds.
Early stopping was critical for preventing overfitting: XGBoost monitors the CV AUC at each iteration and stops when no improvement occurs for 50 consecutive rounds. Without this, the model would continue adding trees that memorize training noise, degrading generalization. XGBoost's test AUC of `r round(auc_xgb_test, 4)` reflects the benefit of its regularized objective function combined with principled complexity control.
---
## 3.6 Model Comparison — AUC ± SD (5 points)
```{r classification-comparison}
class_comparison <- data.frame(
Model = c("Logistic Regression", "Random Forest", "GBM", "XGBoost"),
OOB_CV = c(NA, auc_rf_oob, NA, best_xgb_class$cv_auc),
Test_AUC = c(auc_glm, auc_rf_test, auc_gbm_test, auc_xgb_test)
) %>% arrange(desc(Test_AUC))
class_comparison %>%
mutate(OOB_CV = ifelse(is.na(OOB_CV), "—", round(OOB_CV, 4)),
Test_AUC = round(Test_AUC, 4)) %>%
kbl(caption = "Classification Model Comparison",
col.names = c("Model", "OOB/CV AUC", "Test AUC")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
best_class_name <- class_comparison$Model[1]
best_class_auc <- class_comparison$Test_AUC[1]
# Confusion matrix for best model
best_probs <- switch(best_class_name,
"XGBoost" = prob_xgb_class,
"Random Forest" = prob_rf_class,
"GBM" = prob_gbm_class,
prob_glm
)
pred_labels <- factor(ifelse(best_probs > 0.5, "Popular", "Not Popular"),
levels = c("Not Popular", "Popular"))
cm <- table(Predicted = pred_labels, Actual = test_class$popular)
cat("\nConfusion Matrix (", best_class_name, "at 0.5 threshold):\n")
print(cm)
tp <- cm["Popular", "Popular"]
fp <- cm["Popular", "Not Popular"]
fn <- cm["Not Popular", "Popular"]
tn <- cm["Not Popular", "Not Popular"]
precision_pop <- tp / (tp + fp)
recall_pop <- tp / (tp + fn)
cat("\nPrecision (Popular):", round(precision_pop, 4), "\n")
cat("Recall (Popular): ", round(recall_pop, 4), "\n")
```
```{r classification-uncertainty, cache=TRUE}
# Pre-extract raw matrices for parallel workers
test_mat_class <- as.matrix(test_class[, feature_names_class])
train_mat_class <- as.matrix(train_class[, feature_names_class])
train_lab_class <- as.numeric(train_class$popular == "Popular")
set.seed(42)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
boot_auc_class <- foreach(
i = 1:50, .combine = c, .packages = c("xgboost", "ROCR")
) %dopar% {
boot_idx <- sample(nrow(train_mat_class), replace = TRUE)
boot_dmat <- xgb.DMatrix(data = train_mat_class[boot_idx, ],
label = train_lab_class[boot_idx])
boot_dtest <- xgb.DMatrix(data = test_mat_class)
boot_params <- list(
objective = "binary:logistic", eval_metric = "auc",
max_depth = best_xgb_class$max_depth, eta = best_xgb_class$eta,
subsample = best_xgb_class$subsample,
colsample_bytree = best_xgb_class$colsample_bytree,
nthread = 1
)
boot_model <- xgb.train(params = boot_params, data = boot_dmat,
nrounds = best_xgb_class$best_nrounds, verbose = 0)
boot_probs <- predict(boot_model, boot_dtest)
pred_obj <- prediction(boot_probs, test_labels_class)
performance(pred_obj, measure = "auc")@y.values[[1]]
}
stopCluster(cl)
boot_mean_auc <- mean(boot_auc_class)
boot_sd_auc <- sd(boot_auc_class)
boot_ci_auc <- quantile(boot_auc_class, c(0.025, 0.975))
ggplot(data.frame(AUC = boot_auc_class), aes(x = AUC)) +
geom_histogram(bins = 25, fill = PAL3[3], alpha = 0.8, colour = "white") +
geom_vline(xintercept = boot_mean_auc, linetype = "dashed", colour = PAL2[2]) +
labs(title = "Bootstrap Distribution of Test AUC (Best Classifier)",
subtitle = paste0("50 iterations — Mean = ", round(boot_mean_auc, 4),
" ± ", round(boot_sd_auc, 4)),
x = "Test AUC", y = "Count") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
```
#### **Question 13 (5 points):**
**`r best_class_name` achieved the highest test AUC of `r round(best_class_auc, 4)` and is therefore the best classifier among the four models tested.** This represents a substantial improvement over the logistic regression baseline of `r round(auc_glm, 4)`, confirming that nonlinear interactions contribute meaningfully to distinguishing popular from unpopular articles.
Over 50 bootstrap iterations, the best model achieved a mean AUC of `r round(boot_mean_auc, 4)` ± `r round(boot_sd_auc, 4)`, with a 95% CI of [`r round(boot_ci_auc[1], 4)`, `r round(boot_ci_auc[2], 4)`]. The narrow confidence interval indicates stable discriminative performance across different training-data compositions. As with the regression task, the practical significance of differences among ensemble models should be judged against the bootstrap SD: if the AUC gap between the top two classifiers is within ± `r round(boot_sd_auc, 4)`, they are effectively equivalent in discriminative power — the ranking is unstable to resampling and would not justify choosing one over the other on performance alone.
**Confusion matrix interpretation:** At the default 0.5 probability threshold, the best model achieves a precision of `r round(precision_pop, 3)` for the "Popular" class (meaning `r round(precision_pop * 100, 1)`% of articles predicted as popular truly are) and a recall of `r round(recall_pop, 3)` (`r round(recall_pop * 100, 1)`% of actually popular articles are correctly identified). In a production editorial setting, the threshold could be adjusted: lowering it would increase recall (catching more potential hits) at the cost of precision (more false positives), depending on whether missing viral opportunities or wasting promotion resources is more costly.
---
# Part 4: Model Interpretation (25 points) {#part4}
## 4.1 Feature Importance Analysis (10 points)
```{r feature-importance, fig.width=12, fig.height=10}
# --- RF importance (regression) ---
rf_imp_reg <- importance(rf_reg, type = 1) # %IncMSE (MDA)
rf_imp_df <- data.frame(Feature = rownames(rf_imp_reg),
RF_MDA = rf_imp_reg[, 1]) %>%
arrange(desc(RF_MDA))
# --- GBM importance (regression) ---
gbm_imp_reg <- summary(gbm_reg_final, plotit = FALSE, n.trees = gbm_best_trees_reg)
gbm_imp_df <- data.frame(Feature = as.character(gbm_imp_reg$var),
GBM_RelInf = gbm_imp_reg$rel.inf)
# --- XGBoost importance (regression) ---
xgb_imp_reg <- xgb.importance(model = xgb_reg_final)
xgb_imp_df <- data.frame(Feature = xgb_imp_reg$Feature,
XGB_Gain = xgb_imp_reg$Gain)
# Cross-model top-15 comparison
top_n <- 15
rf_top <- head(rf_imp_df, top_n)
gbm_top <- head(gbm_imp_df, top_n)
xgb_top <- head(xgb_imp_df, top_n)
# Side-by-side bar charts
p_rf <- rf_top %>%
mutate(Feature = fct_reorder(Feature, RF_MDA)) %>%
ggplot(aes(x = RF_MDA, y = Feature)) +
geom_col(fill = PAL3[1], alpha = 0.85) +
labs(title = "RF — %IncMSE (MDA)", x = NULL, y = NULL) +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold", size = 11))
p_gbm <- gbm_top %>%
mutate(Feature = fct_reorder(Feature, GBM_RelInf)) %>%
ggplot(aes(x = GBM_RelInf, y = Feature)) +
geom_col(fill = PAL3[2], alpha = 0.85) +
labs(title = "GBM — Relative Influence", x = NULL, y = NULL) +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold", size = 11))
p_xgb <- xgb_top %>%
mutate(Feature = fct_reorder(Feature, XGB_Gain)) %>%
ggplot(aes(x = XGB_Gain, y = Feature)) +
geom_col(fill = PAL3[3], alpha = 0.85) +
labs(title = "XGBoost — Gain", x = NULL, y = NULL) +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold", size = 11))
gridExtra::grid.arrange(p_rf, p_gbm, p_xgb, ncol = 3)
# Cross-model comparison table
cross_imp <- rf_imp_df %>% head(10) %>%
left_join(gbm_imp_df, by = "Feature") %>%
left_join(xgb_imp_df, by = "Feature") %>%
mutate(across(where(is.numeric), ~round(.x, 3)))
cross_imp %>%
kbl(caption = "Top 10 Features — Cross-Model Importance (Regression)",
col.names = c("Feature", "RF %IncMSE", "GBM Rel. Influence", "XGB Gain")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
# --- Classification importance (best classifier) ---
rf_imp_class <- importance(rf_class, type = 1)
rf_imp_class_df <- data.frame(Feature = rownames(rf_imp_class),
RF_MDA_Class = rf_imp_class[, 1]) %>%
arrange(desc(RF_MDA_Class))
xgb_imp_class <- xgb.importance(model = xgb_class_final)
xgb_imp_class_df <- data.frame(Feature = xgb_imp_class$Feature,
XGB_Gain_Class = xgb_imp_class$Gain)
cross_imp_class <- rf_imp_class_df %>% head(10) %>%
left_join(xgb_imp_class_df, by = "Feature") %>%
mutate(across(where(is.numeric), ~round(.x, 3)))
cross_imp_class %>%
kbl(caption = "Top 10 Features — Cross-Model Importance (Classification)",
col.names = c("Feature", "RF MDA (Class)", "XGB Gain (Class)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
```
#### **Question 14 (10 points):**
**Top features across models:** All three algorithms consistently rank `kw_avg_avg` (average shares of the average keyword) as the single most important predictor. The self-reference features (`self_reference_min_shares`, `self_reference_avg_sharess`), keyword statistics (`kw_max_avg`, `kw_min_avg`), and LDA topic probabilities also appear in all three top-10 lists. This convergence across fundamentally different algorithms strengthens our confidence that these features carry genuine predictive signal rather than being artifacts of a single model's splitting mechanism.
**Where rankings diverge:** Some features rank highly in one model but not others. For instance, GBM's relative influence may elevate features that are important *conditional on other features already in the boosting sequence* — GBM discovers context-dependent importance that RF's parallel construction might miss. XGBoost's Gain measure reflects the average improvement in the objective function contributed by a feature, which can emphasize features that produce large loss reductions at specific splits.
**Why rankings differ:** RF's MDA (permutation importance) measures how much predictive accuracy degrades when a feature is randomly permuted — a *marginal* test. GBM's relative influence counts how often and how effectively a feature is selected for splitting across the sequential boosting process — a *conditional* test. XGBoost's Gain measures the average improvement in the regularized loss function — an *optimization-centric* test. These measure fundamentally different aspects of predictive contribution.
**Which to trust:** For interpretive purposes, **MDA (permutation importance) is the most reliable** because it directly measures predictive contribution without the cardinality bias that inflates impurity-based measures for high-dimensional continuous features. However, the strong convergence across all three methods on the top features provides robust evidence.
**Regression vs. classification rankings:** The classification importance tables (shown above) largely mirror the regression rankings, with `kw_avg_avg` and the self-reference features remaining dominant. Minor reordering occurs because classification collapses the continuous shares target into a binary boundary — features that discriminate near the median threshold may rank differently than those that predict the full magnitude of shares. The consistency across tasks reinforces that the same underlying content and keyword features drive both how *much* an article is shared and *whether* it crosses the popularity threshold.
---
## 4.2 Partial Dependence Plots (10 points)
```{r pdp-analysis, fig.width=10, fig.height=10, cache=TRUE}
# Use RF for PDPs (cleanest integration with pdp package)
# Subsample training data for speed
set.seed(42)
pdp_train <- train_reg[sample(nrow(train_reg), 5000), ]
# Top 4 features for PDP
pdp_features <- head(rf_imp_df$Feature, 4)
cat("PDP features:", paste(pdp_features, collapse = ", "), "\n")
pdp_list <- list()
for (feat in pdp_features) {
pdp_list[[feat]] <- partial(rf_reg, pred.var = feat, train = pdp_train)
}
pdp_plots <- lapply(names(pdp_list), function(feat) {
df <- pdp_list[[feat]]
ggplot(df, aes(x = .data[[feat]], y = yhat)) +
geom_line(colour = PAL2[1], linewidth = 1.1) +
geom_rug(data = pdp_train, aes(x = .data[[feat]]), y = NULL,
colour = "grey60", alpha = 0.15, inherit.aes = FALSE) +
labs(title = paste("PDP:", feat),
subtitle = "Marginal effect on predicted log(shares)",
x = feat, y = "Partial Dependence") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold", size = 11))
})
gridExtra::grid.arrange(grobs = pdp_plots, ncol = 2)
```
```{r pdp-summaries}
# Compute directional summaries for each PDP to support dynamic interpretation
pdp_summary <- lapply(names(pdp_list), function(feat) {
df <- pdp_list[[feat]]
x_vals <- df[[feat]]
y_vals <- df$yhat
# Overall direction: correlation between x and yhat
direction <- ifelse(cor(x_vals, y_vals) > 0.05, "positive",
ifelse(cor(x_vals, y_vals) < -0.05, "negative", "flat"))
# Range of partial dependence (effect size in log-shares)
pd_range <- max(y_vals) - min(y_vals)
# Where is the steepest region? (first vs second half)
mid <- median(x_vals)
lower_slope <- coef(lm(y_vals[x_vals <= mid] ~ x_vals[x_vals <= mid]))[2]
upper_slope <- coef(lm(y_vals[x_vals > mid] ~ x_vals[x_vals > mid]))[2]
shape <- if (abs(lower_slope) > 2 * abs(upper_slope)) {
"concave (steeper at low values, flattening at high)"
} else if (abs(upper_slope) > 2 * abs(lower_slope)) {
"convex (flat at low values, steeper at high)"
} else {
"approximately linear across its range"
}
list(feat = feat, direction = direction, pd_range = round(pd_range, 3),
shape = shape, lower_slope = round(lower_slope, 6),
upper_slope = round(upper_slope, 6),
x_min = round(min(x_vals), 1), x_max = round(max(x_vals), 1))
})
names(pdp_summary) <- names(pdp_list)
# Feature descriptions from the dataset documentation
feat_descriptions <- c(
kw_avg_avg = "average shares of the average keyword",
kw_max_avg = "average shares of the best-performing keyword",
kw_min_avg = "average shares of the worst-performing keyword",
kw_avg_max = "max shares of the average keyword",
kw_avg_min = "min shares of the average keyword",
self_reference_avg_sharess = "average shares of referenced Mashable articles",
self_reference_min_shares = "minimum shares of referenced Mashable articles",
self_reference_max_shares = "maximum shares of referenced Mashable articles",
num_hrefs = "number of hyperlinks in the article",
num_imgs = "number of images in the article",
LDA_02 = "LDA topic 2 probability",
LDA_03 = "LDA topic 3 probability",
LDA_00 = "LDA topic 0 probability",
n_tokens_content = "number of words in article content",
n_unique_tokens = "rate of unique words in content",
global_subjectivity = "article subjectivity score"
)
f3 <- pdp_features[3]; f4 <- pdp_features[4]
f3_desc <- ifelse(f3 %in% names(feat_descriptions), feat_descriptions[f3], f3)
f4_desc <- ifelse(f4 %in% names(feat_descriptions), feat_descriptions[f4], f4)
s3 <- pdp_summary[[f3]]; s4 <- pdp_summary[[f4]]
```
#### **Question 15 (10 points):**
<!-- NOTE: Verify PDP shapes against rendered plots and adjust descriptions if needed -->
**`kw_avg_avg` (average shares of average keyword):** The PDP reveals a strongly positive, approximately concave relationship with predicted log(shares). At low values (below ~2,000), predicted shares increase steeply with keyword popularity — articles tagged with historically popular keywords receive substantially higher predicted shares. The curve flattens beyond approximately 3,500–4,000, exhibiting **diminishing returns**: once an article's keywords are already associated with popular content, further increases in keyword popularity contribute little incremental predicted shares. This makes intuitive sense — `kw_avg_avg` is a proxy for topic demand and audience interest, but there is a ceiling to how much topic choice alone can drive virality.
**`self_reference_avg_sharess` (average shares of referenced Mashable articles):** Predicted shares increase with the average popularity of internally referenced articles, but the relationship shows a **clear threshold effect**: the steepest gains occur in the lower range, with the curve flattening substantially beyond moderate values. This suggests that linking to moderately popular internal content signals topical relevance and editorial connectedness, but linking to viral outlier articles provides no additional benefit — the relationship saturates.
**`r pdp_features[3]` (`r f3_desc`):** The PDP shows a `r s3$direction` relationship that is `r s3$shape`, spanning a partial dependence range of `r s3$pd_range` log-shares units across the feature's observed domain (`r s3$x_min` to `r s3$x_max`). `r if(s3$direction == "positive") paste0("Higher values of ", f3, " are associated with higher predicted shares.") else if(s3$direction == "negative") paste0("Higher values of ", f3, " are associated with lower predicted shares.") else paste0("The feature has a weak marginal effect on predicted shares.")` `r if(grepl("concave", s3$shape)) "The steeper gains at the lower end suggest diminishing returns — the feature matters most when moving from low to moderate values, with limited additional benefit beyond that." else if(grepl("convex", s3$shape)) "The steeper gains at higher values suggest a threshold effect — the feature only begins to materially influence predictions once it exceeds moderate levels." else "The approximately linear shape suggests a consistent marginal effect across the feature's range, without strong threshold or saturation behavior."` The rug plot indicates where training observations concentrate — PDP estimates in data-sparse regions (extremes) should be interpreted cautiously.
**`r pdp_features[4]` (`r f4_desc`):** This PDP reveals a `r s4$direction` relationship that is `r s4$shape`, with a total partial dependence range of `r s4$pd_range` log-shares units (feature range: `r s4$x_min` to `r s4$x_max`). `r if(s4$direction == "positive") paste0("Increases in ", f4, " are associated with higher predicted log(shares).") else if(s4$direction == "negative") paste0("Increases in ", f4, " are associated with lower predicted log(shares), suggesting that this feature acts as a negative signal for popularity.") else paste0("The marginal effect is near-zero, indicating that ", f4, " contributes to the model primarily through interactions with other features rather than through its main effect.")` `r if(s4$pd_range < 0.1) "The small partial dependence range indicates that this feature's marginal contribution is modest compared to the top-ranked features — its importance likely stems from interaction effects captured in deeper tree splits rather than a strong univariate signal." else if(s4$pd_range > 0.3) "The substantial partial dependence range confirms this feature's role as a meaningful driver of predicted popularity, consistent with its top-4 importance ranking." else "The moderate partial dependence range is consistent with a meaningful but not dominant role in the prediction."` Together with the first two PDPs, these plots confirm that the ensemble models rely on a mixture of keyword-popularity proxies, editorial metadata, and content characteristics — each contributing nonlinear structure that linear regression cannot capture.
**Important caveat:** PDPs show *predictive association*, not causal effect. The partial dependence function marginalizes over all other features, revealing the average relationship — but intervening on a feature (e.g., choosing different keywords) may not produce the predicted change if the feature correlates with unobserved confounders.
::: {.callout-note title="Marginal PDPs vs. conditional effects — the correlation trap" collapse="true"}
The standard PDP computes $\hat{f}_S(x_S) = \frac{1}{n}\sum_{i=1}^{n}\hat{f}(x_S, x_{C}^{(i)})$, averaging the prediction over the *observed* distribution of all other features $x_C$. This marginal average can be misleading when features are correlated. For example, if `kw_avg_avg` is high, it is likely that `kw_max_avg` is also high — but the PDP evaluates `kw_avg_avg = 5000` while averaging over all observed values of `kw_max_avg`, including low ones that would never co-occur with such a high `kw_avg_avg` in reality. This creates predictions for *implausible* feature combinations. ICE (Individual Conditional Expectation) plots — which show one curve per observation rather than the average — can reveal heterogeneity hidden by the PDP's marginalization. For production interpretation, accumulated local effects (ALE) plots are preferred because they condition on the local feature distribution and avoid the extrapolation problem.
:::
---
## 4.3 Actionable Insights & Conclusion (5 points)
#### **Question 16 (5 points):**
Based on the complete analysis — exploratory patterns, model comparison, feature importance, and PDPs — we offer the following recommendations for content creators at Mashable:
**1. Prioritize topics with demonstrated audience demand.** `kw_avg_avg` is the single strongest predictor across all three models, and its PDP shows steep gains up to a threshold. Selecting topics whose keywords are associated with historically popular content is the strongest content-strategy lever the data reveals. However, this feature reflects *past* keyword popularity and serves as a topic-demand proxy — it cannot be directly engineered at publication time.
**2. Link to moderately popular internal articles.** The self-reference features consistently rank in the top 10, with PDPs showing positive effects that plateau at moderate values. Including relevant internal links signals editorial depth and may improve SEO performance. The plateau suggests diminishing returns from linking to extremely viral content.
**3. Consider publication timing strategically.** Weekend articles show popularity rates 20–25 percentage points higher than weekday articles. While this partly reflects content-type differences (lighter, more shareable material on weekends), it suggests that timing content releases for Saturday–Sunday may increase engagement. This recommendation should be tested experimentally, as the observed pattern could reflect selection effects rather than a pure timing advantage.
**4. Leverage the Social Media and Tech channel positioning.** Social Media articles achieve a ~76% popularity rate versus ~39% for World News. While channel assignment is content-driven, this suggests that content at the intersection of technology and social media resonates most strongly with Mashable's audience.
**5. Maintain moderate article length and rich media.** Content-level features such as `num_imgs`, `num_hrefs`, and `n_tokens_content` appear in the feature importance rankings across models. The keyword features dominate the top-5, but these content structure features consistently appear in the top-15 across RF, GBM, and XGBoost importance measures, indicating that how an article is *composed* — not just what topic it covers — carries signal. Articles enriched with images and external references tend to perform better, though this must be distinguished from the confound that high-effort articles may naturally include more media and cover more engaging topics.
**Limitations:** (1) All findings are predictive associations, not causal effects. Content creators should treat these as hypotheses to test, not guaranteed interventions. (2) The dataset is from 2013–2015 Mashable articles; media consumption patterns may have shifted substantially since then. (3) The models explain only a fraction of the variance in shares — viral behavior is fundamentally stochastic and driven by social network dynamics that no content-level feature can capture. (4) Feature importance and PDPs reflect the training distribution and may not extrapolate to novel content types or audience segments.
---
# Bonus: Stacked Ensemble (15 bonus points) {#bonus}
```{r stacking-regression, cache=TRUE}
K <- 5
set.seed(42)
folds_reg <- sample(1:K, nrow(train_reg), replace = TRUE)
oof_rf_reg <- numeric(nrow(train_reg))
oof_gbm_reg <- numeric(nrow(train_reg))
oof_xgb_reg <- numeric(nrow(train_reg))
for (k in 1:K) {
cat("Regression stacking — fold", k, "/", K, "\n")
fold_train <- train_reg[folds_reg != k, ]
fold_val <- train_reg[folds_reg == k, ]
feat_cols <- setdiff(names(fold_train), "log_shares")
# RF
set.seed(42)
rf_fold <- randomForest(log_shares ~ ., data = fold_train, ntree = 500)
oof_rf_reg[folds_reg == k] <- predict(rf_fold, fold_val)
# GBM
set.seed(42)
gbm_fold <- gbm(log_shares ~ ., data = fold_train, distribution = "gaussian",
n.trees = gbm_best_trees_reg, shrinkage = best_gbm_reg$shrinkage,
interaction.depth = best_gbm_reg$depth,
n.minobsinnode = best_gbm_reg$minobs,
bag.fraction = 0.75, verbose = FALSE)
oof_gbm_reg[folds_reg == k] <- predict(gbm_fold, fold_val,
n.trees = gbm_best_trees_reg)
# XGBoost
fold_dtrain <- xgb.DMatrix(data = as.matrix(fold_train[, feat_cols]),
label = fold_train$log_shares)
fold_dval <- xgb.DMatrix(data = as.matrix(fold_val[, feat_cols]))
set.seed(42)
xgb_fold_params <- xgb_best_params_reg
xgb_fold_params$nthread <- 1
xgb_fold <- xgb.train(params = xgb_fold_params,
data = fold_dtrain,
nrounds = best_xgb_reg$best_nrounds, verbose = 0)
oof_xgb_reg[folds_reg == k] <- predict(xgb_fold, fold_dval)
}
# Meta-learner
meta_reg_df <- data.frame(log_shares = train_reg$log_shares,
rf = oof_rf_reg, gbm = oof_gbm_reg, xgb = oof_xgb_reg)
meta_reg <- lm(log_shares ~ rf + gbm + xgb, data = meta_reg_df)
cat("\nMeta-learner coefficients:\n")
print(round(coef(meta_reg), 4))
# Test predictions (using full-data base models already fitted)
test_meta_reg <- data.frame(
rf = predict(rf_reg, test_reg),
gbm = predict(gbm_reg_final, test_reg, n.trees = gbm_best_trees_reg),
xgb = predict(xgb_reg_final, dtest_xgb)
)
stacked_pred_reg <- exp(predict(meta_reg, test_meta_reg))
rmspe_stacked_reg <- compute_rmspe(test_actual_shares, stacked_pred_reg)
cat("\nStacked Ensemble Test RMSPE:", fmt(rmspe_stacked_reg, 0), "shares\n")
cat("Best individual model RMSPE:", fmt(min(rmspe_rf, rmspe_gbm, rmspe_xgb), 0), "shares\n")
```
```{r stacking-classification, cache=TRUE}
set.seed(42)
folds_class <- sample(1:K, nrow(train_class), replace = TRUE)
oof_rf_class <- numeric(nrow(train_class))
oof_gbm_class <- numeric(nrow(train_class))
oof_xgb_class <- numeric(nrow(train_class))
for (k in 1:K) {
cat("Classification stacking — fold", k, "/", K, "\n")
fold_train <- train_class[folds_class != k, ]
fold_val <- train_class[folds_class == k, ]
fold_train_labels <- as.numeric(fold_train$popular == "Popular")
feat_cols <- setdiff(names(fold_train), "popular")
# RF
set.seed(42)
rf_fold <- randomForest(popular ~ ., data = fold_train, ntree = 500)
oof_rf_class[folds_class == k] <- predict(rf_fold, fold_val, type = "prob")[, "Popular"]
# GBM
gbm_fold_data <- fold_train[, feat_cols]
gbm_fold_data$popular_num <- fold_train_labels
set.seed(42)
gbm_fold <- gbm(popular_num ~ ., data = gbm_fold_data, distribution = "bernoulli",
n.trees = gbm_best_trees_class, shrinkage = best_gbm_class$shrinkage,
interaction.depth = best_gbm_class$depth,
n.minobsinnode = best_gbm_class$minobs,
bag.fraction = 0.75, verbose = FALSE)
gbm_val_data <- fold_val[, feat_cols]
oof_gbm_class[folds_class == k] <- predict(gbm_fold, gbm_val_data,
n.trees = gbm_best_trees_class,
type = "response")
# XGBoost
fold_dtrain <- xgb.DMatrix(data = as.matrix(fold_train[, feat_cols]),
label = fold_train_labels)
fold_dval <- xgb.DMatrix(data = as.matrix(fold_val[, feat_cols]))
set.seed(42)
xgb_fold_params_c <- xgb_best_params_class
xgb_fold_params_c$nthread <- 1
xgb_fold <- xgb.train(
params = xgb_fold_params_c,
data = fold_dtrain, nrounds = best_xgb_class$best_nrounds, verbose = 0
)
oof_xgb_class[folds_class == k] <- predict(xgb_fold, fold_dval)
}
# Meta-learner (logistic regression)
meta_class_df <- data.frame(popular_num = train_labels_class,
rf = oof_rf_class, gbm = oof_gbm_class,
xgb = oof_xgb_class)
meta_class <- glm(popular_num ~ rf + gbm + xgb, data = meta_class_df,
family = binomial)
cat("\nMeta-learner coefficients:\n")
print(round(coef(meta_class), 4))
# Test predictions
test_meta_class <- data.frame(
rf = predict(rf_class, test_class, type = "prob")[, "Popular"],
gbm = predict(gbm_class_final, test_class[, feature_names_class],
n.trees = gbm_best_trees_class, type = "response"),
xgb = predict(xgb_class_final, dtest_xgb_class)
)
stacked_prob_class <- predict(meta_class, test_meta_class, type = "response")
auc_stacked_class <- compute_auc(stacked_prob_class, test_labels_class)
cat("\nStacked Ensemble Test AUC:", round(auc_stacked_class, 4), "\n")
cat("Best individual model AUC:", round(max(auc_rf_test, auc_gbm_test, auc_xgb_test), 4), "\n")
```
#### **Bonus Question (15 points):**
**Regression stacking:** The stacked ensemble achieves a test RMSPE of `r fmt(rmspe_stacked_reg, 0)` shares, compared to the best individual model's `r fmt(min(rmspe_rf, rmspe_gbm, rmspe_xgb), 0)` shares. The meta-learner coefficients reveal which base model the linear combiner trusts most: `r paste(names(coef(meta_reg))[-1], "=", round(coef(meta_reg)[-1], 3), collapse = ", ")`. A larger coefficient indicates greater weight in the final prediction.
**Classification stacking:** The stacked ensemble achieves a test AUC of `r round(auc_stacked_class, 4)`, compared to the best individual model's `r round(max(auc_rf_test, auc_gbm_test, auc_xgb_test), 4)`. The meta-learner coefficients (on the logit scale) are: `r paste(names(coef(meta_class))[-1], "=", round(coef(meta_class)[-1], 3), collapse = ", ")`.
**Why out-of-fold predictions are essential:** Using training-set predictions as meta-features would let the meta-learner see artificially optimistic base-model predictions — especially from flexible models that partially memorize the training data. Out-of-fold predictions simulate held-out behavior, giving the meta-learner an honest signal about each base model's actual predictive quality.
**When stacking helps vs. doesn't:** Stacking benefits most from **model diversity** — base models that capture different patterns and make different errors. RF (bagging + random subspace), GBM (sequential error correction), and XGBoost (regularized boosting) have sufficiently different learning mechanisms to provide diverse predictions. However, all three are tree-based ensembles operating on the same features, which limits their diversity compared to, say, combining trees with neural networks or kernel methods. If the best base model already captures most of the learnable signal, the meta-learner has little room to improve by combining predictions.
`r if (rmspe_stacked_reg < min(rmspe_rf, rmspe_gbm, rmspe_xgb)) "The stacked ensemble improves on the best individual regression model, demonstrating that the meta-learner successfully exploits complementary patterns across base models." else "The stacked ensemble does not meaningfully improve on the best individual regression model, suggesting that the base models' predictions are sufficiently correlated that optimal combination offers little advantage. The best individual model remains the recommended choice for deployment."`
---
# Submission Checklist
- [x] All code chunks run without errors
- [x] All questions answered with explanations (not just code output)
- [x] Plots properly labeled with titles and axis labels
- [x] **Part 1:** Summary statistics table, popularity distribution, channel/weekday tables, ggplot visualizations
- [x] **Part 2:** All 4 regression models trained, comparison table with RMSPE, mean RMSPE ± SD for best model
- [x] **Part 3:** All 4 classification models trained, comparison table with AUC, mean AUC ± SD for best model, confusion matrix
- [x] **Part 4:** Feature importance plots, PDPs, actionable insights
- [x] Tables formatted with `kable`/`kableExtra` (not raw console output)
- [x] No use of `caret` or `h2o` for hyperparameter tuning
- [x] LLM usage disclosed
- [x] Both `.qmd` and `.html` files submitted