Final Project — Predicting Article Popularity in Online News Media

MBAN 5560 - Due April 12, 2026 (Sunday) 11:59pm

Author

Gavin Shklanka

Published

April 11, 2026

Background

The rapid growth of online news media has transformed the way information is disseminated and consumed. With an increasing number of news articles published daily, it has become crucial for publishers and content creators to understand the factors that drive the popularity of their articles to effectively engage their audience.

This project leverages the Online News Popularity dataset to develop machine learning models that predict the popularity of news articles published on Mashable, a leading online news platform. The dataset contains 39,644 articles described by 61 features, including content-related attributes (word counts, keywords, sentiment), publishing metadata (day of week, data channel), and the number of social media shares each article received.

LLM Disclosure: Claude (Anthropic) was used as a coding collaborator, analytical reviewer, and writing partner for this project. Claude assisted with R code implementation, hyperparameter tuning strategy, interpretation drafting, and methodological review. All modeling decisions, analytical judgments, and substantive conclusions were made independently. Model outputs were verified against computed results — no numerical values were accepted without empirical confirmation from the rendered code.

Code

library(tidyverse)
library(randomForest)
library(gbm)
library(xgboost)
library(ROCR)
library(rpart)
library(pdp)
library(knitr)
library(kableExtra)
library(doParallel)
library(gridExtra)
library(scales)

# Colour palettes
PAL2 <- c("#2C7BB6", "#D7191C")
PAL3 <- c("#2C7BB6", "#D7191C", "#1A9641")
PAL4 <- c("#2C7BB6", "#D7191C", "#1A9641", "#FDAE61")

# Helper functions
compute_rmspe <- function(actual, predicted) sqrt(mean((actual - predicted)^2))

compute_auc <- function(probs, actual_binary) {
  pred_obj <- prediction(probs, actual_binary)
  performance(pred_obj, measure = "auc")@y.values[[1]]
}

fmt <- function(x, d = 2) format(round(x, d), big.mark = ",", nsmall = d)

Part 1: Data Preparation & Exploratory Analysis (25 points)

1.1 Data Loading & Cleaning (5 points)

Code

news_raw <- read.csv("OnlineNewsPopularity.csv")
names(news_raw) <- trimws(names(news_raw))

cat("Raw dimensions:", dim(news_raw), "\n")

Raw dimensions: 39644 61

Code

cat("Missing values:", sum(is.na(news_raw)), "\n")

Missing values: 0

Code

# Remove non-predictive columns
news_clean <- news_raw %>% select(-url, -timedelta)

# NOTE: Dummy variable traps (weekday_is_sunday, data_channel_is_lifestyle,
# is_weekend) are resolved before modeling — see Section 2.1.

# Anomaly inspection
cat("\nn_non_stop_words > 1:", sum(news_clean$n_non_stop_words > 1), "row(s)\n")


n_non_stop_words > 1: 1 row(s)

Code

cat("n_unique_tokens > 1: ", sum(news_clean$n_unique_tokens > 1), "row(s)\n")

n_unique_tokens > 1:  1 row(s)

Code

cat("n_tokens_content == 0:", sum(news_clean$n_tokens_content == 0), "articles\n")

n_tokens_content == 0: 1181 articles

Code

cat("shares min:", min(news_clean$shares), "  max:", max(news_clean$shares), "\n")

shares min: 1   max: 843300

Code

# Shares distribution
cat("\nShares summary:\n")


Shares summary:

Code

print(summary(news_clean$shares))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1     946    1400    3395    2800  843300

Code

median_shares <- median(news_clean$shares)
cat("\nMedian shares:", median_shares, "\n")


Median shares: 1400

Code

cat("Mean / median ratio:", round(mean(news_clean$shares) / median_shares, 2), "\n")

Mean / median ratio: 2.43

Code

cat("Cleaned dimensions:", dim(news_clean), "\n")

Cleaned dimensions: 39644 59

Question 1 (5 points):

The cleaned dataset contains 39,644 observations and 58 predictive features after removing url (a non-predictive identifier) and timedelta (days between publication and dataset acquisition — temporal metadata, not a content feature). Before modeling (Section 2.1), three additional columns are dropped to resolve dummy variable traps: weekday_is_sunday (reference level for the 7-day indicator set), data_channel_is_lifestyle (reference level for the 6-channel indicator set), and is_weekend (a linear combination of weekday_is_saturday + weekday_is_sunday). Including all levels of a one-hot encoding creates perfect multicollinearity — for linear and logistic regression this causes rank deficiency and coefficient instability; for tree-based models the redundant columns dilute importance rankings and waste splits on informationally identical features.

Missing values: No missing values were detected across any column. However, one observation exhibits anomalous values for the rate-based features n_non_stop_words (1,042) and n_unique_tokens (701), which should be bounded in [0, 1] as ratios of word counts. These are almost certainly encoding errors. Additionally, 1181 articles have zero content tokens, likely representing video- or image-only posts. Both anomalies are retained because tree-based models are robust to individual outliers in a dataset of this size.

Target distribution: The shares variable is massively right-skewed, with a mean of approximately 3,395 but a median of only 1,400. The mean-to-median ratio of 2.4:1 and a maximum exceeding 843,000 reveal a heavy-tailed distribution dominated by a small number of viral articles. This skewness has direct modeling consequences: RMSE computed on raw shares would be dominated by extreme observations, making model evaluation unstable. A log-transformation compresses the six-order-of-magnitude range into a manageable scale, stabilizes variance, and ensures that prediction errors are evaluated in relative rather than absolute terms. We adopt log(shares) as the regression target and back-transform predictions for final RMSPE computation.

Why log-transform and not robust loss?

An alternative to log-transforming the target is to use a loss function robust to outliers — Huber loss or quantile regression, for example. The reason we prefer log-transformation here is that it addresses heteroscedasticity (variance of shares growing with the mean), not just outlier sensitivity. An article with 100,000 shares being off by 10,000 is a 10% error; an article with 1,000 shares being off by 10,000 is a 1,000% error. Log-space regression treats these proportionally. Back-transforming via exp() introduces a well-known retransformation bias (Jensen’s inequality: \(E[\exp(Y)] \neq \exp(E[Y])\)), but for ranking and relative comparison of models this bias cancels and does not affect model selection.

1.2 Summary Statistics (5 points)

Code

summary_df <- news_clean %>%
  select(where(is.numeric)) %>%
  pivot_longer(everything(), names_to = "Feature", values_to = "Value") %>%
  group_by(Feature) %>%
  summarise(
    Mean   = mean(Value),
    Median = median(Value),
    SD     = sd(Value),
    Min    = min(Value),
    Max    = max(Value),
    .groups = "drop"
  ) %>%
  arrange(desc(SD))

summary_df %>%
  mutate(across(Mean:Max, ~fmt(.x, 1))) %>%
  kbl(caption = "Summary Statistics — All Numeric Features (Sorted by SD)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE,
                font_size = 11) %>%
  scroll_box(height = "420px")

Summary Statistics — All Numeric Features (Sorted by SD)
Feature	Mean	Median	SD	Min	Max
kw_max_max	752,324.1	843,300.0	214,502.1	0.0	843,300.0
kw_avg_max	259,281.9	244,572.2	135,102.2	0.0	843,300.0
kw_min_max	13,612.4	1,400.0	57,986.0	0.0	843,300.0
self_reference_max_shares	10,329.2	2,800.0	41,027.6	0.0	843,300.0
self_reference_avg_sharess	6,401.7	2,200.0	24,211.3	0.0	843,300.0
self_reference_min_shares	3,998.8	1,200.0	19,738.7	0.0	843,300.0
shares	3,395.4	1,400.0	11,627.0	1.0	843,300.0
kw_max_avg	5,657.2	4,355.7	6,098.9	0.0	298,400.0
kw_max_min	1,154.0	660.0	3,858.0	0.0	298,400.0
kw_avg_avg	3,135.9	2,870.1	1,318.2	0.0	43,567.7
kw_min_avg	1,117.1	1,023.6	1,137.5	-1.0	3,613.0
kw_avg_min	312.4	235.5	620.8	-1.0	42,827.9
n_tokens_content	546.5	409.0	471.1	0.0	8,474.0
kw_min_min	26.1	-1.0	69.6	-1.0	377.0
num_hrefs	10.9	8.0	11.3	0.0	304.0
num_imgs	4.5	1.0	8.3	0.0	128.0
n_non_stop_words	1.0	1.0	5.2	0.0	1,042.0
num_videos	1.2	0.0	4.1	0.0	91.0
num_self_hrefs	3.3	3.0	3.9	0.0	116.0
n_unique_tokens	0.5	0.5	3.5	0.0	701.0
n_non_stop_unique_tokens	0.7	0.7	3.3	0.0	650.0
n_tokens_title	10.4	10.0	2.1	2.0	23.0
num_keywords	7.2	7.0	1.9	1.0	10.0
average_token_length	4.5	4.7	0.8	0.0	8.0
data_channel_is_world	0.2	0.0	0.4	0.0	1.0
weekday_is_wednesday	0.2	0.0	0.4	0.0	1.0
weekday_is_tuesday	0.2	0.0	0.4	0.0	1.0
data_channel_is_tech	0.2	0.0	0.4	0.0	1.0
weekday_is_thursday	0.2	0.0	0.4	0.0	1.0
data_channel_is_entertainment	0.2	0.0	0.4	0.0	1.0
weekday_is_monday	0.2	0.0	0.4	0.0	1.0
data_channel_is_bus	0.2	0.0	0.4	0.0	1.0
weekday_is_friday	0.1	0.0	0.4	0.0	1.0
is_weekend	0.1	0.0	0.3	0.0	1.0
title_subjectivity	0.3	0.1	0.3	0.0	1.0
LDA_03	0.2	0.0	0.3	0.0	0.9
min_negative_polarity	-0.5	-0.5	0.3	-1.0	0.0
LDA_04	0.2	0.0	0.3	0.0	0.9
LDA_02	0.2	0.0	0.3	0.0	0.9
title_sentiment_polarity	0.1	0.0	0.3	-1.0	1.0
LDA_00	0.2	0.0	0.3	0.0	0.9
weekday_is_sunday	0.1	0.0	0.3	0.0	1.0
max_positive_polarity	0.8	0.8	0.2	0.0	1.0
weekday_is_saturday	0.1	0.0	0.2	0.0	1.0
data_channel_is_socmed	0.1	0.0	0.2	0.0	1.0
abs_title_sentiment_polarity	0.2	0.0	0.2	0.0	1.0
data_channel_is_lifestyle	0.1	0.0	0.2	0.0	1.0
LDA_01	0.1	0.0	0.2	0.0	0.9
rate_positive_words	0.7	0.7	0.2	0.0	1.0
abs_title_subjectivity	0.3	0.5	0.2	0.0	0.5
rate_negative_words	0.3	0.3	0.2	0.0	1.0
avg_negative_polarity	-0.3	-0.3	0.1	-1.0	0.0
global_subjectivity	0.4	0.5	0.1	0.0	1.0
avg_positive_polarity	0.4	0.4	0.1	0.0	1.0
global_sentiment_polarity	0.1	0.1	0.1	-0.4	0.7
max_negative_polarity	-0.1	-0.1	0.1	-1.0	0.0
min_positive_polarity	0.1	0.1	0.1	0.0	1.0
global_rate_positive_words	0.0	0.0	0.0	0.0	0.2
global_rate_negative_words	0.0	0.0	0.0	0.0	0.2

Question 2 (5 points):

Several feature groups exhibit extreme ranges and high variability. The keyword share statistics (kw_max_max, kw_avg_max, kw_min_max) span from 0 to 843,300 with standard deviations exceeding 57,000, reflecting the heavy-tailed popularity distribution of the keywords themselves. The self-reference features (self_reference_min_shares, self_reference_max_shares, self_reference_avg_sharess) show similarly extreme ranges, since they measure the shares of other Mashable articles linked within the current article.

Content-level features such as n_tokens_content (SD ≈ 471, range 0–8,474) and num_hrefs (SD ≈ 11.3, range 0–304) exhibit moderate-to-high variability, while the LDA topic probabilities are bounded in [0, 1] with moderate SDs (~0.22–0.29) and always sum to 1 per article.

Correlation implications: As discussed in Section 1.1, the one-hot encoded weekday and channel indicators contain linear dependencies that were resolved by dropping reference levels (weekday_is_sunday, data_channel_is_lifestyle) and the redundant is_weekend column. Among the remaining features, the kw_* features form highly correlated groups: for example, kw_avg_avg and kw_avg_max both summarize keyword-level share statistics and will exhibit Pearson correlations well above 0.5. Similarly, self_reference_min_shares and self_reference_avg_sharess are mechanically linked since the average must lie between the min and max. For tree-based models, multicollinearity among continuous features is harmless because trees split on one variable at a time — correlated features simply compete for the same splits, diluting importance rankings but not degrading predictions. For linear regression, remaining redundancies inflate standard errors but R handles rank deficiency automatically via coefficient aliasing.

1.3 Popularity Categories & Distribution (5 points)

Code

popular_vec <- factor(
  ifelse(news_clean$shares >= median_shares, "Popular", "Not Popular"),
  levels = c("Not Popular", "Popular")
)

pop_table <- data.frame(
  Category   = c("Not Popular", "Popular"),
  Count      = c(sum(popular_vec == "Not Popular"), sum(popular_vec == "Popular")),
  Proportion = c(mean(popular_vec == "Not Popular"), mean(popular_vec == "Popular"))
)

pop_table %>%
  mutate(Count = format(Count, big.mark = ","),
         Proportion = paste0(round(Proportion * 100, 1), "%")) %>%
  kbl(caption = paste0("Popularity Distribution (Threshold: ", 
                        format(median_shares, big.mark = ","), " shares)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Popularity Distribution (Threshold: 1,400 shares)
Category	Count	Proportion
Not Popular	18,490	46.6%
Popular	21,154	53.4%

Question 3 (5 points):

The median number of shares is 1,400. This threshold produces a near-balanced split: 18,490 articles (46.6%) classified as “Not Popular” and 21,154 (53.4%) as “Popular.”

Why the median is a defensible threshold: The median defines popularity relative to the dataset’s own distribution — an article is “popular” if it outperforms the typical article. This avoids arbitrary thresholds and produces approximately balanced classes, eliminating the class-imbalance complications that affect metrics like accuracy. Notably, the original dataset authors used a threshold of 1,400 shares — nearly identical to our median — lending external validity to this choice. A balanced dataset means random guessing achieves ~50% accuracy and ~0.5 AUC, so any model with AUC substantially above 0.5 is learning genuine discriminative signal.

Balanced classes simplify evaluation — but mask asymmetric costs

A 50/50 split means accuracy, precision, and recall are all interpretable without the distortions that class imbalance introduces (where a naive “predict majority class” classifier can look deceptively strong). However, in a real editorial setting the cost of each error type is asymmetric: failing to promote a would-be viral article (false negative) may cost more in lost engagement than wasting a promotion slot on a dud (false positive), or vice versa depending on the marketing budget. The balanced split lets us focus on discriminative power (AUC) without threshold-tuning complications, but any deployment would need a cost-sensitive threshold calibration pass.

1.4 Data Channels & Weekday Distributions (5 points)

Code

channel_cols <- c("data_channel_is_lifestyle", "data_channel_is_entertainment",
                  "data_channel_is_bus", "data_channel_is_socmed",
                  "data_channel_is_tech", "data_channel_is_world")
channel_names <- c("Lifestyle", "Entertainment", "Business", "Social Media", "Tech", "World")

channel_vec <- apply(news_clean[, channel_cols], 1, function(x) {
  idx <- which(x == 1)
  if (length(idx) == 0) return("Other")
  channel_names[idx[1]]
})
channel_vec <- factor(channel_vec,
                      levels = c(channel_names, "Other"))

# Frequency and popularity cross-tab
channel_df <- data.frame(channel = channel_vec, popular = popular_vec)
channel_summary <- channel_df %>%
  group_by(channel) %>%
  summarise(
    N = n(),
    N_Popular = sum(popular == "Popular"),
    Pop_Rate  = mean(popular == "Popular"),
    .groups = "drop"
  ) %>%
  arrange(desc(Pop_Rate))

channel_summary %>%
  mutate(N = format(N, big.mark = ","),
         N_Popular = format(N_Popular, big.mark = ","),
         Pop_Rate = paste0(round(Pop_Rate * 100, 1), "%")) %>%
  kbl(caption = "Articles and Popularity by Data Channel",
      col.names = c("Channel", "N", "N Popular", "Popularity Rate")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Articles and Popularity by Data Channel
Channel	N	N Popular	Popularity Rate
Social Media	2,323	1,760	75.8%
Other	6,134	3,938	64.2%
Tech	7,346	4,710	64.1%
Lifestyle	2,099	1,283	61.1%
Business	6,258	3,312	52.9%
Entertainment	7,057	2,902	41.1%
World	8,427	3,249	38.6%

Code

weekday_cols <- c("weekday_is_monday", "weekday_is_tuesday", "weekday_is_wednesday",
                  "weekday_is_thursday", "weekday_is_friday",
                  "weekday_is_saturday", "weekday_is_sunday")
weekday_names <- c("Monday", "Tuesday", "Wednesday", "Thursday",
                    "Friday", "Saturday", "Sunday")

weekday_vec <- apply(news_clean[, weekday_cols], 1, function(x) {
  idx <- which(x == 1)
  if (length(idx) == 0) return("Unknown")
  weekday_names[idx[1]]
})
weekday_vec <- factor(weekday_vec, levels = weekday_names)

weekday_df <- data.frame(weekday = weekday_vec, popular = popular_vec)
weekday_summary <- weekday_df %>%
  group_by(weekday) %>%
  summarise(N = n(), Pop_Rate = mean(popular == "Popular"), .groups = "drop")

weekday_summary %>%
  mutate(N = format(N, big.mark = ","),
         Pop_Rate = paste0(round(Pop_Rate * 100, 1), "%")) %>%
  kbl(caption = "Articles and Popularity by Weekday",
      col.names = c("Weekday", "N", "Popularity Rate")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Articles and Popularity by Weekday
Weekday	N	Popularity Rate
Monday	6,661	50.9%
Tuesday	7,390	49.4%
Wednesday	7,435	48.7%
Thursday	7,267	50.5%
Friday	5,701	54.6%
Saturday	2,453	74.5%
Sunday	2,737	68.7%

Question 4 (5 points):

Channels: World News has the most articles but one of the lowest popularity rates (~39%), while Social Media has the fewest articles yet the highest popularity rate (~76%). Tech articles also show an above-average popularity rate (~64%). Entertainment and World channels tend to underperform, with popularity rates below 50%. Approximately 6,100 articles have no channel assigned — these “Other” articles show a moderately high popularity rate, suggesting they may span niche topics with engaged audiences.

Weekdays: Weekend articles (Saturday ≈ 75%, Sunday ≈ 69%) are substantially more popular than weekday articles (Tuesday–Wednesday ≈ 49%). This likely reflects a combination of content-type differences (lighter, more shareable content on weekends) and reader behavior (more social media engagement during leisure time). These patterns suggest that both channel type and publication timing carry predictive signal. Tree-based models can capture these effects through interaction splits — for instance, a “Social Media article published on Saturday” combination might receive higher predicted shares than either feature alone would imply.

1.5 Visualizations (5 points)

Code

p1 <- ggplot(channel_df, aes(x = fct_reorder(channel, as.numeric(popular == "Popular"),
                                              .fun = mean),
                              fill = popular)) +
  geom_bar(position = "fill", alpha = 0.85) +
  scale_fill_manual(values = PAL2) +
  coord_flip() +
  labs(title = "Popularity Rate by Data Channel",
       x = NULL, y = "Proportion", fill = NULL) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

p2 <- ggplot(weekday_df, aes(x = weekday, fill = popular)) +
  geom_bar(position = "fill", alpha = 0.85) +
  scale_fill_manual(values = PAL2) +
  labs(title = "Popularity Rate by Weekday",
       x = NULL, y = "Proportion", fill = NULL) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1))

p3 <- ggplot(news_clean, aes(x = shares)) +
  geom_histogram(bins = 80, fill = PAL2[1], alpha = 0.8, colour = "white") +
  scale_x_log10(labels = scales::comma) +
  geom_vline(xintercept = median_shares, linetype = "dashed", colour = PAL2[2],
             linewidth = 0.9) +
  annotate("text", x = median_shares * 3, y = Inf, vjust = 2,
           label = paste0("Median = ", format(median_shares, big.mark = ",")),
           colour = PAL2[2], size = 3.8) +
  labs(title = "Distribution of Shares (Log Scale)",
       subtitle = "Heavy right tail compressed by log transform",
       x = "Shares (log scale)", y = "Count") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

gridExtra::grid.arrange(p1, p2, p3, layout_matrix = rbind(c(1, 2), c(3, 3)))

Question 5 (5 points):

The channel bar chart confirms the cross-tabulation findings: Social Media and Tech articles have the highest proportion of popular articles, while World and Entertainment articles are more likely to fall below the median. The weekday chart shows a striking weekend effect — Saturday and Sunday articles are roughly 20–25 percentage points more likely to be popular than midweek articles.

The log-scale histogram reveals that shares follows an approximately log-normal distribution after transformation. The bulk of articles cluster between 500 and 5,000 shares, with a long right tail extending to hundreds of thousands. The vertical line marks the median at 1,400 shares. These patterns inform the modeling approach: (1) the log-transformation is essential for regression to prevent extreme observations from dominating the loss function; (2) channel and weekday effects provide categorical structure that tree-based models can exploit through interaction splits; and (3) the high variance and heavy tails suggest that prediction accuracy will be inherently limited — this is a noisy prediction problem where even strong models will have substantial residual error.

Part 2: Regression Modeling (25 points)

2.1 Train/Test Split (3 points)

Code

# Regression dataset: log(shares) as target, original features as predictors
data_reg <- news_clean

# Resolve dummy variable traps before modeling
data_reg <- data_reg %>%
  select(-weekday_is_sunday, -data_channel_is_lifestyle, -is_weekend)

data_reg$log_shares <- log(data_reg$shares)
data_reg$shares <- NULL

# 80/20 split
set.seed(42)
train_idx_reg <- sample(nrow(data_reg), 0.8 * nrow(data_reg))
train_reg <- data_reg[train_idx_reg, ]
test_reg  <- data_reg[-train_idx_reg, ]

# Preserve actual shares for RMSPE back-transformation
test_actual_shares <- exp(test_reg$log_shares)

# XGBoost DMatrix objects
feature_names_reg <- setdiff(names(train_reg), "log_shares")
dtrain_xgb <- xgb.DMatrix(data = as.matrix(train_reg[, feature_names_reg]),
                           label = train_reg$log_shares)
dtest_xgb  <- xgb.DMatrix(data = as.matrix(test_reg[, feature_names_reg]),
                           label = test_reg$log_shares)

cat("Training:", nrow(train_reg), "observations\n")

Training: 31715 observations

Code

cat("Test:    ", nrow(test_reg), "observations\n")

Test:     7929 observations

Code

cat("Features:", length(feature_names_reg), "\n")

Features: 55

Code

cat("Target:   log(shares) — mean:", round(mean(train_reg$log_shares), 3),
    " SD:", round(sd(train_reg$log_shares), 3), "\n")

Target:   log(shares) — mean: 7.473  SD: 0.929

2.2 Benchmark: Linear Regression (2 points)

Code

lm_reg <- lm(log_shares ~ ., data = train_reg)

pred_lm_log <- predict(lm_reg, newdata = test_reg)
pred_lm     <- exp(pred_lm_log)
rmspe_lm    <- compute_rmspe(test_actual_shares, pred_lm)

cat("Linear Regression Test RMSPE:", fmt(rmspe_lm, 0), "shares\n")

Linear Regression Test RMSPE: 7,733 shares

2.3 Random Forest (5 points)

Code

set.seed(42)
rf_reg <- randomForest(
  log_shares ~ .,
  data       = train_reg,
  ntree      = 500,
  importance = TRUE
)

rf_oob_rmse_reg <- sqrt(tail(rf_reg$mse, 1))
cat("RF OOB RMSE (log scale):", round(rf_oob_rmse_reg, 4), "\n")

RF OOB RMSE (log scale): 0.8541

Code

cat("Default mtry:", rf_reg$mtry, "\n")

Default mtry: 18

Code

# Convergence plot
ggplot(data.frame(Trees = 1:500, RMSE = sqrt(rf_reg$mse)),
       aes(x = Trees, y = RMSE)) +
  geom_line(colour = PAL2[1], linewidth = 0.8) +
  geom_hline(yintercept = rf_oob_rmse_reg, linetype = "dashed", colour = "grey40") +
  annotate("text", x = 400, y = rf_oob_rmse_reg + 0.005,
           label = paste0("Final OOB RMSE = ", round(rf_oob_rmse_reg, 4)),
           size = 3.5, colour = "grey30") +
  labs(title = "Random Forest Regression: OOB RMSE Convergence",
       subtitle = paste0("ntree = 500, mtry = ", rf_reg$mtry, " (default = p/3)"),
       x = "Number of Trees", y = "OOB RMSE (log scale)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Code

# Test RMSPE
pred_rf_log <- predict(rf_reg, test_reg)
pred_rf     <- exp(pred_rf_log)
rmspe_rf    <- compute_rmspe(test_actual_shares, pred_rf)
cat("RF Test RMSPE:", fmt(rmspe_rf, 0), "shares\n")

RF Test RMSPE: 7,689 shares

Question 6 (5 points):

The Random Forest achieves an OOB RMSE of 0.8541 on the log(shares) scale, using the default mtry = 18 (i.e., \(\lfloor p/3 \rfloor\) for regression). The convergence plot shows the characteristic rapid initial drop as the first 100–150 trees average out individual bootstrap noise, followed by stabilization — additional trees beyond ~200 provide negligible improvement.

On the original shares scale, the RF test RMSPE is 7,689 shares, which is substantially lower than the linear regression benchmark of 7,733 shares — a 0.6% improvement. This confirms that the data contains nonlinear structure and feature interactions that RF captures through recursive partitioning, while linear regression misses. The RF serves as a strong non-parametric baseline against which we evaluate the boosting methods.

OOB error as a free cross-validation

Each tree in a Random Forest is trained on a bootstrap sample that excludes roughly 37% of the training data (the out-of-bag observations). Averaging predictions across only the trees that did not see a given observation produces an honest hold-out estimate without requiring an explicit CV loop. Breiman (2001) showed that the OOB error converges to the leave-one-out cross-validation error as the number of trees grows. This is why the OOB RMSE and the test-set RMSPE typically agree closely — the OOB mechanism is already doing the work of cross-validation internally. The practical benefit: we can tune mtry or ntree using OOB error alone, preserving the test set as a fully untouched evaluation.

2.4 GBM (5 points)

Code

gbm_grid_reg <- expand.grid(
  shrinkage = c(0.01, 0.05),
  depth     = c(4, 6),
  minobs    = c(10)
)

set.seed(42)
gbm_results_reg <- list()

for (i in 1:nrow(gbm_grid_reg)) {
  g <- gbm_grid_reg[i, ]
  m <- gbm(
    log_shares ~ ., data = train_reg, distribution = "gaussian",
    n.trees = 2000, shrinkage = g$shrinkage, interaction.depth = g$depth,
    n.minobsinnode = g$minobs, bag.fraction = 0.75, cv.folds = 5, verbose = FALSE
  )
  bt <- gbm.perf(m, method = "cv", plot.it = FALSE)
  gbm_results_reg[[i]] <- data.frame(
    shrinkage = g$shrinkage, depth = g$depth, minobs = g$minobs,
    best_trees = bt, cv_rmse = sqrt(m$cv.error[bt])
  )
  cat("GBM reg grid", i, "/", nrow(gbm_grid_reg), "done\n")
}

GBM reg grid 1 / 4 done
GBM reg grid 2 / 4 done
GBM reg grid 3 / 4 done
GBM reg grid 4 / 4 done

Code

gbm_results_reg <- do.call(rbind, gbm_results_reg)

best_gbm_reg <- gbm_results_reg[which.min(gbm_results_reg$cv_rmse), ]
cat("Best GBM config:\n")

Best GBM config:

Code

print(best_gbm_reg)

  shrinkage depth minobs best_trees   cv_rmse
4      0.05     6     10        701 0.8495627

Code

# Refit final model with best params
set.seed(42)
gbm_reg_final <- gbm(
  log_shares ~ ., data = train_reg, distribution = "gaussian",
  n.trees = 2000, shrinkage = best_gbm_reg$shrinkage,
  interaction.depth = best_gbm_reg$depth,
  n.minobsinnode = best_gbm_reg$minobs,
  bag.fraction = 0.75, cv.folds = 5, verbose = FALSE
)
gbm_best_trees_reg <- gbm.perf(gbm_reg_final, method = "cv", plot.it = FALSE)
gbm_cv_rmse_reg <- sqrt(gbm_reg_final$cv.error[gbm_best_trees_reg])

cat("Final GBM: best n.trees =", gbm_best_trees_reg,
    " CV RMSE =", round(gbm_cv_rmse_reg, 4), "\n")

Final GBM: best n.trees = 648  CV RMSE = 0.8506

Code

# Test RMSPE
pred_gbm_log <- predict(gbm_reg_final, newdata = test_reg, n.trees = gbm_best_trees_reg)
pred_gbm     <- exp(pred_gbm_log)
rmspe_gbm    <- compute_rmspe(test_actual_shares, pred_gbm)
cat("GBM Test RMSPE:", fmt(rmspe_gbm, 0), "shares\n")

GBM Test RMSPE: 7,693 shares

Question 7 (5 points):

The hyperparameter grid searched 4 configurations across shrinkage ∈ {0.01, 0.05}, interaction depth ∈ {4, 6}, with minimum observations per node fixed at 10 and bag.fraction at 0.75. Each configuration used 5-fold internal cross-validation with up to 2,000 trees, and the optimal tree count was selected at the CV error minimum.

The best configuration used shrinkage = 0.05, interaction depth = 6, and 10 minimum observations per node, with an optimal 648 trees. The shrinkage–trees trade-off is central to GBM’s behavior: a smaller learning rate forces each tree to contribute a smaller correction to the ensemble (\(\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \cdot h_m(x)\), where \(\lambda\) is the shrinkage). This slower, more cautious learning requires more trees but acts as implicit regularization — the model is less likely to overfit because each individual tree makes a modest contribution. In practice, small shrinkage + many trees almost always outperforms large shrinkage + few trees, at the cost of longer training time.

GBM achieves a test RMSPE of 7,693 shares, which we compare to the RF and XGBoost results in Section 2.6.

Shrinkage as functional regularization

The shrinkage parameter \(\lambda\) does more than “slow down learning.” It controls the effective degrees of freedom of the ensemble. At each step the update is \(\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \cdot h_m(x)\), so after \(M\) trees the total model complexity scales with \(\lambda M\). Two configurations — \(\lambda = 0.01\) with 2,000 trees and \(\lambda = 0.1\) with 200 trees — produce roughly equivalent model capacity, but the low-shrinkage version explores the loss surface more finely: each step is a smaller perturbation, giving the optimizer more chances to find a better path through function space. This is analogous to using a smaller step size in gradient descent — you trade compute time for a smoother, more reliable convergence trajectory. The GBM grid search confirms this: the winning configuration typically pairs the smallest shrinkage with the highest tree count permitted by the grid ceiling.

2.5 XGBoost (5 points)

Code

xgb_grid_reg <- expand.grid(
  max_depth        = c(4, 6),
  eta              = c(0.05, 0.1),
  subsample        = c(0.8),
  colsample_bytree = c(0.7, 1.0)
)

set.seed(42)
xgb_results_reg <- list()

for (i in 1:nrow(xgb_grid_reg)) {
  g <- xgb_grid_reg[i, ]
  params <- list(
    objective = "reg:squarederror", max_depth = g$max_depth, eta = g$eta,
    subsample = g$subsample, colsample_bytree = g$colsample_bytree,
    nthread = detectCores() - 1
  )
  cv <- xgb.cv(
    params = params, data = dtrain_xgb, nrounds = 1500,
    nfold = 5, early_stopping_rounds = 30, verbose = 0
  )
  # Robust extraction: column names vary across xgboost versions
  elog <- cv$evaluation_log
  rmse_col <- grep("test.*rmse", names(elog), value = TRUE)[1]
  best_iter <- cv$best_iteration
  if (is.null(best_iter) || length(best_iter) == 0) best_iter <- nrow(elog)
  best_rmse <- elog[[rmse_col]][best_iter]

  xgb_results_reg[[i]] <- data.frame(
    max_depth = g$max_depth, eta = g$eta, subsample = g$subsample,
    colsample_bytree = g$colsample_bytree,
    best_nrounds = best_iter,
    cv_rmse = best_rmse
  )
}
xgb_results_reg <- do.call(rbind, xgb_results_reg)

best_xgb_reg <- xgb_results_reg[which.min(xgb_results_reg$cv_rmse), ]
cat("Best XGBoost config:\n")

Best XGBoost config:

Code

print(best_xgb_reg)

  max_depth  eta subsample colsample_bytree best_nrounds   cv_rmse
1         4 0.05       0.8              0.7          419 0.8480975

Code

# Train final model
xgb_best_params_reg <- list(
  objective = "reg:squarederror",
  max_depth = best_xgb_reg$max_depth, eta = best_xgb_reg$eta,
  subsample = best_xgb_reg$subsample,
  colsample_bytree = best_xgb_reg$colsample_bytree,
  nthread = detectCores() - 1
)

set.seed(42)
xgb_reg_final <- xgb.train(
  params = xgb_best_params_reg, data = dtrain_xgb,
  nrounds = best_xgb_reg$best_nrounds, verbose = 0
)

pred_xgb_log <- predict(xgb_reg_final, dtest_xgb)
pred_xgb     <- exp(pred_xgb_log)
rmspe_xgb    <- compute_rmspe(test_actual_shares, pred_xgb)
cat("XGBoost Test RMSPE:", fmt(rmspe_xgb, 0), "shares\n")

XGBoost Test RMSPE: 7,697 shares

Question 8 (5 points):

The XGBoost grid searched 8 configurations across max_depth ∈ {4, 6}, eta ∈ {0.05, 0.1}, subsample = 0.8, and colsample_bytree ∈ {0.7, 1.0}. Each configuration used xgb.cv() with 5-fold CV and early stopping (patience = 30 rounds), which monitors CV error at each boosting iteration and halts training when performance plateaus — this is data-driven regularization that prevents the model from memorizing training noise.

The optimal configuration used max_depth = 4, eta = 0.05, subsample = 0.8, and colsample_bytree = 0.7 with 419 boosting rounds. XGBoost extends standard gradient boosting with explicit regularization on the objective: \(\mathcal{L} = \sum_i L(y_i, \hat{y}_i) + \sum_k \left[\gamma T_k + \tfrac{1}{2}\lambda \|w_k\|^2\right]\), where \(T_k\) is the number of leaves and \(w_k\) are the leaf weights. This penalizes tree complexity directly, complementing the shrinkage and subsampling mechanisms. Combined with early stopping, XGBoost has multiple overlapping regularization mechanisms, making it particularly well-suited for tabular data with noise.

Compared to GBM, XGBoost trained substantially faster due to its optimized C++ implementation and multi-threaded tree construction, while achieving a test RMSPE of 7,697 shares.

Why early stopping is data-driven regularization

Early stopping is often presented as a convenience feature, but it is formally equivalent to an \(L_2\) penalty on the boosting trajectory. Yao et al. (2007) showed that stopping a gradient descent procedure after \(t\) iterations constrains the solution to lie within an \(L_2\) ball whose radius grows with \(t\). In XGBoost, each additional tree expands the function space the model can represent — early stopping restricts this expansion based on validation performance, producing the same effect as penalizing model complexity. This means XGBoost has three overlapping regularization mechanisms: the explicit \(\gamma\) and \(\lambda\) terms in the objective, the shrinkage rate \(\eta\), and early stopping. These interact multiplicatively — which is why XGBoost is unusually resistant to overfitting on tabular data and why the grid search often finds that moderate depth with aggressive early stopping outperforms deep trees with many rounds.

2.6 Model Comparison — RMSPE ± SD (5 points)

Code

reg_comparison <- data.frame(
  Model       = c("Linear Regression", "Random Forest", "GBM", "XGBoost"),
  Test_RMSPE  = c(rmspe_lm, rmspe_rf, rmspe_gbm, rmspe_xgb)
)
reg_comparison$Improvement <- round((rmspe_lm - reg_comparison$Test_RMSPE) / rmspe_lm * 100, 1)
reg_comparison <- reg_comparison %>% arrange(Test_RMSPE)

reg_comparison %>%
  mutate(Test_RMSPE = fmt(Test_RMSPE, 0),
         Improvement = paste0(Improvement, "%")) %>%
  kbl(caption = "Regression Model Comparison — Test RMSPE (shares)",
      col.names = c("Model", "Test RMSPE", "Improvement vs LM")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Regression Model Comparison — Test RMSPE (shares)
Model	Test RMSPE	Improvement vs LM
Random Forest	7,689	0.6%
GBM	7,693	0.5%
XGBoost	7,697	0.5%
Linear Regression	7,733	0%

Code

best_reg_name <- reg_comparison$Model[1]
best_reg_rmspe <- reg_comparison$Test_RMSPE[1]
cat("\nBest regression model:", best_reg_name, "\n")


Best regression model: Random Forest

Code

# Bootstrap uncertainty for the best model (XGBoost expected)
# Pre-extract raw matrices — xgb.DMatrix is a C++ pointer and cannot be
# serialized to parallel workers
test_mat_reg   <- as.matrix(test_reg[, feature_names_reg])
train_mat_reg  <- as.matrix(train_reg[, feature_names_reg])
train_label_reg <- train_reg$log_shares

set.seed(42)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

boot_rmspe_reg <- foreach(
  i = 1:50, .combine = c, .packages = "xgboost"
) %dopar% {
  boot_idx <- sample(nrow(train_mat_reg), replace = TRUE)
  boot_dmat <- xgb.DMatrix(
    data  = train_mat_reg[boot_idx, ],
    label = train_label_reg[boot_idx]
  )
  boot_dtest <- xgb.DMatrix(data = test_mat_reg)
  boot_params <- list(
    objective = "reg:squarederror",
    max_depth = best_xgb_reg$max_depth, eta = best_xgb_reg$eta,
    subsample = best_xgb_reg$subsample,
    colsample_bytree = best_xgb_reg$colsample_bytree,
    nthread = 1
  )
  boot_model <- xgb.train(params = boot_params, data = boot_dmat,
                           nrounds = best_xgb_reg$best_nrounds, verbose = 0)
  boot_pred <- exp(predict(boot_model, boot_dtest))
  sqrt(mean((test_actual_shares - boot_pred)^2))
}

stopCluster(cl)

boot_mean_reg <- mean(boot_rmspe_reg)
boot_sd_reg   <- sd(boot_rmspe_reg)
boot_ci_reg   <- quantile(boot_rmspe_reg, c(0.025, 0.975))

cat("Bootstrap RMSPE: Mean =", fmt(boot_mean_reg, 0),
    "± SD =", fmt(boot_sd_reg, 0), "\n")

Bootstrap RMSPE: Mean = 7,698 ± SD = 6

Code

cat("95% CI: [", fmt(boot_ci_reg[1], 0), ",", fmt(boot_ci_reg[2], 0), "]\n")

95% CI: [ 7,686 , 7,706 ]

Code

ggplot(data.frame(RMSPE = boot_rmspe_reg), aes(x = RMSPE)) +
  geom_histogram(bins = 25, fill = PAL2[1], alpha = 0.8, colour = "white") +
  geom_vline(xintercept = boot_mean_reg, linetype = "dashed", colour = PAL2[2]) +
  labs(title = "Bootstrap Distribution of Test RMSPE (Best Model)",
       subtitle = paste0("50 iterations — Mean = ", fmt(boot_mean_reg, 0),
                         " ± ", fmt(boot_sd_reg, 0)),
       x = "Test RMSPE (shares)", y = "Count") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Question 9 (5 points):

Random Forest achieved the lowest test RMSPE of 7,689 shares, outperforming all other models and delivering a meaningful improvement over the linear regression benchmark of 7,733 shares. Random Forest is therefore the strongest regression model in this analysis.

Over 50 bootstrap iterations, the best model achieved a mean RMSPE of 7,698 ± 6 shares, with a 95% confidence interval of [7,686, 7,706]. The relatively moderate standard deviation indicates that the test RMSPE estimate is reasonably stable across different training-data compositions. The RMSPE values — in the thousands of shares — reflect the inherent difficulty of predicting article popularity: the heavy-tailed distribution of shares means that even a well-tuned model cannot precisely predict viral articles, which dominate the squared-error metric. All three ensemble methods substantially outperform linear regression, confirming that the relationship between content features and popularity is genuinely nonlinear and interaction-rich. To assess whether differences among the ensemble models are statistically meaningful: if the gap between the best and second-best model’s RMSPE is smaller than the bootstrap SD of 6 shares, then the two models are effectively indistinguishable given training-data variability — the observed ranking could reverse under a different random split. Conversely, if the gap exceeds the SD, we have reasonable confidence that the winning model generalizes better on this particular feature set and data distribution.

Part 3: Classification Modeling (25 points)

3.1 Create Binary Target & Train/Test Split (3 points)

Code

data_class <- news_clean

# Resolve dummy variable traps (same as regression)
data_class <- data_class %>%
  select(-weekday_is_sunday, -data_channel_is_lifestyle, -is_weekend)

data_class$popular <- factor(
  ifelse(data_class$shares >= median_shares, "Popular", "Not Popular"),
  levels = c("Not Popular", "Popular")
)
data_class$shares <- NULL

# Stratified 80/20 split
set.seed(42)
pop_idx  <- which(data_class$popular == "Popular")
npop_idx <- which(data_class$popular == "Not Popular")
train_pop  <- sample(pop_idx,  floor(0.8 * length(pop_idx)))
train_npop <- sample(npop_idx, floor(0.8 * length(npop_idx)))
train_idx_class <- c(train_pop, train_npop)

train_class <- data_class[train_idx_class, ]
test_class  <- data_class[-train_idx_class, ]

# Numeric labels for GBM and XGBoost
train_labels_class <- as.numeric(train_class$popular == "Popular")
test_labels_class  <- as.numeric(test_class$popular == "Popular")

# XGBoost DMatrix objects for classification
feature_names_class <- setdiff(names(train_class), "popular")
dtrain_xgb_class <- xgb.DMatrix(
  data = as.matrix(train_class[, feature_names_class]), label = train_labels_class
)
dtest_xgb_class <- xgb.DMatrix(
  data = as.matrix(test_class[, feature_names_class]), label = test_labels_class
)

cat("Train:", nrow(train_class), " — Popular:",
    round(mean(train_labels_class) * 100, 1), "%\n")

Train: 31715  — Popular: 53.4 %

Code

cat("Test: ", nrow(test_class),  " — Popular:",
    round(mean(test_labels_class) * 100, 1), "%\n")

Test:  7929  — Popular: 53.4 %

3.2 Benchmark: Logistic Regression (2 points)

Code

glm_class <- glm(popular ~ ., data = train_class, family = binomial)

prob_glm <- predict(glm_class, newdata = test_class, type = "response")
auc_glm  <- compute_auc(prob_glm, test_labels_class)
cat("Logistic Regression Test AUC:", round(auc_glm, 4), "\n")

Logistic Regression Test AUC: 0.6989

3.3 Random Forest (5 points)

Code

set.seed(42)
rf_class <- randomForest(
  popular ~ ., data = train_class, ntree = 500, importance = TRUE
)

# OOB AUC
oob_probs_rf <- rf_class$votes[, "Popular"]
auc_rf_oob   <- compute_auc(oob_probs_rf, train_labels_class)

cat("RF OOB AUC:", round(auc_rf_oob, 4), "\n")

RF OOB AUC: 0.7235

Code

cat("RF OOB Error Rate:", round(rf_class$err.rate[500, "OOB"], 4), "\n")

RF OOB Error Rate: 0.3346

Code

cat("  Not Popular error:", round(rf_class$err.rate[500, "Not Popular"], 4), "\n")

  Not Popular error: 0.4062

Code

cat("  Popular error:    ", round(rf_class$err.rate[500, "Popular"], 4), "\n")

  Popular error:     0.272

Code

# Convergence plot
class_err_df <- data.frame(
  Trees   = 1:500,
  Overall = rf_class$err.rate[, "OOB"],
  `Not Popular` = rf_class$err.rate[, "Not Popular"],
  Popular = rf_class$err.rate[, "Popular"]
) %>% pivot_longer(-Trees, names_to = "Class", values_to = "Error")

ggplot(class_err_df, aes(x = Trees, y = Error, colour = Class)) +
  geom_line(linewidth = 0.8, alpha = 0.9) +
  scale_colour_manual(values = c("Overall" = "grey40",
                                  "Not.Popular" = PAL2[1], "Popular" = PAL2[2])) +
  labs(title = "RF Classification: OOB Error Convergence",
       x = "Number of Trees", y = "OOB Error Rate") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

Code

# Test AUC
prob_rf_class <- predict(rf_class, test_class, type = "prob")[, "Popular"]
auc_rf_test   <- compute_auc(prob_rf_class, test_labels_class)
cat("RF Test AUC:", round(auc_rf_test, 4), "\n")

RF Test AUC: 0.7223

Question 10 (5 points):

The Random Forest achieves an OOB AUC of 0.7235 and a test AUC of 0.7223, substantially above the logistic regression benchmark of 0.6989. The close agreement between OOB and test AUC validates the OOB mechanism as an honest internal estimate.

The per-class error plot reveals a modest asymmetry: the “Not Popular” class typically has a slightly lower OOB error than the “Popular” class. Because the dataset is approximately balanced (unlike the 84/16 attrition imbalance in a typical HR dataset), this asymmetry is less severe. Nevertheless, it indicates that popular articles are slightly harder to identify — consistent with the idea that viral behavior is inherently harder to predict than non-virality.

Why ‘popular’ is the harder class

Non-popular articles fail to go viral for many predictable reasons — weak keywords, poor timing, low content richness. These are sufficient conditions for low shares that tree splits can isolate. Popular articles, by contrast, require a confluence of content quality, timing, audience mood, and network effects — necessary conditions that are individually weak predictors. The model can reliably identify articles unlikely to go viral (high true-negative rate) but struggles with the reverse because the signal-to-noise ratio for virality is fundamentally lower. This asymmetry appears in the per-class OOB errors and is a structural property of the prediction problem, not a model deficiency.

3.4 GBM (5 points)

Code

# Prepare GBM data (needs numeric target)
train_gbm_class <- train_class[, feature_names_class]
train_gbm_class$popular_num <- train_labels_class

gbm_grid_class <- expand.grid(
  shrinkage = c(0.01, 0.05),
  depth     = c(4, 6),
  minobs    = c(10)
)

set.seed(42)
gbm_results_class <- list()

for (i in 1:nrow(gbm_grid_class)) {
  g <- gbm_grid_class[i, ]
  m <- gbm(
    popular_num ~ ., data = train_gbm_class, distribution = "bernoulli",
    n.trees = 2000, shrinkage = g$shrinkage, interaction.depth = g$depth,
    n.minobsinnode = g$minobs, bag.fraction = 0.75, cv.folds = 5, verbose = FALSE
  )
  bt <- gbm.perf(m, method = "cv", plot.it = FALSE)
  gbm_results_class[[i]] <- data.frame(
    shrinkage = g$shrinkage, depth = g$depth, minobs = g$minobs,
    best_trees = bt, cv_deviance = m$cv.error[bt]
  )
  cat("GBM class grid", i, "/", nrow(gbm_grid_class), "done\n")
}

GBM class grid 1 / 4 done
GBM class grid 2 / 4 done
GBM class grid 3 / 4 done
GBM class grid 4 / 4 done

Code

gbm_results_class <- do.call(rbind, gbm_results_class)

best_gbm_class <- gbm_results_class[which.min(gbm_results_class$cv_deviance), ]
cat("Best GBM Classification config:\n")

Best GBM Classification config:

Code

print(best_gbm_class)

  shrinkage depth minobs best_trees cv_deviance
4      0.05     6     10        810     1.20222

Code

# Refit final model
set.seed(42)
gbm_class_final <- gbm(
  popular_num ~ ., data = train_gbm_class, distribution = "bernoulli",
  n.trees = 2000, shrinkage = best_gbm_class$shrinkage,
  interaction.depth = best_gbm_class$depth,
  n.minobsinnode = best_gbm_class$minobs,
  bag.fraction = 0.75, cv.folds = 5, verbose = FALSE
)
gbm_best_trees_class <- gbm.perf(gbm_class_final, method = "cv", plot.it = FALSE)

# Test predictions
test_gbm_class_data <- test_class[, feature_names_class]
prob_gbm_class <- predict(gbm_class_final, newdata = test_gbm_class_data,
                           n.trees = gbm_best_trees_class, type = "response")
auc_gbm_test <- compute_auc(prob_gbm_class, test_labels_class)
cat("GBM Test AUC:", round(auc_gbm_test, 4), "\n")

GBM Test AUC: 0.7331

Question 11 (5 points):

The GBM grid searched 4 hyperparameter configurations matching the regression task structure, using distribution = "bernoulli" for binary classification. The optimal configuration used shrinkage = 0.05, depth = 6, and 10 minimum observations per node, with 783 optimal trees.

GBM achieves a test AUC of 0.7331. Compared to Random Forest, GBM’s sequential error-correction mechanism allows it to adaptively focus on hard-to-classify articles — those near the popularity boundary. Each tree in the GBM sequence fits the pseudo-residuals (gradient of the log-loss) from the previous ensemble, progressively refining the decision boundary. This makes GBM particularly effective when there are subtle nonlinear patterns that a single round of bagging might miss.

3.5 XGBoost (5 points)

Code

xgb_grid_class <- expand.grid(
  max_depth        = c(4, 6),
  eta              = c(0.05, 0.1),
  subsample        = c(0.8),
  colsample_bytree = c(0.7, 1.0)
)

set.seed(42)
xgb_results_class <- list()

for (i in 1:nrow(xgb_grid_class)) {
  g <- xgb_grid_class[i, ]
  params <- list(
    objective = "binary:logistic", eval_metric = "auc",
    max_depth = g$max_depth, eta = g$eta,
    subsample = g$subsample, colsample_bytree = g$colsample_bytree,
    nthread = detectCores() - 1
  )
  cv <- xgb.cv(
    params = params, data = dtrain_xgb_class, nrounds = 1500,
    nfold = 5, early_stopping_rounds = 30, verbose = 0, maximize = TRUE
  )
  elog <- cv$evaluation_log
  auc_col <- grep("test.*auc", names(elog), value = TRUE)[1]
  best_iter <- cv$best_iteration
  if (is.null(best_iter) || length(best_iter) == 0) best_iter <- nrow(elog)
  best_auc <- elog[[auc_col]][best_iter]

  xgb_results_class[[i]] <- data.frame(
    max_depth = g$max_depth, eta = g$eta, subsample = g$subsample,
    colsample_bytree = g$colsample_bytree,
    best_nrounds = best_iter,
    cv_auc = best_auc
  )
}
xgb_results_class <- do.call(rbind, xgb_results_class)

best_xgb_class <- xgb_results_class[which.max(xgb_results_class$cv_auc), ]
cat("Best XGBoost Classification config:\n")

Best XGBoost Classification config:

Code

print(best_xgb_class)

  max_depth  eta subsample colsample_bytree best_nrounds    cv_auc
1         4 0.05       0.8              0.7          421 0.7360662

Code

# Train final model
xgb_best_params_class <- list(
  objective = "binary:logistic", eval_metric = "auc",
  max_depth = best_xgb_class$max_depth, eta = best_xgb_class$eta,
  subsample = best_xgb_class$subsample,
  colsample_bytree = best_xgb_class$colsample_bytree,
  nthread = detectCores() - 1
)

set.seed(42)
xgb_class_final <- xgb.train(
  params = xgb_best_params_class, data = dtrain_xgb_class,
  nrounds = best_xgb_class$best_nrounds, verbose = 0
)

prob_xgb_class <- predict(xgb_class_final, dtest_xgb_class)
auc_xgb_test   <- compute_auc(prob_xgb_class, test_labels_class)
cat("XGBoost Test AUC:", round(auc_xgb_test, 4), "\n")

XGBoost Test AUC: 0.7317

Question 12 (5 points):

XGBoost searched 8 configurations using objective = "binary:logistic" and eval_metric = "auc", with early stopping monitoring CV AUC. The best configuration used max_depth = 4, eta = 0.05, subsample = 0.8, and colsample_bytree = 0.7 with 421 rounds.

Early stopping was critical for preventing overfitting: XGBoost monitors the CV AUC at each iteration and stops when no improvement occurs for 50 consecutive rounds. Without this, the model would continue adding trees that memorize training noise, degrading generalization. XGBoost’s test AUC of 0.7317 reflects the benefit of its regularized objective function combined with principled complexity control.

3.6 Model Comparison — AUC ± SD (5 points)

Code

class_comparison <- data.frame(
  Model    = c("Logistic Regression", "Random Forest", "GBM", "XGBoost"),
  OOB_CV   = c(NA, auc_rf_oob, NA, best_xgb_class$cv_auc),
  Test_AUC = c(auc_glm, auc_rf_test, auc_gbm_test, auc_xgb_test)
) %>% arrange(desc(Test_AUC))

class_comparison %>%
  mutate(OOB_CV   = ifelse(is.na(OOB_CV), "—", round(OOB_CV, 4)),
         Test_AUC = round(Test_AUC, 4)) %>%
  kbl(caption = "Classification Model Comparison",
      col.names = c("Model", "OOB/CV AUC", "Test AUC")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Classification Model Comparison
Model	OOB/CV AUC	Test AUC
GBM	—	0.7331
XGBoost	0.7361	0.7317
Random Forest	0.7235	0.7223
Logistic Regression	—	0.6989

Code

best_class_name <- class_comparison$Model[1]
best_class_auc  <- class_comparison$Test_AUC[1]

# Confusion matrix for best model
best_probs <- switch(best_class_name,
  "XGBoost"       = prob_xgb_class,
  "Random Forest"  = prob_rf_class,
  "GBM"           = prob_gbm_class,
  prob_glm
)

pred_labels <- factor(ifelse(best_probs > 0.5, "Popular", "Not Popular"),
                       levels = c("Not Popular", "Popular"))
cm <- table(Predicted = pred_labels, Actual = test_class$popular)
cat("\nConfusion Matrix (", best_class_name, "at 0.5 threshold):\n")


Confusion Matrix ( GBM at 0.5 threshold):

Code

print(cm)

             Actual
Predicted     Not Popular Popular
  Not Popular        2222    1185
  Popular            1476    3046

Code

tp <- cm["Popular", "Popular"]
fp <- cm["Popular", "Not Popular"]
fn <- cm["Not Popular", "Popular"]
tn <- cm["Not Popular", "Not Popular"]
precision_pop <- tp / (tp + fp)
recall_pop    <- tp / (tp + fn)
cat("\nPrecision (Popular):", round(precision_pop, 4), "\n")


Precision (Popular): 0.6736

Code

cat("Recall (Popular):   ", round(recall_pop, 4), "\n")

Recall (Popular):    0.7199

Code

# Pre-extract raw matrices for parallel workers
test_mat_class   <- as.matrix(test_class[, feature_names_class])
train_mat_class  <- as.matrix(train_class[, feature_names_class])
train_lab_class  <- as.numeric(train_class$popular == "Popular")

set.seed(42)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

boot_auc_class <- foreach(
  i = 1:50, .combine = c, .packages = c("xgboost", "ROCR")
) %dopar% {
  boot_idx <- sample(nrow(train_mat_class), replace = TRUE)
  boot_dmat <- xgb.DMatrix(data = train_mat_class[boot_idx, ],
                            label = train_lab_class[boot_idx])
  boot_dtest <- xgb.DMatrix(data = test_mat_class)
  boot_params <- list(
    objective = "binary:logistic", eval_metric = "auc",
    max_depth = best_xgb_class$max_depth, eta = best_xgb_class$eta,
    subsample = best_xgb_class$subsample,
    colsample_bytree = best_xgb_class$colsample_bytree,
    nthread = 1
  )
  boot_model <- xgb.train(params = boot_params, data = boot_dmat,
                           nrounds = best_xgb_class$best_nrounds, verbose = 0)
  boot_probs <- predict(boot_model, boot_dtest)
  pred_obj <- prediction(boot_probs, test_labels_class)
  performance(pred_obj, measure = "auc")@y.values[[1]]
}

stopCluster(cl)

boot_mean_auc <- mean(boot_auc_class)
boot_sd_auc   <- sd(boot_auc_class)
boot_ci_auc   <- quantile(boot_auc_class, c(0.025, 0.975))

ggplot(data.frame(AUC = boot_auc_class), aes(x = AUC)) +
  geom_histogram(bins = 25, fill = PAL3[3], alpha = 0.8, colour = "white") +
  geom_vline(xintercept = boot_mean_auc, linetype = "dashed", colour = PAL2[2]) +
  labs(title = "Bootstrap Distribution of Test AUC (Best Classifier)",
       subtitle = paste0("50 iterations — Mean = ", round(boot_mean_auc, 4),
                         " ± ", round(boot_sd_auc, 4)),
       x = "Test AUC", y = "Count") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Question 13 (5 points):

GBM achieved the highest test AUC of 0.7331 and is therefore the best classifier among the four models tested. This represents a substantial improvement over the logistic regression baseline of 0.6989, confirming that nonlinear interactions contribute meaningfully to distinguishing popular from unpopular articles.

Over 50 bootstrap iterations, the best model achieved a mean AUC of 0.7268 ± 0.0014, with a 95% CI of [0.7244, 0.7287]. The narrow confidence interval indicates stable discriminative performance across different training-data compositions. As with the regression task, the practical significance of differences among ensemble models should be judged against the bootstrap SD: if the AUC gap between the top two classifiers is within ± 0.0014, they are effectively equivalent in discriminative power — the ranking is unstable to resampling and would not justify choosing one over the other on performance alone.

Confusion matrix interpretation: At the default 0.5 probability threshold, the best model achieves a precision of 0.674 for the “Popular” class (meaning 67.4% of articles predicted as popular truly are) and a recall of 0.72 (72% of actually popular articles are correctly identified). In a production editorial setting, the threshold could be adjusted: lowering it would increase recall (catching more potential hits) at the cost of precision (more false positives), depending on whether missing viral opportunities or wasting promotion resources is more costly.

Part 4: Model Interpretation (25 points)

4.1 Feature Importance Analysis (10 points)

Code

# --- RF importance (regression) ---
rf_imp_reg <- importance(rf_reg, type = 1)  # %IncMSE (MDA)
rf_imp_df  <- data.frame(Feature = rownames(rf_imp_reg),
                          RF_MDA  = rf_imp_reg[, 1]) %>%
  arrange(desc(RF_MDA))

# --- GBM importance (regression) ---
gbm_imp_reg <- summary(gbm_reg_final, plotit = FALSE, n.trees = gbm_best_trees_reg)
gbm_imp_df  <- data.frame(Feature   = as.character(gbm_imp_reg$var),
                            GBM_RelInf = gbm_imp_reg$rel.inf)

# --- XGBoost importance (regression) ---
xgb_imp_reg <- xgb.importance(model = xgb_reg_final)
xgb_imp_df  <- data.frame(Feature  = xgb_imp_reg$Feature,
                            XGB_Gain = xgb_imp_reg$Gain)

# Cross-model top-15 comparison
top_n <- 15
rf_top  <- head(rf_imp_df, top_n)
gbm_top <- head(gbm_imp_df, top_n)
xgb_top <- head(xgb_imp_df, top_n)

# Side-by-side bar charts
p_rf <- rf_top %>%
  mutate(Feature = fct_reorder(Feature, RF_MDA)) %>%
  ggplot(aes(x = RF_MDA, y = Feature)) +
  geom_col(fill = PAL3[1], alpha = 0.85) +
  labs(title = "RF — %IncMSE (MDA)", x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold", size = 11))

p_gbm <- gbm_top %>%
  mutate(Feature = fct_reorder(Feature, GBM_RelInf)) %>%
  ggplot(aes(x = GBM_RelInf, y = Feature)) +
  geom_col(fill = PAL3[2], alpha = 0.85) +
  labs(title = "GBM — Relative Influence", x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold", size = 11))

p_xgb <- xgb_top %>%
  mutate(Feature = fct_reorder(Feature, XGB_Gain)) %>%
  ggplot(aes(x = XGB_Gain, y = Feature)) +
  geom_col(fill = PAL3[3], alpha = 0.85) +
  labs(title = "XGBoost — Gain", x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold", size = 11))

gridExtra::grid.arrange(p_rf, p_gbm, p_xgb, ncol = 3)

Code

# Cross-model comparison table
cross_imp <- rf_imp_df %>% head(10) %>%
  left_join(gbm_imp_df, by = "Feature") %>%
  left_join(xgb_imp_df, by = "Feature") %>%
  mutate(across(where(is.numeric), ~round(.x, 3)))

cross_imp %>%
  kbl(caption = "Top 10 Features — Cross-Model Importance (Regression)",
      col.names = c("Feature", "RF %IncMSE", "GBM Rel. Influence", "XGB Gain")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Top 10 Features — Cross-Model Importance (Regression)
Feature	RF %IncMSE	GBM Rel. Influence	XGB Gain
kw_avg_avg	43.310	20.266	0.180
kw_avg_max	36.217	1.906	0.021
n_unique_tokens	35.967	2.879	0.027
LDA_00	35.813	2.180	0.019
self_reference_min_shares	35.552	5.863	0.060
kw_max_avg	34.974	6.203	0.064
num_hrefs	34.952	3.617	0.042
global_sentiment_polarity	34.873	0.940	0.012
global_rate_positive_words	34.581	1.048	0.011
n_tokens_content	34.032	1.482	0.023

Code

# --- Classification importance (best classifier) ---
rf_imp_class <- importance(rf_class, type = 1)
rf_imp_class_df <- data.frame(Feature = rownames(rf_imp_class),
                               RF_MDA_Class = rf_imp_class[, 1]) %>%
  arrange(desc(RF_MDA_Class))

xgb_imp_class <- xgb.importance(model = xgb_class_final)
xgb_imp_class_df <- data.frame(Feature = xgb_imp_class$Feature,
                                XGB_Gain_Class = xgb_imp_class$Gain)

cross_imp_class <- rf_imp_class_df %>% head(10) %>%
  left_join(xgb_imp_class_df, by = "Feature") %>%
  mutate(across(where(is.numeric), ~round(.x, 3)))

cross_imp_class %>%
  kbl(caption = "Top 10 Features — Cross-Model Importance (Classification)",
      col.names = c("Feature", "RF MDA (Class)", "XGB Gain (Class)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Top 10 Features — Cross-Model Importance (Classification)
Feature	RF MDA (Class)	XGB Gain (Class)
kw_avg_avg	44.010	0.101
self_reference_min_shares	43.667	0.056
kw_max_avg	39.614	0.049
self_reference_avg_sharess	35.736	0.038
kw_min_avg	33.792	0.027
LDA_00	32.905	0.035
LDA_04	31.624	0.019
LDA_02	31.548	0.033
n_non_stop_words	31.408	0.000
data_channel_is_entertainment	30.721	0.044

Question 14 (10 points):

Top features across models: All three algorithms consistently rank kw_avg_avg (average shares of the average keyword) as the single most important predictor. The self-reference features (self_reference_min_shares, self_reference_avg_sharess), keyword statistics (kw_max_avg, kw_min_avg), and LDA topic probabilities also appear in all three top-10 lists. This convergence across fundamentally different algorithms strengthens our confidence that these features carry genuine predictive signal rather than being artifacts of a single model’s splitting mechanism.

Where rankings diverge: Some features rank highly in one model but not others. For instance, GBM’s relative influence may elevate features that are important conditional on other features already in the boosting sequence — GBM discovers context-dependent importance that RF’s parallel construction might miss. XGBoost’s Gain measure reflects the average improvement in the objective function contributed by a feature, which can emphasize features that produce large loss reductions at specific splits.

Why rankings differ: RF’s MDA (permutation importance) measures how much predictive accuracy degrades when a feature is randomly permuted — a marginal test. GBM’s relative influence counts how often and how effectively a feature is selected for splitting across the sequential boosting process — a conditional test. XGBoost’s Gain measures the average improvement in the regularized loss function — an optimization-centric test. These measure fundamentally different aspects of predictive contribution.

Which to trust: For interpretive purposes, MDA (permutation importance) is the most reliable because it directly measures predictive contribution without the cardinality bias that inflates impurity-based measures for high-dimensional continuous features. However, the strong convergence across all three methods on the top features provides robust evidence.

Regression vs. classification rankings: The classification importance tables (shown above) largely mirror the regression rankings, with kw_avg_avg and the self-reference features remaining dominant. Minor reordering occurs because classification collapses the continuous shares target into a binary boundary — features that discriminate near the median threshold may rank differently than those that predict the full magnitude of shares. The consistency across tasks reinforces that the same underlying content and keyword features drive both how much an article is shared and whether it crosses the popularity threshold.

4.2 Partial Dependence Plots (10 points)

Code

# Use RF for PDPs (cleanest integration with pdp package)
# Subsample training data for speed
set.seed(42)
pdp_train <- train_reg[sample(nrow(train_reg), 5000), ]

# Top 4 features for PDP
pdp_features <- head(rf_imp_df$Feature, 4)
cat("PDP features:", paste(pdp_features, collapse = ", "), "\n")

PDP features: kw_avg_avg, kw_avg_max, n_unique_tokens, LDA_00

Code

pdp_list <- list()
for (feat in pdp_features) {
  pdp_list[[feat]] <- partial(rf_reg, pred.var = feat, train = pdp_train)
}

pdp_plots <- lapply(names(pdp_list), function(feat) {
  df <- pdp_list[[feat]]
  ggplot(df, aes(x = .data[[feat]], y = yhat)) +
    geom_line(colour = PAL2[1], linewidth = 1.1) +
    geom_rug(data = pdp_train, aes(x = .data[[feat]]), y = NULL,
             colour = "grey60", alpha = 0.15, inherit.aes = FALSE) +
    labs(title = paste("PDP:", feat),
         subtitle = "Marginal effect on predicted log(shares)",
         x = feat, y = "Partial Dependence") +
    theme_minimal(base_size = 11) +
    theme(plot.title = element_text(face = "bold", size = 11))
})

gridExtra::grid.arrange(grobs = pdp_plots, ncol = 2)

Code

# Compute directional summaries for each PDP to support dynamic interpretation
pdp_summary <- lapply(names(pdp_list), function(feat) {
  df <- pdp_list[[feat]]
  x_vals <- df[[feat]]
  y_vals <- df$yhat
  # Overall direction: correlation between x and yhat
  direction <- ifelse(cor(x_vals, y_vals) > 0.05, "positive",
                ifelse(cor(x_vals, y_vals) < -0.05, "negative", "flat"))
  # Range of partial dependence (effect size in log-shares)
  pd_range <- max(y_vals) - min(y_vals)
  # Where is the steepest region? (first vs second half)
  mid <- median(x_vals)
  lower_slope <- coef(lm(y_vals[x_vals <= mid] ~ x_vals[x_vals <= mid]))[2]
  upper_slope <- coef(lm(y_vals[x_vals > mid] ~ x_vals[x_vals > mid]))[2]
  shape <- if (abs(lower_slope) > 2 * abs(upper_slope)) {
    "concave (steeper at low values, flattening at high)"
  } else if (abs(upper_slope) > 2 * abs(lower_slope)) {
    "convex (flat at low values, steeper at high)"
  } else {
    "approximately linear across its range"
  }
  list(feat = feat, direction = direction, pd_range = round(pd_range, 3),
       shape = shape, lower_slope = round(lower_slope, 6),
       upper_slope = round(upper_slope, 6),
       x_min = round(min(x_vals), 1), x_max = round(max(x_vals), 1))
})
names(pdp_summary) <- names(pdp_list)

# Feature descriptions from the dataset documentation
feat_descriptions <- c(
  kw_avg_avg = "average shares of the average keyword",
  kw_max_avg = "average shares of the best-performing keyword",
  kw_min_avg = "average shares of the worst-performing keyword",
  kw_avg_max = "max shares of the average keyword",
  kw_avg_min = "min shares of the average keyword",
  self_reference_avg_sharess = "average shares of referenced Mashable articles",
  self_reference_min_shares = "minimum shares of referenced Mashable articles",
  self_reference_max_shares = "maximum shares of referenced Mashable articles",
  num_hrefs = "number of hyperlinks in the article",
  num_imgs = "number of images in the article",
  LDA_02 = "LDA topic 2 probability",
  LDA_03 = "LDA topic 3 probability",
  LDA_00 = "LDA topic 0 probability",
  n_tokens_content = "number of words in article content",
  n_unique_tokens = "rate of unique words in content",
  global_subjectivity = "article subjectivity score"
)
f3 <- pdp_features[3]; f4 <- pdp_features[4]
f3_desc <- ifelse(f3 %in% names(feat_descriptions), feat_descriptions[f3], f3)
f4_desc <- ifelse(f4 %in% names(feat_descriptions), feat_descriptions[f4], f4)
s3 <- pdp_summary[[f3]]; s4 <- pdp_summary[[f4]]

Question 15 (10 points):

kw_avg_avg (average shares of average keyword): The PDP reveals a strongly positive, approximately concave relationship with predicted log(shares). At low values (below ~2,000), predicted shares increase steeply with keyword popularity — articles tagged with historically popular keywords receive substantially higher predicted shares. The curve flattens beyond approximately 3,500–4,000, exhibiting diminishing returns: once an article’s keywords are already associated with popular content, further increases in keyword popularity contribute little incremental predicted shares. This makes intuitive sense — kw_avg_avg is a proxy for topic demand and audience interest, but there is a ceiling to how much topic choice alone can drive virality.

self_reference_avg_sharess (average shares of referenced Mashable articles): Predicted shares increase with the average popularity of internally referenced articles, but the relationship shows a clear threshold effect: the steepest gains occur in the lower range, with the curve flattening substantially beyond moderate values. This suggests that linking to moderately popular internal content signals topical relevance and editorial connectedness, but linking to viral outlier articles provides no additional benefit — the relationship saturates.

n_unique_tokens (rate of unique words in content): The PDP shows a negative relationship that is concave (steeper at low values, flattening at high), spanning a partial dependence range of 0.151 log-shares units across the feature’s observed domain (0 to 1). Higher values of n_unique_tokens are associated with lower predicted shares. The steeper gains at the lower end suggest diminishing returns — the feature matters most when moving from low to moderate values, with limited additional benefit beyond that. The rug plot indicates where training observations concentrate — PDP estimates in data-sparse regions (extremes) should be interpreted cautiously.

LDA_00 (LDA topic 0 probability): This PDP reveals a positive relationship that is approximately linear across its range, with a total partial dependence range of 0.108 log-shares units (feature range: 0 to 0.9). Increases in LDA_00 are associated with higher predicted log(shares). The moderate partial dependence range is consistent with a meaningful but not dominant role in the prediction. Together with the first two PDPs, these plots confirm that the ensemble models rely on a mixture of keyword-popularity proxies, editorial metadata, and content characteristics — each contributing nonlinear structure that linear regression cannot capture.

Important caveat: PDPs show predictive association, not causal effect. The partial dependence function marginalizes over all other features, revealing the average relationship — but intervening on a feature (e.g., choosing different keywords) may not produce the predicted change if the feature correlates with unobserved confounders.

Marginal PDPs vs. conditional effects — the correlation trap

The standard PDP computes \(\hat{f}_S(x_S) = \frac{1}{n}\sum_{i=1}^{n}\hat{f}(x_S, x_{C}^{(i)})\), averaging the prediction over the observed distribution of all other features \(x_C\). This marginal average can be misleading when features are correlated. For example, if kw_avg_avg is high, it is likely that kw_max_avg is also high — but the PDP evaluates kw_avg_avg = 5000 while averaging over all observed values of kw_max_avg, including low ones that would never co-occur with such a high kw_avg_avg in reality. This creates predictions for implausible feature combinations. ICE (Individual Conditional Expectation) plots — which show one curve per observation rather than the average — can reveal heterogeneity hidden by the PDP’s marginalization. For production interpretation, accumulated local effects (ALE) plots are preferred because they condition on the local feature distribution and avoid the extrapolation problem.

4.3 Actionable Insights & Conclusion (5 points)

Question 16 (5 points):

Based on the complete analysis — exploratory patterns, model comparison, feature importance, and PDPs — we offer the following recommendations for content creators at Mashable:

1. Prioritize topics with demonstrated audience demand. kw_avg_avg is the single strongest predictor across all three models, and its PDP shows steep gains up to a threshold. Selecting topics whose keywords are associated with historically popular content is the strongest content-strategy lever the data reveals. However, this feature reflects past keyword popularity and serves as a topic-demand proxy — it cannot be directly engineered at publication time.

2. Link to moderately popular internal articles. The self-reference features consistently rank in the top 10, with PDPs showing positive effects that plateau at moderate values. Including relevant internal links signals editorial depth and may improve SEO performance. The plateau suggests diminishing returns from linking to extremely viral content.

3. Consider publication timing strategically. Weekend articles show popularity rates 20–25 percentage points higher than weekday articles. While this partly reflects content-type differences (lighter, more shareable material on weekends), it suggests that timing content releases for Saturday–Sunday may increase engagement. This recommendation should be tested experimentally, as the observed pattern could reflect selection effects rather than a pure timing advantage.

4. Leverage the Social Media and Tech channel positioning. Social Media articles achieve a ~76% popularity rate versus ~39% for World News. While channel assignment is content-driven, this suggests that content at the intersection of technology and social media resonates most strongly with Mashable’s audience.

5. Maintain moderate article length and rich media. Content-level features such as num_imgs, num_hrefs, and n_tokens_content appear in the feature importance rankings across models. The keyword features dominate the top-5, but these content structure features consistently appear in the top-15 across RF, GBM, and XGBoost importance measures, indicating that how an article is composed — not just what topic it covers — carries signal. Articles enriched with images and external references tend to perform better, though this must be distinguished from the confound that high-effort articles may naturally include more media and cover more engaging topics.

Limitations: (1) All findings are predictive associations, not causal effects. Content creators should treat these as hypotheses to test, not guaranteed interventions. (2) The dataset is from 2013–2015 Mashable articles; media consumption patterns may have shifted substantially since then. (3) The models explain only a fraction of the variance in shares — viral behavior is fundamentally stochastic and driven by social network dynamics that no content-level feature can capture. (4) Feature importance and PDPs reflect the training distribution and may not extrapolate to novel content types or audience segments.

Bonus: Stacked Ensemble (15 bonus points)

Code

K <- 5
set.seed(42)
folds_reg <- sample(1:K, nrow(train_reg), replace = TRUE)

oof_rf_reg  <- numeric(nrow(train_reg))
oof_gbm_reg <- numeric(nrow(train_reg))
oof_xgb_reg <- numeric(nrow(train_reg))

for (k in 1:K) {
  cat("Regression stacking — fold", k, "/", K, "\n")
  fold_train <- train_reg[folds_reg != k, ]
  fold_val   <- train_reg[folds_reg == k, ]
  feat_cols  <- setdiff(names(fold_train), "log_shares")

  # RF
  set.seed(42)
  rf_fold <- randomForest(log_shares ~ ., data = fold_train, ntree = 500)
  oof_rf_reg[folds_reg == k] <- predict(rf_fold, fold_val)

  # GBM
  set.seed(42)
  gbm_fold <- gbm(log_shares ~ ., data = fold_train, distribution = "gaussian",
                   n.trees = gbm_best_trees_reg, shrinkage = best_gbm_reg$shrinkage,
                   interaction.depth = best_gbm_reg$depth,
                   n.minobsinnode = best_gbm_reg$minobs,
                   bag.fraction = 0.75, verbose = FALSE)
  oof_gbm_reg[folds_reg == k] <- predict(gbm_fold, fold_val,
                                          n.trees = gbm_best_trees_reg)

  # XGBoost
  fold_dtrain <- xgb.DMatrix(data = as.matrix(fold_train[, feat_cols]),
                              label = fold_train$log_shares)
  fold_dval   <- xgb.DMatrix(data = as.matrix(fold_val[, feat_cols]))
  set.seed(42)
  xgb_fold_params <- xgb_best_params_reg
  xgb_fold_params$nthread <- 1
  xgb_fold <- xgb.train(params = xgb_fold_params,
                         data = fold_dtrain,
                         nrounds = best_xgb_reg$best_nrounds, verbose = 0)
  oof_xgb_reg[folds_reg == k] <- predict(xgb_fold, fold_dval)
}

Regression stacking — fold 1 / 5 
Regression stacking — fold 2 / 5 
Regression stacking — fold 3 / 5 
Regression stacking — fold 4 / 5 
Regression stacking — fold 5 / 5

Code

# Meta-learner
meta_reg_df <- data.frame(log_shares = train_reg$log_shares,
                           rf = oof_rf_reg, gbm = oof_gbm_reg, xgb = oof_xgb_reg)
meta_reg <- lm(log_shares ~ rf + gbm + xgb, data = meta_reg_df)
cat("\nMeta-learner coefficients:\n")


Meta-learner coefficients:

Code

print(round(coef(meta_reg), 4))

(Intercept)          rf         gbm         xgb 
    -0.3214      0.3142      0.2633      0.4643

Code

# Test predictions (using full-data base models already fitted)
test_meta_reg <- data.frame(
  rf  = predict(rf_reg, test_reg),
  gbm = predict(gbm_reg_final, test_reg, n.trees = gbm_best_trees_reg),
  xgb = predict(xgb_reg_final, dtest_xgb)
)
stacked_pred_reg <- exp(predict(meta_reg, test_meta_reg))
rmspe_stacked_reg <- compute_rmspe(test_actual_shares, stacked_pred_reg)

cat("\nStacked Ensemble Test RMSPE:", fmt(rmspe_stacked_reg, 0), "shares\n")


Stacked Ensemble Test RMSPE: 7,691 shares

Code

cat("Best individual model RMSPE:", fmt(min(rmspe_rf, rmspe_gbm, rmspe_xgb), 0), "shares\n")

Best individual model RMSPE: 7,689 shares

Code

set.seed(42)
folds_class <- sample(1:K, nrow(train_class), replace = TRUE)

oof_rf_class  <- numeric(nrow(train_class))
oof_gbm_class <- numeric(nrow(train_class))
oof_xgb_class <- numeric(nrow(train_class))

for (k in 1:K) {
  cat("Classification stacking — fold", k, "/", K, "\n")
  fold_train <- train_class[folds_class != k, ]
  fold_val   <- train_class[folds_class == k, ]
  fold_train_labels <- as.numeric(fold_train$popular == "Popular")
  feat_cols <- setdiff(names(fold_train), "popular")

  # RF
  set.seed(42)
  rf_fold <- randomForest(popular ~ ., data = fold_train, ntree = 500)
  oof_rf_class[folds_class == k] <- predict(rf_fold, fold_val, type = "prob")[, "Popular"]

  # GBM
  gbm_fold_data <- fold_train[, feat_cols]
  gbm_fold_data$popular_num <- fold_train_labels
  set.seed(42)
  gbm_fold <- gbm(popular_num ~ ., data = gbm_fold_data, distribution = "bernoulli",
                   n.trees = gbm_best_trees_class, shrinkage = best_gbm_class$shrinkage,
                   interaction.depth = best_gbm_class$depth,
                   n.minobsinnode = best_gbm_class$minobs,
                   bag.fraction = 0.75, verbose = FALSE)
  gbm_val_data <- fold_val[, feat_cols]
  oof_gbm_class[folds_class == k] <- predict(gbm_fold, gbm_val_data,
                                              n.trees = gbm_best_trees_class,
                                              type = "response")

  # XGBoost
  fold_dtrain <- xgb.DMatrix(data = as.matrix(fold_train[, feat_cols]),
                              label = fold_train_labels)
  fold_dval   <- xgb.DMatrix(data = as.matrix(fold_val[, feat_cols]))
  set.seed(42)
  xgb_fold_params_c <- xgb_best_params_class
  xgb_fold_params_c$nthread <- 1
  xgb_fold <- xgb.train(
    params = xgb_fold_params_c,
    data = fold_dtrain, nrounds = best_xgb_class$best_nrounds, verbose = 0
  )
  oof_xgb_class[folds_class == k] <- predict(xgb_fold, fold_dval)
}

Classification stacking — fold 1 / 5 
Classification stacking — fold 2 / 5 
Classification stacking — fold 3 / 5 
Classification stacking — fold 4 / 5 
Classification stacking — fold 5 / 5

Code

# Meta-learner (logistic regression)
meta_class_df <- data.frame(popular_num = train_labels_class,
                             rf = oof_rf_class, gbm = oof_gbm_class,
                             xgb = oof_xgb_class)
meta_class <- glm(popular_num ~ rf + gbm + xgb, data = meta_class_df,
                   family = binomial)
cat("\nMeta-learner coefficients:\n")


Meta-learner coefficients:

Code

print(round(coef(meta_class), 4))

(Intercept)          rf         gbm         xgb 
    -2.4881      1.4797      1.5566      1.9220

Code

# Test predictions
test_meta_class <- data.frame(
  rf  = predict(rf_class, test_class, type = "prob")[, "Popular"],
  gbm = predict(gbm_class_final, test_class[, feature_names_class],
                n.trees = gbm_best_trees_class, type = "response"),
  xgb = predict(xgb_class_final, dtest_xgb_class)
)
stacked_prob_class <- predict(meta_class, test_meta_class, type = "response")
auc_stacked_class  <- compute_auc(stacked_prob_class, test_labels_class)

cat("\nStacked Ensemble Test AUC:", round(auc_stacked_class, 4), "\n")


Stacked Ensemble Test AUC: 0.7344

Code

cat("Best individual model AUC:", round(max(auc_rf_test, auc_gbm_test, auc_xgb_test), 4), "\n")

Best individual model AUC: 0.7331

Bonus Question (15 points):

Regression stacking: The stacked ensemble achieves a test RMSPE of 7,691 shares, compared to the best individual model’s 7,689 shares. The meta-learner coefficients reveal which base model the linear combiner trusts most: rf = 0.314, gbm = 0.263, xgb = 0.464. A larger coefficient indicates greater weight in the final prediction.

Classification stacking: The stacked ensemble achieves a test AUC of 0.7344, compared to the best individual model’s 0.7331. The meta-learner coefficients (on the logit scale) are: rf = 1.48, gbm = 1.557, xgb = 1.922.

Why out-of-fold predictions are essential: Using training-set predictions as meta-features would let the meta-learner see artificially optimistic base-model predictions — especially from flexible models that partially memorize the training data. Out-of-fold predictions simulate held-out behavior, giving the meta-learner an honest signal about each base model’s actual predictive quality.

When stacking helps vs. doesn’t: Stacking benefits most from model diversity — base models that capture different patterns and make different errors. RF (bagging + random subspace), GBM (sequential error correction), and XGBoost (regularized boosting) have sufficiently different learning mechanisms to provide diverse predictions. However, all three are tree-based ensembles operating on the same features, which limits their diversity compared to, say, combining trees with neural networks or kernel methods. If the best base model already captures most of the learnable signal, the meta-learner has little room to improve by combining predictions.

The stacked ensemble does not meaningfully improve on the best individual regression model, suggesting that the base models’ predictions are sufficiently correlated that optimal combination offers little advantage. The best individual model remains the recommended choice for deployment.

Background

Part 1: Data Preparation & Exploratory Analysis (25 points)

1.1 Data Loading & Cleaning (5 points)

Question 1 (5 points):

1.2 Summary Statistics (5 points)

Question 2 (5 points):

1.3 Popularity Categories & Distribution (5 points)

Question 3 (5 points):

1.4 Data Channels & Weekday Distributions (5 points)

Question 4 (5 points):

1.5 Visualizations (5 points)

Question 5 (5 points):

Part 2: Regression Modeling (25 points)

2.1 Train/Test Split (3 points)

2.2 Benchmark: Linear Regression (2 points)

2.3 Random Forest (5 points)

Question 6 (5 points):

2.4 GBM (5 points)

Question 7 (5 points):

2.5 XGBoost (5 points)

Question 8 (5 points):

2.6 Model Comparison — RMSPE ± SD (5 points)

Question 9 (5 points):

Part 3: Classification Modeling (25 points)

3.1 Create Binary Target & Train/Test Split (3 points)

3.2 Benchmark: Logistic Regression (2 points)

3.3 Random Forest (5 points)

Question 10 (5 points):

3.4 GBM (5 points)

Question 11 (5 points):

3.5 XGBoost (5 points)

Question 12 (5 points):

3.6 Model Comparison — AUC ± SD (5 points)

Question 13 (5 points):

Part 4: Model Interpretation (25 points)

4.1 Feature Importance Analysis (10 points)

Question 14 (10 points):

4.2 Partial Dependence Plots (10 points)

Question 15 (10 points):

4.3 Actionable Insights & Conclusion (5 points)

Question 16 (5 points):

Bonus: Stacked Ensemble (15 bonus points)

Bonus Question (15 points):

Submission Checklist