LFMG assignment 2

Author

LFMG

Assignment 2

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit  The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.

This assignment consists of conducting experiments for different algorithms: Decision Trees, Random Forest and Adaboost. For each experiment we are defining what we are trying to achieve (before each run), conduct the experiment, and at the end reviewing how the experiment went. These experiments will allow us to compare algorithms and choose an optimal model. 

Libraries

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.1
Warning: package 'lubridate' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(tibble)
library(readr)
library(janitor)
Warning: package 'janitor' was built under R version 4.5.1

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(rsample)
Warning: package 'rsample' was built under R version 4.5.1
library(yardstick)
Warning: package 'yardstick' was built under R version 4.5.1

Attaching package: 'yardstick'

The following object is masked from 'package:readr':

    spec
library(rpart) 
library(rpart.plot) 
library(forcats)    
library(skimr)
Warning: package 'skimr' was built under R version 4.5.1
library(naniar)
Warning: package 'naniar' was built under R version 4.5.1

Attaching package: 'naniar'

The following object is masked from 'package:skimr':

    n_complete
library(DataExplorer)
Warning: package 'DataExplorer' was built under R version 4.5.1
library(GGally)
Warning: package 'GGally' was built under R version 4.5.1
library(corrplot)
Warning: package 'corrplot' was built under R version 4.5.1
corrplot 0.95 loaded
library(vip)
Warning: package 'vip' was built under R version 4.5.1

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi
library(lubridate)
library(knitr)
Warning: package 'knitr' was built under R version 4.5.1
library(future)
library(furrr)
Warning: package 'furrr' was built under R version 4.5.1
library(kableExtra)
Warning: package 'kableExtra' was built under R version 4.5.1

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin
library(ranger)

Attaching package: 'ranger'

The following object is masked from 'package:randomForest':

    importance
library(adabag)
Warning: package 'adabag' was built under R version 4.5.1
Loading required package: caret
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:future':

    cluster

The following objects are masked from 'package:yardstick':

    precision, recall, sensitivity, specificity

The following object is masked from 'package:rsample':

    calibration

The following object is masked from 'package:purrr':

    lift

Loading required package: foreach

Attaching package: 'foreach'

The following objects are masked from 'package:purrr':

    accumulate, when

Loading required package: doParallel
Warning: package 'doParallel' was built under R version 4.5.1
Loading required package: iterators
Loading required package: parallel
library(progressr)

Attaching package: 'progressr'

The following object is masked from 'package:caret':

    progress
library(fastAdaboost)

Loading the Data

The dataset has been upload to GitHub for easy reproducibility.

url <- "https://raw.githubusercontent.com/Lfirenzeg/msds622/refs/heads/main/bank-additional-full.csv"
set.seed(489) ## let's define a fixed seed, also for reproducibility

The file has semi colons instead of comas, so we also have to account for that. Also, when exploring the CSV some “unknown” values were found, so we’ll turn those into NAs.

Global set up

# Loading, and converting "unknown" and "" to NA; clean names to snake_case
uci_raw <- read_delim(url, delim = ";", show_col_types = FALSE, na = c("", "unknown")) |>
  clean_names()
# Remove exact duplicate rows (we know there's very few, but it's a good practice)
uci_raw <- uci_raw |> distinct()

Now we’ll define a function that prepares the data following our findings from assignment 1. We’ll call it prep_features, and we’ll apply to the global dataset, since all the experiments we’ll benefit from it and we’ll have a uniform starting point.

prep_features <- function(df) {
  df |>
    # Target to ordered factor (no, yes)
    mutate(y = factor(y, levels = c("no","yes"))) |>
    # we'll drop the leakage feature, since duration is known only after the call
    select(-duration) |>
    # to handle "pdays" we'll  create a "no prior contact" flag and a cleaned numeric one
    mutate(
      no_prior      = if_else(pdays == 999, "yes", "no"),
      pdays_real    = if_else(pdays == 999, NA, pdays),
      # and the recent contact flag
      recent_contact = if_else(!is.na(pdays_real) & pdays_real < 7, "recent", "not_recent"),
      # also some contact intensity bands from campaign
      intensity = case_when(
        campaign <= 2 ~ "low",
        campaign <= 5 ~ "medium",
        TRUE ~ "high"
      ),
      # we had defined age buckets to segment the data for easy handling
      age_bucket = cut(age, breaks = c(0,30,45,60,Inf), labels = c("<=30","31-45","46-60","60+"), right = TRUE),
      # Creating a flag for "Prior outcome" (if it has any prior outcome)
      had_prior = if_else(poutcome == "nonexistent", "no", "yes")
    ) |>
    # finally, keep categorical columns as factors so trees can split on them
    mutate(
      across(where(is.character), ~factor(.x)),
      across(c(no_prior, recent_contact, intensity, age_bucket, had_prior), ~factor(.x))
    )
}

Then, we apply it to the dataset

uci <- prep_features(uci_raw)

Additionally, we want to reduce macro collinearity by dropping one near duplicate macro. So we’ll keep euribor3m and drop nr_employed. This really is optional for trees, but we do it to stabilize comparisons.

if ("nr_employed" %in% names(uci)) {
  uci <- uci |> select(-nr_employed)
}

Train and Test split, CV folds

We will do an 80/20 stratified split for the final holdout

split_obj <- initial_split(uci, prop = 0.80, strata = y)
train_df <- training(split_obj)
test_df <- testing(split_obj)

And a 5 fold stratified CV on the training data

folds <- vfold_cv(train_df, v = 5, strata = y)

Additional useful metrics

Compute Recall@TopK% for the “yes” class

recall_at_top_k <- function(data, prob_col = ".pred_yes", truth_col = "y", k_frac = 0.10) {
  stopifnot(k_frac > 0, k_frac <= 1)
  n <- nrow(data)
  if (n == 0) return(NA_real_)
  k <- max(1, floor(k_frac * n))

  # Order by the chosen probability column (descending)
  ord <- order(data[[prob_col]], decreasing = TRUE, na.last = TRUE)

  # Predicted classes: top K% -> "yes", others -> "no"
  pred_class <- rep("no", n)
  pred_class[ord[seq_len(k)]] <- "yes"

  # Ensure both columns are factors with levels c("no","yes")
  truth <- factor(data[[truth_col]], levels = c("no","yes"))
  pred  <- factor(pred_class,            levels = c("no","yes"))

  yardstick::recall_vec(truth = truth, estimate = pred, event_level = "second")
}

We also can define a single function to wrap standard metrics after each validation prediction

compute_metrics <- function(pred_df, truth_col = "y", prob_col = ".pred_yes") {
  # Pull columns via tidy-eval and pass positionally (no `estimate =`)
  pr  <- yardstick::pr_auc(pred_df,
                           !!rlang::sym(truth_col),
                           !!rlang::sym(prob_col),
                           event_level = "second")$.estimate
  roc <- yardstick::roc_auc(pred_df,
                            !!rlang::sym(truth_col),
                            !!rlang::sym(prob_col),
                            event_level = "second")$.estimate
  rec <- recall_at_top_k(pred_df,
                         prob_col = prob_col,
                         truth_col = truth_col,
                         k_frac = 0.10)

  tibble(pr_auc = pr, roc_auc = roc, rec_at10 = rec)
}

Class Weights

To mitigate imbalance, we’ll give higher weight to minority class “yes”, inversely proportional to frequency. We’ll use a function for this called compute_class_weights.

compute_class_weights <- function(df, y_col = "y") {
  tab <- table(df[[y_col]])
  # For inverse frequency, so that mean weight is 1
  w <- 1 / as.numeric(tab)
  names(w) <- names(tab)
  w <- w / mean(w)
  return(w)
}
class_wts <- compute_class_weights(train_df)
class_wts
       no       yes 
0.2253188 1.7746812 

And build a weight vector (by weighting each row by class weight)

row_weights_from_class <- function(df, y_col = "y", class_wts) {
  unname(class_wts[as.character(df[[y_col]])])
}

Log

Since we will be running multiple experiments we’ll prepare a log where we can add our results as we make progress.

# defining the log
# creating an empty log with the exact schema we want to register
init_exp_log <- function() {
  tibble::tibble(
    id        = character(),
    model     = character(),
    objective = character(),
    variation = character(),
    controls  = character(),
    metrics   = character(),
    result    = character(),
    conclusion= character(),
    recommend = character()
  )
}
# defining function to populate the log with results
# To be safe, we add an appender that enforces the same column set/order every time
log_experiment <- function(log,
                           id, model, objective, variation, controls,
                           metrics, result, conclusion, recommend) {
  new_row <- tibble::tibble(
    id        = as.character(id),
    model     = as.character(model),
    objective = as.character(objective),
    variation = as.character(variation),
    controls  = as.character(controls),
    metrics   = as.character(metrics),
    result    = as.character(result),
    conclusion= as.character(conclusion),
    recommend = as.character(recommend)
  )

  # and if the log doesn't exist or has different columns, it re-starts cleanly
  if (!exists("log") || !all(names(log) == names(new_row))) {
    log <- init_exp_log()
  }

  dplyr::bind_rows(log, new_row)
}
fmt_res <- function(m) {
  # m is a 1-row tibble with pr_auc, roc_auc, rec_at10
  sprintf("Test PR-AUC=%.4f, ROC-AUC=%.4f, Recall@10%%=%.4f",
          m$pr_auc, m$roc_auc, m$rec_at10)
}
exp_log <- init_exp_log()

Decision Trees

We will implement the following experiments:

  • DT1: Establish a baseline tree versus a cost-sensitive tree

    Objective: To improve minority recall without collapsing precision by introducing cost sensitivity.

    Variation: We’ll add a class cost matrix, class weights and tune cp (complexity), minsplit, maxdepth. We will then compare to a plain baseline.

    Controls: Same features, same train/validation split, same CV, no SMOTE.

    Metrics: PR-AUC, Recall@Top10%, ROC-AUC.

    Hypothesis: The cost sensitive pruned tree will raise recall@top10% and PR-AUC relative to baseline.

    Decision rule: If PR-AUC improves by more than 3% absolute or recall@top10% improves by more than 5 points, we’ll recommend the cost sensitive settings.

  • DT2: Add interaction aware features, plus some depth control

    Objective: For this second experiment we’ll test whether simple domain features (for example recent × intensity, age buckets, edu×job) can boost ranking power while controlling variance.

    Variation: We are adding engineered features; restricting maxdepth and tuning cp/minbucket to avoid overfitting.

    Controls: Same imbalance strategy as DT1’s better variant; same split and CV.

    Metrics: PR-AUC, Recall@Top10%, tree size as complexity proxy.

    Hypothesis: Our new engineered features plus some depth control yields higher PR-AUC with a smaller tree than DT1.

    Decision rule: We’ll prefer DT2 if PR-AUC improves and the tree size is reduced or unchanged.

DT1

We’ll start by defining a small hyperparameter grid

dt_grid <- expand.grid(
  cp = c(0.001, 0.0025, 0.005, 0.01),
  minsplit = c(20, 50, 100),
  maxdepth = c(3, 4, 5, 6))

We ’ll use a Function to fit or predict for one set of parameters on one fold

fit_predict_rpart <- function(train_data, valid_data, cp, minsplit, maxdepth, use_weights = FALSE, class_wts = NULL) {
  ctrl <- rpart.control(cp = cp, minsplit = minsplit, maxdepth = maxdepth, xval = 0)
  if (use_weights) {
    w <- row_weights_from_class(train_data, "y", class_wts)
    fit <- rpart(y ~ ., data = train_data, method = "class", control = ctrl, weights = w)} 
  else {fit <- rpart(y ~ ., data = train_data, method = "class", control = ctrl)
  }
  # Return only numeric prob + truth; DO NOT include the model object
  preds <- predict(fit, newdata = valid_data, type = "prob")[, "yes"]
  tibble(.pred_yes = preds, y = valid_data$y)
}

We then run CV for a given hyperparameter setting and weight option

cv_eval_rpart <- function(folds, params, use_weights = FALSE, class_wts = NULL) {
  fold_metrics <- purrr::map_dfr(folds$splits, function(s) {
    tr <- analysis(s) |> droplevels()
    vl <- assessment(s) |> droplevels()
    p  <- fit_predict_rpart(tr, vl,
                            cp = params$cp, minsplit = params$minsplit, maxdepth = params$maxdepth,
                            use_weights = use_weights, class_wts = class_wts)
    compute_metrics(p)
  })
  fold_metrics |>
    dplyr::summarise(dplyr::across(dplyr::everything(), \(x) mean(x, na.rm = TRUE))) |>  # <- changed here
    dplyr::mutate(cp = params$cp, minsplit = params$minsplit, maxdepth = params$maxdepth, weighted = use_weights)
}

Then, we evaluate the whole grid for both baseline and cost sensitive

dt1_results <- purrr::map_dfr(seq_len(nrow(dt_grid)), function(i) {
  params <- dt_grid[i, ]
  base <- cv_eval_rpart(folds, params, use_weights = FALSE, class_wts = NULL)
  wts  <- cv_eval_rpart(folds, params, use_weights = TRUE,  class_wts = class_wts)
  dplyr::bind_rows(base, wts)
}) |>
  dplyr::arrange(dplyr::desc(pr_auc))

Now we can see the top rows

head(dt1_results, 10)
# A tibble: 10 × 7
   pr_auc roc_auc rec_at10    cp minsplit maxdepth weighted
    <dbl>   <dbl>    <dbl> <dbl>    <dbl>    <dbl> <lgl>   
 1  0.470   0.759    0.333  0.01       20        3 TRUE    
 2  0.470   0.759    0.333  0.01       50        3 TRUE    
 3  0.470   0.759    0.333  0.01      100        3 TRUE    
 4  0.470   0.759    0.333  0.01       20        4 TRUE    
 5  0.470   0.759    0.333  0.01       50        4 TRUE    
 6  0.470   0.759    0.333  0.01      100        4 TRUE    
 7  0.470   0.759    0.333  0.01       20        5 TRUE    
 8  0.470   0.759    0.333  0.01       50        5 TRUE    
 9  0.470   0.759    0.333  0.01      100        5 TRUE    
10  0.470   0.759    0.333  0.01       20        6 TRUE    

Pick the best by PR-AUC. And use as a tie breaker a higher recall@10, then a simpler smaller maxdepth, or larger cp

pick_dt1 <- dt1_results |>
  arrange(desc(pr_auc), desc(rec_at10), maxdepth, desc(cp)) |>
  slice(1)
pick_dt1
# A tibble: 1 × 7
  pr_auc roc_auc rec_at10    cp minsplit maxdepth weighted
   <dbl>   <dbl>    <dbl> <dbl>    <dbl>    <dbl> <lgl>   
1  0.470   0.759    0.333  0.01       20        3 TRUE    

The best cv config we find is cp = 0.01, minsplit = 20, maxdepth = 3, weighted = TRUE

And the best 5 fold cv metrics PR-AUC around 0.470, ROC-AUC around 0.759, and recall@top10% around 0.333

Now we can train the chosen model on the full training set and evaluate on the 20% holdout

best_ctrl <- rpart.control(cp = pick_dt1$cp, minsplit = pick_dt1$minsplit, maxdepth = pick_dt1$maxdepth, xval = 0)
if (pick_dt1$weighted) {
  final_dt1 <- rpart(y ~ ., data = train_df, method = "class", control = best_ctrl,
                     weights = row_weights_from_class(train_df, "y", class_wts))
} else {
  final_dt1 <- rpart(y ~ ., data = train_df, method = "class", control = best_ctrl)
}

And this will lead us to the holdout predictions and metrics

dt1_test_pred <- predict(final_dt1, newdata = test_df, type = "prob")[, "yes"]
dt1_test_df   <- tibble(.pred_yes = dt1_test_pred, y = test_df$y)
dt1_test_metrics <- compute_metrics(dt1_test_df)
dt1_test_metrics
# A tibble: 1 × 3
  pr_auc roc_auc rec_at10
   <dbl>   <dbl>    <dbl>
1  0.463   0.765    0.387

The holdout PR-AUC is very close to CV (it has a small drop), and recall@top10% actually improves, which suggests the class weighted, pruned tree is not overfitting and is ranking positives reasonably well.

We can also plot the final tree for dt1

rpart.plot(final_dt1, type = 2, extra = 101, fallen.leaves = TRUE)

DT1 Results

We can see the first splits are macro variables: euribor3m then cons_conf_idx, indicating macro conditions dominate early decisions.

A later split on month = “may” shows that month is not as favorable compared with other months, which is consistent with the findings from the EDA where May had one of the lowest conversion rates.

Also the leaves on the right side (low consumer confidence, certain months) have higher “yes” proportions, which aligns with the idea that certain economic policies and times are more receptive.

This all tell us that the cost sensitivity and pruning seems to have worked, since with a compact tree we have achieved a decent PR-AUC and even improved operational recall at a fixed calling quota (top 10%).

Also for this assignment we can say that the model’s stability from CV to test indicates we can continue using this setup as the DT baseline for the experiment table.

At this point we can finally add the experiment to our log

# Logging the experiment
exp_log <- log_experiment(
  exp_log,
  id        = "DT-1",
  model     = "Decision Tree",
  objective = "Improve minority recall with class weights and pruning.",
  variation = paste0("Grid over cp/minsplit/maxdepth; weights=", pick_dt1$weighted),
  controls  = "Same features/split/CV; no SMOTE.",
  metrics   = "PR-AUC (primary), Recall@Top10%, ROC-AUC",
  result    = fmt_res(dt1_test_metrics),
  conclusion= "Weighted and pruned tree outperformed baseline on PR-AUC/Recall@10%.",
  recommend = "Adopt cost-sensitive settings with selected cp/minsplit/maxdepth."
)
exp_log
# A tibble: 1 × 9
  id    model   objective variation controls metrics result conclusion recommend
  <chr> <chr>   <chr>     <chr>     <chr>    <chr>   <chr>  <chr>      <chr>    
1 DT-1  Decisi… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…

DT2

For the second experiment we’ll start by creating a modeling copy with explicit interaction columns

# Count interaction levels on TRAIN ONLY
tmp_train_interactions <- train_df |>
  mutate(
    recent_x_intensity = interaction(recent_contact, intensity, drop = TRUE),
    edu_x_job          = interaction(education, job, drop = TRUE)
  )
# We'll keep reasonably frequent levels so both model and test can share them
# this can tune the threshold; we'll start at 50
keep_recent_intensity <- names(which(table(tmp_train_interactions$recent_x_intensity) >= 50))
keep_edu_job          <- names(which(table(tmp_train_interactions$edu_x_job)          >= 50))
# this helper builds DT-2 features and lump rare levels to "Other" using TRAIN-derived keep lists
make_dt2_frame <- function(df, keep_recent_intensity, keep_edu_job) {
  df |>
    mutate(
      recent_x_intensity = interaction(recent_contact, intensity, drop = TRUE),
      edu_x_job          = interaction(education, job, drop = TRUE)
    ) |>
    mutate(
      recent_x_intensity = forcats::fct_other(recent_x_intensity, keep = keep_recent_intensity, other_level = "Other"),
      edu_x_job          = forcats::fct_other(edu_x_job,          keep = keep_edu_job,          other_level = "Other")
    ) |>
    mutate(across(where(is.character), ~factor(.x)))
}
train_df_dt2 <- make_dt2_frame(train_df, keep_recent_intensity, keep_edu_job)
test_df_dt2  <- make_dt2_frame(test_df,  keep_recent_intensity, keep_edu_job)
lvl_recent_intensity <- levels(train_df_dt2$recent_x_intensity)
lvl_edu_job <- levels(train_df_dt2$edu_x_job)

Then we rebuild folds for DT-2 (the folds are defined on the new training frame)

folds_dt2 <- rsample::vfold_cv(train_df_dt2, v = 5, strata = y)

And define a smaller depth range (for tighter control), and re-use the cp/minsplit values

dt2_grid <- expand.grid(
  cp        = c(0.0025, 0.005, 0.01),
  minsplit  = c(50, 100),
  maxdepth  = c(3, 4, 5)
)

We can re use the helpers for the fit and predict function with optional weights

fit_predict_rpart_dt2 <- function(train_data, valid_data, cp, minsplit, maxdepth, use_weights = TRUE, class_wts = NULL) {
  ctrl <- rpart.control(cp = cp, minsplit = minsplit, maxdepth = maxdepth, xval = 0)
  if (use_weights) {
    w <- row_weights_from_class(train_data, "y", class_wts)
    fit <- rpart(y ~ ., data = train_data, method = "class", control = ctrl, weights = w)
  } else {
    fit <- rpart(y ~ ., data = train_data, method = "class", control = ctrl)
  }
  preds <- predict(fit, newdata = valid_data, type = "prob")[, "yes"]
  tibble(.pred_yes = preds, y = valid_data$y)
}
cv_eval_rpart_dt2 <- function(folds, params, use_weights = TRUE, class_wts = NULL) {
  fold_metrics <- purrr::map_dfr(folds$splits, function(s) {
    tr <- rsample::analysis(s)
    vl <- rsample::assessment(s)
    p  <- fit_predict_rpart_dt2(tr, vl,
                                cp = params$cp, minsplit = params$minsplit, maxdepth = params$maxdepth,
                                use_weights = use_weights, class_wts = class_wts)
    compute_metrics(p)
  })
  fold_metrics |>
    dplyr::summarise(dplyr::across(dplyr::everything(), \(x) mean(x, na.rm = TRUE))) |>
    dplyr::mutate(cp = params$cp, minsplit = params$minsplit, maxdepth = params$maxdepth, weighted = use_weights)
}

We now run the CV grid for DT2

dt2_results <- purrr::map_dfr(seq_len(nrow(dt2_grid)), function(i) {
  params <- dt2_grid[i, ]
  cv_eval_rpart_dt2(folds_dt2, params, use_weights = TRUE, class_wts = class_wts)
}) |>
  dplyr::arrange(dplyr::desc(pr_auc))

We can take a look at the top results

head(dt2_results, 10)
# A tibble: 10 × 7
   pr_auc roc_auc rec_at10     cp minsplit maxdepth weighted
    <dbl>   <dbl>    <dbl>  <dbl>    <dbl>    <dbl> <lgl>   
 1  0.448   0.753    0.201 0.01         50        3 TRUE    
 2  0.448   0.753    0.201 0.01        100        3 TRUE    
 3  0.448   0.753    0.201 0.01         50        4 TRUE    
 4  0.448   0.753    0.201 0.01        100        4 TRUE    
 5  0.448   0.753    0.201 0.01         50        5 TRUE    
 6  0.448   0.753    0.201 0.01        100        5 TRUE    
 7  0.397   0.762    0.236 0.0025       50        5 TRUE    
 8  0.397   0.762    0.236 0.0025      100        5 TRUE    
 9  0.397   0.764    0.236 0.0025       50        4 TRUE    
10  0.397   0.764    0.236 0.0025      100        4 TRUE    

And similar to DT1 we can select the best configuration by PR-AUC (and use the same tie breakers as before)

pick_dt2 <- dt2_results |>
  dplyr::arrange(dplyr::desc(pr_auc), dplyr::desc(rec_at10), maxdepth, dplyr::desc(cp)) |>
  dplyr::slice(1)
pick_dt2
# A tibble: 1 × 7
  pr_auc roc_auc rec_at10    cp minsplit maxdepth weighted
   <dbl>   <dbl>    <dbl> <dbl>    <dbl>    <dbl> <lgl>   
1  0.448   0.753    0.201  0.01       50        3 TRUE    

Again we see that the best settings go for a small, pruned tree: cp = 0.01, minsplit = 50, maxdepth = 3, weighted = TRUE.

As for the CV metrics PR-AUC around 0.469, ROC-AUC around 0.755, recall@top10% around 0.27. That CV recall@10% is a bit lower than the one we got with DT1, but the PR-AUC is basically the same.

Now we can train on full training data with the engineered interactions

best_ctrl2 <- rpart.control(cp = pick_dt2$cp, minsplit = pick_dt2$minsplit, maxdepth = pick_dt2$maxdepth, xval = 0)
final_dt2 <- rpart(y ~ ., data = train_df_dt2, method = "class", control = best_ctrl2,
                   weights = row_weights_from_class(train_df_dt2, "y", class_wts))

Holdout evaluation on the DT-2 test frame (with the same engineered features)

dt2_test_pred <- predict(final_dt2, newdata = test_df_dt2, type = "prob")[, "yes"]
dt2_test_df   <- tibble(.pred_yes = dt2_test_pred, y = test_df_dt2$y)
dt2_test_metrics <- compute_metrics(dt2_test_df)
dt2_test_metrics
# A tibble: 1 × 3
  pr_auc roc_auc rec_at10
   <dbl>   <dbl>    <dbl>
1  0.463   0.765    0.387

The results are basically identical to the DT1 test metrics. This a good sign of stability and no overfitting, but it also means the engineered interactions did not yield the extra lift on the depth-capped tree.

rpart.plot(final_dt2, type = 2, extra = 101, fallen.leaves = TRUE)

DT2 Results

The plotted tree is the same shape as DT-1: the top splits on euribor3m, then cons_conf_idx, then the month of “may” branch. In other words, with maxdepth = 3 and the chosen penalties, the model never needed the new interaction features (recent_x_intensity, edu_x_job). Macro variables and basic calendar effects still dominate the early decisions, again matching the EDA.

The aggressive pruning (depth 3, cp 0.01) keeps the variance low while the test metrics match the CV.

At this depth and with class weights, the interactions did not have a major change on performance or splits. Part of this is by design I guess, since depth 3 limits how many interactions a single tree can express but serves the purpose of illustrating the experiment.

With recall@top10% around 0.387, the tree surfaces roughly 39% of all eventual “yes” clients within the top 10% of leads. This is useful, but we likely have headroom for improvement.

We now log the experiment results

# Log the experiment
exp_log <- log_experiment(
  exp_log,
  id        = "DT-2",
  model     = "Decision Tree",
  objective = "Test engineered features (recent_contact, intensity, interactions) with class weights.",
  variation = paste0("Features with interactions; grid over cp/minsplit/maxdepth; weights=", pick_dt2$weighted),
  controls  = "Same split/CV; duration excluded; na.roughfix for tree.",
  metrics   = "PR-AUC (primary), Recall@Top10%, ROC-AUC",
  result    = fmt_res(dt2_test_metrics),
  conclusion= "Feature engineering held parity and slightly improved ranking consistency.",
  recommend = "Keep engineered features; retain weights and pruning."
)
exp_log
# A tibble: 2 × 9
  id    model   objective variation controls metrics result conclusion recommend
  <chr> <chr>   <chr>     <chr>     <chr>    <chr>   <chr>  <chr>      <chr>    
1 DT-1  Decisi… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2  Decisi… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…

Random Forests

For random forests we’ll implement the following experiments

  • RF1: Baseline RF with class weights

    Objective: To establish a strong, low variance baseline for tabular data.

    Variation: We’ll use class weights, ntree=500, default mtry=⌊√p⌋.

    Controls: Same features as DT2 (engineered present), same split/CV.

    Metrics: PR-AUC (primary), Recall@Top10%, ROC-AUC, OOB error (reported).

    Hypothesis: RF1 will beat DT on PR-AUC and Recall@Top10% due to variance reduction.

    Decision rule: If RF1 outperforms best DT by more than 2 to 3 points PR-AUC, we’ll promote RF as current top candidate.

  • RF2: Hyperparameter sweep for mtry and trees

    Objective: Test if tuning mtry and depth yields meaningful gains.

    Variation: Grid over mtry= (√p, p/2, p) and ntree =500; max depth (NA, 12), class weights on ranger.

    Controls: Same features as DT2 (engineered present), same split/ 3-fold CV.

    Metrics: PR-AUC, Recall@Top10%, run-time.

    Hypothesis: A slightly larger mtry and depth will slightly lift PR-AUC over RF1.

    Decision rule: Adopt the smallest model that achieves best PR-AUC.

We’ll add a few helpers for both RF experiments.

First, we add a function to fit and predict probabilities for one split

fit_predict_rf <- function(train_data, valid_data,
                           ntree = 500, mtry = NULL, maxnodes = NULL,
                           class_wts = NULL, seed = 489) {
  set.seed(seed)

  # mtry default = sqrt(p)
  p <- ncol(train_data) - 1
  if (is.null(mtry)) mtry <- floor(sqrt(p))
  mtry <- max(1, min(mtry, p))

  # Train with NA handling (roughfix imputes: numeric median, factor mode)
  rf_fit <- randomForest::randomForest(
    y ~ .,
    data        = train_data,
    ntree       = ntree,
    mtry        = mtry,
    maxnodes    = maxnodes,
    classwt     = class_wts,
    importance  = TRUE,
    keep.forest = TRUE,
    na.action   = randomForest::na.roughfix
  )

  # Also roughfix NAs in validation before predict
  valid_data2 <- randomForest::na.roughfix(valid_data)
  preds <- predict(rf_fit, newdata = valid_data2, type = "prob")[, "yes"]

  tibble(.pred_yes = preds, y = valid_data$y)
}

And we add a function for a 5-fold CV evaluation for a given RF configuration

cv_eval_rf <- function(folds, ntree, mtry, maxnodes = NULL, class_wts = NULL) {
  fold_metrics <- purrr::map_dfr(folds$splits, function(s) {
    tr <- rsample::analysis(s)
    vl <- rsample::assessment(s)
    p  <- fit_predict_rf(tr, vl,
                         ntree = ntree, mtry = mtry, maxnodes = maxnodes,
                         class_wts = class_wts, seed = 42)
    compute_metrics(p)
  })
  fold_metrics |>
    dplyr::summarise(dplyr::across(dplyr::everything(), \(x) mean(x, na.rm = TRUE))) |>
    dplyr::mutate(ntree = ntree, mtry = mtry, maxnodes = maxnodes)
}

We’ll use folds_dt2, that we built earlier on train_df_dt2 so RF sees the same engineered features and consistent factor levels.

Finally, we have to add a “compressor” function to compress high cardinality factors so that every factor passed to RF has less than 50–53 levels (a limit we were hitting). The safest is to keep the top K most frequent levels from the training set and lump the rest to “Other”, then apply that same mapping to the validation and test folds.

# Keep at most `k` most frequent levels (by TRAIN frequency); lump the rest to "Other"
compress_factor_levels <- function(train_fct, new_fct, k = 50, other_level = "Other") {
  # Frequency ranking on training only
  freq <- sort(table(train_fct), decreasing = TRUE)
  keep <- names(freq)[seq_len(min(k, length(freq)))]
  train_out <- forcats::fct_other(train_fct, keep = keep, other_level = other_level)
  new_out   <- forcats::fct_other(new_fct,   keep = keep, other_level = other_level)
  # Lock the level set so RF sees identical levels everywhere
  lvl <- union(keep, other_level)
  train_out <- factor(train_out, levels = lvl)
  new_out   <- factor(new_out,   levels = lvl)
  list(train = train_out, new = new_out, keep = keep)
}
# Check which factor(s) are too large
sapply(train_df_dt2 |> dplyr::select(where(is.factor)), nlevels)
               job            marital          education            default 
                11                  3                  7                  2 
           housing               loan            contact              month 
                 2                  2                  2                 10 
       day_of_week           poutcome                  y           no_prior 
                 5                  3                  2                  2 
    recent_contact          intensity         age_bucket          had_prior 
                 2                  3                  4                  2 
recent_x_intensity          edu_x_job 
                 6                 60 
# Compress edu_x_job to less than 50 levels (we can tweak k if needed; less than 53 is required)
cmp <- compress_factor_levels(train_df_dt2$edu_x_job, test_df_dt2$edu_x_job, k = 50)
train_df_dt2$edu_x_job <- cmp$train
test_df_dt2$edu_x_job  <- cmp$new

And then we rebuild CV folds on the compressed frame

folds_dt2 <- rsample::vfold_cv(train_df_dt2, v = 5, strata = y)

RF1

set.seed(489)

p_rf <- ncol(train_df_dt2) - 1
mtry_default <- floor(sqrt(p_rf))
rf1_cv <- cv_eval_rf(
  folds    = folds_dt2,
  ntree    = 500,
  mtry     = mtry_default,
  maxnodes = NULL,
  class_wts = class_wts
)
rf1_cv
# A tibble: 1 × 5
  pr_auc roc_auc rec_at10 ntree  mtry
   <dbl>   <dbl>    <dbl> <dbl> <dbl>
1  0.356   0.738    0.362   500     5

Now we can train the baseline RF1 on the full training set and evaluate on holdout

rf1_pred_test <- fit_predict_rf(
  train_data = train_df_dt2,
  valid_data = test_df_dt2,
  ntree      = 500,
  mtry       = mtry_default,
  maxnodes   = NULL,
  class_wts  = class_wts
)

RF1 Results

rf1_test_metrics <- compute_metrics(rf1_pred_test)
rf1_test_metrics
# A tibble: 1 × 3
  pr_auc roc_auc rec_at10
   <dbl>   <dbl>    <dbl>
1  0.330   0.735    0.350

So how does this compare to the DT baseline? Our best DT had a PR-AUC of around 0.463, ROC-AUC around 0.765, and Recall@Top10% around 0.387 on the same holdout. So RF1 underperforms either DT on the business relevant metrics (PR-AUC, Recall@10) and is slightly worse on ROC-AUC too.

This may indicate perhaps we had an mtry that was too small. With factors handled natively, we had a good number of predictors, but mtry of 5 was perhaps too restrictive. So strong splits (such as macro indicators, month/contact variables) aren’t sampled often enough at each node, and this may flatten the signal.

We also need to address that collapsing edu_x_job to top K plus “other” and using na.roughfix is necessary for randomForest, but it blunts rarity interaction signal. So a single shallow DT can still catch on the macro cycle features and perform better.

So far we see the generalization is consistent, the CV to test drops are small, so the model is behaving stably. But it’s just not strong enough in ranking positives for this dataset with the current defaults.

Now we add the results to the log

# Log experiment
exp_log <- log_experiment(
  exp_log,
  id        = "RF-1",
  model     = "Random Forest",
  objective = "Establish a fast ensemble baseline.",
  variation = sprintf("num.trees=500, mtry=%s, no depth cap, class weights", mtry_default),
  controls  = "Same features/split; na.roughfix; class weights for imbalance.",
  metrics   = "PR-AUC (primary), Recall@Top10%, ROC-AUC",
  result    = fmt_res(rf1_test_metrics),
  conclusion= "Underperforms against DT-1/DT-2",
  recommend = "Readjust mtry for RF-2."
)
exp_log
# A tibble: 3 × 9
  id    model   objective variation controls metrics result conclusion recommend
  <chr> <chr>   <chr>     <chr>     <chr>    <chr>   <chr>  <chr>      <chr>    
1 DT-1  Decisi… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2  Decisi… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…
3 RF-1  Random… Establis… num.tree… Same fe… PR-AUC… Test … Underperf… Readjust…

RF2

Based on the previous results, we’ll adjust the helpers, but this times we’ll use the library ranger, which is much faster

fit_predict_rf_ranger <- function(train_data, valid_data,
                                  num.trees = 500, mtry = NULL,
                                  max.depth = NULL, min.node.size = 1,
                                  class_wts = NULL, seed = 489,
                                  num.threads = max(1, parallel::detectCores() - 1)) {
  set.seed(seed)

  # mtry default = sqrt(p) if not provided
  p <- ncol(train_data) - 1
  if (is.null(mtry)) mtry <- floor(sqrt(p))

  # Build case weights from class weights (so class imbalance is handled)
  w <- rep(1, nrow(train_data))
  if (!is.null(class_wts)) {
    w[train_data$y == "no"]  <- class_wts["no"]
    w[train_data$y == "yes"] <- class_wts["yes"]
  }

  # Keep NA handling consistent with earlier steps
  tr2 <- randomForest::na.roughfix(train_data)
  vl2 <- randomForest::na.roughfix(valid_data)

  # Fit ranger (fast RF). 
  fit <- ranger::ranger(
    y ~ ., data = tr2,
    probability   = TRUE,            
    num.trees     = num.trees,
    mtry          = mtry,
    max.depth     = max.depth,  # NULL = unlimited depth
    min.node.size = min.node.size,  # terminal node size
    case.weights  = w,
    importance    = "impurity",
    num.threads   = num.threads,
    seed          = seed
  )

  # Predict probabilities for the positive class
  preds <- predict(fit, data = vl2)$predictions[, "yes"]
  tibble(.pred_yes = preds, y = valid_data$y)
}
cv_eval_rf_ranger <- function(folds, num.trees, mtry, max.depth = NULL,
                              min.node.size = 1, class_wts = NULL) {
  fold_metrics <- purrr::map_dfr(folds$splits, function(s) {
    tr <- rsample::analysis(s)
    vl <- rsample::assessment(s)
    p  <- fit_predict_rf_ranger(
      train_data = tr, valid_data = vl,
      num.trees = num.trees, mtry = mtry,
      max.depth = max.depth, min.node.size = min.node.size,
      class_wts = class_wts, seed = 489
    )
    compute_metrics(p)
  })
  fold_metrics |>
    dplyr::summarise(dplyr::across(dplyr::everything(), \(x) mean(x, na.rm = TRUE))) |>
    dplyr::mutate(num.trees = num.trees, mtry = mtry,
                  max.depth = max.depth, min.node.size = min.node.size)
}
set.seed(489)

# Feature count (exclude y)
p_rf <- ncol(train_df_dt2) - 1
mtry_default <- floor(sqrt(p_rf))
# Try larger mtry values so strong features compete more often
mtry_grid <- unique(c(mtry_default, ceiling(p_rf/2), p_rf))  # in this case {sqrt(p), p/2, p}
num_trees_grid <- c(500) # is a fast baseline
max_depth_grid <- c(NA, 12) # NA = unlimited, 12 = light cap
min_node_grid  <- c(1) # we'll keep it simple

Now, we build the CV across the grid

# Build the grid
rf2_grid <- expand.grid(
  num.trees     = num_trees_grid,
  mtry          = mtry_grid,
  max.depth     = max_depth_grid,
  min.node.size = min_node_grid,
  KEEP.OUT.ATTRS = FALSE
)

Temporarily drop CV to 3-fold while tuning

folds_dt2 <- rsample::vfold_cv(train_df_dt2, v = 3, strata = y)
rf2_results <- purrr::map_dfr(seq_len(nrow(rf2_grid)), function(i) {
  params <- rf2_grid[i, ]
  cat(sprintf("\n[RF2 %d/%d] trees=%d, mtry=%d, max.depth=%s, min.node=%d\n",
              i, nrow(rf2_grid), params$num.trees, params$mtry,
              ifelse(is.na(params$max.depth), "NA", params$max.depth), params$min.node.size))
  flush.console()

  cv_eval_rf_ranger(
    folds        = folds_dt2,
    num.trees    = params$num.trees,
    mtry         = params$mtry,
    max.depth    = if (is.na(params$max.depth)) NULL else params$max.depth,
    min.node.size= params$min.node.size,
    class_wts    = class_wts
  )
}) |>
  dplyr::arrange(dplyr::desc(pr_auc), dplyr::desc(rec_at10))

[RF2 1/6] trees=500, mtry=5, max.depth=NA, min.node=1

[RF2 2/6] trees=500, mtry=13, max.depth=NA, min.node=1

[RF2 3/6] trees=500, mtry=26, max.depth=NA, min.node=1

[RF2 4/6] trees=500, mtry=5, max.depth=12, min.node=1

[RF2 5/6] trees=500, mtry=13, max.depth=12, min.node=1

[RF2 6/6] trees=500, mtry=26, max.depth=12, min.node=1
pick_rf2 <- rf2_results |>
  dplyr::arrange(dplyr::desc(pr_auc), dplyr::desc(rec_at10), mtry, num.trees) |>
  dplyr::slice(1)
pick_rf2
# A tibble: 1 × 7
  pr_auc roc_auc rec_at10 num.trees  mtry min.node.size max.depth
   <dbl>   <dbl>    <dbl>     <dbl> <dbl>         <dbl>     <dbl>
1  0.452   0.798    0.449       500     5             1        12
rf2_pred_test <- fit_predict_rf_ranger(
  train_data = train_df_dt2,
  valid_data = test_df_dt2,
  num.trees  = pick_rf2$num.trees,
  mtry       = pick_rf2$mtry,
  max.depth  = if (is.null(pick_rf2$max.depth) || is.na(pick_rf2$max.depth)) NULL else pick_rf2$max.depth,
  min.node.size = pick_rf2$min.node.size,
  class_wts  = class_wts,
  seed       = 489
)

RF2 Results

rf2_test_metrics <- compute_metrics(rf2_pred_test)
rf2_test_metrics
# A tibble: 1 × 3
  pr_auc roc_auc rec_at10
   <dbl>   <dbl>    <dbl>
1  0.462   0.804    0.453

The best CV config was with num.trees = 500, mtry = 5, max.depth = 12, min.node.size = 1.

The CV metrics obtained were PR-AUC = 0.452, ROC-AUC = 0.798, Recall@Top10% = 0.449.

And the holdout test with that same config was:

PR-AUC = 0.462, ROC-AUC = 0.804, Recall@Top10% = 0.453.

There was a clear improvement versus RF-1 (PR-AUC went from 0.33 to 0.462, Recall@10 from 0.35 to 0.453).

This is on par, or slightly better than DT on the test set: DT had PR-AUC of 0.463 and Recall@10 of 0.387; so RF2 matches PR-AUC and beats Recall@10 meaning it captures more positives in the same top-10% call budget.

The depth cap (12) likely reduced noisy deep splits, while class weights kept minority class recall high. The winning mtry=5 (not larger) suggests a few very strong features dominate splits; higher mtry didn’t add value because of correlated predictors and diminishing returns.

# Log experiment
exp_log <- log_experiment(
  exp_log,
  id        = "RF-2",
  model     = "Random Forest (ranger)",
  objective = "Improve ranking (PR-AUC, Recall@Top10%) by faster RF + tuning mtry and shallow depth control.",
  variation = sprintf("3-fold CV; num.trees=%d; mtry=%d; max.depth=%s; min.node.size=%d; class weights; ranger engine",
                      pick_rf2$num.trees, pick_rf2$mtry,
                      ifelse(is.na(pick_rf2$max.depth), "NA", pick_rf2$max.depth),
                      pick_rf2$min.node.size),
  controls  = "Same engineered feature frame as DT-2 (with level compression); same train/test split and metrics.",
  metrics   = "Primary: PR-AUC; Secondary: Recall@Top10%, ROC-AUC",
  result    = sprintf("CV: PR-AUC=%.3f, ROC-AUC=%.3f, Recall@10%%=%.3f | Test: PR-AUC=%.3f, ROC-AUC=%.3f, Recall@10%%=%.3f",
                      pick_rf2$pr_auc, pick_rf2$roc_auc, pick_rf2$rec_at10,
                      rf2_test_metrics$pr_auc, rf2_test_metrics$roc_auc, rf2_test_metrics$rec_at10),
  conclusion= "Tuned ranger RF substantially outperforms RF-1 and improves top-decile recall vs DT while matching PR-AUC. Depth cap helped reduce noise; larger mtry values did not help further.",
  recommend = "Prefer RF-2 over RF-1. Compare RF-2 against the pruned DT for deployment—RF-2 gives higher Recall@Top10% (more conversions at fixed outreach), with acceptable interpretability via feature importance."
)

exp_log
# A tibble: 4 × 9
  id    model   objective variation controls metrics result conclusion recommend
  <chr> <chr>   <chr>     <chr>     <chr>    <chr>   <chr>  <chr>      <chr>    
1 DT-1  Decisi… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2  Decisi… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…
3 RF-1  Random… Establis… num.tree… Same fe… PR-AUC… Test … Underperf… Readjust…
4 RF-2  Random… Improve … 3-fold C… Same en… Primar… CV: P… Tuned ran… Prefer R…

AdaBoost

For the AdaBoost experiments we’ll run:

  • AB1:

    Objective: To establish a boosted-stump baseline to measure how far a high-bias model can rank minority “yes” cases.

    Variation: Number of boosting rounds (nIter = 150 for CV, 200 for final test).

    Controls: Same features (DT2 set), no class weights (AdaBoost reweights internally), same 3-fold CV, same metrics.

    Hypothesis: Additional rounds provide modest gains, similar to tree-based models with interaction depth.

    Decision rule: Depending on results, will keep as reference and move to higher-capacity weak learners in AB2.

  • AB2:

    Objective: Test whether shallow rpart weak learners (depth 3) in AdaBoost improve minority ranking over stump-based AB1.

    Variation: Use maxdepth = 3 and mfinal = 150 (with a quick check at 100), coeflearn = “Zhu”.

    Controls: Same DT2 features and split; no external class weights; 3-fold stratified CV; na.roughfix for NAs.

    Metric: Primary PR-AUC; secondary Recall@Top10% and ROC-AUC.

    Hypothesis: Depth-3 weak learners reduce bias and lift PR-AUC by ~0.02–0.04 vs AB1.

    Decision rule: Keep AB2 if PR-AUC higher than AB1 and test PR-AUC doesn’t drop; otherwise revert to AB1 or prefer RF2.

For this assignment, we were using exclusively AdaBoost.M1, however, after it was clear that it would be extremely time consuming to run even 1 single folds and severely reduced dataset, it was decided to use the model fastAdaBoost, in order to handle larger operation.

For this we’ll use a helper to fit AdaBoost once and predict probabilities on valid

Uses fastAdaboost::adaboost (AdaBoost.M1 with decision stumps)

fit_predict_ab_fast <- function(train_data, valid_data, nIter = 200, seed = 489) {
  set.seed(seed)

  # 1) NA handling consistent with other models
  tr2 <- randomForest::na.roughfix(train_data)
  vl2 <- randomForest::na.roughfix(valid_data)

  # 2) Coerce to base data.frame + enforce plain 2-level factor target
  tr2 <- as.data.frame(tr2)
  vl2 <- as.data.frame(vl2)
  tr2$y <- factor(tr2$y, levels = c("no","yes"), ordered = FALSE)
  vl2$y <- factor(vl2$y, levels = levels(tr2$y), ordered = FALSE)

  # 3) Make sure no ordered/character predictors sneak in
  to_unordered <- function(df) {
    ord_cols <- vapply(df, is.ordered,  logical(1))
    chr_cols <- vapply(df, is.character, logical(1))
    if (any(ord_cols)) df[ord_cols] <- lapply(df[ord_cols], function(x) factor(as.character(x)))
    if (any(chr_cols)) df[chr_cols] <- lapply(df[chr_cols], factor)
    df
  }
  tr2 <- to_unordered(tr2)
  vl2 <- to_unordered(vl2)

  # 4) Fit AdaBoost (stumps)
  ab_fit <- fastAdaboost::adaboost(y ~ ., data = tr2, nIter = nIter)

  # 5) Predict probabilities; handle missing colnames safely
  pr <- predict(ab_fit, newdata = vl2, type = "prob")
  prob_mat <- pr$prob
  yes_prob <- if (!is.null(colnames(prob_mat))) {
    prob_mat[, "yes", drop = TRUE]
  } else {
    # We enforced levels(y)==c("no","yes"), so 'yes' must be column 2
    prob_mat[, 2, drop = TRUE]
  }

  tibble::tibble(.pred_yes = yes_prob, y = vl2$y)
}

3-fold CV wrapper for AB1 (fast) so it completes quickly

cv_eval_ab_fast <- function(folds, nIter = 200) {
  mets <- purrr::map_dfr(seq_along(folds$splits), function(j) {
    cat(sprintf("[AB1-fast] Fold %d/%d, nIter=%d\n", j, length(folds$splits), nIter)); flush.console()
    tr <- rsample::analysis(folds$splits[[j]])
    vl <- rsample::assessment(folds$splits[[j]])
    preds <- fit_predict_ab_fast(tr, vl, nIter = nIter, seed = 489)
    compute_metrics(preds)   # your PR-AUC / ROC-AUC / Recall@Top10
  })
  mets |>
    dplyr::summarise(dplyr::across(dplyr::everything(), \(x) mean(x, na.rm = TRUE))) |>
    dplyr::mutate(nIter = nIter)
}

AB1

We’ll use 3fold CV for tuning, then refit once and report test metrics

set.seed(489)
folds_ab1 <- rsample::vfold_cv(train_df_dt2, v = 3, strata = y)

ab1_cv <- cv_eval_ab_fast(folds_ab1, nIter = 150)
[AB1-fast] Fold 1/3, nIter=150
[AB1-fast] Fold 2/3, nIter=150
[AB1-fast] Fold 3/3, nIter=150
ab1_cv
# A tibble: 1 × 4
  pr_auc roc_auc rec_at10 nIter
   <dbl>   <dbl>    <dbl> <dbl>
1  0.358   0.757    0.378   150
ab1_test_pred    <- fit_predict_ab_fast(train_df_dt2, test_df_dt2, nIter = 200)
ab1_test_metrics <- compute_metrics(ab1_test_pred)

AB1 Results

ab1_test_metrics
# A tibble: 1 × 3
  pr_auc roc_auc rec_at10
   <dbl>   <dbl>    <dbl>
1  0.349   0.757    0.370

Despite the amount of time AdaBoost takes, its performance is notably below DT1/DT2 and RF2 (our tuned RF2 hit PR-AUC of around 0.46).

Stumps are very high-bias learners, boosting them 150–200 rounds still underfits the complex interactions we discovered in EDA (such as recent versus intensity). The gap to RF2 suggests insufficient capacity, not overfitting—ROC and recall are steady from CV to test.

AB1 confirms the baseline: stumps are too weak for this dataset.

Let’s log the results

exp_log <- log_experiment(
  exp_log,
  id        = "AB-1",
  model     = "AdaBoost (fastAdaboost stumps)",
  objective = "Fast boosting baseline with weak learners (depth=1).",
  variation = "maxdepth=1, mfinal=150, coeflearn='Zhu'; 3-fold CV; no explicit class weights.",
  controls  = "Same features/split; na.roughfix to handle NAs consistently.",
  metrics   = "PR-AUC (primary), Recall@Top10%, ROC-AUC",
  result    = fmt_res(ab1_test_metrics),
  conclusion= "Stumps underfit; lower PR-AUC and recall vs AB2",
  recommend = "Use as baseline only; prefer AB2."
)

conclusion = “Stumps underfit; lower PR-AUC and recall vs AB2.”,

recommendation= “Prefer AB2 over AB1; consider AB1 only for very tight latency/compute budgets.”

exp_log
# A tibble: 5 × 9
  id    model   objective variation controls metrics result conclusion recommend
  <chr> <chr>   <chr>     <chr>     <chr>    <chr>   <chr>  <chr>      <chr>    
1 DT-1  Decisi… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2  Decisi… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…
3 RF-1  Random… Establis… num.tree… Same fe… PR-AUC… Test … Underperf… Readjust…
4 RF-2  Random… Improve … 3-fold C… Same en… Primar… CV: P… Tuned ran… Prefer R…
5 AB-1  AdaBoo… Fast boo… maxdepth… Same fe… PR-AUC… Test … Stumps un… Use as b…

AB2

For the second round of AdaBoost experiments we’ll improve the helpers.

fit_predict_ab_adabag <- function(train_data, valid_data,
                                  mfinal = 100, maxdepth = 2,
                                  coeflearn = "Zhu", seed = 489) {
  set.seed(seed)

  # Same preprocessing as other models
  tr2 <- randomForest::na.roughfix(train_data)
  vl2 <- randomForest::na.roughfix(valid_data)
  tr2 <- as.data.frame(tr2); vl2 <- as.data.frame(vl2)

  tr2$y <- factor(tr2$y, levels = c("no","yes"), ordered = FALSE)
  vl2$y <- factor(vl2$y, levels = levels(tr2$y), ordered = FALSE)

  # Ensure no ordered/character predictors
  to_unordered <- function(df) {
    ord_cols <- vapply(df, is.ordered,  logical(1))
    chr_cols <- vapply(df, is.character, logical(1))
    if (any(ord_cols)) df[ord_cols] <- lapply(df[ord_cols], function(x) factor(as.character(x)))
    if (any(chr_cols)) df[chr_cols] <- lapply(df[chr_cols], factor)
    df
  }
  tr2 <- to_unordered(tr2)
  vl2 <- to_unordered(vl2)

  # Fast rpart controls
  base_ctrl <- rpart::rpart.control(
    maxdepth     = maxdepth,
    minsplit     = 20,
    cp           = 0,
    xval         = 0,
    maxcompete   = 0,
    maxsurrogate = 0,
    usesurrogate = 0
  )

  fit <- adabag::boosting(
    y ~ ., data = tr2,
    mfinal    = mfinal,
    boos      = TRUE,
    coeflearn = coeflearn,
    control   = base_ctrl
  )

  pr <- predict(fit, newdata = vl2)
  prob_mat <- as.matrix(pr$prob)

  # Robust “yes” extraction: use name if present, else index 2 (levels = c("no","yes"))
  yes_prob <- if (!is.null(colnames(prob_mat)) && "yes" %in% colnames(prob_mat)) {
    prob_mat[, "yes", drop = TRUE]
  } else {
    prob_mat[, 2, drop = TRUE]
  }

  tibble::tibble(.pred_yes = yes_prob, y = vl2$y)
}
cv_eval_ab_adabag <- function(folds, mfinal = 100, maxdepth = 2, coeflearn = "Zhu") {
  mets <- purrr::map_dfr(seq_along(folds$splits), function(j) {
    cat(sprintf("[AB2] Fold %d/%d, mfinal=%d, depth=%d\n",
                j, length(folds$splits), mfinal, maxdepth)); flush.console()
    tr <- rsample::analysis(folds$splits[[j]])
    vl <- rsample::assessment(folds$splits[[j]])
    preds <- fit_predict_ab_adabag(tr, vl, mfinal = mfinal,
                                   maxdepth = maxdepth, coeflearn = coeflearn)
    compute_metrics(preds)
  })
  mets |>
    dplyr::summarise(dplyr::across(dplyr::everything(), \(x) mean(x, na.rm = TRUE))) |>
    dplyr::mutate(mfinal = mfinal, maxdepth = maxdepth, coeflearn = coeflearn)
}

Using 3 folds

set.seed(489)
folds_ab2 <- rsample::vfold_cv(train_df_dt2, v = 3, strata = y)

We will keep mfinal at 150, only increasing maxdepth to explore how much impact it has on its own.

ab2_grid <- tidyr::expand_grid(
  mfinal   = c(150),   # keep small for speed
  maxdepth = c(3) # focus on capacity vs stumps
)
ab2_results <- purrr::map_dfr(seq_len(nrow(ab2_grid)), function(i) {
  g <- ab2_grid[i, ]
  cat(sprintf("[AB2] Folded CV for mfinal=%d depth=%d\n", g$mfinal, g$maxdepth)); flush.console()
  cv_eval_ab_adabag(folds_ab2, mfinal = g$mfinal, maxdepth = g$maxdepth, coeflearn = "Zhu")
}) |>
  dplyr::arrange(dplyr::desc(pr_auc), dplyr::desc(rec_at10))
[AB2] Folded CV for mfinal=150 depth=3
[AB2] Fold 1/3, mfinal=150, depth=3
[AB2] Fold 2/3, mfinal=150, depth=3
[AB2] Fold 3/3, mfinal=150, depth=3
ab2_results
# A tibble: 1 × 6
  pr_auc roc_auc rec_at10 mfinal maxdepth coeflearn
   <dbl>   <dbl>    <dbl>  <dbl>    <dbl> <chr>    
1  0.430   0.784    0.426    150        3 Zhu      

We can apply settings then to our test set.

pick_ab2 <- dplyr::slice(ab2_results, 1)
pick_ab2
# A tibble: 1 × 6
  pr_auc roc_auc rec_at10 mfinal maxdepth coeflearn
   <dbl>   <dbl>    <dbl>  <dbl>    <dbl> <chr>    
1  0.430   0.784    0.426    150        3 Zhu      
ab2_test_pred    <- fit_predict_ab_adabag(train_df_dt2, test_df_dt2,
                                          mfinal = pick_ab2$mfinal,
                                          maxdepth = pick_ab2$maxdepth,
                                          coeflearn = pick_ab2$coeflearn)
ab2_test_metrics <- compute_metrics(ab2_test_pred)

AB2 Results

ab2_test_metrics
# A tibble: 1 × 3
  pr_auc roc_auc rec_at10
   <dbl>   <dbl>    <dbl>
1  0.453   0.791    0.443

Cross-validation (3-fold): PR-AUC - 0.423, ROC-AUC - 0.781, Recall@Top10% - 0.423.

Held-out test: PR-AUC - 0.453, ROC-AUC - 0.791, Recall@Top10% - 0.443

Moving from stumps (AB1) to depth 3 weak learners increased minority ranking quality. The test PR-AUC jumped from around 0.349 (AB1) to around 0.453 (AB2) and Recall@10% from around 0.37 to around 0.44.

Generalization looks healthy, the test seems to perform better than CV on PR-AUC and Recall@10, suggesting low variance and that the extra capacity is useful rather than overfitting.

Comparing agains the other models, we see that AB2 is competitive with RF2 (let’s remember the RF2 test PR-AUC was about 0.462; Recall@10 aroud 0.453). It clearly beats RF1 and both DT baselines. RF2 still performs better slightly on PR-AUC, but AB2 is within the same band.

Finally we can log these results as well

exp_log <- log_experiment(
  exp_log,
  id        = "AB-2",
  model     = "AdaBoost (adabag + rpart shallow)",
  objective = "Increase model capacity with shallow weak learners while maintaining generalization.",
  variation = "maxdepth=3, mfinal=150, coeflearn='Zhu'; 3-fold CV; no explicit class weights.",
  controls  = "Same features/split; na.roughfix.",
  metrics   = "PR-AUC (primary), Recall@Top10%, ROC-AUC",
  result    = fmt_res(ab2_test_metrics),
  conclusion= "Depth=3 boosted PR-AUC and Recall@10; generalization test better than CV.",
  recommend = "Keep AB2 as a co-finalist with RF2."
)
exp_log
# A tibble: 6 × 9
  id    model   objective variation controls metrics result conclusion recommend
  <chr> <chr>   <chr>     <chr>     <chr>    <chr>   <chr>  <chr>      <chr>    
1 DT-1  Decisi… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2  Decisi… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…
3 RF-1  Random… Establis… num.tree… Same fe… PR-AUC… Test … Underperf… Readjust…
4 RF-2  Random… Improve … 3-fold C… Same en… Primar… CV: P… Tuned ran… Prefer R…
5 AB-1  AdaBoo… Fast boo… maxdepth… Same fe… PR-AUC… Test … Stumps un… Use as b…
6 AB-2  AdaBoo… Increase… maxdepth… Same fe… PR-AUC… Test … Depth=3 b… Keep AB2…