LFMG assignment 3

Author

LFMG

Assignment 3

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.

This assignment builds on the findings from previous ones and seeks to compare existing results against a new algorithm: SVM. We are aiming to find out which algorithm is recommended to get more accurate results.

Libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(dplyr)
library(tibble)
library(readr)
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(rsample)
library(yardstick)


Attaching package: 'yardstick'

The following object is masked from 'package:readr':

    spec

library(rpart) 
library(rpart.plot) 
library(forcats)
library(Matrix)


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

library(skimr)
library(naniar)


Attaching package: 'naniar'

The following object is masked from 'package:skimr':

    n_complete

library(DataExplorer)
library(GGally)
library(corrplot)

corrplot 0.95 loaded

library(vip)


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

library(lubridate)
library(knitr)
library(future)
library(furrr)
library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(LiblineaR)
library(progressr)

Loading the Data

The dataset has been upload to GitHub for easy reproducibility.

url <- "https://raw.githubusercontent.com/Lfirenzeg/msds622/refs/heads/main/bank-additional-full.csv"

set.seed(489) ## let's define a fixed seed, also for reproducibility

The file has semi colons instead of comas, so we also have to account for that. Also, when exploring the CSV some “unknown” values were found, so we’ll turn those into NAs.

This time we’ll also the log that we saved previously with the results from our DT, RF and AB models.

exp_log_prev <- read_csv(
  "https://raw.githubusercontent.com/Lfirenzeg/msds622/refs/heads/main/exp_log_project2.csv",
  show_col_types = FALSE
) %>%
  mutate(across(everything(), as.character))

Global set up

# Loading, and converting "unknown" and "" to NA; clean names to snake_case
uci_raw <- read_delim(url, delim = ";", show_col_types = FALSE, na = c("", "unknown")) |>
  clean_names()

# Remove exact duplicate rows (we know there's very few, but it's a good practice)
uci_raw <- uci_raw |> distinct()

We’ll re-usea functions that helped prepare the data in previous assignments. We’ll call it prep_features, and we’ll apply to the global dataset, since all the experiments we’ll benefit from it and we’ll have a uniform starting point.

prep_features <- function(df) {
  df |>
    # Target to ordered factor (no, yes)
    mutate(y = factor(y, levels = c("no","yes"))) |>
    # we'll drop the leakage feature, since duration is known only after the call
    select(-duration) |>
    # to handle "pdays" we'll  create a "no prior contact" flag and a cleaned numeric one
    mutate(
      no_prior      = if_else(pdays == 999, "yes", "no"),
      pdays_real    = if_else(pdays == 999, NA, pdays),
      # and the recent contact flag
      recent_contact = if_else(!is.na(pdays_real) & pdays_real < 7, "recent", "not_recent"),
      # also some contact intensity bands from campaign
      intensity = case_when(
        campaign <= 2 ~ "low",
        campaign <= 5 ~ "medium",
        TRUE ~ "high"
      ),
      # we had defined age buckets to segment the data for easy handling
      age_bucket = cut(age, breaks = c(0,30,45,60,Inf), labels = c("<=30","31-45","46-60","60+"), right = TRUE),
      # Creating a flag for "Prior outcome" (if it has any prior outcome)
      had_prior = if_else(poutcome == "nonexistent", "no", "yes")
    ) |>
    # finally, keep categorical columns as factors so trees can split on them
    mutate(
      across(where(is.character), ~factor(.x)),
      across(c(no_prior, recent_contact, intensity, age_bucket, had_prior), ~factor(.x))
    )
}

Then, we apply it to the dataset

uci <- prep_features(uci_raw)

Additionally, we want to reduce macro collinearity by dropping one near duplicate macro. So we’ll keep euribor3m and drop nr_employed. This really is optional for trees, but we do it to stabilize comparisons.

if ("nr_employed" %in% names(uci)) {
  uci <- uci |> select(-nr_employed)
}

Train and Test split, CV folds

We will do an 80/20 stratified split for the final holdout

split_obj <- initial_split(uci, prop = 0.80, strata = y)
train_df <- training(split_obj)
test_df <- testing(split_obj)

And a 5 fold stratified CV on the training data

folds <- vfold_cv(train_df, v = 5, strata = y)

Additional useful metrics

Compute Recall@TopK% for the “yes” class

recall_at_top_k <- function(data, prob_col = ".pred_yes", truth_col = "y", k_frac = 0.10) {
  stopifnot(k_frac > 0, k_frac <= 1)
  n <- nrow(data)
  if (n == 0) return(NA_real_)
  k <- max(1, floor(k_frac * n))

  # Order by the chosen probability column (descending)
  ord <- order(data[[prob_col]], decreasing = TRUE, na.last = TRUE)

  # Predicted classes: top K% -> "yes", others -> "no"
  pred_class <- rep("no", n)
  pred_class[ord[seq_len(k)]] <- "yes"

  # Ensure both columns are factors with levels c("no","yes")
  truth <- factor(data[[truth_col]], levels = c("no","yes"))
  pred  <- factor(pred_class,            levels = c("no","yes"))

  yardstick::recall_vec(truth = truth, estimate = pred, event_level = "second")
}

We also can define a single function to wrap standard metrics after each validation prediction

compute_metrics <- function(pred_df, truth_col = "y", prob_col = ".pred_yes") {
  # Pull columns via tidy-eval and pass positionally (no `estimate =`)
  pr  <- yardstick::pr_auc(pred_df,
                           !!rlang::sym(truth_col),
                           !!rlang::sym(prob_col),
                           event_level = "second")$.estimate
  roc <- yardstick::roc_auc(pred_df,
                            !!rlang::sym(truth_col),
                            !!rlang::sym(prob_col),
                            event_level = "second")$.estimate
  rec <- recall_at_top_k(pred_df,
                         prob_col = prob_col,
                         truth_col = truth_col,
                         k_frac = 0.10)

  tibble(pr_auc = pr, roc_auc = roc, rec_at10 = rec)
}

Class Weights

To mitigate imbalance, we’ll give higher weight to minority class “yes”, inversely proportional to frequency. We’ll use a function for this called compute_class_weights.

compute_class_weights <- function(df, y_col = "y") {
  tab <- table(df[[y_col]])
  # For inverse frequency, so that mean weight is 1
  w <- 1 / as.numeric(tab)
  names(w) <- names(tab)
  w <- w / mean(w)
  return(w)
}

class_wts <- compute_class_weights(train_df)
class_wts

       no       yes 
0.2253188 1.7746812

And build a weight vector (by weighting each row by class weight)

row_weights_from_class <- function(df, y_col = "y", class_wts) {
  unname(class_wts[as.character(df[[y_col]])])
}

Log

Since we will be comparing SVM results with those from previous experiments we’ll re-use a log where we can add our new results for easy comparison and export.

# defining the log
# creating an empty log with the exact schema we want to register
init_exp_log <- function() {
  tibble::tibble(
    id        = character(),
    model     = character(),
    objective = character(),
    variation = character(),
    controls  = character(),
    metrics   = character(),
    result    = character(),
    conclusion= character(),
    recommend = character()
  )
}

# defining function to populate the log with results
# To be safe, we add an appender that enforces the same column set/order every time
log_experiment <- function(log,
                           id, model, objective, variation, controls,
                           metrics, result, conclusion, recommend) {
  new_row <- tibble::tibble(
    id        = as.character(id),
    model     = as.character(model),
    objective = as.character(objective),
    variation = as.character(variation),
    controls  = as.character(controls),
    metrics   = as.character(metrics),
    result    = as.character(result),
    conclusion= as.character(conclusion),
    recommend = as.character(recommend)
  )

  # and if the log doesn't exist or has different columns, it re-starts cleanly
  if (!exists("log") || !all(names(log) == names(new_row))) {
    log <- init_exp_log()
  }

  dplyr::bind_rows(log, new_row)
}

fmt_res <- function(m) {
  # m is a 1-row tibble with pr_auc, roc_auc, rec_at10
  sprintf("Test PR-AUC=%.4f, ROC-AUC=%.4f, Recall@10%%=%.4f",
          m$pr_auc, m$roc_auc, m$rec_at10)
}

Reviewing previous findings

Let’s quickly bring up again the saved experiment log from Project 2 so we can reference prior DT/RF/AB results in this notebook.

exp_log_prev

# A tibble: 6 × 9
  id    model   objective variation controls metrics result conclusion recommend
  <chr> <chr>   <chr>     <chr>     <chr>    <chr>   <chr>  <chr>      <chr>    
1 DT-1  Decisi… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2  Decisi… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…
3 RF-1  Random… Establis… num.tree… Same fe… PR-AUC… Test … Underperf… Readjust…
4 RF-2  Random… Improve … 3-fold C… Same en… Primar… CV: P… Tuned ran… Prefer R…
5 AB-1  AdaBoo… Fast boo… maxdepth… Same fe… PR-AUC… Test … Stumps un… Use as b…
6 AB-2  AdaBoo… Increase… maxdepth… Same fe… PR-AUC… Test … Depth=3 b… Keep AB2…

We’ll need to export the the previous log into exp_log so any new entries (from SVM) append to the same schema.

# So we can continue logging:
exp_log <- exp_log_prev

SVM

Initially we attempted a robust, two-stage preprocessing pipeline to avoid data leakage. The idea was to learn all statistics from train only, then apply those exact stats to valid/test. The first implementation (using e1071::svm) worked conceptually but proved slower and a bit fussier in practice. I decided to keep a short “legacy” section to document that approach and why we replaced it with a faster, simpler LiblineaR workflow.

Small legacy section

To consistently train and valid splits we tried to use a function that scans a data frame and saves numeric medians for each column and also saves which columns are factors.

# chunk 22
# impute -> encode -> fit) 
build_impute_maps <- function(df) {
  nums <- names(df)[vapply(df, is.numeric, logical(1))]
  facs <- names(df)[vapply(df, is.factor,  logical(1))]
  list(
    num_medians = vapply(df[nums], function(x) median(x, na.rm = TRUE), numeric(1)),
    factor_cols = facs
  )
}

We also needed a function that imputes missing numerics with the saved medians and fills missing factor levels with “Unknown,” ensuring stable factor types across folds.

# chunk 23
apply_impute_maps <- function(df, maps) {
  for (nm in names(maps$num_medians)) {
    if (nm %in% names(df)) df[[nm]][is.na(df[[nm]])] <- maps$num_medians[[nm]]
  }
  for (fc in maps$factor_cols) {
    if (fc %in% names(df)) {
      if (!is.factor(df[[fc]])) df[[fc]] <- factor(df[[fc]])
      if (!("Unknown" %in% levels(df[[fc]]))) df[[fc]] <- forcats::fct_expand(df[[fc]], "Unknown")
      df[[fc]][is.na(df[[fc]])] <- "Unknown"
      df[[fc]] <- droplevels(df[[fc]])
    }
  }
  df
}

This was the “learning” part of our one-hot encoding. It uses the model.matrix() function, which is how we can create a numeric design matrix for linear models. It took the training data (after imputation) and converted all factor columns into binary “dummy” columns.

Perhaps more importantly, it didn’t just return the matrix, it returns the exact set of column names produced. This matters because the test set might not have all the same factor levels, which would result in a different set of dummy columns.

# chunk 24
encode_train_mm <- function(df, y_col = "y") {
  mm <- model.matrix(as.formula(paste(y_col, "~ .")), df)[, -1, drop = FALSE]
  list(X = mm, ref_cols = colnames(mm))
}

This was the “applying” part of the one-hot encoding, and its job is to solve the problem mentioned above. It takes the validation/test data and the list of target column names from encode_train_mm().

It runs model.matrix() on the new data. Then it performs some checks:

Missing Columns: If the training data had a color_Green column but the test set had no “Green” examples, this function manually adds a color_Green column filled with zeros.
Extra Columns: (Rarer, but good to check) If the test set had a level not in the train set (e.g., color_Purple), this function would remove it.

This ensures the final validation/test matrix has the exact same columns in the exact same order as the training matrix, which is a hard requirement for the predict() function.

# chunk 25
encode_apply_mm <- function(df, ref_cols, y_col = "y") {
  mm <- model.matrix(as.formula(paste(y_col, "~ .")), df)[, -1, drop = FALSE]
  add <- setdiff(ref_cols, colnames(mm))
  if (length(add)) {
    mm_add <- matrix(0, nrow = nrow(mm), ncol = length(add))
    colnames(mm_add) <- add
    mm <- cbind(mm, mm_add)
  }
  mm[, ref_cols, drop = FALSE]
}

Next we tried creating a a pipeline helper function that bundled the entire process for a single CV fold. It takes training and validation data, then:

1) Imputes: Runs apply_impute_maps().
2) Encodes: Runs encode_apply_mm().
3) Fits: Trains an e1071::svm model on the processed training data. It’s flexible, allowing us to test both a linear kernel (fast, simple) and an RBF (radial basis function) kernel (more complex, can find non-linear patterns).
4) Predicts: Uses the fitted model to generate probabilities for the “yes” class on the processed validation data.

This ensures the exact same preprocessing is applied consistently inside each loop.

# chunk 26
fit_predict_svm <- function(train_df, valid_df,
                            kernel = c("linear","radial"),
                            cost = 1, gamma = NULL, degree = 3,
                            class_wts = NULL, scale = TRUE, seed = 489) {
  kernel <- match.arg(kernel)
  set.seed(seed)

  # Train-aware impute
  maps <- build_impute_maps(train_df)
  tr <- apply_impute_maps(train_df, maps)
  vl <- apply_impute_maps(valid_df, maps)

  tr$y <- factor(tr$y, levels = c("no","yes"))
  vl$y <- factor(vl$y, levels = c("no","yes"))

  # One-hot
  enc <- encode_train_mm(tr, "y")
  X_tr <- enc$X
  X_vl <- encode_apply_mm(vl, enc$ref_cols, "y")
  y_tr <- tr$y
  y_vl <- vl$y

  # SVM (e1071)
  args <- list(x = X_tr, y = y_tr, kernel = kernel, cost = cost,
               probability = TRUE, class.weights = class_wts, scale = scale)
  if (kernel == "radial") args$gamma <- if (is.null(gamma)) 1/ncol(X_tr) else gamma
  if (kernel == "polynomial") args$degree <- degree

  fit <- do.call(e1071::svm, args)

  pred <- predict(fit, newdata = X_vl, probability = TRUE)
  P    <- attr(pred, "probabilities")
  p_yes <- if (!is.null(colnames(P)) && "yes" %in% colnames(P)) P[, "yes"] else P[, 2]

  tibble(.pred_yes = p_yes, y = y_vl)
}

Then, we had a cross-validation manager, taking the full dataset, creates K-folds, and then loops K times. In each loop, it uses fit_predict_svm() to train on K-1 folds and predict on the 1 held-out fold.

After the loop finishes, it has a complete set of “out-of-sample” predictions for the entire dataset. It then uses these predictions to calculate our key business metrics we had previously defined since Project 2:

PR-AUC: Area Under the Precision-Recall Curve (best for imbalanced data).
ROC-AUC: Area Under the Receiver Operating Characteristic Curve (standard classification metric).
Recall@10%: What percentage of the true “yes” cases are captured if you select the top 10% riskiest (highest probability) predictions? This is a very common and practical business-focused metric.

# chunk 27
cv_eval_svm <- function(folds, kernel = "radial", cost = 1, gamma = NULL, degree = 3,
                        class_wts = NULL, scale = TRUE) {
  mets <- purrr::map_dfr(seq_along(folds$splits), function(j) {
    tr <- rsample::analysis(folds$splits[[j]])
    vl <- rsample::assessment(folds$splits[[j]])
    preds <- fit_predict_svm(tr, vl, kernel = kernel, cost = cost, gamma = gamma,
                             degree = degree, class_wts = class_wts, scale = scale)
    compute_metrics(preds)
  })
  mets |>
    dplyr::summarise(dplyr::across(dplyr::everything(), \(x) mean(x, na.rm = TRUE))) |>
    dplyr::mutate(kernel = kernel, cost = cost, gamma = gamma, degree = degree)
}

SVM - Using LiblineaR

This section replaces the previous pipeline with one optimized for LiblineaR, which is known for being extremely fast on sparse, high-dimensional linear data.

This next chunk condenses chunks 22-25 into a single, optimized “learning” step. It scans the training data to learn all preprocessing statistics at once (medians for imputation, means/SDs for standardization, and all unique factor levels).

It then defines an apply_encoder() function that uses these “locked” stats to process new data by imputing, standardizing (scaling numerics to mean=0, sd=1, which helps LiblineaR converge), and one-hot encoding.

Finally, it applies this encoder to create the final X_tr (training matrix) and X_te (test matrix), ensuring they are perfectly aligned. It also prepares the labels: y_tr_num (numeric 0/1 for the model) and y_te_fac (factors for the yardstick or PRROC metrics packages).

## Train-locked impute + scale + one-hot
num_cols  <- names(train_df)[sapply(train_df, is.numeric)]
cat_cols  <- names(train_df)[sapply(train_df, is.factor)]

num_median <- vapply(train_df[num_cols], \(x) median(x, na.rm = TRUE), numeric(1))
num_mean   <- vapply(train_df[num_cols], \(x) mean(x, na.rm = TRUE),   numeric(1))
num_sd     <- vapply(train_df[num_cols], \(x) sd(x,   na.rm = TRUE),   numeric(1))
num_sd[num_sd == 0 | is.na(num_sd)] <- 1
cat_levels <- lapply(train_df[cat_cols], levels)

apply_encoder <- function(df, form = y ~ .) {
  # numeric: impute + standardize (with TRAIN stats)
  for (nm in num_cols) if (nm %in% names(df)) {
    v <- df[[nm]]; v[is.na(v)] <- num_median[[nm]]
    df[[nm]] <- (v - num_mean[[nm]]) / num_sd[[nm]]
  }
  # categoricals: align levels to TRAIN; unseen -> first level (keeps rows)
  for (nm in cat_cols) if (nm %in% names(df)) {
    df[[nm]] <- factor(df[[nm]], levels = cat_levels[[nm]])
    df[[nm]][is.na(df[[nm]])] <- levels(df[[nm]])[1]
  }
  mm <- model.matrix(form, df)
  mm[, -1, drop = FALSE]  # drop intercept
}

X_tr <- apply_encoder(train_df)
X_te_tmp <- apply_encoder(test_df)
# force same columns/order on TEST
miss <- setdiff(colnames(X_tr), colnames(X_te_tmp))
if (length(miss)) {
  X_te_tmp <- cbind(X_te_tmp, matrix(0, nrow(X_te_tmp), length(miss),
                                     dimnames = list(NULL, miss)))
}
X_te <- X_te_tmp[, colnames(X_tr), drop = FALSE]

# labels: numeric {0,1} for LiblineaR; keep factors for metrics
y_tr_num <- as.integer(train_df$y == "yes")
y_te_fac <- factor(test_df$y, levels = c("no","yes"))

Next we set our final model validation.

3-fold: We’ve chosen K=3. This means we’ll train on 2/3 of the data and validate on 1/3, three times.
Stratified: This is critical for imbalanced data. It ensures that each of our 3 folds has the same percentage of “yes” and “no” cases as the original dataset. This prevents a “bad” split where one fold might accidentally get almost no “yes” cases.
Stable .row_id: This is a robustness trick. It makes it easy to join predictions back to our original data, even after all the shuffling and subsetting

## 3-fold stratified CV with stable row ids
train_id <- train_df %>% dplyr::mutate(.row_id = dplyr::row_number())
set.seed(489)
folds3 <- rsample::vfold_cv(train_id, v = 3, strata = y)
fold_idx <- lapply(folds3$splits, function(s) {
  list(tr = rsample::analysis(s)$.row_id,
       vl = rsample::assessment(s)$.row_id)
})

For our current strategy for handling the imbalanced dataset we compute the inverse frequency of each class.

Example: If we have 90% “0” (no) and 10% “1” (yes), the weights would be 0 = 1/0.90 and 1 = 1/0.10. When we pass these weights to the model, it means the model is penalized 10x more for misclassifying a rare “yes” case than for misclassifying a common “no” case. This forces the model to pay attention to the positive class. The renaming to “0” and “1” is just to match the specific syntax LiblineaR’s weights argument expects.

## Class weights for LiblineaR: names must be "0" and "1"
cw <- table(train_df$y)  # c(no=?, yes=?)
cw <- 1 / as.numeric(cw) # inverse frequency
cw <- cw / mean(cw)
names(cw) <- c("no","yes")
cw_lib <- c(`0` = unname(cw["no"]), `1` = unname(cw["yes"]))

This is the new core function for our CV loop, replacing chunk 26.

LiblineaR (type = 0): LiblineaR type = 0 is L2-regularized logistic regression, but it produces a linear decision boundary very close to a linear SVM and gives calibrated probabilities natively. That makes it ideal here: fast, robust on wide one-hot data, and business-friendly

The function fits this model on a training fold (using the class weights from the previous chunk) and predicts the “1” class probability on the validation fold. Because LiblineaR is C++ based, this is typically much faster than the e1071 path.

## CV helper: linear LiblineaR with probabilities (type = 0)
svm_linear_lib_cv <- function(idx_tr, idx_vl, cost, class_weights = cw_lib) {
  fit <- LiblineaR(
    data   = X_tr[idx_tr, , drop = FALSE],
    target = y_tr_num[idx_tr],   # labels 0/1
    type   = 0,                  # <-- logistic regression (L2) supports proba
    cost   = cost,
    wi     = class_weights,      # names must be "0" and "1"
    bias   = TRUE,
    verbose = FALSE
  )

  pr <- predict(fit, X_tr[idx_vl, , drop = FALSE], proba = TRUE)
  P  <- pr$probabilities
  # pick the positive column safely
  pos_col <- if ("1" %in% colnames(P)) "1" else colnames(P)[which.max(colnames(P) %in% c("1","yes","TRUE"))]
  p_yes <- as.numeric(P[, pos_col])

  tibble(
    .pred_yes = p_yes,
    y = factor(ifelse(y_tr_num[idx_vl] == 1, "yes", "no"), levels = c("no","yes"))
  )
}

Next we’ll do the hyperparameter tuning step. The cost parameter (C) in SVMs controls the trade-off between model complexity and classification error (the “regularization strength”).

A low cost (let’s say 0.5) creates a “simpler” model (stronger regularization).

A high cost (like 2.0) allows the model to be more “complex” to fit the data better (weaker regularization), at the risk of overfitting.

We run the entire 3-fold CV for each value in our grid {0.5, 1, 2}. We then average the metrics (PR-AUC, etc.) for each cost value and select the one (pick_lin) that gives the best performance (in this case we look for the highest PR-AUC).

## Tiny cost grid (fast) and CV
lin_costs <- c(0.5, 1, 2)
svm_lin_cv <- purrr::map_dfr(lin_costs, function(C) {
  mets <- purrr::map_dfr(fold_idx, function(ix) {
    compute_metrics(svm_linear_lib_cv(ix$tr, ix$vl, cost = C))
  })
  dplyr::summarise(mets, dplyr::across(dplyr::everything(), mean, na.rm = TRUE)) |>
    dplyr::mutate(kernel = "linear", cost = C)
}) |>
  dplyr::arrange(dplyr::desc(pr_auc), dplyr::desc(rec_at10))

Warning: There was 1 warning in `dplyr::summarise()`.
ℹ In argument: `dplyr::across(dplyr::everything(), mean, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

svm_lin_cv

# A tibble: 3 × 5
  pr_auc roc_auc rec_at10 kernel  cost
   <dbl>   <dbl>    <dbl> <chr>  <dbl>
1  0.441   0.788    0.434 linear   2  
2  0.441   0.788    0.434 linear   1  
3  0.440   0.788    0.433 linear   0.5

pick_lin <- dplyr::slice(svm_lin_cv, 1)

Finally, we have the final, “go-live” step.

Refit: We take our best hyperparameter (pick_lin, in this case it was 2) from the CV step. Then it refits our LiblineaR model one last time on the entire training dataset. This gives the model the maximum possible data to learn from.
Test: Now it takes this final model and predict probabilities on the hold-out test set—the data that was never seen during training, preprocessing, or tuning.
Evaluate: We compute metrics (PR-AUC, ROC-AUC, Recall@10%) on these test set predictions. These final numbers will be the reportable results of our experiment. They represent our best estimate of how this model will perform on brand-new, unseen data.

## Fit best on ALL train, evaluate on TEST with probabilities
fit_lin <- LiblineaR(
  data   = X_tr, target = y_tr_num,
  type   = 0,                        # logistic regression
  cost   = pick_lin$cost,
  wi     = cw_lib,
  bias   = TRUE, verbose = FALSE
)

pr_te <- predict(fit_lin, X_te, proba = TRUE)
P_te  <- pr_te$probabilities
pos_col <- if ("1" %in% colnames(P_te)) "1" else colnames(P_te)[which.max(colnames(P_te) %in% c("1","yes","TRUE"))]
p_te  <- as.numeric(P_te[, pos_col])

svm_lin_test_metrics <- compute_metrics(tibble(.pred_yes = p_te, y = y_te_fac))
svm_lin_test_metrics

# A tibble: 1 × 3
  pr_auc roc_auc rec_at10
   <dbl>   <dbl>    <dbl>
1  0.464   0.804    0.444

Now that we have our metrics for SVM let’s add what we have found to our log to get a good general view for comparison.

exp_log <- log_experiment(
  exp_log,
  id        = "SVM-LIN",
  model     = "Linear (LiblineaR type=0)",
  objective = "Fast, calibrated linear baseline with class weights.",
  variation = sprintf(
    "3-fold stratified CV; cost ∈ {%s}; train-locked impute+standardize; one-hot; class weights.",
    paste(lin_costs, collapse = ",")
  ),
  controls  = "DT2 features; drop duration; same split/holdout; no SMOTE.",
  metrics   = "PR-AUC (primary), Recall@Top10%, ROC-AUC",
  result    = paste0(fmt_res(svm_lin_test_metrics), ", best_cost=", pick_lin$cost),
  conclusion= "Matched/best PR-AUC and near-tied ROC-AUC; slightly below RF2 on Recall@10%. Very fast and stable.",
  recommend = "Pick LiblineaR if PR-AUC/speed/interpretability matter most; keep RF2 if top-decile recall is the priority."
)

exp_log

# A tibble: 7 × 9
  id      model objective variation controls metrics result conclusion recommend
  <chr>   <chr> <chr>     <chr>     <chr>    <chr>   <chr>  <chr>      <chr>    
1 DT-1    Deci… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2    Deci… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…
3 RF-1    Rand… Establis… num.tree… Same fe… PR-AUC… Test … Underperf… Readjust…
4 RF-2    Rand… Improve … 3-fold C… Same en… Primar… CV: P… Tuned ran… Prefer R…
5 AB-1    AdaB… Fast boo… maxdepth… Same fe… PR-AUC… Test … Stumps un… Use as b…
6 AB-2    AdaB… Increase… maxdepth… Same fe… PR-AUC… Test … Depth=3 b… Keep AB2…
7 SVM-LIN Line… Fast, ca… 3-fold s… DT2 fea… PR-AUC… Test … Matched/b… Pick Lib…

On PR-AUC (primary), the linear LiblineaR model is marginally best (just around 0.002 over RF-2) but it’s clearly ahead of AB-2.

On ROC-AUC, LiblineaR and RF-2 are basically tied (around 0.804).

On Recall@Top10%, RF-2 still leads (around 0.453 vs 0.444), which could translate to a few more “yes” captures at a fixed outreach budget.

What recommendations can we derive from this then?

If the business priority is overall ranking of positives across thresholds (PR-AUC), it would be advisable to adopt the LiblineaR linear model. It’s fast, simple, and gives clear probabilities and coefficients for interpretation.

If the priority is maximizing captured positives in the top decile (operational outreach), RF-2 remains the slightly better pick.

Either way, these two models are great finalists, the business can choose by the operational metric that matters most.

Save Log

readr::write_csv(exp_log, "exp_log_project3.csv")