A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.
This assignment builds on the findings from previous ones and seeks to compare existing results against a new algorithm: SVM. We are aiming to find out which algorithm is recommended to get more accurate results.
Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(489) ## let's define a fixed seed, also for reproducibility
The file has semi colons instead of comas, so we also have to account for that. Also, when exploring the CSV some “unknown” values were found, so we’ll turn those into NAs.
This time we’ll also the log that we saved previously with the results from our DT, RF and AB models.
# Loading, and converting "unknown" and "" to NA; clean names to snake_caseuci_raw <-read_delim(url, delim =";", show_col_types =FALSE, na =c("", "unknown")) |>clean_names()
# Remove exact duplicate rows (we know there's very few, but it's a good practice)uci_raw <- uci_raw |>distinct()
We’ll re-usea functions that helped prepare the data in previous assignments. We’ll call it prep_features, and we’ll apply to the global dataset, since all the experiments we’ll benefit from it and we’ll have a uniform starting point.
prep_features <-function(df) { df |># Target to ordered factor (no, yes)mutate(y =factor(y, levels =c("no","yes"))) |># we'll drop the leakage feature, since duration is known only after the callselect(-duration) |># to handle "pdays" we'll create a "no prior contact" flag and a cleaned numeric onemutate(no_prior =if_else(pdays ==999, "yes", "no"),pdays_real =if_else(pdays ==999, NA, pdays),# and the recent contact flagrecent_contact =if_else(!is.na(pdays_real) & pdays_real <7, "recent", "not_recent"),# also some contact intensity bands from campaignintensity =case_when( campaign <=2~"low", campaign <=5~"medium",TRUE~"high" ),# we had defined age buckets to segment the data for easy handlingage_bucket =cut(age, breaks =c(0,30,45,60,Inf), labels =c("<=30","31-45","46-60","60+"), right =TRUE),# Creating a flag for "Prior outcome" (if it has any prior outcome)had_prior =if_else(poutcome =="nonexistent", "no", "yes") ) |># finally, keep categorical columns as factors so trees can split on themmutate(across(where(is.character), ~factor(.x)),across(c(no_prior, recent_contact, intensity, age_bucket, had_prior), ~factor(.x)) )}
Then, we apply it to the dataset
uci <-prep_features(uci_raw)
Additionally, we want to reduce macro collinearity by dropping one near duplicate macro. So we’ll keep euribor3m and drop nr_employed. This really is optional for trees, but we do it to stabilize comparisons.
if ("nr_employed"%in%names(uci)) { uci <- uci |>select(-nr_employed)}
Train and Test split, CV folds
We will do an 80/20 stratified split for the final holdout
To mitigate imbalance, we’ll give higher weight to minority class “yes”, inversely proportional to frequency. We’ll use a function for this called compute_class_weights.
compute_class_weights <-function(df, y_col ="y") { tab <-table(df[[y_col]])# For inverse frequency, so that mean weight is 1 w <-1/as.numeric(tab)names(w) <-names(tab) w <- w /mean(w)return(w)}
Since we will be comparing SVM results with those from previous experiments we’ll re-use a log where we can add our new results for easy comparison and export.
# defining the log# creating an empty log with the exact schema we want to registerinit_exp_log <-function() { tibble::tibble(id =character(),model =character(),objective =character(),variation =character(),controls =character(),metrics =character(),result =character(),conclusion=character(),recommend =character() )}
# defining function to populate the log with results# To be safe, we add an appender that enforces the same column set/order every timelog_experiment <-function(log, id, model, objective, variation, controls, metrics, result, conclusion, recommend) { new_row <- tibble::tibble(id =as.character(id),model =as.character(model),objective =as.character(objective),variation =as.character(variation),controls =as.character(controls),metrics =as.character(metrics),result =as.character(result),conclusion=as.character(conclusion),recommend =as.character(recommend) )# and if the log doesn't exist or has different columns, it re-starts cleanlyif (!exists("log") ||!all(names(log) ==names(new_row))) { log <-init_exp_log() } dplyr::bind_rows(log, new_row)}
fmt_res <-function(m) {# m is a 1-row tibble with pr_auc, roc_auc, rec_at10sprintf("Test PR-AUC=%.4f, ROC-AUC=%.4f, Recall@10%%=%.4f", m$pr_auc, m$roc_auc, m$rec_at10)}
Reviewing previous findings
Let’s quickly bring up again the saved experiment log from Project 2 so we can reference prior DT/RF/AB results in this notebook.
exp_log_prev
# A tibble: 6 × 9
id model objective variation controls metrics result conclusion recommend
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 DT-1 Decisi… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2 Decisi… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…
3 RF-1 Random… Establis… num.tree… Same fe… PR-AUC… Test … Underperf… Readjust…
4 RF-2 Random… Improve … 3-fold C… Same en… Primar… CV: P… Tuned ran… Prefer R…
5 AB-1 AdaBoo… Fast boo… maxdepth… Same fe… PR-AUC… Test … Stumps un… Use as b…
6 AB-2 AdaBoo… Increase… maxdepth… Same fe… PR-AUC… Test … Depth=3 b… Keep AB2…
We’ll need to export the the previous log into exp_log so any new entries (from SVM) append to the same schema.
# So we can continue logging:exp_log <- exp_log_prev
SVM
Initially we attempted a robust, two-stage preprocessing pipeline to avoid data leakage. The idea was to learn all statistics from train only, then apply those exact stats to valid/test. The first implementation (using e1071::svm) worked conceptually but proved slower and a bit fussier in practice. I decided to keep a short “legacy” section to document that approach and why we replaced it with a faster, simpler LiblineaR workflow.
Small legacy section
To consistently train and valid splits we tried to use a function that scans a data frame and saves numeric medians for each column and also saves which columns are factors.
We also needed a function that imputes missing numerics with the saved medians and fills missing factor levels with “Unknown,” ensuring stable factor types across folds.
This was the “learning” part of our one-hot encoding. It uses the model.matrix() function, which is how we can create a numeric design matrix for linear models. It took the training data (after imputation) and converted all factor columns into binary “dummy” columns.
Perhaps more importantly, it didn’t just return the matrix, it returns the exact set of column names produced. This matters because the test set might not have all the same factor levels, which would result in a different set of dummy columns.
# chunk 24encode_train_mm <-function(df, y_col ="y") { mm <-model.matrix(as.formula(paste(y_col, "~ .")), df)[, -1, drop =FALSE]list(X = mm, ref_cols =colnames(mm))}
This was the “applying” part of the one-hot encoding, and its job is to solve the problem mentioned above. It takes the validation/test data and the list of target column names from encode_train_mm().
It runs model.matrix() on the new data. Then it performs some checks:
Missing Columns: If the training data had a color_Green column but the test set had no “Green” examples, this function manually adds a color_Green column filled with zeros.
Extra Columns: (Rarer, but good to check) If the test set had a level not in the train set (e.g., color_Purple), this function would remove it.
This ensures the final validation/test matrix has the exact same columns in the exact same order as the training matrix, which is a hard requirement for the predict() function.
# chunk 25encode_apply_mm <-function(df, ref_cols, y_col ="y") { mm <-model.matrix(as.formula(paste(y_col, "~ .")), df)[, -1, drop =FALSE] add <-setdiff(ref_cols, colnames(mm))if (length(add)) { mm_add <-matrix(0, nrow =nrow(mm), ncol =length(add))colnames(mm_add) <- add mm <-cbind(mm, mm_add) } mm[, ref_cols, drop =FALSE]}
Next we tried creating a a pipeline helper function that bundled the entire process for a single CV fold. It takes training and validation data, then:
1) Imputes: Runs apply_impute_maps().
2) Encodes: Runs encode_apply_mm().
3) Fits: Trains an e1071::svm model on the processed training data. It’s flexible, allowing us to test both a linear kernel (fast, simple) and an RBF (radial basis function) kernel (more complex, can find non-linear patterns).
4) Predicts: Uses the fitted model to generate probabilities for the “yes” class on the processed validation data.
This ensures the exact same preprocessing is applied consistently inside each loop.
Then, we had a cross-validation manager, taking the full dataset, creates K-folds, and then loops K times. In each loop, it uses fit_predict_svm() to train on K-1 folds and predict on the 1 held-out fold.
After the loop finishes, it has a complete set of “out-of-sample” predictions for the entire dataset. It then uses these predictions to calculate our key business metrics we had previously defined since Project 2:
PR-AUC: Area Under the Precision-Recall Curve (best for imbalanced data).
ROC-AUC: Area Under the Receiver Operating Characteristic Curve (standard classification metric).
Recall@10%: What percentage of the true “yes” cases are captured if you select the top 10% riskiest (highest probability) predictions? This is a very common and practical business-focused metric.
This section replaces the previous pipeline with one optimized for LiblineaR, which is known for being extremely fast on sparse, high-dimensional linear data.
This next chunk condenses chunks 22-25 into a single, optimized “learning” step. It scans the training data to learn all preprocessing statistics at once (medians for imputation, means/SDs for standardization, and all unique factor levels).
It then defines an apply_encoder() function that uses these “locked” stats to process new data by imputing, standardizing (scaling numerics to mean=0, sd=1, which helps LiblineaR converge), and one-hot encoding.
Finally, it applies this encoder to create the final X_tr (training matrix) and X_te (test matrix), ensuring they are perfectly aligned. It also prepares the labels: y_tr_num (numeric 0/1 for the model) and y_te_fac (factors for the yardstick or PRROC metrics packages).
## Train-locked impute + scale + one-hotnum_cols <-names(train_df)[sapply(train_df, is.numeric)]cat_cols <-names(train_df)[sapply(train_df, is.factor)]num_median <-vapply(train_df[num_cols], \(x) median(x, na.rm =TRUE), numeric(1))num_mean <-vapply(train_df[num_cols], \(x) mean(x, na.rm =TRUE), numeric(1))num_sd <-vapply(train_df[num_cols], \(x) sd(x, na.rm =TRUE), numeric(1))num_sd[num_sd ==0|is.na(num_sd)] <-1cat_levels <-lapply(train_df[cat_cols], levels)apply_encoder <-function(df, form = y ~ .) {# numeric: impute + standardize (with TRAIN stats)for (nm in num_cols) if (nm %in%names(df)) { v <- df[[nm]]; v[is.na(v)] <- num_median[[nm]] df[[nm]] <- (v - num_mean[[nm]]) / num_sd[[nm]] }# categoricals: align levels to TRAIN; unseen -> first level (keeps rows)for (nm in cat_cols) if (nm %in%names(df)) { df[[nm]] <-factor(df[[nm]], levels = cat_levels[[nm]]) df[[nm]][is.na(df[[nm]])] <-levels(df[[nm]])[1] } mm <-model.matrix(form, df) mm[, -1, drop =FALSE] # drop intercept}X_tr <-apply_encoder(train_df)X_te_tmp <-apply_encoder(test_df)# force same columns/order on TESTmiss <-setdiff(colnames(X_tr), colnames(X_te_tmp))if (length(miss)) { X_te_tmp <-cbind(X_te_tmp, matrix(0, nrow(X_te_tmp), length(miss),dimnames =list(NULL, miss)))}X_te <- X_te_tmp[, colnames(X_tr), drop =FALSE]# labels: numeric {0,1} for LiblineaR; keep factors for metricsy_tr_num <-as.integer(train_df$y =="yes")y_te_fac <-factor(test_df$y, levels =c("no","yes"))
Next we set our final model validation.
3-fold: We’ve chosen K=3. This means we’ll train on 2/3 of the data and validate on 1/3, three times.
Stratified: This is critical for imbalanced data. It ensures that each of our 3 folds has the same percentage of “yes” and “no” cases as the original dataset. This prevents a “bad” split where one fold might accidentally get almost no “yes” cases.
Stable .row_id: This is a robustness trick. It makes it easy to join predictions back to our original data, even after all the shuffling and subsetting
For our current strategy for handling the imbalanced dataset we compute the inverse frequency of each class.
Example: If we have 90% “0” (no) and 10% “1” (yes), the weights would be 0 = 1/0.90 and 1 = 1/0.10. When we pass these weights to the model, it means the model is penalized 10x more for misclassifying a rare “yes” case than for misclassifying a common “no” case. This forces the model to pay attention to the positive class. The renaming to “0” and “1” is just to match the specific syntax LiblineaR’s weights argument expects.
## Class weights for LiblineaR: names must be "0" and "1"cw <-table(train_df$y) # c(no=?, yes=?)cw <-1/as.numeric(cw) # inverse frequencycw <- cw /mean(cw)names(cw) <-c("no","yes")cw_lib <-c(`0`=unname(cw["no"]), `1`=unname(cw["yes"]))
This is the new core function for our CV loop, replacing chunk 26.
LiblineaR (type = 0): LiblineaR type = 0 is L2-regularized logistic regression, but it produces a linear decision boundary very close to a linear SVM and gives calibrated probabilities natively. That makes it ideal here: fast, robust on wide one-hot data, and business-friendly
The function fits this model on a training fold (using the class weights from the previous chunk) and predicts the “1” class probability on the validation fold. Because LiblineaR is C++ based, this is typically much faster than the e1071 path.
## CV helper: linear LiblineaR with probabilities (type = 0)svm_linear_lib_cv <-function(idx_tr, idx_vl, cost, class_weights = cw_lib) { fit <-LiblineaR(data = X_tr[idx_tr, , drop =FALSE],target = y_tr_num[idx_tr], # labels 0/1type =0, # <-- logistic regression (L2) supports probacost = cost,wi = class_weights, # names must be "0" and "1"bias =TRUE,verbose =FALSE ) pr <-predict(fit, X_tr[idx_vl, , drop =FALSE], proba =TRUE) P <- pr$probabilities# pick the positive column safely pos_col <-if ("1"%in%colnames(P)) "1"elsecolnames(P)[which.max(colnames(P) %in%c("1","yes","TRUE"))] p_yes <-as.numeric(P[, pos_col])tibble(.pred_yes = p_yes,y =factor(ifelse(y_tr_num[idx_vl] ==1, "yes", "no"), levels =c("no","yes")) )}
Next we’ll do the hyperparameter tuning step. The cost parameter (C) in SVMs controls the trade-off between model complexity and classification error (the “regularization strength”).
A low cost (let’s say 0.5) creates a “simpler” model (stronger regularization).
A high cost (like 2.0) allows the model to be more “complex” to fit the data better (weaker regularization), at the risk of overfitting.
We run the entire 3-fold CV for each value in our grid {0.5, 1, 2}. We then average the metrics (PR-AUC, etc.) for each cost value and select the one (pick_lin) that gives the best performance (in this case we look for the highest PR-AUC).
Warning: There was 1 warning in `dplyr::summarise()`.
ℹ In argument: `dplyr::across(dplyr::everything(), mean, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
svm_lin_cv
# A tibble: 3 × 5
pr_auc roc_auc rec_at10 kernel cost
<dbl> <dbl> <dbl> <chr> <dbl>
1 0.441 0.788 0.434 linear 2
2 0.441 0.788 0.434 linear 1
3 0.440 0.788 0.433 linear 0.5
pick_lin <- dplyr::slice(svm_lin_cv, 1)
Finally, we have the final, “go-live” step.
Refit: We take our best hyperparameter (pick_lin, in this case it was 2) from the CV step. Then it refits our LiblineaR model one last time on the entire training dataset. This gives the model the maximum possible data to learn from.
Test: Now it takes this final model and predict probabilities on the hold-out test set—the data that was never seen during training, preprocessing, or tuning.
Evaluate: We compute metrics (PR-AUC, ROC-AUC, Recall@10%) on these test set predictions. These final numbers will be the reportable results of our experiment. They represent our best estimate of how this model will perform on brand-new, unseen data.
## Fit best on ALL train, evaluate on TEST with probabilitiesfit_lin <-LiblineaR(data = X_tr, target = y_tr_num,type =0, # logistic regressioncost = pick_lin$cost,wi = cw_lib,bias =TRUE, verbose =FALSE)pr_te <-predict(fit_lin, X_te, proba =TRUE)P_te <- pr_te$probabilitiespos_col <-if ("1"%in%colnames(P_te)) "1"elsecolnames(P_te)[which.max(colnames(P_te) %in%c("1","yes","TRUE"))]p_te <-as.numeric(P_te[, pos_col])svm_lin_test_metrics <-compute_metrics(tibble(.pred_yes = p_te, y = y_te_fac))svm_lin_test_metrics
Now that we have our metrics for SVM let’s add what we have found to our log to get a good general view for comparison.
exp_log <-log_experiment( exp_log,id ="SVM-LIN",model ="Linear (LiblineaR type=0)",objective ="Fast, calibrated linear baseline with class weights.",variation =sprintf("3-fold stratified CV; cost ∈ {%s}; train-locked impute+standardize; one-hot; class weights.",paste(lin_costs, collapse =",") ),controls ="DT2 features; drop duration; same split/holdout; no SMOTE.",metrics ="PR-AUC (primary), Recall@Top10%, ROC-AUC",result =paste0(fmt_res(svm_lin_test_metrics), ", best_cost=", pick_lin$cost),conclusion="Matched/best PR-AUC and near-tied ROC-AUC; slightly below RF2 on Recall@10%. Very fast and stable.",recommend ="Pick LiblineaR if PR-AUC/speed/interpretability matter most; keep RF2 if top-decile recall is the priority.")
exp_log
# A tibble: 7 × 9
id model objective variation controls metrics result conclusion recommend
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 DT-1 Deci… Improve … Grid ove… Same fe… PR-AUC… Test … Weighted … Adopt co…
2 DT-2 Deci… Test eng… Features… Same sp… PR-AUC… Test … Feature e… Keep eng…
3 RF-1 Rand… Establis… num.tree… Same fe… PR-AUC… Test … Underperf… Readjust…
4 RF-2 Rand… Improve … 3-fold C… Same en… Primar… CV: P… Tuned ran… Prefer R…
5 AB-1 AdaB… Fast boo… maxdepth… Same fe… PR-AUC… Test … Stumps un… Use as b…
6 AB-2 AdaB… Increase… maxdepth… Same fe… PR-AUC… Test … Depth=3 b… Keep AB2…
7 SVM-LIN Line… Fast, ca… 3-fold s… DT2 fea… PR-AUC… Test … Matched/b… Pick Lib…
On PR-AUC (primary), the linear LiblineaR model is marginally best (just around 0.002 over RF-2) but it’s clearly ahead of AB-2.
On ROC-AUC, LiblineaR and RF-2 are basically tied (around 0.804).
On Recall@Top10%, RF-2 still leads (around 0.453 vs 0.444), which could translate to a few more “yes” captures at a fixed outreach budget.
What recommendations can we derive from this then?
If the business priority is overall ranking of positives across thresholds (PR-AUC), it would be advisable to adopt the LiblineaR linear model. It’s fast, simple, and gives clear probabilities and coefficients for interpretation.
If the priority is maximizing captured positives in the top decile (operational outreach), RF-2 remains the slightly better pick.
Either way, these two models are great finalists, the business can choose by the operational metric that matters most.