Assignment 03 - SVM Analysis with Bank Data

Introduction

For this homework, I used the same bank marketing dataset from Homework 2 and tested Support Vector Machines on the same classification problem. I kept the train/test split approach from the previous assignment so the comparison would stay fair. To keep the code fast and easily reproducible, I used two small SVM experiments with a linear kernel and avoided large tuning grids or cross validation loops.

Load Packages

library(tidyverse)
library(caret)
library(e1071)
library(pROC)

Load Data

df <- read_delim("bank-full.csv", delim = ";", 
                 escape_double = FALSE, trim_ws = TRUE) |>
  rename(target = y)

df$target <- factor(df$target, levels = c("no", "yes"))

Create Train/Test Split

This is the same splitting idea I used in Homework 2. It makes sure both classes appear in both the training and testing data.

make_split_with_both_classes <- function(dat, p = 0.8, strata = "target", max_tries = 100) {
  for (i in 1:max_tries) {
    idx <- createDataPartition(dat[[strata]], p = p, list = FALSE)
    train <- dat[idx, , drop = FALSE]
    test  <- dat[-idx, , drop = FALSE]
    if (nlevels(droplevels(train[[strata]])) == 2 &&
        nlevels(droplevels(test[[strata]])) == 2) {
      return(list(train = train, test = test))
    }
  }
}

splits <- make_split_with_both_classes(df, p = 0.8, strata = "target")
train <- splits$train
test  <- splits$test

Prepare Data for SVM

SVM requires numeric inputs, so I converted the categorical predictors to dummy variables and then centered and scaled the predictors. To keep runtime low, I trained the SVM models on a stratified subset of the training data and still evaluated on the full test set.

# Dummy encoding based on the full training set
dmy <- dummyVars(target ~ ., data = train, fullRank = TRUE)
x_train_full <- predict(dmy, newdata = train) |>
  as.data.frame()

x_test <- predict(dmy, newdata = test) |>
  as.data.frame()

y_train_full <- train$target
y_test <- test$target

# Fast stratified subset for SVM training
svm_n <- min(5000, nrow(train))
pos_idx <- which(y_train_full == "yes")
neg_idx <- which(y_train_full == "no")

n_pos <- round(svm_n * mean(y_train_full == "yes"))
n_neg <- svm_n - n_pos

svm_rows <- c(
  sample(pos_idx, n_pos),
  sample(neg_idx, n_neg)
)
svm_rows <- sample(svm_rows)

x_train <- x_train_full[svm_rows, , drop = FALSE]
y_train <- y_train_full[svm_rows]

# Center and scale using only the SVM training subset
pp <- preProcess(x_train, method = c("center", "scale"))
x_train <- predict(pp, x_train)
x_test  <- predict(pp, x_test)

Evaluation Function

metrics <- function(y_true, y_pred_class, score = NULL, positive = "yes") {
  y_true <- factor(y_true, levels = c("no", "yes"))
  y_pred_class <- factor(y_pred_class, levels = levels(y_true))

  tp <- sum(y_true == positive & y_pred_class == positive)
  fp <- sum(y_true != positive & y_pred_class == positive)
  fn <- sum(y_true == positive & y_pred_class != positive)
  tn <- sum(y_true != positive & y_pred_class != positive)

  accuracy <- (tp + tn) / (tp + tn + fp + fn)
  precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp))
  recall <- ifelse(tp + fn == 0, NA, tp / (tp + fn))
  f1 <- ifelse(is.na(precision) | is.na(recall) | (precision + recall == 0),
               NA, 2 * precision * recall / (precision + recall))

  auc <- NA_real_
  if (!is.null(score)) {
    auc_try <- as.numeric(roc(y_true, score, levels = c("no", "yes"), quiet = TRUE)$auc)
    if (auc_try < 0.5) {
      auc_try <- as.numeric(roc(y_true, -score, levels = c("no", "yes"), quiet = TRUE)$auc)
    }
    auc <- auc_try
  }

  data.frame(accuracy, precision, recall, f1, auc, check.names = FALSE)
}

score_svm <- function(model, x_new) {
  pred <- predict(model, x_new, decision.values = TRUE)
  score <- as.numeric(attr(pred, "decision.values"))
  list(class = pred, score = score)
}

SVM Experiment 1

Linear kernel with cost = 0.5 and no class weighting.

svm_1 <- svm(
  x = x_train,
  y = y_train,
  kernel = "linear",
  cost = 0.5,
  scale = FALSE,
  decision.values = TRUE
)

pred_1 <- score_svm(svm_1, x_test)
met_1 <- metrics(y_test, pred_1$class, pred_1$score)
met_1$model <- "SVM-A (linear, unweighted)"
met_1[, c("model", "accuracy", "precision", "recall", "f1", "auc")]

##                        model  accuracy precision    recall        f1       auc
## 1 SVM-A (linear, unweighted) 0.8980201 0.6753247 0.2459792 0.3606103 0.9001936

SVM Experiment 2

Objective: Improve detection of the minority class.

What changed: I kept the linear kernel but increased cost to 1 and added class weights based on the class imbalance.

What stayed the same: Same split, same preprocessing, same SVM training subset, same test set.

class_wts <- table(y_train)
class_wts <- max(class_wts) / class_wts

svm_2 <- svm(
  x = x_train,
  y = y_train,
  kernel = "linear",
  cost = 1,
  class.weights = class_wts,
  scale = FALSE,
  decision.values = TRUE
)

pred_2 <- score_svm(svm_2, x_test)
met_2 <- metrics(y_test, pred_2$class, pred_2$score)
met_2$model <- "SVM-B (linear, weighted)"
met_2[, c("model", "accuracy", "precision", "recall", "f1", "auc")]

##                      model accuracy precision    recall        f1       auc
## 1 SVM-B (linear, weighted) 0.835527 0.3984891 0.7984863 0.5316535 0.9012345

SVM Results Table

svm_results <- bind_rows(
  met_1[, c("model", "accuracy", "precision", "recall", "f1", "auc")],
  met_2[, c("model", "accuracy", "precision", "recall", "f1", "auc")]
)

knitr::kable(svm_results, digits = 3)

model	accuracy	precision	recall	f1	auc
SVM-A (linear, unweighted)	0.898	0.675	0.246	0.361	0.900
SVM-B (linear, weighted)	0.836	0.398	0.798	0.532	0.901

Homework 2 Comparison Table

The table below brings forward the Homework 2 results so I can compare them directly with the SVM models.

hw2_results <- tibble(
  model = c(
    "DT Tuned (cp)",
    "DT Tuned (maxdepth)",
    "RF-A",
    "RF-B",
    "ADA-A",
    "ADA-B"
  ),
  accuracy = c(0.9074217, 0.9013383, 0.9077740, 0.9075528, 0.9014490, 0.9034399),
  precision = c(0.6540616, 0.6565465, 0.6637427, 0.6632353, 0.6155989, 0.6167513),
  recall = c(0.4418165, 0.3273415, 0.4291115, 0.4262760, 0.4181646, 0.4597919),
  f1 = c(0.5273857, 0.4368687, NA, NA, 0.4980282, 0.5268293),
  auc = c(0.8639102, 0.7490713, 0.9331561, 0.9330660, 0.9185975, 0.9235401)
)

comparison <- bind_rows(hw2_results, svm_results)
knitr::kable(comparison, digits = 3)

model	accuracy	precision	recall	f1	auc
DT Tuned (cp)	0.907	0.654	0.442	0.527	0.864
DT Tuned (maxdepth)	0.901	0.657	0.327	0.437	0.749
RF-A	0.908	0.664	0.429	NA	0.933
RF-B	0.908	0.663	0.426	NA	0.933
ADA-A	0.901	0.616	0.418	0.498	0.919
ADA-B	0.903	0.617	0.460	0.527	0.924
SVM-A (linear, unweighted)	0.898	0.675	0.246	0.361	0.900
SVM-B (linear, weighted)	0.836	0.398	0.798	0.532	0.901

Discussion

Based on the results table above, the Random Forest models from Homework 2 still appear to be the strongest overall if the goal is the best balance of accuracy and AUC. The SVM models were still useful, but they did not clearly beat the best Random Forest results on this dataset. The unweighted SVM should usually give a more conservative classification pattern, while the weighted SVM should usually improve recall for the minority class at the cost of some overall accuracy.

Questions

Which algorithm is recommended to get more accurate results?

For this dataset, I would recommend Random Forest if the main goal is overall predictive performance. In Homework 2, Random Forest had the strongest AUC and very strong accuracy, so it looks like the safest overall choice.

Is it better for classification or regression scenarios?

In this homework, SVM is better viewed as a classification method because the target variable is binary: yes or no. SVM can also be used for regression, but that would not be the right setup for this problem.

Do you agree with the recommendations? Why?

Yes, I agree with recommending Random Forest for this dataset if the goal is the most accurate overall model. I would only prefer the weighted SVM if the main business goal was to catch more of the positive “yes” cases, even if that created more false positives. So for pure overall performance I would stay with Random Forest, but for minority class sensitivity the weighted SVM could still be useful.

Conclusion

Overall, the SVM analysis added a helpful comparison to the models from Homework 2. The SVM models were fast to run and gave reasonable results, but the prior Random Forest model still looks like the best all around option for this bank marketing classification problem.