For this homework, I used the same bank marketing dataset from Homework 2 and tested Support Vector Machines on the same classification problem. I kept the train/test split approach from the previous assignment so the comparison would stay fair. To keep the code fast and easily reproducible, I used two small SVM experiments with a linear kernel and avoided large tuning grids or cross validation loops.
library(tidyverse)
library(caret)
library(e1071)
library(pROC)
df <- read_delim("bank-full.csv", delim = ";",
escape_double = FALSE, trim_ws = TRUE) |>
rename(target = y)
df$target <- factor(df$target, levels = c("no", "yes"))
This is the same splitting idea I used in Homework 2. It makes sure both classes appear in both the training and testing data.
make_split_with_both_classes <- function(dat, p = 0.8, strata = "target", max_tries = 100) {
for (i in 1:max_tries) {
idx <- createDataPartition(dat[[strata]], p = p, list = FALSE)
train <- dat[idx, , drop = FALSE]
test <- dat[-idx, , drop = FALSE]
if (nlevels(droplevels(train[[strata]])) == 2 &&
nlevels(droplevels(test[[strata]])) == 2) {
return(list(train = train, test = test))
}
}
}
splits <- make_split_with_both_classes(df, p = 0.8, strata = "target")
train <- splits$train
test <- splits$test
SVM requires numeric inputs, so I converted the categorical predictors to dummy variables and then centered and scaled the predictors. To keep runtime low, I trained the SVM models on a stratified subset of the training data and still evaluated on the full test set.
# Dummy encoding based on the full training set
dmy <- dummyVars(target ~ ., data = train, fullRank = TRUE)
x_train_full <- predict(dmy, newdata = train) |>
as.data.frame()
x_test <- predict(dmy, newdata = test) |>
as.data.frame()
y_train_full <- train$target
y_test <- test$target
# Fast stratified subset for SVM training
svm_n <- min(5000, nrow(train))
pos_idx <- which(y_train_full == "yes")
neg_idx <- which(y_train_full == "no")
n_pos <- round(svm_n * mean(y_train_full == "yes"))
n_neg <- svm_n - n_pos
svm_rows <- c(
sample(pos_idx, n_pos),
sample(neg_idx, n_neg)
)
svm_rows <- sample(svm_rows)
x_train <- x_train_full[svm_rows, , drop = FALSE]
y_train <- y_train_full[svm_rows]
# Center and scale using only the SVM training subset
pp <- preProcess(x_train, method = c("center", "scale"))
x_train <- predict(pp, x_train)
x_test <- predict(pp, x_test)
metrics <- function(y_true, y_pred_class, score = NULL, positive = "yes") {
y_true <- factor(y_true, levels = c("no", "yes"))
y_pred_class <- factor(y_pred_class, levels = levels(y_true))
tp <- sum(y_true == positive & y_pred_class == positive)
fp <- sum(y_true != positive & y_pred_class == positive)
fn <- sum(y_true == positive & y_pred_class != positive)
tn <- sum(y_true != positive & y_pred_class != positive)
accuracy <- (tp + tn) / (tp + tn + fp + fn)
precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp))
recall <- ifelse(tp + fn == 0, NA, tp / (tp + fn))
f1 <- ifelse(is.na(precision) | is.na(recall) | (precision + recall == 0),
NA, 2 * precision * recall / (precision + recall))
auc <- NA_real_
if (!is.null(score)) {
auc_try <- as.numeric(roc(y_true, score, levels = c("no", "yes"), quiet = TRUE)$auc)
if (auc_try < 0.5) {
auc_try <- as.numeric(roc(y_true, -score, levels = c("no", "yes"), quiet = TRUE)$auc)
}
auc <- auc_try
}
data.frame(accuracy, precision, recall, f1, auc, check.names = FALSE)
}
score_svm <- function(model, x_new) {
pred <- predict(model, x_new, decision.values = TRUE)
score <- as.numeric(attr(pred, "decision.values"))
list(class = pred, score = score)
}
Linear kernel with cost = 0.5 and no class weighting.
svm_1 <- svm(
x = x_train,
y = y_train,
kernel = "linear",
cost = 0.5,
scale = FALSE,
decision.values = TRUE
)
pred_1 <- score_svm(svm_1, x_test)
met_1 <- metrics(y_test, pred_1$class, pred_1$score)
met_1$model <- "SVM-A (linear, unweighted)"
met_1[, c("model", "accuracy", "precision", "recall", "f1", "auc")]
## model accuracy precision recall f1 auc
## 1 SVM-A (linear, unweighted) 0.8980201 0.6753247 0.2459792 0.3606103 0.9001936
Objective: Improve detection of the minority class.
What changed: I kept the linear kernel but increased cost to 1 and added class weights based on the class imbalance.
What stayed the same: Same split, same preprocessing, same SVM training subset, same test set.
class_wts <- table(y_train)
class_wts <- max(class_wts) / class_wts
svm_2 <- svm(
x = x_train,
y = y_train,
kernel = "linear",
cost = 1,
class.weights = class_wts,
scale = FALSE,
decision.values = TRUE
)
pred_2 <- score_svm(svm_2, x_test)
met_2 <- metrics(y_test, pred_2$class, pred_2$score)
met_2$model <- "SVM-B (linear, weighted)"
met_2[, c("model", "accuracy", "precision", "recall", "f1", "auc")]
## model accuracy precision recall f1 auc
## 1 SVM-B (linear, weighted) 0.835527 0.3984891 0.7984863 0.5316535 0.9012345
svm_results <- bind_rows(
met_1[, c("model", "accuracy", "precision", "recall", "f1", "auc")],
met_2[, c("model", "accuracy", "precision", "recall", "f1", "auc")]
)
knitr::kable(svm_results, digits = 3)
| model | accuracy | precision | recall | f1 | auc |
|---|---|---|---|---|---|
| SVM-A (linear, unweighted) | 0.898 | 0.675 | 0.246 | 0.361 | 0.900 |
| SVM-B (linear, weighted) | 0.836 | 0.398 | 0.798 | 0.532 | 0.901 |
The table below brings forward the Homework 2 results so I can compare them directly with the SVM models.
hw2_results <- tibble(
model = c(
"DT Tuned (cp)",
"DT Tuned (maxdepth)",
"RF-A",
"RF-B",
"ADA-A",
"ADA-B"
),
accuracy = c(0.9074217, 0.9013383, 0.9077740, 0.9075528, 0.9014490, 0.9034399),
precision = c(0.6540616, 0.6565465, 0.6637427, 0.6632353, 0.6155989, 0.6167513),
recall = c(0.4418165, 0.3273415, 0.4291115, 0.4262760, 0.4181646, 0.4597919),
f1 = c(0.5273857, 0.4368687, NA, NA, 0.4980282, 0.5268293),
auc = c(0.8639102, 0.7490713, 0.9331561, 0.9330660, 0.9185975, 0.9235401)
)
comparison <- bind_rows(hw2_results, svm_results)
knitr::kable(comparison, digits = 3)
| model | accuracy | precision | recall | f1 | auc |
|---|---|---|---|---|---|
| DT Tuned (cp) | 0.907 | 0.654 | 0.442 | 0.527 | 0.864 |
| DT Tuned (maxdepth) | 0.901 | 0.657 | 0.327 | 0.437 | 0.749 |
| RF-A | 0.908 | 0.664 | 0.429 | NA | 0.933 |
| RF-B | 0.908 | 0.663 | 0.426 | NA | 0.933 |
| ADA-A | 0.901 | 0.616 | 0.418 | 0.498 | 0.919 |
| ADA-B | 0.903 | 0.617 | 0.460 | 0.527 | 0.924 |
| SVM-A (linear, unweighted) | 0.898 | 0.675 | 0.246 | 0.361 | 0.900 |
| SVM-B (linear, weighted) | 0.836 | 0.398 | 0.798 | 0.532 | 0.901 |
Based on the results table above, the Random Forest models from Homework 2 still appear to be the strongest overall if the goal is the best balance of accuracy and AUC. The SVM models were still useful, but they did not clearly beat the best Random Forest results on this dataset. The unweighted SVM should usually give a more conservative classification pattern, while the weighted SVM should usually improve recall for the minority class at the cost of some overall accuracy.
For this dataset, I would recommend Random Forest if the main goal is overall predictive performance. In Homework 2, Random Forest had the strongest AUC and very strong accuracy, so it looks like the safest overall choice.
In this homework, SVM is better viewed as a classification method because the target variable is binary: yes or no. SVM can also be used for regression, but that would not be the right setup for this problem.
Yes, I agree with recommending Random Forest for this dataset if the goal is the most accurate overall model. I would only prefer the weighted SVM if the main business goal was to catch more of the positive “yes” cases, even if that created more false positives. So for pure overall performance I would stay with Random Forest, but for minority class sensitivity the weighted SVM could still be useful.
Overall, the SVM analysis added a helpful comparison to the models from Homework 2. The SVM models were fast to run and gave reasonable results, but the prior Random Forest model still looks like the best all around option for this bank marketing classification problem.