Continuing from my previous work with banking and marketing data, this experiment and analysis will use support vector machines for classification.
library(tidyverse)
library(caret)
library(e1071)
library(randomForest)
The steps for data import, pre-processing, and partitioning are all repeated from the previous work. The experiment log is imported as well.
bank_raw <- read.csv2(file="bank+marketing/bank/bank-full.csv")
bank <- bank_raw
bank <- bank |>
mutate(poutcome = na_if(poutcome, "unknown")) |>
mutate(poutcome = na_if(poutcome, "other"))
chr_cols <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
bank <- bank |> mutate(across(all_of(chr_cols), as.factor))
head(bank)
set.seed(123)
splitIndex <- createDataPartition(bank$y, p = 0.8, list = FALSE)
bank_train <- bank[splitIndex,]
bank_test <- bank[-splitIndex,]
round(prop.table(table(select(bank, y))), 2)
## y
## no yes
## 0.88 0.12
round(prop.table(table(select(bank_train, y))), 2)
## y
## no yes
## 0.88 0.12
round(prop.table(table(select(bank_test, y))), 2)
## y
## no yes
## 0.88 0.12
# Import the log from the previous experiments for comparison.
experiment_log <- read_csv("experiment_log.csv")
Objective: We will be testing if a support vector machine can be a better model for making classifications on this banking dataset than the algorithms from the previous experiments.
Variations: This first model will use a linear kernel with the default cost (hardness/softness of margin) of 1.
Evaluation: A table will be generated to view the SVM predictions against the actual values, along with the confusion matrix for accuracy.
Experiment:
First, as with random forests, any missing values in the data will
need to be imputed. For consistency, I will again apply
na.roughfix
. Next, the numeric columns must be scaled due
to how SVMs use distances between data points to determine the
hyperplane and make classifications. If a column has a highly variable
range of numbers, it could dominate the calculations; scaling helps all
balance the features’ contributions.
set.seed(123)
bank_train3 <- na.roughfix(bank_train)
bank_test3 <- na.roughfix(bank_test)
num_cols <- sapply(bank_train, is.numeric)
bank_svm1 <- svm(y ~.,
data = bank_train,
scale = num_cols,
kernel = "linear")
summary(bank_svm1)
##
## Call:
## svm(formula = y ~ ., data = bank_train, kernel = "linear", scale = num_cols)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 1894
##
## ( 953 941 )
##
##
## Number of Classes: 2
##
## Levels:
## no yes
# predict and evaluate on training data
svm1_train_pred <- predict(bank_svm1, bank_train3)
table(predict = svm1_train_pred, truth = bank_train3$y)
## truth
## predict no yes
## no 31423 3282
## yes 515 950
svm1_train_cm <- confusionMatrix(svm1_train_pred, bank_train3$y)
svm1_train_cm$overall["Accuracy"]
## Accuracy
## 0.8950235
# predict and evaluate on testing data
svm1_test_pred <- predict(bank_svm1, bank_test3)
table(predict = svm1_test_pred, truth = bank_test3$y)
## truth
## predict no yes
## no 7862 830
## yes 122 227
svm1_test_cm <- confusionMatrix(svm1_test_pred, bank_test3$y)
svm1_test_cm$overall["Accuracy"]
## Accuracy
## 0.8947019
With this model, there are 32373 correct classifications and 3797 errors on the training data. On the testing set, there are 952 errors against 8089 correct classifications. So the accuracy values in the confusion matrix are about the same as with previous experiments using decision trees.
svm1_log <- data.frame(
ID = 7,
Model = "SVM",
Features = "all",
Hyperparameters = "cost = 1",
Train = 0.90,
Test = 0.89,
Notes = "same accuracy as decision tree experiments"
)
experiment_log <- bind_rows(experiment_log, svm1_log)
Objective: To see if we can improve on this, 10-fold cross-validation
will be applied with different, commonly-used cost
values
to determine the best-performing model for training and testing.
Variations: Based on hyperparameter tuning, the cost will be either 0.01, 0.1, 1 (same), or 10.
Evaluation: The same table and accuracy will be generated.
Experiment:
tune_mod <- tune(svm,
y ~.,
data = bank_train,
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 10)))
summary(tune_mod)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.1775158
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.1775158 0.006975374
## 2 0.10 0.1808359 0.005952814
## 3 1.00 0.1816526 0.005950147
## 4 10.00 0.1810692 0.006229219
The error rates do not vary, but the best model was determined to have a cost value of 0.01.
best_mod <- tune_mod$best.model
# predict and evaluate on training data
best_train_pred <- predict(best_mod, bank_train3)
table(predict = best_train_pred, truth = bank_train3$y)
## truth
## predict no yes
## no 31354 3169
## yes 584 1063
best_train_cm <- confusionMatrix(best_train_pred, bank_train3$y)
best_train_cm$overall["Accuracy"]
## Accuracy
## 0.89624
# predict and evaluate on testing data
best_test_pred <- predict(best_mod, bank_test3)
table(predict = best_test_pred, truth = bank_test3$y)
## truth
## predict no yes
## no 7841 808
## yes 143 249
best_test_cm <- confusionMatrix(best_test_pred, bank_test3$y)
best_test_cm$overall["Accuracy"]
## Accuracy
## 0.8948125
In this case, the number of errors on training was 3753, a small drop of 44. On testing, there 951 errors; the overall improvement was minuscule.
svm2_log <- data.frame(
ID = 8,
Model = "SVM",
Features = "all",
Hyperparameters = "tuned to best cost = 0.01",
Train = 0.90,
Test = 0.89,
Notes = "no real improvement"
)
experiment_log <- bind_rows(experiment_log, svm2_log)
Objective: To see if changing the model from linear to the Radial Basis Function (RBF), a common non-linear kernel, will affect performance.
Variations: The kernel will the changed; and in the case of
non-linear kernels, the gamma hyperparameter will be taken into account
to determine how influential individual points are on the hyperplane, or
basically how smooth/sensitive the decision boundary will be. The
default value for svm
is 1/(data dimension).
Evaluation: The same table and accuracy will be generated.
Experiment:
set.seed(123)
num_cols <- sapply(bank_train, is.numeric)
bank_svm3 <- svm(y ~.,
data = bank_train,
scale = num_cols,
cost = 0.1,
kernel = "radial")
summary(bank_svm3)
##
## Call:
## svm(formula = y ~ ., data = bank_train, cost = 0.1, kernel = "radial",
## scale = num_cols)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.1
##
## Number of Support Vectors: 2501
##
## ( 1257 1244 )
##
##
## Number of Classes: 2
##
## Levels:
## no yes
# predict and evaluate on training data
svm3_train_pred <- predict(bank_svm3, bank_train3)
table(predict = svm3_train_pred, truth = bank_train3$y)
## truth
## predict no yes
## no 31186 2860
## yes 752 1372
svm3_train_cm <- confusionMatrix(svm3_train_pred, bank_train3$y)
svm3_train_cm$overall["Accuracy"]
## Accuracy
## 0.9001382
# predict and evaluate on testing data
svm3_test_pred <- predict(bank_svm3, bank_test3)
table(predict = svm3_test_pred, truth = bank_test3$y)
## truth
## predict no yes
## no 7801 732
## yes 183 325
svm3_test_cm <- confusionMatrix(svm3_test_pred, bank_test3$y)
svm3_test_cm$overall["Accuracy"]
## Accuracy
## 0.8987944
Again, there is minimal change to the errors and accuracy.
svm3_log <- data.frame(
ID = 9,
Model = "SVM",
Features = "all",
Hyperparameters = "RBF kernel, cost = 0.1, gamma = 0.024",
Train = 0.90,
Test = 0.90,
Notes = "no real improvement"
)
experiment_log <- bind_rows(experiment_log, svm3_log)
Objective: To perform cross-validation on models with different,
commonly-used gamma
values.
Variations: Based on this hyperparameter tuning, the gamma will be either 0.001, 0.024 (same), 0.1, or 1.
Evaluation: The same table and accuracy will be generated.
Experiment:
RBF_tune_mod <- tune(svm,
y ~.,
data = bank_train,
kernel = "radial",
ranges = list(gamma = c(0.001, 0.024, 0.1, 1)))
summary(RBF_tune_mod)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma
## 0.1
##
## - best performance: 0.1607619
##
## - Detailed performance results:
## gamma error dispersion
## 1 0.001 0.2163326 0.015528251
## 2 0.024 0.1726482 0.009425219
## 3 0.100 0.1607619 0.011262381
## 4 1.000 0.2410393 0.021859604
A gamma of 0.1 was determined to give the best performance, with an error rate of 0.16; but this is not far off from the previous model.
best_rbf_mod <- RBF_tune_mod$best.model
# predict and evaluate on training data
brbf_train_pred <- predict(best_rbf_mod, bank_train3)
table(predict = brbf_train_pred, truth = bank_train3$y)
## truth
## predict no yes
## no 31318 2883
## yes 620 1349
brbf_train_cm <- confusionMatrix(brbf_train_pred, bank_train3$y)
brbf_train_cm$overall["Accuracy"]
## Accuracy
## 0.9031518
# predict and evaluate on testing data
brbf_test_pred <- predict(best_rbf_mod, bank_test3)
table(predict = brbf_test_pred, truth = bank_test3$y)
## truth
## predict no yes
## no 7809 763
## yes 175 294
brbf_test_cm <- confusionMatrix(brbf_test_pred, bank_test3$y)
brbf_test_cm$overall["Accuracy"]
## Accuracy
## 0.8962504
Once again, hyperparameter tuning appeared to have little effect on the performance.
svm4_log <- data.frame(
ID = 10,
Model = "SVM",
Features = "all",
Hyperparameters = "tuned to best gamma = 0.1",
Train = 0.90,
Test = 0.90,
Notes = "no real improvement"
)
experiment_log <- bind_rows(experiment_log, svm4_log)
knitr::kable(experiment_log, format = "pipe", padding = 0)
ID | Model | Features | Hyperparameters | Train | Test | Notes |
---|---|---|---|---|---|---|
1 | Decision Tree | duration, poutcome, pdays | none | 0.90 | 0.89 | marketing features only |
2 | Decision Tree | poutcome, pdays | none | 0.89 | 0.89 | dropped duration, minimal changes |
3 | Random Forest | all, with different ranking order from decision trees after ‘duration’ | impute method, number of trees | 1.00 | 0.85 | overfitting |
4 | Random Forest | ranked ‘duration’, ‘month’, and ‘poutcome’ | leaf size, number of features randomly sampled | 0.85 | 0.81 | less accurate, lowered variance |
5 | XGBoost | all | nrounds = 100, defaults | 0.96 | 0.91 | duration ranked first |
6 | XGBoost | all | k-fold cross-validation, gamma, minimum child weight, nrounds = 55 | 0.94 | 0.91 | boosting rounds reduced significanty, similar accuracy |
7 | SVM | all | cost = 1 | 0.90 | 0.89 | same accuracy as decision tree experiments |
8 | SVM | all | tuned to best cost = 0.01 | 0.90 | 0.89 | no real improvement |
9 | SVM | all | RBF kernel, cost = 0.1, gamma = 0.024 | 0.90 | 0.90 | no real improvement |
10 | SVM | all | tuned to best gamma = 0.1 | 0.90 | 0.90 | no real improvement |
The previous experiments had determined that XGBoost was the best performing model based on accuracy compared to decision trees and random forest. The addition of these results from the SVM experiments has not changed that conclusion. The SVM models, even when tuned for best cost or gamma hyperparameters, did not vary much in accuracy from each other, nor did they particularly exceed the results of the previous second-best performing algorithm, decision trees (which, unlike SVMs, did not require any particular data manipulation like imputation, encoding or scaling before training).
In the context of this binary classification problem using this large and multidimensional banking dataset, the XGBoost ensemble method performed better than single models like SVM and decision tree, as could be expected.