The Bank Marketing dataset was obtained from UCI Machine Learning Repository. It contains 45,211 client data related to direct marketing campaign conducted by a Portuguese banking institution, which is carried out via phone calls. The client data contains 17 features that includes their demographics, i.e., age, education, job, and martial status, their financial status, i.e., bank balance in euros, personal and housing loan status, and credit default status, and campaign related variables. The target variable indicates whether the client subscribed to term deposit or not. For this assignment, we conducted experiments using SVM kernels. The objective is to choose the optimal model for predicting where a client will subscribe to a bank term deposit. https://archive.ics.uci.edu/dataset/222/bank+marketing
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
library(e1071)
library(rpart)
library(ROCR)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
raw_data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-622/refs/heads/main/bank-full.csv",sep=";")
bank_df <- raw_data %>%
mutate(across(where(is.character), as.factor)) %>%
mutate( y= as.factor(y))
set.seed(1)
trainIndex <- createDataPartition(bank_df$y, p = 0.7, list = FALSE)
train_data <- bank_df[trainIndex, ]
test_data <- bank_df[-trainIndex, ]
All models were scaled as there were some skewed distribution and SVM models are sensitive to scaling as it relies on distance calculations.
This linear SVM model belows predicts the outcome y (yes or no to
term depsoits) based on the features in the train data with default
setting of cost
= 1 (regularization parameter), cross = 0
and with scaling.
Result: Accuracy: 0.817 Precision: 0.9004 F1-Score: 0.9415 AOC: 0.915
set.seed(10)
svm_linear_prob <- svm(formula = y ~ ., data = train_data, kernel = "linear", probability = TRUE, scale = TRUE)
svm_linear_pred <- predict(svm_linear_prob, newdata = test_data, probability = TRUE)
confusion_matrix_svm_linear <-confusionMatrix(svm_linear_pred, test_data$y)
svm_linear_probs <- as.numeric(attr(svm_linear_pred, "probabilities")[,"yes"])
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
roc_svm_linear <- roc(test_data$y, svm_linear_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
auc_svm_linear <- auc(roc_svm_linear)
print(confusion_matrix_svm_linear)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11814 1307
## yes 162 279
##
## Accuracy : 0.8917
## 95% CI : (0.8863, 0.8969)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.0008341
##
## Kappa : 0.2364
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9865
## Specificity : 0.1759
## Pos Pred Value : 0.9004
## Neg Pred Value : 0.6327
## Prevalence : 0.8831
## Detection Rate : 0.8711
## Detection Prevalence : 0.9675
## Balanced Accuracy : 0.5812
##
## 'Positive' Class : no
##
Hyperparameter tuning was applied to find optimal settings. A smaller cv was used due to the long computational time. If time allows, cv = 10 is better to create a more robust model.
Result: The tuned model shows identical Accuracy, Precision, and F1-score. However, AOC increased slightly. This may suggest tuning didn’t change much due to the default setting was already close to optimal.
set.seed(12)
tuned_optimal_l<- tune.svm(y~ . ,
data = train_data,
kernel = "linear",
cost = c(0.001,0.01,0.1),
probability = TRUE,
tunecontrol = tune.control(cross = 2,
best.model = TRUE,
performances = TRUE)
)
# Usually I would do 10 - fold CV. However, it was taking hours to run.
print(tuned_optimal_l$best.parameters)
## cost
## 3 0.1
print(tuned_optimal_l$performances)
## cost error dispersion
## 1 0.001 0.1170021 0.002184303
## 2 0.010 0.1069860 0.001157011
## 3 0.100 0.1066384 0.001737923
svm_linear_tuned <- svm(y ~ ., data = train_data, kernel = "linear", cost = tuned_optimal_l$best.parameters$cost, probability = TRUE, scale = TRUE)
svm_linear_tuned_pred <- predict(svm_linear_tuned, newdata = test_data, probability = TRUE)
confusion_matrix_svm_linear_tuned <-confusionMatrix(svm_linear_tuned_pred, test_data$y)
svm_linear_tuned_probs <- as.numeric(attr(svm_linear_tuned_pred, "probabilities")[,"yes"])
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
roc_svm_linear_tuned <- roc(test_data$y, svm_linear_tuned_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
auc_svm_linear_tuned <- auc(roc_svm_linear_tuned)
print(confusion_matrix_svm_linear_tuned)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11814 1307
## yes 162 279
##
## Accuracy : 0.8917
## 95% CI : (0.8863, 0.8969)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.0008341
##
## Kappa : 0.2364
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9865
## Specificity : 0.1759
## Pos Pred Value : 0.9004
## Neg Pred Value : 0.6327
## Prevalence : 0.8831
## Detection Rate : 0.8711
## Detection Prevalence : 0.9675
## Balanced Accuracy : 0.5812
##
## 'Positive' Class : no
##
This Radial SVM model belows predicts the outcome y (yes or no to
term depsoits) based on the features in the train data with default
setting of cost
= 1 (regularization parameter), cross = 0,
gamma = 1/(number of features) and with scaling.
Result: The untuned Radial SVM already perform better than both linear SVMs across metrics, which suggest that a non-linear decision boundary provided by this kernel is a better fit for the dataset.
set.seed(10)
svm_radial_prob <- svm(formula = y ~ ., data = train_data,kernel = "radial", probability = TRUE, scale = TRUE)
svm_radial_pred <- predict(svm_radial_prob, newdata = test_data, probability = TRUE)
confusion_matrix_svm_radial <-confusionMatrix(svm_radial_pred, test_data$y)
svm_radial_probs <- as.numeric(attr(svm_radial_pred, "probabilities")[,"yes"])
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
roc_svm_radial <- roc(test_data$y, svm_radial_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
auc_svm_radial <- auc(roc_svm_radial)
print(confusion_matrix_svm_radial)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11737 1181
## yes 239 405
##
## Accuracy : 0.8953
## 95% CI : (0.89, 0.9004)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 3.581e-06
##
## Kappa : 0.3171
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9800
## Specificity : 0.2554
## Pos Pred Value : 0.9086
## Neg Pred Value : 0.6289
## Prevalence : 0.8831
## Detection Rate : 0.8654
## Detection Prevalence : 0.9525
## Balanced Accuracy : 0.6177
##
## 'Positive' Class : no
##
Hyperparameter tuning was applied to find optimal settings. A smaller cv was used due to the long computational time. If time allows, cv = 10 is better to create a more robust model.
set.seed(11)
tuned_optimal<- tune.svm(y~ . ,
data = train_data,
kernel = "radial",
gamma = 10^(-1:1),
cost = c(0.001,0.01,0.1),
probability = TRUE,
tunecontrol = tune.control(cross = 2,
best.model = TRUE,
performances = TRUE)
)
# Usually I would do 10 - fold CV. However, it was taking hours to run.
print(tuned_optimal$best.parameters)
## gamma cost
## 7 0.1 0.1
print(tuned_optimal$performances)
## gamma cost error dispersion
## 1 0.1 0.001 0.1170022 0.002194759
## 2 1.0 0.001 0.1170022 0.002194759
## 3 10.0 0.001 0.1170022 0.002194759
## 4 0.1 0.010 0.1170022 0.002194759
## 5 1.0 0.010 0.1170022 0.002194759
## 6 10.0 0.010 0.1170022 0.002194759
## 7 0.1 0.100 0.1096718 0.003803067
## 8 1.0 0.100 0.1170022 0.002194759
## 9 10.0 0.100 0.1170022 0.002194759
svm_radial_tuned <- svm(y ~ ., data = train_data, kernel = "radial", cost = tuned_optimal$best.parameters$cost, gamma = tuned_optimal$best.parameters$gamma, probability = TRUE, scale = TRUE)
svm_radial_tuned_pred <- predict(svm_radial_tuned, newdata = test_data, probability = TRUE)
confusion_matrix_svm_radial_tuned <-confusionMatrix(svm_radial_tuned_pred, test_data$y)
svm_radial_tuned_probs <- as.numeric(attr(svm_radial_tuned_pred, "probabilities")[,"yes"])
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
roc_svm_radial_tuned <- roc(test_data$y, svm_radial_tuned_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
auc_svm_radial_tuned <- auc(roc_svm_radial_tuned)
print(confusion_matrix_svm_radial_tuned)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11716 1123
## yes 260 463
##
## Accuracy : 0.898
## 95% CI : (0.8928, 0.9031)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1.76e-08
##
## Kappa : 0.3537
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9783
## Specificity : 0.2919
## Pos Pred Value : 0.9125
## Neg Pred Value : 0.6404
## Prevalence : 0.8831
## Detection Rate : 0.8639
## Detection Prevalence : 0.9467
## Balanced Accuracy : 0.6351
##
## 'Positive' Class : no
##
This Polynomial SVM model belows predicts the outcome y (yes or no to
term depsoits) based on the features in the train data with default
setting of cost
= 1 (regularization parameter), cross = 0,
gamma = 1/(number of features), degree - 3, and with scaling.
set.seed(10)
svm_polynomial_prob <- svm(formula = y ~ ., data = train_data, kernel = "polynomial", probability = TRUE, scale = TRUE)
svm_polynomial_pred <- predict(svm_polynomial_prob, newdata = test_data, probability = TRUE)
confusion_matrix_svm_polynomial <-confusionMatrix(svm_polynomial_pred, test_data$y)
svm_polynomial_probs <- as.numeric(attr(svm_polynomial_pred, "probabilities")[,"yes"])
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
roc_svm_polynomial <- roc(test_data$y, svm_polynomial_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
auc_svm_polynomial <- auc(roc_svm_polynomial)
print(confusion_matrix_svm_polynomial)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11820 1301
## yes 156 285
##
## Accuracy : 0.8926
## 95% CI : (0.8872, 0.8977)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.0002584
##
## Kappa : 0.2427
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9870
## Specificity : 0.1797
## Pos Pred Value : 0.9008
## Neg Pred Value : 0.6463
## Prevalence : 0.8831
## Detection Rate : 0.8716
## Detection Prevalence : 0.9675
## Balanced Accuracy : 0.5833
##
## 'Positive' Class : no
##
Support Vector Machines are powerful supervised learning used for both classification and regression. They are great with high-dimensional data as it draw a boundary between different groups of data, making it easier to tell which points belong to which class.
The models evaluated well with both default and tuned parameters. Tuning was performed using 2-fold cross-validation due to computational limits. The SVM with radial kernel outperformed the others overall, especially once it was tuned, which makes sense since its better at handling complex, nonlinear patterns, as expected in a real world data like this data set. The tuned radial model showed the highest accuarcy (0.898) and f1-score (0.944). However, the AUC decreased slightly after tuning the radial SVM model, suggesting a possibility of an overfitting to the majority class.
The linear kernel models both defualt and tuned performed consistently well. The polynomial kernel is capable of modeling complex relationships. It did not outperformed the other models. Tuning was not applied to this kernel due to time constraints.
summary_comparison <- data.frame(
Model = c("SVM linear",
"SVM linear (tuned)",
"SVM Radial",
"SVM Radical (tuned)",
"SVM Polynomial"),
Accuracy = c(confusion_matrix_svm_linear$overall["Accuracy"],
confusion_matrix_svm_linear_tuned$overall["Accuracy"] ,
confusion_matrix_svm_radial$overall["Accuracy"],
confusion_matrix_svm_radial_tuned$overall["Accuracy"] ,
confusion_matrix_svm_polynomial$overall["Accuracy"] ),
Precision = c(confusion_matrix_svm_linear$byClass["Pos Pred Value"],
confusion_matrix_svm_linear_tuned$byClass["Pos Pred Value"],
confusion_matrix_svm_radial$byClass["Pos Pred Value"],
confusion_matrix_svm_radial_tuned$byClass["Pos Pred Value"],
confusion_matrix_svm_polynomial$byClass["Pos Pred Value"]),
F1_Score = c(confusion_matrix_svm_linear$byClass["F1"],
confusion_matrix_svm_linear_tuned$byClass["F1"],
confusion_matrix_svm_radial$byClass["F1"],
confusion_matrix_svm_radial_tuned$byClass["F1"],
confusion_matrix_svm_polynomial$byClass["F1"]),
AOC = c(auc_svm_linear, auc_svm_linear_tuned , auc_svm_radial,auc_svm_radial_tuned, auc_svm_polynomial)
)
print(summary_comparison)
## Model Accuracy Precision F1_Score AOC
## 1 SVM linear 0.8916826 0.9003887 0.9414671 0.9036006
## 2 SVM linear (tuned) 0.8916826 0.9003887 0.9414671 0.9049189
## 3 SVM Radial 0.8952957 0.9085772 0.9429581 0.9059097
## 4 SVM Radical (tuned) 0.8980239 0.9125321 0.9442676 0.8909411
## 5 SVM Polynomial 0.8925675 0.9008460 0.9419453 0.8850769
This dataset contains 16 features including categorical and numerical
values. It is a moderately large dataset that includes over 45,000
observations. Moreover, class imbalance exists with a subscription rate
around 11.7% (“yes” is the minority class,) and there are a large amount
of high unknowns in the categorical variables (ie: about 81% of
poutcome
has unknown values.) Considering these dataset
characteristics, Random Forest and Adaboost are the recommend algorithm
to use to predict the categorical outcome of this dataset (yes or no to
term deposit).
Observing the results, the results aligns with the conclusions above. The Random forest and AdaBoost clearly outperformed the other algorithms. They consistently achieve the highest F1-score, which is the most reliable metric for an imbalanced dataset as it balances precision and recall. The particular lower AOC score for Decision Tree is a strong indicator of its struggle with identifying the minority class.
Random Forest models:
Highest F1-Scores: 0.947
0.948 - Highest AOC: 0.926-0.928
Highest Accuracy: 0.904-0.905
AdaBoost models:
Decision Tree models:
F1-Scores:0.941 - 0.944 (lower than ensembles)
AOC: 0.744 - 0.746 (significantly lower than ensembles, which highlights the weakness on imbalance data despite a good F1-score)
Accuracy: 0.903 - 0.904
SVM models:
Random Forest is recommended to get more accurate result as all various Random Forest models performed the best in all metrics compared to the other models.
summary_comparison <- data.frame(
Model = c("Decision Tree (default)",
"Decision Tree (Max-depth = 3)",
"Decision Tree (Max-depth = 5)",
"Decision Tree (Pruned)",
"Random Forest (default)",
"Random Forest (ntree=200)",
"Random Forest (Tuned-Mty=6)",
"AdaBoost (Default)",
"AdaBoost (mfinal=100,cp=0.001)",
"SVM linear",
"SVM linear (tuned)",
"SVM Radial",
"SVM Radical (tuned)",
"SVM Polynomial"),
Accuracy = c(0.8990562, 0.8946321, 0.8990562, 0.8990562,
0.9046601, 0.9057661, 0.9049550, 0.9046601, 0.9031116,confusion_matrix_svm_linear$overall["Accuracy"],
confusion_matrix_svm_linear_tuned$overall["Accuracy"]
,confusion_matrix_svm_radial$overall["Accuracy"],
confusion_matrix_svm_radial_tuned$overall["Accuracy"] ,
confusion_matrix_svm_polynomial$overall["Accuracy"] ),
Precision = c(0.9173027, 0.9165153, 0.9173027, 0.9173027,
0.9206899, 0.9211811, 0.9201195, 0.9302457, 0.9327110,confusion_matrix_svm_linear$byClass["Pos Pred Value"],
confusion_matrix_svm_linear_tuned$byClass["Pos Pred Value"],
confusion_matrix_svm_radial$byClass["Pos Pred Value"],
confusion_matrix_svm_radial_tuned$byClass["Pos Pred Value"],
confusion_matrix_svm_polynomial$byClass["Pos Pred Value"]),
F1_Score = c(0.9445412, 0.9419978, 0.9445412, 0.9445412,
0.9475945, 0.9482088, 0.9478032, 0.9469886, 0.9459170,confusion_matrix_svm_linear$byClass["F1"],
confusion_matrix_svm_linear_tuned$byClass["F1"],
confusion_matrix_svm_radial$byClass["F1"],
confusion_matrix_svm_radial_tuned$byClass["F1"],
confusion_matrix_svm_polynomial$byClass["F1"]),
AOC = c(0.7465744, 0.7445256, 0.7465744, 0.7465744,
0.9260661, 0.9280303, 0.9280113, 0.9268785, 0.9201388,auc_svm_linear,auc_svm_linear_tuned, auc_svm_radial,auc_svm_radial_tuned, auc_svm_polynomial)
)
print(summary_comparison)
## Model Accuracy Precision F1_Score AOC
## 1 Decision Tree (default) 0.8990562 0.9173027 0.9445412 0.7465744
## 2 Decision Tree (Max-depth = 3) 0.8946321 0.9165153 0.9419978 0.7445256
## 3 Decision Tree (Max-depth = 5) 0.8990562 0.9173027 0.9445412 0.7465744
## 4 Decision Tree (Pruned) 0.8990562 0.9173027 0.9445412 0.7465744
## 5 Random Forest (default) 0.9046601 0.9206899 0.9475945 0.9260661
## 6 Random Forest (ntree=200) 0.9057661 0.9211811 0.9482088 0.9280303
## 7 Random Forest (Tuned-Mty=6) 0.9049550 0.9201195 0.9478032 0.9280113
## 8 AdaBoost (Default) 0.9046601 0.9302457 0.9469886 0.9268785
## 9 AdaBoost (mfinal=100,cp=0.001) 0.9031116 0.9327110 0.9459170 0.9201388
## 10 SVM linear 0.8916826 0.9003887 0.9414671 0.9036006
## 11 SVM linear (tuned) 0.8916826 0.9003887 0.9414671 0.9049189
## 12 SVM Radial 0.8952957 0.9085772 0.9429581 0.9059097
## 13 SVM Radical (tuned) 0.8980239 0.9125321 0.9442676 0.8909411
## 14 SVM Polynomial 0.8925675 0.9008460 0.9419453 0.8850769
Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study This study shows that class imbalance present a challenge when working with medical data, in this case predicting Covid 19, where abuot 13% of the cases are actually positive. Researchers used techniques like SMOTE and RUS to address class imbalance. Ensemble models like Random Forest, XGBoost, AdaBoost, Decision tree were used along with evaluating metrics like Recall, F1-measure, precision, AUROC as accuracy alone is not the best measure for an imbalanced dataset. Models designed for imbalance data like balanced random forest performed the best and better a single decision tree. Relevant to our assignment: This article supports my strong recommendation of Random Forest and AdaBoost as they are robust for imbalanced dataset. It also validates my usage of the following metrics besides Accuracy to evaluate the models: Precision, AUC, and F1-score.
A novel approach to predict COVID-19 using support vector machine This article compares the usage of SVM against other models like Random Forest and Decsion Tree for predicting Covid infections severity. SVM outperform all other models like Naive Bayes, KNN , Random Forest and more. This contrasts with the results I found in our assignments, highlighting model performance could be dataset specific. This articles proves SVM are still super powerful for classifying thing as they are great with high-dimensional data and draw a boundary between different groups of data, making it easier to tell which points belong to which class.
A Comparative Analysis of Decision Tree and Support Vector Machine on Suicide Ideation Detection This articles from 2023 provides a comparison between decision treee and SVM in suicide detection using social media data. Again in this article SVM outperforms single decision tree in all metrics (precision, recall, accuracy, f1-score)
Are Random Forests Better than Support Vector Machines for Microarray-Based Cancer Classification? This article compares two algorithm: Random Forest and SVM. The author challenges a previous belief that Random Forest is superior. This is important as there is no single algorithm that is the best as performance is based on the dataset and the methodology like we learn in article 2. The author discusses about the bias found in prior works due to using the wrong metric. Previous study relied heavily on accuracy alone which is flawed as the metric is senstive to imbalance class (like our bank dataset). They also pointed that previous studies did not property tune SVM model. It is important to optimize the models with the right parameters to see the models true potential.
SVM vs Decision Tree Algorithm Cost Effective Comparison to Enhance Crime Detection and Prevention This article focuses on examining how well Decision Tree and SVM perform when used to predict crime patterns.Both models were used on a dataset from 56 states with 30 features. The results showed that Decision Tree significantly performed better than SVM in terms of accuracy with p-value of 0.025. This is a sharp contrast to some of the articles (suicide detection and covid 19 severity prediction) I found where SVM outperforms Decision Tree The articles notes Decision Tree’s advantages that includes low computational cost, robust to different data types and its interpretability.