This RMD supports Assignment 03 for CUNY SPS DATA 622, where an SVM algorithm model is built with a dataset from a Portuguese bank’s marketing campaign. The bank used phone calls to predict whether clients would subscribe to a term deposit. The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y) by applying machine learning techniques to analyze the data and uncover the most effective strategies for boosting customer subscriptions in future campaigns. Results from this RMD will be compared to results from Assignment02’s models.
Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study
Five machine learning algorithms (neural networks, random forests, XGBoost, logistic regression, and support vector machines (SVM)) were tested on a Covid-19 dataset based on commonly taken laboratory tests with the goal of predicting positive Covid-19 cases.
Accuracy is one of the most commonly used performance measures. This experiment uses an imbalanced dataset so accuracy, precision, recall, and F1-measure are all used.
A novel approach to predict COVID-19 using support vector machine
The use of predictive analytics in finance
link: https://www.sciencedirect.com/science/article/pii/S2405918822000071
3.2. Earnings prediction
4.1. Customer acquisition and attrition prediction
4.2. Customer segmentation and sales prediction
Predicting Employee Attrition Using Machine Learning Approaches
link: https://www.mdpi.com/2076-3417/12/13/6424
4. Proposed Machine Learning Approaches
5. Results and Discussions
Machine Learning for Predicting Employee Attrition
B. Machine Learning Classification Algorithms
library(tidyverse)
library(ggplot2)
library(dplyr)
library(caret)
library(kernlab)
library(pROC)
The below code mirrors the previous assignments.
# Read csv from github
bank_df <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA622/refs/heads/main/bank-full.csv", sep = ";")
# Change character cols to factors
character_columns <- sapply(bank_df, is.character)
bank_df[character_columns] <- lapply(bank_df[character_columns], as.factor)
# Check the data
head(bank_df)
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
summary(bank_df)
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
sum(is.na(bank_df))
## [1] 0
# Update
bank_df$y <- as.factor(bank_df$y)
Data looks good. There are plenty of observations and the minority class (yes) is 13% of the majority class (no) in count. Undersampling the majority class will bring a balance that will help the SVM model.
# Set seed for reproducability
set.seed(64)
# Separate into majority and minority classes
majority_class <- names(sort(table(bank_df$y), decreasing = TRUE))[1]
minority_class <- names(sort(table(bank_df$y), decreasing = TRUE))[2]
majority_df <- bank_df[bank_df$y == majority_class, ]
minority_df <- bank_df[bank_df$y == minority_class, ]
# Undersample the majority class to match the minority
undersampled_majority <- majority_df[sample(nrow(majority_df), nrow(minority_df)), ]
# Combine to create a balanced dataset
balanced_bank <- rbind(undersampled_majority, minority_df)
# Shuffle the rows
balanced_bank <- balanced_bank[sample(nrow(balanced_bank)), ]
# Confirm new class balance
table(balanced_bank$y)
##
## no yes
## 5289 5289
# Set seed for reproducability
set.seed(64)
# Create partition of 80-20
index_bank <- createDataPartition(balanced_bank$y, p = .8, list = FALSE)
# One-hot encode before model training
dummies_bank <- dummyVars(y ~ ., data = balanced_bank)
X_dummified <- predict(dummies_bank, newdata = balanced_bank)
# Split X into training and testing
X_train_bank <- X_dummified[index_bank, ]
X_test_bank <- X_dummified[-index_bank, ]
# Split y into training and testing (only the 'y' column)
y_train_bank <- balanced_bank[index_bank, "y", drop = TRUE]
y_test_bank <- balanced_bank[-index_bank, "y", drop = TRUE]
# Tune the SVM model
ctrl <- trainControl(method = "cv",
classProbs = TRUE,
savePredictions = "final",
summaryFunction = twoClassSummary)
svm_model <- train(x = X_train_bank,
y = y_train_bank,
method = "svmRadial",
preProc = c("center", "scale"),
metric = "ROC", # use AUC as the optimization goal
tuneLength = 3,
trControl = ctrl)
# svm_model
# > The final values used for the model were sigma = 0.01313339 and C = 1.
# Predict on the model
svm_pred <- predict(svm_model, newdata = X_test_bank)
#Get the test set performance values
svm_performance <- postResample(pred = svm_pred, obs = y_test_bank)
svm_performance
## Accuracy Kappa
## 0.8618732 0.7237465
# Predict using the final model and calculate confusion matrix
svm_pred <- predict(svm_model, newdata = X_test_bank)
svm_cm_bank <- confusionMatrix(svm_pred, y_test_bank)
# Show the final confusion matrix
print(svm_cm_bank)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 906 141
## yes 151 916
##
## Accuracy : 0.8619
## 95% CI : (0.8464, 0.8763)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7237
##
## Mcnemar's Test P-Value : 0.5984
##
## Sensitivity : 0.8571
## Specificity : 0.8666
## Pos Pred Value : 0.8653
## Neg Pred Value : 0.8585
## Prevalence : 0.5000
## Detection Rate : 0.4286
## Detection Prevalence : 0.4953
## Balanced Accuracy : 0.8619
##
## 'Positive' Class : no
##
# Initialize matrix to store metrics from each experiment
results_matrix <- matrix(NA, nrow = 0, ncol = 6)
colnames(results_matrix) <- c("Model",
"Precision",
"Accuracy",
"Recall",
"F1",
"AUC")
# Probability predictions for AUC calculation
svm_probabilities <- predict(svm_model, newdata = X_test_bank, type = "prob")
positive_class <- "yes"
# AUC
svm_auc_bank <- auc(roc(response = y_test_bank, predictor = svm_probabilities[, positive_class]))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
# Add metrics to the results matrix
results_matrix <- rbind(results_matrix,
c("SVM",
svm_cm_bank$overall["Accuracy"],
svm_cm_bank$byClass["Precision"],
svm_cm_bank$byClass["Recall"],
svm_cm_bank$byClass["F1"],
svm_auc_bank))
# Show results matrix
results_matrix
## Model Precision Accuracy Recall
## [1,] "SVM" "0.861873226111637" "0.865329512893983" "0.857142857142857"
## F1 AUC
## [1,] "0.861216730038023" "0.925176035064699"
After adding the SVM performance to the previous assignment’s evaluation, Random Forest stays as the optimal model, achieving the highest precision, accuracy, F1-score, and AUC-ROC. It also demonstrated strong recall, making it the most balanced model overall. Random Forest strikes an effective balance between bias and variance, providing the best combination of all metrics. The SVM model performs well overall and excels at the AUC-ROC statistic; however, it does not beat out Random Forest performance in any category.
Random Forest performance:
Precision - 0.912
Accuracy - 0.936
Recall - 0.966
F1 - 0.951
AUC-ROC - 0.937
Support Vector Machine performance [compared to RF]:
Precision - 0.867 [-0.045]
Accuracy - 0.871 [-0.065]
Recall - 0.862 [-0.104]
F1 - 0.866 [-0.085]
AUC-ROC - 0.926 [-0.011]