Introduction

This RMD supports Assignment 03 for CUNY SPS DATA 622, where an SVM algorithm model is built with a dataset from a Portuguese bank’s marketing campaign. The bank used phone calls to predict whether clients would subscribe to a term deposit. The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y) by applying machine learning techniques to analyze the data and uncover the most effective strategies for boosting customer subscriptions in future campaigns. Results from this RMD will be compared to results from Assignment02’s models.

Provided Articles (Analysis)

Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study

  • Five machine learning algorithms (neural networks, random forests, XGBoost, logistic regression, and support vector machines (SVM)) were tested on a Covid-19 dataset based on commonly taken laboratory tests with the goal of predicting positive Covid-19 cases.

  • Accuracy is one of the most commonly used performance measures. This experiment uses an imbalanced dataset so accuracy, precision, recall, and F1-measure are all used.

A novel approach to predict COVID-19 using support vector machine

  • Detecting COVID-19 based on symptoms is challenging due to uncertainty and a lack of reliable reference datasets. Most infected patients showed symptoms like high fever, cough with sputum, and shortness of breath. An SVM classifier was chosen for its ability to handle non-linear problems using the kernel trick. The data was split 70% for training and 30% for testing, and a linear kernel was used to separate the classes with a maximum margin. A cost parameter of C = 10 was selected to prioritize accurate classification of COVID-19 cases with minimal misclassifications.

Academic Content(Analysis)

The use of predictive analytics in finance

link: https://www.sciencedirect.com/science/article/pii/S2405918822000071

  • 3.2. Earnings prediction

    • Corporate earnings are easier to predict than economic trends, but still complex. Nonlinear models like decision trees and neural networks perform better than regressions. SVMs are especially effective in decision support systems and often outperform models like ARIMA.
  • 4.1. Customer acquisition and attrition prediction

    • Predictive analytics in DSS often focus on customer behavior, including acquisition and retention. Methods like SVM and decision trees have been used to predict customer churn, with SVM often showing higher accuracy. Studies found decision trees outperform some traditional methods, while SVMs are frequently recommended for their effectiveness.
  • 4.2. Customer segmentation and sales prediction

    • Sales prediction and customer segmentation are key parts of DSS. Techniques like regression forests and decision trees help classify customer attributes for better targeting. SVMs are used for tasks like predicting satisfaction from social media sentiment and customer lifetime value. While linear models often fall short, SVMs and kernel-based methods offer better accuracy for complex, nonlinear patterns. These methods support more precise decision-making in finance and marketing.

Predicting Employee Attrition Using Machine Learning Approaches

link: https://www.mdpi.com/2076-3417/12/13/6424

  • 4. Proposed Machine Learning Approaches

    • This study calls decision trees Decision Tree Classifiers (DTC). It defines and compares how the machine learning algorithms relate to the data, such as SVM support vectors and DTC mirroring human thinking.
  • 5. Results and Discussions

    • Accuracy and other result analysis shows which models performed the best. SVM outperformed DTC just barely.

Machine Learning for Predicting Employee Attrition

link: https://thesai.org/Downloads/Volume12No11/Paper_49-Machine_Learning_for_Predicting_Employee_Attrition.pdf

  • B. Machine Learning Classification Algorithms

    • This section of the paper discusses the pros and cons of different machine learning algorithms for the data they have. There are visuals of what a decision tree would look like for the specific attrition data. Decision Trees and SVM modeling are discussed for the same data.

Load Libraries

library(tidyverse)
library(ggplot2)
library(dplyr)
library(caret)
library(kernlab)
library(pROC)

Data Prep

The below code mirrors the previous assignments.

# Read csv from github
bank_df <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA622/refs/heads/main/bank-full.csv", sep = ";")

# Change character cols to factors
character_columns <- sapply(bank_df, is.character)
bank_df[character_columns] <- lapply(bank_df[character_columns], as.factor)

# Check the data
head(bank_df)
##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no
summary(bank_df)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##      pdays          previous           poutcome       y        
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
## 
sum(is.na(bank_df))
## [1] 0
# Update
bank_df$y <- as.factor(bank_df$y)

Data looks good. There are plenty of observations and the minority class (yes) is 13% of the majority class (no) in count. Undersampling the majority class will bring a balance that will help the SVM model.

# Set seed for reproducability 
set.seed(64)

# Separate into majority and minority classes
majority_class <- names(sort(table(bank_df$y), decreasing = TRUE))[1]
minority_class <- names(sort(table(bank_df$y), decreasing = TRUE))[2]

majority_df <- bank_df[bank_df$y == majority_class, ]
minority_df <- bank_df[bank_df$y == minority_class, ]

# Undersample the majority class to match the minority
undersampled_majority <- majority_df[sample(nrow(majority_df), nrow(minority_df)), ]

# Combine to create a balanced dataset
balanced_bank <- rbind(undersampled_majority, minority_df)

# Shuffle the rows
balanced_bank <- balanced_bank[sample(nrow(balanced_bank)), ]

# Confirm new class balance
table(balanced_bank$y)
## 
##   no  yes 
## 5289 5289
# Set seed for reproducability 
set.seed(64)

# Create partition of 80-20
index_bank <- createDataPartition(balanced_bank$y, p = .8, list = FALSE)

# One-hot encode before model training
dummies_bank <- dummyVars(y ~ ., data = balanced_bank)
X_dummified <- predict(dummies_bank, newdata = balanced_bank)

# Split X into training and testing 
X_train_bank <- X_dummified[index_bank, ]
X_test_bank  <- X_dummified[-index_bank, ]

# Split y into training and testing (only the 'y' column)
y_train_bank <- balanced_bank[index_bank, "y", drop = TRUE]
y_test_bank  <- balanced_bank[-index_bank, "y", drop = TRUE]

Train the SVM model

# Tune the SVM  model
ctrl <- trainControl(method = "cv", 
                     classProbs = TRUE,
                     savePredictions = "final",
                     summaryFunction = twoClassSummary)

svm_model <- train(x = X_train_bank,
                   y = y_train_bank,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   metric = "ROC",      # use AUC as the optimization goal
                   tuneLength = 3,
                   trControl = ctrl)

# svm_model
  # > The final values used for the model were sigma = 0.01313339 and C = 1.

# Predict on the model 
svm_pred <- predict(svm_model, newdata = X_test_bank)

#Get the test set performance values
svm_performance <- postResample(pred = svm_pred, obs = y_test_bank)
svm_performance
##  Accuracy     Kappa 
## 0.8618732 0.7237465
# Predict using the final model and calculate confusion matrix
svm_pred <- predict(svm_model, newdata = X_test_bank)
svm_cm_bank <- confusionMatrix(svm_pred, y_test_bank)

# Show the final confusion matrix
print(svm_cm_bank)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  906 141
##        yes 151 916
##                                           
##                Accuracy : 0.8619          
##                  95% CI : (0.8464, 0.8763)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7237          
##                                           
##  Mcnemar's Test P-Value : 0.5984          
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.8666          
##          Pos Pred Value : 0.8653          
##          Neg Pred Value : 0.8585          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4286          
##    Detection Prevalence : 0.4953          
##       Balanced Accuracy : 0.8619          
##                                           
##        'Positive' Class : no              
## 
# Initialize matrix to store metrics from each experiment
results_matrix <- matrix(NA, nrow = 0, ncol = 6) 
colnames(results_matrix) <- c("Model",
                              "Precision",
                              "Accuracy", 
                              "Recall", 
                              "F1", 
                              "AUC")
# Probability predictions for AUC calculation
svm_probabilities <- predict(svm_model, newdata = X_test_bank, type = "prob")

positive_class <- "yes"

# AUC
svm_auc_bank <- auc(roc(response = y_test_bank, predictor = svm_probabilities[, positive_class]))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
# Add metrics to the results matrix
results_matrix <- rbind(results_matrix, 
                        c("SVM",
                          svm_cm_bank$overall["Accuracy"], 
                          svm_cm_bank$byClass["Precision"],
                          svm_cm_bank$byClass["Recall"],
                          svm_cm_bank$byClass["F1"],
                          svm_auc_bank))

# Show results matrix
results_matrix
##      Model Precision           Accuracy            Recall             
## [1,] "SVM" "0.861873226111637" "0.865329512893983" "0.857142857142857"
##      F1                  AUC                
## [1,] "0.861216730038023" "0.925176035064699"

Analysis and Conclusion

After adding the SVM performance to the previous assignment’s evaluation, Random Forest stays as the optimal model, achieving the highest precision, accuracy, F1-score, and AUC-ROC. It also demonstrated strong recall, making it the most balanced model overall. Random Forest strikes an effective balance between bias and variance, providing the best combination of all metrics. The SVM model performs well overall and excels at the AUC-ROC statistic; however, it does not beat out Random Forest performance in any category. 

Random Forest performance:

Support Vector Machine performance [compared to RF]: