Foy - DATA 622 Assignment Three

Assignment Three required an analysis of the data set(s) used in Assignment Two, but using an SVM algorithm and compare the results with previous Assignment results, which consists of experiments for Decision Trees, Random Forest and Adaboost. Below is a model of how I laid out the Process Flowchart for this assignment.

# Insert a Workflow Process Diagram.

knitr::include_graphics("Assignment Three Process Flowchart.jpg")

# Load required packages. The tidyverse package contains dplyr and the DT package 
# enables the creation of a data table.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DT)

Upload Bank data set.

# The Workflow Process starts from uploading the preprocessed bank data set used 
# in Assignment Two. # I named the new data set "a3_bankdata" 
# (for Assignment Three Bankdata). I loaded it herein.

url <- "a2_bankdata.csv"

# To ensure that I have the correct data structure, I forced the column type names.
a3_bankdata <- read_csv(url, col_types = "nffffnfffnfnnff")

# Examine the data and its structure.

glimpse(a3_bankdata)

## Rows: 45,211
## Columns: 15
## $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital   <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

Split and Balance the Data.

The next step is to split and balance the data with the goal of seeing how well an SVM Model generalizes to new data. Following Nwanganga & Chapple (2020), I used the set.seed() function to reproduce results. I split the data 75% into a training set to fit an SVM Model, and 25% into a testing set to evaluate the model’s performance.

set.seed(1234)

# Split the data 75/25 into training/test sets.

sample_set <- sample(nrow(a3_bankdata), round(nrow(a3_bankdata)*.75), replace = FALSE)
a3_bankdata_train <- a3_bankdata[sample_set, ]
a3_bankdata_test <- a3_bankdata[-sample_set, ]

I ran a test to see the proportionality of split data. I needed to make sure that the the training and testing data is proportional to the original data set for the dependent variable, “y”.

# Proportionality tests.

round(prop.table(table(select(a3_bankdata, y), exclude = NULL)), 4) * 100

## y
##   no  yes 
## 88.3 11.7

round(prop.table(table(select(a3_bankdata_train, y), exclude = NULL)), 4) * 100

## y
##    no   yes 
## 88.23 11.77

round(prop.table(table(select(a3_bankdata_test, y), exclude = NULL)), 4) * 100

## y
##    no   yes 
## 88.53 11.47

The data is proportional across the entire data set, training and testing data sets. The classes are unbalanced for the training and testing data sets. This data needs to be balanced to prevent overfitting. I apply the SMOTE function to balance the data.

Apply SMOTE and Balance Training Class.

# I used the DMwR package to run the SMOTE function.

library(DMwR)

## Loading required package: lattice

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

set.seed(1234)

# Convert a3_bankdata_train to a data frame.

a3_bankdata_train <- as.data.frame(a3_bankdata_train)

# Apply the SMOTE function to the training data set. 
# I created an object labeled "a3_bankdata_train_smote" for balanced training data 
# that can be used later in other models.

a3_bankdata_train_smote <- SMOTE(y ~ ., data = a3_bankdata_train, perc.over = 100, perc.under = 200)

# I create a table showing the percentage balances.

round(prop.table(table(a3_bankdata_train_smote$y)) * 100, 2)

## 
##  no yes 
##  50  50

Applicaton of the First Kernel.

Creation of a Support Vector Machine Algorithm.

Meyer (2024) discuss the use of Support Vector Machines and provide some R code to analyze outcomes. I followed Meyer (2024) and James, G., et al. (2023) to understand the basics of an SVM. I augmented some of the code with the assistance of ChatGPT by prompting how to use R code for an SVM.

# Load required libraries. I need the e1071 package to operationalize a SVM.
# The caret package is required to train the SVM Models.

library(e1071)
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

Test if Linearly Separable.

Using the “linear” kernel, an SVM model is created and plotted to see the decision boundary to determine if the data is linearly separable. I created a temporary object to test the data. Per James, et al. (2023), the data need to be in the form of factors. Although this was already done previously in Assignment Two, I am using the R code again to lay out the procedures to test the training data for linear separability.

# Test the linear separability between the "balance" and "duration" variables, 
# and ensure that the "y" variable is a factor.

df_test <- a3_bankdata_train_smote[, c("balance", "duration", "y")]
df_test$y <- as.factor(df_test$y)

The next step is to run a basic SVM model without hyperparameter tuning.

# With the training data converted to a factor, I ran an SVM Model. The kernel is 
# set to "linear" and the cost is arbitrarily set to 10, which aggressively reduces
# error. This arbitrary value means it was chosen without hyperparameter tuning.

svm_model <- svm(
  y ~ ., 
  data = df_test, 
  kernel = "linear", 
  cost = 10, 
  scale = TRUE
)

The results from the SVM model are fed into a grid.

# A grid is created to plot a decision boundary.

make_grid <- function(data, n = 100) {
  x_range <- seq(min(data$balance), max(data$balance), length.out = n)
  y_range <- seq(min(data$duration), max(data$duration), length.out = n)
  expand.grid(balance = x_range, duration = y_range)
}

grid_df <- make_grid(df_test)
grid_df$pred <- predict(svm_model, grid_df)

# Once the grid is created, the data is plotted to see the degree of separation.

# Create a base plot with data and aesthetics.
ggplot(df_test, aes(x = balance, y = duration, color = y)) +
  # Add scatterplot of test points
  geom_point(size = 2) +
  
  # Add decision boundary from SVM predictions.
  geom_contour(
    data = cbind(grid_df, y = as.numeric(grid_df$pred)),
    aes(z = y),
    breaks = c(1.5), # Create a contour line that has a value of 1.5 between class 1 and 2.
    color = "black"
  ) +
  labs(title = "SVM with Linear Kernel", 
       subtitle = "Decision Boundary Test for Linear Separability") +
  theme_minimal()

From the results, it is clear that the data points significantly overlap, indicating nonlinearity. James, et al (2023) suggest to test linearity over a range of costs. Below, I used a loop to pass over the SVM model at different cost levels. The output shows that the number of support vectors do not significantly decrease and remain over 9,000 at each cost level. That is, the margins get narrower as the costs increase, but no linear separation appears.

# I created a loop to test different cost levels.

for (c in c(0.01, 0.1, 1, 10, 100)) {
  print(paste("Fitting with cost =", c))
  model <- svm(y ~ ., data = df_test, kernel = "linear", cost = c)
  print(summary(model))
}

## [1] "Fitting with cost = 0.01"
## 
## Call:
## svm(formula = y ~ ., data = df_test, kernel = "linear", cost = c)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  9659
## 
##  ( 4830 4829 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes
## 
## 
## 
## [1] "Fitting with cost = 0.1"
## 
## Call:
## svm(formula = y ~ ., data = df_test, kernel = "linear", cost = c)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  9436
## 
##  ( 4718 4718 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes
## 
## 
## 
## [1] "Fitting with cost = 1"
## 
## Call:
## svm(formula = y ~ ., data = df_test, kernel = "linear", cost = c)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  9411
## 
##  ( 4705 4706 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes
## 
## 
## 
## [1] "Fitting with cost = 10"
## 
## Call:
## svm(formula = y ~ ., data = df_test, kernel = "linear", cost = c)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  9409
## 
##  ( 4704 4705 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes
## 
## 
## 
## [1] "Fitting with cost = 100"
## 
## Call:
## svm(formula = y ~ ., data = df_test, kernel = "linear", cost = c)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  100 
## 
## Number of Support Vectors:  9413
## 
##  ( 4707 4706 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes

Applicaton of the Second Kernel.

Train an SVM model with a “radial” kernel and Test with Different Cross-Validation Combinations.

Since the data is nonlinear, I ran an SVM Model with a “radial” kernel. Per Meyer (2024), SVMs may be sensitive to the proper choice of parameters. He suggests to always check a range of parameter combinations. Below I ran a couple of different cross-validation combinations: “cv” and “repeatedcv”. Using the caret package, I was able to wrap the two different cross-validation combinations on a radial kernel.

Cross-Validation Combinations.

# First, I created of a simple k-fold cross-validation function.

train_control_cv <- trainControl(
  # data will be split into 5 parts, and the model will train on 4 of 5 parts of the data 
  # and validate on the remaining 1 of 5 parts.
  # the model will train on 4 folds and test on the 1 remaining fold.
  # This is repeated until each fold has served as the test set once.
  method = "cv",
  number = 5,           
  savePredictions = "final",
  classProbs = TRUE,    # Set to TRUE to set up the ROC graphs.
  # The summaryFuntion will return ROC, Sensitivity and Specificity performance metrics.
  summaryFunction = twoClassSummary 
)

# Creation of a repeated k-fold cross-validation function.

train_control_repeatedcv <- trainControl(
  # With "repeatedcv", and its parameters, the data is split into 5 folds. It is 
  # repeated two times. Therefore, the model is trained and validated 10 times.
  method = "repeatedcv",        
  number = 5,
  repeats = 3,
  savePredictions = "final",
  classProbs = TRUE,   
  summaryFunction = twoClassSummary
)

Insert cross-validation functions into two different SVM radial kernel models.

# Creation of a radial kernel model with a simple k-fold cross-validation function.

svm_model_cv <- train(
  y ~ ., 
  data = a3_bankdata_train_smote, 
  method = "svmRadial",     
  trControl = train_control_cv, 
  metric = "ROC",
  # The following standardizes the features to a 0 to 1 scale.
  preProcess = c("center", "scale"), 
  # The following informs caret to automatically try 3 best performing 
  # combinations of hyperparameters. 
  tuneLength = 3
)

# Creation of a radial kernel model with a repeated k-fold cross-validation function.

svm_model_repeatedcv <- train(
  y ~ ., 
  data = a3_bankdata_train_smote, 
  method = "svmRadial",     
  trControl = train_control_repeatedcv, 
  metric = "ROC",           
  preProcess = c("center", "scale"), 
  tuneLength = 3
)

Compare ROC performance between two cross-validation radial kernel models.

Now the two models are compared to see the difference in cross-validation techniques.

# Compare ROC performance of the models.

svm_model_cv

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 15968 samples
##    14 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (40), scaled (40) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12774, 12774, 12774, 12775, 12775 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.25  0.9234361  0.8524568  0.8458174
##   0.50  0.9278156  0.8544604  0.8550858
##   1.00  0.9311583  0.8579673  0.8621001
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01673654
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01673654 and C = 1.

svm_model_repeatedcv

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 15968 samples
##    14 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (40), scaled (40) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 12774, 12774, 12774, 12774, 12776, 12774, ... 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.25  0.9239217  0.8544597  0.8466938
##   0.50  0.9283767  0.8554616  0.8574236
##   1.00  0.9317469  0.8565470  0.8656067
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01681706
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01681706 and C = 1.

I placed the outcomes in a table for clarity.

library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

# Create a summary data frame with the algorithms and metrics (with some coding assistance from ChatGPT).
confusion_summary <- data.frame(
  Metrics = c("ROC", "Sensitivity", "Specificity"),
  `Simple Cross-Validation` = c(0.9317470, 0.8570882, 0.8480727),
  `Repeated Cross-Validation` = c(0.9313365, 0.8559211, 0.8651473)
)

# Create an HTML table with kableExtra.
confusion_summary %>%
  kbl(caption = "Confusion Matrices Metrics Summary Across Models",
      align = "lcccccc",
      format = "html",
      escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") %>%
  scroll_box(height = "auto", width = "100%")

Confusion Matrices Metrics Summary Across Models
Metrics	Simple.Cross.Validation	Repeated.Cross.Validation
ROC	0.9317470	0.9313365
Sensitivity	0.8570882	0.8559211
Specificity	0.8480727	0.8651473

ROC Curves and Variable Importance Bar Charts

I plotted the ROC Curves for the two cross-validation models to visualize performance.

# Plot ROC curve and variable importance bar chart for the simple k-fold cross-validation model.
# Variable importance provides an estimate of how much each feature contributed to the final model.
# Tuning performance gives a visual summary of model tuning performance.

plot(varImp(svm_model_cv))

plot(svm_model_cv)

# Plot ROC curve and variable importance bar chart for the repeated k-fold cross-validation model.

plot(varImp(svm_model_repeatedcv))

plot(svm_model_repeatedcv)

Interpretation

Both cross-validation models are nearly identical in performance metrics. The simple cross-validation model has a very slight ROC advantage over the repeated cross-validation model, but it is negligible. This is shown in the ROC Curve graphs. Given that the repeated cross-validation is known to be more robust than a simple cross-validation model, it has a slight performance edge. What is interesting about the Variable Importance Bar Chart is that Duration appears to be by far the most significant feature, across both cross-validation techniques.

Compare Confusion Matrices between two cross-validation radial kernel models.

# The predict function make class predictions on the test set a3_bankdata_test.
# It stores the predicted class labels to use in the Confusion Matrix.
# The first one is for the simple k-fold cross-validation.

pred <- predict(svm_model_cv, newdata = a3_bankdata_test)

# The Confusion Matrix compares the predicted labels (pred) to the true labels 
# for the a3_bankdata_test$y data.

confusionMatrix(pred, a3_bankdata_test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  8547  205
##        yes 1459 1092
##                                           
##                Accuracy : 0.8528          
##                  95% CI : (0.8461, 0.8593)
##     No Information Rate : 0.8853          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.49            
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8542          
##             Specificity : 0.8419          
##          Pos Pred Value : 0.9766          
##          Neg Pred Value : 0.4281          
##              Prevalence : 0.8853          
##          Detection Rate : 0.7562          
##    Detection Prevalence : 0.7743          
##       Balanced Accuracy : 0.8481          
##                                           
##        'Positive' Class : no              
##

# The second one is for the repeated k-fold cross-validation.

pred <- predict(svm_model_repeatedcv, newdata = a3_bankdata_test)
confusionMatrix(pred, a3_bankdata_test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  8558  209
##        yes 1448 1088
##                                           
##                Accuracy : 0.8534          
##                  95% CI : (0.8467, 0.8599)
##     No Information Rate : 0.8853          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4903          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8553          
##             Specificity : 0.8389          
##          Pos Pred Value : 0.9762          
##          Neg Pred Value : 0.4290          
##              Prevalence : 0.8853          
##          Detection Rate : 0.7571          
##    Detection Prevalence : 0.7756          
##       Balanced Accuracy : 0.8471          
##                                           
##        'Positive' Class : no              
##

# Load the required packages to create a table. The knitr package renders tables in RMarkdown and the kableExtra package adds style and format to tables.

library(knitr)
library(kableExtra)

# Create a summary data frame with the algorithms and metrics (with some coding assistance from ChatGPT).
confusion_summary <- data.frame(
  Metrics = c("Accuracy", "Mcnemar's P-Value", "Sensitivity", "Specificity",
              "Pos Pred Value", "Neg Pred Value", "Balanced Accuracy"),
  `Simple Cross-Validation` = c(0.8525, "<2e-16", 0.8540, 0.8412, 0.9765, 0.4275, 0.8476),
  `Repeated Cross-Validation` = c(0.8528, "<2e-16", 0.8543, 0.8412, 0.9765, 0.4275, 0.8477)
)

# Create an HTML table with kableExtra.
confusion_summary %>%
  kbl(caption = "Confusion Matrices Metrics Summary Across Models",
      align = "lcccccc",
      format = "html",
      escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") %>%
  scroll_box(height = "auto", width = "100%")

Confusion Matrices Metrics Summary Across Models
Metrics	Simple.Cross.Validation	Repeated.Cross.Validation
Accuracy	0.8525	0.8528
Mcnemar’s P-Value	<2e-16	<2e-16
Sensitivity	0.854	0.8543
Specificity	0.8412	0.8412
Pos Pred Value	0.9765	0.9765
Neg Pred Value	0.4275	0.4275
Balanced Accuracy	0.8476	0.8477

Interpretation

Consistent with the ROC analysis, both cross-validation models are nearly identical in performance metrics. Given that the repeated cross-validation is known to be more robust than a simple cross-validation model, it has a slight performance edge. Therefore, I will advance with this repeated cross-validation SVM Model with the radial kernel.

Manual Hyperparameter Tuning for Radial Kernel SVM Model with Repeated Cross Validation.

Instead of using the tuneLength = 3, which uses various combinations of the sigma (controlling decision boundary smoothness) and C (controlling regularization strength), I am able to manually set these figures by defining a tuning grid.

# Creation of a tuning grid. The figures are arbitrary to see the direction of the optimum numbers. Only two numbers where chosen because of execution timeouts with expanded parameters.

svm_grid <- expand.grid(
  sigma = c(0.001, 0.01),  
  C = c(0.25, 1)             
)

The grid figures are then placed back into the SVM Model with the radial kernel using the simple k-fold cross-validation.

set.seed(1234)

# Creation of a radial kernel model with a simple k-fold cross-validation function 
# with the tuning grid manually controlled.

svm_model_controlled <- train(
  y ~ ., 
  data = a3_bankdata_train_smote,
  method = "svmRadial",
  trControl = train_control_repeatedcv,  
  tuneGrid = svm_grid,
  metric = "ROC",
  preProcess = c("center", "scale")
)

Now compare the hyperparameter tuning from automatically applied with the caret package to a manually controlled hyperparameter tuning.

print(svm_model_controlled)     # View best parameters and performance

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 15968 samples
##    14 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (40), scaled (40) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 12774, 12776, 12774, 12774, 12774, 12774, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C     ROC        Sens       Spec     
##   0.001  0.25  0.9080919  0.8506181  0.8101202
##   0.001  1.00  0.9108232  0.8532902  0.8143368
##   0.010  0.25  0.9207372  0.8536658  0.8375501
##   0.010  1.00  0.9290842  0.8588013  0.8535821
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01 and C = 1.

plot(svm_model_controlled)      # Visualizes tuning results

varImp(svm_model_controlled)    # Variable importance (optional)

## ROC curve variable importance
## 
##           Importance
## duration     100.000
## poutcome      39.365
## month         31.519
## contact       25.022
## balance       21.639
## housing       21.332
## marital       19.088
## education     13.572
## campaign      12.724
## loan          12.639
## day            8.671
## default        7.571
## job            2.938
## age            0.000

Final SVM Model Confusion Matrix and Analyze Key SVM Performance Metrics.

I made predictions and created a confusion matrix on the test data for the radial kernel with cv cross-validation.

# Predict class labels.
predicted_class <- predict(svm_model_controlled, newdata = a3_bankdata_test)

# Create the Confusion Matrix.
confusionMatrix(predicted_class, a3_bankdata_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  8589  216
##        yes 1417 1081
##                                          
##                Accuracy : 0.8555         
##                  95% CI : (0.8489, 0.862)
##     No Information Rate : 0.8853         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.4931         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.83346        
##             Specificity : 0.85838        
##          Pos Pred Value : 0.43275        
##          Neg Pred Value : 0.97547        
##              Prevalence : 0.11475        
##          Detection Rate : 0.09564        
##    Detection Prevalence : 0.22100        
##       Balanced Accuracy : 0.84592        
##                                          
##        'Positive' Class : yes            
##

Compare Assignment Two Algorithm Performance Metrics with the Final SVM Model.

The following code generates a styled HTML table summarizing key performance metrics for the six machine learning algorithms.

library(knitr)
library(kableExtra)

# Create a summary data frame with the algorithms and metrics (with some coding assistance from ChatGPT).
confusion_summary <- data.frame(
  Metrics = c("Accuracy", "Mcnemar's P-Value", "Sensitivity", "Specificity",
              "Pos Pred Value", "Neg Pred Value", "Balanced Accuracy"),
  `DT` = c(0.8435, "<2e-16", 0.73323, 0.85779, 0.40059, 0.96125, 0.79551),
  `RF` = c(0.8457, "<2e-16", 0.85736, 0.84419, 0.41632, 0.97857, 0.85078),
  `Ada` = c(0.8616, "<2e-16", 0.79183, 0.87068, 0.44248, 0.96994, 0.83125),
  `SVM` = c(0.8555, "<2e-16", 0.83346, 0.85838, 0.43275, 0.97547, 0.84592)
)

# Create an HTML table with kableExtra.
confusion_summary %>%
  kbl(caption = "Confusion Matrices Metrics Summary Across Models",
      align = "lcccccc",
      format = "html",
      escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = FALSE,
                position = "center") %>%
  scroll_box(height = "auto", width = "100%")

Confusion Matrices Metrics Summary Across Models
Metrics	DT	RF	Ada	SVM
Accuracy	0.8435	0.8457	0.8616	0.8555
Mcnemar’s P-Value	<2e-16	<2e-16	<2e-16	<2e-16
Sensitivity	0.73323	0.85736	0.79183	0.83346
Specificity	0.85779	0.84419	0.87068	0.85838
Pos Pred Value	0.40059	0.41632	0.44248	0.43275
Neg Pred Value	0.96125	0.97857	0.96994	0.97547
Balanced Accuracy	0.79551	0.85078	0.83125	0.84592

To better understand each of the algorithm’s effects, with the help of AI, I created a table that ranks all performance metric values.

library(knitr)
library(kableExtra)
library(dplyr)

# Step 1: Create the original data frame
algorithm_ranking <- data.frame(
  Metrics = c("Accuracy", "Sensitivity", "Specificity",
              "Pos Pred Value", "Neg Pred Value", "Balanced Accuracy"),
  DT = c(0.8435, 0.73323, 0.85779, 0.40059, 0.96125, 0.79551),
  RF = c(0.8457, 0.85736, 0.84419, 0.41632, 0.97857, 0.85078),
  Ada = c(0.8616, 0.79183, 0.87068, 0.44248, 0.96994, 0.83125),
  SVM = c(0.8555, 0.83346, 0.85838, 0.43275, 0.97547, 0.84592)
)

# Step 2: Transpose so algorithms are rows
algorithm_matrix <- as.data.frame(t(algorithm_ranking[, -1]))
colnames(algorithm_matrix) <- algorithm_ranking$Metrics
algorithm_matrix$Model <- rownames(algorithm_matrix)
algorithm_matrix <- algorithm_matrix %>% relocate(Model)

# Step 3: Compute ranks (1 = best)
ranking_matrix <- algorithm_matrix
ranking_matrix[,-1] <- lapply(algorithm_matrix[,-1], function(x) rank(-x, ties.method = "first"))

# Step 4: Add range of actual metric values
range_values <- apply(algorithm_matrix[,-1], 1, function(x) round(max(x) - min(x), 5))
ranking_matrix$Range <- range_values

# Step 5: Display cleanly with single-line caption
ranking_matrix %>%
  kbl(
    caption = "Model Rankings by Metric (1 = Highest Score, 4 = Lowest Score). 
              The Range is the difference between highest and lowest metric value per model.",
    align = "lccccccr",
    format = "html",
    row.names = FALSE,
    escape = FALSE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  ) %>%
  scroll_box(height = "auto", width = "100%")

Model Rankings by Metric (1 = Highest Score, 4 = Lowest Score). The Range is the difference between highest and lowest metric value per model.
Model	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Balanced Accuracy	Range
DT	4	4	3	4	4	4	0.56066
RF	3	1	4	3	1	1	0.56225
Ada	1	3	1	1	3	3	0.52746
SVM	2	2	2	2	2	2	0.54272

Final Model Selection.

In terms of accuracy, all the models perform well, but the Adaboost model performs best at 86.16%. In terms of Sensitivity, the ability to detect true positives, RF and SVM works best (85.74% and 83.35%, respectively). DT trails significantly at 73.32%. In terms of Specificity, the ability to detect true negatives, all models excel, but Adaboost performs best at 87.07%. In terms of Positive Predictive Value, indications that positive predictions are false positives, all models have low figures, indicating significant false positives, with Adaboost detecting the least false positives at 44.25%. In terms of Negative Predictive Value, the proportion of negative predictions that are actually correct, all models excelled. RF and SVM performed the best at 97.86% and 97.55%, respectively.

Rankings

In terms of ranking models, it appears that Adaboost and RF performed the best, splitting the number one ranking in all metrics. Interestingly, SVM performed second best in all metrics and was the most consistent across all metrics. DT appears to perform worse across almost all metrics.

Ad Hoc Analysis

The confusion matrices outputs should be put into perspective as to what is most important when interpreting outcomes. I further analyzed in terms of costs to responses on telemarketing calls. For insight, I want to know if “no” answers for the dependent variable were costly, what would be the recommended model selected? It would be costly if a “no” answer is actually a “yes” answer. That means a revenue generating customer is lost. Therefore, what would be most important in model selection would be to predict the rate of false negatives. We would be interested in a higher Sensitivity to get as many actual positive values as possible and Negative Predictive Value, capturing actual “no” values as possible. The RF has the highest Sensitivity level at 85.74% and highest Negative Prediction Value at 97.86%.

Final Model Recommendation

Random Forest

References

James, G., et al. (2023). An Introduction to Statistical Learning with Applications in R. (2nd ed.).

Meyer, D. (2024). Support Vector Machines * The Interface to libsvm in package e1071. In F. T. Wein (Ed.). Vienna, Austria.

Nwanganga, F., & Chapple, M. (2020). Practical Machine Learning in R (1st edition ed.). John Wiley & Sons. https://doi.org/https://doi.org/10.1002/9781119591542

Foy - DATA 622 Assignment Three - SVM Model

April 10, 2025