DATA622- Assignment 3 - Support Vector Machines

Background
Data Preparation
Pre- processing
Experiment using SVM algorithm
Comparison with previous homework
Conclusion and Recommendations
Review of articles

Background

The assignment is to use the SVM algorithm and perform an analysis on a dataset “Bank Marketing” UCI dataset , (detailed description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). Compare the results with the results from previous homework.

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

The classification goal is to predict if the client will subscribe a term deposit (variable y). We will use the bank-full.csv

Input predictors:

# bank client data: 1 - age (numeric)

2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)

3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)

4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)

5 - default: has credit in default? (binary: “yes”,“no”)

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: “yes”,“no”)

8 - loan: has personal loan? (binary: “yes”,“no”)

# related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

12 - duration: last contact duration, in seconds (numeric)

# other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

Missing values : none

Data Preparation

1. Importing Libraries

#Import needed libraries

library(ggplot2)
library(readr) # to uses read_csv function
library(dplyr) # to use Filter, mutate, arrange function etc
library(tidyr) # to use pivot_longer function

library(e1071)  # For skewness function
library(corrplot)

library(ROSE)
library(smotefamily)

library(caret) # varImp function trainControl 

library(rpart) # for decision tree rpart()
library(rpart.plot) # for decision tree rpart.plot()

# ROC Curve
library(pROC)

# Precision-Recall Curve
library(PRROC)

# randomForest
library(randomForest)

# XGBoost
library(xgboost)

2. Data Ingestion and inspection

The data analysis shows there are 45211 observations and 17 variables. I find some of variables not in the correct type. we need to convert it into correct data type.

bank_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA622/refs/heads/main/bank-full.csv")

head(bank_raw)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

3. Change the predictors to correct data type.

bank_transform <- bank_raw
bank_transform$job <- as.factor(bank_raw$job)
bank_transform$marital <- as.factor(bank_raw$marital)
bank_transform$education <- as.factor(bank_raw$education)
bank_transform$default <- as.factor(bank_raw$default)
bank_transform$balance <- as.integer(bank_raw$balance)
bank_transform$housing <- as.factor(bank_raw$housing)
bank_transform$loan <- as.factor(bank_raw$loan)
bank_transform$contact <- as.factor(bank_raw$contact)
bank_transform$month <- as.factor(bank_raw$month)
bank_transform$pdays <- as.integer(bank_raw$pdays)
bank_transform$poutcome <- as.factor(bank_raw$poutcome)
bank_transform$y <- as.factor(bank_raw$y)

str(bank_transform)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

4. Target predictor to a new numeric variable

To perform analysis, it is needed to have the target predictor in numeric format as well.

#bank_transform$y_numeric <- ifelse(bank_transform$y == "yes", 1, 0)
#bank_transform$y_numeric <- as.integer(bank_transform$y_numeric)

Pre- processing

1. Data Cleaning

There are no missing values. There are no duplicates as shown in above code.However there are unknown values which are converted to NA.

# Count missing values per column
colSums(is.na(bank_transform ))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

# Replace missing numerical values with median
bank_final <- bank_transform |>
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Replace missing categorical values with the most common value (mode)
for (col in names(bank_final)) {
if (is.factor(bank_final[[col]])) {
mode_val <- names(sort(table(bank_final[[col]]), decreasing = TRUE))[1]
bank_final[[col]][is.na(bank_final[[col]])] <- mode_val

    }

}

2. Feature Engineering

# Create Age Groups
bank_final_add <- bank_final%>%
  mutate(age_group = case_when(
    age <= 30 ~ "18-30",
    age > 30 & age <= 40 ~ "30-40",
    age > 40 & age <= 50 ~ "41-50",
    age > 50 & age <= 60 ~ "51-60",
    age > 60 ~ "60+"
  ))

# Categorize Balance Levels
bank_final_add$balance_group <- cut(bank_final_add$balance, 
                       breaks = quantile(bank_final_add$balance, probs = seq(0, 1, 0.2)),
                       labels = c("Very Low", "Low", "Medium", "High", "Very High"),
                       include.lowest = TRUE,   # Ensures the lowest value is included
                       right = FALSE            # Makes the intervals [a, b) instead of (a, b]
  )

# Categorize Contact Duration
bank_final_add <- bank_final_add %>%
  mutate(duration_category = case_when(
    duration < 100 ~ "Short",
    duration >= 100 & duration <= 300 ~ "Medium",
    duration > 300 ~ "Long"
  ))

# Convert categorical variables to factors
bank_final_add <- bank_final_add %>% mutate(across(where(is.character), as.factor))

3. Imbalanced Data

# Check imbalance
table(bank_final_add$y)

## 
##    no   yes 
## 39922  5289

# Oversampling using ROSE -Random Over-Sampling Examples
bank_final_balanced <- ROSE(y ~ ., data = bank_final_add, seed = 123)$data
table(bank_final_balanced$y)

## 
##    no   yes 
## 22885 22326

4. Split the data into train and test

# 80% train data 20% test data
set.seed(1234)
sample_set <- createDataPartition(bank_final_balanced$y, p = 0.8, list = FALSE)

bank_train <- bank_final_add[sample_set, ]

bank_test<- bank_final_add[-sample_set, ]

5. Level the target variable such that yes is positive class

Initially it shows the levels as no , yes, hence releveling it to factor as yes, no.

# Verify levels before training
levels(bank_train$y)

## [1] "no"  "yes"

# Re-level the target variable by explicitly set "yes" as the first level of your factor variable y. this will make yes as the positive class.
bank_train$y <- factor(bank_train$y, levels = c("yes", "no"))
bank_test$y <- factor(bank_test$y, levels = c("yes", "no"))

Experiment using SVM algorithm

Experiment 7. Support Vector Machine with a linear kernel

Objective: The performance of the SVM linear model will improve tuning various cost parameter(C).

Evaluation metric: Accuracy, ROC-AUC

Results/Run:

The SVM linear model using the train set shows 36169 samples,19 predictor and 2 classes: ‘no’, ‘yes’.
The model accuracy was best when C=10 and is 0.82877.
The validation plot confirms using C= 10 provides the best accuracy.
Number of Support Vectors generated : 3333 .
Training error : 0.155767
Duration, contact and housing were the most important features.
Test prediction accuracy is 0.8332 and ROC-AUC = 0.842.

Conclusion/Recommendation:

AUC value of 0.845 means the model does a very good job distinguishing between subscribers and non-subscribers.

A high value of accuracy suggests some form of linear relationship among few predictors and target variable.

Recommendation is to try other methods like Radial etc.

set.seed(100)
# cross-validation as the resampling method for model training
#Used downsampling to deal with class imbalance. This randomly removes examples from the majority class to balance the classes during training.
control <- trainControl(method = "cv",number = 5,sampling = "down") 

# Fit the model
svmModel_linear <- train(y ~ .,
                   data= bank_train,
                   method = "svmLinear",
                   preProcess = c("center", "scale"),
                   tuneLength = 3,
                   trControl = control,
                   tuneGrid = expand.grid(C = c(0.01, 0.1, 1, 10))) # cost or penalty parameter

#display the model results
svmModel_linear

## Support Vector Machines with Linear Kernel 
## 
## 36169 samples
##    19 predictor
##     2 classes: 'yes', 'no' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 28935, 28936, 28935, 28935, 28935 
## Addtional sampling using down-sampling prior to pre-processing
## 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##    0.01  0.8203434  0.4311621
##    0.10  0.8264260  0.4450691
##    1.00  0.8290248  0.4478733
##   10.00  0.8319277  0.4521268
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 10.

#validation plot
plot(svmModel_linear)

#summary
svmModel_linear$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 10 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 3349 
## 
## Objective Function Value : -33216.03 
## Training error : 0.157788

# Important Variables
plot(varImp(svmModel_linear))

# predict for the test data
pred_svm_exp1 <- predict(svmModel_linear, bank_test)

# Generate the confusion matrix
cm_svm_exp1 <- confusionMatrix(pred_svm_exp1 , bank_test$y)

cm_svm_exp1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  932 1362
##        no   152 6596
##                                           
##                Accuracy : 0.8326          
##                  95% CI : (0.8247, 0.8402)
##     No Information Rate : 0.8801          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4646          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8598          
##             Specificity : 0.8289          
##          Pos Pred Value : 0.4063          
##          Neg Pred Value : 0.9775          
##              Prevalence : 0.1199          
##          Detection Rate : 0.1031          
##    Detection Prevalence : 0.2537          
##       Balanced Accuracy : 0.8443          
##                                           
##        'Positive' Class : yes             
##

# display the accuracy
acc_svm_exp1 <- cm_svm_exp1$overall["Accuracy"]
paste0("SVM Linear model: Accuracy =", acc_svm_exp1,"/n" )

## [1] "SVM Linear model: Accuracy =0.832559168325592/n"

# display the ROC -AUC
roc_curve_linear <- roc(response = bank_test$y, 
                 predictor = as.numeric(pred_svm_exp1))

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

tuned_auc_linear <- auc(roc_curve_linear)
plot(roc_curve_linear, main = paste0("ROC curve with AUC = ",round(tuned_auc_linear, 3)))

Experiment 8. Support Vector Machine with a Radial Basis Function (RBF) / Gaussian Kernel

Objective: The performance of the SVM Radial model will improve tuning using various cost parameter(C) and gamma.

Evaluation metric: Accuracy, ROC-AUC

Results/Run:

The SVM Radial model using the train set shows 36169 samples,19 predictor and 2 classes: ‘no’, ‘yes’.
The model accuracy was best were sigma = 0.01 and C = 10 with accuracy = 0.83261.
The validation plot confirms using C= 10 and sigma = 0.01 provides the best accuracy.
Number of Support Vectors generated : 3389 .
Training error : 0.082164.
Duration, contact and housing were the most important features.
Test prediction accuracy is 0.8361 and ROC-AUC = 0.850.

Conclusion/Recommendation:

A high number of false positives (1,340 people predicted to subscribe who didn’t) still exists.

AUC value of 0.85 means the model does a very good job distinguishing between subscribers and non-subscribers.

This model is better in accuracy and ROC than linear model.

Recommendation is to try other methods like Polynomial to check if accuracy can get any better.

# cross-validation as the resampling method for model training
control <- trainControl(method = "cv",number = 5,sampling = "down") 

# Create tuning grid with combinations of C and sigma
# as we are using tuneGrid in caret i am passing sigma instead of gamma
tune_grid <- expand.grid(
  C = c( 1, 10),
  sigma = c(0.01, 0.1)
)
svmModel_radial <- train(y ~ .,
                   data= bank_train,
                   method = "svmRadial",
                   preProcess = c("zv","center", "scale"),
                   trControl = control,
                   tuneGrid = tune_grid )
svmModel_radial

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 36169 samples
##    19 predictor
##     2 classes: 'yes', 'no' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 28935, 28935, 28936, 28935, 28935 
## Addtional sampling using down-sampling prior to pre-processing
## 
## Resampling results across tuning parameters:
## 
##   C   sigma  Accuracy   Kappa    
##    1  0.01   0.8228314  0.4467695
##    1  0.10   0.7467721  0.3317558
##   10  0.01   0.8333379  0.4606905
##   10  0.10   0.7528823  0.3287391
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01 and C = 10.

#validation plot
plot(svmModel_radial)

#summary
svmModel_radial$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 10 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.01 
## 
## Number of Support Vectors : 3362 
## 
## Objective Function Value : -20576.66 
## Training error : 0.077883

# Important Variables
plot(varImp(svmModel_radial))

# predict for the test data
pred_svm_exp2 <- predict(svmModel_radial, bank_test)

# Generate the confusion matrix
cm_svm_exp2 <- confusionMatrix(pred_svm_exp2 , bank_test$y)

cm_svm_exp2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  935 1308
##        no   149 6650
##                                           
##                Accuracy : 0.8389          
##                  95% CI : (0.8311, 0.8464)
##     No Information Rate : 0.8801          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4776          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8625          
##             Specificity : 0.8356          
##          Pos Pred Value : 0.4169          
##          Neg Pred Value : 0.9781          
##              Prevalence : 0.1199          
##          Detection Rate : 0.1034          
##    Detection Prevalence : 0.2481          
##       Balanced Accuracy : 0.8491          
##                                           
##        'Positive' Class : yes             
##

# display the accuracy
acc_svm_exp2 <- cm_svm_exp2$overall["Accuracy"]
paste0("SVM Radial model: Accuracy =", acc_svm_exp2,"/n" )

## [1] "SVM Radial model: Accuracy =0.838863083388631/n"

# display the ROC -AUC
roc_curve_radial<- roc(response = bank_test$y, 
                 predictor = as.numeric(pred_svm_exp2))

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

tuned_auc_radial <- auc(roc_curve_radial)
plot(roc_curve_radial, main = paste0("ROC curve with AUC = ",round(tuned_auc_radial, 3)))

Experiment 9. Support Vector Machine with Polynomial Kernel

Objective: The performance of the SVM Polynomial model using various cost parameter(C), degree and scale.

Evaluation metric: Accuracy, ROC-AUC

Results/Run:

The SVM Polynomial model using the train set shows 36169 samples,19 predictor and 2 classes: ‘no’, ‘yes’.
The model accuracy was best were degree = 2, scale = 0.01 and C = 10. with accuracy = 0.83328.
The validation plot confirms using degree = 2, scale = 0.01 and C = 10 provides the best accuracy.
Number of Support Vectors generated : 2948 .
Training error : 0.099643.
Duration, contact and housing were the most important features.
Test prediction accuracy is 0.83631 and ROC-AUC = 0.851.

Conclusion/Recommendation:

A high number of false positives (1,340 people predicted to subscribe who didn’t) still exists. AUC value of 0.851 means the model does a very good job distinguishing between subscribers and non-subscribers.

This model is better in accuracy and ROC than linear and radial models.

# cross-validation as the resampling method for model training
control <- trainControl(method = "cv",number = 5,sampling = "down") 

# Create tuning grid with combinations of C and sigma
# as we are using tuneGrid in caret i am passing sigma instead of gamma
tune_grid <- expand.grid(
  C = c(1, 10),
  degree = c(2, 3), # Degree of the polynomial
  scale = c(0.001, 0.01)) # Scale factor (also known as alpha)

svmModel_poly <- train(y ~ .,
                   data= bank_train,
                   method = "svmPoly",
                   preProcess = c("zv","center", "scale"),
                   trControl = control,
                   tuneGrid = tune_grid )
svmModel_poly

## Support Vector Machines with Polynomial Kernel 
## 
## 36169 samples
##    19 predictor
##     2 classes: 'yes', 'no' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 28936, 28935, 28935, 28935, 28935 
## Addtional sampling using down-sampling prior to pre-processing
## 
## Resampling results across tuning parameters:
## 
##   C   degree  scale  Accuracy   Kappa    
##    1  2       0.001  0.8138738  0.4184082
##    1  2       0.010  0.8312644  0.4600537
##    1  3       0.001  0.8130167  0.4180915
##    1  3       0.010  0.8305179  0.4579542
##   10  2       0.001  0.8241312  0.4398467
##   10  2       0.010  0.8322599  0.4587298
##   10  3       0.001  0.8241587  0.4430253
##   10  3       0.010  0.8299648  0.4489567
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 0.01 and C = 10.

#validation plot
plot(svmModel_poly)

#summary
svmModel_poly$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 10 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  2  scale =  0.01  offset =  1 
## 
## Number of Support Vectors : 3020 
## 
## Objective Function Value : -24056.76 
## Training error : 0.102616

# Important Variables
plot(varImp(svmModel_poly))

# predict for the test data
pred_svm_exp3 <- predict(svmModel_poly, bank_test)

# Generate the confusion matrix
cm_svm_exp3 <- confusionMatrix(pred_svm_exp3 , bank_test$y)

cm_svm_exp3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  946 1330
##        no   138 6628
##                                           
##                Accuracy : 0.8376          
##                  95% CI : (0.8299, 0.8452)
##     No Information Rate : 0.8801          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4784          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8727          
##             Specificity : 0.8329          
##          Pos Pred Value : 0.4156          
##          Neg Pred Value : 0.9796          
##              Prevalence : 0.1199          
##          Detection Rate : 0.1046          
##    Detection Prevalence : 0.2517          
##       Balanced Accuracy : 0.8528          
##                                           
##        'Positive' Class : yes             
##

# display the accuracy
acc_svm_exp3 <- cm_svm_exp3$overall["Accuracy"]
paste0("SVM Radial model: Accuracy =", acc_svm_exp3,"/n" )

## [1] "SVM Radial model: Accuracy =0.837646538376465/n"

# display the ROC -AUC
roc_curve_poly <- roc(response = bank_test$y, 
                 predictor = as.numeric(pred_svm_exp3))
tuned_auc_poly <- auc(roc_curve_poly)
plot(roc_curve_poly, main = paste0("ROC curve with AUC = ",round(tuned_auc_poly, 3)))

Comparison of three experiments

Experiment 7 SVM Linear Hyper parameters: C = 10 provided the best performance. Number of support vectors:3333, Accuracy = 0.8332 and ROC-AUC = 0.842,Sensitivity : 0.8469, Specificity : 0.8383, Precision : 0.4163

Experiment 8 SVM Radial Hyper parameters: sigma = 0.01 and C = 10 provided the best performance. Number of support vectors:3389 , Accuracy = 0.8361 and ROC-AUC = 0.850,Sensitivity : 0.8690, Specificity : 0.8316 ,precision: 0.4128

Experiment 9 SVM Polynomial Hyper parameters: degree = 2, scale = 0.01 and C = 10 provided the best performance. Number of support vectors:2948 , Accuracy = 0.83631 and ROC-AUC = 0.851, Sensitivity : 0.8708, Specificity : 0.8316, precision: 0.4133

Comparison

All models perform similarly in terms of accuracy (around 83.3–83.6%).

SVM_Poly has a slight edge in AUC-ROC (0.851), meaning it’s best at distinguishing between the two classes across thresholds.

SVM_Poly and SVM_Radial have higher sensitivity (~87%), meaning they’re better at identifying actual subscribers.

SVM_Linear slightly edges out in specificity (83.8%), so it’s a bit better at identifying non-subscribers.

All models have low precision (~41%), which is expected due to class imbalance (most customers didn’t subscribe).

It means that when the model predicts “yes”, it’s only right ~41% of the time.

# Example metric values — replace with your actual numbers
svm_results <- data.frame(
  Model = c("SVM_Linear", "SVM_Radial", "SVM_Poly"),
  Accuracy = c(0.8332, 0.8361, 0.8363),
  Sensitivity = c(0.8469, 0.8690, 0.8708),
  Specificity = c(0.8383, 0.8316 , 0.8316),
  Precision = c(0.4163, 0.4128 , 0.4133),
  AUCROC = c(0.842, 0.850, 0.851)
)

# Reshape for plotting
svm_results_long <- svm_results %>%
  pivot_longer(cols = -Model, names_to = "Metric", values_to = "Value")

# Bar plot
ggplot(svm_results_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_col(position = "dodge") +
  labs(title = "Comparison of SVM Models",
       y = "Score", x = "Metric") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

plot(roc_curve_linear, col = "blue")
plot(roc_curve_radial, col = "red", add = TRUE)
plot(roc_curve_poly, col = "green", add = TRUE)
legend("bottomright", legend = c("Linear", "Radial", "Poly"),
       col = c("blue", "red", "green"), lwd = 2)

Comparison with previous homework

Introduction

The objective of this study was to evaluate different SVM models using various kernels—on a classification problem. The experiments were designed to analyze the impact of training-test data ratios and hyper parameter tuning parameters.

Bias-Variance Considerations

Decision Trees generally have low bias but high variance, leading to overfitting on training data. Random Forest, as an ensemble method, reduces variance by averaging multiple Decision Trees, improving generalization. XGBoost, a boosting method, builds models sequentially to correct errors, optimizing both bias and variance for better predictive performance.

SVM’s behavior depends heavily on: The kernel you choose (linear, radial, poly) The regularization parameter C The kernel parameters like sigma (RBF), degree (poly). svmLinear typically has lower variance, higher bias (simpler model) svmRadial More flexible, can have higher variance, lower bias svmPoly Bias and variance depend on the degree and scale: • High degree → low bias, high variance • Low degree → high bias, low variance

Experiment Summaries and Results

Decision Tree Experiments

The Decision Tree experiments explored how data partitioning and class weight adjustments impact performance. Experiment 1 tested the effect of changing the training-to-test ratio from 80/20 to 70/30, resulting in a marginal accuracy increase from 0.900 to 0.901. However, the model struggled with false negatives for the minority class. Experiment 2 adjusted class weights, slightly reducing accuracy (0.8912) but improving the minority class prediction (AUC-ROC = 0.746). Cross-validation did not improve results, indicating that weighted class adjustments were more effective.

Random Forest Experiments

Random Forest significantly improved model performance over Decision Trees. Experiment 3 applied class weight adjustments while keeping all features, achieving 0.9064 accuracy and 0.927 AUC-ROC. The most predictive features were ‘month,’ ‘day,’ and ‘duration.’ Experiment 4 increased the number of trees from 100 to 200, slightly boosting accuracy to 0.9082 and AUC-ROC to 0.929. The results confirmed that increasing tree count marginally enhances predictive power while maintaining model stability.

XGBoost Experiments

XGBoost provided the best balance of accuracy and generalization. Experiment 5 tested an increase in boosting rounds from 100 to 200, which unexpectedly led to a slight decline in performance (accuracy: 0.9043, AUC-ROC: 0.927). Experiment 6 focused on hyperparameter tuning, varying max_depth from 3 to 9. The best results came from max_depth = 6, yielding an accuracy of 0.9089 and an AUC-ROC of 0.9315, indicating an optimal balance between model complexity and generalization.

SVM experiments

SVM algorithm using svmPoly ( Polynomial kernel) provided the better accuracy, recall and AUC than svmLinear and svmRadial. It yielded an accuracy of 0.83631 and ROC-AUC = 0.851 on 20% test split. SVM_Linear provided slightly better specificity and precision. However this is less compared to any of the algorithms(Decision Tree, Random Forest and XGBoost). Hence it is not suitable for this classification problem as we had primarily categorical data.

Comparison of Model Performance

library(knitr)

# Define the data
results_table <- data.frame(
  Model = c("Decision Tree", "Random Forest", "XGBoost", "SVM", "SVM","SVM"),
  Experiment = c("Exp 1", "Exp 3", "Exp 6", "Exp 7", "Exp 8", "Exp 9"),
  Key_Variation = c("70/30 data split",  "Increased trees to 200", "Tuned max_depth = 6" , "Linear kernel", "Radial kernel", "Polynomial kernel"),
  Accuracy    = c(0.9010,0.9082,0.9089,0.8332, 0.8361,0.8363 ),
  AUC_ROC     = c(0.6900,0.7460,0.931, 0.842, 0.850, 0.851),
  Sensitivity = c(0.9652,0.9629,0.9731,0.8469, 0.8690, 0.8708),
  Specificity = c(0.4142,0.4950,0.4029, 0.8383, 0.8316 , 0.8316),
  Precision   = c(0.9256,0.9351,0.9248, 0.4163, 0.4128 , 0.4133)
)
# Display the table using kable
kable(results_table, format = "markdown", align = "l")

Model	Experiment	Key_Variation	Accuracy	AUC_ROC	Sensitivity	Specificity	Precision
Decision Tree	Exp 1	70/30 data split	0.9010	0.690	0.9652	0.4142	0.9256
Random Forest	Exp 3	Increased trees to 200	0.9082	0.746	0.9629	0.4950	0.9351
XGBoost	Exp 6	Tuned max_depth = 6	0.9089	0.931	0.9731	0.4029	0.9248
SVM	Exp 7	Linear kernel	0.8332	0.842	0.8469	0.8383	0.4163
SVM	Exp 8	Radial kernel	0.8361	0.850	0.8690	0.8316	0.4128
SVM	Exp 9	Polynomial kernel	0.8363	0.851	0.8708	0.8316	0.4133

# Reshape for plotting
results_long <- results_table |>
  select(-Experiment) |>
  pivot_longer(cols = -c(Model,Key_Variation), names_to = "Metric", values_to = "Value")

# Bar plot
ggplot(results_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_col(position = "dodge") +
  labs(title = "Comparison of classification Models",
       y = "Score", x = "Metric") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Best Performing Model

The XGBoost model with max_depth = 6 demonstrated the highest accuracy (0.9089) and AUC-ROC (0.9315), making it the optimal model. It provided better class discrimination compared to Decision Trees and Random Forest while maintaining generalization.

I agree with this result as Decision Trees perform better with categorical predictors than SVM classification.It offers better interpretability and transparency.

If specificity is the criteria, then SVM model performs the best.

Conclusion and Recommendations

Decision Trees offer interpretability but struggle with variance and class imbalance. While simple to implement, they require careful tuning of class weights to improve minority class predictions.

Random Forest reduces variance and performs better than Decision Trees, offering improved generalization. However, class imbalance remains a challenge, and tuning hyperparameters such as tree count and feature selection can further optimize results.

XGBoost provides the best trade-off between bias and variance, with hyperparameter tuning yielding the highest accuracy and AUC-ROC. It is well-suited for structured data and imbalanced classification problems.

SVM performs least among them all even with hyper parameter tuning.As the data is imbalanced, looks like it does not perform as good as others.

Recommendation for Data Science: XGBoost with max_depth = 6 should be the preferred model, with further tuning of learning rate, regularization, and feature selection to enhance performance.

Recommendation for Business Problem: Given the superior class discrimination and predictive power of XGBoost, it should be implemented to predict term deposit subscriptions and to improve decision-making accuracy. Business stakeholders should focus on feature importance like having longer contact duration,checking if previous outcome is a success, contacting clients during months of may, august etc .

Review of articles

Article 1: https://www.hindawi.com/journals/complexity/2021/5550344/

Objective: Predict COVID-19 using decision tree ensembles.

Explanation: Machine learning algorithms have been applied to various kinds of datasets to predict Covid-19 positive cases.As there is an imbalance between the number of Covid-19 positive cases and the number of negative cases, decision tree ensembles developed for imbalanced datasets performed better.In cases where accuracy measure is not very useful, precision, recall, and F1-measure are used in the experiments. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) are also used to compare the performance of decision tree ensembles.

Comparison:

For two (accuracy and precision) out of six performance measures, the standard classifiers performed best, whereas for the other four performance measures (F1-measure, recall, AU-ROC, and AUPRC) decision tree ensembles for imbalanced datasets perform best.
Random forests performed best for two performance measures (accuracy and precision).
Balanced random forest (RUS) performed best for three performance measures: recall, F1-measures, and AUPRC.
RUSBagging performed best for AUROC. AUROC and AUPRC are widely used performance measures for imbalanced datasets.
RUSBagging performed best for AUROC with the value 0.881, whereas balanced random forest (RUS) gave the best AUPRC with the value 0.561.
The study demonstrates that decision tree ensembles for imbalanced datasets perform better for this Covid-19 dataset.

Insight drawn:

Sampling is an approach to overcome the imbalance of datasets. The article uses standard decision tree ensembles with two sampling approaches: SMOTE and RUS. However, did not have similar effects on all the general decision tree ensembles.
As AUROC and AUPRC are mostly used performance measures for imbalanced datasets.
Results suggest that generally the performance slightly improves or remains constant with the ensemble size( combination of classifiers).This is shown by including Age variable.
The selection of classification methods should be based on the properties of data.

Article 2: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

Objective: Predict whether a person is affected by COVID-19 or not using SVM classification.

Explanation:

Machine learning algorithms have been applied to various kinds of datasets to predict Covid-19 positive cases.Based on the crucial impact of the symptoms, the article has applied the support vector machine classifier to classify the patient’s condition in no infection( numerical value = 3), mild infection( numerical value = 2), and serious infection ( numerical value = 1)categories. A selection of hypertension, heart diseases, chest pain, and acute respiratory syndromes is made as an attribute for the dataset.

SVM is chosen because it uses kernel trick to convert low-dimensional input space to high-dimensional space and thus converts the nonseparable problem to separable problem.

Comparison:

The SVM performance was compared with the performance of various supervised machine-learning models viz. kNN, Naïve Bayes, RF, AdaBoost, Binary Tree, and SVM. SVM outperforms all other models that are tested.
Precision Recall f1-score Support are the performance metrics used to compare the 3 classes.Severely infected had the highest precision, recall.

Insight drawn:

Since it is dealing with a very high-priority situation of detecting COVID-19,the article is looking for a smaller margin hyper-plane to classify the infected classes more accurately with fewer miss-predictions.
The model has an accuracy of 87%. From the classification report, we see that our methodology has a high success rate of predicting severely infected cases, which is very crucial for COVID-19 prediction
Precision Recall f1-score Support are the performance metrics used to compare the 3 classes.Severely infected had the highest precision, recall.

Comparison of both articles

Both articles are using ML algorithm to predict whether a person is affected by COVID-19 or not. Article 1 uses decision tree ensembles to predict Covid-19 positive cases whereas Article 2 uses SVM classification to predict infection severity.

Both articles used ROC, CA, F1 Score, Precision, and Recall to evaluate prediction.

Article 1 employs a conceptual analysis to propose a framework for understanding complexity, whereas Article 2 utilizes empirical data and machine learning techniques to develop a predictive model.

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.

Article 3: https://www.sciencedirect.com/science/article/pii/S187705092400406X

Article 4: https://wjarr.com/content/comparative-study-decision-tree-and-support-vector-machine-breast-cancer-prediction

Article 5:https://www.geeksforgeeks.org/comparing-support-vector-machines-and-decision-trees-for-text-classification/

Explanation:

Article 3 provides Performance Comparison of Decision Tree and Support Vector Machine Algorithms for Heart Failure Prediction.By exhaustively evaluating different parameter configurations, they could pinpoint the optimal values that maximize its performance metrics.

Article 4 explores the effectiveness of Support Vector Machine (SVM) techniques for diagnosing breast cancer and also compared against a Decision Tree (DT) model.

Article 5 considers Support Vector Machines (SVMs) and Decision Trees as both popular algorithms for text multi-class classification tasks.

Comparison:

In Article 3 , SVM outperforms DTree in terms of accuracy, precision, recall, and F1-score. However, the performance of these methods is influenced by the preprocessing steps applied, indicating the importance of selecting appropriate data preprocessing techniques for optimal performance with specific machine learning algorithms.

Decision trees are favored for their simplicity, interpretability, and handling of missing data, while SVMs are preferred for their ability to handle high-dimensional data and capture nonlinear relationships.

In Article 4,the results indicated that the SVM model achieved superior performance with an accuracy of 94%, AUC of 98%, sensitivity of 95%, specificity of 87%, and precision of 93%. In comparison, the DT model showed an accuracy of 89%, AUC of 95%, sensitivity of 90%, specificity of 85%, and precision of 90%.

In Article 5,The Support Vector Machines (SVM) model achieved an accuracy of 91.43% on the test set, indicating that it correctly predicted the newsgroup category for the majority of the documents.It also performed well Precision, Precision and F1-score.The decision tree model achieved an accuracy of 61.67% on the test set, which is lower than the accuracy achieved by the SVM model.

Insight drawn:

In Article 3, compared to the results without preprocessing, the DTree method experienced a decrease in true negatives and true positives, but an increase in false positives and false negatives.Conversely, the SVM method exhibited an increase in true negatives and true positives, with a slight rise in false positives and false negatives. Overall, the SVM method still outperforms the DTree method in terms of true positives and true negatives, albeit with a less pronounced difference than in the results without preprocessing.

In Article 4, the findings underscore the potential of SVM in enhancing breast cancer diagnostic accuracy, thereby supporting early detection and treatment.The SVM model’s high performance and robustness make it a valuable addition to existing diagnostic methods, providing a complementary approach to traditional techniques.

In Article 5 , Selecting the ideal model for text classification resembles selecting the ideal tool for a task – it’s crucial to weigh accuracy against interpretability. SVMs are often preferred for text classification tasks due to their ability to handle high-dimensional data like text features and their effectiveness in dealing with non-linear boundaries between classes. However, the choice of model can depend on various factors, including the specific characteristics of the dataset and the computational resources available. In some cases, Decision Trees may be preferred for their simplicity and interpretability, especially when the dataset is not as complex or when interpretability is important.

DATA622- Assignment 3 - Support Vector Machines

Dhanya Nair

2025-04-08

Background

Data Preparation

1. Importing Libraries

2. Data Ingestion and inspection

3. Change the predictors to correct data type.

4. Target predictor to a new numeric variable

Pre- processing

1. Data Cleaning

2. Feature Engineering

3. Imbalanced Data

4. Split the data into train and test

5. Level the target variable such that yes is positive class

Experiment using SVM algorithm

Experiment 7. Support Vector Machine with a linear kernel

Experiment 8. Support Vector Machine with a Radial Basis Function (RBF) / Gaussian Kernel

Experiment 9. Support Vector Machine with Polynomial Kernel

Comparison of three experiments

Comparison with previous homework

Introduction

Bias-Variance Considerations

Experiment Summaries and Results

Decision Tree Experiments

Random Forest Experiments

XGBoost Experiments

SVM experiments

Comparison of Model Performance

Best Performing Model

Conclusion and Recommendations

Review of articles

Article 1: https://www.hindawi.com/journals/complexity/2021/5550344/

Article 2: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.