Data 622- Assignment 3

Assignment 3- Support Vector Machines (SVM)

Perform an analysis of the dataset(s) used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

Read the following articles: https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/ Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework. Answer questions, such as: Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.5.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(e1071)

## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:ggplot2':
## 
##     element

library(dplyr)
library(rpart)
library(rpart.plot)
library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(kernlab)

## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha

Data from assignment #2

Data was imported and preprocessed as in assignment number 2 to have consistency.

Bank_data<-read.csv2("https://raw.githubusercontent.com/Andreina-A/Data-622/refs/heads/main/bank-full.csv") #imported characters as factors for further analysis

Bank_data$y<-as.factor(Bank_data$y)

#preprocessing
#removed "default" variable as it had many unknown data and very low correlation with the subscription outcome.
Bank_data<-Bank_data|>
  select(-default)

head(Bank_data)

##   age          job marital education balance housing loan contact day month
## 1  58   management married  tertiary    2143     yes   no unknown   5   may
## 2  44   technician  single secondary      29     yes   no unknown   5   may
## 3  33 entrepreneur married secondary       2     yes  yes unknown   5   may
## 4  47  blue-collar married   unknown    1506     yes   no unknown   5   may
## 5  33      unknown  single   unknown       1      no   no unknown   5   may
## 6  35   management married  tertiary     231     yes   no unknown   5   may
##   duration campaign pdays previous poutcome  y
## 1      261        1    -1        0  unknown no
## 2      151        1    -1        0  unknown no
## 3       76        1    -1        0  unknown no
## 4       92        1    -1        0  unknown no
## 5      198        1    -1        0  unknown no
## 6      139        1    -1        0  unknown no

The same Train/split from assignment #2 was used for this assignment for consistency as well.

set.seed(123)
#Use 80 percent of data for training 20 percent for test
trainIndex<-createDataPartition(Bank_data$y, p=0.8, list=FALSE)
trainData<-Bank_data[trainIndex,]
testData<-Bank_data[-trainIndex,]

#recode for positive = yes
trainData$y<-relevel(trainData$y, ref = "yes")
testData$y<-relevel(testData$y, ref= "yes")

Analysis using SVM algorithm

SVM:finds the best separating boundary between classes by maximizing the margin.

SVM linear: assumes linear relationships between features, it finds a linear separator or hyperplane to best divide data into two classes. It can serve as a foundation model.

SVM_linear<-svm(y~., data = trainData, kernel="linear", probability=TRUE)
summary(SVM_linear)

## 
## Call:
## svm(formula = y ~ ., data = trainData, kernel = "linear", probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  8187
## 
##  ( 4279 3908 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  yes no

SVM Linear Tuned by just adjusting the cost. Used the train function instead of svm function as svm function is slower when dealing with large data.

SVM_Linear_tuned <- train(
  y ~ ., 
  data = trainData, 
  method = "svmLinear",
  trControl = trainControl(
    method = "cv", 
    number = 5, 
    sampling = "down", 
    classProbs = TRUE
  ),
  preProcess = c("center", "scale"), 
  tuneGrid = expand.grid(C = c(0.01, 0.1, 1, 10))
)

Best performing cost:

bestlinear<-SVM_Linear_tuned$bestTune
print(bestlinear)

##      C
## 1 0.01

SVM Radial- Radial Basis Function (RBF) kernel is commonly used to transform non-linearly separable data into a higher-dimensional space where a linear hyperplane can be found.

SVM_Radial<-svm(y~., data=trainData, kernel="radial", cost=1, gamma=0.1, probability = TRUE, scale=TRUE)
summary(SVM_Radial)

## 
## Call:
## svm(formula = y ~ ., data = trainData, kernel = "radial", cost = 1, 
##     gamma = 0.1, probability = TRUE, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  8242
## 
##  ( 4445 3797 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  yes no

SVM Polynomial kernel-function to map the data into a higher dimensional space.

SVM_Poly<-svm(y~., data=trainData, kernel="poly", cost=1, gamma=0.1, probability = TRUE, scale=TRUE)
summary(SVM_Poly)

## 
## Call:
## svm(formula = y ~ ., data = trainData, kernel = "poly", cost = 1, 
##     gamma = 0.1, probability = TRUE, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  7871
## 
##  ( 4160 3711 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  yes no

Prediction and compare SVM models

Linear SVM

#linear
pred_linear<- predict(SVM_linear, testData)
confMat_linear<-confusionMatrix(pred_linear,testData$y)
confMat_linear

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  184   98
##        no   873 7886
##                                          
##                Accuracy : 0.8926         
##                  95% CI : (0.886, 0.8989)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : 0.002341       
##                                          
##                   Kappa : 0.2373         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.17408        
##             Specificity : 0.98773        
##          Pos Pred Value : 0.65248        
##          Neg Pred Value : 0.90033        
##              Prevalence : 0.11691        
##          Detection Rate : 0.02035        
##    Detection Prevalence : 0.03119        
##       Balanced Accuracy : 0.58090        
##                                          
##        'Positive' Class : yes            
##

Linear SVM Tuned

# Radial SVM Tuned
pred_lineartuned<- predict(SVM_Linear_tuned, testData)
confMat_linear_tuned<-confusionMatrix(pred_lineartuned,testData$y)
confMat_linear_tuned

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  826 1136
##        no   231 6848
##                                           
##                Accuracy : 0.8488          
##                  95% CI : (0.8412, 0.8561)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4661          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.78146         
##             Specificity : 0.85772         
##          Pos Pred Value : 0.42100         
##          Neg Pred Value : 0.96737         
##              Prevalence : 0.11691         
##          Detection Rate : 0.09136         
##    Detection Prevalence : 0.21701         
##       Balanced Accuracy : 0.81959         
##                                           
##        'Positive' Class : yes             
##

Radial SVM

#radial
pred_radial<- predict(SVM_Radial, testData)
confMat_radial<-confusionMatrix(pred_radial,testData$y)
confMat_radial

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  344  170
##        no   713 7814
##                                          
##                Accuracy : 0.9023         
##                  95% CI : (0.896, 0.9084)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : 2.943e-09      
##                                          
##                   Kappa : 0.3914         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.32545        
##             Specificity : 0.97871        
##          Pos Pred Value : 0.66926        
##          Neg Pred Value : 0.91638        
##              Prevalence : 0.11691        
##          Detection Rate : 0.03805        
##    Detection Prevalence : 0.05685        
##       Balanced Accuracy : 0.65208        
##                                          
##        'Positive' Class : yes            
##

Poly SVM

#poly
pred_poly<- predict(SVM_Poly, testData)
confMat_poly<-confusionMatrix(pred_poly,testData$y)
confMat_poly

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  326  164
##        no   731 7820
##                                           
##                Accuracy : 0.901           
##                  95% CI : (0.8947, 0.9071)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 3.203e-08       
##                                           
##                   Kappa : 0.3752          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.30842         
##             Specificity : 0.97946         
##          Pos Pred Value : 0.66531         
##          Neg Pred Value : 0.91451         
##              Prevalence : 0.11691         
##          Detection Rate : 0.03606         
##    Detection Prevalence : 0.05420         
##       Balanced Accuracy : 0.64394         
##                                           
##        'Positive' Class : yes             
##

#same seed 123 was set in the beginning of this assignment.
DT1_model<-rpart(y ~ ., data=trainData, method = "class")
rpart.plot(DT1_model, main="Decision Tree: Baseline")

DT1_predictions<-predict(DT1_model, newdata = testData, type = "class")
confusionMatrix(DT1_predictions, testData$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  350  217
##        no   707 7767
##                                          
##                Accuracy : 0.8978         
##                  95% CI : (0.8914, 0.904)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : 5.031e-06      
##                                          
##                   Kappa : 0.3805         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.33113        
##             Specificity : 0.97282        
##          Pos Pred Value : 0.61728        
##          Neg Pred Value : 0.91657        
##              Prevalence : 0.11691        
##          Detection Rate : 0.03871        
##    Detection Prevalence : 0.06271        
##       Balanced Accuracy : 0.65197        
##                                          
##        'Positive' Class : yes            
##

#created Eval function
evaluate_model<-function(model_predictions, true_labels, model_name){
  cm<- confusionMatrix(model_predictions, true_labels, positive="yes")
  
  list(model=model_name,
       accurracy= cm$overall["Accuracy"],
       precision=cm$byClass["Precision"],
       recall=cm$byClass["Recall"],
       f1_score=cm$byClass["F1"]
  )
}

Comparsion chart

results<-bind_rows(
  evaluate_model(pred_linear, testData$y, "SVM Linear"),
  evaluate_model(pred_lineartuned, testData$y, "SVM linear Tuned"),
  evaluate_model(pred_radial, testData$y, "SVM Radial"),
  evaluate_model(pred_poly, testData$y, "SVM poly"),
  evaluate_model(DT1_predictions, testData$y, "Decision Tree 1")
)
print(results)

## # A tibble: 5 × 5
##   model            accurracy precision recall f1_score
##   <chr>                <dbl>     <dbl>  <dbl>    <dbl>
## 1 SVM Linear           0.893     0.652  0.174    0.275
## 2 SVM linear Tuned     0.849     0.421  0.781    0.547
## 3 SVM Radial           0.902     0.669  0.325    0.438
## 4 SVM poly             0.901     0.665  0.308    0.421
## 5 Decision Tree 1      0.898     0.617  0.331    0.431

ROC comparsion

Using ROC we are able to see the best performing models. When comparing all algorithms linear SVM tuned had a better ROC value, but radial SVM had better accuracy, sensitivity, F1 score, precision, and a ROC value that performed almost as good as linear SVM tuned. Decision tree had the lowest accuracy, sensitivity, F1 score, precision, and ROC, but it did have a higher precision and accuracy compared to the SVM linear tuned.

# Get probability predictions for the "yes" class
pred_linear_prob <- attr(predict(SVM_linear, testData, probability = TRUE), "probabilities")[, "yes"]
pred_lineartuned_prob <- predict(SVM_Linear_tuned, testData, type = "prob")[, "yes"]
pred_radial_prob <- attr(predict(SVM_Radial, testData, probability = TRUE), "probabilities")[, "yes"]
pred_poly_prob <- attr(predict(SVM_Poly, testData, probability = TRUE), "probabilities")[, "yes"]

# For Decision Tree (rpart)
DT1_prob <- predict(DT1_model, newdata = testData, type = "prob")[, "yes"]

par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))
roc1 <- roc(testData$y, pred_linear_prob, plot = TRUE,
            print.auc = TRUE, show.thres = TRUE)

## Setting levels: control = yes, case = no

## Setting direction: controls > cases

title(main = "Linear SVM: ROC")

roc2 <- roc(testData$y, pred_lineartuned_prob, plot = TRUE,
            print.auc = TRUE, show.thres = TRUE)

## Setting levels: control = yes, case = no
## Setting direction: controls > cases

title(main = "linear SVM Tuned: ROC")

roc3 <- roc(testData$y, pred_radial_prob, plot = TRUE,
            print.auc = TRUE, show.thres = TRUE)

## Setting levels: control = yes, case = no
## Setting direction: controls > cases

title(main = "Radial SVM : ROC")

roc4 <- roc(testData$y, pred_poly_prob, plot = TRUE,
            print.auc = TRUE, show.thres = TRUE)

## Setting levels: control = yes, case = no
## Setting direction: controls > cases

title(main = "Poly SVM: ROC")

roc5 <- roc(testData$y, DT1_prob, plot = TRUE,
            print.auc = TRUE, show.thres = TRUE)

## Setting levels: control = yes, case = no
## Setting direction: controls > cases

title(main = "Decision Tree: ROC")