Perform an analysis of the dataset(s) used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.
Read the following articles: https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/ Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework. Answer questions, such as: Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(e1071)
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:ggplot2':
##
## element
library(dplyr)
library(rpart)
library(rpart.plot)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
Data was imported and preprocessed as in assignment number 2 to have consistency.
Bank_data<-read.csv2("https://raw.githubusercontent.com/Andreina-A/Data-622/refs/heads/main/bank-full.csv") #imported characters as factors for further analysis
Bank_data$y<-as.factor(Bank_data$y)
#preprocessing
#removed "default" variable as it had many unknown data and very low correlation with the subscription outcome.
Bank_data<-Bank_data|>
select(-default)
head(Bank_data)
## age job marital education balance housing loan contact day month
## 1 58 management married tertiary 2143 yes no unknown 5 may
## 2 44 technician single secondary 29 yes no unknown 5 may
## 3 33 entrepreneur married secondary 2 yes yes unknown 5 may
## 4 47 blue-collar married unknown 1506 yes no unknown 5 may
## 5 33 unknown single unknown 1 no no unknown 5 may
## 6 35 management married tertiary 231 yes no unknown 5 may
## duration campaign pdays previous poutcome y
## 1 261 1 -1 0 unknown no
## 2 151 1 -1 0 unknown no
## 3 76 1 -1 0 unknown no
## 4 92 1 -1 0 unknown no
## 5 198 1 -1 0 unknown no
## 6 139 1 -1 0 unknown no
The same Train/split from assignment #2 was used for this assignment for consistency as well.
set.seed(123)
#Use 80 percent of data for training 20 percent for test
trainIndex<-createDataPartition(Bank_data$y, p=0.8, list=FALSE)
trainData<-Bank_data[trainIndex,]
testData<-Bank_data[-trainIndex,]
#recode for positive = yes
trainData$y<-relevel(trainData$y, ref = "yes")
testData$y<-relevel(testData$y, ref= "yes")
SVM:finds the best separating boundary between classes by maximizing the margin.
SVM linear: assumes linear relationships between features, it finds a linear separator or hyperplane to best divide data into two classes. It can serve as a foundation model.
SVM_linear<-svm(y~., data = trainData, kernel="linear", probability=TRUE)
summary(SVM_linear)
##
## Call:
## svm(formula = y ~ ., data = trainData, kernel = "linear", probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 8187
##
## ( 4279 3908 )
##
##
## Number of Classes: 2
##
## Levels:
## yes no
SVM Linear Tuned by just adjusting the cost. Used the train function instead of svm function as svm function is slower when dealing with large data.
SVM_Linear_tuned <- train(
y ~ .,
data = trainData,
method = "svmLinear",
trControl = trainControl(
method = "cv",
number = 5,
sampling = "down",
classProbs = TRUE
),
preProcess = c("center", "scale"),
tuneGrid = expand.grid(C = c(0.01, 0.1, 1, 10))
)
Best performing cost:
bestlinear<-SVM_Linear_tuned$bestTune
print(bestlinear)
## C
## 1 0.01
SVM Radial- Radial Basis Function (RBF) kernel is commonly used to transform non-linearly separable data into a higher-dimensional space where a linear hyperplane can be found.
SVM_Radial<-svm(y~., data=trainData, kernel="radial", cost=1, gamma=0.1, probability = TRUE, scale=TRUE)
summary(SVM_Radial)
##
## Call:
## svm(formula = y ~ ., data = trainData, kernel = "radial", cost = 1,
## gamma = 0.1, probability = TRUE, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 8242
##
## ( 4445 3797 )
##
##
## Number of Classes: 2
##
## Levels:
## yes no
SVM Polynomial kernel-function to map the data into a higher dimensional space.
SVM_Poly<-svm(y~., data=trainData, kernel="poly", cost=1, gamma=0.1, probability = TRUE, scale=TRUE)
summary(SVM_Poly)
##
## Call:
## svm(formula = y ~ ., data = trainData, kernel = "poly", cost = 1,
## gamma = 0.1, probability = TRUE, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 7871
##
## ( 4160 3711 )
##
##
## Number of Classes: 2
##
## Levels:
## yes no
#linear
pred_linear<- predict(SVM_linear, testData)
confMat_linear<-confusionMatrix(pred_linear,testData$y)
confMat_linear
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 184 98
## no 873 7886
##
## Accuracy : 0.8926
## 95% CI : (0.886, 0.8989)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.002341
##
## Kappa : 0.2373
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.17408
## Specificity : 0.98773
## Pos Pred Value : 0.65248
## Neg Pred Value : 0.90033
## Prevalence : 0.11691
## Detection Rate : 0.02035
## Detection Prevalence : 0.03119
## Balanced Accuracy : 0.58090
##
## 'Positive' Class : yes
##
# Radial SVM Tuned
pred_lineartuned<- predict(SVM_Linear_tuned, testData)
confMat_linear_tuned<-confusionMatrix(pred_lineartuned,testData$y)
confMat_linear_tuned
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 826 1136
## no 231 6848
##
## Accuracy : 0.8488
## 95% CI : (0.8412, 0.8561)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4661
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.78146
## Specificity : 0.85772
## Pos Pred Value : 0.42100
## Neg Pred Value : 0.96737
## Prevalence : 0.11691
## Detection Rate : 0.09136
## Detection Prevalence : 0.21701
## Balanced Accuracy : 0.81959
##
## 'Positive' Class : yes
##
#radial
pred_radial<- predict(SVM_Radial, testData)
confMat_radial<-confusionMatrix(pred_radial,testData$y)
confMat_radial
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 344 170
## no 713 7814
##
## Accuracy : 0.9023
## 95% CI : (0.896, 0.9084)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 2.943e-09
##
## Kappa : 0.3914
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.32545
## Specificity : 0.97871
## Pos Pred Value : 0.66926
## Neg Pred Value : 0.91638
## Prevalence : 0.11691
## Detection Rate : 0.03805
## Detection Prevalence : 0.05685
## Balanced Accuracy : 0.65208
##
## 'Positive' Class : yes
##
#poly
pred_poly<- predict(SVM_Poly, testData)
confMat_poly<-confusionMatrix(pred_poly,testData$y)
confMat_poly
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 326 164
## no 731 7820
##
## Accuracy : 0.901
## 95% CI : (0.8947, 0.9071)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 3.203e-08
##
## Kappa : 0.3752
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.30842
## Specificity : 0.97946
## Pos Pred Value : 0.66531
## Neg Pred Value : 0.91451
## Prevalence : 0.11691
## Detection Rate : 0.03606
## Detection Prevalence : 0.05420
## Balanced Accuracy : 0.64394
##
## 'Positive' Class : yes
##
#same seed 123 was set in the beginning of this assignment.
DT1_model<-rpart(y ~ ., data=trainData, method = "class")
rpart.plot(DT1_model, main="Decision Tree: Baseline")
DT1_predictions<-predict(DT1_model, newdata = testData, type = "class")
confusionMatrix(DT1_predictions, testData$y, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 350 217
## no 707 7767
##
## Accuracy : 0.8978
## 95% CI : (0.8914, 0.904)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 5.031e-06
##
## Kappa : 0.3805
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.33113
## Specificity : 0.97282
## Pos Pred Value : 0.61728
## Neg Pred Value : 0.91657
## Prevalence : 0.11691
## Detection Rate : 0.03871
## Detection Prevalence : 0.06271
## Balanced Accuracy : 0.65197
##
## 'Positive' Class : yes
##
#created Eval function
evaluate_model<-function(model_predictions, true_labels, model_name){
cm<- confusionMatrix(model_predictions, true_labels, positive="yes")
list(model=model_name,
accurracy= cm$overall["Accuracy"],
precision=cm$byClass["Precision"],
recall=cm$byClass["Recall"],
f1_score=cm$byClass["F1"]
)
}
results<-bind_rows(
evaluate_model(pred_linear, testData$y, "SVM Linear"),
evaluate_model(pred_lineartuned, testData$y, "SVM linear Tuned"),
evaluate_model(pred_radial, testData$y, "SVM Radial"),
evaluate_model(pred_poly, testData$y, "SVM poly"),
evaluate_model(DT1_predictions, testData$y, "Decision Tree 1")
)
print(results)
## # A tibble: 5 × 5
## model accurracy precision recall f1_score
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 SVM Linear 0.893 0.652 0.174 0.275
## 2 SVM linear Tuned 0.849 0.421 0.781 0.547
## 3 SVM Radial 0.902 0.669 0.325 0.438
## 4 SVM poly 0.901 0.665 0.308 0.421
## 5 Decision Tree 1 0.898 0.617 0.331 0.431
Using ROC we are able to see the best performing models. When comparing all algorithms linear SVM tuned had a better ROC value, but radial SVM had better accuracy, sensitivity, F1 score, precision, and a ROC value that performed almost as good as linear SVM tuned. Decision tree had the lowest accuracy, sensitivity, F1 score, precision, and ROC, but it did have a higher precision and accuracy compared to the SVM linear tuned.
# Get probability predictions for the "yes" class
pred_linear_prob <- attr(predict(SVM_linear, testData, probability = TRUE), "probabilities")[, "yes"]
pred_lineartuned_prob <- predict(SVM_Linear_tuned, testData, type = "prob")[, "yes"]
pred_radial_prob <- attr(predict(SVM_Radial, testData, probability = TRUE), "probabilities")[, "yes"]
pred_poly_prob <- attr(predict(SVM_Poly, testData, probability = TRUE), "probabilities")[, "yes"]
# For Decision Tree (rpart)
DT1_prob <- predict(DT1_model, newdata = testData, type = "prob")[, "yes"]
par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))
roc1 <- roc(testData$y, pred_linear_prob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
## Setting levels: control = yes, case = no
## Setting direction: controls > cases
title(main = "Linear SVM: ROC")
roc2 <- roc(testData$y, pred_lineartuned_prob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
## Setting levels: control = yes, case = no
## Setting direction: controls > cases
title(main = "linear SVM Tuned: ROC")
roc3 <- roc(testData$y, pred_radial_prob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
## Setting levels: control = yes, case = no
## Setting direction: controls > cases
title(main = "Radial SVM : ROC")
roc4 <- roc(testData$y, pred_poly_prob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
## Setting levels: control = yes, case = no
## Setting direction: controls > cases
title(main = "Poly SVM: ROC")
roc5 <- roc(testData$y, DT1_prob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
## Setting levels: control = yes, case = no
## Setting direction: controls > cases
title(main = "Decision Tree: ROC")