Part I: Assignment

Instructions
Perform an analysis of the dataset used in Homework 1 and 2 using the SVM algorithm.
Compare the results with the results from previous homework:

Read the following articles:

To do’s:

PART II: Essay

Model Y=0 Y=1 TP FN FP TN Accuracy Sensitiv Specifi Bal Acc Prec F1 AUC-PR
Decision Tree
Initial Decision Tree (DT) 39822 5289 364 706 239 7733 0.8955 0.9700 0.3402 0.6551 0.934 0.945
SMOTE #1(K=5,dup_size=1) & DT 31950 8438 497 573 409 7563 0.8914 0.9487 0.4645 0.7066
SMOTE #2(K=5,dup_size=2) & DT 31950 12657 577 493 510 7462 0.8891 0.9360 0.5393 0.7376
Hyperparameter tuning #1 (cp=0.05) 31950 12657 628 442 700 7272 0.8737 0.9122 0.5869 0.7496
Hyperparameter tuning #2 (weight) 31950 12657 901 169 1851 6121 0.7766 0.7678 0.8421 0.8049 0.973 0.858 0.37
Random Forest
Initial Random Forest (RF) 39822 5289 410 660 212 7760 0.9036 0.9734 0.3832 0.6783
SMOTE #1(K=5,dup_size=1) & RF 31950 8438 445 625 234 7738 0.9050 0.9706 0.4159 0.6933
SMOTE #2(K=5,dup_size=2) & RF 31950 12657 448 622 249 7723 0.9037 0.9688 0.4187 0.6937
Hyperparameter tuning #1 (weight) 31950 12657 530 540 371 7601 0.8992 0.9535 0.4953 0.7244
Hyperparameter tuning #2 (weight) 31950 12657 537 533 395 7577 0.8974 0.9505 0.5019 0.7319
ADA Boost
Initial ADA BOOST (ADA) 39822 5289 236 807 134 7838 0.8959 0.9832 0.2458 0.6145
SMOTE #1(K=5,dup_size=1) & ADA 31950 8438 461 609 305 7667 0.8989 0.9617 0.4308 0.6963
SMOTE #2(K=5,dup_size=2) & ADA 31950 12657 520 550 377 7595 0.8975 0.9527 0.4860 0.7193
Hyperparameter tuning #1 (weight4,1) 31950 12657 507 563 333 7639 0.9009 0.9582 0.4738 0.7160
Hyperparameter tuning #2 (weight5,1) 31950 12657 507 563 333 7639 0.9009 0.9582 0.4738 0.7160
SVM
Initial SVM 39822 5289 348 722 165 7807 0.9019 0.9793 0.3252 0.6523 0.915 0.946
SMOTE #1(K=5,dup_size=1) & Radial 31950 8438 465 605 270 7702 0.9032 0.9661 0.4346 0.7004 0.927 0.946
SMOTE #2(K=5,dup_size=2) & Radial 31950 12657 533 537 337 7635 0.9033 0.9577 0.4981 0.7279 0.934 0.946
Kernal Adjustment (Linear) 31950 12657 696 374 616 7356 0.8905 0.9227 0.6505 0.7866 0.951 0.937 0.56
Hyperparameter tuning #2 (c= 0.1) 31950 12657 696 374 615 7357 0.8906 0.9229 0.6505 0.7867
Hyperparameter tuning #3 (c= 1.0) 31950 12657 696 374 616 7356 0.8905 0.9227 0.6505 0.7866 0.952 0.937

Assignment 1: Results Summarized/EDA
EDA was performed in the previous assignment: https://rpubs.com/greggmaloy/1275261 Below the EDA from the previous assignment is summarized:

There was considerable class imbalance in the target variable (y), where ~11% of clients subscribed, while ~88% did not. There are seven numerical and ten categorical variables in the dataset. Most numerical features were right-skewed and many had outliers, detected via IQR and scatterplots. There were no strong linear relationships between features, with most variables showing either very weak correlations via correlation matrix or none at all. There was no missing data.

What the EDA Means for Decision Trees, Random Forest, DA Boost Models & SVM Models
Since SVM and ADA Boost cannot handle categorical variables and since both decision tree and random forest models can accommodate one-hot encoding of categorical variables, categorical variables were transformed via one-hot encoding. This transformation standardizes the dataset for all four models.

Assignment 2: Results of Summarized for DT, RF, ADA Boost Models
Assignment 2 concluded with the final decision tree being the preferred model. This model included ~7,000 rows of SMOTE generated data, the complexity parameter adjusted to 5%, and an increased weight assigned to the target variable of the minority class. In deciding which model was the most preferred for assignment 2, we assumed the business had unlimited resources and would therefore prefer a model that maximizes true positives while minimizing false negatives. As a result, even though accuracy and sensitivity decreased compared to the initial model, the increase in true positives in the final decision tree expanded the pool of potential clients that could be contacted, increasing the likelihood of successful subscriptions/phone calls.

Assignment 3: Initial Models
In the present assignment, SVM was utilized and results compared to decision tree, random forest, and ADA boost models. The initial SVM model results were comparable to the initial decision tree and initial random forest models for all metrics. However, the ADA Boost model had a markedly lower specificity when compared to the other three models.

Assignment 3: Utilization of SMOTE
As with assignment 2, SMOTE was utilized to generate ~7,000 rows of synthetic data in the SVM model in an attempt to address the class imbalance present in the target variable. Results were comparable between SVN, DT and ADA Boost models, with all three models experiencing negligible decreases in accuracy and sensitivity, as well as larger increases in specificity and balance accuracy. Of note, the RF model did not denote as large increase in specificity and balance accuracy. This may have been because it is an ensemble model and addressed some of the class imbalance innately.

Assignment 3: Kernal Adjustment
Initially, the SVM model was run using a radial basis function (RBF) kernel based on findings from assignment 1. Assignment 1 found a lack of substantial linear relationships. However, after applying SMOTE to address class imbalance and one-hot encoding to standardize categorical variables, a linear kernel was introduced to reassess whether the data might now support more linear relationship. Surprisingly, the linear kernel outperformed the RBF kernel, yielding higher specificity and balanced accuracy. This improvement may have been due to the one hot encoding expanding the feature space, creating clearer decision boundary. The kernal adjustment increase specificity by ~15% while only lowering sensitivity by 3%.

Assignment 3: Cost Parameter Adjustment
After kernel adjustment,the cost parameter was adjusted to control how much the SVM model punishes mistakes. The results across two different adjustments (C = 0.1, C = 1.0) were approximately identical to the model which introduced the linear kernel. The common results suggest that the model’s performance was not related to cost parameter, possibly due to bias introduced by SMOTE or one hot encoding. Higher adjustments were attempted (c=1-), but R studio threw an error (warning: reaching max number of iterations)

Assignment 3: Final Model Choice

FINAL TWO MODELS

Model Y=0 Y=1 TP FN FP TN Accuracy Sensitiv Specifi Bal Acc Prec F1 AUC-PR
DT HP tuning #2 (SMOTE &weight) 31950 12657 901 169 1851 6121 0.7766 0.7678 0.8421 0.8049 0.973 0.858 0.37
SVM Kernal Adjustment(Linear & SMOTE) 31950 12657 696 374 616 7356 0.8905 0.9227 0.6505 0.7866 0.951 0.937 0.56

The final model selected for this assignment is the SVM with SMOTE and linear kernel. Although the decision tree model did outperform on some metrics, such as precision, balance accuracy and specificity, the final decision tree model also saw a massive increase in the number of false positives (FP=1851), dramatically lower sensitivity (0.7678), lower accuracy (0.7766). The SVM had the higher AUC-PR denoting that the model was better able to identify balance between true positives and false positives. Additionally the SVM mode had a high F1 score of the two models (0.937), denoting a more optimal balance between precision and recall.

In terms of utilization for business purposes, although we did assume the business had unlimited resources, the number of false positive generated by the decision tree model was approximately three times that of the SVM model (FP= 1851). This would translate to significant operational considerations. In contrast, the SVM model maintained a strong true positive rate while keeping false positives comparatively low, thus making it more cost effective in subscriber outreach.

PART III: Lit Review & Relevance to Interest

Decision Tree Ensembles to Predict Corona virus Disease 2019 Infection: A Comparative Study
The article “Decision Tree Ensembles to Predict Corona virus Disease 2019 Infection: A Comparative Study” by Amir Ahmad et al. investigated the effectiveness of various decision tree-based models in predicting COVID-19 infection. More specifically, the authors compared standard ensemble methods such as Random Forest, Ada Boost, and XGBoost against specialized ensemble methods designed specifically for imbalanced data, such as Balanced Random Forest, SMOTEBoost. Similar to the bank dataset, the data used in this study suffered from class imbalance, with relatively few positive cases present in the target variable. To address this, the authors applied both SMOTE and Random Undersampling. The models were evaluated using AUROC and AUPRC. The main finding suggests that ensemble methods designed for imbalanced data outperform standard ensemble techniques. Of note, SVM was not used in this study, and SMOTE was used.

A Novel Approach to Predict COVID-19 Using Support Vector Machine
In “A Novel Approach to Predict COVID-19 Using Support Vector Machine”, the authors utilized SVM, logistic regression, KNN, Naive Bayes, and random forest to predict severity of COVID-19 infection based on symptoms features. Among these models, SVM achieved the highest accuracy. Unlike the study by Amir Ahmad et al., this article did not utilize SMOTE or other ‘rebalancing’ technique to address class imbalance. Additionally, the authors relied primarily on classification accuracy as the evaluation metric, whereas Amir Ahmad et al. used AUROC, AUPRC, and F1-score.

Predicting metabolic syndrome using decision tree and support vector machine methods
The study “Predicting metabolic syndrome using decision tree and support vector machine methods” aimed to predict incidence of metabolic syndrome. The study utilized two models, SVM(polynomial kernel) and a decision tree. To address the class imbalance in the dataset, the authors used SMOTE. Evaluation metrics included sensitivity, specificity, and accuracy. Ultimately, SVM outperformed the decision tree models in terms of sensitivity, specificity and accuracy. Of note, the authors of this article used SMOTE to address class imbalance.

A comparative study of decision tree and support vector machine for breast cancer prediction In the article “A comparative study of decision tree and support vector machine for breast cancer prediction” the authors aimed to improve breast cancer diagnosis by comparing SVM models to Decision Tree models. Accuracy, sensitivity, specificity, precision, and AUC were used to evaluate the models. In the end, SVM outperformed the decision tree models across all metrics, although results were somewhat comparable. These results suggest that SVM provides a slightly more accurate model. Of note, although there was some ‘mild’ class imbalance present in the dataset, there was no attempt at imbalance correction, such as SMOTE.

Utility of support vector machine and decision tree to identify the prognosis of metformin poisoning in the United States: Analysis of National Poisoning Data System In the article, “Analysis of National Poisoning Data System” authors aimed to predict the prognosis of metformin poisoning using SVM and decision tree models to classify outcomes as minor, moderate, or major. SVM outperformed decision trees in all evaluation metrics, including accuracy, precision, recall, F1-score, ROC-AUC and PR-AUC. Of note, the dataset also suffered from class imbalance. The authors of the study, however, did not utilize techniques, such as SMOTE, to address this imbalance.

Article Imbalance Present Imbalance Addressed Metrics Preferred Model(s)
DT Ensembles to Predict Corona virus Disease 2019 Inf yes yes various & SMOTE AUROC and AUPRC Ensemble methods for imbalanced dataset
A Novel Approach to Predict COVID-19 Using SVM yes no AUROC,AUPRC, F1 SVM
Predicting metabolic syndrome using DT and SVM methods yes yes-SMOTE sen,spec,acc SVM
A comparative study of SVM and DT for breast ca predict mild no sen,spec,acc,prec,AUC SVM
Utility of SVM and DT prognosis of metformin poisoning yes no sen,acc,prec,F1,AUC SVM

Discussion
As a healthcare professional, the above articles are extremely useful. All of the above articles allude to similar underlying data issue which frequently affects healthcare data, that being class imbalance in the target variable. The target variable in healthcare can many times be a disease or condition which is relatively uncommon when compared to the greater populations. Three of the five above articles employed SMOTE as a method to help remedy this imbalance. Furthermore, for four of the five articles, SVM models outperformed other models (usually decision trees). Of note, the five articles also utilized similar evalulation metrics.

PART III: References

Ahmad, A., Safi, O., Malebary, S., Alesawi, S., & Alkayal, E. (2021). Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study. Complexity, 2021. https://doi.org/10.1155/2021/5550344

Guhathakurata, S., Kundu, S., Chakraborty, A., & Banerjee, J. S. (2021). A novel approach to predict COVID-19 using support vector machine. In Data Science for COVID-19 (pp. 351–364). Elsevier. https://doi.org/10.1016/B978-0-12-824536-1.00014-9

Karimi-Alavijeh, F., Jalili, S., & Sadeghi, M. (2016). Predicting metabolic syndrome using decision tree and support vector machine methods. ARYA Atherosclerosis, 12(3), 146–152.https://pmc.ncbi.nlm.nih.gov/articles/PMC5055373/

Ogbe, M. I., Nzeanorue, C. C., Olusola, R. A., Olofin, D. O., Owoeye, M. C., Enabulele, E. C., Ibijola, A. P., Ifechukwu, C. J., & Ayo, O. I. (2024). A comparative study of decision tree and support vector machine for breast cancer prediction. World Journal of Advanced Research and Reviews, 23(1), 746–752. https://doi.org/10.30574/wjarr.2024.23.1.2024

Mehrpour, O., Saeedi, F., Hoyte, C., Goss, F., & Shirazi, F. M. (2022). Utility of support vector machine and decision tree to identify the prognosis of metformin poisoning in the United States: Analysis of National Poisoning Data System. BMC Pharmacology and Toxicology, 23(1), 49. https://doi.org/10.1186/s40360-022-00588-0

PART IV: Code

########################################## ONLY CODE BELOW###############################################################

library(tidyverse)   
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)      
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(fastDummies) 
library(rpart)       
library(rpart.plot)  
library(smotefamily)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(mlbench)
library(e1071)

bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# Categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]

# One-hot encoding 
bank_data_encoded <- fastDummies::dummy_cols(bank_data, select_columns = categorical_cols, 
                                             remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Split data 80 training/ 20 testing 
set.seed(42)
trainIndex <- createDataPartition(bank_data_encoded$y_yes, p = 0.8, list = FALSE)  
trainData <- bank_data_encoded[trainIndex, ]
testData <- bank_data_encoded[-trainIndex, ]

# factorize target
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))

set.seed(123) 
set.seed(123)
# BEFORE SMOTE
# Scale features
scaled_train <- trainData
scaled_test <- testData

# Remove target for scaling
train_features <- scaled_train[, setdiff(names(scaled_train), "y_yes")]
test_features <- scaled_test[, setdiff(names(scaled_test), "y_yes")]

# Standardize
scaled_train_scaled <- scale(train_features)
scaled_test_scaled <- scale(test_features, center = attr(scaled_train_scaled, "scaled:center"), 
                            scale = attr(scaled_train_scaled, "scaled:scale"))

# Re-add target variable
scaled_train <- data.frame(scaled_train_scaled)
scaled_train$y_yes <- trainData$y_yes

scaled_test <- data.frame(scaled_test_scaled)
scaled_test$y_yes <- testData$y_yes

# Train SVM with radial 
svm_model <- svm(y_yes ~ ., data = scaled_train, kernel = "radial", probability = TRUE)

# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test)

# Confusion Matrix
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test$y_yes)


precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9153476
cat("Recall:", recall, "\n")
## Recall: 0.9793026
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9462457
print(conf_matrix_svm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7807  722
##          1  165  348
##                                          
##                Accuracy : 0.9019         
##                  95% CI : (0.8956, 0.908)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 5.443e-10      
##                                          
##                   Kappa : 0.3931         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9793         
##             Specificity : 0.3252         
##          Pos Pred Value : 0.9153         
##          Neg Pred Value : 0.6784         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8634         
##    Detection Prevalence : 0.9433         
##       Balanced Accuracy : 0.6523         
##                                          
##        'Positive' Class : 0              
## 
# Load SVM library

#install.packages("e1071")
#SMOTE 1!!!!!!!!!!!!!!!!!!

set.seed(123)

# Ensure target is factor
trainData$y_yes <- as.factor(trainData$y_yes)
testData$y_yes <- as.factor(testData$y_yes)

# SMOTE - balance the training set
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], 
                    trainData$y_yes, 
                    K = 5, dup_size = 1)

# Create SMOTE-balanced dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)


print(table(trainData_balanced$y_yes))
## 
##     0     1 
## 31950  8438
# Standardize numeric features after SMOTE
train_features <- trainData_balanced[, setdiff(names(trainData_balanced), "y_yes")]
test_features <- testData[, setdiff(names(testData), "y_yes")]

# Scale training and testing 
scaled_train <- scale(train_features)
scaled_test <- scale(test_features, 
                     center = attr(scaled_train, "scaled:center"), 
                     scale = attr(scaled_train, "scaled:scale"))

# Combine scaled features with target variable
scaled_train_df <- data.frame(scaled_train)
scaled_train_df$y_yes <- trainData_balanced$y_yes

scaled_test_df <- data.frame(scaled_test)
scaled_test_df$y_yes <- testData$y_yes

# Train SVM radial kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
                 kernel = "radial", probability = TRUE)

# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# Confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)

precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9271699
cat("Recall:", recall, "\n")
## Recall: 0.9661315
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9462498
print(conf_matrix_svm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7702  605
##          1  270  465
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8969, 0.9092)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 3.782e-11       
##                                           
##                   Kappa : 0.4635          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9661          
##             Specificity : 0.4346          
##          Pos Pred Value : 0.9272          
##          Neg Pred Value : 0.6327          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8518          
##    Detection Prevalence : 0.9187          
##       Balanced Accuracy : 0.7004          
##                                           
##        'Positive' Class : 0               
## 
#SMOTE 2!!!!!!!!!!!!!!!!!!

set.seed(123)

#ensure target is factor
trainData$y_yes <- as.factor(trainData$y_yes)
testData$y_yes <- as.factor(testData$y_yes)

# SMOTE dup_size=2
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], 
                    trainData$y_yes, 
                    K = 5, dup_size = 2)

# create SMOTE dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)


print(table(trainData_balanced$y_yes))
## 
##     0     1 
## 31950 12657
# Standardize numeric features after SMOTE
train_features <- trainData_balanced[, setdiff(names(trainData_balanced), "y_yes")]
test_features <- testData[, setdiff(names(testData), "y_yes")]

# Scale training and testing 
scaled_train <- scale(train_features)
scaled_test <- scale(test_features, 
                     center = attr(scaled_train, "scaled:center"), 
                     scale = attr(scaled_train, "scaled:scale"))

# Combine scaled features with target variable
scaled_train_df <- data.frame(scaled_train)
scaled_train_df$y_yes <- trainData_balanced$y_yes

scaled_test_df <- data.frame(scaled_test)
scaled_test_df$y_yes <- testData$y_yes

# Train SVM radial kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
                 kernel = "radial", probability = TRUE)

# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# Confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)


precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9342878
cat("Recall:", recall, "\n")
## Recall: 0.957727
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9458622
print(conf_matrix_svm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7635  537
##          1  337  533
##                                           
##                Accuracy : 0.9033          
##                  95% CI : (0.8971, 0.9094)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 3.004e-11       
##                                           
##                   Kappa : 0.496           
##                                           
##  Mcnemar's Test P-Value : 1.682e-11       
##                                           
##             Sensitivity : 0.9577          
##             Specificity : 0.4981          
##          Pos Pred Value : 0.9343          
##          Neg Pred Value : 0.6126          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8444          
##    Detection Prevalence : 0.9038          
##       Balanced Accuracy : 0.7279          
##                                           
##        'Positive' Class : 0               
## 
set.seed(123)
# Train with linear
svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
                 kernel = "linear", probability = TRUE)

# Prediction
svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# Confusion 
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9516171
cat("Recall:", recall, "\n")
## Recall: 0.9227296
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9369507
print(conf_matrix_svm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7356  374
##          1  616  696
##                                           
##                Accuracy : 0.8905          
##                  95% CI : (0.8839, 0.8969)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.00449         
##                                           
##                   Kappa : 0.5221          
##                                           
##  Mcnemar's Test P-Value : 1.867e-14       
##                                           
##             Sensitivity : 0.9227          
##             Specificity : 0.6505          
##          Pos Pred Value : 0.9516          
##          Neg Pred Value : 0.5305          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8135          
##    Detection Prevalence : 0.8549          
##       Balanced Accuracy : 0.7866          
##                                           
##        'Positive' Class : 0               
## 
library(PRROC)
## Loading required package: rlang
## 
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice
svm_probs <- attr(predict(svm_model, newdata = scaled_test_df, probability = TRUE), "probabilities")[, "1"]


labels <- as.numeric(as.character(scaled_test_df$y_yes))


pr <- pr.curve(scores.class0 = svm_probs[labels == 1],
               scores.class1 = svm_probs[labels == 0],
               curve = TRUE)

# Plot PR curve
plot(pr,
     main = "Precision-Recall Curve for SVM (Linear)",
     auc.main = TRUE,
     color = "#2c7fb8",
     lwd = 2)

set.seed(123)
# Try different cost with linear kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
                 kernel = "linear", 
                 cost = 1,        
                 probability = TRUE)

# Prediction
svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9516171
cat("Recall:", recall, "\n")
## Recall: 0.9227296
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9369507
print(conf_matrix_svm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7356  374
##          1  616  696
##                                           
##                Accuracy : 0.8905          
##                  95% CI : (0.8839, 0.8969)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.00449         
##                                           
##                   Kappa : 0.5221          
##                                           
##  Mcnemar's Test P-Value : 1.867e-14       
##                                           
##             Sensitivity : 0.9227          
##             Specificity : 0.6505          
##          Pos Pred Value : 0.9516          
##          Neg Pred Value : 0.5305          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8135          
##    Detection Prevalence : 0.8549          
##       Balanced Accuracy : 0.7866          
##                                           
##        'Positive' Class : 0               
## 
#set.seed(123)
# Cost =10 with linear
#svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
#                 kernel = "linear", 
#                 cost = 10,       
#                 probability = TRUE)

# Prediction
#svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# Confusion
#conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
#print(conf_matrix_svm)
set.seed(123)
install.packages("PRROC")
## Warning: package 'PRROC' is in use and will not be installed
library(PRROC)
# Get predicted probabilities for class "1"
svm_probs <- attr(predict(svm_model, newdata = scaled_test_df, probability = TRUE), "probabilities")[, "1"]

# Convert test labels to numeric (must be 0/1)
labels <- as.numeric(as.character(scaled_test_df$y_yes))

# Create precision-recall object
pr <- pr.curve(scores.class0 = svm_probs[labels == 1],
               scores.class1 = svm_probs[labels == 0],
               curve = TRUE)

# Plot the PR curve
plot(pr,
     main = "Precision-Recall Curve for SVM",
     auc.main = TRUE,
     color = "#2c7fb8",
     lwd = 2)

DECISION TREE

bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# Categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]

# One-hot encoding 
bank_data_encoded <- fastDummies::dummy_cols(bank_data, select_columns = categorical_cols, 
                                             remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Split data 80 training/ 20 testing 
set.seed(42)
trainIndex <- createDataPartition(bank_data_encoded$y_yes, p = 0.8, list = FALSE)  
trainData <- bank_data_encoded[trainIndex, ]
testData <- bank_data_encoded[-trainIndex, ]

# factorize target
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))

set.seed(123) 

#INITIAL DT MODEL!!!!!!!!!!!!!!!!!!

dt_model <- rpart(y_yes ~ ., data = trainData, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
rpart.plot(dt_model)

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# COnfusion matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9163408
cat("Recall:", recall, "\n")
## Recall: 0.9700201
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9424167
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7733  706
##          1  239  364
##                                          
##                Accuracy : 0.8955         
##                  95% CI : (0.889, 0.9017)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 1.886e-05      
##                                          
##                   Kappa : 0.3825         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9700         
##             Specificity : 0.3402         
##          Pos Pred Value : 0.9163         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8552         
##    Detection Prevalence : 0.9333         
##       Balanced Accuracy : 0.6551         
##                                          
##        'Positive' Class : 0              
## 
# Feature importance
print(dt_model$variable.importance)
##         duration poutcome_success  contact_unknown            pdays 
##     1110.9180804      685.8992109        3.9494818        0.8776626 
##         previous              age         campaign 
##        0.5831593        0.1443280        0.1443280
#SMOTE 1!!!!!!!!!!!!!!!!!!
set.seed(123)
trainData$y_yes <- as.factor(trainData$y_yes)

#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)

# Creation of new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"  
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)

# New class distribution 
table(trainData_balanced$y_yes)
## 
##     0     1 
## 31950  8438
set.seed(123)
# Train decision tree
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))



# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# Confusion Matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9295723
cat("Recall:", recall, "\n")
## Recall: 0.9486954
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9390365
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7563  573
##          1  409  497
##                                           
##                Accuracy : 0.8914          
##                  95% CI : (0.8848, 0.8977)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.001992        
##                                           
##                   Kappa : 0.4425          
##                                           
##  Mcnemar's Test P-Value : 1.976e-07       
##                                           
##             Sensitivity : 0.9487          
##             Specificity : 0.4645          
##          Pos Pred Value : 0.9296          
##          Neg Pred Value : 0.5486          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8364          
##    Detection Prevalence : 0.8998          
##       Balanced Accuracy : 0.7066          
##                                           
##        'Positive' Class : 0               
## 
# Plot 
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,     
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)  

#SMOTE 2!!!!!!!!!!!
set.seed(123)

trainData$y_yes <- as.factor(trainData$y_yes)

#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)

# new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes" 
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)  

# class distribution
table(trainData_balanced$y_yes)
## 
##     0     1 
## 31950 12657
# Train dt
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# conf matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9380264
cat("Recall:", recall, "\n")
## Recall: 0.9360261
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9370252
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7462  493
##          1  510  577
##                                           
##                Accuracy : 0.8891          
##                  95% CI : (0.8824, 0.8955)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.0146          
##                                           
##                   Kappa : 0.472           
##                                           
##  Mcnemar's Test P-Value : 0.6134          
##                                           
##             Sensitivity : 0.9360          
##             Specificity : 0.5393          
##          Pos Pred Value : 0.9380          
##          Neg Pred Value : 0.5308          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8253          
##    Detection Prevalence : 0.8798          
##       Balanced Accuracy : 0.7376          
##                                           
##        'Positive' Class : 0               
## 
# Plot 
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,     
           under = TRUE,    
           tweak = 1.2,    
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)  

#HYPERPARAMETER TUNING #1
#MODIFY COMPLEXITY PARAMETER
set.seed(123)

dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class",
                  control = rpart.control(minsplit = 20, cp = 0.05))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.9427016
cat("Recall:", recall, "\n")
## Recall: 0.9121927
cat("F1 Score:", f1, "\n")
## F1 Score: 0.9271962
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7272  442
##          1  700  628
##                                           
##                Accuracy : 0.8737          
##                  95% CI : (0.8667, 0.8805)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.9904          
##                                           
##                   Kappa : 0.4519          
##                                           
##  Mcnemar's Test P-Value : 2.849e-14       
##                                           
##             Sensitivity : 0.9122          
##             Specificity : 0.5869          
##          Pos Pred Value : 0.9427          
##          Neg Pred Value : 0.4729          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8042          
##    Detection Prevalence : 0.8531          
##       Balanced Accuracy : 0.7496          
##                                           
##        'Positive' Class : 0               
## 
# Plot
rpart.plot(dt_model, 
           type = 3,       
           extra = 104,    
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)  

#HYPERPARAMETER TUNING #2 CHANGING WEIGHTS!!!!!!!!!!!
set.seed(123)

dt_model <- rpart(y_yes ~ ., 
                  data = trainData_balanced, 
                  method = "class",
                  parms = list(prior = c(0.4, 0.6)),  # Corrected argument name
                  control = rpart.control(minsplit = 20, cp = 0.01))

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# Confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
## Precision: 0.973132
cat("Recall:", recall, "\n")
## Recall: 0.7678123
cat("F1 Score:", f1, "\n")
## F1 Score: 0.8583649
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6121  169
##          1 1851  901
##                                           
##                Accuracy : 0.7766          
##                  95% CI : (0.7679, 0.7851)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3629          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7678          
##             Specificity : 0.8421          
##          Pos Pred Value : 0.9731          
##          Neg Pred Value : 0.3274          
##              Prevalence : 0.8817          
##          Detection Rate : 0.6770          
##    Detection Prevalence : 0.6956          
##       Balanced Accuracy : 0.8049          
##                                           
##        'Positive' Class : 0               
## 
# Plot DT
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,    
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE) 

library(PRROC)

# Get predicted probabilities for the positive class ("1")
dt_probs <- predict(dt_model, newdata = testData, type = "prob")[, "1"]

# Convert actual labels to numeric (0/1)
labels <- as.numeric(as.character(testData$y_yes))

# Compute Precision-Recall curve
pr <- pr.curve(scores.class0 = dt_probs[labels == 1],
               scores.class1 = dt_probs[labels == 0],
               curve = TRUE)

# Plot PR Curve
plot(pr,
     main = "Precision-Recall Curve for Decision Tree (Weighted)",
     auc.main = TRUE,
     color = "#d95f02",
     lwd = 2)