Data 622 Assignment 3

Part I: Assignment

Instructions
Perform an analysis of the dataset used in Homework 1 and 2 using the SVM algorithm.
Compare the results with the results from previous homework:

Homework 1 https://rpubs.com/greggmaloy/1275261
Homework 2 https://rpubs.com/greggmaloy/1288078

Read the following articles:

To do’s:

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.
Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.
Compare the results with the results from previous homework.
Answer questions, such as:
Which algorithm is recommended to get more accurate results?
Is it better for classification or regression scenarios?
Do you agree with the recommendations?
Why?

PART II: Essay

Model	Y=0	Y=1	TP	FN	FP	TN	Accuracy	Sensitiv	Specifi	Bal Acc	Prec	F1	AUC-PR
Decision Tree
Initial Decision Tree (DT)	39822	5289	364	706	239	7733	0.8955	0.9700	0.3402	0.6551	0.934	0.945
SMOTE #1(K=5,dup_size=1) & DT	31950	8438	497	573	409	7563	0.8914	0.9487	0.4645	0.7066
SMOTE #2(K=5,dup_size=2) & DT	31950	12657	577	493	510	7462	0.8891	0.9360	0.5393	0.7376
Hyperparameter tuning #1 (cp=0.05)	31950	12657	628	442	700	7272	0.8737	0.9122	0.5869	0.7496
Hyperparameter tuning #2 (weight)	31950	12657	901	169	1851	6121	0.7766	0.7678	0.8421	0.8049	0.973	0.858	0.37
Random Forest
Initial Random Forest (RF)	39822	5289	410	660	212	7760	0.9036	0.9734	0.3832	0.6783
SMOTE #1(K=5,dup_size=1) & RF	31950	8438	445	625	234	7738	0.9050	0.9706	0.4159	0.6933
SMOTE #2(K=5,dup_size=2) & RF	31950	12657	448	622	249	7723	0.9037	0.9688	0.4187	0.6937
Hyperparameter tuning #1 (weight)	31950	12657	530	540	371	7601	0.8992	0.9535	0.4953	0.7244
Hyperparameter tuning #2 (weight)	31950	12657	537	533	395	7577	0.8974	0.9505	0.5019	0.7319
ADA Boost
Initial ADA BOOST (ADA)	39822	5289	236	807	134	7838	0.8959	0.9832	0.2458	0.6145
SMOTE #1(K=5,dup_size=1) & ADA	31950	8438	461	609	305	7667	0.8989	0.9617	0.4308	0.6963
SMOTE #2(K=5,dup_size=2) & ADA	31950	12657	520	550	377	7595	0.8975	0.9527	0.4860	0.7193
Hyperparameter tuning #1 (weight4,1)	31950	12657	507	563	333	7639	0.9009	0.9582	0.4738	0.7160
Hyperparameter tuning #2 (weight5,1)	31950	12657	507	563	333	7639	0.9009	0.9582	0.4738	0.7160
SVM
Initial SVM	39822	5289	348	722	165	7807	0.9019	0.9793	0.3252	0.6523	0.915	0.946
SMOTE #1(K=5,dup_size=1) & Radial	31950	8438	465	605	270	7702	0.9032	0.9661	0.4346	0.7004	0.927	0.946
SMOTE #2(K=5,dup_size=2) & Radial	31950	12657	533	537	337	7635	0.9033	0.9577	0.4981	0.7279	0.934	0.946
Kernal Adjustment (Linear)	31950	12657	696	374	616	7356	0.8905	0.9227	0.6505	0.7866	0.951	0.937	0.56
Hyperparameter tuning #2 (c= 0.1)	31950	12657	696	374	615	7357	0.8906	0.9229	0.6505	0.7867
Hyperparameter tuning #3 (c= 1.0)	31950	12657	696	374	616	7356	0.8905	0.9227	0.6505	0.7866	0.952	0.937

Assignment 1: Results Summarized/EDA
EDA was performed in the previous assignment: https://rpubs.com/greggmaloy/1275261 Below the EDA from the previous assignment is summarized:

There was considerable class imbalance in the target variable (y), where ~11% of clients subscribed, while ~88% did not. There are seven numerical and ten categorical variables in the dataset. Most numerical features were right-skewed and many had outliers, detected via IQR and scatterplots. There were no strong linear relationships between features, with most variables showing either very weak correlations via correlation matrix or none at all. There was no missing data.

What the EDA Means for Decision Trees, Random Forest, DA Boost Models & SVM Models
Since SVM and ADA Boost cannot handle categorical variables and since both decision tree and random forest models can accommodate one-hot encoding of categorical variables, categorical variables were transformed via one-hot encoding. This transformation standardizes the dataset for all four models.

Assignment 2: Results of Summarized for DT, RF, ADA Boost Models
Assignment 2 concluded with the final decision tree being the preferred model. This model included ~7,000 rows of SMOTE generated data, the complexity parameter adjusted to 5%, and an increased weight assigned to the target variable of the minority class. In deciding which model was the most preferred for assignment 2, we assumed the business had unlimited resources and would therefore prefer a model that maximizes true positives while minimizing false negatives. As a result, even though accuracy and sensitivity decreased compared to the initial model, the increase in true positives in the final decision tree expanded the pool of potential clients that could be contacted, increasing the likelihood of successful subscriptions/phone calls.

Assignment 3: Initial Models
In the present assignment, SVM was utilized and results compared to decision tree, random forest, and ADA boost models. The initial SVM model results were comparable to the initial decision tree and initial random forest models for all metrics. However, the ADA Boost model had a markedly lower specificity when compared to the other three models.

Assignment 3: Utilization of SMOTE
As with assignment 2, SMOTE was utilized to generate ~7,000 rows of synthetic data in the SVM model in an attempt to address the class imbalance present in the target variable. Results were comparable between SVN, DT and ADA Boost models, with all three models experiencing negligible decreases in accuracy and sensitivity, as well as larger increases in specificity and balance accuracy. Of note, the RF model did not denote as large increase in specificity and balance accuracy. This may have been because it is an ensemble model and addressed some of the class imbalance innately.

Assignment 3: Kernal Adjustment
Initially, the SVM model was run using a radial basis function (RBF) kernel based on findings from assignment 1. Assignment 1 found a lack of substantial linear relationships. However, after applying SMOTE to address class imbalance and one-hot encoding to standardize categorical variables, a linear kernel was introduced to reassess whether the data might now support more linear relationship. Surprisingly, the linear kernel outperformed the RBF kernel, yielding higher specificity and balanced accuracy. This improvement may have been due to the one hot encoding expanding the feature space, creating clearer decision boundary. The kernal adjustment increase specificity by ~15% while only lowering sensitivity by 3%.

Assignment 3: Cost Parameter Adjustment
After kernel adjustment,the cost parameter was adjusted to control how much the SVM model punishes mistakes. The results across two different adjustments (C = 0.1, C = 1.0) were approximately identical to the model which introduced the linear kernel. The common results suggest that the model’s performance was not related to cost parameter, possibly due to bias introduced by SMOTE or one hot encoding. Higher adjustments were attempted (c=10), but R studio threw an error (warning: reaching max number of iterations)

Assignment 3: Final Model Choice

FINAL TWO MODELS

Model	Y=0	Y=1	TP	FN	FP	TN	Accuracy	Sensitiv	Specifi	Bal Acc	Prec	F1	AUC-PR
DT HP tuning #2 (SMOTE &weight)	31950	12657	901	169	1851	6121	0.7766	0.7678	0.8421	0.8049	0.973	0.858	0.37
SVM Kernal Adjustment(Linear & SMOTE)	31950	12657	696	374	616	7356	0.8905	0.9227	0.6505	0.7866	0.951	0.937	0.56

The final model selected for this assignment is the SVM with SMOTE and linear kernel. Although the decision tree model did outperform on some metrics, such as precision, balance accuracy and specificity, the final decision tree model also saw a massive increase in the number of false positives (FP=1851), dramatically lower sensitivity (0.7678), lower accuracy (0.7766). The SVM had the higher AUC-PR denoting that the model was better able to identify balance between true positives and false positives. Additionally the SVM mode had a high F1 score of the two models (0.937), denoting a more optimal balance between precision and recall.

In terms of utilization for business purposes, although we did assume the business had unlimited resources, the number of false positive generated by the decision tree model was approximately three times that of the SVM model (FP= 1851). This would translate to significant operational considerations. In contrast, the SVM model maintained a strong true positive rate while keeping false positives comparatively low, thus making it more cost effective in subscriber outreach.

PART III: Lit Review & Relevance to Interest

Decision Tree Ensembles to Predict Corona virus Disease 2019 Infection: A Comparative Study
The article “Decision Tree Ensembles to Predict Corona virus Disease 2019 Infection: A Comparative Study” by Amir Ahmad et al. investigated the effectiveness of various decision tree-based models in predicting COVID-19 infection. More specifically, the authors compared standard ensemble methods such as Random Forest, Ada Boost, and XGBoost against specialized ensemble methods designed specifically for imbalanced data, such as Balanced Random Forest, SMOTEBoost. Similar to the bank dataset, the data used in this study suffered from class imbalance, with relatively few positive cases present in the target variable. To address this, the authors applied both SMOTE and Random Undersampling. The models were evaluated using AUROC and AUPRC. The main finding suggests that ensemble methods designed for imbalanced data outperform standard ensemble techniques. Of note, SVM was not used in this study, and SMOTE was used.

A Novel Approach to Predict COVID-19 Using Support Vector Machine
In “A Novel Approach to Predict COVID-19 Using Support Vector Machine”, the authors utilized SVM, logistic regression, KNN, Naive Bayes, and random forest to predict severity of COVID-19 infection based on symptoms features. Among these models, SVM achieved the highest accuracy. Unlike the study by Amir Ahmad et al., this article did not utilize SMOTE or other ‘rebalancing’ technique to address class imbalance. Additionally, the authors relied primarily on classification accuracy as the evaluation metric, whereas Amir Ahmad et al. used AUROC, AUPRC, and F1-score.

Predicting metabolic syndrome using decision tree and support vector machine methods
The study “Predicting metabolic syndrome using decision tree and support vector machine methods” aimed to predict incidence of metabolic syndrome. The study utilized two models, SVM(polynomial kernel) and a decision tree. To address the class imbalance in the dataset, the authors used SMOTE. Evaluation metrics included sensitivity, specificity, and accuracy. Ultimately, SVM outperformed the decision tree models in terms of sensitivity, specificity and accuracy. Of note, the authors of this article used SMOTE to address class imbalance.

A comparative study of decision tree and support vector machine for breast cancer prediction In the article “A comparative study of decision tree and support vector machine for breast cancer prediction” the authors aimed to improve breast cancer diagnosis by comparing SVM models to Decision Tree models. Accuracy, sensitivity, specificity, precision, and AUC were used to evaluate the models. In the end, SVM outperformed the decision tree models across all metrics, although results were somewhat comparable. These results suggest that SVM provides a slightly more accurate model. Of note, although there was some ‘mild’ class imbalance present in the dataset, there was no attempt at imbalance correction, such as SMOTE.

Utility of support vector machine and decision tree to identify the prognosis of metformin poisoning in the United States: Analysis of National Poisoning Data System In the article, “Analysis of National Poisoning Data System” authors aimed to predict the prognosis of metformin poisoning using SVM and decision tree models to classify outcomes as minor, moderate, or major. SVM outperformed decision trees in all evaluation metrics, including accuracy, precision, recall, F1-score, ROC-AUC and PR-AUC. Of note, the dataset also suffered from class imbalance. The authors of the study, however, did not utilize techniques, such as SMOTE, to address this imbalance.

Article	Imbalance Present	Imbalance Addressed	Metrics	Preferred Model(s)
DT Ensembles to Predict Corona virus Disease 2019 Inf	yes	yes various & SMOTE	AUROC and AUPRC	Ensemble methods for imbalanced dataset
A Novel Approach to Predict COVID-19 Using SVM	yes	no	AUROC,AUPRC, F1	SVM
Predicting metabolic syndrome using DT and SVM methods	yes	yes-SMOTE	sen,spec,acc	SVM
A comparative study of SVM and DT for breast ca predict	mild	no	sen,spec,acc,prec,AUC	SVM
Utility of SVM and DT prognosis of metformin poisoning	yes	no	sen,acc,prec,F1,AUC	SVM

Discussion
As a healthcare professional, the above articles are extremely useful. All of the above articles allude to similar underlying data issue which frequently affects healthcare data, that being class imbalance in the target variable. The target variable in healthcare can many times be a disease or condition which is relatively uncommon when compared to the greater populations. Three of the five above articles employed SMOTE as a method to help remedy this imbalance. Furthermore, for four of the five articles, SVM models outperformed other models (usually decision trees). Of note, the five articles also utilized similar evalulation metrics.

PART IV: References

Ahmad, A., Safi, O., Malebary, S., Alesawi, S., & Alkayal, E. (2021). Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study. Complexity, 2021. https://doi.org/10.1155/2021/5550344

Guhathakurata, S., Kundu, S., Chakraborty, A., & Banerjee, J. S. (2021). A novel approach to predict COVID-19 using support vector machine. In Data Science for COVID-19 (pp. 351–364). Elsevier. https://doi.org/10.1016/B978-0-12-824536-1.00014-9

Karimi-Alavijeh, F., Jalili, S., & Sadeghi, M. (2016). Predicting metabolic syndrome using decision tree and support vector machine methods. ARYA Atherosclerosis, 12(3), 146–152.https://pmc.ncbi.nlm.nih.gov/articles/PMC5055373/

Ogbe, M. I., Nzeanorue, C. C., Olusola, R. A., Olofin, D. O., Owoeye, M. C., Enabulele, E. C., Ibijola, A. P., Ifechukwu, C. J., & Ayo, O. I. (2024). A comparative study of decision tree and support vector machine for breast cancer prediction. World Journal of Advanced Research and Reviews, 23(1), 746–752. https://doi.org/10.30574/wjarr.2024.23.1.2024

Mehrpour, O., Saeedi, F., Hoyte, C., Goss, F., & Shirazi, F. M. (2022). Utility of support vector machine and decision tree to identify the prognosis of metformin poisoning in the United States: Analysis of National Poisoning Data System. BMC Pharmacology and Toxicology, 23(1), 49. https://doi.org/10.1186/s40360-022-00588-0

PART V: Code

########################################## ONLY CODE BELOW###############################################################

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(fastDummies) 
library(rpart)       
library(rpart.plot)  
library(smotefamily)
library(randomForest)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(mlbench)
library(e1071)

bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

# Categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]

# One-hot encoding 
bank_data_encoded <- fastDummies::dummy_cols(bank_data, select_columns = categorical_cols, 
                                             remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Split data 80 training/ 20 testing 
set.seed(42)
trainIndex <- createDataPartition(bank_data_encoded$y_yes, p = 0.8, list = FALSE)  
trainData <- bank_data_encoded[trainIndex, ]
testData <- bank_data_encoded[-trainIndex, ]

# factorize target
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))

set.seed(123)

set.seed(123)
# BEFORE SMOTE
# Scale features
scaled_train <- trainData
scaled_test <- testData

# Remove target for scaling
train_features <- scaled_train[, setdiff(names(scaled_train), "y_yes")]
test_features <- scaled_test[, setdiff(names(scaled_test), "y_yes")]

# Standardize
scaled_train_scaled <- scale(train_features)
scaled_test_scaled <- scale(test_features, center = attr(scaled_train_scaled, "scaled:center"), 
                            scale = attr(scaled_train_scaled, "scaled:scale"))

# Re-add target variable
scaled_train <- data.frame(scaled_train_scaled)
scaled_train$y_yes <- trainData$y_yes

scaled_test <- data.frame(scaled_test_scaled)
scaled_test$y_yes <- testData$y_yes

# Train SVM with radial 
svm_model <- svm(y_yes ~ ., data = scaled_train, kernel = "radial", probability = TRUE)

# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test)

# Confusion Matrix
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test$y_yes)


precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9153476

cat("Recall:", recall, "\n")

## Recall: 0.9793026

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9462457

print(conf_matrix_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7807  722
##          1  165  348
##                                          
##                Accuracy : 0.9019         
##                  95% CI : (0.8956, 0.908)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 5.443e-10      
##                                          
##                   Kappa : 0.3931         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9793         
##             Specificity : 0.3252         
##          Pos Pred Value : 0.9153         
##          Neg Pred Value : 0.6784         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8634         
##    Detection Prevalence : 0.9433         
##       Balanced Accuracy : 0.6523         
##                                          
##        'Positive' Class : 0              
##

# Load SVM library

#install.packages("e1071")

#SMOTE 1!!!!!!!!!!!!!!!!!!

set.seed(123)

# Ensure target is factor
trainData$y_yes <- as.factor(trainData$y_yes)
testData$y_yes <- as.factor(testData$y_yes)

# SMOTE - balance the training set
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], 
                    trainData$y_yes, 
                    K = 5, dup_size = 1)

# Create SMOTE-balanced dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)


print(table(trainData_balanced$y_yes))

## 
##     0     1 
## 31950  8438

# Standardize numeric features after SMOTE
train_features <- trainData_balanced[, setdiff(names(trainData_balanced), "y_yes")]
test_features <- testData[, setdiff(names(testData), "y_yes")]

# Scale training and testing 
scaled_train <- scale(train_features)
scaled_test <- scale(test_features, 
                     center = attr(scaled_train, "scaled:center"), 
                     scale = attr(scaled_train, "scaled:scale"))

# Combine scaled features with target variable
scaled_train_df <- data.frame(scaled_train)
scaled_train_df$y_yes <- trainData_balanced$y_yes

scaled_test_df <- data.frame(scaled_test)
scaled_test_df$y_yes <- testData$y_yes

# Train SVM radial kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
                 kernel = "radial", probability = TRUE)

# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# Confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)

precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9271699

cat("Recall:", recall, "\n")

## Recall: 0.9661315

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9462498

print(conf_matrix_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7702  605
##          1  270  465
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8969, 0.9092)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 3.782e-11       
##                                           
##                   Kappa : 0.4635          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9661          
##             Specificity : 0.4346          
##          Pos Pred Value : 0.9272          
##          Neg Pred Value : 0.6327          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8518          
##    Detection Prevalence : 0.9187          
##       Balanced Accuracy : 0.7004          
##                                           
##        'Positive' Class : 0               
##

#SMOTE 2!!!!!!!!!!!!!!!!!!

set.seed(123)

#ensure target is factor
trainData$y_yes <- as.factor(trainData$y_yes)
testData$y_yes <- as.factor(testData$y_yes)

# SMOTE dup_size=2
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], 
                    trainData$y_yes, 
                    K = 5, dup_size = 2)

# create SMOTE dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)


print(table(trainData_balanced$y_yes))

## 
##     0     1 
## 31950 12657

# Standardize numeric features after SMOTE
train_features <- trainData_balanced[, setdiff(names(trainData_balanced), "y_yes")]
test_features <- testData[, setdiff(names(testData), "y_yes")]

# Scale training and testing 
scaled_train <- scale(train_features)
scaled_test <- scale(test_features, 
                     center = attr(scaled_train, "scaled:center"), 
                     scale = attr(scaled_train, "scaled:scale"))

# Combine scaled features with target variable
scaled_train_df <- data.frame(scaled_train)
scaled_train_df$y_yes <- trainData_balanced$y_yes

scaled_test_df <- data.frame(scaled_test)
scaled_test_df$y_yes <- testData$y_yes

# Train SVM radial kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
                 kernel = "radial", probability = TRUE)

# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# Confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)


precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9342878

cat("Recall:", recall, "\n")

## Recall: 0.957727

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9458622

print(conf_matrix_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7635  537
##          1  337  533
##                                           
##                Accuracy : 0.9033          
##                  95% CI : (0.8971, 0.9094)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 3.004e-11       
##                                           
##                   Kappa : 0.496           
##                                           
##  Mcnemar's Test P-Value : 1.682e-11       
##                                           
##             Sensitivity : 0.9577          
##             Specificity : 0.4981          
##          Pos Pred Value : 0.9343          
##          Neg Pred Value : 0.6126          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8444          
##    Detection Prevalence : 0.9038          
##       Balanced Accuracy : 0.7279          
##                                           
##        'Positive' Class : 0               
##

set.seed(123)
# Train with linear
svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
                 kernel = "linear", probability = TRUE)

# Prediction
svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# Confusion 
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9516171

cat("Recall:", recall, "\n")

## Recall: 0.9227296

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9369507

print(conf_matrix_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7356  374
##          1  616  696
##                                           
##                Accuracy : 0.8905          
##                  95% CI : (0.8839, 0.8969)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.00449         
##                                           
##                   Kappa : 0.5221          
##                                           
##  Mcnemar's Test P-Value : 1.867e-14       
##                                           
##             Sensitivity : 0.9227          
##             Specificity : 0.6505          
##          Pos Pred Value : 0.9516          
##          Neg Pred Value : 0.5305          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8135          
##    Detection Prevalence : 0.8549          
##       Balanced Accuracy : 0.7866          
##                                           
##        'Positive' Class : 0               
##

library(PRROC)

## Loading required package: rlang

## 
## Attaching package: 'rlang'

## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice

svm_probs <- attr(predict(svm_model, newdata = scaled_test_df, probability = TRUE), "probabilities")[, "1"]


labels <- as.numeric(as.character(scaled_test_df$y_yes))


pr <- pr.curve(scores.class0 = svm_probs[labels == 1],
               scores.class1 = svm_probs[labels == 0],
               curve = TRUE)

# Plot PR curve
plot(pr,
     main = "Precision-Recall Curve for SVM (Linear)",
     auc.main = TRUE,
     color = "#2c7fb8",
     lwd = 2)

set.seed(123)
# Try different cost with linear kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
                 kernel = "linear", 
                 cost = 1,        
                 probability = TRUE)

# Prediction
svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9516171

cat("Recall:", recall, "\n")

## Recall: 0.9227296

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9369507

print(conf_matrix_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7356  374
##          1  616  696
##                                           
##                Accuracy : 0.8905          
##                  95% CI : (0.8839, 0.8969)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.00449         
##                                           
##                   Kappa : 0.5221          
##                                           
##  Mcnemar's Test P-Value : 1.867e-14       
##                                           
##             Sensitivity : 0.9227          
##             Specificity : 0.6505          
##          Pos Pred Value : 0.9516          
##          Neg Pred Value : 0.5305          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8135          
##    Detection Prevalence : 0.8549          
##       Balanced Accuracy : 0.7866          
##                                           
##        'Positive' Class : 0               
##

#set.seed(123)
# Cost =10 with linear
#svm_model <- svm(y_yes ~ ., data = scaled_train_df, 
#                 kernel = "linear", 
#                 cost = 10,       
#                 probability = TRUE)

# Prediction
#svm_predictions <- predict(svm_model, newdata = scaled_test_df)

# Confusion
#conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
#print(conf_matrix_svm)

set.seed(123)
install.packages("PRROC")

## Warning: package 'PRROC' is in use and will not be installed

library(PRROC)
# Get predicted probabilities for class "1"
svm_probs <- attr(predict(svm_model, newdata = scaled_test_df, probability = TRUE), "probabilities")[, "1"]

# Convert test labels to numeric (must be 0/1)
labels <- as.numeric(as.character(scaled_test_df$y_yes))

# Create precision-recall object
pr <- pr.curve(scores.class0 = svm_probs[labels == 1],
               scores.class1 = svm_probs[labels == 0],
               curve = TRUE)

# Plot the PR curve
plot(pr,
     main = "Precision-Recall Curve for SVM",
     auc.main = TRUE,
     color = "#2c7fb8",
     lwd = 2)

DECISION TREE

bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

# Categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]

# One-hot encoding 
bank_data_encoded <- fastDummies::dummy_cols(bank_data, select_columns = categorical_cols, 
                                             remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Split data 80 training/ 20 testing 
set.seed(42)
trainIndex <- createDataPartition(bank_data_encoded$y_yes, p = 0.8, list = FALSE)  
trainData <- bank_data_encoded[trainIndex, ]
testData <- bank_data_encoded[-trainIndex, ]

# factorize target
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))

set.seed(123) 

#INITIAL DT MODEL!!!!!!!!!!!!!!!!!!

dt_model <- rpart(y_yes ~ ., data = trainData, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
rpart.plot(dt_model)

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# COnfusion matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9163408

cat("Recall:", recall, "\n")

## Recall: 0.9700201

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9424167

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7733  706
##          1  239  364
##                                          
##                Accuracy : 0.8955         
##                  95% CI : (0.889, 0.9017)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 1.886e-05      
##                                          
##                   Kappa : 0.3825         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9700         
##             Specificity : 0.3402         
##          Pos Pred Value : 0.9163         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8552         
##    Detection Prevalence : 0.9333         
##       Balanced Accuracy : 0.6551         
##                                          
##        'Positive' Class : 0              
##

# Feature importance
print(dt_model$variable.importance)

##         duration poutcome_success  contact_unknown            pdays 
##     1110.9180804      685.8992109        3.9494818        0.8776626 
##         previous              age         campaign 
##        0.5831593        0.1443280        0.1443280

#SMOTE 1!!!!!!!!!!!!!!!!!!
set.seed(123)
trainData$y_yes <- as.factor(trainData$y_yes)

#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)

# Creation of new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"  
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)

# New class distribution 
table(trainData_balanced$y_yes)

## 
##     0     1 
## 31950  8438

set.seed(123)
# Train decision tree
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))



# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# Confusion Matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9295723

cat("Recall:", recall, "\n")

## Recall: 0.9486954

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9390365

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7563  573
##          1  409  497
##                                           
##                Accuracy : 0.8914          
##                  95% CI : (0.8848, 0.8977)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.001992        
##                                           
##                   Kappa : 0.4425          
##                                           
##  Mcnemar's Test P-Value : 1.976e-07       
##                                           
##             Sensitivity : 0.9487          
##             Specificity : 0.4645          
##          Pos Pred Value : 0.9296          
##          Neg Pred Value : 0.5486          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8364          
##    Detection Prevalence : 0.8998          
##       Balanced Accuracy : 0.7066          
##                                           
##        'Positive' Class : 0               
##

# Plot 
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,     
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)

#SMOTE 2!!!!!!!!!!!
set.seed(123)

trainData$y_yes <- as.factor(trainData$y_yes)

#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)

# new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes" 
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)  

# class distribution
table(trainData_balanced$y_yes)

## 
##     0     1 
## 31950 12657

# Train dt
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# conf matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9380264

cat("Recall:", recall, "\n")

## Recall: 0.9360261

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9370252

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7462  493
##          1  510  577
##                                           
##                Accuracy : 0.8891          
##                  95% CI : (0.8824, 0.8955)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.0146          
##                                           
##                   Kappa : 0.472           
##                                           
##  Mcnemar's Test P-Value : 0.6134          
##                                           
##             Sensitivity : 0.9360          
##             Specificity : 0.5393          
##          Pos Pred Value : 0.9380          
##          Neg Pred Value : 0.5308          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8253          
##    Detection Prevalence : 0.8798          
##       Balanced Accuracy : 0.7376          
##                                           
##        'Positive' Class : 0               
##

# Plot 
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,     
           under = TRUE,    
           tweak = 1.2,    
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)

#HYPERPARAMETER TUNING #1
#MODIFY COMPLEXITY PARAMETER
set.seed(123)

dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class",
                  control = rpart.control(minsplit = 20, cp = 0.05))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.9427016

cat("Recall:", recall, "\n")

## Recall: 0.9121927

cat("F1 Score:", f1, "\n")

## F1 Score: 0.9271962

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7272  442
##          1  700  628
##                                           
##                Accuracy : 0.8737          
##                  95% CI : (0.8667, 0.8805)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.9904          
##                                           
##                   Kappa : 0.4519          
##                                           
##  Mcnemar's Test P-Value : 2.849e-14       
##                                           
##             Sensitivity : 0.9122          
##             Specificity : 0.5869          
##          Pos Pred Value : 0.9427          
##          Neg Pred Value : 0.4729          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8042          
##    Detection Prevalence : 0.8531          
##       Balanced Accuracy : 0.7496          
##                                           
##        'Positive' Class : 0               
##

# Plot
rpart.plot(dt_model, 
           type = 3,       
           extra = 104,    
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)

#HYPERPARAMETER TUNING #2 CHANGING WEIGHTS!!!!!!!!!!!
set.seed(123)

dt_model <- rpart(y_yes ~ ., 
                  data = trainData_balanced, 
                  method = "class",
                  parms = list(prior = c(0.4, 0.6)),  # Corrected argument name
                  control = rpart.control(minsplit = 20, cp = 0.01))

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# Confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.973132

cat("Recall:", recall, "\n")

## Recall: 0.7678123

cat("F1 Score:", f1, "\n")

## F1 Score: 0.8583649

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6121  169
##          1 1851  901
##                                           
##                Accuracy : 0.7766          
##                  95% CI : (0.7679, 0.7851)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3629          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7678          
##             Specificity : 0.8421          
##          Pos Pred Value : 0.9731          
##          Neg Pred Value : 0.3274          
##              Prevalence : 0.8817          
##          Detection Rate : 0.6770          
##    Detection Prevalence : 0.6956          
##       Balanced Accuracy : 0.8049          
##                                           
##        'Positive' Class : 0               
##

# Plot DT
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,    
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)

library(PRROC)

# Get predicted probabilities for the positive class ("1")
dt_probs <- predict(dt_model, newdata = testData, type = "prob")[, "1"]

# Convert actual labels to numeric (0/1)
labels <- as.numeric(as.character(testData$y_yes))

# Compute Precision-Recall curve
pr <- pr.curve(scores.class0 = dt_probs[labels == 1],
               scores.class1 = dt_probs[labels == 0],
               curve = TRUE)

# Plot PR Curve
plot(pr,
     main = "Precision-Recall Curve for Decision Tree (Weighted)",
     auc.main = TRUE,
     color = "#d95f02",
     lwd = 2)