Instructions
Perform an analysis of the dataset used in Homework 1 and 2 using the
SVM algorithm.
Compare the results with the results from previous homework:
Read the following articles:
To do’s:
| Model | Y=0 | Y=1 | TP | FN | FP | TN | Accuracy | Sensitiv | Specifi | Bal Acc | Prec | F1 | AUC-PR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Decision Tree | |||||||||||||
| Initial Decision Tree (DT) | 39822 | 5289 | 364 | 706 | 239 | 7733 | 0.8955 | 0.9700 | 0.3402 | 0.6551 | 0.934 | 0.945 | |
| SMOTE #1(K=5,dup_size=1) & DT | 31950 | 8438 | 497 | 573 | 409 | 7563 | 0.8914 | 0.9487 | 0.4645 | 0.7066 | |||
| SMOTE #2(K=5,dup_size=2) & DT | 31950 | 12657 | 577 | 493 | 510 | 7462 | 0.8891 | 0.9360 | 0.5393 | 0.7376 | |||
| Hyperparameter tuning #1 (cp=0.05) | 31950 | 12657 | 628 | 442 | 700 | 7272 | 0.8737 | 0.9122 | 0.5869 | 0.7496 | |||
| Hyperparameter tuning #2 (weight) | 31950 | 12657 | 901 | 169 | 1851 | 6121 | 0.7766 | 0.7678 | 0.8421 | 0.8049 | 0.973 | 0.858 | 0.37 |
| Random Forest | |||||||||||||
| Initial Random Forest (RF) | 39822 | 5289 | 410 | 660 | 212 | 7760 | 0.9036 | 0.9734 | 0.3832 | 0.6783 | |||
| SMOTE #1(K=5,dup_size=1) & RF | 31950 | 8438 | 445 | 625 | 234 | 7738 | 0.9050 | 0.9706 | 0.4159 | 0.6933 | |||
| SMOTE #2(K=5,dup_size=2) & RF | 31950 | 12657 | 448 | 622 | 249 | 7723 | 0.9037 | 0.9688 | 0.4187 | 0.6937 | |||
| Hyperparameter tuning #1 (weight) | 31950 | 12657 | 530 | 540 | 371 | 7601 | 0.8992 | 0.9535 | 0.4953 | 0.7244 | |||
| Hyperparameter tuning #2 (weight) | 31950 | 12657 | 537 | 533 | 395 | 7577 | 0.8974 | 0.9505 | 0.5019 | 0.7319 | |||
| ADA Boost | |||||||||||||
| Initial ADA BOOST (ADA) | 39822 | 5289 | 236 | 807 | 134 | 7838 | 0.8959 | 0.9832 | 0.2458 | 0.6145 | |||
| SMOTE #1(K=5,dup_size=1) & ADA | 31950 | 8438 | 461 | 609 | 305 | 7667 | 0.8989 | 0.9617 | 0.4308 | 0.6963 | |||
| SMOTE #2(K=5,dup_size=2) & ADA | 31950 | 12657 | 520 | 550 | 377 | 7595 | 0.8975 | 0.9527 | 0.4860 | 0.7193 | |||
| Hyperparameter tuning #1 (weight4,1) | 31950 | 12657 | 507 | 563 | 333 | 7639 | 0.9009 | 0.9582 | 0.4738 | 0.7160 | |||
| Hyperparameter tuning #2 (weight5,1) | 31950 | 12657 | 507 | 563 | 333 | 7639 | 0.9009 | 0.9582 | 0.4738 | 0.7160 | |||
| SVM | |||||||||||||
| Initial SVM | 39822 | 5289 | 348 | 722 | 165 | 7807 | 0.9019 | 0.9793 | 0.3252 | 0.6523 | 0.915 | 0.946 | |
| SMOTE #1(K=5,dup_size=1) & Radial | 31950 | 8438 | 465 | 605 | 270 | 7702 | 0.9032 | 0.9661 | 0.4346 | 0.7004 | 0.927 | 0.946 | |
| SMOTE #2(K=5,dup_size=2) & Radial | 31950 | 12657 | 533 | 537 | 337 | 7635 | 0.9033 | 0.9577 | 0.4981 | 0.7279 | 0.934 | 0.946 | |
| Kernal Adjustment (Linear) | 31950 | 12657 | 696 | 374 | 616 | 7356 | 0.8905 | 0.9227 | 0.6505 | 0.7866 | 0.951 | 0.937 | 0.56 |
| Hyperparameter tuning #2 (c= 0.1) | 31950 | 12657 | 696 | 374 | 615 | 7357 | 0.8906 | 0.9229 | 0.6505 | 0.7867 | |||
| Hyperparameter tuning #3 (c= 1.0) | 31950 | 12657 | 696 | 374 | 616 | 7356 | 0.8905 | 0.9227 | 0.6505 | 0.7866 | 0.952 | 0.937 |
Assignment 1: Results Summarized/EDA
EDA was performed in the previous assignment: https://rpubs.com/greggmaloy/1275261 Below the EDA from
the previous assignment is summarized:
There was considerable class imbalance in the target variable (y), where ~11% of clients subscribed, while ~88% did not. There are seven numerical and ten categorical variables in the dataset. Most numerical features were right-skewed and many had outliers, detected via IQR and scatterplots. There were no strong linear relationships between features, with most variables showing either very weak correlations via correlation matrix or none at all. There was no missing data.
What the EDA Means for Decision Trees, Random Forest, DA
Boost Models & SVM Models
Since SVM and ADA Boost cannot handle categorical variables and since
both decision tree and random forest models can accommodate one-hot
encoding of categorical variables, categorical variables were
transformed via one-hot encoding. This transformation standardizes the
dataset for all four models.
Assignment 2: Results of Summarized for DT, RF, ADA Boost
Models
Assignment 2 concluded with the final decision tree being the preferred
model. This model included ~7,000 rows of SMOTE generated data, the
complexity parameter adjusted to 5%, and an increased weight assigned to
the target variable of the minority class. In deciding which model was
the most preferred for assignment 2, we assumed the business had
unlimited resources and would therefore prefer a model that maximizes
true positives while minimizing false negatives. As a result, even
though accuracy and sensitivity decreased compared to the initial model,
the increase in true positives in the final decision tree expanded the
pool of potential clients that could be contacted, increasing the
likelihood of successful subscriptions/phone calls.
Assignment 3: Initial Models
In the present assignment, SVM was utilized and results compared to
decision tree, random forest, and ADA boost models. The initial SVM
model results were comparable to the initial decision tree and initial
random forest models for all metrics. However, the ADA Boost model had a
markedly lower specificity when compared to the other three models.
Assignment 3: Utilization of SMOTE
As with assignment 2, SMOTE was utilized to generate ~7,000 rows of
synthetic data in the SVM model in an attempt to address the class
imbalance present in the target variable. Results were comparable
between SVN, DT and ADA Boost models, with all three models experiencing
negligible decreases in accuracy and sensitivity, as well as larger
increases in specificity and balance accuracy. Of note, the RF model did
not denote as large increase in specificity and balance accuracy. This
may have been because it is an ensemble model and addressed some of the
class imbalance innately.
Assignment 3: Kernal Adjustment
Initially, the SVM model was run using a radial basis function (RBF)
kernel based on findings from assignment 1. Assignment 1 found a lack of
substantial linear relationships. However, after applying SMOTE to
address class imbalance and one-hot encoding to standardize categorical
variables, a linear kernel was introduced to reassess whether the data
might now support more linear relationship. Surprisingly, the linear
kernel outperformed the RBF kernel, yielding higher specificity and
balanced accuracy. This improvement may have been due to the one hot
encoding expanding the feature space, creating clearer decision
boundary. The kernal adjustment increase specificity by ~15% while only
lowering sensitivity by 3%.
Assignment 3: Cost Parameter Adjustment
After kernel adjustment,the cost parameter was adjusted to control how
much the SVM model punishes mistakes. The results across two different
adjustments (C = 0.1, C = 1.0) were approximately identical to the model
which introduced the linear kernel. The common results suggest that the
model’s performance was not related to cost parameter, possibly due to
bias introduced by SMOTE or one hot encoding. Higher adjustments were
attempted (c=10), but R studio threw an error (warning: reaching max
number of iterations)
Assignment 3: Final Model Choice
FINAL TWO MODELS
| Model | Y=0 | Y=1 | TP | FN | FP | TN | Accuracy | Sensitiv | Specifi | Bal Acc | Prec | F1 | AUC-PR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DT HP tuning #2 (SMOTE &weight) | 31950 | 12657 | 901 | 169 | 1851 | 6121 | 0.7766 | 0.7678 | 0.8421 | 0.8049 | 0.973 | 0.858 | 0.37 |
| SVM Kernal Adjustment(Linear & SMOTE) | 31950 | 12657 | 696 | 374 | 616 | 7356 | 0.8905 | 0.9227 | 0.6505 | 0.7866 | 0.951 | 0.937 | 0.56 |
The final model selected for this assignment is the SVM with SMOTE and linear kernel. Although the decision tree model did outperform on some metrics, such as precision, balance accuracy and specificity, the final decision tree model also saw a massive increase in the number of false positives (FP=1851), dramatically lower sensitivity (0.7678), lower accuracy (0.7766). The SVM had the higher AUC-PR denoting that the model was better able to identify balance between true positives and false positives. Additionally the SVM mode had a high F1 score of the two models (0.937), denoting a more optimal balance between precision and recall.
In terms of utilization for business purposes, although we did assume the business had unlimited resources, the number of false positive generated by the decision tree model was approximately three times that of the SVM model (FP= 1851). This would translate to significant operational considerations. In contrast, the SVM model maintained a strong true positive rate while keeping false positives comparatively low, thus making it more cost effective in subscriber outreach.
Decision Tree Ensembles to Predict Corona virus Disease 2019
Infection: A Comparative Study
The article “Decision Tree Ensembles to Predict Corona virus Disease
2019 Infection: A Comparative Study” by Amir Ahmad et al. investigated
the effectiveness of various decision tree-based models in predicting
COVID-19 infection. More specifically, the authors compared standard
ensemble methods such as Random Forest, Ada Boost, and XGBoost against
specialized ensemble methods designed specifically for imbalanced data,
such as Balanced Random Forest, SMOTEBoost. Similar to the bank dataset,
the data used in this study suffered from class imbalance, with
relatively few positive cases present in the target variable. To address
this, the authors applied both SMOTE and Random Undersampling. The
models were evaluated using AUROC and AUPRC. The main finding suggests
that ensemble methods designed for imbalanced data outperform standard
ensemble techniques. Of note, SVM was not used in this study, and SMOTE
was used.
A Novel Approach to Predict COVID-19 Using Support Vector
Machine
In “A Novel Approach to Predict COVID-19 Using Support Vector Machine”,
the authors utilized SVM, logistic regression, KNN, Naive Bayes, and
random forest to predict severity of COVID-19 infection based on
symptoms features. Among these models, SVM achieved the highest
accuracy. Unlike the study by Amir Ahmad et al., this article did not
utilize SMOTE or other ‘rebalancing’ technique to address class
imbalance. Additionally, the authors relied primarily on classification
accuracy as the evaluation metric, whereas Amir Ahmad et al. used AUROC,
AUPRC, and F1-score.
Predicting metabolic syndrome using decision tree and support
vector machine methods
The study “Predicting metabolic syndrome using decision tree and support
vector machine methods” aimed to predict incidence of metabolic
syndrome. The study utilized two models, SVM(polynomial kernel) and a
decision tree. To address the class imbalance in the dataset, the
authors used SMOTE. Evaluation metrics included sensitivity,
specificity, and accuracy. Ultimately, SVM outperformed the decision
tree models in terms of sensitivity, specificity and accuracy. Of note,
the authors of this article used SMOTE to address class imbalance.
A comparative study of decision tree and support vector machine for breast cancer prediction In the article “A comparative study of decision tree and support vector machine for breast cancer prediction” the authors aimed to improve breast cancer diagnosis by comparing SVM models to Decision Tree models. Accuracy, sensitivity, specificity, precision, and AUC were used to evaluate the models. In the end, SVM outperformed the decision tree models across all metrics, although results were somewhat comparable. These results suggest that SVM provides a slightly more accurate model. Of note, although there was some ‘mild’ class imbalance present in the dataset, there was no attempt at imbalance correction, such as SMOTE.
Utility of support vector machine and decision tree to identify the prognosis of metformin poisoning in the United States: Analysis of National Poisoning Data System In the article, “Analysis of National Poisoning Data System” authors aimed to predict the prognosis of metformin poisoning using SVM and decision tree models to classify outcomes as minor, moderate, or major. SVM outperformed decision trees in all evaluation metrics, including accuracy, precision, recall, F1-score, ROC-AUC and PR-AUC. Of note, the dataset also suffered from class imbalance. The authors of the study, however, did not utilize techniques, such as SMOTE, to address this imbalance.
| Article | Imbalance Present | Imbalance Addressed | Metrics | Preferred Model(s) |
|---|---|---|---|---|
| DT Ensembles to Predict Corona virus Disease 2019 Inf | yes | yes various & SMOTE | AUROC and AUPRC | Ensemble methods for imbalanced dataset |
| A Novel Approach to Predict COVID-19 Using SVM | yes | no | AUROC,AUPRC, F1 | SVM |
| Predicting metabolic syndrome using DT and SVM methods | yes | yes-SMOTE | sen,spec,acc | SVM |
| A comparative study of SVM and DT for breast ca predict | mild | no | sen,spec,acc,prec,AUC | SVM |
| Utility of SVM and DT prognosis of metformin poisoning | yes | no | sen,acc,prec,F1,AUC | SVM |
Discussion
As a healthcare professional, the above articles are extremely useful.
All of the above articles allude to similar underlying data issue which
frequently affects healthcare data, that being class imbalance in the
target variable. The target variable in healthcare can many times be a
disease or condition which is relatively uncommon when compared to the
greater populations. Three of the five above articles employed SMOTE as
a method to help remedy this imbalance. Furthermore, for four of the
five articles, SVM models outperformed other models (usually decision
trees). Of note, the five articles also utilized similar evalulation
metrics.
Ahmad, A., Safi, O., Malebary, S., Alesawi, S., & Alkayal, E. (2021). Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study. Complexity, 2021. https://doi.org/10.1155/2021/5550344
Guhathakurata, S., Kundu, S., Chakraborty, A., & Banerjee, J. S. (2021). A novel approach to predict COVID-19 using support vector machine. In Data Science for COVID-19 (pp. 351–364). Elsevier. https://doi.org/10.1016/B978-0-12-824536-1.00014-9
Karimi-Alavijeh, F., Jalili, S., & Sadeghi, M. (2016). Predicting metabolic syndrome using decision tree and support vector machine methods. ARYA Atherosclerosis, 12(3), 146–152.https://pmc.ncbi.nlm.nih.gov/articles/PMC5055373/
Ogbe, M. I., Nzeanorue, C. C., Olusola, R. A., Olofin, D. O., Owoeye, M. C., Enabulele, E. C., Ibijola, A. P., Ifechukwu, C. J., & Ayo, O. I. (2024). A comparative study of decision tree and support vector machine for breast cancer prediction. World Journal of Advanced Research and Reviews, 23(1), 746–752. https://doi.org/10.30574/wjarr.2024.23.1.2024
Mehrpour, O., Saeedi, F., Hoyte, C., Goss, F., & Shirazi, F. M. (2022). Utility of support vector machine and decision tree to identify the prognosis of metformin poisoning in the United States: Analysis of National Poisoning Data System. BMC Pharmacology and Toxicology, 23(1), 49. https://doi.org/10.1186/s40360-022-00588-0
########################################## ONLY CODE BELOW###############################################################
library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(mlbench)
library(e1071)
bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# Categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]
# One-hot encoding
bank_data_encoded <- fastDummies::dummy_cols(bank_data, select_columns = categorical_cols,
remove_first_dummy = TRUE, remove_selected_columns = TRUE)
# Split data 80 training/ 20 testing
set.seed(42)
trainIndex <- createDataPartition(bank_data_encoded$y_yes, p = 0.8, list = FALSE)
trainData <- bank_data_encoded[trainIndex, ]
testData <- bank_data_encoded[-trainIndex, ]
# factorize target
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))
set.seed(123) set.seed(123)
# BEFORE SMOTE
# Scale features
scaled_train <- trainData
scaled_test <- testData
# Remove target for scaling
train_features <- scaled_train[, setdiff(names(scaled_train), "y_yes")]
test_features <- scaled_test[, setdiff(names(scaled_test), "y_yes")]
# Standardize
scaled_train_scaled <- scale(train_features)
scaled_test_scaled <- scale(test_features, center = attr(scaled_train_scaled, "scaled:center"),
scale = attr(scaled_train_scaled, "scaled:scale"))
# Re-add target variable
scaled_train <- data.frame(scaled_train_scaled)
scaled_train$y_yes <- trainData$y_yes
scaled_test <- data.frame(scaled_test_scaled)
scaled_test$y_yes <- testData$y_yes
# Train SVM with radial
svm_model <- svm(y_yes ~ ., data = scaled_train, kernel = "radial", probability = TRUE)
# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test)
# Confusion Matrix
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9153476
## Recall: 0.9793026
## F1 Score: 0.9462457
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7807 722
## 1 165 348
##
## Accuracy : 0.9019
## 95% CI : (0.8956, 0.908)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 5.443e-10
##
## Kappa : 0.3931
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9793
## Specificity : 0.3252
## Pos Pred Value : 0.9153
## Neg Pred Value : 0.6784
## Prevalence : 0.8817
## Detection Rate : 0.8634
## Detection Prevalence : 0.9433
## Balanced Accuracy : 0.6523
##
## 'Positive' Class : 0
##
#SMOTE 1!!!!!!!!!!!!!!!!!!
set.seed(123)
# Ensure target is factor
trainData$y_yes <- as.factor(trainData$y_yes)
testData$y_yes <- as.factor(testData$y_yes)
# SMOTE - balance the training set
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")],
trainData$y_yes,
K = 5, dup_size = 1)
# Create SMOTE-balanced dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
print(table(trainData_balanced$y_yes))##
## 0 1
## 31950 8438
# Standardize numeric features after SMOTE
train_features <- trainData_balanced[, setdiff(names(trainData_balanced), "y_yes")]
test_features <- testData[, setdiff(names(testData), "y_yes")]
# Scale training and testing
scaled_train <- scale(train_features)
scaled_test <- scale(test_features,
center = attr(scaled_train, "scaled:center"),
scale = attr(scaled_train, "scaled:scale"))
# Combine scaled features with target variable
scaled_train_df <- data.frame(scaled_train)
scaled_train_df$y_yes <- trainData_balanced$y_yes
scaled_test_df <- data.frame(scaled_test)
scaled_test_df$y_yes <- testData$y_yes
# Train SVM radial kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df,
kernel = "radial", probability = TRUE)
# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test_df)
# Confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9271699
## Recall: 0.9661315
## F1 Score: 0.9462498
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7702 605
## 1 270 465
##
## Accuracy : 0.9032
## 95% CI : (0.8969, 0.9092)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 3.782e-11
##
## Kappa : 0.4635
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9661
## Specificity : 0.4346
## Pos Pred Value : 0.9272
## Neg Pred Value : 0.6327
## Prevalence : 0.8817
## Detection Rate : 0.8518
## Detection Prevalence : 0.9187
## Balanced Accuracy : 0.7004
##
## 'Positive' Class : 0
##
#SMOTE 2!!!!!!!!!!!!!!!!!!
set.seed(123)
#ensure target is factor
trainData$y_yes <- as.factor(trainData$y_yes)
testData$y_yes <- as.factor(testData$y_yes)
# SMOTE dup_size=2
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")],
trainData$y_yes,
K = 5, dup_size = 2)
# create SMOTE dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
print(table(trainData_balanced$y_yes))##
## 0 1
## 31950 12657
# Standardize numeric features after SMOTE
train_features <- trainData_balanced[, setdiff(names(trainData_balanced), "y_yes")]
test_features <- testData[, setdiff(names(testData), "y_yes")]
# Scale training and testing
scaled_train <- scale(train_features)
scaled_test <- scale(test_features,
center = attr(scaled_train, "scaled:center"),
scale = attr(scaled_train, "scaled:scale"))
# Combine scaled features with target variable
scaled_train_df <- data.frame(scaled_train)
scaled_train_df$y_yes <- trainData_balanced$y_yes
scaled_test_df <- data.frame(scaled_test)
scaled_test_df$y_yes <- testData$y_yes
# Train SVM radial kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df,
kernel = "radial", probability = TRUE)
# Predictions
svm_predictions <- predict(svm_model, newdata = scaled_test_df)
# Confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9342878
## Recall: 0.957727
## F1 Score: 0.9458622
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7635 537
## 1 337 533
##
## Accuracy : 0.9033
## 95% CI : (0.8971, 0.9094)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 3.004e-11
##
## Kappa : 0.496
##
## Mcnemar's Test P-Value : 1.682e-11
##
## Sensitivity : 0.9577
## Specificity : 0.4981
## Pos Pred Value : 0.9343
## Neg Pred Value : 0.6126
## Prevalence : 0.8817
## Detection Rate : 0.8444
## Detection Prevalence : 0.9038
## Balanced Accuracy : 0.7279
##
## 'Positive' Class : 0
##
set.seed(123)
# Train with linear
svm_model <- svm(y_yes ~ ., data = scaled_train_df,
kernel = "linear", probability = TRUE)
# Prediction
svm_predictions <- predict(svm_model, newdata = scaled_test_df)
# Confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9516171
## Recall: 0.9227296
## F1 Score: 0.9369507
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7356 374
## 1 616 696
##
## Accuracy : 0.8905
## 95% CI : (0.8839, 0.8969)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 0.00449
##
## Kappa : 0.5221
##
## Mcnemar's Test P-Value : 1.867e-14
##
## Sensitivity : 0.9227
## Specificity : 0.6505
## Pos Pred Value : 0.9516
## Neg Pred Value : 0.5305
## Prevalence : 0.8817
## Detection Rate : 0.8135
## Detection Prevalence : 0.8549
## Balanced Accuracy : 0.7866
##
## 'Positive' Class : 0
##
## Loading required package: rlang
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
svm_probs <- attr(predict(svm_model, newdata = scaled_test_df, probability = TRUE), "probabilities")[, "1"]
labels <- as.numeric(as.character(scaled_test_df$y_yes))
pr <- pr.curve(scores.class0 = svm_probs[labels == 1],
scores.class1 = svm_probs[labels == 0],
curve = TRUE)
# Plot PR curve
plot(pr,
main = "Precision-Recall Curve for SVM (Linear)",
auc.main = TRUE,
color = "#2c7fb8",
lwd = 2)set.seed(123)
# Try different cost with linear kernel
svm_model <- svm(y_yes ~ ., data = scaled_train_df,
kernel = "linear",
cost = 1,
probability = TRUE)
# Prediction
svm_predictions <- predict(svm_model, newdata = scaled_test_df)
# confusion
conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
precision <- conf_matrix_svm$byClass["Precision"]
recall <- conf_matrix_svm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9516171
## Recall: 0.9227296
## F1 Score: 0.9369507
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7356 374
## 1 616 696
##
## Accuracy : 0.8905
## 95% CI : (0.8839, 0.8969)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 0.00449
##
## Kappa : 0.5221
##
## Mcnemar's Test P-Value : 1.867e-14
##
## Sensitivity : 0.9227
## Specificity : 0.6505
## Pos Pred Value : 0.9516
## Neg Pred Value : 0.5305
## Prevalence : 0.8817
## Detection Rate : 0.8135
## Detection Prevalence : 0.8549
## Balanced Accuracy : 0.7866
##
## 'Positive' Class : 0
##
#set.seed(123)
# Cost =10 with linear
#svm_model <- svm(y_yes ~ ., data = scaled_train_df,
# kernel = "linear",
# cost = 10,
# probability = TRUE)
# Prediction
#svm_predictions <- predict(svm_model, newdata = scaled_test_df)
# Confusion
#conf_matrix_svm <- confusionMatrix(svm_predictions, scaled_test_df$y_yes)
#print(conf_matrix_svm)## Warning: package 'PRROC' is in use and will not be installed
library(PRROC)
# Get predicted probabilities for class "1"
svm_probs <- attr(predict(svm_model, newdata = scaled_test_df, probability = TRUE), "probabilities")[, "1"]
# Convert test labels to numeric (must be 0/1)
labels <- as.numeric(as.character(scaled_test_df$y_yes))
# Create precision-recall object
pr <- pr.curve(scores.class0 = svm_probs[labels == 1],
scores.class1 = svm_probs[labels == 0],
curve = TRUE)
# Plot the PR curve
plot(pr,
main = "Precision-Recall Curve for SVM",
auc.main = TRUE,
color = "#2c7fb8",
lwd = 2)DECISION TREE
bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# Categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]
# One-hot encoding
bank_data_encoded <- fastDummies::dummy_cols(bank_data, select_columns = categorical_cols,
remove_first_dummy = TRUE, remove_selected_columns = TRUE)
# Split data 80 training/ 20 testing
set.seed(42)
trainIndex <- createDataPartition(bank_data_encoded$y_yes, p = 0.8, list = FALSE)
trainData <- bank_data_encoded[trainIndex, ]
testData <- bank_data_encoded[-trainIndex, ]
# factorize target
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))
set.seed(123)
#INITIAL DT MODEL!!!!!!!!!!!!!!!!!!
dt_model <- rpart(y_yes ~ ., data = trainData, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
rpart.plot(dt_model)# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# COnfusion matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9163408
## Recall: 0.9700201
## F1 Score: 0.9424167
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7733 706
## 1 239 364
##
## Accuracy : 0.8955
## 95% CI : (0.889, 0.9017)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1.886e-05
##
## Kappa : 0.3825
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9700
## Specificity : 0.3402
## Pos Pred Value : 0.9163
## Neg Pred Value : 0.6036
## Prevalence : 0.8817
## Detection Rate : 0.8552
## Detection Prevalence : 0.9333
## Balanced Accuracy : 0.6551
##
## 'Positive' Class : 0
##
## duration poutcome_success contact_unknown pdays
## 1110.9180804 685.8992109 3.9494818 0.8776626
## previous age campaign
## 0.5831593 0.1443280 0.1443280
#SMOTE 1!!!!!!!!!!!!!!!!!!
set.seed(123)
trainData$y_yes <- as.factor(trainData$y_yes)
#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)
# Creation of new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
# New class distribution
table(trainData_balanced$y_yes)##
## 0 1
## 31950 8438
set.seed(123)
# Train decision tree
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# Confusion Matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9295723
## Recall: 0.9486954
## F1 Score: 0.9390365
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7563 573
## 1 409 497
##
## Accuracy : 0.8914
## 95% CI : (0.8848, 0.8977)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 0.001992
##
## Kappa : 0.4425
##
## Mcnemar's Test P-Value : 1.976e-07
##
## Sensitivity : 0.9487
## Specificity : 0.4645
## Pos Pred Value : 0.9296
## Neg Pred Value : 0.5486
## Prevalence : 0.8817
## Detection Rate : 0.8364
## Detection Prevalence : 0.8998
## Balanced Accuracy : 0.7066
##
## 'Positive' Class : 0
##
# Plot
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE) #SMOTE 2!!!!!!!!!!!
set.seed(123)
trainData$y_yes <- as.factor(trainData$y_yes)
#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)
# new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
# class distribution
table(trainData_balanced$y_yes)##
## 0 1
## 31950 12657
# Train dt
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# conf matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9380264
## Recall: 0.9360261
## F1 Score: 0.9370252
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7462 493
## 1 510 577
##
## Accuracy : 0.8891
## 95% CI : (0.8824, 0.8955)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 0.0146
##
## Kappa : 0.472
##
## Mcnemar's Test P-Value : 0.6134
##
## Sensitivity : 0.9360
## Specificity : 0.5393
## Pos Pred Value : 0.9380
## Neg Pred Value : 0.5308
## Prevalence : 0.8817
## Detection Rate : 0.8253
## Detection Prevalence : 0.8798
## Balanced Accuracy : 0.7376
##
## 'Positive' Class : 0
##
# Plot
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE) #HYPERPARAMETER TUNING #1
#MODIFY COMPLEXITY PARAMETER
set.seed(123)
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class",
control = rpart.control(minsplit = 20, cp = 0.05))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.9427016
## Recall: 0.9121927
## F1 Score: 0.9271962
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7272 442
## 1 700 628
##
## Accuracy : 0.8737
## 95% CI : (0.8667, 0.8805)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 0.9904
##
## Kappa : 0.4519
##
## Mcnemar's Test P-Value : 2.849e-14
##
## Sensitivity : 0.9122
## Specificity : 0.5869
## Pos Pred Value : 0.9427
## Neg Pred Value : 0.4729
## Prevalence : 0.8817
## Detection Rate : 0.8042
## Detection Prevalence : 0.8531
## Balanced Accuracy : 0.7496
##
## 'Positive' Class : 0
##
# Plot
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE) #HYPERPARAMETER TUNING #2 CHANGING WEIGHTS!!!!!!!!!!!
set.seed(123)
dt_model <- rpart(y_yes ~ .,
data = trainData_balanced,
method = "class",
parms = list(prior = c(0.4, 0.6)), # Corrected argument name
control = rpart.control(minsplit = 20, cp = 0.01))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# Confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")## Precision: 0.973132
## Recall: 0.7678123
## F1 Score: 0.8583649
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6121 169
## 1 1851 901
##
## Accuracy : 0.7766
## 95% CI : (0.7679, 0.7851)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3629
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7678
## Specificity : 0.8421
## Pos Pred Value : 0.9731
## Neg Pred Value : 0.3274
## Prevalence : 0.8817
## Detection Rate : 0.6770
## Detection Prevalence : 0.6956
## Balanced Accuracy : 0.8049
##
## 'Positive' Class : 0
##
# Plot DT
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE) library(PRROC)
# Get predicted probabilities for the positive class ("1")
dt_probs <- predict(dt_model, newdata = testData, type = "prob")[, "1"]
# Convert actual labels to numeric (0/1)
labels <- as.numeric(as.character(testData$y_yes))
# Compute Precision-Recall curve
pr <- pr.curve(scores.class0 = dt_probs[labels == 1],
scores.class1 = dt_probs[labels == 0],
curve = TRUE)
# Plot PR Curve
plot(pr,
main = "Precision-Recall Curve for Decision Tree (Weighted)",
auc.main = TRUE,
color = "#d95f02",
lwd = 2)