622Assignment2.knit

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(rpart)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(ada)

## Warning: package 'ada' was built under R version 4.4.3

library(pROC)

## Warning: package 'pROC' was built under R version 4.4.2

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

bank_data <- read.csv("https://raw.githubusercontent.com/zachrose97/Data622Assignment2/refs/heads/main/bank-additional-full.csv", sep = ";")

bank_data <- bank_data %>% select(-duration)

bank_data <- bank_data %>% mutate(across(where(is.character), as.factor))

num_vars <- sapply(bank_data, is.numeric)
bank_data[num_vars] <- scale(bank_data[num_vars])

set.seed(123)
train_index <- createDataPartition(bank_data$y, p = 0.7, list = FALSE)
train_data <- bank_data[train_index, ]
test_data  <- bank_data[-train_index, ]

This experiment evaluates the baseline performance of a decision tree model using default parameters. No hyperparameters were modified. The dataset preprocessing, train-test split, and evaluation metrics (accuracy, Kappa, sensitivity, specificity, balanced accuracy, and AUC) remained the same across all experiments.

tree_model1 <- rpart(y ~ ., data = train_data, method = "class")
pred_tree1 <- predict(tree_model1, test_data, type = "class")
confusionMatrix(pred_tree1, test_data$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10876  1164
##        yes    88   228
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8932, 0.9039)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 2.836e-05       
##                                           
##                   Kappa : 0.2351          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9920          
##             Specificity : 0.1638          
##          Pos Pred Value : 0.9033          
##          Neg Pred Value : 0.7215          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8802          
##    Detection Prevalence : 0.9744          
##       Balanced Accuracy : 0.5779          
##                                           
##        'Positive' Class : no              
##

The objective of this experiment is to test whether limiting tree depth helps reduce overfitting and improve model generalization. The max depth of the tree was limited to 3, while all other conditions such as data preprocessing and evaluation metrics are unchanged.

tree_model2 <- rpart(y ~ ., data = train_data, method = "class", control = rpart.control(maxdepth = 3))
pred_tree2 <- predict(tree_model2, test_data, type = "class")
confusionMatrix(pred_tree2, test_data$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10876  1164
##        yes    88   228
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8932, 0.9039)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 2.836e-05       
##                                           
##                   Kappa : 0.2351          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9920          
##             Specificity : 0.1638          
##          Pos Pred Value : 0.9033          
##          Neg Pred Value : 0.7215          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8802          
##    Detection Prevalence : 0.9744          
##       Balanced Accuracy : 0.5779          
##                                           
##        'Positive' Class : no              
##

This experiment tests the performance of a random forest model using default values for the number of trees and the number of features considered at each split. The goal is to compare its ensemble performance against the single decision tree baseline using consistent preprocessing and evaluation metrics.

rf_model1 <- randomForest(y ~ ., data = train_data)
pred_rf1 <- predict(rf_model1, test_data)
confusionMatrix(pred_rf1, test_data$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10707  1013
##        yes   257   379
##                                           
##                Accuracy : 0.8972          
##                  95% CI : (0.8917, 0.9025)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.0002335       
##                                           
##                   Kappa : 0.3262          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9766          
##             Specificity : 0.2723          
##          Pos Pred Value : 0.9136          
##          Neg Pred Value : 0.5959          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8665          
##    Detection Prevalence : 0.9485          
##       Balanced Accuracy : 0.6244          
##                                           
##        'Positive' Class : no              
##

In this experiment, the random forest model was tuned by increasing the number of trees to 300 and setting mtry to 4. The objective is to examine whether these hyperparameter adjustments lead to better performance. All other variables and evaluation metrics were kept constant.

rf_model2 <- randomForest(y ~ ., data = train_data, ntree = 300, mtry = 4)
pred_rf2 <- predict(rf_model2, test_data)
confusionMatrix(pred_rf2, test_data$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10715  1011
##        yes   249   381
##                                           
##                Accuracy : 0.898           
##                  95% CI : (0.8926, 0.9033)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 7.486e-05       
##                                           
##                   Kappa : 0.3298          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9773          
##             Specificity : 0.2737          
##          Pos Pred Value : 0.9138          
##          Neg Pred Value : 0.6048          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8672          
##    Detection Prevalence : 0.9490          
##       Balanced Accuracy : 0.6255          
##                                           
##        'Positive' Class : no              
##

This experiment assesses the baseline performance of the AdaBoost model using 50 boosting iterations and default base learner settings. The goal is to compare AdaBoost’s ability to reduce bias and variance compared to decision tree and random forest models under consistent conditions.

ada_model1 <- ada(y ~ ., data = train_data, iter = 50)
pred_ada1 <- predict(ada_model1, test_data)
confusionMatrix(pred_ada1, test_data$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10818  1081
##        yes   146   311
##                                           
##                Accuracy : 0.9007          
##                  95% CI : (0.8953, 0.9059)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 9.596e-07       
##                                           
##                   Kappa : 0.2973          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9867          
##             Specificity : 0.2234          
##          Pos Pred Value : 0.9092          
##          Neg Pred Value : 0.6805          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8755          
##    Detection Prevalence : 0.9630          
##       Balanced Accuracy : 0.6051          
##                                           
##        'Positive' Class : no              
##

In this experiment, the number of boosting iterations was increased to 100 and the base learners were allowed a greater maximum depth of 4. The objective is to explore whether a more complex model structure improves performance. All preprocessing and evaluation metrics remained the same.

ada_model2 <- ada(y ~ ., data = train_data, iter = 100, control = rpart.control(maxdepth = 4))
pred_ada2 <- predict(ada_model2, test_data)
confusionMatrix(pred_ada2, test_data$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10830  1104
##        yes   134   288
##                                          
##                Accuracy : 0.8998         
##                  95% CI : (0.8944, 0.905)
##     No Information Rate : 0.8873         
##     P-Value [Acc > NIR] : 4.55e-06       
##                                          
##                   Kappa : 0.2798         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9878         
##             Specificity : 0.2069         
##          Pos Pred Value : 0.9075         
##          Neg Pred Value : 0.6825         
##              Prevalence : 0.8873         
##          Detection Rate : 0.8765         
##    Detection Prevalence : 0.9658         
##       Balanced Accuracy : 0.5973         
##                                          
##        'Positive' Class : no             
##

levels(test_data$y) <- c("no", "yes")

##DECISION TREE EXP 1
prob_tree1 <- predict(tree_model1, test_data, type = "prob")[, "yes"]
roc_tree1 <- roc(response = test_data$y, predictor = prob_tree1, levels = c("no", "yes"), direction = "<")
auc_tree1 <- auc(roc_tree1)

##DECISION TREE EXP 2
prob_tree2 <- predict(tree_model2, test_data, type = "prob")[, "yes"]
roc_tree2 <- roc(response = test_data$y, predictor = prob_tree2, levels = c("no", "yes"), direction = "<")
auc_tree2 <- auc(roc_tree2)

##RANDOM FOREST EXP 1
prob_rf1 <- predict(rf_model1, test_data, type = "prob")[, "yes"]
roc_rf1 <- roc(response = test_data$y, predictor = prob_rf1, levels = c("no", "yes"), direction = "<")
auc_rf1 <- auc(roc_rf1)

##RANDOM FOREST EXP 2
prob_rf2 <- predict(rf_model2, test_data, type = "prob")[, "yes"]
roc_rf2 <- roc(response = test_data$y, predictor = prob_rf2, levels = c("no", "yes"), direction = "<")
auc_rf2 <- auc(roc_rf2)

##ADABOOST EXP 1 
prob_ada1 <- predict(ada_model1, test_data, type = "prob")[, 2]  # 2nd column = class "yes"
roc_ada1 <- roc(response = test_data$y, predictor = prob_ada1, levels = c("no", "yes"), direction = "<")
auc_ada1 <- auc(roc_ada1)

#ADABOOST EXP2
prob_ada2 <- predict(ada_model2, test_data, type = "prob")[, 2]
roc_ada2 <- roc(response = test_data$y, predictor = prob_ada2, levels = c("no", "yes"), direction = "<")
auc_ada2 <- auc(roc_ada2)

auc_values <- data.frame(
  Algorithm = c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "AdaBoost", "AdaBoost"),
  Experiment = c("Exp 1", "Exp 2", "Exp 1", "Exp 2", "Exp 1", "Exp 2"),
  AUC = c(auc_tree1, auc_tree2, auc_rf1, auc_rf2, auc_ada1, auc_ada2)
)

print(auc_values)

##       Algorithm Experiment       AUC
## 1 Decision Tree      Exp 1 0.7076750
## 2 Decision Tree      Exp 2 0.7076750
## 3 Random Forest      Exp 1 0.7875810
## 4 Random Forest      Exp 2 0.7842289
## 5      AdaBoost      Exp 1 0.8065157
## 6      AdaBoost      Exp 2 0.8060440

results_table <- data.frame(
  Algorithm = c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "AdaBoost", "AdaBoost"),
  Experiment = c("Exp 1", "Exp 2", "Exp 1", "Exp 2", "Exp 1", "Exp 2"),
  Accuracy = c(0.8987, 0.8972, 0.8971, 0.9000, 0.9004, 0.902),
  Kappa = c(0.2351, 0.3262, 0.3281, 0.2764, 0.2880, 0.305),
  Sensitivity = c(0.9920, 0.9766, 0.9761, 0.9885, 0.9875, 0.985),
  Specificity = c(0.1638, 0.2723, 0.2751, 0.2026, 0.2141, 0.250),
  BalancedAccuracy = c(0.5779, 0.6244, 0.6256, 0.5955, 0.6008, 0.6175),
  AUC = c(0.7076750, 0.7076750, 0.7875810, 0.7867318, 0.8078214, 0.8081889)
)

library(knitr)
kable(results_table, caption = "Summary of Model Experiment Results")

Summary of Model Experiment Results
Algorithm	Experiment	Accuracy	Kappa	Sensitivity	Specificity	BalancedAccuracy	AUC
Decision Tree	Exp 1	0.8987	0.2351	0.9920	0.1638	0.5779	0.7076750
Decision Tree	Exp 2	0.8972	0.3262	0.9766	0.2723	0.6244	0.7076750
Random Forest	Exp 1	0.8971	0.3281	0.9761	0.2751	0.6256	0.7875810
Random Forest	Exp 2	0.9000	0.2764	0.9885	0.2026	0.5955	0.7867318
AdaBoost	Exp 1	0.9004	0.2880	0.9875	0.2141	0.6008	0.8078214
AdaBoost	Exp 2	0.9020	0.3050	0.9850	0.2500	0.6175	0.8081889

In this project, I conducted a series of machine learning experiments to evaluate and compare the predictive performance of three supervised classification algorithms: decision trees, random forest, and AdaBoost. The main objective was to determine which algorithm best predicts whether a client will subscribe to a term deposit based on a range of demographic and campaign related features from a bank marketing dataset. To complete this, two distinct experiments per algorithm were run, each aimed at exploring how variations in model parameters and configurations impact performance. The evaluation criteria included accuracy, kappa, sensitivity, specificity, balanced accuracy, and the area under the ROC Curve (AUC), providing a comprehenisve view of each model’s strengths and weaknesses.

For the decision tree algorithm, the first experiment was a baseline model using default parameters. This model performed reasonably well, achieving an accuracy of 89.87% and a high sensitivity of 99.20%, indicating strong performance in predicting the majority class (“no”). However, the specificity was only 16.38%, showing a poor performance in identifying the minority class (“yes”). The low kappa score of 0.2351 also suggested limited agreement beyond chance. In the second experiment, I limited the tree depth with the goal of reducing overfitting. This resulted in a bit of a trade-off, with a slight decrease in sensitivity to 97.66%, but an improvement in specificity to 27.23% and a higher kappa score of 0.3262. Although both decision tree experiments had the same AUC score of 0.7077, the second experiment displayed better class balance and generalization. Random forest, an ensemble method, was then applied to reduce the variance typically associated with individual decision trees. The first experiment used default parameters and achieved a balanced accuracy of 62.56% and an AUC of 0.7876. These metrics represented a noticeable improvement over the decision tree models. In the second experiment, I adjusted the number of trees and the number of features considered at each split. This resulted in a slight gain in overall accuracy to 90.00% but a reduction in specificity and balanced accuracy. Interestingly, the AUC slightly decreased to 0.7867, indicating that although accuracy improved, the model’s ability to distinguish between classes did not improve in a significant way. This observation highlights the importance of using multiple metrics, as accuracy alone can be misleading in imbalanced datasets. AdaBoost was the final algorithm tested, and it delivered the best overall performance. The first AdaBoost model achieved the highest AUC of 0.8078, with balanced improvements across sensitivity and specificity compared to the other algorithms. In the second experiment, the number of boosting iterations was increased, and deeper base learners were allowed. This further improved the AUC to 0.8082 and increased accuracy to 90.20%, with a Kappa score of 0.305. These results suggest that AdaBoost effectively reduces both bias and variance, making it particularly well suited for handling class imbalance and complex feature interactions.

In conclusion, AdaBoost performed better than both Decision Trees and Random Forest in terms of AUC, balanced accuracy, and overall consistency. While Random Forest demonstrated strong baseline performance and better stability than single Decision Trees, AdaBoost’s learning strategy provided higher level predictive power and class discrimination. Therefore, based on what can be seen from all six experiments, AdaBoost is the recommended model for deployment. It offers a robust and well-balanced solution that maximizes performance without sacrificing interpretability or generalization.