Assignment

Introduction
In Machine Learning, Experimentation refers to the systematic process of designing, executing, and analyzing different configurations to identify the optimal settings that performs best on a given task. Experimentation is learning by doing. It involves systematically changing parameters, evaluating results with metrics, and comparing different approaches to find the best solution; essentially, it’s the practice of testing and refining machine learning models through controlled experiments to improve their performance.

The key is to modify only one or a few variables at a time to isolate the impact of each change and understand its effect on model performance. In the assignment you will conduct at least 6 experiments. In real life, data scientists run anywhere from a dozen to hundreds of experiments (depending on the dataset and problem domain).

Dataset
A Portuguese bank conducted a marketing campaign(phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

Assignment
This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment you will define what you are trying to achieve (before each run), conduct the experiment, and at the end you will review how your experiment went. These experiments will allow you to compare algorithms and choose the optimal model.

Transformations and Algorithm Selection You will perform experiments using the following algorithms:

  • Decision Trees
  • Random Forest
  • Adaboost

Experiment
For each of the algorithms (above), perform at least two (2) experiments. In a typical experiment you should:

  • Define the objective of the experiment (hypothesis)
  • Decide what will change, and what will stay the same
  • Select the evaluation metric (what you want to measure)
  • Perform the experiment
  • Document the experiment so you compare results (track progress)

Variations
There are many things you can vary between experiments, here are some examples:

  • Data sampling (feature selection)
  • Data augmentation e.g., regularization, normalization, scaling
  • Hyperparameter optimization (you decide, random search, grid search, etc.)
  • Decision Tree breadth & depth (this is an example of a hyperparameter)
  • Evaluation metrics e.g., Accuracy, precision, recall, F1-score, AUC-ROC
  • Cross-validation strategy e.g., holdout, k-fold, leave-one-out
  • Number of trees (for ensemble models)
  • Train-test split: Using different data splits to assess model generalization ability

Essay Write a short essay summarizing your findings. Your essay should include:

  • Explain why you chose the experiments you did
  • Discuss bias & variance across the experiments, e.g. between Decision Tree experiments, and with Random Forest & Adaboost
  • A table with experiments & results
  • What was the optimal model you found, and why
  • What conclusion did you came to? What do you recommend.

PART I: EDA & Pre-processing

EDA
EDA was performed in the previous assignment: https://rpubs.com/greggmaloy/1275261
Below the EDA from the previous assignment is summarized:

There was considerable class imbalance in the target variable (y), where ~11% of clients subscribed (yes), while ~88% did not (no). There are seven numerical and ten categorical variables in the dataset. Most numerical features were right-skewed and many had outliers(detected via IQR and scatterplots). There were no strong linear relationships between features, with most variables showing either very weak correlations via correlation matrix or none at all. There was no missing data.

What the EDA Means for Decision Trees, Random Forest & ADA Boost Models
Since ADA Boost cannot handle categorical variables, as it needs numerical representation of categorical variables in order to assign weights to features, and since both decision trees and random forrest models can accomodate one-hot encoding of categorical variables, below categorical variables are transformed via one-hot encoding. This transformation standardizes the dataset for all three models.

Evalulation Metrics
Confusion matrix were used to assess all models in this analysis. More specifically, accuracy, sensitivity, specificity and balance accuracy were of particular importance.

########################################## ONLY CODE BELOW###############################################################

library(tidyverse)   
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)      
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(fastDummies) 
library(rpart)       
library(rpart.plot)  
library(smotefamily)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(mlbench)

bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# Categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]

# One-hot encoding 
bank_data_encoded <- fastDummies::dummy_cols(bank_data, select_columns = categorical_cols, 
                                             remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Split data 80 training/ 20 testing 
set.seed(42)
trainIndex <- createDataPartition(bank_data_encoded$y_yes, p = 0.8, list = FALSE)  
trainData <- bank_data_encoded[trainIndex, ]
testData <- bank_data_encoded[-trainIndex, ]

# factorize target
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))

set.seed(123) 

Part II: Decision Trees

Model Y=0 Y=1 TP FN FP TN Accuracy Sensitivity Specificity Bal Acc
DECISION TREE
Initial Decision Tree (DT) 39822 5289 364 706 239 7733 0.8955 0.9700 0.3402 0.6551
SMOTE #1(K=5,dup_size=1) & DT 31950 8438 497 573 409 7563 0.8914 0.9487 0.4645 0.7066
SMOTE #2(K=5,dup_size=2) & DT 31950 12657 577 493 510 7462 0.8891 0.9360 0.5393 0.7376
Hyperparameter tuning #1 (cp=0.05) 31950 12657 628 442 700 7272 0.8737 0.9122 0.5869 0.7496
Hyperparameter tuning #2 (weight) 31950 12657 901 169 1851 6121 0.7766 0.7678 0.8421 0.8049

Initial Decision Tree
The initial decision tree resulted in a model with high accuracy (90%), sensitivity (97%), moderate class imbalance (66%) and low specificity (34%). Since class imbalance was not adequatley addressed by the initial decision tree, the experiments will focus on balancing the minority class imbalance present in the dependent variable while also attempting to improve specificity.

Experiments
The following models/experiments were run:
1. SMOTE dup_size=1 in an attempt to balance the dataset
2. SMOTE dup_size=2 in another attempt to balance dataset
3. Hyperparameter tuning #1 cp=0.05 in an attempt to improve the specificity
4. Hyperparameter tuning which weighted the minority class in an attempt to improve specificity

EXPERIEMENT 1: SMOTE
Since predicting class Y=1 is important, SMOTE was utilized to resample. SMOTE generates synthetic data using k-nearest neighbors of the minority class in order to overcome majority class imbalance. Below SMOTE was used to generate approximately 7,000 more rows in which Y=1 and the decision trees are again run. The expected result would be an increase in specificity and a decrease in sensitivity. Overall accuracy would also be expected to decrease since synthetic data is being generated. Balance accuracy, however, would be expected to increase.

SMOTE Hypothesis
Null Hypothesis: The use of SMOTE will not effectively balance the dataset and will not result in an improvement in specificity.
Alternate Hypothesis: The use of SMOTE will create a more balanced dataset, leading to an increase in specificity.

SMOTE Results
Utilization of SMOTE created ~7,000 synthetic instances of the minority class (Y=1). SMOTE did improve specificity (initial=34% versus SMOTE #2 = 54%) and the minority class was further balanced (initial model balance accuracy = 66% versus SMOTE balance accuracy =74% ) with minimal impact on sensitivity and accuracy. Although specificity only increase from low to relatively moderate (initial=34% versus SMOTE #2 = 54%), we are still able to reject the null hypothesis.

EXPERIEMENT 2: Hyperparameter Tuning
In an attempt to improve the specificity and balance the data, hyperparameter tuning was utilized. Two experiments were run:
1.) The complex parameter (cp) was set to 0.05.
2.) The decision tree model adjusted the weights in favor of the minority class (Y=0: 40% weight; Y=1: 60% weight).

Hyperparameter Tuning Hypothesis #1: Complexity Parameter Increase
Null Hypothesis: The use of SMOTE will not effectively balance the dataset and will not result in an improvement in specificity.
Alternate Hypothesis: The increase of the complexity parameter will increase specificity.

Hyperparameter Tuning #1 Results
As expected, increasing the complexity parameter increased specificity(SMOTE #2 specificity=54% versus HP Tuning #1 specificity =59%) and decreased the accuracy and sensitivity slightly. As such, the null hypothesis can be rejected. The complexity parameter increase led to an increased specificity.

Hyperparameter Tuning Hypothesis #2: Increase the Minority Class Weight
Null Hypothesis: Increasing the weight of the minority class will not increase specificity.
Alternate Hypothesis: Increasing the weight of the minority class will increase specificity.

Hyperparameter Tuning #2 Results
The decision tree model adjusted the weights in favor of the minority class (Y=0: 40% weight; Y=1: 60% weight) and resulted in a significantly increase specificity (HP Tuning #1 specificity =59% versus Adjusted Weight specificity= 84%), a decrease in sensitivy (77%) and decreased model accuracy (78%). Of note, balance accuracy increased to 80%. As such we are able to reject the null hypothesis. Adjusting the weight in the minority class was determined to dramatically increase the decision tree model specificity.

Final Decision Tree Model
Among the models discussed, the final model, with adjusted weights for the minority class in the target variable, is the preferred model. It’s important to note, however, that this model was trained on a dataset with 1.) ~7000 rows of synthetic data generated by SMOTE, 2.) the model’s complexity parameter was increased, and finally 3.) class weights were adjusted in favor of the minority class.

Visualization
Below is a visualization of the final decision tree
Generally speaking the decision tree, SMOTE and hyperparameter tuning denoted the ‘duration’ feature to be the most influencial feature, followed by poutcome_success, and significantly lesser influence by the remainder of the features.

########################################## ONLY CODE BELOW###############################################################
set.seed(123)

#INITIAL DT MODEL!!!!!!!!!!!!!!!!!!

dt_model <- rpart(y_yes ~ ., data = trainData, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
rpart.plot(dt_model)

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# COnfusion matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7733  706
##          1  239  364
##                                          
##                Accuracy : 0.8955         
##                  95% CI : (0.889, 0.9017)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 1.886e-05      
##                                          
##                   Kappa : 0.3825         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9700         
##             Specificity : 0.3402         
##          Pos Pred Value : 0.9163         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8552         
##    Detection Prevalence : 0.9333         
##       Balanced Accuracy : 0.6551         
##                                          
##        'Positive' Class : 0              
## 
# Feature importance
print(dt_model$variable.importance)
##         duration poutcome_success  contact_unknown            pdays 
##     1110.9180804      685.8992109        3.9494818        0.8776626 
##         previous              age         campaign 
##        0.5831593        0.1443280        0.1443280
#SMOTE 1!!!!!!!!!!!!!!!!!!
set.seed(123)
trainData$y_yes <- as.factor(trainData$y_yes)

#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)

# Creation of new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"  
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)

# New class distribution 
table(trainData_balanced$y_yes)
## 
##     0     1 
## 31950  8438
set.seed(123)
# Train decision tree
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))



# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# Confusion Matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7563  573
##          1  409  497
##                                           
##                Accuracy : 0.8914          
##                  95% CI : (0.8848, 0.8977)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.001992        
##                                           
##                   Kappa : 0.4425          
##                                           
##  Mcnemar's Test P-Value : 1.976e-07       
##                                           
##             Sensitivity : 0.9487          
##             Specificity : 0.4645          
##          Pos Pred Value : 0.9296          
##          Neg Pred Value : 0.5486          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8364          
##    Detection Prevalence : 0.8998          
##       Balanced Accuracy : 0.7066          
##                                           
##        'Positive' Class : 0               
## 
# Plot 
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,     
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)  

#SMOTE 2!!!!!!!!!!!
set.seed(123)

trainData$y_yes <- as.factor(trainData$y_yes)

#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)

# new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes" 
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)  

# class distribution
table(trainData_balanced$y_yes)
## 
##     0     1 
## 31950 12657
# Train dt
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# conf matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7462  493
##          1  510  577
##                                           
##                Accuracy : 0.8891          
##                  95% CI : (0.8824, 0.8955)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.0146          
##                                           
##                   Kappa : 0.472           
##                                           
##  Mcnemar's Test P-Value : 0.6134          
##                                           
##             Sensitivity : 0.9360          
##             Specificity : 0.5393          
##          Pos Pred Value : 0.9380          
##          Neg Pred Value : 0.5308          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8253          
##    Detection Prevalence : 0.8798          
##       Balanced Accuracy : 0.7376          
##                                           
##        'Positive' Class : 0               
## 
# Plot 
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,     
           under = TRUE,    
           tweak = 1.2,    
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)  

#HYPERPARAMETER TUNING #1
#MODIFY COMPLEXITY PARAMETER
set.seed(123)

dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class",
                  control = rpart.control(minsplit = 20, cp = 0.05))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7272  442
##          1  700  628
##                                           
##                Accuracy : 0.8737          
##                  95% CI : (0.8667, 0.8805)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 0.9904          
##                                           
##                   Kappa : 0.4519          
##                                           
##  Mcnemar's Test P-Value : 2.849e-14       
##                                           
##             Sensitivity : 0.9122          
##             Specificity : 0.5869          
##          Pos Pred Value : 0.9427          
##          Neg Pred Value : 0.4729          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8042          
##    Detection Prevalence : 0.8531          
##       Balanced Accuracy : 0.7496          
##                                           
##        'Positive' Class : 0               
## 
# Plot
rpart.plot(dt_model, 
           type = 3,       
           extra = 104,    
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)  

#HYPERPARAMETER TUNING #2 CHANGING WEIGHTS!!!!!!!!!!!
set.seed(123)

dt_model <- rpart(y_yes ~ ., 
                  data = trainData_balanced, 
                  method = "class",
                  parms = list(prior = c(0.4, 0.6)),  # Corrected argument name
                  control = rpart.control(minsplit = 20, cp = 0.01))

# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")

# Confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6121  169
##          1 1851  901
##                                           
##                Accuracy : 0.7766          
##                  95% CI : (0.7679, 0.7851)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3629          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7678          
##             Specificity : 0.8421          
##          Pos Pred Value : 0.9731          
##          Neg Pred Value : 0.3274          
##              Prevalence : 0.8817          
##          Detection Rate : 0.6770          
##    Detection Prevalence : 0.6956          
##       Balanced Accuracy : 0.8049          
##                                           
##        'Positive' Class : 0               
## 
# Plot DT
rpart.plot(dt_model, 
           type = 3,        
           extra = 104,    
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE)  

Part III: Random Forest

Model Y=0 Y=1 TP FN FP TN Accuracy Sensitivity Specificity Bal Acc
RANDOM FOREST
Initial Random Forest (RF) 39822 5289 410 660 212 7760 0.9036 0.9734 0.3832 0.6783
SMOTE #1(K=5,dup_size=1) & RF 31950 8438 445 625 234 7738 0.9050 0.9706 0.4159 0.6933
SMOTE #2(K=5,dup_size=2) & RF 31950 12657 448 622 249 7723 0.9037 0.9688 0.4187 0.6937
Hyperparameter tuning #1 (weight) 31950 12657 530 540 371 7601 0.8992 0.9535 0.4953 0.7244
Hyperparameter tuning #2 (weight) 31950 12657 537 533 395 7577 0.8974 0.9505 0.5019 0.7319

Initial Random Forest
Comparable to the initial decision tree model, the initial random forest resulted in a model with high accuracy(90%), high sensitivity(97%), moderate class imbalance(67%) and low specificity(98%). Again, since class imbalance was not adequatley addressed by the initial random forest, the experiments will focus on balancing the minority class imbalance present in the dependent variable while also attempting to improve specificity.

Experiments
Below the following models were run:
1.) Initial random forest
2.) Experiment 1 SMOTE: SMOTE was utilized and dup_size was set to dup_size=1 in an attempt to address class imbalance
3.) Experiment 1 SMOTE: Again a SMOTE was utilized and dup_size set to dup_size=2 in an attempt to address class imbalance
4.) Experiement 2 hyperparameter Tuning: hyperparameter tuning which increased the weighted of the minority class (classwt = c(1, 2))
5.) Experiement 2 hyperparameter Tuning: hyperparameter tuning which further increased the weighted of the minority class (classwt = c(1, 3))

Again, confusion matrix were used to assess the models. The model’s specificity still is of the utmost importance since the dependent variable/feature suffers from considerable class imbalance (Y=0: ~89% versus Y=1: ~11%).

EXPERIEMENT 1: SMOTE
Since predicting class Y=1 is important, SMOTE was again utilized to resample. SMOTE generates synthetic data using k-nearest neighbors of the minority class in order to overcome majority class imbalance. Below SMOTE is used to generate approximately 7000 more rows in which Y=1 and the random forest models are again run. The expected result would be an increase in specificity and a decrease in sensitivity. Accuracy would also be expected to decrease since synthetic data is being generated.

SMOTE Hypothesis
Null Hypothesis: The use of SMOTE will not create a more balance the dataset and will not result in an improvement in specificity.
Alternate Hypothesis: The use of SMOTE will create a more balanced dataset, leading to an increase in specificity.

SMOTE Results
The initial model was a poor predictor of identifying the minority class due to the class imbalance (balance accuracy = 68%). The model also denoted low specificity(38%), good sensitivity(97%) and good model accuracy(90%). These results were comparable with the inital random forest model. Unlike the decision tree model, SMOTE did not improve specificity as much (initial model=38% versus SMOTE #2 = 42%) and the minority class was only slightly more balanced (balance accuracy SMOTE 2 =69% versus initial=68%) with minimal impact on sensitivity and accuracy. As such SMOTE was not as effective in balancing the dataset for random forest as for decision trees. However, the null hypothesis can be rejected because SMOTE did address the class imbalance.

EXPERIEMENT 2: Hyperparameter Tuning
In an attempt to improve the class imbalance and specificity, hyperparameter tuning was utilized.
Hyperparameters were adjusted twice:
1.) hyperparameter tuning which increased the weight of the minority class. (classwt = c(1, 2))
2.) hyperparameter tuning which increased the weight of the minority class further. (classwt = c(1, 3))

Hyperparameter Tuning Hypothesis: Increase the Minority Class Weight
Null Hypothesis: Increasing the weight of the minority class will not increase specificity
Alternate Hypothesis: Increasing the weight of the minority class will increase specificity

Hyperparameter Tuning
The weighting improved specificity from 38% (initial model) to 50%, but it was still weaker than the final decision tree model (84%). Generally speaking, the ‘duration’ feature was determined as the most influencial feature for this final random forest model, followed by poutcome_success. Although weighting increased balance accuracy, the increase in balance accuracy was not very large (Weight =73%, SMOTE=69%, initial model= 67%). We are still, however, able to reject the null hypothesis.

Final Random Forest Model
Of the above random forest models, the final model with the adjusted weights for the target variable minority class would be favored.
Important to note, however, this model uses a dataset which 1.) had synthetic data generated by SMOTE 2.) utilized the SMOTE generated dataset to increase the target variable weight to improve class imbalance and specificity. In other words, the final model uses both SMOTE data and adjusted the weights of the target variable.

########################################## ONLY CODE BELOW###############################################################
colnames(trainData) <- trimws(colnames(trainData))  
colnames(testData) <- trimws(colnames(testData))
colnames(trainData) <- make.names(colnames(trainData))  
colnames(testData) <- make.names(colnames(testData))

#initial RF
set.seed(123)
#clean data by elimiating spaces, etc
colnames(trainData) <- trimws(colnames(trainData))  
colnames(testData) <- trimws(colnames(testData))
colnames(trainData) <- make.names(colnames(trainData))  
colnames(testData) <- make.names(colnames(testData))

# Train RF model
rf_model <- randomForest(y_yes ~ ., data = trainData, ntree = 100, mtry = sqrt(ncol(trainData) - 1), importance = TRUE)

# Predictions 
rf_predictions <- predict(rf_model, newdata = testData, type = "class")

# confusion matrix
conf_matrix <- confusionMatrix(factor(rf_predictions), factor(testData$y_yes))
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7760  660
##          1  212  410
##                                           
##                Accuracy : 0.9036          
##                  95% CI : (0.8973, 0.9096)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 1.888e-11       
##                                           
##                   Kappa : 0.4355          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9734          
##             Specificity : 0.3832          
##          Pos Pred Value : 0.9216          
##          Neg Pred Value : 0.6592          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8582          
##    Detection Prevalence : 0.9312          
##       Balanced Accuracy : 0.6783          
##                                           
##        'Positive' Class : 0               
## 

# Feature importance
print(importance(rf_model))
##                                0          1 MeanDecreaseAccuracy
## age                 19.256946455  9.8764595           23.2592877
## balance              2.207072030  4.9043589            5.2897120
## day                 25.674296666  2.4417613           25.9694724
## duration            55.268989858 99.1389350           87.8663890
## campaign             8.398792635  6.2730665           10.4628118
## pdays               13.484429807  9.9624594           15.2659462
## previous             6.884958037  5.9887055            7.4877267
## job_blue.collar      4.392803753  2.5880214            5.0773431
## job_entrepreneur     0.628089794  0.3553440            0.7493254
## job_housemaid        2.445014392 -2.3434614            0.8673570
## job_management       3.266437014  1.4669602            3.8311581
## job_retired          1.718451225  2.3962232            2.8974627
## job_self.employed    0.004851697 -0.5527589           -0.2594183
## job_services         2.159301219  0.1135077            1.9644682
## job_student          5.439205194  1.1827379            6.0150204
## job_technician       2.312954662  1.1110553            2.5486284
## job_unemployed       0.387435937 -1.4569323           -0.5483327
## job_unknown          1.953482526 -1.4909916            1.2639386
## marital_married      5.274191354  4.0475651            7.3539484
## marital_single       4.987173524  2.4868982            6.3379350
## education_secondary  0.477539883  0.5195440            0.8186026
## education_tertiary   4.851742773  3.3924724            5.7183920
## education_unknown   -0.102589961 -0.5151976           -0.3523125
## default_yes          0.641010985  2.6382661            2.1114791
## housing_yes         15.142366680 16.4947867           21.1485187
## loan_yes            -2.304014299 10.2540403            4.6931123
## contact_telephone    4.716429159  0.5906281            4.2907130
## contact_unknown     21.552401015  6.4277420           22.5738388
## month_aug           16.754644269 -1.7683791           16.7655979
## month_dec            7.838966414  4.6731467            8.8037796
## month_feb           14.438009607  0.2258086           14.4305664
## month_jan           15.735228042 -3.1363352           15.1970251
## month_jul           17.169187032  1.2876320           17.3416497
## month_jun           15.170957081 -1.3380552           15.3592556
## month_mar           15.402942706 21.1085927           21.2418963
## month_may           11.466287170  7.4654130           12.0426590
## month_nov           13.033046784 -1.2610798           12.9620166
## month_oct           15.183891145 12.0224645           17.6886436
## month_sep           12.746151446  7.8365523           13.5633357
## poutcome_other       2.659058836  2.9152952            3.1951152
## poutcome_success     5.011075672 40.7228659           18.3215721
## poutcome_unknown     7.298410344  3.8488444            7.2858040
##                     MeanDecreaseGini
## age                       584.081660
## balance                   572.136195
## day                       505.459817
## duration                 1847.481764
## campaign                  224.465591
## pdays                     294.629309
## previous                  147.859857
## job_blue.collar            54.948871
## job_entrepreneur           22.433592
## job_housemaid              21.489897
## job_management             68.102311
## job_retired                35.377496
## job_self.employed          28.797968
## job_services               41.295741
## job_student                33.514945
## job_technician             67.481750
## job_unemployed             30.835596
## job_unknown                 8.837568
## marital_married            73.618299
## marital_single             58.923226
## education_secondary        80.522398
## education_tertiary         76.049339
## education_unknown          35.477192
## default_yes                11.412613
## housing_yes               145.890260
## loan_yes                   62.238459
## contact_telephone          44.827778
## contact_unknown            94.706311
## month_aug                  64.774379
## month_dec                  27.218596
## month_feb                  59.358935
## month_jan                  39.782869
## month_jul                  64.712422
## month_jun                  73.708280
## month_mar                  91.195724
## month_may                  66.843562
## month_nov                  57.203945
## month_oct                  79.025797
## month_sep                  61.471510
## poutcome_other             26.397949
## poutcome_success          411.395694
## poutcome_unknown           46.885629
varImpPlot(rf_model)

set.seed(123)
#SMOTE #1

smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)

# Create new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)  

# Train RF
rf_model <- randomForest(y_yes ~ ., data = trainData_balanced, ntree = 100, 
                         mtry = sqrt(ncol(trainData_balanced) - 1), importance = TRUE)

# Predictions
rf_predictions <- predict(rf_model, newdata = testData, type = "class")

# confusion
rf_conf_matrix <- confusionMatrix(rf_predictions, testData$y_yes)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7738  625
##          1  234  445
##                                          
##                Accuracy : 0.905          
##                  95% CI : (0.8988, 0.911)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 8.176e-13      
##                                          
##                   Kappa : 0.4592         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9706         
##             Specificity : 0.4159         
##          Pos Pred Value : 0.9253         
##          Neg Pred Value : 0.6554         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8558         
##    Detection Prevalence : 0.9249         
##       Balanced Accuracy : 0.6933         
##                                          
##        'Positive' Class : 0              
## 

# Plot 
varImpPlot(rf_model)


table(trainData_balanced$y_yes)
## 
##     0     1 
## 31950  8438
#SMOTE 2
set.seed(123)
#  SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)

# Create new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)  

# Train RF
rf_model <- randomForest(y_yes ~ ., data = trainData_balanced, ntree = 100, 
                         mtry = sqrt(ncol(trainData_balanced) - 1), importance = TRUE)

# Predictions 
rf_predictions <- predict(rf_model, newdata = testData, type = "class")

# confusion
rf_conf_matrix <- confusionMatrix(rf_predictions, testData$y_yes)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7723  622
##          1  249  448
##                                           
##                Accuracy : 0.9037          
##                  95% CI : (0.8974, 0.9097)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 1.494e-11       
##                                           
##                   Kappa : 0.4563          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9688          
##             Specificity : 0.4187          
##          Pos Pred Value : 0.9255          
##          Neg Pred Value : 0.6428          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8541          
##    Detection Prevalence : 0.9229          
##       Balanced Accuracy : 0.6937          
##                                           
##        'Positive' Class : 0               
## 

# Plot 
varImpPlot(rf_model)


table(trainData_balanced$y_yes)
## 
##     0     1 
## 31950 12657
#HYPERPARAMETER TUNING 1
set.seed(123)
rf_model <- randomForest(y_yes ~ ., 
                         data = trainData_balanced, 
                         ntree = 100, 
                         mtry = sqrt(ncol(trainData_balanced)), 
                         classwt = c(1, 2),  # Penalizing false positives
                         importance = TRUE)

rf_predictions <- predict(rf_model, newdata = testData, type = "class")
rf_conf_matrix <- confusionMatrix(rf_predictions, testData$y_yes)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7601  540
##          1  371  530
##                                           
##                Accuracy : 0.8992          
##                  95% CI : (0.8929, 0.9054)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 6.665e-08       
##                                           
##                   Kappa : 0.4817          
##                                           
##  Mcnemar's Test P-Value : 2.605e-08       
##                                           
##             Sensitivity : 0.9535          
##             Specificity : 0.4953          
##          Pos Pred Value : 0.9337          
##          Neg Pred Value : 0.5882          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8406          
##    Detection Prevalence : 0.9004          
##       Balanced Accuracy : 0.7244          
##                                           
##        'Positive' Class : 0               
## 
varImpPlot(rf_model)

#HYPERPARAMETER TUNING 2
set.seed(123)
rf_model <- randomForest(y_yes ~ ., 
                         data = trainData_balanced, 
                         ntree = 100, 
                         mtry = sqrt(ncol(trainData_balanced)), 
                         classwt = c(1, 3),  
                         importance = TRUE)

rf_predictions <- predict(rf_model, newdata = testData, type = "class")
rf_conf_matrix <- confusionMatrix(rf_predictions, testData$y_yes)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7577  533
##          1  395  537
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8909, 0.9035)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 1.326e-06       
##                                           
##                   Kappa : 0.4791          
##                                           
##  Mcnemar's Test P-Value : 6.884e-06       
##                                           
##             Sensitivity : 0.9505          
##             Specificity : 0.5019          
##          Pos Pred Value : 0.9343          
##          Neg Pred Value : 0.5762          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8380          
##    Detection Prevalence : 0.8969          
##       Balanced Accuracy : 0.7262          
##                                           
##        'Positive' Class : 0               
## 
varImpPlot(rf_model)

Part IV: ADA Boost

Model Y=0 Y=1 TP FN FP TN Accuracy Sensitivity Specificity Balance Acc
ADA BOOST
Initial ADA BOOST (ADA) 39822 5289 236 807 134 7838 0.8959 0.9832 0.2458 0.6145
SMOTE #1(K=5,dup_size=1) & ADA 31950 8438 461 609 305 7667 0.8989 0.9617 0.4308 0.6963
SMOTE #2(K=5,dup_size=2) & ADA 31950 12657 520 550 377 7595 0.8975 0.9527 0.4860 0.7193
Hyperparameter tuning #1 (weight4,1) 31950 12657 507 563 333 7639 0.9009 0.9582 0.4738 0.7160
Hyperparameter tuning #2 (weight5,1) 31950 12657 507 563 333 7639 0.9009 0.9582 0.4738 0.7160

Initial ADA Boost
Comparable to the initial decision tree and the initial random forest models, the initial ADA Boost resulted in a model with high accuracy(90%), sensitivity (98%), moderate class imbalance(61%) and low specificity(25%). Of note, the initial ADA Boost model had a specificity which was ~10% lower than that of the initial decision tree and random forest models. Again, as with the initial decision tree and random forest models, class imbalance was not adequately addressed by the initial ADA Boost model. As such, the experiments will focus on balancing the minority class imbalance present in the dependent variable while also attempting to improve specificity.

Experiments
Below the following models were run:
1.) Initial ADA Boost
2.) Experiment 1 SMOTE: SMOTE was utilized and dup_size was set to dup_size=1 in an attempt to address class imbalance
3.) Experiment 1 SMOTE: Again a SMOTE was utilized and dup_size set to dup_size=2 in an attempt to address class imbalance
4.) Experiement 2 hyperparameter Tuning: hyperparameter tuning which increased the weighted of the minority class (weight (4,1))
5.) Experiement 2 hyperparameter Tuning: hyperparameter tuning which further increased the weighted of the minority class (weight (5,1))

Again, confusion matrix were used to assess the models. The model’s specificity still is of the utmost importance since the dependent variable/feature suffers from considerable class imbalance (Y=0: ~89% versus Y=1: ~11%) and the initial ADA Boost model failed to address this imbalance.

EXPERIEMENT 1: SMOTE
Since predicting class Y=1 is important, SMOTE was again utilized to resample. SMOTE generates synthetic data using k-nearest neighbors of the minority class in order to overcome majority class imbalance. Below SMOTE is used to generate approximately 7000 more rows in which Y=1 and the decision trees are again run. The expected result would be an increase in specificity and a decrease in sensitivity. Accuracy would also be expected to decrease since synthetic data is being generated.

SMOTE Hypothesis
Null Hypothesis: The use of SMOTE will not effectively balance the dataset and will not result in an improvement in specificity.
Alternate Hypothesis: The use of SMOTE will create a more balanced dataset, leading to an increase in specificity.

SMOTE RESULTS
Utilization of SMOTE created ~7000 synthetic instances of the minority class (Y=1). Utilization of SMOTE improve specificity (initial model=24% versus SMOTE #2 = 49%) and the minority class was more balanced (initial ada boost =61% versus SMOTE balance accuracy =72%). Again, althought sensitivity decreased with SMOTE, the decrease was only negliable (~3%). Model accuracy remained unchanged by SMOTE. As such SMOTE was useful in balancing the class imbalance present in the dependent variable. Accordingly, the null hypothesis can be rejected because SMOTE did address the class imbalance.

EXPERIEMENT 2: Hyperparameter Tuning
In an attempt to improve the specificity, hyperparameter tuning was utilized. Two experiments were run:
1.) hyperparameter tuning which increased the weight of the minority class. (weight4,1)
2.) hyperparameter tuning which increased the weight of the minority class further. (weight9,1)

Hyperparameter Tuning Hypothesis: Increase the Minority Class Weight
Null Hypothesis: Increasing the weight of the minority class will not increase specificity
Alternate Hypothesis: Increasing the weight of the minority class will increase specificity

Hyperparameter Tuning Results
The weight-adjusted hyperparameter tuning failed to improve specificity as expected. In fact, specificity slightly decreased by 1%, making it an ineffective for balancing the target variable. As such, for the ADA Boost model we cannot reject the null hypothesis. Of note, the second weighted model returned identical results as the first weighted model, alluding to the impracticality of using weights with ADA Boost models.

Final ADA Boost Model
The SMOTE model was selected as the preferred for the ADA Boost subset since it significantly improved specificity (from 25% to 49%) while maintaining high accuracy and sensitivity. In contrast, weighting the minority class had no positive effect on specificity and led to the same results across different weights. This may be because that ADA Boost already adjusts for class imbalance, making the above weight adjustments less effective.

########################################## ONLY CODE BELOW###############################################################
#initial ADA
set.seed(123)

trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))

# train control for boosting
train_control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation

#  AdaBoost model 
ada_model <- train(
  y_yes ~ ., 
  data = trainData,
  method = "AdaBoost.M1",   # AdaBoost method in caret
  trControl = train_control,
  tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman")  # 100 weak learners (stumps)
)

# Predictions
ada_predictions <- predict(ada_model, newdata = testData)

# confusion
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7838  807
##          1  134  263
##                                           
##                Accuracy : 0.8959          
##                  95% CI : (0.8895, 0.9022)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 1.04e-05        
##                                           
##                   Kappa : 0.3147          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9832          
##             Specificity : 0.2458          
##          Pos Pred Value : 0.9067          
##          Neg Pred Value : 0.6625          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8668          
##    Detection Prevalence : 0.9561          
##       Balanced Accuracy : 0.6145          
##                                           
##        'Positive' Class : 0               
## 
# Feature importance
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                    Overall
## duration          100.0000
## poutcome_success   30.4888
## contact_unknown    21.6696
## month_mar           5.6232
## housing_yes         5.4766
## pdays               2.6769
## age                 2.6350
## month_oct           2.2611
## month_may           1.5601
## balance             1.1913
## campaign            1.0611
## loan_yes            0.6342
## month_sep           0.5849
## marital_married     0.5319
## job_blue.collar     0.5197
## month_jul           0.4191
## month_nov           0.0000
## contact_telephone   0.0000
## month_feb           0.0000
## month_aug           0.0000
plot(var_importance)

#SMOTE #1
set.seed(123)

smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)

#boost control
train_control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation

# AdaBoost model 
ada_model <- train(
  y_yes ~ ., 
  data = trainData_balanced,
  method = "AdaBoost.M1",   # AdaBoost method in caret
  trControl = train_control,
  tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman")  # 100 weak learners (stumps)
)


ada_predictions <- predict(ada_model, newdata = testData)

# confusion
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7667  609
##          1  305  461
##                                           
##                Accuracy : 0.8989          
##                  95% CI : (0.8925, 0.9051)
##     No Information Rate : 0.8817          
##     P-Value [Acc > NIR] : 1.158e-07       
##                                           
##                   Kappa : 0.4476          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9617          
##             Specificity : 0.4308          
##          Pos Pred Value : 0.9264          
##          Neg Pred Value : 0.6018          
##              Prevalence : 0.8817          
##          Detection Rate : 0.8479          
##    Detection Prevalence : 0.9153          
##       Balanced Accuracy : 0.6963          
##                                           
##        'Positive' Class : 0               
## 
# Feature importance
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                      Overall
## duration            100.0000
## poutcome_success     47.4535
## contact_unknown      28.3437
## housing_yes          17.5030
## month_mar             6.0438
## marital_married       4.8515
## month_jul             3.8791
## month_may             3.8326
## education_secondary   3.6657
## education_tertiary    3.5538
## month_aug             2.8578
## month_oct             2.7972
## campaign              2.4225
## loan_yes              2.1898
## month_jan             1.7651
## month_nov             1.6598
## month_sep             1.4084
## pdays                 0.9504
## age                   0.6801
## job_blue.collar       0.5957
plot(var_importance)

table(trainData_balanced$y_yes)
## 
##     0     1 
## 31950  8438
#SMOTE 2
set.seed(123)

smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)






# control for boost
train_control <- trainControl(method = "cv", number = 5) 

# daBoost model 
ada_model <- train(
  y_yes ~ ., 
  data = trainData_balanced,
  method = "AdaBoost.M1",   #
  trControl = train_control,
  tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman")  
)


ada_predictions <- predict(ada_model, newdata = testData)

# confusion
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7595  550
##          1  377  520
##                                          
##                Accuracy : 0.8975         
##                  95% CI : (0.891, 0.9037)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 1.123e-06      
##                                          
##                   Kappa : 0.4717         
##                                          
##  Mcnemar's Test P-Value : 1.612e-08      
##                                          
##             Sensitivity : 0.9527         
##             Specificity : 0.4860         
##          Pos Pred Value : 0.9325         
##          Neg Pred Value : 0.5797         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8400         
##    Detection Prevalence : 0.9008         
##       Balanced Accuracy : 0.7193         
##                                          
##        'Positive' Class : 0              
## 
# Feature importance
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                      Overall
## duration            100.0000
## poutcome_success     48.2869
## contact_unknown      26.4591
## housing_yes          24.7824
## marital_married       9.9207
## education_secondary   9.1166
## month_mar             5.8022
## education_tertiary    5.1447
## month_aug             5.0308
## month_jul             4.9175
## month_may             4.5223
## month_oct             3.6235
## month_jan             3.5693
## loan_yes              2.8208
## campaign              2.8148
## marital_single        1.9306
## month_nov             1.8669
## job_retired           0.9932
## job_blue.collar       0.7114
## month_sep             0.5606
plot(var_importance)

table(trainData_balanced$y_yes)
## 
##     0     1 
## 31950 12657

#weight 1

#HYPERPARAMETER TUNING 1 WEIGHT
set.seed(123)


train_control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation

ada_model <- train(
  y_yes ~ ., 
  data = trainData_balanced,
  method = "AdaBoost.M1",  
  trControl = train_control,
  weights = ifelse(trainData_balanced$y_yes == "MinorityClass", 4, 1),  # Assign higher weight to minority class
  tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman")
)

ada_predictions <- predict(ada_model, newdata = testData)


conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7639  563
##          1  333  507
##                                          
##                Accuracy : 0.9009         
##                  95% CI : (0.8946, 0.907)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 3.583e-09      
##                                          
##                   Kappa : 0.4764         
##                                          
##  Mcnemar's Test P-Value : 2.004e-14      
##                                          
##             Sensitivity : 0.9582         
##             Specificity : 0.4738         
##          Pos Pred Value : 0.9314         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8448         
##    Detection Prevalence : 0.9071         
##       Balanced Accuracy : 0.7160         
##                                          
##        'Positive' Class : 0              
## 
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                      Overall
## duration            100.0000
## poutcome_success     52.5760
## contact_unknown      37.2597
## housing_yes          27.8386
## marital_married      13.4268
## education_secondary  10.3990
## month_mar             6.7924
## education_tertiary    5.6332
## month_jul             5.4873
## month_may             5.4068
## month_nov             5.4048
## month_aug             5.3463
## month_oct             4.5147
## loan_yes              3.5497
## campaign              3.1920
## month_jan             2.5875
## month_sep             1.4705
## marital_single        0.9854
## job_blue.collar       0.7762
## job_retired           0.7395
plot(var_importance)

weight#2

#HYPERPARAMETER TUNING 2
set.seed(123)


train_control <- trainControl(method = "cv", number = 5)  


ada_model <- train(
  y_yes ~ ., 
  data = trainData_balanced,
  method = "AdaBoost.M1",  
  trControl = train_control,
  weights = ifelse(trainData_balanced$y_yes == "MinorityClass", 9, 1),  
  tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman")
)

ada_predictions <- predict(ada_model, newdata = testData)
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7639  563
##          1  333  507
##                                          
##                Accuracy : 0.9009         
##                  95% CI : (0.8946, 0.907)
##     No Information Rate : 0.8817         
##     P-Value [Acc > NIR] : 3.583e-09      
##                                          
##                   Kappa : 0.4764         
##                                          
##  Mcnemar's Test P-Value : 2.004e-14      
##                                          
##             Sensitivity : 0.9582         
##             Specificity : 0.4738         
##          Pos Pred Value : 0.9314         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.8817         
##          Detection Rate : 0.8448         
##    Detection Prevalence : 0.9071         
##       Balanced Accuracy : 0.7160         
##                                          
##        'Positive' Class : 0              
## 
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                      Overall
## duration            100.0000
## poutcome_success     52.5760
## contact_unknown      37.2597
## housing_yes          27.8386
## marital_married      13.4268
## education_secondary  10.3990
## month_mar             6.7924
## education_tertiary    5.6332
## month_jul             5.4873
## month_may             5.4068
## month_nov             5.4048
## month_aug             5.3463
## month_oct             4.5147
## loan_yes              3.5497
## campaign              3.1920
## month_jan             2.5875
## month_sep             1.4705
## marital_single        0.9854
## job_blue.collar       0.7762
## job_retired           0.7395
plot(var_importance)

PART V: Essay

Model Y=0 Y=1 TP FN FP TN Accuracy Sensitivity Specificity Balance Acc
Decision Tree
Initial Decision Tree (DT) 39822 5289 364 706 239 7733 0.8955 0.9700 0.3402 0.6551
SMOTE #1(K=5,dup_size=1) & DT 31950 8438 497 573 409 7563 0.8914 0.9487 0.4645 0.7066
SMOTE #2(K=5,dup_size=2) & DT 31950 12657 577 493 510 7462 0.8891 0.9360 0.5393 0.7376
Hyperparameter tuning #1 (cp=0.05) 31950 12657 628 442 700 7272 0.8737 0.9122 0.5869 0.7496
Hyperparameter tuning #2 (weight) 31950 12657 901 169 1851 6121 0.7766 0.7678 0.8421 0.8049
Random Forest
Initial Random Forest (RF) 39822 5289 410 660 212 7760 0.9036 0.9734 0.3832 0.6783
SMOTE #1(K=5,dup_size=1) & RF 31950 8438 445 625 234 7738 0.9050 0.9706 0.4159 0.6933
SMOTE #2(K=5,dup_size=2) & RF 31950 12657 448 622 249 7723 0.9037 0.9688 0.4187 0.6937
Hyperparameter tuning #1 (weight) 31950 12657 530 540 371 7601 0.8992 0.9535 0.4953 0.7244
Hyperparameter tuning #2 (weight) 31950 12657 537 533 395 7577 0.8974 0.9505 0.5019 0.7319
ADA Boost
Initial ADA BOOST (ADA) 39822 5289 236 807 134 7838 0.8959 0.9832 0.2458 0.6145
SMOTE #1(K=5,dup_size=1) & RF 31950 8438 461 609 305 7667 0.8989 0.9617 0.4308 0.6963
SMOTE #2(K=5,dup_size=2) & RF 31950 12657 520 550 377 7595 0.8975 0.9527 0.4860 0.7193
Hyperparameter tuning #1 (weight4,1) 31950 12657 507 563 333 7639 0.9009 0.9582 0.4738 0.7160
Hyperparameter tuning #2 (weight5,1) 31950 12657 507 563 333 7639 0.9009 0.9582 0.4738 0.7160

The overarching aim of this assignment was to address the class imbalance in the minority class of the target variable and increase specificity. Initial models struggled with high numbers of false negatives (type II errors) and a low specificity, meaning the models struggled to identifity correctly those who did no sign up for the subscription.

The final decision tree model was the preferred model. The final decision tree model was analyzed using:
1.) SMOTE to genterate ~7000 synthetic rows where Y=1
2.) the complexity parameter set to 0.05
3.) an adjusted/increased weight for the target variable minority class

In deciding which model was the most preferred, we assumed the business had unlimited resources and would therefore prefer a model that maximizes true positives while minimizing false negatives. As a result, even though accuracy and sensitivity decreased compared to the initial model, the increase in true positives in the final decision tree expanded the pool of potential clients that could be contacted, increasing the likelihood of successful subscriptions/phone calls.

SMOTE Results
The purpose of using SMOTE was to address imbalance in the minority class of the target variable. The use of SMOTE varied across models:

  • Decision Trees: Balance accuracy increased ~10%
  • ADA Boost: Balance accuracy increased ~10%
  • Random FOrest: Balance accuracy increased by only ~2%

The varied results were expected, as SMOTE is more effective in models that utlize decision boundries, such as decision trees and ADA Boost.
Random forest, in contrast, is an ensemble method which handles some class imbalance. As such, the impact of SMOTE on the random forest model was not as large because it was somewhat already addressed by the initial model.

Use of SMOTE did have it’s draw backs however. Of note, SMOTE increased the variance present in the ADA boost and decision tree models as denoted by improved specificty (decision tree: 34%-54%, ada boost: 25%-49%) and decreased/plateauing of accuracy and sensitivity (decision tree accruacy: 89%-88%, ada boost: 89% unchanged; decision tree sensitivity: 97% to 93% ada boost sensitivity: 98% to 95%). In general, for the decision tree and ada boost models, the use of SMOTE data resulting in possible overfitting and increased model sensitivity to noise from the syntheitic data.

Weight Results
Manipulating the weight of the minority class also had the end goal of balancing the minority class of the target variable. The use of weights varied across models:

  • Decision Trees: Weighting decreased accuracy (78%) and sensitivity (77%), though both remained within acceptable ranges. However, weighing significantly increased specificity (84%) and improved balanced accuracy (80%). The drop in accuracy and sensitivity alludes to increased model variance due to overfitting the training data.

  • Random Forest: Weighting caused a minor decrease in accuracy and sensitivity (~1%), but specificity improved by ~10%. This increase i specificity was smaller than the ~16% specificity improvement in Decision Trees. However, the increase in specificity does denote an increase in model variance.

  • ADA Boost: Weight adjustments had a minimal effect on accuracy, sensitivity, specificity, and balanced accuracy since ADA boost already accounts for class imbalance internally. As such, adjusting the class weights had negligible affects on variance and bias.

Final Model Interpreted
Below the final decision tree model is visualized and features interpreted:

Duration Duration: The duration of the call was determined to be the strongest predictor of successful subscription. The first split occured at 204 seconds meaning that those calls which were longer than 203 seconds were strongly associated with successful subscriptions.

Other significant features
Poutcome Success: If a client’s previous campaign outcome success was less than 0.0014, the client was less likely to subscribe (94% probability of not subscribing for 25% of total cases).

Housing Loan Status: Clients without a housing loan and with call duration of more than 472 seconds had an 88% probability of subscribing (26% of total cases), while those with a housing loan are not as likely to subscribe.

Contact Method & Call Duration: If the contact method is unknown and call duration is more than 497 seconds, the likelihood of subscription increases to 72% (8% of cases), while calls shorter calls than 498 seconds have a higher probability of non-subscription (94%).

Recommendation for Data Scientists
When dealing with such data recommendations for data scientists include: 1. Use SMOTE with caution with decision tree and ada boost models, since both model types rely heavily on decision boundaries 2. There is no need to adjust weights for class imbalance of the target variable for ada boost models since ada boost already adjusts weights internally
3. Use balance accuracy as a metric for imbalanced data sets
4. The experiments conducted in this model were just the tip of the iceberg. The results of this analysis are in no way conclusive, as the preferred model likely suffers from overfitting among other problems
5. Bias-variance/sensitivity-specificity will always be a trade off.

Recommendation for Business Problem
To increase the likelihood of securing client subscriptions, the business should prioritize longer duration calls, especially with clients whose previous campaign outcome was successful, and do not have housing loans. Additionally, since the final model had a high specificity and high balance accuracy, this model should be used to prioritize daily work lists for client outreach.

rpart.plot(dt_model, 
           type = 3,        
           extra = 104,    
           under = TRUE,    
           tweak = 1.2,     
           box.palette = "RdYlGn",  
           fallen.leaves = TRUE) 
## Warning: Cannot retrieve the data used to build the model (model.frame: object 'job_blue-collar' not found).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

#tinytex::install_tinytex()