Assignment 2 D622

This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment you will define what you are trying to achieve (before each run), conduct the experiment, and at the end you will review how your experiment went. These experiments will allow you to compare algorithms and choose the optimal model.

The Dataset for this assignment

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

Preparing data

# Load required libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'tidyr' was built under R version 4.4.2

## Warning: package 'readr' was built under R version 4.4.2

## Warning: package 'purrr' was built under R version 4.4.3

## Warning: package 'dplyr' was built under R version 4.4.2

## Warning: package 'stringr' was built under R version 4.4.2

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(dplyr)
library(ggplot2)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.3

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(ada)  # For AdaBoost

## Warning: package 'ada' was built under R version 4.4.3

## Loading required package: rpart

## Warning: package 'rpart' was built under R version 4.4.3

library(rpart)  # For Decision Trees

# --- Load and Prepare Data ---

# Load the dataset (adjust the path to your local file)
bank_data <- read.csv("C:/Users/Dell/Downloads/bank-full.csv", sep = ";")

# View the structure of the dataset
str(bank_data)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

# Summarize the dataset
summary(bank_data)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

# Check for missing values
colSums(is.na(bank_data))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

# Replace "unknown" with NA for easier handling
bank_data <- bank_data %>%
  mutate(across(where(is.character), ~na_if(., "unknown")))

# Verify the presence of NA values
colSums(is.na(bank_data))

##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##     13020         0         0         0         0         0         0     36959 
##         y 
##         0

# Ensure 'y' is a factor
bank_data$y <- factor(bank_data$y, levels = c("yes", "no"))

# --- Train-Test Split ---

# One-hot encoding for categorical variables
bank_data_encoded <- dummyVars("~ .", data = bank_data %>% select(-y)) %>%
  predict(newdata = bank_data %>% select(-y)) %>%
  as.data.frame()

# Add target variable back to the encoded dataset
bank_data_encoded$y <- bank_data$y

# Train-test split (80%-20%)
set.seed(42)  # For reproducibility
train_index <- createDataPartition(bank_data_encoded$y, p = 0.8, list = FALSE)

# Split the dataset into training and testing sets
train_data <- bank_data_encoded[train_index, ]
test_data <- bank_data_encoded[-train_index, ]

# Verify class distribution in training data
table(train_data$y)

## 
##   yes    no 
##  4232 31938

# --- Experiment Code ---

# Define training control for all models
control <- trainControl(method = "cv", number = 5, savePredictions = "final")

#colSums(is.na(train_data))
names(train_data)

##  [1] "age"                "jobadmin."          "jobblue-collar"    
##  [4] "jobentrepreneur"    "jobhousemaid"       "jobmanagement"     
##  [7] "jobretired"         "jobself-employed"   "jobservices"       
## [10] "jobstudent"         "jobtechnician"      "jobunemployed"     
## [13] "maritaldivorced"    "maritalmarried"     "maritalsingle"     
## [16] "educationprimary"   "educationsecondary" "educationtertiary" 
## [19] "defaultno"          "defaultyes"         "balance"           
## [22] "housingno"          "housingyes"         "loanno"            
## [25] "loanyes"            "contactcellular"    "contacttelephone"  
## [28] "day"                "monthapr"           "monthaug"          
## [31] "monthdec"           "monthfeb"           "monthjan"          
## [34] "monthjul"           "monthjun"           "monthmar"          
## [37] "monthmay"           "monthnov"           "monthoct"          
## [40] "monthsep"           "duration"           "campaign"          
## [43] "pdays"              "previous"           "poutcomefailure"   
## [46] "poutcomeother"      "poutcomesuccess"    "y"

# --- Clear Environment ---
rm(list = ls())
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2416522 129.1    3990517 213.2  3990517 213.2
## Vcells 4127868  31.5   15608685 119.1 15579109 118.9

# --- Reload and Preprocess Dataset ---
bank_data <- read.csv("C:/Users/Dell/Downloads/bank-full.csv", sep = ";")
bank_data <- bank_data %>%
  mutate(across(where(is.character), ~na_if(., "unknown")))

dummy_vars <- dummyVars("~ .", data = bank_data %>% select(-y))
encoded_data <- predict(dummy_vars, newdata = bank_data %>% select(-y)) %>% as.data.frame()
encoded_data$y <- factor(bank_data$y, levels = c("yes", "no"))

# --- Train-Test Split ---
set.seed(42)
train_index <- createDataPartition(encoded_data$y, p = 0.8, list = FALSE)
train_data <- encoded_data[train_index, ]
test_data <- encoded_data[-train_index, ]

# --- Verify Column Names ---
print(names(train_data))

##  [1] "age"                "jobadmin."          "jobblue-collar"    
##  [4] "jobentrepreneur"    "jobhousemaid"       "jobmanagement"     
##  [7] "jobretired"         "jobself-employed"   "jobservices"       
## [10] "jobstudent"         "jobtechnician"      "jobunemployed"     
## [13] "maritaldivorced"    "maritalmarried"     "maritalsingle"     
## [16] "educationprimary"   "educationsecondary" "educationtertiary" 
## [19] "defaultno"          "defaultyes"         "balance"           
## [22] "housingno"          "housingyes"         "loanno"            
## [25] "loanyes"            "contactcellular"    "contacttelephone"  
## [28] "day"                "monthapr"           "monthaug"          
## [31] "monthdec"           "monthfeb"           "monthjan"          
## [34] "monthjul"           "monthjun"           "monthmar"          
## [37] "monthmay"           "monthnov"           "monthoct"          
## [40] "monthsep"           "duration"           "campaign"          
## [43] "pdays"              "previous"           "poutcomefailure"   
## [46] "poutcomeother"      "poutcomesuccess"    "y"

print(names(test_data))

##  [1] "age"                "jobadmin."          "jobblue-collar"    
##  [4] "jobentrepreneur"    "jobhousemaid"       "jobmanagement"     
##  [7] "jobretired"         "jobself-employed"   "jobservices"       
## [10] "jobstudent"         "jobtechnician"      "jobunemployed"     
## [13] "maritaldivorced"    "maritalmarried"     "maritalsingle"     
## [16] "educationprimary"   "educationsecondary" "educationtertiary" 
## [19] "defaultno"          "defaultyes"         "balance"           
## [22] "housingno"          "housingyes"         "loanno"            
## [25] "loanyes"            "contactcellular"    "contacttelephone"  
## [28] "day"                "monthapr"           "monthaug"          
## [31] "monthdec"           "monthfeb"           "monthjan"          
## [34] "monthjul"           "monthjun"           "monthmar"          
## [37] "monthmay"           "monthnov"           "monthoct"          
## [40] "monthsep"           "duration"           "campaign"          
## [43] "pdays"              "previous"           "poutcomefailure"   
## [46] "poutcomeother"      "poutcomesuccess"    "y"

# --- Explicit Formula ---
formula_dt <- y ~ age + jobadmin. + jobblue-collar + jobmanagement + jobretired +
              jobservices + jobtechnician + maritaldivorced + maritalmarried +
              maritalsingle + educationprimary + educationsecondary +
              educationtertiary + balance + housingno + housingyes + loanno +
              loanyes + contactcellular + contacttelephone + day + monthapr +
              monthaug + monthfeb + monthjul + monthjun + monthmay + monthnov +
              duration + campaign + previous + poutcomefailure + poutcomeother +
              poutcomesuccess

# --- Train Decision Tree Model ---
control <- trainControl(method = "cv", number = 5, savePredictions = "final")

#dt_model_baseline <- train(formula_dt, data = train_data, method = "rpart", trControl = control)

Decision Tree Model

# Load required libraries
library(tidyverse)
library(caret)
library(rpart)  # Decision Trees
library(randomForest)  # Random Forest
library(ada)  # AdaBoost

# --- Load and Prepare Data ---
bank_data <- read.csv("C:/Users/Dell/Downloads/bank-full.csv", sep = ";")

# Convert "unknown" values to NA
bank_data <- bank_data %>%
  mutate(across(where(is.character), ~na_if(., "unknown")))

# Ensure 'y' is a factor
bank_data$y <- factor(bank_data$y, levels = c("yes", "no"))

# One-hot encoding (excluding y)
bank_data_encoded <- dummyVars("~ .", data = bank_data %>% select(-y), fullRank = TRUE) %>%
  predict(newdata = bank_data %>% select(-y)) %>%
  as.data.frame()

# Add target variable back
bank_data_encoded$y <- bank_data$y

# --- Train-Test Split ---
set.seed(42)  # For reproducibility
train_index <- createDataPartition(bank_data_encoded$y, p = 0.8, list = FALSE)

train_data <- bank_data_encoded[train_index, ]
test_data <- bank_data_encoded[-train_index, ]

# Handle missing values
preProcess_missingdata <- preProcess(train_data, method = "medianImpute")
train_data <- predict(preProcess_missingdata, train_data)

# Ensure 'y' is a factor
train_data$y <- as.factor(train_data$y)

# Confirm no NA values in 'y'
print(sum(is.na(train_data$y)))  # Should be 0

## [1] 0

# Train Decision Tree
dt_model_baseline <- train(
  x = train_data %>% select(-y), 
  y = train_data$y, 
  method = "rpart", 
  trControl = control
)

# Evaluate Decision Tree Baseline
predictions_dt <- predict(dt_model_baseline, test_data)
print(confusionMatrix(predictions_dt, test_data$y))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  324  168
##        no   733 7816
##                                          
##                Accuracy : 0.9003         
##                  95% CI : (0.894, 0.9064)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : 9.899e-08      
##                                          
##                   Kappa : 0.3717         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.30653        
##             Specificity : 0.97896        
##          Pos Pred Value : 0.65854        
##          Neg Pred Value : 0.91426        
##              Prevalence : 0.11691        
##          Detection Rate : 0.03584        
##    Detection Prevalence : 0.05442        
##       Balanced Accuracy : 0.64274        
##                                          
##        'Positive' Class : yes            
##

#Interpretation Decision Tree model

The Decision Tree model, as evidenced by the provided confusion matrix and statistics, demonstrates a high overall accuracy of 90.03%, significantly exceeding the “No Information Rate” of 88.31%. This indicates the model performs better than simply predicting the majority class. However, a deeper examination of the metrics reveals a notable imbalance in its predictive capabilities. The model exhibits high specificity, at 97.9%, showcasing its strong ability to correctly identify “no” cases (non-subscribers). This suggests the model is conservative in its predictions, minimizing false positives.

Conversely, the model’s sensitivity, or recall, is notably low at 30.65%. This indicates that the model only captures approximately 30.65% of the actual “yes” cases (subscribers), meaning it misses a significant portion of potential subscribers. The precision, or positive predictive value, is 65.85%, suggesting that when the model predicts “yes,” it is correct roughly 66% of the time. The Kappa statistic, at 0.37, signifies a moderate level of agreement beyond chance, but is not exceptionally strong. The low detection rate, at 3.58%, reinforces the model’s limited ability to identify true positive cases. In practical terms, this Decision Tree model prioritizes avoiding false positives at the cost of missing a substantial number of potential subscribers, resulting in a conservative approach that may not be optimal for maximizing revenue opportunities.

# Load required libraries
library(tidyverse)
library(caret)
library(doParallel)

## Warning: package 'doParallel' was built under R version 4.4.3

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.4.2

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Warning: package 'iterators' was built under R version 4.4.2

## Loading required package: parallel

# --- Load the Dataset ---
bank_data <- read.csv("C:/Users/Dell/Downloads/bank-full.csv", sep = ";")

# --- Replace "unknown" with NA ---
bank_data <- bank_data %>%
  mutate(across(where(is.character), ~na_if(., "unknown")))

# Ensure target variable 'y' is a factor
bank_data$y <- factor(bank_data$y, levels = c("yes", "no"))

# --- Perform One-Hot Encoding for Categorical Variables ---
dummy_vars <- dummyVars("~ .", data = bank_data %>% select(-y), fullRank = TRUE)
bank_data_encoded <- predict(dummy_vars, newdata = bank_data %>% select(-y)) %>% as.data.frame()

# Add the target variable back to the dataset
bank_data_encoded$y <- bank_data$y

# --- Train-Test Split ---
set.seed(42)  # For reproducibility
train_index <- createDataPartition(bank_data_encoded$y, p = 0.8, list = FALSE)
train_data <- bank_data_encoded[train_index, ]
test_data <- bank_data_encoded[-train_index, ]

# --- Handle Missing Values ---
# Impute missing values in train_data
preProcess_missing <- preProcess(train_data, method = "medianImpute")
train_data <- predict(preProcess_missing, train_data)
test_data <- predict(preProcess_missing, test_data)

# Align test data structure with train data
test_data <- test_data[, colnames(train_data)]

# Confirm No Missing Values Remain
stopifnot(sum(is.na(train_data)) == 0)
stopifnot(sum(is.na(test_data)) == 0)

# --- Random Forest Model ---
# Set up parallel processing
control <- trainControl(method = "cv", number = 3, allowParallel = TRUE)

# Experiment 1: Baseline Random Forest
rf_model_baseline <- train(
  x = train_data %>% select(-y),
  y = train_data$y,
  method = "rf",
  trControl = control,
  ntree = 100  # Reduced number of trees
)

# Experiment 2: Random Forest with Hyperparameter Tuning
rf_tune_grid <- expand.grid(mtry = c(2, 5))  # Limit search space for faster tuning
rf_model_tuned <- train(
  x = train_data %>% select(-y),
  y = train_data$y,
  method = "rf",
  trControl = control,
  tuneGrid = rf_tune_grid,
  ntree = 100
)

# Evaluate the Baseline Model
predictions_rf <- predict(rf_model_baseline, test_data)
print(confusionMatrix(predictions_rf, test_data$y))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  458  240
##        no   599 7744
##                                          
##                Accuracy : 0.9072         
##                  95% CI : (0.901, 0.9131)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : 1.006e-13      
##                                          
##                   Kappa : 0.4729         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.43330        
##             Specificity : 0.96994        
##          Pos Pred Value : 0.65616        
##          Neg Pred Value : 0.92820        
##              Prevalence : 0.11691        
##          Detection Rate : 0.05066        
##    Detection Prevalence : 0.07720        
##       Balanced Accuracy : 0.70162        
##                                          
##        'Positive' Class : yes            
##

# Evaluate the Tuned Model
predictions_rf_tuned <- predict(rf_model_tuned, test_data)
print(confusionMatrix(predictions_rf_tuned, test_data$y))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  388  155
##        no   669 7829
##                                           
##                Accuracy : 0.9089          
##                  95% CI : (0.9027, 0.9147)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1.714e-15       
##                                           
##                   Kappa : 0.4406          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.36708         
##             Specificity : 0.98059         
##          Pos Pred Value : 0.71455         
##          Neg Pred Value : 0.92128         
##              Prevalence : 0.11691         
##          Detection Rate : 0.04292         
##    Detection Prevalence : 0.06006         
##       Balanced Accuracy : 0.67383         
##                                           
##        'Positive' Class : yes             
##

Interpretation Random Forest Model

This model demonstrates a notable ability to classify non-subscribers with high accuracy, as evidenced by its specificity nearing 97-98%. This indicates a strong propensity to avoid false positives, effectively minimizing the risk of misclassifying individuals who are unlikely to subscribe. Consequently, the model’s precision, ranging from 65-71%, suggests that when it does predict a potential subscriber, it’s generally accurate. This capability is valuable for targeted marketing campaigns where minimizing wasted resources is crucial.

However, the model exhibits a significant limitation in its sensitivity, capturing only 36-43% of actual subscribers. This means a substantial portion of potential customers are missed, which could translate to lost revenue opportunities. While the model excels at avoiding false positives, its conservative nature results in a low detection rate, identifying only a fraction of the true positive cases. The balanced accuracy, averaging around 67-70%, reflects this trade-off between sensitivity and specificity. Ultimately, the Random Forest model prioritizes minimizing false positives over maximizing true positives, making it a suitable choice when the cost of contacting a non-subscriber is high, but potentially limiting its effectiveness in capturing all potential subscribers.

# First, clean all column names to replace special characters
clean_names <- function(df) {
  names(df) <- gsub("-", ".", names(df))  # Replace hyphens with periods
  names(df) <- gsub("\\+", ".", names(df)) # Replace + with periods if any
  return(df)
}

# Apply to both train and test data
train_data_clean <- clean_names(train_data)
test_data_clean <- clean_names(test_data)

# Verify the problematic column is now properly named
print(names(train_data_clean))  # Should show "jobblue.collar" instead of "jobblue-collar"

##  [1] "age"                "jobblue.collar"     "jobentrepreneur"   
##  [4] "jobhousemaid"       "jobmanagement"      "jobretired"        
##  [7] "jobself.employed"   "jobservices"        "jobstudent"        
## [10] "jobtechnician"      "jobunemployed"      "maritalmarried"    
## [13] "maritalsingle"      "educationsecondary" "educationtertiary" 
## [16] "defaultyes"         "balance"            "housingyes"        
## [19] "loanyes"            "contacttelephone"   "day"               
## [22] "monthaug"           "monthdec"           "monthfeb"          
## [25] "monthjan"           "monthjul"           "monthjun"          
## [28] "monthmar"           "monthmay"           "monthnov"          
## [31] "monthoct"           "monthsep"           "duration"          
## [34] "campaign"           "pdays"              "previous"          
## [37] "poutcomeother"      "poutcomesuccess"    "y"

#Experiment 3.1: Baseline AdaBoost

# AdaBoost Baseline with cleaned column names
ada_model_baseline <- train(
  x = train_data_clean %>% select(-y),
  y = train_data_clean$y,
  method = "ada",
  trControl = trainControl(method = "cv", number = 3),
  metric = "Accuracy"
)

# Evaluate
predictions_ada_base <- predict(ada_model_baseline, test_data_clean)
print(confusionMatrix(predictions_ada_base, test_data_clean$y))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes 1028 7984
##        no    29    0
##                                           
##                Accuracy : 0.1137          
##                  95% CI : (0.1072, 0.1204)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0064         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9726          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.1141          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.1169          
##          Detection Rate : 0.1137          
##    Detection Prevalence : 0.9968          
##       Balanced Accuracy : 0.4863          
##                                           
##        'Positive' Class : yes             
##

# Review
print("AdaBoost Baseline Results:")

## [1] "AdaBoost Baseline Results:"

print(ada_model_baseline)

## Boosted Classification Trees 
## 
## 36170 samples
##    38 predictor
##     2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 24114, 24113, 24113 
## Resampling results across tuning parameters:
## 
##   maxdepth  iter  Accuracy   Kappa       
##   1          50   0.1149848  -0.005058857
##   1         100   0.1121648  -0.012986847
##   1         150   0.1109483  -0.017621001
##   2          50   0.1048383  -0.046576962
##   2         100   0.1042024  -0.050504568
##   2         150   0.1036771  -0.055392382
##   3          50   0.1024053  -0.061199254
##   3         100   0.1020182  -0.065391244
##   3         150   0.1013547  -0.069239112
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 50, maxdepth = 1 and nu = 0.1.

varImp(ada_model_baseline)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 38)
## 
##                    Importance
## duration              100.000
## housingyes             34.188
## previous               33.297
## pdays                  30.266
## balance                29.766
## poutcomesuccess        27.795
## monthmay               23.621
## campaign               23.424
## educationtertiary      15.211
## jobblue.collar         15.023
## maritalmarried         14.315
## maritalsingle          14.094
## loanyes                12.336
## day                     9.652
## educationsecondary      8.360
## jobretired              8.248
## monthoct                7.994
## jobmanagement           6.578
## monthsep                6.535
## monthmar                6.118

#Experiment 3.2: Tuned AdaBoost

# AdaBoost Tuning with cleaned column names
ada_tune_grid <- expand.grid(
  iter = c(50, 100, 150),
  maxdepth = c(3, 5),
  nu = 0.1
)

ada_model_tuned <- train(
  x = train_data_clean %>% select(-y),
  y = train_data_clean$y,
  method = "ada",
  trControl = trainControl(method = "cv", number = 3),
  tuneGrid = ada_tune_grid,
  metric = "Accuracy"
)

# Evaluate
predictions_ada_tuned <- predict(ada_model_tuned, test_data_clean)
print(confusionMatrix(predictions_ada_tuned, test_data_clean$y))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  767 7859
##        no   290  125
##                                          
##                Accuracy : 0.0987         
##                  95% CI : (0.0926, 0.105)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : -0.063         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.72564        
##             Specificity : 0.01566        
##          Pos Pred Value : 0.08892        
##          Neg Pred Value : 0.30120        
##              Prevalence : 0.11691        
##          Detection Rate : 0.08484        
##    Detection Prevalence : 0.95410        
##       Balanced Accuracy : 0.37065        
##                                          
##        'Positive' Class : yes            
##

# Review
print("Tuned AdaBoost Results:")

## [1] "Tuned AdaBoost Results:"

print(ada_model_tuned)

## Boosted Classification Trees 
## 
## 36170 samples
##    38 predictor
##     2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 24113, 24114, 24113 
## Resampling results across tuning parameters:
## 
##   maxdepth  iter  Accuracy    Kappa      
##   3          50   0.10165879  -0.06304283
##   3         100   0.10171410  -0.06618752
##   3         150   0.10149293  -0.07036822
##   5          50   0.09988940  -0.08040376
##   5         100   0.09969587  -0.08479942
##   5         150   0.09883881  -0.08920124
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 3 and nu = 0.1.

plot(ada_model_tuned)

# Interpretation

The analysis reveals severe performance issues with both the baseline and tuned AdaBoost models, achieving a dismal accuracy of approximately 11%. This is further underscored by negative Kappa values, indicating the models perform worse than random chance. The high “No Information Rate” of 88.31% highlights a significant class imbalance, with a vast majority of “no” responses, leading to models that heavily favor the majority class. Consequently, the confusion matrices demonstrate a critical failure in specificity, with the models unable to correctly identify true negatives, resulting in a deluge of false positives. Parameter tuning, while exploring different tree depths and iterations, yielded only marginal improvements, with shallower trees (maxdepth=3) performing slightly better than deeper ones and peak performance occurring around 50 iterations.

The feature importance analysis identified “duration,” “housingyes,” “previous,” “pdays,” and “balance” as the most influential predictors. However, the models’ inability to effectively handle the class imbalance and the misleading nature of accuracy as an evaluation metric in this context pose significant challenges. The models’ tendency to predict nearly all clients as subscribing (“yes”) would result in a disastrous campaign for the bank, wasting resources on false positives and failing to identify true negatives. AdaBoost’s struggle with this extreme imbalance suggests that alternative modeling approaches or data preprocessing techniques, such as oversampling or undersampling, may be necessary to achieve meaningful results.

Conclusions

The performance comparison of AdaBoost, Random Forest, and Decision Tree models reveals significant differences in their effectiveness for the given classification task. AdaBoost, while demonstrating exceptionally high sensitivity by capturing nearly all “yes” cases, suffers from abysmal specificity, labeling almost every instance as “yes,” rendering it practically useless. Its negative Kappa values further indicate performance worse than random chance.

In contrast, Random Forest emerges as the most balanced model, achieving the highest Kappa value and maintaining good specificity while capturing a substantial portion of “yes” cases. It also exhibits the highest precision among the three models. The Decision Tree model, although simple, falls short in sensitivity, missing nearly 70% of “yes” cases, and exhibits a lower Kappa value compared to Random Forest.

From a business perspective, AdaBoost’s tendency to contact nearly all customers would result in significant resource wastage due to high false positives. Random Forest, on the other hand, offers a more practical approach by correctly identifying a reasonable proportion of subscribers while maintaining high accuracy in rejecting non-subscribers. Its balanced performance translates to contacting fewer wrong customers and capturing more revenue opportunities. The Decision Tree model, being overly conservative, misses a significant number of potential opportunities.

Therefore, Random Forest is identified as the optimal model due to its superior metric balance, practical utility, and overall business value, effectively balancing the trade-off between sensitivity and specificity.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Assignment 2 D622

Jose Fuentes

2025-03-23

Assignment 2 D622

Preparing data

Decision Tree Model

Interpretation Random Forest Model

Conclusions