This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment you will define what you are trying to achieve (before each run), conduct the experiment, and at the end you will review how your experiment went. These experiments will allow you to compare algorithms and choose the optimal model.
The Dataset for this assignment
A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
# Load required libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
library(ggplot2)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(ada) # For AdaBoost
## Warning: package 'ada' was built under R version 4.4.3
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 4.4.3
library(rpart) # For Decision Trees
# --- Load and Prepare Data ---
# Load the dataset (adjust the path to your local file)
bank_data <- read.csv("C:/Users/Dell/Downloads/bank-full.csv", sep = ";")
# View the structure of the dataset
str(bank_data)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
# Summarize the dataset
summary(bank_data)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
# Check for missing values
colSums(is.na(bank_data))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
# Replace "unknown" with NA for easier handling
bank_data <- bank_data %>%
mutate(across(where(is.character), ~na_if(., "unknown")))
# Verify the presence of NA values
colSums(is.na(bank_data))
## age job marital education default balance housing loan
## 0 288 0 1857 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 13020 0 0 0 0 0 0 36959
## y
## 0
# Ensure 'y' is a factor
bank_data$y <- factor(bank_data$y, levels = c("yes", "no"))
# --- Train-Test Split ---
# One-hot encoding for categorical variables
bank_data_encoded <- dummyVars("~ .", data = bank_data %>% select(-y)) %>%
predict(newdata = bank_data %>% select(-y)) %>%
as.data.frame()
# Add target variable back to the encoded dataset
bank_data_encoded$y <- bank_data$y
# Train-test split (80%-20%)
set.seed(42) # For reproducibility
train_index <- createDataPartition(bank_data_encoded$y, p = 0.8, list = FALSE)
# Split the dataset into training and testing sets
train_data <- bank_data_encoded[train_index, ]
test_data <- bank_data_encoded[-train_index, ]
# Verify class distribution in training data
table(train_data$y)
##
## yes no
## 4232 31938
# --- Experiment Code ---
# Define training control for all models
control <- trainControl(method = "cv", number = 5, savePredictions = "final")
#colSums(is.na(train_data))
names(train_data)
## [1] "age" "jobadmin." "jobblue-collar"
## [4] "jobentrepreneur" "jobhousemaid" "jobmanagement"
## [7] "jobretired" "jobself-employed" "jobservices"
## [10] "jobstudent" "jobtechnician" "jobunemployed"
## [13] "maritaldivorced" "maritalmarried" "maritalsingle"
## [16] "educationprimary" "educationsecondary" "educationtertiary"
## [19] "defaultno" "defaultyes" "balance"
## [22] "housingno" "housingyes" "loanno"
## [25] "loanyes" "contactcellular" "contacttelephone"
## [28] "day" "monthapr" "monthaug"
## [31] "monthdec" "monthfeb" "monthjan"
## [34] "monthjul" "monthjun" "monthmar"
## [37] "monthmay" "monthnov" "monthoct"
## [40] "monthsep" "duration" "campaign"
## [43] "pdays" "previous" "poutcomefailure"
## [46] "poutcomeother" "poutcomesuccess" "y"
# --- Clear Environment ---
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2416522 129.1 3990517 213.2 3990517 213.2
## Vcells 4127868 31.5 15608685 119.1 15579109 118.9
# --- Reload and Preprocess Dataset ---
bank_data <- read.csv("C:/Users/Dell/Downloads/bank-full.csv", sep = ";")
bank_data <- bank_data %>%
mutate(across(where(is.character), ~na_if(., "unknown")))
dummy_vars <- dummyVars("~ .", data = bank_data %>% select(-y))
encoded_data <- predict(dummy_vars, newdata = bank_data %>% select(-y)) %>% as.data.frame()
encoded_data$y <- factor(bank_data$y, levels = c("yes", "no"))
# --- Train-Test Split ---
set.seed(42)
train_index <- createDataPartition(encoded_data$y, p = 0.8, list = FALSE)
train_data <- encoded_data[train_index, ]
test_data <- encoded_data[-train_index, ]
# --- Verify Column Names ---
print(names(train_data))
## [1] "age" "jobadmin." "jobblue-collar"
## [4] "jobentrepreneur" "jobhousemaid" "jobmanagement"
## [7] "jobretired" "jobself-employed" "jobservices"
## [10] "jobstudent" "jobtechnician" "jobunemployed"
## [13] "maritaldivorced" "maritalmarried" "maritalsingle"
## [16] "educationprimary" "educationsecondary" "educationtertiary"
## [19] "defaultno" "defaultyes" "balance"
## [22] "housingno" "housingyes" "loanno"
## [25] "loanyes" "contactcellular" "contacttelephone"
## [28] "day" "monthapr" "monthaug"
## [31] "monthdec" "monthfeb" "monthjan"
## [34] "monthjul" "monthjun" "monthmar"
## [37] "monthmay" "monthnov" "monthoct"
## [40] "monthsep" "duration" "campaign"
## [43] "pdays" "previous" "poutcomefailure"
## [46] "poutcomeother" "poutcomesuccess" "y"
print(names(test_data))
## [1] "age" "jobadmin." "jobblue-collar"
## [4] "jobentrepreneur" "jobhousemaid" "jobmanagement"
## [7] "jobretired" "jobself-employed" "jobservices"
## [10] "jobstudent" "jobtechnician" "jobunemployed"
## [13] "maritaldivorced" "maritalmarried" "maritalsingle"
## [16] "educationprimary" "educationsecondary" "educationtertiary"
## [19] "defaultno" "defaultyes" "balance"
## [22] "housingno" "housingyes" "loanno"
## [25] "loanyes" "contactcellular" "contacttelephone"
## [28] "day" "monthapr" "monthaug"
## [31] "monthdec" "monthfeb" "monthjan"
## [34] "monthjul" "monthjun" "monthmar"
## [37] "monthmay" "monthnov" "monthoct"
## [40] "monthsep" "duration" "campaign"
## [43] "pdays" "previous" "poutcomefailure"
## [46] "poutcomeother" "poutcomesuccess" "y"
# --- Explicit Formula ---
formula_dt <- y ~ age + jobadmin. + jobblue-collar + jobmanagement + jobretired +
jobservices + jobtechnician + maritaldivorced + maritalmarried +
maritalsingle + educationprimary + educationsecondary +
educationtertiary + balance + housingno + housingyes + loanno +
loanyes + contactcellular + contacttelephone + day + monthapr +
monthaug + monthfeb + monthjul + monthjun + monthmay + monthnov +
duration + campaign + previous + poutcomefailure + poutcomeother +
poutcomesuccess
# --- Train Decision Tree Model ---
control <- trainControl(method = "cv", number = 5, savePredictions = "final")
#dt_model_baseline <- train(formula_dt, data = train_data, method = "rpart", trControl = control)
# Load required libraries
library(tidyverse)
library(caret)
library(rpart) # Decision Trees
library(randomForest) # Random Forest
library(ada) # AdaBoost
# --- Load and Prepare Data ---
bank_data <- read.csv("C:/Users/Dell/Downloads/bank-full.csv", sep = ";")
# Convert "unknown" values to NA
bank_data <- bank_data %>%
mutate(across(where(is.character), ~na_if(., "unknown")))
# Ensure 'y' is a factor
bank_data$y <- factor(bank_data$y, levels = c("yes", "no"))
# One-hot encoding (excluding y)
bank_data_encoded <- dummyVars("~ .", data = bank_data %>% select(-y), fullRank = TRUE) %>%
predict(newdata = bank_data %>% select(-y)) %>%
as.data.frame()
# Add target variable back
bank_data_encoded$y <- bank_data$y
# --- Train-Test Split ---
set.seed(42) # For reproducibility
train_index <- createDataPartition(bank_data_encoded$y, p = 0.8, list = FALSE)
train_data <- bank_data_encoded[train_index, ]
test_data <- bank_data_encoded[-train_index, ]
# Handle missing values
preProcess_missingdata <- preProcess(train_data, method = "medianImpute")
train_data <- predict(preProcess_missingdata, train_data)
# Ensure 'y' is a factor
train_data$y <- as.factor(train_data$y)
# Confirm no NA values in 'y'
print(sum(is.na(train_data$y))) # Should be 0
## [1] 0
# Train Decision Tree
dt_model_baseline <- train(
x = train_data %>% select(-y),
y = train_data$y,
method = "rpart",
trControl = control
)
# Evaluate Decision Tree Baseline
predictions_dt <- predict(dt_model_baseline, test_data)
print(confusionMatrix(predictions_dt, test_data$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 324 168
## no 733 7816
##
## Accuracy : 0.9003
## 95% CI : (0.894, 0.9064)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 9.899e-08
##
## Kappa : 0.3717
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.30653
## Specificity : 0.97896
## Pos Pred Value : 0.65854
## Neg Pred Value : 0.91426
## Prevalence : 0.11691
## Detection Rate : 0.03584
## Detection Prevalence : 0.05442
## Balanced Accuracy : 0.64274
##
## 'Positive' Class : yes
##
#Interpretation Decision Tree model
The Decision Tree model, as evidenced by the provided confusion matrix and statistics, demonstrates a high overall accuracy of 90.03%, significantly exceeding the “No Information Rate” of 88.31%. This indicates the model performs better than simply predicting the majority class. However, a deeper examination of the metrics reveals a notable imbalance in its predictive capabilities. The model exhibits high specificity, at 97.9%, showcasing its strong ability to correctly identify “no” cases (non-subscribers). This suggests the model is conservative in its predictions, minimizing false positives.
Conversely, the model’s sensitivity, or recall, is notably low at 30.65%. This indicates that the model only captures approximately 30.65% of the actual “yes” cases (subscribers), meaning it misses a significant portion of potential subscribers. The precision, or positive predictive value, is 65.85%, suggesting that when the model predicts “yes,” it is correct roughly 66% of the time. The Kappa statistic, at 0.37, signifies a moderate level of agreement beyond chance, but is not exceptionally strong. The low detection rate, at 3.58%, reinforces the model’s limited ability to identify true positive cases. In practical terms, this Decision Tree model prioritizes avoiding false positives at the cost of missing a substantial number of potential subscribers, resulting in a conservative approach that may not be optimal for maximizing revenue opportunities.
# Load required libraries
library(tidyverse)
library(caret)
library(doParallel)
## Warning: package 'doParallel' was built under R version 4.4.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.4.2
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.4.2
## Loading required package: parallel
# --- Load the Dataset ---
bank_data <- read.csv("C:/Users/Dell/Downloads/bank-full.csv", sep = ";")
# --- Replace "unknown" with NA ---
bank_data <- bank_data %>%
mutate(across(where(is.character), ~na_if(., "unknown")))
# Ensure target variable 'y' is a factor
bank_data$y <- factor(bank_data$y, levels = c("yes", "no"))
# --- Perform One-Hot Encoding for Categorical Variables ---
dummy_vars <- dummyVars("~ .", data = bank_data %>% select(-y), fullRank = TRUE)
bank_data_encoded <- predict(dummy_vars, newdata = bank_data %>% select(-y)) %>% as.data.frame()
# Add the target variable back to the dataset
bank_data_encoded$y <- bank_data$y
# --- Train-Test Split ---
set.seed(42) # For reproducibility
train_index <- createDataPartition(bank_data_encoded$y, p = 0.8, list = FALSE)
train_data <- bank_data_encoded[train_index, ]
test_data <- bank_data_encoded[-train_index, ]
# --- Handle Missing Values ---
# Impute missing values in train_data
preProcess_missing <- preProcess(train_data, method = "medianImpute")
train_data <- predict(preProcess_missing, train_data)
test_data <- predict(preProcess_missing, test_data)
# Align test data structure with train data
test_data <- test_data[, colnames(train_data)]
# Confirm No Missing Values Remain
stopifnot(sum(is.na(train_data)) == 0)
stopifnot(sum(is.na(test_data)) == 0)
# --- Random Forest Model ---
# Set up parallel processing
control <- trainControl(method = "cv", number = 3, allowParallel = TRUE)
# Experiment 1: Baseline Random Forest
rf_model_baseline <- train(
x = train_data %>% select(-y),
y = train_data$y,
method = "rf",
trControl = control,
ntree = 100 # Reduced number of trees
)
# Experiment 2: Random Forest with Hyperparameter Tuning
rf_tune_grid <- expand.grid(mtry = c(2, 5)) # Limit search space for faster tuning
rf_model_tuned <- train(
x = train_data %>% select(-y),
y = train_data$y,
method = "rf",
trControl = control,
tuneGrid = rf_tune_grid,
ntree = 100
)
# Evaluate the Baseline Model
predictions_rf <- predict(rf_model_baseline, test_data)
print(confusionMatrix(predictions_rf, test_data$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 458 240
## no 599 7744
##
## Accuracy : 0.9072
## 95% CI : (0.901, 0.9131)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1.006e-13
##
## Kappa : 0.4729
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.43330
## Specificity : 0.96994
## Pos Pred Value : 0.65616
## Neg Pred Value : 0.92820
## Prevalence : 0.11691
## Detection Rate : 0.05066
## Detection Prevalence : 0.07720
## Balanced Accuracy : 0.70162
##
## 'Positive' Class : yes
##
# Evaluate the Tuned Model
predictions_rf_tuned <- predict(rf_model_tuned, test_data)
print(confusionMatrix(predictions_rf_tuned, test_data$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 388 155
## no 669 7829
##
## Accuracy : 0.9089
## 95% CI : (0.9027, 0.9147)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1.714e-15
##
## Kappa : 0.4406
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.36708
## Specificity : 0.98059
## Pos Pred Value : 0.71455
## Neg Pred Value : 0.92128
## Prevalence : 0.11691
## Detection Rate : 0.04292
## Detection Prevalence : 0.06006
## Balanced Accuracy : 0.67383
##
## 'Positive' Class : yes
##
This model demonstrates a notable ability to classify non-subscribers with high accuracy, as evidenced by its specificity nearing 97-98%. This indicates a strong propensity to avoid false positives, effectively minimizing the risk of misclassifying individuals who are unlikely to subscribe. Consequently, the model’s precision, ranging from 65-71%, suggests that when it does predict a potential subscriber, it’s generally accurate. This capability is valuable for targeted marketing campaigns where minimizing wasted resources is crucial.
However, the model exhibits a significant limitation in its sensitivity, capturing only 36-43% of actual subscribers. This means a substantial portion of potential customers are missed, which could translate to lost revenue opportunities. While the model excels at avoiding false positives, its conservative nature results in a low detection rate, identifying only a fraction of the true positive cases. The balanced accuracy, averaging around 67-70%, reflects this trade-off between sensitivity and specificity. Ultimately, the Random Forest model prioritizes minimizing false positives over maximizing true positives, making it a suitable choice when the cost of contacting a non-subscriber is high, but potentially limiting its effectiveness in capturing all potential subscribers.
# First, clean all column names to replace special characters
clean_names <- function(df) {
names(df) <- gsub("-", ".", names(df)) # Replace hyphens with periods
names(df) <- gsub("\\+", ".", names(df)) # Replace + with periods if any
return(df)
}
# Apply to both train and test data
train_data_clean <- clean_names(train_data)
test_data_clean <- clean_names(test_data)
# Verify the problematic column is now properly named
print(names(train_data_clean)) # Should show "jobblue.collar" instead of "jobblue-collar"
## [1] "age" "jobblue.collar" "jobentrepreneur"
## [4] "jobhousemaid" "jobmanagement" "jobretired"
## [7] "jobself.employed" "jobservices" "jobstudent"
## [10] "jobtechnician" "jobunemployed" "maritalmarried"
## [13] "maritalsingle" "educationsecondary" "educationtertiary"
## [16] "defaultyes" "balance" "housingyes"
## [19] "loanyes" "contacttelephone" "day"
## [22] "monthaug" "monthdec" "monthfeb"
## [25] "monthjan" "monthjul" "monthjun"
## [28] "monthmar" "monthmay" "monthnov"
## [31] "monthoct" "monthsep" "duration"
## [34] "campaign" "pdays" "previous"
## [37] "poutcomeother" "poutcomesuccess" "y"
#Experiment 3.1: Baseline AdaBoost
# AdaBoost Baseline with cleaned column names
ada_model_baseline <- train(
x = train_data_clean %>% select(-y),
y = train_data_clean$y,
method = "ada",
trControl = trainControl(method = "cv", number = 3),
metric = "Accuracy"
)
# Evaluate
predictions_ada_base <- predict(ada_model_baseline, test_data_clean)
print(confusionMatrix(predictions_ada_base, test_data_clean$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 1028 7984
## no 29 0
##
## Accuracy : 0.1137
## 95% CI : (0.1072, 0.1204)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0064
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9726
## Specificity : 0.0000
## Pos Pred Value : 0.1141
## Neg Pred Value : 0.0000
## Prevalence : 0.1169
## Detection Rate : 0.1137
## Detection Prevalence : 0.9968
## Balanced Accuracy : 0.4863
##
## 'Positive' Class : yes
##
# Review
print("AdaBoost Baseline Results:")
## [1] "AdaBoost Baseline Results:"
print(ada_model_baseline)
## Boosted Classification Trees
##
## 36170 samples
## 38 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 24114, 24113, 24113
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.1149848 -0.005058857
## 1 100 0.1121648 -0.012986847
## 1 150 0.1109483 -0.017621001
## 2 50 0.1048383 -0.046576962
## 2 100 0.1042024 -0.050504568
## 2 150 0.1036771 -0.055392382
## 3 50 0.1024053 -0.061199254
## 3 100 0.1020182 -0.065391244
## 3 150 0.1013547 -0.069239112
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 50, maxdepth = 1 and nu = 0.1.
varImp(ada_model_baseline)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 38)
##
## Importance
## duration 100.000
## housingyes 34.188
## previous 33.297
## pdays 30.266
## balance 29.766
## poutcomesuccess 27.795
## monthmay 23.621
## campaign 23.424
## educationtertiary 15.211
## jobblue.collar 15.023
## maritalmarried 14.315
## maritalsingle 14.094
## loanyes 12.336
## day 9.652
## educationsecondary 8.360
## jobretired 8.248
## monthoct 7.994
## jobmanagement 6.578
## monthsep 6.535
## monthmar 6.118
#Experiment 3.2: Tuned AdaBoost
# AdaBoost Tuning with cleaned column names
ada_tune_grid <- expand.grid(
iter = c(50, 100, 150),
maxdepth = c(3, 5),
nu = 0.1
)
ada_model_tuned <- train(
x = train_data_clean %>% select(-y),
y = train_data_clean$y,
method = "ada",
trControl = trainControl(method = "cv", number = 3),
tuneGrid = ada_tune_grid,
metric = "Accuracy"
)
# Evaluate
predictions_ada_tuned <- predict(ada_model_tuned, test_data_clean)
print(confusionMatrix(predictions_ada_tuned, test_data_clean$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 767 7859
## no 290 125
##
## Accuracy : 0.0987
## 95% CI : (0.0926, 0.105)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.063
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72564
## Specificity : 0.01566
## Pos Pred Value : 0.08892
## Neg Pred Value : 0.30120
## Prevalence : 0.11691
## Detection Rate : 0.08484
## Detection Prevalence : 0.95410
## Balanced Accuracy : 0.37065
##
## 'Positive' Class : yes
##
# Review
print("Tuned AdaBoost Results:")
## [1] "Tuned AdaBoost Results:"
print(ada_model_tuned)
## Boosted Classification Trees
##
## 36170 samples
## 38 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 24113, 24114, 24113
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 3 50 0.10165879 -0.06304283
## 3 100 0.10171410 -0.06618752
## 3 150 0.10149293 -0.07036822
## 5 50 0.09988940 -0.08040376
## 5 100 0.09969587 -0.08479942
## 5 150 0.09883881 -0.08920124
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 3 and nu = 0.1.
plot(ada_model_tuned)
# Interpretation
The analysis reveals severe performance issues with both the baseline and tuned AdaBoost models, achieving a dismal accuracy of approximately 11%. This is further underscored by negative Kappa values, indicating the models perform worse than random chance. The high “No Information Rate” of 88.31% highlights a significant class imbalance, with a vast majority of “no” responses, leading to models that heavily favor the majority class. Consequently, the confusion matrices demonstrate a critical failure in specificity, with the models unable to correctly identify true negatives, resulting in a deluge of false positives. Parameter tuning, while exploring different tree depths and iterations, yielded only marginal improvements, with shallower trees (maxdepth=3) performing slightly better than deeper ones and peak performance occurring around 50 iterations.
The feature importance analysis identified “duration,” “housingyes,” “previous,” “pdays,” and “balance” as the most influential predictors. However, the models’ inability to effectively handle the class imbalance and the misleading nature of accuracy as an evaluation metric in this context pose significant challenges. The models’ tendency to predict nearly all clients as subscribing (“yes”) would result in a disastrous campaign for the bank, wasting resources on false positives and failing to identify true negatives. AdaBoost’s struggle with this extreme imbalance suggests that alternative modeling approaches or data preprocessing techniques, such as oversampling or undersampling, may be necessary to achieve meaningful results.
The performance comparison of AdaBoost, Random Forest, and Decision Tree models reveals significant differences in their effectiveness for the given classification task. AdaBoost, while demonstrating exceptionally high sensitivity by capturing nearly all “yes” cases, suffers from abysmal specificity, labeling almost every instance as “yes,” rendering it practically useless. Its negative Kappa values further indicate performance worse than random chance.
In contrast, Random Forest emerges as the most balanced model, achieving the highest Kappa value and maintaining good specificity while capturing a substantial portion of “yes” cases. It also exhibits the highest precision among the three models. The Decision Tree model, although simple, falls short in sensitivity, missing nearly 70% of “yes” cases, and exhibits a lower Kappa value compared to Random Forest.
From a business perspective, AdaBoost’s tendency to contact nearly all customers would result in significant resource wastage due to high false positives. Random Forest, on the other hand, offers a more practical approach by correctly identifying a reasonable proportion of subscribers while maintaining high accuracy in rejecting non-subscribers. Its balanced performance translates to contacting fewer wrong customers and capturing more revenue opportunities. The Decision Tree model, being overly conservative, misses a significant number of potential opportunities.
Therefore, Random Forest is identified as the optimal model due to its superior metric balance, practical utility, and overall business value, effectively balancing the trade-off between sensitivity and specificity.
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.