Background

The assignment is to perform EDA and further experimentation on a dataset “Bank Marketing” UCI dataset , (detailed description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

The classification goal is to predict if the client will subscribe a term deposit (variable y). We will use the bank-full.csv

Input predictors:

# bank client data: 1 - age (numeric)

2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)

3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)

4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)

5 - default: has credit in default? (binary: “yes”,“no”)

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: “yes”,“no”)

8 - loan: has personal loan? (binary: “yes”,“no”)

# related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

12 - duration: last contact duration, in seconds (numeric)

# other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

Missing values : none

Data Preparation

1. Importing Libraries

#Import needed libraries

library(ggplot2)
library(readr) # to uses read_csv function
library(dplyr) # to use Filter, mutate, arrange function etc
library(tidyr) # to use pivot_longer function

library(e1071)  # For skewness function
library(corrplot)

library(ROSE)
library(smotefamily)

library(caret) # varImp function trainControl 

library(rpart) # for decision tree rpart()
library(rpart.plot) # for decision tree rpart.plot()

# ROC Curve
library(pROC)

# Precision-Recall Curve
library(PRROC)

# randomForest
library(randomForest)

# XGBoost
library(xgboost)      

2. Data Ingestion and inspection

The data analysis shows there are 45211 observations and 17 variables. I find some of variables not in the correct type. we need to convert it into correct data type.

bank_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA622/refs/heads/main/bank-full.csv")

head(bank_raw)
##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

3. Change the predictors to correct data type.

bank_transform <- bank_raw
bank_transform$job <- as.factor(bank_raw$job)
bank_transform$marital <- as.factor(bank_raw$marital)
bank_transform$education <- as.factor(bank_raw$education)
bank_transform$default <- as.factor(bank_raw$default)
bank_transform$balance <- as.integer(bank_raw$balance)
bank_transform$housing <- as.factor(bank_raw$housing)
bank_transform$loan <- as.factor(bank_raw$loan)
bank_transform$contact <- as.factor(bank_raw$contact)
bank_transform$month <- as.factor(bank_raw$month)
bank_transform$pdays <- as.integer(bank_raw$pdays)
bank_transform$poutcome <- as.factor(bank_raw$poutcome)
bank_transform$y <- as.factor(bank_raw$y)

str(bank_transform)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

4. Target predictor to a new numeric variable

To perform analysis, it is needed to have the target predictor in numeric format as well.

#bank_transform$y_numeric <- ifelse(bank_transform$y == "yes", 1, 0)
#bank_transform$y_numeric <- as.integer(bank_transform$y_numeric)

Pre- processing

1. Data Cleaning

There are no missing values. There are no duplicates as shown in above code.However there are unknown values which are converted to NA.

# Count missing values per column
colSums(is.na(bank_transform ))  
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0
# Replace missing numerical values with median
bank_final <- bank_transform |>
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Replace missing categorical values with the most common value (mode)
for (col in names(bank_final)) {
if (is.factor(bank_final[[col]])) {
mode_val <- names(sort(table(bank_final[[col]]), decreasing = TRUE))[1]
bank_final[[col]][is.na(bank_final[[col]])] <- mode_val

    }

}

2. Feature Engineering

# Create Age Groups
bank_final_add <- bank_final%>%
  mutate(age_group = case_when(
    age <= 30 ~ "18-30",
    age > 30 & age <= 40 ~ "30-40",
    age > 40 & age <= 50 ~ "41-50",
    age > 50 & age <= 60 ~ "51-60",
    age > 60 ~ "60+"
  ))

# Categorize Balance Levels
bank_final_add$balance_group <- cut(bank_final_add$balance, 
                       breaks = quantile(bank_final_add$balance, probs = seq(0, 1, 0.2)),
                       labels = c("Very Low", "Low", "Medium", "High", "Very High"))

# Categorize Contact Duration
bank_final_add <- bank_final_add %>%
  mutate(duration_category = case_when(
    duration < 100 ~ "Short",
    duration >= 100 & duration <= 300 ~ "Medium",
    duration > 300 ~ "Long"
  ))

# Convert categorical variables to factors
bank_final_add <- bank_final_add %>% mutate(across(where(is.character), as.factor))

3. Imbalanced Data

# Check imbalance
table(bank_final_add$y)
## 
##    no   yes 
## 39922  5289
# Oversampling using ROSE -Random Over-Sampling Examples
bank_final_balanced <- ROSE(y ~ ., data = bank_final, seed = 123)$data
table(bank_final_balanced$y)
## 
##    no   yes 
## 22885 22326

4. Split the data into train and test

# 80% train data 20% test data
set.seed(1234)
sample_set <- createDataPartition(bank_final_balanced$y, p = 0.8, list = FALSE)

bank_train <- bank_final_add[sample_set, ]

bank_test<- bank_final_add[-sample_set, ]

Experiment using Decision Trees

Experiment 1. Changing the ratio of training to test data.

Objective: The performance of the decision tree can improve by adjusting the ratio of training to test data. Variation: From the original preparation of partitioning 80% of original data as training data and remaining 20% as test data, i changed the partitioning to 70% of original data as training data and remaining as 30% as test data. Evaluation metric: Accuracy, ROC-AUC Results/Run: The decision tree model using the train set shows that the duration feature is the most predictive of the final outcome, followed by poutcome feature.This model is fairly simple, using only two variables and a series of binary splits to make predictions. The percentages at the leaf nodes provide a measure of confidence in the prediction. Default model accuracy is 0.900, after changing the ratio of the training to test data, i see increased accuracy of 0.901. ROC-AUC = 0.69 Conclusion/Recommendation: The performance of the decision tree improved by adjusting the ratio of training to test data. While the accuracy is high, the model is significantly underperforming for the minority class(“yes”), indicating a need for further refinement. High false negatives for the ‘yes’ class (1004 misclassified) Low specificity suggests the model struggles to identify positive cases. Though technique like ROSE was used to address class imbalance, it did not address the issue. Recommendation is to Adjust Class Weights More Aggressively.

#fit the decision tree model
dt_exp1 <- rpart(y ~  ., data = bank_train, method = "class")

#Visualize the model
rpart.plot(dt_exp1, main="Default Decision Tree Model")

# predict for the test data
pred_dt_exp1 <- predict(dt_exp1, bank_test, type = "class")

# Generate the confusion matrix
cm_dt_exp1 <- confusionMatrix(pred_dt_exp1 , bank_test$y)

# display the accuracy
acc_dt_exp1 <- cm_dt_exp1$overall["Accuracy"]
paste0("Decision Tree Experiment 1: Accuracy =", acc_dt_exp1,"/n" )
## [1] "Decision Tree Experiment 1: Accuracy =0.897920813979208/n"
#variation to change the ratio of training to test data
# 70% train data 30% test data
sample_set <- createDataPartition(bank_final_add$y, p = 0.7, list = FALSE)
bank_train <- bank_final_add[sample_set, ]
bank_test<- bank_final_add[-sample_set, ]

#refit the model
dt_exp1 <- rpart(y ~ ., data = bank_train, method = "class")
# predict for the test data
pred_dt_exp1 <- predict(dt_exp1, bank_test, type = "class")

# Generate the confusion matrix
cm_dt_exp1 <- confusionMatrix(pred_dt_exp1 , bank_test$y)
cm_dt_exp1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11637  1004
##        yes   339   582
##                                          
##                Accuracy : 0.901          
##                  95% CI : (0.8958, 0.906)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : 1.678e-11      
##                                          
##                   Kappa : 0.4139         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9717         
##             Specificity : 0.3670         
##          Pos Pred Value : 0.9206         
##          Neg Pred Value : 0.6319         
##              Prevalence : 0.8831         
##          Detection Rate : 0.8581         
##    Detection Prevalence : 0.9321         
##       Balanced Accuracy : 0.6693         
##                                          
##        'Positive' Class : no             
## 
# display the ROC -AUC
roc_curve <- roc(response = bank_test$y, 
                 predictor = as.numeric(pred_dt_exp1))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
tuned_auc <- auc(roc_curve)
plot(roc_curve, main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))

Experiment 2. Adjust Class Weights and using cross validation

Objective: To experiment that performance of the decision tree can improve by adjusting the class weights or cross validation. Variation: From the original class weights, i changed the class weights moderately. Evaluation metric: Default model accuracy is 0.900, after adjusting the class weights, the accuracy is reduced a bit to 0.8912 , but with improved minority class prediction. AUC -ROC score = 0.746 indicates Acceptable performance. Results/Run: The decision tree model now includes month along with duration feature and poutcome feature.This model is still fairly simple, using only three variables and a series of binary splits to make predictions. The percentages at the leaf nodes provide a measure of confidence in the prediction. After trying the cross validation and training the decision tree, the accuracy of the model did not improve.

Conclusion/Recommendation: The model has reached a stable and promising state with adjusted class weights, with good accuracy(0.8912) and improved minority class prediction(Fewer false negatives in ‘no’ class). Cross validation technique did not improve the accuracy of the model. The cross-validation results suggest that the best parameter value is around 0.015.Beyond this point, increasing complexity will lead to a decrease in the ROC score. Hence dt_weighted model is a better model than the cross validated dt_tune model.

# Adjust Class Weights moderately
dt_weighted <- rpart(y ~ ., 
                     data = bank_train, 
                     method = "class",
                     parms = list(prior = c(0.80, 0.20)))

#Visualize the model
rpart.plot(dt_weighted, main="Class weighted Decision Tree Model")

# predict for the test data
pred_dt_weighted <- predict(dt_weighted, bank_test, type = "class")

# Generate the confusion matrix
cm_dt_weighted <- confusionMatrix(pred_dt_weighted , bank_test$y)
cm_dt_weighted
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11203   703
##        yes   773   883
##                                           
##                Accuracy : 0.8912          
##                  95% CI : (0.8858, 0.8964)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 0.001575        
##                                           
##                   Kappa : 0.483           
##                                           
##  Mcnemar's Test P-Value : 0.072495        
##                                           
##             Sensitivity : 0.9355          
##             Specificity : 0.5567          
##          Pos Pred Value : 0.9410          
##          Neg Pred Value : 0.5332          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8261          
##    Detection Prevalence : 0.8779          
##       Balanced Accuracy : 0.7461          
##                                           
##        'Positive' Class : no              
## 
# display the accuracy
acc_dt_weighted <- cm_dt_weighted$overall["Accuracy"]
paste0("Decision Tree Experiment 2 (Class weighted): Accuracy =", acc_dt_weighted,"/n" )
## [1] "Decision Tree Experiment 2 (Class weighted): Accuracy =0.891166494617313/n"
# ROC Curve

roc_curve <- roc(response = bank_test$y, 
                 predictor = as.numeric(pred_dt_weighted))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
tuned_auc <- auc(roc_curve)
plot(roc_curve, main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))

# Evaluate Feature Importance
# varImp(dt_weighted)

# Feature Selection
# importantFeatures <- varImp(dt_weighted)
# selected_features <- rownames(importantFeatures)[importantFeatures$Overall > threshold]

bank_test <- na.omit(bank_test)
bank_train <- na.omit(bank_train)
# Cross-Validation
control <- trainControl(method = "cv", number = 10,
                        classProbs = TRUE,
                        summaryFunction = twoClassSummary)
                      

dt_tune <- train(y ~ ., 
                 data = bank_train, 
                 method = "rpart",
                 trControl = control)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
# Inspect the tuned model
plot(dt_tune)

# predict for the test data
pred_dt_tune <- predict(dt_tune, bank_test)


# Generate the confusion matrix
cm_dt_tune <- confusionMatrix(pred_dt_tune , bank_test$y)
cm_dt_tune
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11644  1041
##        yes   332   545
##                                           
##                Accuracy : 0.8988          
##                  95% CI : (0.8936, 0.9038)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 3.484e-09       
##                                           
##                   Kappa : 0.3919          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9723          
##             Specificity : 0.3436          
##          Pos Pred Value : 0.9179          
##          Neg Pred Value : 0.6214          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8586          
##    Detection Prevalence : 0.9353          
##       Balanced Accuracy : 0.6580          
##                                           
##        'Positive' Class : no              
## 
# display the accuracy
acc_dt_tune <- cm_dt_tune$overall["Accuracy"]
paste0("Decision Tree Experiment 2 (Cross Validation): Accuracy =", acc_dt_tune,"/n" )
## [1] "Decision Tree Experiment 2 (Cross Validation): Accuracy =0.898761244654181/n"
# ROC Curve

roc_curve <- roc(response = bank_test$y, 
                 predictor = as.numeric(pred_dt_tune))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
tuned_auc <- auc(roc_curve)
plot(roc_curve, main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))

Comparison of both experiments

Experiment 1 showed the default decision tree and its improvement when changing the ratio of training to test data. The decision tree model includes duration feature and poutcome feature.This model is still fairly simple, using only two variables and a series of binary splits to make predictions.

Experiment 2 showed that the model can be further improved to reduce the false negatives by adjusting the class weights. Cross validation did not improve the performance.The decision tree model now includes month along with duration feature and poutcome feature.This model is still fairly simple, using only three variables and a series of binary splits to make predictions.

Experiment using Random Forest

Experiment 3. Adjusted class weight and all features importance

Objective: To experiment using standard Random Forest technique for comparison with DT technique. Variation: From the original class weights, i changed the class weights moderately( no: 80, yes: 20) and kept the features the same. Evaluation metric: The accuracy of random Forest is 0.9064 . AUC -ROC score :0.927

Results/Run: The random Forest model indicates month, day,duration, poutcome as the most important features in that sequence. Duration, month and balance have the highest Gini Index in that sequence. Accuracy: 0.9067 seems high, but needs careful interpretation Kappa Score: 0.5008 suggests the model performs better than random guessing. the model showed maximum performance with all features,hence did not try to perform feature selection. Conclusion/Recommendation: Random Forest performs better than the Decision Trees in both accuracy and AUC-ROC. High AUC-ROC demonstrates it as a excellent discrimination between the classes. It may need refinement in handling class imbalance and hence we need to tune parameters.

#fir the random Forest model and visualize
rf_exp1 <- randomForest(y ~ ., 
                         data = bank_train, 
                         ntree = 100, 
                         importance = TRUE,
                         classwt = c("no" = 0.80, "yes" = 0.20))
varImpPlot(rf_exp1, main = "Random Forest Default model with trees = 100")

#predict the test data
rf_exp1_pred1 <- predict(rf_exp1, bank_test)

#generate the confusion matrix
rf_cm1 <- confusionMatrix(rf_exp1_pred1, bank_test$y)
rf_cm1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11520   802
##        yes   456   784
##                                           
##                Accuracy : 0.9072          
##                  95% CI : (0.9022, 0.9121)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5039          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9619          
##             Specificity : 0.4943          
##          Pos Pred Value : 0.9349          
##          Neg Pred Value : 0.6323          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8494          
##    Detection Prevalence : 0.9086          
##       Balanced Accuracy : 0.7281          
##                                           
##        'Positive' Class : no              
## 
#ROC curve
rf_prob1 <- predict(rf_exp1, bank_test, type = "prob")[, 2]
rf_roc1 <- roc(bank_test$y, rf_prob1)

tuned_auc <- auc(rf_roc1 )
plot(rf_roc1 , main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))

Experiment 4. Increase in number of trees

Objective: Increase in number of trees improves the performance of the random forest model. Variation: From the experiment 1 number of trees as 100, i increased it to 200. Rest were kept the same. Evaluation metric: The accuracy of random Forest is 0.9082 . AUC -ROC score :0.929

Results/Run: The random Forest model output changed to indicate month, day,duration, housing as the most important features in that sequence. Duration, month and balance remained the highest Gini Index in that sequence. Accuracy: 0.9082 increased from experiment 1 and AUC -ROC score :0.929 increased from experiment 1. Kappa Score: 0.5074 remained alomost same.

Conclusion/Recommendation: Random Forest with trees = 200 performed better than trees = 100 in both accuracy and AUC-ROC. High AUC-ROC demonstrates it as a excellent discrimination between the classes. It may need refinement in handling class imbalance and hence we need to tune parameters.

#fir the random Forest model and visualize
rf_exp2 <- randomForest(y ~ ., 
                         data = bank_train, 
                         ntree = 200, 
                         importance = TRUE,
                         classwt = c("no" = 0.80, "yes" = 0.20))
varImpPlot(rf_exp2, main = "Random Forest model with increased trees( 200) ")

#predict the test data
rf_exp1_pred2 <- predict(rf_exp2, bank_test)

#generate the confusion matrix
rf_cm2 <- confusionMatrix(rf_exp1_pred2, bank_test$y)
rf_cm2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11540   809
##        yes   436   777
##                                          
##                Accuracy : 0.9082         
##                  95% CI : (0.9032, 0.913)
##     No Information Rate : 0.8831         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.505          
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9636         
##             Specificity : 0.4899         
##          Pos Pred Value : 0.9345         
##          Neg Pred Value : 0.6406         
##              Prevalence : 0.8831         
##          Detection Rate : 0.8509         
##    Detection Prevalence : 0.9106         
##       Balanced Accuracy : 0.7268         
##                                          
##        'Positive' Class : no             
## 
#ROC curve
rf_prob2 <- predict(rf_exp2, bank_test, type = "prob")[, 2]
rf_roc2 <- roc(bank_test$y, rf_prob2)

tuned_auc <- auc(rf_roc2 )
plot(rf_roc2 , main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))

Comparison of both experiments

Experiment 3 showed the default random forest improved performance over decision trees and its improvement when changing the weights. The random Forest model indicates month, day,duration, poutcome as the most important features in that sequence.The accuracy of random Forest is 0.9064 . AUC -ROC score :0.927

Experiment 4 showed that the model can be further improved by increasing the number of trees. The random Forest model output changed to indicate month, day,duration, housing as the most important features in that sequence. Accuracy: 0.9082 increased from experiment 1 and AUC -ROC score :0.929 increased from experiment 1.

Experiment using Adaboost/XGBoost

As Adaboost is computationally intensive, i experimented with XGBoost

Experiment 5. Default XGBoost with moderate boosting rounds

Objective: Increase in boosting rounds improves the performance of the XGBoost model. Variation: From the default setting, i increased it to 200. Rest were kept the same. Evaluation metric: The accuracy,AUC-ROC score

Results/Run: XGBoost model with default setting of 100 showed less accuracy than Random forest model. Accuracy: 0.9069 and AUC -ROC score :0.929. When increased the boosting round from 100 to 200, it showed less accuracy and AUC score. Accuracy: 0.9043 and AUC -ROC score :0.927.

Conclusion/Recommendation: XGBoost with boosting rounds = 200 did not perform better than boosting rounds = 100 in both accuracy and AUC-ROC. I need to explore Hyperparameter Tuning by using XGBoost hyperparameters: * Adjust scale_pos_weight to handle imbalanced classes * Fine-tune max_depth, eta (learning rate) * Use cross-validation for optimal parameters

# One-hot encoding variables and convert to matrix
bank_train_xgb <- model.matrix(~ . - 1, data = bank_train |> select(-y))
bank_test_xgb <- model.matrix(~ . - 1, data = bank_test |> select(-y))


# Numeric labels (0/1) for the categorical target variable
train_labels_numeric <- as.numeric(bank_train$y) - 1  
test_labels_numeric <- as.numeric(bank_test$y) - 1

# Prepare DMatrix for XGBoost using numeric labels
dtrain <- xgb.DMatrix(data = bank_train_xgb, label = train_labels_numeric)
dtest <- xgb.DMatrix(data = bank_test_xgb, label = test_labels_numeric)

# Fit XGBoost model
xgb_exp1 <- xgboost(data = dtrain, nrounds = 100, objective = "binary:logistic", verbose = 0)

# Predict test values
xgb_pred1 <- predict(xgb_exp1, dtest)
xgb_pred1_class <- ifelse(xgb_pred1 > 0.5, 1, 0)

# generate confusion matrix
xgb_cm1 <- confusionMatrix(as.factor(xgb_pred1_class), as.factor(test_labels_numeric))
xgb_cm1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 11505   792
##          1   471   794
##                                           
##                Accuracy : 0.9069          
##                  95% CI : (0.9019, 0.9117)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5057          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9607          
##             Specificity : 0.5006          
##          Pos Pred Value : 0.9356          
##          Neg Pred Value : 0.6277          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8483          
##    Detection Prevalence : 0.9067          
##       Balanced Accuracy : 0.7307          
##                                           
##        'Positive' Class : 0               
## 
# ROC Curve
xgb_roc1 <- roc(test_labels_numeric, xgb_pred1)


tuned_auc <- auc(xgb_roc1  )
plot(xgb_roc1 , main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))

#performing same by increasing boosting rounds =200

# Fit XGBoost model
xgb_exp1 <- xgboost(data = dtrain, nrounds = 200, objective = "binary:logistic", verbose = 0)

# Predict test values
xgb_pred1 <- predict(xgb_exp1, dtest)
xgb_pred1_class <- ifelse(xgb_pred1 > 0.5, 1, 0)

# generate confusion matrix
xgb_cm1 <- confusionMatrix(as.factor(xgb_pred1_class), as.factor(test_labels_numeric))
xgb_cm1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 11476   798
##          1   500   788
##                                           
##                Accuracy : 0.9043          
##                  95% CI : (0.8992, 0.9092)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1.418e-15       
##                                           
##                   Kappa : 0.4955          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9582          
##             Specificity : 0.4968          
##          Pos Pred Value : 0.9350          
##          Neg Pred Value : 0.6118          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8462          
##    Detection Prevalence : 0.9050          
##       Balanced Accuracy : 0.7275          
##                                           
##        'Positive' Class : 0               
## 
# ROC Curve
xgb_roc1 <- roc(test_labels_numeric, xgb_pred1)


tuned_auc <- auc(xgb_roc1  )
plot(xgb_roc1 , main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))

Experiment 6. Default XGBoost with Hyperparameter Tuning:

Objective: Adjusting max_depth improves the performance of the XGBoost model. Variation: From the default setting, i adjusted max_depth from 3 to 6 to 9 Evaluation metric: The accuracy,AUC-ROC score

Results/Run: XGBoost model with max_depth: 3 Accuracy: 0.9064 and AUC -ROC score :0.9212.

XGBoost model with max_depth: 6 Accuracy: 0.9089 and AUC -ROC score :0.9315.

XGBoost model with max_depth: 9 Accuracy: 0.9078 and AUC -ROC score :0.9315.

Conclusion/Recommendation: The model with a max depth of 6 demonstrates the best overall performance, showing a marginal improvement in both accuracy and AUC-ROC compared to the max depth 3 and 6 model.

Increasing the max depth from 6 to 9 does not yield further improvements in model performance. This suggests that a max depth of 6 represents the optimal balance between model complexity and generalization.

Recommendation is to conduct additional hyperparameter tuning, focusing on: Learning rate (eta) Subsample and colsample parameters Regularization terms (alpha, lambda)

# model with parameters with max_depth = 3
best_params <- list(
    objective = "binary:logistic",  
    max_depth = 3,
    eta = 0.1,
    subsample = 0.8,
    colsample_bytree = 0.8
  )


#  XGBoost model 
set.seed(123)
xgb_final <- xgb.train(
    params = best_params,
    data = dtrain,
    nrounds = 100,
    early_stopping_rounds = 10,
    watchlist = list(eval = dtest),
    verbose = FALSE)

# Predictions
xgb_pred_final <- predict(xgb_final, dtest)
xgb_pred_final_class <- ifelse(xgb_pred_final > 0.5, 1, 0)

# Confusion Matrix
xgb_cm_final <- confusionMatrix(as.factor(xgb_pred_final_class), 
                                 as.factor(test_labels_numeric))

#ROC curve
xgb_roc_final <- roc(test_labels_numeric, xgb_pred_final)


# Print results
print(" Confusion Matrix(max_depth = 3)")
## [1] " Confusion Matrix(max_depth = 3)"
print(xgb_cm_final)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 11654   947
##          1   322   639
##                                           
##                Accuracy : 0.9064          
##                  95% CI : (0.9014, 0.9113)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4535          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9731          
##             Specificity : 0.4029          
##          Pos Pred Value : 0.9248          
##          Neg Pred Value : 0.6649          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8593          
##    Detection Prevalence : 0.9291          
##       Balanced Accuracy : 0.6880          
##                                           
##        'Positive' Class : 0               
## 
print(paste0(" AUC with max_depth = 3 is ", auc(xgb_roc_final), "\n"))
## [1] " AUC with max_depth = 3 is 0.921291221577244\n"
# model with parameters with max_depth = 9
best_params <- list(
    objective = "binary:logistic",  
    max_depth = 9,
    eta = 0.1,
    subsample = 0.8,
    colsample_bytree = 0.8
  )


#  XGBoost model 
set.seed(123)
xgb_final <- xgb.train(
    params = best_params,
    data = dtrain,
    nrounds = 100,
    early_stopping_rounds = 10,
    watchlist = list(eval = dtest),
    verbose = FALSE)

# Predictions
xgb_pred_final <- predict(xgb_final, dtest)
xgb_pred_final_class <- ifelse(xgb_pred_final > 0.5, 1, 0)

# Confusion Matrix
xgb_cm_final <- confusionMatrix(as.factor(xgb_pred_final_class), 
                                 as.factor(test_labels_numeric))

#ROC curve
xgb_roc_final <- roc(test_labels_numeric, xgb_pred_final)


# Print results
print(" Confusion Matrix(max_depth = 9)")
## [1] " Confusion Matrix(max_depth = 9)"
print(xgb_cm_final)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 11520   795
##          1   456   791
##                                           
##                Accuracy : 0.9078          
##                  95% CI : (0.9028, 0.9126)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5077          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9619          
##             Specificity : 0.4987          
##          Pos Pred Value : 0.9354          
##          Neg Pred Value : 0.6343          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8494          
##    Detection Prevalence : 0.9081          
##       Balanced Accuracy : 0.7303          
##                                           
##        'Positive' Class : 0               
## 
print(paste0(" AUC with max_depth = 9 is ", auc(xgb_roc_final), "\n"))
## [1] " AUC with max_depth = 9 is 0.931505139324466\n"
# model with parameters with max_depth = 6
best_params <- list(
    objective = "binary:logistic",  
    max_depth = 6,
    eta = 0.1,
    subsample = 0.8,
    colsample_bytree = 0.8
  )


#  XGBoost model 
set.seed(123)
xgb_final <- xgb.train(
    params = best_params,
    data = dtrain,
    nrounds = 100,
    early_stopping_rounds = 10,
    watchlist = list(eval = dtest),
    verbose = FALSE)

# Predictions
xgb_pred_final <- predict(xgb_final, dtest)
xgb_pred_final_class <- ifelse(xgb_pred_final > 0.5, 1, 0)

# Confusion Matrix
xgb_cm_final <- confusionMatrix(as.factor(xgb_pred_final_class), 
                                 as.factor(test_labels_numeric))

#ROC curve
xgb_roc_final <- roc(test_labels_numeric, xgb_pred_final)


# Print results
print(" Confusion Matrix(max_depth = 6):")
## [1] " Confusion Matrix(max_depth = 6):"
print(xgb_cm_final)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 11554   814
##          1   422   772
##                                           
##                Accuracy : 0.9089          
##                  95% CI : (0.9039, 0.9137)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5057          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9648          
##             Specificity : 0.4868          
##          Pos Pred Value : 0.9342          
##          Neg Pred Value : 0.6466          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8519          
##    Detection Prevalence : 0.9120          
##       Balanced Accuracy : 0.7258          
##                                           
##        'Positive' Class : 0               
## 
print(paste0(" AUC with max_depth = 6 is ", auc(xgb_roc_final), "\n"))
## [1] " AUC with max_depth = 6 is 0.931487238874554\n"

Comparison of both experiments

Experiment 5 showed Default XGBoost with moderate boosting rounds and its lack of improvement when increasing the boosting. When increasing the boosting round from 100 to 200, it showed less accuracy and AUC score. Accuracy: 0.9043 and AUC -ROC score :0.927.

Experiment 6 showed that adjusting Hyperparameter max_depth improves the performance of the XGBoost model. The model with a max depth of 6 demonstrates the best overall performance, showing a marginal improvement in both accuracy and AUC-ROC compared to the max depth 3 and 6 model.Increasing the max depth from 6 to 9 does not yield further improvements in model performance. This suggests that a max depth of 6 represents the optimal balance between model complexity and generalization. Accuracy: 0.9089 and AUC -ROC score :0.9315.

Comparison between DT & Random Forest & Adaboost

Introduction

The objective of this study was to evaluate different machine learning models—Decision Trees, Random Forest, and XGBoost—on a classification problem. The experiments were designed to analyze the impact of training-test data ratios, class weight adjustments on Decision Tree model;class weight adjustments and number of trees on Random Forest model and hyper parameter tuning, and the number of estimators on XGBoost model performance.

Bias-Variance Considerations

Decision Trees generally have low bias but high variance, leading to overfitting on training data. Random Forest, as an ensemble method, reduces variance by averaging multiple Decision Trees, improving generalization. XGBoost, a boosting method, builds models sequentially to correct errors, optimizing both bias and variance for better predictive performance.

Experiment Summaries and Results

Decision Tree Experiments

The Decision Tree experiments explored how data partitioning and class weight adjustments impact performance. Experiment 1 tested the effect of changing the training-to-test ratio from 80/20 to 70/30, resulting in a marginal accuracy increase from 0.900 to 0.901. However, the model struggled with false negatives for the minority class. Experiment 2 adjusted class weights, slightly reducing accuracy (0.8912) but improving the minority class prediction (AUC-ROC = 0.746). Cross-validation did not improve results, indicating that weighted class adjustments were more effective.

Random Forest Experiments

Random Forest significantly improved model performance over Decision Trees. Experiment 3 applied class weight adjustments while keeping all features, achieving 0.9064 accuracy and 0.927 AUC-ROC. The most predictive features were ‘month,’ ‘day,’ and ‘duration.’ Experiment 4 increased the number of trees from 100 to 200, slightly boosting accuracy to 0.9082 and AUC-ROC to 0.929. The results confirmed that increasing tree count marginally enhances predictive power while maintaining model stability.

XGBoost Experiments

XGBoost provided the best balance of accuracy and generalization. Experiment 5 tested an increase in boosting rounds from 100 to 200, which unexpectedly led to a slight decline in performance (accuracy: 0.9043, AUC-ROC: 0.927). Experiment 6 focused on hyperparameter tuning, varying max_depth from 3 to 9. The best results came from max_depth = 6, yielding an accuracy of 0.9089 and an AUC-ROC of 0.9315, indicating an optimal balance between model complexity and generalization.

Comparison of Model Performance

library(knitr)

# Define the data
results_table <- data.frame(
  Model = c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "XGBoost", "XGBoost"),
  Experiment = c("Exp 1", "Exp 2", "Exp 3", "Exp 4", "Exp 5", "Exp 6"),
  Key_Variation = c("70/30 data split", "Adjusted class weights", "Adjusted class weights", "Increased trees to 200", "Increased boosting rounds", "Tuned max_depth = 6"),
  Accuracy = c(0.901, 0.8912, 0.9064, 0.9082, 0.9043, 0.9089),
  AUC_ROC = c(0.69, 0.746, 0.927, 0.929, 0.927, 0.9315)
)

# Display the table using kable
kable(results_table, format = "markdown", align = "l")
Model Experiment Key_Variation Accuracy AUC_ROC
Decision Tree Exp 1 70/30 data split 0.9010 0.6900
Decision Tree Exp 2 Adjusted class weights 0.8912 0.7460
Random Forest Exp 3 Adjusted class weights 0.9064 0.9270
Random Forest Exp 4 Increased trees to 200 0.9082 0.9290
XGBoost Exp 5 Increased boosting rounds 0.9043 0.9270
XGBoost Exp 6 Tuned max_depth = 6 0.9089 0.9315

Best Performing Model

The XGBoost model with max_depth = 6 demonstrated the highest accuracy (0.9089) and AUC-ROC (0.9315), making it the optimal model. It provided better class discrimination compared to Decision Trees and Random Forest while maintaining generalization.

Conclusion and Recommendations

Decision Trees offer interpretability but struggle with variance and class imbalance. While simple to implement, they require careful tuning of class weights to improve minority class predictions.

Random Forest reduces variance and performs better than Decision Trees, offering improved generalization. However, class imbalance remains a challenge, and tuning hyperparameters such as tree count and feature selection can further optimize results.

XGBoost provides the best trade-off between bias and variance, with hyperparameter tuning yielding the highest accuracy and AUC-ROC. It is well-suited for structured data and imbalanced classification problems.

Recommendation for Data Science: XGBoost with max_depth = 6 should be the preferred model, with further tuning of learning rate, regularization, and feature selection to enhance performance.

Recommendation for Business Problem: Given the superior class discrimination and predictive power of XGBoost, it should be implemented to predict term deposit subscriptions and to improve decision-making accuracy. Business stakeholders should focus on feature importance like having longer contact duration,checking if previous outcome is a success, contacting clients during months of may, august etc .