The assignment is to perform EDA and further experimentation on a dataset “Bank Marketing” UCI dataset , (detailed description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
The classification goal is to predict if the client will subscribe a term deposit (variable y). We will use the bank-full.csv
Input predictors:
# bank client data: 1 - age (numeric)
2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
5 - default: has credit in default? (binary: “yes”,“no”)
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: “yes”,“no”)
8 - loan: has personal loan? (binary: “yes”,“no”)
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target):
17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
Missing values : none
#Import needed libraries
library(ggplot2)
library(readr) # to uses read_csv function
library(dplyr) # to use Filter, mutate, arrange function etc
library(tidyr) # to use pivot_longer function
library(e1071) # For skewness function
library(corrplot)
library(ROSE)
library(smotefamily)
library(caret) # varImp function trainControl
library(rpart) # for decision tree rpart()
library(rpart.plot) # for decision tree rpart.plot()
# ROC Curve
library(pROC)
# Precision-Recall Curve
library(PRROC)
# randomForest
library(randomForest)
# XGBoost
library(xgboost)
The data analysis shows there are 45211 observations and 17 variables. I find some of variables not in the correct type. we need to convert it into correct data type.
bank_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA622/refs/heads/main/bank-full.csv")
head(bank_raw)
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
bank_transform <- bank_raw
bank_transform$job <- as.factor(bank_raw$job)
bank_transform$marital <- as.factor(bank_raw$marital)
bank_transform$education <- as.factor(bank_raw$education)
bank_transform$default <- as.factor(bank_raw$default)
bank_transform$balance <- as.integer(bank_raw$balance)
bank_transform$housing <- as.factor(bank_raw$housing)
bank_transform$loan <- as.factor(bank_raw$loan)
bank_transform$contact <- as.factor(bank_raw$contact)
bank_transform$month <- as.factor(bank_raw$month)
bank_transform$pdays <- as.integer(bank_raw$pdays)
bank_transform$poutcome <- as.factor(bank_raw$poutcome)
bank_transform$y <- as.factor(bank_raw$y)
str(bank_transform)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
To perform analysis, it is needed to have the target predictor in numeric format as well.
#bank_transform$y_numeric <- ifelse(bank_transform$y == "yes", 1, 0)
#bank_transform$y_numeric <- as.integer(bank_transform$y_numeric)
There are no missing values. There are no duplicates as shown in above code.However there are unknown values which are converted to NA.
# Count missing values per column
colSums(is.na(bank_transform ))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
# Replace missing numerical values with median
bank_final <- bank_transform |>
mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Replace missing categorical values with the most common value (mode)
for (col in names(bank_final)) {
if (is.factor(bank_final[[col]])) {
mode_val <- names(sort(table(bank_final[[col]]), decreasing = TRUE))[1]
bank_final[[col]][is.na(bank_final[[col]])] <- mode_val
}
}
# Create Age Groups
bank_final_add <- bank_final%>%
mutate(age_group = case_when(
age <= 30 ~ "18-30",
age > 30 & age <= 40 ~ "30-40",
age > 40 & age <= 50 ~ "41-50",
age > 50 & age <= 60 ~ "51-60",
age > 60 ~ "60+"
))
# Categorize Balance Levels
bank_final_add$balance_group <- cut(bank_final_add$balance,
breaks = quantile(bank_final_add$balance, probs = seq(0, 1, 0.2)),
labels = c("Very Low", "Low", "Medium", "High", "Very High"))
# Categorize Contact Duration
bank_final_add <- bank_final_add %>%
mutate(duration_category = case_when(
duration < 100 ~ "Short",
duration >= 100 & duration <= 300 ~ "Medium",
duration > 300 ~ "Long"
))
# Convert categorical variables to factors
bank_final_add <- bank_final_add %>% mutate(across(where(is.character), as.factor))
# Check imbalance
table(bank_final_add$y)
##
## no yes
## 39922 5289
# Oversampling using ROSE -Random Over-Sampling Examples
bank_final_balanced <- ROSE(y ~ ., data = bank_final, seed = 123)$data
table(bank_final_balanced$y)
##
## no yes
## 22885 22326
# 80% train data 20% test data
set.seed(1234)
sample_set <- createDataPartition(bank_final_balanced$y, p = 0.8, list = FALSE)
bank_train <- bank_final_add[sample_set, ]
bank_test<- bank_final_add[-sample_set, ]
Objective: The performance of the decision tree can improve by adjusting the ratio of training to test data. Variation: From the original preparation of partitioning 80% of original data as training data and remaining 20% as test data, i changed the partitioning to 70% of original data as training data and remaining as 30% as test data. Evaluation metric: Accuracy, ROC-AUC Results/Run: The decision tree model using the train set shows that the duration feature is the most predictive of the final outcome, followed by poutcome feature.This model is fairly simple, using only two variables and a series of binary splits to make predictions. The percentages at the leaf nodes provide a measure of confidence in the prediction. Default model accuracy is 0.900, after changing the ratio of the training to test data, i see increased accuracy of 0.901. ROC-AUC = 0.69 Conclusion/Recommendation: The performance of the decision tree improved by adjusting the ratio of training to test data. While the accuracy is high, the model is significantly underperforming for the minority class(“yes”), indicating a need for further refinement. High false negatives for the ‘yes’ class (1004 misclassified) Low specificity suggests the model struggles to identify positive cases. Though technique like ROSE was used to address class imbalance, it did not address the issue. Recommendation is to Adjust Class Weights More Aggressively.
#fit the decision tree model
dt_exp1 <- rpart(y ~ ., data = bank_train, method = "class")
#Visualize the model
rpart.plot(dt_exp1, main="Default Decision Tree Model")
# predict for the test data
pred_dt_exp1 <- predict(dt_exp1, bank_test, type = "class")
# Generate the confusion matrix
cm_dt_exp1 <- confusionMatrix(pred_dt_exp1 , bank_test$y)
# display the accuracy
acc_dt_exp1 <- cm_dt_exp1$overall["Accuracy"]
paste0("Decision Tree Experiment 1: Accuracy =", acc_dt_exp1,"/n" )
## [1] "Decision Tree Experiment 1: Accuracy =0.897920813979208/n"
#variation to change the ratio of training to test data
# 70% train data 30% test data
sample_set <- createDataPartition(bank_final_add$y, p = 0.7, list = FALSE)
bank_train <- bank_final_add[sample_set, ]
bank_test<- bank_final_add[-sample_set, ]
#refit the model
dt_exp1 <- rpart(y ~ ., data = bank_train, method = "class")
# predict for the test data
pred_dt_exp1 <- predict(dt_exp1, bank_test, type = "class")
# Generate the confusion matrix
cm_dt_exp1 <- confusionMatrix(pred_dt_exp1 , bank_test$y)
cm_dt_exp1
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11637 1004
## yes 339 582
##
## Accuracy : 0.901
## 95% CI : (0.8958, 0.906)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1.678e-11
##
## Kappa : 0.4139
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9717
## Specificity : 0.3670
## Pos Pred Value : 0.9206
## Neg Pred Value : 0.6319
## Prevalence : 0.8831
## Detection Rate : 0.8581
## Detection Prevalence : 0.9321
## Balanced Accuracy : 0.6693
##
## 'Positive' Class : no
##
# display the ROC -AUC
roc_curve <- roc(response = bank_test$y,
predictor = as.numeric(pred_dt_exp1))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
tuned_auc <- auc(roc_curve)
plot(roc_curve, main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))
Objective: To experiment that performance of the decision tree can improve by adjusting the class weights or cross validation. Variation: From the original class weights, i changed the class weights moderately. Evaluation metric: Default model accuracy is 0.900, after adjusting the class weights, the accuracy is reduced a bit to 0.8912 , but with improved minority class prediction. AUC -ROC score = 0.746 indicates Acceptable performance. Results/Run: The decision tree model now includes month along with duration feature and poutcome feature.This model is still fairly simple, using only three variables and a series of binary splits to make predictions. The percentages at the leaf nodes provide a measure of confidence in the prediction. After trying the cross validation and training the decision tree, the accuracy of the model did not improve.
Conclusion/Recommendation: The model has reached a stable and promising state with adjusted class weights, with good accuracy(0.8912) and improved minority class prediction(Fewer false negatives in ‘no’ class). Cross validation technique did not improve the accuracy of the model. The cross-validation results suggest that the best parameter value is around 0.015.Beyond this point, increasing complexity will lead to a decrease in the ROC score. Hence dt_weighted model is a better model than the cross validated dt_tune model.
# Adjust Class Weights moderately
dt_weighted <- rpart(y ~ .,
data = bank_train,
method = "class",
parms = list(prior = c(0.80, 0.20)))
#Visualize the model
rpart.plot(dt_weighted, main="Class weighted Decision Tree Model")
# predict for the test data
pred_dt_weighted <- predict(dt_weighted, bank_test, type = "class")
# Generate the confusion matrix
cm_dt_weighted <- confusionMatrix(pred_dt_weighted , bank_test$y)
cm_dt_weighted
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11203 703
## yes 773 883
##
## Accuracy : 0.8912
## 95% CI : (0.8858, 0.8964)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.001575
##
## Kappa : 0.483
##
## Mcnemar's Test P-Value : 0.072495
##
## Sensitivity : 0.9355
## Specificity : 0.5567
## Pos Pred Value : 0.9410
## Neg Pred Value : 0.5332
## Prevalence : 0.8831
## Detection Rate : 0.8261
## Detection Prevalence : 0.8779
## Balanced Accuracy : 0.7461
##
## 'Positive' Class : no
##
# display the accuracy
acc_dt_weighted <- cm_dt_weighted$overall["Accuracy"]
paste0("Decision Tree Experiment 2 (Class weighted): Accuracy =", acc_dt_weighted,"/n" )
## [1] "Decision Tree Experiment 2 (Class weighted): Accuracy =0.891166494617313/n"
# ROC Curve
roc_curve <- roc(response = bank_test$y,
predictor = as.numeric(pred_dt_weighted))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
tuned_auc <- auc(roc_curve)
plot(roc_curve, main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))
# Evaluate Feature Importance
# varImp(dt_weighted)
# Feature Selection
# importantFeatures <- varImp(dt_weighted)
# selected_features <- rownames(importantFeatures)[importantFeatures$Overall > threshold]
bank_test <- na.omit(bank_test)
bank_train <- na.omit(bank_train)
# Cross-Validation
control <- trainControl(method = "cv", number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
dt_tune <- train(y ~ .,
data = bank_train,
method = "rpart",
trControl = control)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
# Inspect the tuned model
plot(dt_tune)
# predict for the test data
pred_dt_tune <- predict(dt_tune, bank_test)
# Generate the confusion matrix
cm_dt_tune <- confusionMatrix(pred_dt_tune , bank_test$y)
cm_dt_tune
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11644 1041
## yes 332 545
##
## Accuracy : 0.8988
## 95% CI : (0.8936, 0.9038)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 3.484e-09
##
## Kappa : 0.3919
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9723
## Specificity : 0.3436
## Pos Pred Value : 0.9179
## Neg Pred Value : 0.6214
## Prevalence : 0.8831
## Detection Rate : 0.8586
## Detection Prevalence : 0.9353
## Balanced Accuracy : 0.6580
##
## 'Positive' Class : no
##
# display the accuracy
acc_dt_tune <- cm_dt_tune$overall["Accuracy"]
paste0("Decision Tree Experiment 2 (Cross Validation): Accuracy =", acc_dt_tune,"/n" )
## [1] "Decision Tree Experiment 2 (Cross Validation): Accuracy =0.898761244654181/n"
# ROC Curve
roc_curve <- roc(response = bank_test$y,
predictor = as.numeric(pred_dt_tune))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
tuned_auc <- auc(roc_curve)
plot(roc_curve, main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))
Experiment 1 showed the default decision tree and its improvement when changing the ratio of training to test data. The decision tree model includes duration feature and poutcome feature.This model is still fairly simple, using only two variables and a series of binary splits to make predictions.
Experiment 2 showed that the model can be further improved to reduce the false negatives by adjusting the class weights. Cross validation did not improve the performance.The decision tree model now includes month along with duration feature and poutcome feature.This model is still fairly simple, using only three variables and a series of binary splits to make predictions.
Objective: To experiment using standard Random Forest technique for comparison with DT technique. Variation: From the original class weights, i changed the class weights moderately( no: 80, yes: 20) and kept the features the same. Evaluation metric: The accuracy of random Forest is 0.9064 . AUC -ROC score :0.927
Results/Run: The random Forest model indicates month, day,duration, poutcome as the most important features in that sequence. Duration, month and balance have the highest Gini Index in that sequence. Accuracy: 0.9067 seems high, but needs careful interpretation Kappa Score: 0.5008 suggests the model performs better than random guessing. the model showed maximum performance with all features,hence did not try to perform feature selection. Conclusion/Recommendation: Random Forest performs better than the Decision Trees in both accuracy and AUC-ROC. High AUC-ROC demonstrates it as a excellent discrimination between the classes. It may need refinement in handling class imbalance and hence we need to tune parameters.
#fir the random Forest model and visualize
rf_exp1 <- randomForest(y ~ .,
data = bank_train,
ntree = 100,
importance = TRUE,
classwt = c("no" = 0.80, "yes" = 0.20))
varImpPlot(rf_exp1, main = "Random Forest Default model with trees = 100")
#predict the test data
rf_exp1_pred1 <- predict(rf_exp1, bank_test)
#generate the confusion matrix
rf_cm1 <- confusionMatrix(rf_exp1_pred1, bank_test$y)
rf_cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11520 802
## yes 456 784
##
## Accuracy : 0.9072
## 95% CI : (0.9022, 0.9121)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5039
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9619
## Specificity : 0.4943
## Pos Pred Value : 0.9349
## Neg Pred Value : 0.6323
## Prevalence : 0.8831
## Detection Rate : 0.8494
## Detection Prevalence : 0.9086
## Balanced Accuracy : 0.7281
##
## 'Positive' Class : no
##
#ROC curve
rf_prob1 <- predict(rf_exp1, bank_test, type = "prob")[, 2]
rf_roc1 <- roc(bank_test$y, rf_prob1)
tuned_auc <- auc(rf_roc1 )
plot(rf_roc1 , main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))
Objective: Increase in number of trees improves the performance of the random forest model. Variation: From the experiment 1 number of trees as 100, i increased it to 200. Rest were kept the same. Evaluation metric: The accuracy of random Forest is 0.9082 . AUC -ROC score :0.929
Results/Run: The random Forest model output changed to indicate month, day,duration, housing as the most important features in that sequence. Duration, month and balance remained the highest Gini Index in that sequence. Accuracy: 0.9082 increased from experiment 1 and AUC -ROC score :0.929 increased from experiment 1. Kappa Score: 0.5074 remained alomost same.
Conclusion/Recommendation: Random Forest with trees = 200 performed better than trees = 100 in both accuracy and AUC-ROC. High AUC-ROC demonstrates it as a excellent discrimination between the classes. It may need refinement in handling class imbalance and hence we need to tune parameters.
#fir the random Forest model and visualize
rf_exp2 <- randomForest(y ~ .,
data = bank_train,
ntree = 200,
importance = TRUE,
classwt = c("no" = 0.80, "yes" = 0.20))
varImpPlot(rf_exp2, main = "Random Forest model with increased trees( 200) ")
#predict the test data
rf_exp1_pred2 <- predict(rf_exp2, bank_test)
#generate the confusion matrix
rf_cm2 <- confusionMatrix(rf_exp1_pred2, bank_test$y)
rf_cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11540 809
## yes 436 777
##
## Accuracy : 0.9082
## 95% CI : (0.9032, 0.913)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.505
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9636
## Specificity : 0.4899
## Pos Pred Value : 0.9345
## Neg Pred Value : 0.6406
## Prevalence : 0.8831
## Detection Rate : 0.8509
## Detection Prevalence : 0.9106
## Balanced Accuracy : 0.7268
##
## 'Positive' Class : no
##
#ROC curve
rf_prob2 <- predict(rf_exp2, bank_test, type = "prob")[, 2]
rf_roc2 <- roc(bank_test$y, rf_prob2)
tuned_auc <- auc(rf_roc2 )
plot(rf_roc2 , main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))
Experiment 3 showed the default random forest improved performance over decision trees and its improvement when changing the weights. The random Forest model indicates month, day,duration, poutcome as the most important features in that sequence.The accuracy of random Forest is 0.9064 . AUC -ROC score :0.927
Experiment 4 showed that the model can be further improved by increasing the number of trees. The random Forest model output changed to indicate month, day,duration, housing as the most important features in that sequence. Accuracy: 0.9082 increased from experiment 1 and AUC -ROC score :0.929 increased from experiment 1.
As Adaboost is computationally intensive, i experimented with XGBoost
Objective: Increase in boosting rounds improves the performance of the XGBoost model. Variation: From the default setting, i increased it to 200. Rest were kept the same. Evaluation metric: The accuracy,AUC-ROC score
Results/Run: XGBoost model with default setting of 100 showed less accuracy than Random forest model. Accuracy: 0.9069 and AUC -ROC score :0.929. When increased the boosting round from 100 to 200, it showed less accuracy and AUC score. Accuracy: 0.9043 and AUC -ROC score :0.927.
Conclusion/Recommendation: XGBoost with boosting rounds = 200 did not perform better than boosting rounds = 100 in both accuracy and AUC-ROC. I need to explore Hyperparameter Tuning by using XGBoost hyperparameters: * Adjust scale_pos_weight to handle imbalanced classes * Fine-tune max_depth, eta (learning rate) * Use cross-validation for optimal parameters
# One-hot encoding variables and convert to matrix
bank_train_xgb <- model.matrix(~ . - 1, data = bank_train |> select(-y))
bank_test_xgb <- model.matrix(~ . - 1, data = bank_test |> select(-y))
# Numeric labels (0/1) for the categorical target variable
train_labels_numeric <- as.numeric(bank_train$y) - 1
test_labels_numeric <- as.numeric(bank_test$y) - 1
# Prepare DMatrix for XGBoost using numeric labels
dtrain <- xgb.DMatrix(data = bank_train_xgb, label = train_labels_numeric)
dtest <- xgb.DMatrix(data = bank_test_xgb, label = test_labels_numeric)
# Fit XGBoost model
xgb_exp1 <- xgboost(data = dtrain, nrounds = 100, objective = "binary:logistic", verbose = 0)
# Predict test values
xgb_pred1 <- predict(xgb_exp1, dtest)
xgb_pred1_class <- ifelse(xgb_pred1 > 0.5, 1, 0)
# generate confusion matrix
xgb_cm1 <- confusionMatrix(as.factor(xgb_pred1_class), as.factor(test_labels_numeric))
xgb_cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11505 792
## 1 471 794
##
## Accuracy : 0.9069
## 95% CI : (0.9019, 0.9117)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5057
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9607
## Specificity : 0.5006
## Pos Pred Value : 0.9356
## Neg Pred Value : 0.6277
## Prevalence : 0.8831
## Detection Rate : 0.8483
## Detection Prevalence : 0.9067
## Balanced Accuracy : 0.7307
##
## 'Positive' Class : 0
##
# ROC Curve
xgb_roc1 <- roc(test_labels_numeric, xgb_pred1)
tuned_auc <- auc(xgb_roc1 )
plot(xgb_roc1 , main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))
#performing same by increasing boosting rounds =200
# Fit XGBoost model
xgb_exp1 <- xgboost(data = dtrain, nrounds = 200, objective = "binary:logistic", verbose = 0)
# Predict test values
xgb_pred1 <- predict(xgb_exp1, dtest)
xgb_pred1_class <- ifelse(xgb_pred1 > 0.5, 1, 0)
# generate confusion matrix
xgb_cm1 <- confusionMatrix(as.factor(xgb_pred1_class), as.factor(test_labels_numeric))
xgb_cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11476 798
## 1 500 788
##
## Accuracy : 0.9043
## 95% CI : (0.8992, 0.9092)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1.418e-15
##
## Kappa : 0.4955
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9582
## Specificity : 0.4968
## Pos Pred Value : 0.9350
## Neg Pred Value : 0.6118
## Prevalence : 0.8831
## Detection Rate : 0.8462
## Detection Prevalence : 0.9050
## Balanced Accuracy : 0.7275
##
## 'Positive' Class : 0
##
# ROC Curve
xgb_roc1 <- roc(test_labels_numeric, xgb_pred1)
tuned_auc <- auc(xgb_roc1 )
plot(xgb_roc1 , main = paste0("ROC curve with AUC = ",round(tuned_auc, 3)))
Objective: Adjusting max_depth improves the performance of the XGBoost model. Variation: From the default setting, i adjusted max_depth from 3 to 6 to 9 Evaluation metric: The accuracy,AUC-ROC score
Results/Run: XGBoost model with max_depth: 3 Accuracy: 0.9064 and AUC -ROC score :0.9212.
XGBoost model with max_depth: 6 Accuracy: 0.9089 and AUC -ROC score :0.9315.
XGBoost model with max_depth: 9 Accuracy: 0.9078 and AUC -ROC score :0.9315.
Conclusion/Recommendation: The model with a max depth of 6 demonstrates the best overall performance, showing a marginal improvement in both accuracy and AUC-ROC compared to the max depth 3 and 6 model.
Increasing the max depth from 6 to 9 does not yield further improvements in model performance. This suggests that a max depth of 6 represents the optimal balance between model complexity and generalization.
Recommendation is to conduct additional hyperparameter tuning, focusing on: Learning rate (eta) Subsample and colsample parameters Regularization terms (alpha, lambda)
# model with parameters with max_depth = 3
best_params <- list(
objective = "binary:logistic",
max_depth = 3,
eta = 0.1,
subsample = 0.8,
colsample_bytree = 0.8
)
# XGBoost model
set.seed(123)
xgb_final <- xgb.train(
params = best_params,
data = dtrain,
nrounds = 100,
early_stopping_rounds = 10,
watchlist = list(eval = dtest),
verbose = FALSE)
# Predictions
xgb_pred_final <- predict(xgb_final, dtest)
xgb_pred_final_class <- ifelse(xgb_pred_final > 0.5, 1, 0)
# Confusion Matrix
xgb_cm_final <- confusionMatrix(as.factor(xgb_pred_final_class),
as.factor(test_labels_numeric))
#ROC curve
xgb_roc_final <- roc(test_labels_numeric, xgb_pred_final)
# Print results
print(" Confusion Matrix(max_depth = 3)")
## [1] " Confusion Matrix(max_depth = 3)"
print(xgb_cm_final)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11654 947
## 1 322 639
##
## Accuracy : 0.9064
## 95% CI : (0.9014, 0.9113)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4535
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9731
## Specificity : 0.4029
## Pos Pred Value : 0.9248
## Neg Pred Value : 0.6649
## Prevalence : 0.8831
## Detection Rate : 0.8593
## Detection Prevalence : 0.9291
## Balanced Accuracy : 0.6880
##
## 'Positive' Class : 0
##
print(paste0(" AUC with max_depth = 3 is ", auc(xgb_roc_final), "\n"))
## [1] " AUC with max_depth = 3 is 0.921291221577244\n"
# model with parameters with max_depth = 9
best_params <- list(
objective = "binary:logistic",
max_depth = 9,
eta = 0.1,
subsample = 0.8,
colsample_bytree = 0.8
)
# XGBoost model
set.seed(123)
xgb_final <- xgb.train(
params = best_params,
data = dtrain,
nrounds = 100,
early_stopping_rounds = 10,
watchlist = list(eval = dtest),
verbose = FALSE)
# Predictions
xgb_pred_final <- predict(xgb_final, dtest)
xgb_pred_final_class <- ifelse(xgb_pred_final > 0.5, 1, 0)
# Confusion Matrix
xgb_cm_final <- confusionMatrix(as.factor(xgb_pred_final_class),
as.factor(test_labels_numeric))
#ROC curve
xgb_roc_final <- roc(test_labels_numeric, xgb_pred_final)
# Print results
print(" Confusion Matrix(max_depth = 9)")
## [1] " Confusion Matrix(max_depth = 9)"
print(xgb_cm_final)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11520 795
## 1 456 791
##
## Accuracy : 0.9078
## 95% CI : (0.9028, 0.9126)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5077
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9619
## Specificity : 0.4987
## Pos Pred Value : 0.9354
## Neg Pred Value : 0.6343
## Prevalence : 0.8831
## Detection Rate : 0.8494
## Detection Prevalence : 0.9081
## Balanced Accuracy : 0.7303
##
## 'Positive' Class : 0
##
print(paste0(" AUC with max_depth = 9 is ", auc(xgb_roc_final), "\n"))
## [1] " AUC with max_depth = 9 is 0.931505139324466\n"
# model with parameters with max_depth = 6
best_params <- list(
objective = "binary:logistic",
max_depth = 6,
eta = 0.1,
subsample = 0.8,
colsample_bytree = 0.8
)
# XGBoost model
set.seed(123)
xgb_final <- xgb.train(
params = best_params,
data = dtrain,
nrounds = 100,
early_stopping_rounds = 10,
watchlist = list(eval = dtest),
verbose = FALSE)
# Predictions
xgb_pred_final <- predict(xgb_final, dtest)
xgb_pred_final_class <- ifelse(xgb_pred_final > 0.5, 1, 0)
# Confusion Matrix
xgb_cm_final <- confusionMatrix(as.factor(xgb_pred_final_class),
as.factor(test_labels_numeric))
#ROC curve
xgb_roc_final <- roc(test_labels_numeric, xgb_pred_final)
# Print results
print(" Confusion Matrix(max_depth = 6):")
## [1] " Confusion Matrix(max_depth = 6):"
print(xgb_cm_final)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11554 814
## 1 422 772
##
## Accuracy : 0.9089
## 95% CI : (0.9039, 0.9137)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5057
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9648
## Specificity : 0.4868
## Pos Pred Value : 0.9342
## Neg Pred Value : 0.6466
## Prevalence : 0.8831
## Detection Rate : 0.8519
## Detection Prevalence : 0.9120
## Balanced Accuracy : 0.7258
##
## 'Positive' Class : 0
##
print(paste0(" AUC with max_depth = 6 is ", auc(xgb_roc_final), "\n"))
## [1] " AUC with max_depth = 6 is 0.931487238874554\n"
Experiment 5 showed Default XGBoost with moderate boosting rounds and its lack of improvement when increasing the boosting. When increasing the boosting round from 100 to 200, it showed less accuracy and AUC score. Accuracy: 0.9043 and AUC -ROC score :0.927.
Experiment 6 showed that adjusting Hyperparameter max_depth improves the performance of the XGBoost model. The model with a max depth of 6 demonstrates the best overall performance, showing a marginal improvement in both accuracy and AUC-ROC compared to the max depth 3 and 6 model.Increasing the max depth from 6 to 9 does not yield further improvements in model performance. This suggests that a max depth of 6 represents the optimal balance between model complexity and generalization. Accuracy: 0.9089 and AUC -ROC score :0.9315.
The objective of this study was to evaluate different machine learning models—Decision Trees, Random Forest, and XGBoost—on a classification problem. The experiments were designed to analyze the impact of training-test data ratios, class weight adjustments on Decision Tree model;class weight adjustments and number of trees on Random Forest model and hyper parameter tuning, and the number of estimators on XGBoost model performance.
Decision Trees generally have low bias but high variance, leading to overfitting on training data. Random Forest, as an ensemble method, reduces variance by averaging multiple Decision Trees, improving generalization. XGBoost, a boosting method, builds models sequentially to correct errors, optimizing both bias and variance for better predictive performance.
The Decision Tree experiments explored how data partitioning and class weight adjustments impact performance. Experiment 1 tested the effect of changing the training-to-test ratio from 80/20 to 70/30, resulting in a marginal accuracy increase from 0.900 to 0.901. However, the model struggled with false negatives for the minority class. Experiment 2 adjusted class weights, slightly reducing accuracy (0.8912) but improving the minority class prediction (AUC-ROC = 0.746). Cross-validation did not improve results, indicating that weighted class adjustments were more effective.
Random Forest significantly improved model performance over Decision Trees. Experiment 3 applied class weight adjustments while keeping all features, achieving 0.9064 accuracy and 0.927 AUC-ROC. The most predictive features were ‘month,’ ‘day,’ and ‘duration.’ Experiment 4 increased the number of trees from 100 to 200, slightly boosting accuracy to 0.9082 and AUC-ROC to 0.929. The results confirmed that increasing tree count marginally enhances predictive power while maintaining model stability.
XGBoost provided the best balance of accuracy and generalization. Experiment 5 tested an increase in boosting rounds from 100 to 200, which unexpectedly led to a slight decline in performance (accuracy: 0.9043, AUC-ROC: 0.927). Experiment 6 focused on hyperparameter tuning, varying max_depth from 3 to 9. The best results came from max_depth = 6, yielding an accuracy of 0.9089 and an AUC-ROC of 0.9315, indicating an optimal balance between model complexity and generalization.
library(knitr)
# Define the data
results_table <- data.frame(
Model = c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "XGBoost", "XGBoost"),
Experiment = c("Exp 1", "Exp 2", "Exp 3", "Exp 4", "Exp 5", "Exp 6"),
Key_Variation = c("70/30 data split", "Adjusted class weights", "Adjusted class weights", "Increased trees to 200", "Increased boosting rounds", "Tuned max_depth = 6"),
Accuracy = c(0.901, 0.8912, 0.9064, 0.9082, 0.9043, 0.9089),
AUC_ROC = c(0.69, 0.746, 0.927, 0.929, 0.927, 0.9315)
)
# Display the table using kable
kable(results_table, format = "markdown", align = "l")
Model | Experiment | Key_Variation | Accuracy | AUC_ROC |
---|---|---|---|---|
Decision Tree | Exp 1 | 70/30 data split | 0.9010 | 0.6900 |
Decision Tree | Exp 2 | Adjusted class weights | 0.8912 | 0.7460 |
Random Forest | Exp 3 | Adjusted class weights | 0.9064 | 0.9270 |
Random Forest | Exp 4 | Increased trees to 200 | 0.9082 | 0.9290 |
XGBoost | Exp 5 | Increased boosting rounds | 0.9043 | 0.9270 |
XGBoost | Exp 6 | Tuned max_depth = 6 | 0.9089 | 0.9315 |
The XGBoost model with max_depth = 6 demonstrated the highest accuracy (0.9089) and AUC-ROC (0.9315), making it the optimal model. It provided better class discrimination compared to Decision Trees and Random Forest while maintaining generalization.
Decision Trees offer interpretability but struggle with variance and class imbalance. While simple to implement, they require careful tuning of class weights to improve minority class predictions.
Random Forest reduces variance and performs better than Decision Trees, offering improved generalization. However, class imbalance remains a challenge, and tuning hyperparameters such as tree count and feature selection can further optimize results.
XGBoost provides the best trade-off between bias and variance, with hyperparameter tuning yielding the highest accuracy and AUC-ROC. It is well-suited for structured data and imbalanced classification problems.
Recommendation for Data Science: XGBoost with max_depth = 6 should be the preferred model, with further tuning of learning rate, regularization, and feature selection to enhance performance.
Recommendation for Business Problem: Given the superior class discrimination and predictive power of XGBoost, it should be implemented to predict term deposit subscriptions and to improve decision-making accuracy. Business stakeholders should focus on feature importance like having longer contact duration,checking if previous outcome is a success, contacting clients during months of may, august etc .