library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(knitr)
library(randomForest)
library(xgboost)
df <- read.csv("bank-full.csv", sep = ";")
df$y <- as.factor(df$y)
# Set seed for reproducibility and create a 70/30 train-test split
set.seed(123)
trainIndex <- createDataPartition(df$y, p=0.7, list=FALSE)
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]

Experiment 1 - Decision Tree

Define the objective of the experiment (hypothesis)

For this experiment, I wanted to get a better understanding of how the decision tree algorithm will split the data. To do this, I choose a maximum depth of 10. My hypothesis is that by setting a high maximum depth, we will see a more detailed split of the data and identify which features influence the target variable the most. By choosing a maximum depth of 10, the decision tree will produce a complex but detailed structure that will highlight the significant predictor variables. While this may lead to overfitting, the primary goal of this experiment is to get a better sense of the data.

Decide what will change, and what will stay the same

Based on the Exploratory Data Analysis (EDA), the dataset did not have any missing values. However, the variable ‘poutcome’ contained a substantial number of “unknown” entries. I will exclude this predictor variable from this experiment since it may skew the model’s splits. All of the other data and the hyperparameters for this decision tree will remain unchanged.

Select the evaluation metric (what you want to measure)

For this experiment, I have decided to assess the model’s performance using accuracy on both the training and test datasets. The dataset was split into a 70/30, train/test sets. Accuracy will show how well the model’s predictions match the actual labels. Also, I will visually inspect the tree structure using a plot to understand the splits and observe whether the extra depth yielded better splits.

# Train a Decision Tree with maximum depth = 10
# Excluded the poutcome variable
tree_depth10 <- rpart(y ~ . - poutcome, 
                      data = trainData, 
                      method = "class", 
                      control = rpart.control(maxdepth = 10))

# Predict on the training data
depth10_train_preds <- predict(tree_depth10, trainData, type = "class")

# Predict on the test data
depth10_preds <- predict(tree_depth10, testData, type = "class")

# Evaluate the model's performance on training data
depth10_cm_train <- confusionMatrix(depth10_train_preds, trainData$y)
depth10_acc_train <- depth10_cm_train$overall["Accuracy"]
print(paste("Training Accuracy with max depth 10:", round(depth10_acc_train, 4)))
## [1] "Training Accuracy with max depth 10: 0.8942"
# Evaluate the model's performance using on test data
depth10_cm <- confusionMatrix(depth10_preds, testData$y)
depth10_acc <- depth10_cm$overall["Accuracy"]
print(paste("Test Accuracy with max depth 10:", round(depth10_acc, 4)))
## [1] "Test Accuracy with max depth 10: 0.8945"
rpart.plot(tree_depth10, main = "Decision Tree with Max Depth 10")

Result evaluated & conclusion drawn

The decision tree model with a maximum depth of 10 achieved a training accuracy of 89.42% and a test accuracy of 89.45%, indicating it fits the data well. The near-identical accuracies between the training and test sets indicate that the model is not overfitting. Although the maximum depth was set at 10, the final tree only grew to 3 levels. Out of 15 predictor variables, the tree only used two: ‘duration’ and ‘month’, which indicates these two features provided the strongest signals for predicting the whether a client subscribes to a term deposit. The remaining 13 predictors did not contribute additional predictive power, suggesting they either had redundant information or a weak relationship with the target variable. It is also worth noting that the duration variable appears in multiple branches, which confirms that it is a highly informative predictor in this dataset. This makes sense as one would expect the duration of a call to correlate with whether the client subscribes to a term deposit or not. Overall, despite the high accuracy in the training and test data, the structure of this model looks a bit too simple, especially with just two predictors in the decision tree.

Recommendations

Although the decision tree model achieved a high training and test accuracy, its structure was quite simple with only two predictors, “duration” and “month”. This suggests that these two features dominated the splitting process while the remaining predictors were overlooked. To encourage the tree to consider additional predictors and to uncover deeper interactions in the data, I recommend lowering the minimum split parameter(minsplit).By reducing minsplit, the tree will be allowed to split nodes that contain fewer observations, potentially resulting in a more complex structure with additional splits. This change could reveal whether other variables contribute valuable predictive power beyond “duration” and “month,” and might also help reduce overall node impurity.

# Create an empty data frame to store results
results_table <- data.frame(
  Algorithm = character(),
  Experiment = character(),
  Training_Accuracy = numeric(),
  Test_Accuracy = numeric(),
  stringsAsFactors = FALSE
)

#Append results for Decision Tree Experiment 1
results_table <- rbind(results_table,
  data.frame(
    Algorithm = "Decision Tree",
    Experiment = "Exp 1",
    Training_Accuracy = 89.42,
    Test_Accuracy = 89.45
  )
)
kable(results_table, digits = 2)
Algorithm Experiment Training_Accuracy Test_Accuracy
Decision Tree Exp 1 89.42 89.45

Experiment 2 - Decision Tree

Define the objective of the experiment (hypothesis)

Based on the results from the previous experiment, the objective of this experiment is to investigate the impact of lowering the minimum number of observations required for a split (minsplit) on the decision tree’s structure and performance. In the earlier model, only two variables:duration and month were used. Also, some nodes exhibited high impurity. By reducing minsplit to 10 (from the default value of 20), the decision tree will be compelled to perform splits even when fewer samples are available. This should result in a more complex tree structure, which may reveal additional interactions among predictors that were previously ignored and could help reduce impurity in ambiguous nodes as well as improving accuracy.

Decide what will change, and what will stay the same

The major change will be the minsplit parameter. By default it is set to 20 but by reducing it to 10, this will cause more splits even with smaller node sizes. The rest of the dataset will remain unchanged. The maximum depth will also remain at 10 and like before the poutcome variable will remain excluded because of the amonut of unknown values in the variable.

Select the evaluation metric (what you want to measure)

As before, the evaluation metric will be the measured using the accuracy of both the training and test datasets. Again, the train-test split will be 70/30. Also, we will do a visual inspection of the treeplot to see the complexity and the purity of the model.

# Training a Decision Tree with maximum depth = 10 and lowered minsplit = 10
tree_depth10_lower_minsplit <- rpart(y ~ . - poutcome, 
                                     data = trainData, 
                                     method = "class", 
                                     control = rpart.control(maxdepth = 10, minsplit = 10))

# Predict on the training data
depth10_lower_minsplit_train_preds <- predict(tree_depth10_lower_minsplit, trainData, type = "class")

# Predict on the test data
depth10_lower_minsplit_preds <- predict(tree_depth10_lower_minsplit, testData, type = "class")

# Evaluate performance on training data
depth10_lower_minsplit_cm_train <- confusionMatrix(depth10_lower_minsplit_train_preds, trainData$y)
depth10_lower_minsplit_acc_train <- depth10_lower_minsplit_cm_train$overall["Accuracy"]
print(paste("Training Accuracy with lowered minsplit:", round(depth10_lower_minsplit_acc_train, 4)))
## [1] "Training Accuracy with lowered minsplit: 0.8942"
# Evaluate performance on test data
depth10_lower_minsplit_cm <- confusionMatrix(depth10_lower_minsplit_preds, testData$y)
depth10_lower_minsplit_acc <- depth10_lower_minsplit_cm$overall["Accuracy"]
print(paste("Test Accuracy with lowered minsplit:", round(depth10_lower_minsplit_acc, 4)))
## [1] "Test Accuracy with lowered minsplit: 0.8945"
# Plot the decision tree
rpart.plot(tree_depth10_lower_minsplit, main = "Decision Tree with max depth 10 & minsplit = 10")

Result evaluated & conclusion drawn

In this experiment, lowering the minsplit parameter to 10 (compared to the previous experiment’s default setting) resulted in a decision tree that achieved a training accuracy of 89.42% and a test accuracy of 89.45%. These nearly identical accuracies indicate that the model continues to generalize well and is not overfitting. While the goal was to create a more complex model with more predictors, the predictive performance of this model remained unchanged when compared to the first experiment. The final tree maintained the same structure, with splits driven by the same two predictors: duration and month. This indicates that these two predictors capture the most critical information needed for the prediction process. Overall, my hypothesis was not correct. Lowering the minsplit parameter did not uncover any further interactions among the remaining features nor did it reduce any impurity in the nodes.

Recommendation

Based on these results, I would recommend further exploration with data sampling and feature selection. While lowering minsplit did not alter the tree’s complexity or its reliance on the variables “duration” and “month,” experimenting with different sampling techniques like oversampling the minority class might help reveal additional interactions in the data. In the EDA, there was a disproportionate amount of no values compared to yes values in the target variable. Addressing this issue could have yielded different results with a different starting node. Also, applying feature selection to combine variables may offer new insights into the data and improve the model’s performance.

results_table <- rbind(results_table,
  data.frame(
    Algorithm = "Decision Tree",
    Experiment = "Exp 2 (minsplit = 10)",
    Training_Accuracy = as.numeric(depth10_lower_minsplit_acc_train)*100,
    Test_Accuracy = as.numeric(depth10_lower_minsplit_acc)*100
  )
)
kable(results_table, digits = 2)
Algorithm Experiment Training_Accuracy Test_Accuracy
Decision Tree Exp 1 89.42 89.45
Decision Tree Exp 2 (minsplit = 10) 89.42 89.45

Experiment 3 - Random Forest

Define the objective of the experiment (hypothesis)

The goal of this experiment is to determine whether using a Random Forest can improve the predictive performance and potentially incorporate additional predictors compared to a single decision tree. By employing a Random Forest with 500 trees,the model is expected to capture more complex interactions among the predictors and have a higher accuracy than the decision trees. My hypothesis is that this random forest model will have a high accuracy than the decision tree models while also showing the contribution of the other predictors in the dataset.

Decide what will change, and what will stay the same

The model algorithm changed form a single decision tree to a random forest with many trees. Since this is the first random forest experiment, I will leave all of the parameters as default (ntree set to 500) and see what changes need to be made based on the result. The same train-test split and exclusion of the poutcome variable will be maintained and consistent with the previous decision tree experiment.

Select the evaluation metric (what you want to measure)

Th model’s performance will be measured using the training and test accuracy. We will also do a visual inspection of the plot to see which predictors are influencing the predictions and if it makes sense intuitively.

# Train a Random Forest model with 500 trees, excluding the 'poutcome' variable
rf_model <- randomForest(y ~ . - poutcome, 
                         data = trainData, 
                         ntree = 500, 
                         importance = TRUE)

# Make predictions on the training data
rf_train_preds <- predict(rf_model, trainData, type = "class")

# Make predictions on the test data
rf_test_preds <- predict(rf_model, testData, type = "class")

# Evaluate performance on the training set
rf_cm_train <- confusionMatrix(rf_train_preds, trainData$y)
rf_train_acc <- rf_cm_train$overall["Accuracy"]
print(paste("Training Accuracy - Random Forest:", round(rf_train_acc, 4)))
## [1] "Training Accuracy - Random Forest: 0.9964"
# Evaluate performance on the test set
rf_cm_test <- confusionMatrix(rf_test_preds, testData$y)
rf_test_acc <- rf_cm_test$overall["Accuracy"]
print(paste("Test Accuracy - Random Forest:", round(rf_test_acc, 4)))
## [1] "Test Accuracy - Random Forest: 0.9008"
# Plot variable importance to see which features matter most
varImpPlot(rf_model, main = "Random Forest Variable Importance")

Result evaluated & conclusion drawn

In this experiment, the Random Forest model achieved an extremely high training accuracy of 99.64% and a training accuracy of 90.08%, showing that my initial hypotheses was correct. Although the training accuracy is almost perfect, it does invite the possibility of overfitting. The test performance indicates the model generalizes well with unseen data. However, there is a noticeable gap between the training and testing accuracy (9.6%) that I find a bit concerning.

The variable importance plot reveals that duration is the most influential predictor. The MeanDecreaseAccuracy shows that duration and month are not the only variables that contribute to the accuracy of the model. Removing the housing, day, contact, age, and pdays variables have almost the same impact on the model’s accuracy than removing the month variable. These findings suggest that the Random Forest is not only produces a higher accuracy model but also incorporates a broader set of predictors compared to the single decision tree.

Recommendation

As mentioned previously, there is a ~10% gap between the training and testing accuracy, which usually indicates overfitting and may warrant adjustments. One way to reduce overfitting is to increase regularization by reducing the number of trees (ntree), increasing the minimum node size(nodesize), or reducing the maximum tree depth. Based on these findings, I would recommend reducing the number of trees to 300 to see if the test accuracy increases. If there is still a large gap between the the training and testing accuracy, I would use the other methods to reduce overfitting. It would be a step by step process.

results_table <- rbind(results_table,
  data.frame(
    Algorithm = "Random Forest",
    Experiment = "Exp 3 (ntree = 500)",
    Training_Accuracy = as.numeric(rf_train_acc) * 100,
    Test_Accuracy = as.numeric(rf_test_acc) * 100
  )
)

kable(results_table, digits = 2)
Algorithm Experiment Training_Accuracy Test_Accuracy
Decision Tree Exp 1 89.42 89.45
Decision Tree Exp 2 (minsplit = 10) 89.42 89.45
Random Forest Exp 3 (ntree = 500) 99.64 90.08

Experimeent 4 - Random Forest

Define the objective of the experiment (hypothesis)

The goal of this experiment is to assess whether reducing the number of trees in the Random Forest will lower overfitting from the previous model. By lowering the number of trees from 500 to 200, the model will be less prone to overfitting. Although this might reduce the training accuracy, it is expected to improve, or at least maintain, the test accuracy.

Decide what will change, and what will stay the same

The main change in this experiment will be the reduction in the number of tree(ntree) in the Random Forest from 500 to 200. The dataset, train-test split, and the exclusion of the poutcome variable will remain unchanged.

Select the evaluation metric (what you want to measure)

The evaluation metric will be the same as before where the training and test accuracy was recorded. Also, there will be a visual inspection, a variable importance plot, to help verify which predictors are driving the prediction.

# Train a Random Forest model with reduced number of trees (ntree = 200)
rf_model_reduced <- randomForest(y ~ . - poutcome, 
                                 data = trainData, 
                                 ntree = 200, 
                                 importance = TRUE)

# Make predictions on the training data
rf_reduced_train_preds <- predict(rf_model_reduced, trainData, type = "class")

# Make predictions on the test data
rf_reduced_test_preds <- predict(rf_model_reduced, testData, type = "class")

# Evaluate performance on the training set
rf_reduced_cm_train <- confusionMatrix(rf_reduced_train_preds, trainData$y)
rf_reduced_train_acc <- rf_reduced_cm_train$overall["Accuracy"]
print(paste("Training Accuracy - Reduced ntree Random Forest:", round(rf_reduced_train_acc, 4)))
## [1] "Training Accuracy - Reduced ntree Random Forest: 0.9958"
# Evaluate performance on the test set
rf_reduced_cm_test <- confusionMatrix(rf_reduced_test_preds, testData$y)
rf_reduced_test_acc <- rf_reduced_cm_test$overall["Accuracy"]
print(paste("Test Accuracy - Reduced ntree Random Forest:", round(rf_reduced_test_acc, 4)))
## [1] "Test Accuracy - Reduced ntree Random Forest: 0.901"
varImpPlot(rf_model_reduced, main = "Variable Importance - Reduced ntree Random Forest")

Result evaluated & conclusion drawn

In this experiment, the Random Forest model with a reduced number of trees achieved a training accuracy of 99.59% and a test accuracy of 90.23%. Although the gap between the training and test accuracy got smaller, it is still 9.36% indicating that even with fewer trees, the model is still overfitting to the training data.

The variable importance plot shows that the key predictor is still duration in the decision-making process. However, variable day and housing seem to have the same MeanDecreaseAccuracy as month but duration seems to be far and away the greatest contributor to the accuracy of the Random Forest model. They hypothesis was slightly accurate in that reducing the amount of trees to 200 did slighlty improve the test accuracy.

Recommendation

Based on these findings, I would recommend further measures to mitigate overfitting and improve generalization. Given the persistent gap between training and test accuracy, it is worth exploring adjustments to other hyperparameters such as increasing the minimum node size(nodesize), or reducing the maximum tree depth. Additionally, incorporating techniques like cross-validation and oversampling the minority class may provide a more balanced view of model performance.

results_table <- rbind(results_table,
  data.frame(
    Algorithm = "Random Forest",
    Experiment = "Exp 4 (ntree = 200)",
    Training_Accuracy = as.numeric(rf_reduced_train_acc) * 100,
    Test_Accuracy = as.numeric(rf_reduced_test_acc) * 100
  )
)

kable(results_table, digits = 2)
Algorithm Experiment Training_Accuracy Test_Accuracy
Decision Tree Exp 1 89.42 89.45
Decision Tree Exp 2 (minsplit = 10) 89.42 89.45
Random Forest Exp 3 (ntree = 500) 99.64 90.08
Random Forest Exp 4 (ntree = 200) 99.58 90.10

Experiment 5 - XGBoost

Define the objective of the experiment (hypothesis)

The goal of this experiment is to evaluate the performance of Extreme Gradient Boosting(XGBoost) using its default parameters. This experiment serves as a baseline for how well gradient boosting can capture complex interactions among predictors in comparison to previous models. I chose XGBoost over AdaBoost for this dataset because XGBoost is known for its robustness and efficient handling of large and imbalanced data. By employing XGBoost with default parameters, we expect to achieve a higher training and test accuracy than Random Forest.

Decide what will change, and what will stay the same

The objective of this experiment is to create a baseline performance for XGBoost with the default parameters (max_depth = 6, eta 0.3) and nrounds set to 100. Like the previous models, the dataset will remain the same as well as the train-test split (70/30) and the exclusion of the poutcome variable.

Select the evaluation metric (what you want to measure)

The evaluation metric will also be the same as the previous models. Measuring the training and test accuracy for all models keeps it uniform and is a good way to evaluate which model performed best. Again, there will be a visual inspection to assess which predictors are driving the model’s decisions.

# Prepare the training data: create model matrix and convert the target variable to numeric (0/1)
train_matrix <- model.matrix(y ~ . - poutcome, data = trainData)[, -1]
train_label <- as.numeric(trainData$y) - 1  # assuming 'y' factor levels are "no" and "yes"

# Prepare the test data similarly
test_matrix <- model.matrix(y ~ . - poutcome, data = testData)[, -1]
test_label <- as.numeric(testData$y) - 1

# Create DMatrix objects for XGBoost training and testing
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest  <- xgb.DMatrix(data = test_matrix, label = test_label)

# Set XGBoost parameters to their default-ish values
# (Note: Default parameters for xgboost in xgb.train are approximately max_depth=6, eta=0.3, subsample=1, colsample_bytree=1)
params <- list(
  objective = "binary:logistic",
  eval_metric = "error"
)

# Train the XGBoost model with 100 rounds using default parameters
set.seed(123)
xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100, verbose = 0)

# Make predictions on the training data
train_preds <- predict(xgb_model, dtrain)
train_preds_class <- ifelse(train_preds > 0.5, 1, 0)

# Make predictions on the test data
test_preds <- predict(xgb_model, dtest)
test_preds_class <- ifelse(test_preds > 0.5, 1, 0)

# Evaluate performance on the training set
cm_train <- confusionMatrix(as.factor(train_preds_class), as.factor(train_label))
xgb_train_acc <- cm_train$overall["Accuracy"]
print(paste("Training Accuracy - XGBoost:", round(as.numeric(xgb_train_acc), 4)))
## [1] "Training Accuracy - XGBoost: 0.9553"
# Evaluate performance on the test set
cm_test <- confusionMatrix(as.factor(test_preds_class), as.factor(test_label))
xgb_test_acc <- cm_test$overall["Accuracy"]
print(paste("Test Accuracy - XGBoost:", round(as.numeric(xgb_test_acc), 4)))
## [1] "Test Accuracy - XGBoost: 0.904"
# Plot variable importance
importance_matrix <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance_matrix, main = "XGBoost Variable Importance")

Result evaluated & conclusion drawn

The XGBoost model achieved a training accuracy of 95.53% and a test accuracy of 90.40%. The training to test accuracy gap is 5.13%, which is notably smaller than the previous Random Forest models. This indicates that XGBoost exhibits less overfitting and generalizes better to unseen data. As shown in the variable importance plot, duration remains the most influential predictor by a long shot. Instead of month as the second most influential variable, in this model day is the second most influential predictor. This seems inconsitent to what happens in the real world. It is odd that the day of the month has any impact on the target variable. However, this result confirms the hypothesis that employing a boosting method like XGBoost maintains a strong predictive performance and also shows the predictors that were underutilized in the single decision tree and Random Forest models.

Recommendation

Overall, I am satisfied with the performance of the XGBoost model. To further confirm the robustness of these results, I would recommend using cross validation. Incorporating cross validation will provide a more reliable estimate of the model’s generalizability and help fine-tune parameters to potentially further reduce any remaining overfitting.

results_table <- rbind(results_table,
  data.frame(
    Algorithm = "XGBoost",
    Experiment = "Exp 5 (Default Params)",
    Training_Accuracy = as.numeric(xgb_train_acc) * 100,
    Test_Accuracy = as.numeric(xgb_test_acc) * 100
  )
)

kable(results_table, digits = 2)
Algorithm Experiment Training_Accuracy Test_Accuracy
Decision Tree Exp 1 89.42 89.45
Decision Tree Exp 2 (minsplit = 10) 89.42 89.45
Random Forest Exp 3 (ntree = 500) 99.64 90.08
Random Forest Exp 4 (ntree = 200) 99.58 90.10
XGBoost Exp 5 (Default Params) 95.53 90.40

Experiment 6 - XGBoost

Define the Objective of the Experiment (Hypothesis)

This experiment will evaluate XGBoost using 5-fold cross validation to obtain a more robust estimate of the model’s performance. By using cross validation, we will fine-tune the boosting process specifally optimizing the number of rounds to reducing overfitting. By using cross-validation (with early stopping) on XGBoost, I hypothesize that the gap between the training and test accuracy will be lower than the previous XGBoost model.The experiment will show the optimal number of boosting rounds that can be idenitifed to prevent unnecessary overfitting.

Decide What Will Change and What Will Stay the Same

Instead of relying on a fixed 70/30 train-test split, this experiment will employ 5-fold cross-validation with early stopping to assess model performance. While the previous XGBoost model was trained for 100 rounds, this experiment will automatically determine the optimal number of boosting rounds based on cross validation performance. The dataset and the exclusion of the poutcome variable will remain unchanged.

Select the Evaluation Metric (What You Want to Measure)

The model’s performance will be assessed using the training and test accuracy of the final model. Also, a plot will be used to interpret which predictors drive the model’s decisions.

# Prepare the training data: create a model matrix and convert target variable to numeric (0/1)
train_matrix <- model.matrix(y ~ . - poutcome, data = trainData)[, -1]
train_label <- as.numeric(trainData$y) - 1  # Assuming 'y' factor levels ("no", "yes") are converted to 0/1

# Create DMatrix for XGBoost
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)

# Set XGBoost parameters (using defaults for other parameters)
params <- list(
  objective = "binary:logistic",
  eval_metric = "error"  # error: fraction of misclassified instances
)

# Perform 5-fold cross-validation with early stopping
set.seed(123)
cv_results <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 100,
  nfold = 5,
  early_stopping_rounds = 10,
  verbose = 1,
  maximize = FALSE
)
## [1]  train-error:0.097033+0.000735   test-error:0.106859+0.004008 
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 10 rounds.
## 
## [2]  train-error:0.093984+0.001084   test-error:0.104490+0.003100 
## [3]  train-error:0.092309+0.001265   test-error:0.103921+0.004052 
## [4]  train-error:0.090548+0.001254   test-error:0.103194+0.004057 
## [5]  train-error:0.089932+0.001116   test-error:0.103289+0.004389 
## [6]  train-error:0.088834+0.001509   test-error:0.103289+0.003301 
## [7]  train-error:0.088028+0.001147   test-error:0.103100+0.003091 
## [8]  train-error:0.087625+0.001678   test-error:0.103099+0.003738 
## [9]  train-error:0.086338+0.001321   test-error:0.103415+0.002664 
## [10] train-error:0.085666+0.001569   test-error:0.103036+0.003662 
## [11] train-error:0.084781+0.001534   test-error:0.101930+0.003272 
## [12] train-error:0.083905+0.001761   test-error:0.101425+0.004103 
## [13] train-error:0.083375+0.001926   test-error:0.100761+0.004382 
## [14] train-error:0.082230+0.001594   test-error:0.100035+0.004575 
## [15] train-error:0.081424+0.001179   test-error:0.099908+0.004475 
## [16] train-error:0.080058+0.001668   test-error:0.099087+0.004233 
## [17] train-error:0.078944+0.001416   test-error:0.098771+0.003859 
## [18] train-error:0.078446+0.001557   test-error:0.098455+0.003314 
## [19] train-error:0.077554+0.001352   test-error:0.098518+0.003918 
## [20] train-error:0.076669+0.001223   test-error:0.097791+0.003744 
## [21] train-error:0.075745+0.001489   test-error:0.097886+0.003329 
## [22] train-error:0.075129+0.001477   test-error:0.098044+0.003539 
## [23] train-error:0.074007+0.001323   test-error:0.097760+0.003304 
## [24] train-error:0.072877+0.001469   test-error:0.098044+0.002957 
## [25] train-error:0.072080+0.001481   test-error:0.097823+0.002741 
## [26] train-error:0.071471+0.001458   test-error:0.098265+0.003490 
## [27] train-error:0.070824+0.001358   test-error:0.097633+0.003298 
## [28] train-error:0.070144+0.001331   test-error:0.097760+0.003150 
## [29] train-error:0.069552+0.001232   test-error:0.098107+0.004116 
## [30] train-error:0.068683+0.001346   test-error:0.097918+0.003661 
## [31] train-error:0.067933+0.001309   test-error:0.097696+0.003545 
## [32] train-error:0.067372+0.001397   test-error:0.097570+0.003722 
## [33] train-error:0.066440+0.001154   test-error:0.097033+0.003720 
## [34] train-error:0.065752+0.001509   test-error:0.097128+0.003482 
## [35] train-error:0.065350+0.001518   test-error:0.096749+0.003582 
## [36] train-error:0.064686+0.001449   test-error:0.096306+0.003877 
## [37] train-error:0.064015+0.001402   test-error:0.096369+0.004319 
## [38] train-error:0.063786+0.001313   test-error:0.096180+0.004319 
## [39] train-error:0.063454+0.001295   test-error:0.096338+0.003977 
## [40] train-error:0.062822+0.001114   test-error:0.096148+0.004491 
## [41] train-error:0.062427+0.001147   test-error:0.096148+0.004387 
## [42] train-error:0.061992+0.001193   test-error:0.096053+0.004157 
## [43] train-error:0.061274+0.000852   test-error:0.096306+0.004078 
## [44] train-error:0.060713+0.000813   test-error:0.096338+0.003865 
## [45] train-error:0.060073+0.000823   test-error:0.096527+0.003888 
## [46] train-error:0.059757+0.000832   test-error:0.096622+0.003287 
## [47] train-error:0.059528+0.000710   test-error:0.096591+0.003232 
## [48] train-error:0.058612+0.000564   test-error:0.096875+0.003146 
## [49] train-error:0.058193+0.000491   test-error:0.097159+0.003171 
## [50] train-error:0.057822+0.000746   test-error:0.097096+0.003171 
## [51] train-error:0.057277+0.000438   test-error:0.097033+0.003061 
## [52] train-error:0.056740+0.000615   test-error:0.096843+0.003263 
## Stopping. Best iteration:
## [42] train-error:0.061992+0.001193   test-error:0.096053+0.004157
# The cv_results object shows the best iteration based on test error.
best_nrounds <- cv_results$best_iteration
cat("Best number of rounds from CV:", best_nrounds, "\n")
## Best number of rounds from CV: 42
# Retrain the final XGBoost model using the optimal number of boosting rounds
xgb_model_cv <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = best_nrounds,
  verbose = 0
)

# Prepare test data similarly
test_matrix <- model.matrix(y ~ . - poutcome, data = testData)[, -1]
test_label <- as.numeric(testData$y) - 1
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)

# Make predictions on the training data
train_preds <- predict(xgb_model_cv, dtrain)
train_preds_class <- ifelse(train_preds > 0.5, 1, 0)

# Make predictions on the test data
test_preds <- predict(xgb_model_cv, dtest)
test_preds_class <- ifelse(test_preds > 0.5, 1, 0)

# Evaluate performance on the training set
cm_train <- confusionMatrix(as.factor(train_preds_class), as.factor(train_label))
xgb_train_acc <- cm_train$overall["Accuracy"]
cat("Training Accuracy - XGBoost (CV):", round(as.numeric(xgb_train_acc), 4), "\n")
## Training Accuracy - XGBoost (CV): 0.9333
# Evaluate performance on the test set
cm_test <- confusionMatrix(as.factor(test_preds_class), as.factor(test_label))
xgb_test_acc <- cm_test$overall["Accuracy"]
cat("Test Accuracy - XGBoost (CV):", round(as.numeric(xgb_test_acc), 4), "\n")
## Test Accuracy - XGBoost (CV): 0.9039
# Plot variable importance
importance_matrix <- xgb.importance(model = xgb_model_cv)
xgb.plot.importance(importance_matrix, main = "XGBoost (CV) Variable Importance")

Result evaluated & conclusion drawn

XGBoost was tuned using 5-fold cross-validation with early stopping, yielding an optimal model at 42 boosting rounds. At this iteration, the model achieved a training accuracy of 93.33% (train-error = 0.0620) and a test accuracy of 90.39% (test-error = 0.0961). The relatively small 3 percentage point gap between training and test performance demonstrates that the cross-validated approach has successfully reduced overfitting compared to previous models. This indicates that the model generalizes well to unseen data, and it confirms our hypothesis that incorporating cross-validation can yield a more robust, balanced estimator by optimizing the boosting rounds. Also, the variable importance plot remains consistent with previous experiments. As before, the plot shows that duration is by far the most influential predictor.

Recommendation

Based on these findings, I recommend adopting the XGBoost configuration using cross validation, as it has demonstrated a robust and balanced performance with only a modest gap between training and test accuracies. However, to further improve the model and lower the errors, additional hyperparameter tuning can be performed. For example, experimenting with different maximum tree depths, different learning rates, and regularization parameters such as lambda and alpha can further narrow the gap and reduce both training and test error. In addition, it may also be beneficial to oversample the minority class to ensure that the model fully uses all the available predictors.

Summary of Experiments

results_table <- rbind(results_table,
  data.frame(
    Algorithm = "XGBoost (CV)",
    Experiment = "Exp 6",
    Training_Accuracy = as.numeric(xgb_train_acc) * 100,
    Test_Accuracy = as.numeric(xgb_test_acc) * 100
  )
)

# Display the results table
kable(results_table, digits = 2)
Algorithm Experiment Training_Accuracy Test_Accuracy
Decision Tree Exp 1 89.42 89.45
Decision Tree Exp 2 (minsplit = 10) 89.42 89.45
Random Forest Exp 3 (ntree = 500) 99.64 90.08
Random Forest Exp 4 (ntree = 200) 99.58 90.10
XGBoost Exp 5 (Default Params) 95.53 90.40
XGBoost (CV) Exp 6 93.33 90.39