library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(knitr)
library(randomForest)
library(xgboost)
df <- read.csv("bank-full.csv", sep = ";")
df$y <- as.factor(df$y)
# Set seed for reproducibility and create a 70/30 train-test split
set.seed(123)
trainIndex <- createDataPartition(df$y, p=0.7, list=FALSE)
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]
For this experiment, I wanted to get a better understanding of how the decision tree algorithm will split the data. To do this, I choose a maximum depth of 10. My hypothesis is that by setting a high maximum depth, we will see a more detailed split of the data and identify which features influence the target variable the most. By choosing a maximum depth of 10, the decision tree will produce a complex but detailed structure that will highlight the significant predictor variables. While this may lead to overfitting, the primary goal of this experiment is to get a better sense of the data.
Based on the Exploratory Data Analysis (EDA), the dataset did not have any missing values. However, the variable ‘poutcome’ contained a substantial number of “unknown” entries. I will exclude this predictor variable from this experiment since it may skew the model’s splits. All of the other data and the hyperparameters for this decision tree will remain unchanged.
For this experiment, I have decided to assess the model’s performance using accuracy on both the training and test datasets. The dataset was split into a 70/30, train/test sets. Accuracy will show how well the model’s predictions match the actual labels. Also, I will visually inspect the tree structure using a plot to understand the splits and observe whether the extra depth yielded better splits.
# Train a Decision Tree with maximum depth = 10
# Excluded the poutcome variable
tree_depth10 <- rpart(y ~ . - poutcome,
data = trainData,
method = "class",
control = rpart.control(maxdepth = 10))
# Predict on the training data
depth10_train_preds <- predict(tree_depth10, trainData, type = "class")
# Predict on the test data
depth10_preds <- predict(tree_depth10, testData, type = "class")
# Evaluate the model's performance on training data
depth10_cm_train <- confusionMatrix(depth10_train_preds, trainData$y)
depth10_acc_train <- depth10_cm_train$overall["Accuracy"]
print(paste("Training Accuracy with max depth 10:", round(depth10_acc_train, 4)))
## [1] "Training Accuracy with max depth 10: 0.8942"
# Evaluate the model's performance using on test data
depth10_cm <- confusionMatrix(depth10_preds, testData$y)
depth10_acc <- depth10_cm$overall["Accuracy"]
print(paste("Test Accuracy with max depth 10:", round(depth10_acc, 4)))
## [1] "Test Accuracy with max depth 10: 0.8945"
rpart.plot(tree_depth10, main = "Decision Tree with Max Depth 10")
The decision tree model with a maximum depth of 10 achieved a training accuracy of 89.42% and a test accuracy of 89.45%, indicating it fits the data well. The near-identical accuracies between the training and test sets indicate that the model is not overfitting. Although the maximum depth was set at 10, the final tree only grew to 3 levels. Out of 15 predictor variables, the tree only used two: ‘duration’ and ‘month’, which indicates these two features provided the strongest signals for predicting the whether a client subscribes to a term deposit. The remaining 13 predictors did not contribute additional predictive power, suggesting they either had redundant information or a weak relationship with the target variable. It is also worth noting that the duration variable appears in multiple branches, which confirms that it is a highly informative predictor in this dataset. This makes sense as one would expect the duration of a call to correlate with whether the client subscribes to a term deposit or not. Overall, despite the high accuracy in the training and test data, the structure of this model looks a bit too simple, especially with just two predictors in the decision tree.
Although the decision tree model achieved a high training and test accuracy, its structure was quite simple with only two predictors, “duration” and “month”. This suggests that these two features dominated the splitting process while the remaining predictors were overlooked. To encourage the tree to consider additional predictors and to uncover deeper interactions in the data, I recommend lowering the minimum split parameter(minsplit).By reducing minsplit, the tree will be allowed to split nodes that contain fewer observations, potentially resulting in a more complex structure with additional splits. This change could reveal whether other variables contribute valuable predictive power beyond “duration” and “month,” and might also help reduce overall node impurity.
# Create an empty data frame to store results
results_table <- data.frame(
Algorithm = character(),
Experiment = character(),
Training_Accuracy = numeric(),
Test_Accuracy = numeric(),
stringsAsFactors = FALSE
)
#Append results for Decision Tree Experiment 1
results_table <- rbind(results_table,
data.frame(
Algorithm = "Decision Tree",
Experiment = "Exp 1",
Training_Accuracy = 89.42,
Test_Accuracy = 89.45
)
)
kable(results_table, digits = 2)
Algorithm | Experiment | Training_Accuracy | Test_Accuracy |
---|---|---|---|
Decision Tree | Exp 1 | 89.42 | 89.45 |
Based on the results from the previous experiment, the objective of
this experiment is to investigate the impact of lowering the minimum
number of observations required for a split (minsplit) on the decision
tree’s structure and performance. In the earlier model, only two
variables:duration
and month
were used. Also,
some nodes exhibited high impurity. By reducing minsplit to 10 (from the
default value of 20), the decision tree will be compelled to perform
splits even when fewer samples are available. This should result in a
more complex tree structure, which may reveal additional interactions
among predictors that were previously ignored and could help reduce
impurity in ambiguous nodes as well as improving accuracy.
The major change will be the minsplit parameter. By default it is set
to 20 but by reducing it to 10, this will cause more splits even with
smaller node sizes. The rest of the dataset will remain unchanged. The
maximum depth will also remain at 10 and like before the
poutcome
variable will remain excluded because of the
amonut of unknown values in the variable.
As before, the evaluation metric will be the measured using the accuracy of both the training and test datasets. Again, the train-test split will be 70/30. Also, we will do a visual inspection of the treeplot to see the complexity and the purity of the model.
# Training a Decision Tree with maximum depth = 10 and lowered minsplit = 10
tree_depth10_lower_minsplit <- rpart(y ~ . - poutcome,
data = trainData,
method = "class",
control = rpart.control(maxdepth = 10, minsplit = 10))
# Predict on the training data
depth10_lower_minsplit_train_preds <- predict(tree_depth10_lower_minsplit, trainData, type = "class")
# Predict on the test data
depth10_lower_minsplit_preds <- predict(tree_depth10_lower_minsplit, testData, type = "class")
# Evaluate performance on training data
depth10_lower_minsplit_cm_train <- confusionMatrix(depth10_lower_minsplit_train_preds, trainData$y)
depth10_lower_minsplit_acc_train <- depth10_lower_minsplit_cm_train$overall["Accuracy"]
print(paste("Training Accuracy with lowered minsplit:", round(depth10_lower_minsplit_acc_train, 4)))
## [1] "Training Accuracy with lowered minsplit: 0.8942"
# Evaluate performance on test data
depth10_lower_minsplit_cm <- confusionMatrix(depth10_lower_minsplit_preds, testData$y)
depth10_lower_minsplit_acc <- depth10_lower_minsplit_cm$overall["Accuracy"]
print(paste("Test Accuracy with lowered minsplit:", round(depth10_lower_minsplit_acc, 4)))
## [1] "Test Accuracy with lowered minsplit: 0.8945"
# Plot the decision tree
rpart.plot(tree_depth10_lower_minsplit, main = "Decision Tree with max depth 10 & minsplit = 10")
In this experiment, lowering the minsplit parameter to 10 (compared
to the previous experiment’s default setting) resulted in a decision
tree that achieved a training accuracy of 89.42% and a test accuracy of
89.45%. These nearly identical accuracies indicate that the model
continues to generalize well and is not overfitting. While the goal was
to create a more complex model with more predictors, the predictive
performance of this model remained unchanged when compared to the first
experiment. The final tree maintained the same structure, with splits
driven by the same two predictors: duration
and
month
. This indicates that these two predictors capture the
most critical information needed for the prediction process. Overall, my
hypothesis was not correct. Lowering the minsplit parameter did not
uncover any further interactions among the remaining features nor did it
reduce any impurity in the nodes.
Based on these results, I would recommend further exploration with
data sampling and feature selection. While lowering minsplit did not
alter the tree’s complexity or its reliance on the variables “duration”
and “month,” experimenting with different sampling techniques like
oversampling the minority class might help reveal additional
interactions in the data. In the EDA, there was a disproportionate
amount of no
values compared to yes
values in
the target variable. Addressing this issue could have yielded different
results with a different starting node. Also, applying feature selection
to combine variables may offer new insights into the data and improve
the model’s performance.
results_table <- rbind(results_table,
data.frame(
Algorithm = "Decision Tree",
Experiment = "Exp 2 (minsplit = 10)",
Training_Accuracy = as.numeric(depth10_lower_minsplit_acc_train)*100,
Test_Accuracy = as.numeric(depth10_lower_minsplit_acc)*100
)
)
kable(results_table, digits = 2)
Algorithm | Experiment | Training_Accuracy | Test_Accuracy |
---|---|---|---|
Decision Tree | Exp 1 | 89.42 | 89.45 |
Decision Tree | Exp 2 (minsplit = 10) | 89.42 | 89.45 |
The goal of this experiment is to determine whether using a Random Forest can improve the predictive performance and potentially incorporate additional predictors compared to a single decision tree. By employing a Random Forest with 500 trees,the model is expected to capture more complex interactions among the predictors and have a higher accuracy than the decision trees. My hypothesis is that this random forest model will have a high accuracy than the decision tree models while also showing the contribution of the other predictors in the dataset.
The model algorithm changed form a single decision tree to a random
forest with many trees. Since this is the first random forest
experiment, I will leave all of the parameters as default (ntree set to
500) and see what changes need to be made based on the result. The same
train-test split and exclusion of the poutcome
variable
will be maintained and consistent with the previous decision tree
experiment.
Th model’s performance will be measured using the training and test accuracy. We will also do a visual inspection of the plot to see which predictors are influencing the predictions and if it makes sense intuitively.
# Train a Random Forest model with 500 trees, excluding the 'poutcome' variable
rf_model <- randomForest(y ~ . - poutcome,
data = trainData,
ntree = 500,
importance = TRUE)
# Make predictions on the training data
rf_train_preds <- predict(rf_model, trainData, type = "class")
# Make predictions on the test data
rf_test_preds <- predict(rf_model, testData, type = "class")
# Evaluate performance on the training set
rf_cm_train <- confusionMatrix(rf_train_preds, trainData$y)
rf_train_acc <- rf_cm_train$overall["Accuracy"]
print(paste("Training Accuracy - Random Forest:", round(rf_train_acc, 4)))
## [1] "Training Accuracy - Random Forest: 0.9964"
# Evaluate performance on the test set
rf_cm_test <- confusionMatrix(rf_test_preds, testData$y)
rf_test_acc <- rf_cm_test$overall["Accuracy"]
print(paste("Test Accuracy - Random Forest:", round(rf_test_acc, 4)))
## [1] "Test Accuracy - Random Forest: 0.9008"
# Plot variable importance to see which features matter most
varImpPlot(rf_model, main = "Random Forest Variable Importance")
In this experiment, the Random Forest model achieved an extremely high training accuracy of 99.64% and a training accuracy of 90.08%, showing that my initial hypotheses was correct. Although the training accuracy is almost perfect, it does invite the possibility of overfitting. The test performance indicates the model generalizes well with unseen data. However, there is a noticeable gap between the training and testing accuracy (9.6%) that I find a bit concerning.
The variable importance plot reveals that duration
is
the most influential predictor. The MeanDecreaseAccuracy shows that
duration
and month
are not the only variables
that contribute to the accuracy of the model. Removing the
housing
, day
, contact
,
age
, and pdays
variables have almost the same
impact on the model’s accuracy than removing the month
variable. These findings suggest that the Random Forest is not only
produces a higher accuracy model but also incorporates a broader set of
predictors compared to the single decision tree.
As mentioned previously, there is a ~10% gap between the training and testing accuracy, which usually indicates overfitting and may warrant adjustments. One way to reduce overfitting is to increase regularization by reducing the number of trees (ntree), increasing the minimum node size(nodesize), or reducing the maximum tree depth. Based on these findings, I would recommend reducing the number of trees to 300 to see if the test accuracy increases. If there is still a large gap between the the training and testing accuracy, I would use the other methods to reduce overfitting. It would be a step by step process.
results_table <- rbind(results_table,
data.frame(
Algorithm = "Random Forest",
Experiment = "Exp 3 (ntree = 500)",
Training_Accuracy = as.numeric(rf_train_acc) * 100,
Test_Accuracy = as.numeric(rf_test_acc) * 100
)
)
kable(results_table, digits = 2)
Algorithm | Experiment | Training_Accuracy | Test_Accuracy |
---|---|---|---|
Decision Tree | Exp 1 | 89.42 | 89.45 |
Decision Tree | Exp 2 (minsplit = 10) | 89.42 | 89.45 |
Random Forest | Exp 3 (ntree = 500) | 99.64 | 90.08 |
The goal of this experiment is to assess whether reducing the number of trees in the Random Forest will lower overfitting from the previous model. By lowering the number of trees from 500 to 200, the model will be less prone to overfitting. Although this might reduce the training accuracy, it is expected to improve, or at least maintain, the test accuracy.
The main change in this experiment will be the reduction in the
number of tree(ntree) in the Random Forest from 500 to 200. The dataset,
train-test split, and the exclusion of the poutcome
variable will remain unchanged.
The evaluation metric will be the same as before where the training and test accuracy was recorded. Also, there will be a visual inspection, a variable importance plot, to help verify which predictors are driving the prediction.
# Train a Random Forest model with reduced number of trees (ntree = 200)
rf_model_reduced <- randomForest(y ~ . - poutcome,
data = trainData,
ntree = 200,
importance = TRUE)
# Make predictions on the training data
rf_reduced_train_preds <- predict(rf_model_reduced, trainData, type = "class")
# Make predictions on the test data
rf_reduced_test_preds <- predict(rf_model_reduced, testData, type = "class")
# Evaluate performance on the training set
rf_reduced_cm_train <- confusionMatrix(rf_reduced_train_preds, trainData$y)
rf_reduced_train_acc <- rf_reduced_cm_train$overall["Accuracy"]
print(paste("Training Accuracy - Reduced ntree Random Forest:", round(rf_reduced_train_acc, 4)))
## [1] "Training Accuracy - Reduced ntree Random Forest: 0.9958"
# Evaluate performance on the test set
rf_reduced_cm_test <- confusionMatrix(rf_reduced_test_preds, testData$y)
rf_reduced_test_acc <- rf_reduced_cm_test$overall["Accuracy"]
print(paste("Test Accuracy - Reduced ntree Random Forest:", round(rf_reduced_test_acc, 4)))
## [1] "Test Accuracy - Reduced ntree Random Forest: 0.901"
varImpPlot(rf_model_reduced, main = "Variable Importance - Reduced ntree Random Forest")
In this experiment, the Random Forest model with a reduced number of trees achieved a training accuracy of 99.59% and a test accuracy of 90.23%. Although the gap between the training and test accuracy got smaller, it is still 9.36% indicating that even with fewer trees, the model is still overfitting to the training data.
The variable importance plot shows that the key predictor is still
duration in the decision-making process. However, variable
day
and housing
seem to have the same
MeanDecreaseAccuracy as month
but duration seems to be far
and away the greatest contributor to the accuracy of the Random Forest
model. They hypothesis was slightly accurate in that reducing the amount
of trees to 200 did slighlty improve the test accuracy.
Based on these findings, I would recommend further measures to mitigate overfitting and improve generalization. Given the persistent gap between training and test accuracy, it is worth exploring adjustments to other hyperparameters such as increasing the minimum node size(nodesize), or reducing the maximum tree depth. Additionally, incorporating techniques like cross-validation and oversampling the minority class may provide a more balanced view of model performance.
results_table <- rbind(results_table,
data.frame(
Algorithm = "Random Forest",
Experiment = "Exp 4 (ntree = 200)",
Training_Accuracy = as.numeric(rf_reduced_train_acc) * 100,
Test_Accuracy = as.numeric(rf_reduced_test_acc) * 100
)
)
kable(results_table, digits = 2)
Algorithm | Experiment | Training_Accuracy | Test_Accuracy |
---|---|---|---|
Decision Tree | Exp 1 | 89.42 | 89.45 |
Decision Tree | Exp 2 (minsplit = 10) | 89.42 | 89.45 |
Random Forest | Exp 3 (ntree = 500) | 99.64 | 90.08 |
Random Forest | Exp 4 (ntree = 200) | 99.58 | 90.10 |
Experiment 5 - XGBoost
The goal of this experiment is to evaluate the performance of Extreme Gradient Boosting(XGBoost) using its default parameters. This experiment serves as a baseline for how well gradient boosting can capture complex interactions among predictors in comparison to previous models. I chose XGBoost over AdaBoost for this dataset because XGBoost is known for its robustness and efficient handling of large and imbalanced data. By employing XGBoost with default parameters, we expect to achieve a higher training and test accuracy than Random Forest.
The objective of this experiment is to create a baseline performance
for XGBoost with the default parameters (max_depth = 6, eta 0.3) and
nrounds set to 100. Like the previous models, the dataset will remain
the same as well as the train-test split (70/30) and the exclusion of
the poutcome
variable.
The evaluation metric will also be the same as the previous models. Measuring the training and test accuracy for all models keeps it uniform and is a good way to evaluate which model performed best. Again, there will be a visual inspection to assess which predictors are driving the model’s decisions.
# Prepare the training data: create model matrix and convert the target variable to numeric (0/1)
train_matrix <- model.matrix(y ~ . - poutcome, data = trainData)[, -1]
train_label <- as.numeric(trainData$y) - 1 # assuming 'y' factor levels are "no" and "yes"
# Prepare the test data similarly
test_matrix <- model.matrix(y ~ . - poutcome, data = testData)[, -1]
test_label <- as.numeric(testData$y) - 1
# Create DMatrix objects for XGBoost training and testing
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
# Set XGBoost parameters to their default-ish values
# (Note: Default parameters for xgboost in xgb.train are approximately max_depth=6, eta=0.3, subsample=1, colsample_bytree=1)
params <- list(
objective = "binary:logistic",
eval_metric = "error"
)
# Train the XGBoost model with 100 rounds using default parameters
set.seed(123)
xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100, verbose = 0)
# Make predictions on the training data
train_preds <- predict(xgb_model, dtrain)
train_preds_class <- ifelse(train_preds > 0.5, 1, 0)
# Make predictions on the test data
test_preds <- predict(xgb_model, dtest)
test_preds_class <- ifelse(test_preds > 0.5, 1, 0)
# Evaluate performance on the training set
cm_train <- confusionMatrix(as.factor(train_preds_class), as.factor(train_label))
xgb_train_acc <- cm_train$overall["Accuracy"]
print(paste("Training Accuracy - XGBoost:", round(as.numeric(xgb_train_acc), 4)))
## [1] "Training Accuracy - XGBoost: 0.9553"
# Evaluate performance on the test set
cm_test <- confusionMatrix(as.factor(test_preds_class), as.factor(test_label))
xgb_test_acc <- cm_test$overall["Accuracy"]
print(paste("Test Accuracy - XGBoost:", round(as.numeric(xgb_test_acc), 4)))
## [1] "Test Accuracy - XGBoost: 0.904"
# Plot variable importance
importance_matrix <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance_matrix, main = "XGBoost Variable Importance")
The XGBoost model achieved a training accuracy of 95.53% and a test accuracy of 90.40%. The training to test accuracy gap is 5.13%, which is notably smaller than the previous Random Forest models. This indicates that XGBoost exhibits less overfitting and generalizes better to unseen data. As shown in the variable importance plot, duration remains the most influential predictor by a long shot. Instead of month as the second most influential variable, in this model day is the second most influential predictor. This seems inconsitent to what happens in the real world. It is odd that the day of the month has any impact on the target variable. However, this result confirms the hypothesis that employing a boosting method like XGBoost maintains a strong predictive performance and also shows the predictors that were underutilized in the single decision tree and Random Forest models.
Overall, I am satisfied with the performance of the XGBoost model. To further confirm the robustness of these results, I would recommend using cross validation. Incorporating cross validation will provide a more reliable estimate of the model’s generalizability and help fine-tune parameters to potentially further reduce any remaining overfitting.
results_table <- rbind(results_table,
data.frame(
Algorithm = "XGBoost",
Experiment = "Exp 5 (Default Params)",
Training_Accuracy = as.numeric(xgb_train_acc) * 100,
Test_Accuracy = as.numeric(xgb_test_acc) * 100
)
)
kable(results_table, digits = 2)
Algorithm | Experiment | Training_Accuracy | Test_Accuracy |
---|---|---|---|
Decision Tree | Exp 1 | 89.42 | 89.45 |
Decision Tree | Exp 2 (minsplit = 10) | 89.42 | 89.45 |
Random Forest | Exp 3 (ntree = 500) | 99.64 | 90.08 |
Random Forest | Exp 4 (ntree = 200) | 99.58 | 90.10 |
XGBoost | Exp 5 (Default Params) | 95.53 | 90.40 |
This experiment will evaluate XGBoost using 5-fold cross validation to obtain a more robust estimate of the model’s performance. By using cross validation, we will fine-tune the boosting process specifally optimizing the number of rounds to reducing overfitting. By using cross-validation (with early stopping) on XGBoost, I hypothesize that the gap between the training and test accuracy will be lower than the previous XGBoost model.The experiment will show the optimal number of boosting rounds that can be idenitifed to prevent unnecessary overfitting.
Instead of relying on a fixed 70/30 train-test split, this experiment
will employ 5-fold cross-validation with early stopping to assess model
performance. While the previous XGBoost model was trained for 100
rounds, this experiment will automatically determine the optimal number
of boosting rounds based on cross validation performance. The dataset
and the exclusion of the poutcome
variable will remain
unchanged.
The model’s performance will be assessed using the training and test accuracy of the final model. Also, a plot will be used to interpret which predictors drive the model’s decisions.
# Prepare the training data: create a model matrix and convert target variable to numeric (0/1)
train_matrix <- model.matrix(y ~ . - poutcome, data = trainData)[, -1]
train_label <- as.numeric(trainData$y) - 1 # Assuming 'y' factor levels ("no", "yes") are converted to 0/1
# Create DMatrix for XGBoost
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
# Set XGBoost parameters (using defaults for other parameters)
params <- list(
objective = "binary:logistic",
eval_metric = "error" # error: fraction of misclassified instances
)
# Perform 5-fold cross-validation with early stopping
set.seed(123)
cv_results <- xgb.cv(
params = params,
data = dtrain,
nrounds = 100,
nfold = 5,
early_stopping_rounds = 10,
verbose = 1,
maximize = FALSE
)
## [1] train-error:0.097033+0.000735 test-error:0.106859+0.004008
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 10 rounds.
##
## [2] train-error:0.093984+0.001084 test-error:0.104490+0.003100
## [3] train-error:0.092309+0.001265 test-error:0.103921+0.004052
## [4] train-error:0.090548+0.001254 test-error:0.103194+0.004057
## [5] train-error:0.089932+0.001116 test-error:0.103289+0.004389
## [6] train-error:0.088834+0.001509 test-error:0.103289+0.003301
## [7] train-error:0.088028+0.001147 test-error:0.103100+0.003091
## [8] train-error:0.087625+0.001678 test-error:0.103099+0.003738
## [9] train-error:0.086338+0.001321 test-error:0.103415+0.002664
## [10] train-error:0.085666+0.001569 test-error:0.103036+0.003662
## [11] train-error:0.084781+0.001534 test-error:0.101930+0.003272
## [12] train-error:0.083905+0.001761 test-error:0.101425+0.004103
## [13] train-error:0.083375+0.001926 test-error:0.100761+0.004382
## [14] train-error:0.082230+0.001594 test-error:0.100035+0.004575
## [15] train-error:0.081424+0.001179 test-error:0.099908+0.004475
## [16] train-error:0.080058+0.001668 test-error:0.099087+0.004233
## [17] train-error:0.078944+0.001416 test-error:0.098771+0.003859
## [18] train-error:0.078446+0.001557 test-error:0.098455+0.003314
## [19] train-error:0.077554+0.001352 test-error:0.098518+0.003918
## [20] train-error:0.076669+0.001223 test-error:0.097791+0.003744
## [21] train-error:0.075745+0.001489 test-error:0.097886+0.003329
## [22] train-error:0.075129+0.001477 test-error:0.098044+0.003539
## [23] train-error:0.074007+0.001323 test-error:0.097760+0.003304
## [24] train-error:0.072877+0.001469 test-error:0.098044+0.002957
## [25] train-error:0.072080+0.001481 test-error:0.097823+0.002741
## [26] train-error:0.071471+0.001458 test-error:0.098265+0.003490
## [27] train-error:0.070824+0.001358 test-error:0.097633+0.003298
## [28] train-error:0.070144+0.001331 test-error:0.097760+0.003150
## [29] train-error:0.069552+0.001232 test-error:0.098107+0.004116
## [30] train-error:0.068683+0.001346 test-error:0.097918+0.003661
## [31] train-error:0.067933+0.001309 test-error:0.097696+0.003545
## [32] train-error:0.067372+0.001397 test-error:0.097570+0.003722
## [33] train-error:0.066440+0.001154 test-error:0.097033+0.003720
## [34] train-error:0.065752+0.001509 test-error:0.097128+0.003482
## [35] train-error:0.065350+0.001518 test-error:0.096749+0.003582
## [36] train-error:0.064686+0.001449 test-error:0.096306+0.003877
## [37] train-error:0.064015+0.001402 test-error:0.096369+0.004319
## [38] train-error:0.063786+0.001313 test-error:0.096180+0.004319
## [39] train-error:0.063454+0.001295 test-error:0.096338+0.003977
## [40] train-error:0.062822+0.001114 test-error:0.096148+0.004491
## [41] train-error:0.062427+0.001147 test-error:0.096148+0.004387
## [42] train-error:0.061992+0.001193 test-error:0.096053+0.004157
## [43] train-error:0.061274+0.000852 test-error:0.096306+0.004078
## [44] train-error:0.060713+0.000813 test-error:0.096338+0.003865
## [45] train-error:0.060073+0.000823 test-error:0.096527+0.003888
## [46] train-error:0.059757+0.000832 test-error:0.096622+0.003287
## [47] train-error:0.059528+0.000710 test-error:0.096591+0.003232
## [48] train-error:0.058612+0.000564 test-error:0.096875+0.003146
## [49] train-error:0.058193+0.000491 test-error:0.097159+0.003171
## [50] train-error:0.057822+0.000746 test-error:0.097096+0.003171
## [51] train-error:0.057277+0.000438 test-error:0.097033+0.003061
## [52] train-error:0.056740+0.000615 test-error:0.096843+0.003263
## Stopping. Best iteration:
## [42] train-error:0.061992+0.001193 test-error:0.096053+0.004157
# The cv_results object shows the best iteration based on test error.
best_nrounds <- cv_results$best_iteration
cat("Best number of rounds from CV:", best_nrounds, "\n")
## Best number of rounds from CV: 42
# Retrain the final XGBoost model using the optimal number of boosting rounds
xgb_model_cv <- xgb.train(
params = params,
data = dtrain,
nrounds = best_nrounds,
verbose = 0
)
# Prepare test data similarly
test_matrix <- model.matrix(y ~ . - poutcome, data = testData)[, -1]
test_label <- as.numeric(testData$y) - 1
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
# Make predictions on the training data
train_preds <- predict(xgb_model_cv, dtrain)
train_preds_class <- ifelse(train_preds > 0.5, 1, 0)
# Make predictions on the test data
test_preds <- predict(xgb_model_cv, dtest)
test_preds_class <- ifelse(test_preds > 0.5, 1, 0)
# Evaluate performance on the training set
cm_train <- confusionMatrix(as.factor(train_preds_class), as.factor(train_label))
xgb_train_acc <- cm_train$overall["Accuracy"]
cat("Training Accuracy - XGBoost (CV):", round(as.numeric(xgb_train_acc), 4), "\n")
## Training Accuracy - XGBoost (CV): 0.9333
# Evaluate performance on the test set
cm_test <- confusionMatrix(as.factor(test_preds_class), as.factor(test_label))
xgb_test_acc <- cm_test$overall["Accuracy"]
cat("Test Accuracy - XGBoost (CV):", round(as.numeric(xgb_test_acc), 4), "\n")
## Test Accuracy - XGBoost (CV): 0.9039
# Plot variable importance
importance_matrix <- xgb.importance(model = xgb_model_cv)
xgb.plot.importance(importance_matrix, main = "XGBoost (CV) Variable Importance")
XGBoost was tuned using 5-fold cross-validation with early stopping, yielding an optimal model at 42 boosting rounds. At this iteration, the model achieved a training accuracy of 93.33% (train-error = 0.0620) and a test accuracy of 90.39% (test-error = 0.0961). The relatively small 3 percentage point gap between training and test performance demonstrates that the cross-validated approach has successfully reduced overfitting compared to previous models. This indicates that the model generalizes well to unseen data, and it confirms our hypothesis that incorporating cross-validation can yield a more robust, balanced estimator by optimizing the boosting rounds. Also, the variable importance plot remains consistent with previous experiments. As before, the plot shows that duration is by far the most influential predictor.
Based on these findings, I recommend adopting the XGBoost configuration using cross validation, as it has demonstrated a robust and balanced performance with only a modest gap between training and test accuracies. However, to further improve the model and lower the errors, additional hyperparameter tuning can be performed. For example, experimenting with different maximum tree depths, different learning rates, and regularization parameters such as lambda and alpha can further narrow the gap and reduce both training and test error. In addition, it may also be beneficial to oversample the minority class to ensure that the model fully uses all the available predictors.
results_table <- rbind(results_table,
data.frame(
Algorithm = "XGBoost (CV)",
Experiment = "Exp 6",
Training_Accuracy = as.numeric(xgb_train_acc) * 100,
Test_Accuracy = as.numeric(xgb_test_acc) * 100
)
)
# Display the results table
kable(results_table, digits = 2)
Algorithm | Experiment | Training_Accuracy | Test_Accuracy |
---|---|---|---|
Decision Tree | Exp 1 | 89.42 | 89.45 |
Decision Tree | Exp 2 (minsplit = 10) | 89.42 | 89.45 |
Random Forest | Exp 3 (ntree = 500) | 99.64 | 90.08 |
Random Forest | Exp 4 (ntree = 200) | 99.58 | 90.10 |
XGBoost | Exp 5 (Default Params) | 95.53 | 90.40 |
XGBoost (CV) | Exp 6 | 93.33 | 90.39 |