Introduction

Continuing from the previous data exploration of banking and marketing information, I will be testing and comparing decision trees, random forests and XGBoost to determine which is the optimal model for predicting y.

library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(randomForest)
library(xgboost)
bank_raw <- read.csv2(file="bank+marketing/bank/bank-full.csv")
bank <- bank_raw 
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

The data will be preprocessed in the following ways: the “unknown” and “other” values in poutcome converted to NAs for elucidation of the previous campaign’s results; the columns containing strings converted to factors.

bank <- bank |>
  mutate(poutcome = na_if(poutcome, "unknown")) |>
  mutate(poutcome = na_if(poutcome, "other"))

chr_cols <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
bank <- bank |> mutate(across(all_of(chr_cols), as.factor))

head(bank)

Experiment

The data must first be partitioned for training and testing, at a standard 80-20.

set.seed(123)

splitIndex <- createDataPartition(bank$y, p = 0.8, list = FALSE)

bank_train <- bank[splitIndex,]
bank_test <- bank[-splitIndex,]

round(prop.table(table(select(bank, y))), 2)
## y
##   no  yes 
## 0.88 0.12
round(prop.table(table(select(bank_train, y))), 2)
## y
##   no  yes 
## 0.88 0.12
round(prop.table(table(select(bank_test, y))), 2)
## y
##   no  yes 
## 0.88 0.12

After each experiment, the results will be documented in the following dataframe:

experiment_log <- data.frame(
  ID = integer(),
  Model = character(),
  Features = character(),
  Hyperparameters = character(),
  Train = numeric(),
  Test = numeric(),
  Notes = character(),
  stringsAsFactors = FALSE
)

Decision Trees

Experiment 1:

Objective: Since decision trees recursively partition data to determine the best features upon which to split, the goal here is to find the variable with the greatest predictive power.

Variations: This is the first experiment, so it will be valuable to simply perform the test on the data as-is for a starting point.

Evaluation: A confusion matrix can be produced to view predicted values against the actual values, then used to calculate the predictive accuracy.

Experiment:

set.seed(123)
bank_dt1 <- rpart(y ~ ., method = "class", data = bank_train)

rpart.plot(bank_dt1)

# predict and evaluate on training data
dt1_train_pred <- predict(bank_dt1, bank_train, type = "class")
dt1_train_cm <- confusionMatrix(dt1_train_pred, bank_train$y)
dt1_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.9002212
# predict and evaluate on testing data
dt1_test_pred <- predict(bank_dt1, bank_test, type = "class")
dt1_test_cm <- confusionMatrix(dt1_test_pred, bank_test$y)
dt1_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.8949231

Review: The accuracy of this first model is fairly high, at just over 90%, and the feature determined to be most predictive of y is duration. If the last contact with the customer was recorded at under 522 seconds, the probability that the customer has not subscribed to a term deposit is only 8%. Interestingly, the decision tree only determined features related to marketing as most predictive: duration, poutcome and pdays.

dt1_log <- data.frame(
  ID = 1,
  Model = "Decision Tree",
  Features = "duration, poutcome, pdays",
  Hyperparameters = "none",
  Train = 0.90,
  Test = 0.89,
  Notes = "marketing features only"
)

experiment_log <- bind_rows(experiment_log, dt1_log)

Experiment 2:

Objective: To see if a different set features will be selected for predictive power once the data is altered

Variations: Feature selection will be adjusted; the duration column will be dropped.

Evaluation: For consistency, we will use the accuracy measure from the confusion matrix.

Experiment:

set.seed(123)

bank_sub_train <- bank_train |> select(!duration)
bank_sub_test <- bank_test |> select(!duration)

bank_dt2 <- rpart(y ~ ., method = "class", data = bank_sub_train)

rpart.plot(bank_dt2)

# predict and evaluate on training data
dt2_train_pred <- predict(bank_dt2, bank_sub_train, type = "class")
dt2_train_cm <- confusionMatrix(dt2_train_pred, bank_sub_train$y)
dt2_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.8920652
# predict and evaluate on testing data
dt2_test_pred <- predict(bank_dt2, bank_sub_test, type = "class")
dt2_test_cm <- confusionMatrix(dt2_test_pred, bank_sub_test$y)
dt2_test_cm$overall["Accuracy"]
## Accuracy 
## 0.889614

Review: The accuracy barely changed and this tree appears similar to the first experiment, indicating low variance overall. However, only marketing-related features were again chosen by the model, which may ultimately be of limited use for business decision-making (eg customer selection).

dt2_log <- data.frame(
  ID = 2,
  Model = "Decision Tree",
  Features = "poutcome, pdays",
  Hyperparameters = "none",
  Train = 0.89,
  Test = 0.89,
  Notes = "dropped duration, minimal changes"
)

experiment_log <- bind_rows(experiment_log, dt2_log)

Random Forests

Experiment 3:

Objective: By switching to the random forest bagging method, the algorithm will select different features at random, and will instead predict on different dimensions of the data.

Variations: 1. Imputation: Random forests require handling of NA values and na.roughfix() can be used as a starting point. Per the documentation, this method of imputation replaces the missing data accordingly: “For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.” 2. Number of trees: ntree is set to 100.

Evaluation: We will continue to use a confusion matrix, as well as the importance ranking of the features.

Experiment:

set.seed(123)

bank_rf1 <- randomForest(y ~ .,
                         data = bank_train,
                         ntree = 100,
                         na.action = na.roughfix)

importance(bank_rf1)
##           MeanDecreaseGini
## age              649.09374
## job              492.80733
## marital          145.62751
## education        185.48514
## default           13.96478
## balance          691.34660
## housing          131.72743
## loan              58.41977
## contact          138.05341
## day              583.41136
## month            877.18315
## duration        2074.21176
## campaign         259.83320
## pdays            324.76137
## previous         159.38886
## poutcome         504.25981
# predict and evaluate on training data
rf1_train_pred <- predict(bank_rf1, bank_train, type = "class")
rf1_train_cm <- confusionMatrix(rf1_train_pred, bank_train$y)
rf1_train_cm$overall["Accuracy"]
## Accuracy 
##        1
# predict and evaluate on testing data
rf1_test_pred <- predict(bank_rf1, bank_test, type = "class")
rf1_test_cm <- confusionMatrix(rf1_test_pred, bank_test$y)
rf1_test_cm$overall["Accuracy"]
## Accuracy 
## 0.852707

Review: The accuracy on the training set is 100% with a much lower 0.85 for the testing data, which indicates overfitting. Aside from the high importance value for duration as with the decision tree, the next highest ranked features are month, balance and age.

rf1_log <- data.frame(
  ID = 3,
  Model = "Random Forest",
  Features = "all, with different ranking order from decision trees after 'duration'",
  Hyperparameters = "impute method, number of trees",
  Train = 1.00,
  Test = 0.85,
  Notes = "overfitting"
)

experiment_log <- bind_rows(experiment_log, rf1_log)

Experiment 4:

Objective: For the second random forest model, hyperparameters will be tuned with the aim of reducing overfitting.

Variations: The nodesize will be raised to a minimum of 5 to prevent smaller leaves. The mtry will be reduced to 2 to limit the number of variables being randomly selected at each split.

Evaluation: Both the confusion matrix and importance ranking will again be generated.

Experiment:

set.seed(123)

bank_rf2 <- randomForest(y ~ .,
                         data = bank_train,
                         ntree = 100,
                         mtry = 1,
                         nodesize = 5,
                         na.action = na.roughfix)

importance(bank_rf2)
##           MeanDecreaseGini
## age              98.966817
## job              72.274013
## marital          25.173125
## education        29.218580
## default           2.881819
## balance          65.588576
## housing          73.887030
## loan             18.202610
## contact          59.522897
## day              55.201453
## month           237.243081
## duration        479.610754
## campaign         36.825847
## pdays           123.013328
## previous         78.568812
## poutcome        220.029001
# predict and evaluate on training data
rf2_train_pred <- predict(bank_rf2, bank_train, type = "class")
rf2_train_cm <- confusionMatrix(rf2_train_pred, bank_train$y)
rf2_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.8450349
# predict and evaluate on testing data
rf2_test_pred <- predict(bank_rf2, bank_test, type = "class")
rf2_test_cm <- confusionMatrix(rf2_test_pred, bank_test$y)
rf2_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.8128981

Review: The overfitting of the previous experiment was mitigated, but as expected, lowering the variance came at a cost to accuracy. The order of importance also shifted to more closely match the features the decision tree experiments chose.

rf2_log <- data.frame(
  ID = 4,
  Model = "Random Forest",
  Features = "ranked 'duration', 'month', and 'poutcome'",
  Hyperparameters = "leaf size, number of features randomly sampled",
  Train = 0.85,
  Test = 0.81,
  Notes = "less accurate, lowered variance"
)

experiment_log <- bind_rows(experiment_log, rf2_log)

XGBoost

Experiment 5:

Objective: The boosting ensemble model XGBoost will be tested to compare accuracy and generate an importance matrix to rank the features’ effects on the predictions.

Variations: The data will be augmented via one-hot encoding, converting the factor columns with levels to numeric 0s or 1s.

Evaluation: The importance rankings will be plotted visually and the confusion matrix will continue to be utilized.

Experiment:

# convert the target column from factor to a numeric label for xgboost,
# and apply one-hot encoding to the rest of the features
bank_train2 <- bank_train
bank_train2$y <- as.numeric(bank_train2$y) - 1
bank_var_train <- dummyVars(" ~ .", data = bank_train2[c(-17)])
bank_var_train <- data.frame(predict(bank_var_train, newdata = bank_train2))

bank_test2 <- bank_test
bank_test2$y <- as.numeric(bank_test2$y) - 1
bank_var_test <- dummyVars(" ~ .", data = bank_test2[c(-17)])
bank_var_test <- data.frame(predict(bank_var_test, newdata = bank_test2))

# convert the data to a matrix
train_matrix <- xgb.DMatrix(data = as.matrix(bank_var_train), label = bank_train2$y)
test_matrix  <- xgb.DMatrix(data = as.matrix(bank_var_test), label = bank_test2$y)

set.seed(123)

bank_xg1 <- xgboost(data = train_matrix,
                    objective = "binary:logistic",
                    nrounds = 100,
                    verbose = 0)

importance_matrix <- xgb.importance(model = bank_xg1)
xgb.plot.importance(importance_matrix)

# predict and evaluate on training data
xg1_train_pred <- predict(bank_xg1, train_matrix)

# convert probabilities back to binary yes/no at threshold = 0.5,
xg1_train_pred_factor <- ifelse(xg1_train_pred > 0.5, "yes", "no")

# then back to factors for the confusion matrix
xg1_train_pred_factor <- factor(xg1_train_pred_factor, levels = c("no", "yes"))
xg1_train_cm <- confusionMatrix(xg1_train_pred_factor, bank_train$y, positive = "yes")
xg1_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.9556815
# predict and evaluate on testing data
xg1_test_pred <- predict(bank_xg1, test_matrix)
xg1_test_pred_factor <- ifelse(xg1_test_pred > 0.5, "yes", "no")
xg1_test_pred_factor <- factor(xg1_test_pred_factor, levels = c("no", "yes"))
xg1_test_cm <- confusionMatrix(xg1_test_pred_factor, bank_test$y, positive = "yes")
xg1_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.9070899

Review: The accuracy of both testing and training are high, but the model slightly overfit on the training data. There was a great deal more data wrangling necessary than with the previous algorithms, but the importance matrix allows for feature exploration (and re-confirms the predictive relationship of duration and y).

xg1_log <- data.frame(
  ID = 5,
  Model = "XGBoost",
  Features = "all",
  Hyperparameters = "nrounds = 100, defaults",
  Train = 0.96,
  Test = 0.91,
  Notes = "duration ranked first"
)

experiment_log <- bind_rows(experiment_log, xg1_log)

Experiment 6:

Objective: Since accuracy is already high, the aim is to use K-fold cross-validation to decrease overfitting and test whether the model can generalize well and be applied to unseen data.

Variations: XGBoost has a built-in cross-validation function, into which the following will be set or updated from default:

  • the number of folds, or subsets for training and validation, will be 5

  • the gamma, a regularization parameter that controls the minimum loss reduction required to make a split and helps prune unnecessary splits in trees, will be changed from 0 to 10 because we have a large dataset with many weak features

  • the min_child_weight, another regularization parameter that controls minimum sum of instance weights in a child node, will be changed from 1 to 10, again because the dataset is large

Evaluation: The importance and confusion matrices can be compared to the previous experiment.

Experiment:

set.seed(123)

params2 <- list(
  objective = "binary:logistic",
  min_child_weight = 10,
  gamma = 10
)

bank_xg2 <- xgb.cv(data = train_matrix,
                   params = params2,
                   nrounds = 100,
                   nfold = 5,
                   early_stopping_rounds = 10,
                   verbose = 0)

best_nrounds <- bank_xg2$best_iteration
best_nrounds
## [1] 55
best_n <- bank_xg2$best_ntreelimit
best_n
## [1] 55
set.seed(123)

bank_xg2 <- xgboost(data = train_matrix,
                    objective = "binary:logistic",
                    nrounds = best_nrounds,
                    verbose = 0)

importance_matrix <- xgb.importance(model = bank_xg1)
xgb.plot.importance(importance_matrix)

# predict and evaluate on training data
xg2_train_pred <- predict(bank_xg2, train_matrix, iteration_range = best_n)
# convert probabilities back to binary yes/no at threshold = 0.5,
# then back to factors for the confusion matrix
xg2_train_pred_factor <- ifelse(xg2_train_pred > 0.5, "yes", "no")
xg2_train_pred_factor <- factor(xg2_train_pred_factor, levels = c("no", "yes"))
xg2_train_cm <- confusionMatrix(xg2_train_pred_factor, bank_train$y, positive = "yes")
xg2_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.9410285
# predict and evaluate on testing data
xg2_test_pred <- predict(bank_xg2, test_matrix, iteration_range = best_n)
xg2_test_pred_factor <- ifelse(xg2_test_pred > 0.5, "yes", "no")
xg2_test_pred_factor <- factor(xg2_test_pred_factor, levels = c("no", "yes"))
xg2_test_cm <- confusionMatrix(xg2_test_pred_factor, bank_test$y, positive = "yes")
xg2_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.9096339

Review: The accuracy was once again barely affected, and the importance matrix also did not change. The model should generalize well on future unseen data. The greatest change was that by using cross-validation to determine the best hyperparameters, the number of boosting rounds was reduced to nearly half and so the final model was more efficient while achieving similar predictive accuracy.

xg2_log <- data.frame(
  ID = 6,
  Model = "XGBoost",
  Features = "all",
  Hyperparameters = "k-fold cross-validation, gamma, minimum child weight, nrounds = 55",
  Train = 0.94,
  Test = 0.91,
  Notes = "boosting rounds reduced significanty, similar accuracy"
)

experiment_log <- bind_rows(experiment_log, xg2_log)

Conclusion

knitr::kable(experiment_log, format = "pipe", padding = 0)
ID Model Features Hyperparameters Train Test Notes
1 Decision Tree duration, poutcome, pdays none 0.90 0.89 marketing features only
2 Decision Tree poutcome, pdays none 0.89 0.89 dropped duration, minimal changes
3 Random Forest all, with different ranking order from decision trees after ‘duration’ impute method, number of trees 1.00 0.85 overfitting
4 Random Forest ranked ‘duration’, ‘month’, and ‘poutcome’ leaf size, number of features randomly sampled 0.85 0.81 less accurate, lowered variance
5 XGBoost all nrounds = 100, defaults 0.96 0.91 duration ranked first
6 XGBoost all k-fold cross-validation, gamma, minimum child weight, nrounds = 55 0.94 0.91 boosting rounds reduced significanty, similar accuracy

The above experiment log summarizes the results of all 6 tests. The different algorithms allowed for exploration of the various ways to train and evaluate models for classification.

The first 2 experiments used similar decision trees, which split the data into branches based on the ‘best’ feature for information gain, or reduction in uncertainty. This was the simplest algorithm to implement and interpret, with a plot generated to visualize the decision branches. The accuracy was fairly high and close for the training and testing datasets and did not have much variation between the two experiments, but the results were of somewhat limited use for business logic. By having the algorithm select the most import features and not utilizing the rest, the variables relating to the customers and their banking history were left out. The results also changed fairly dramatically (though not unexpectedly) when only a single variable was removed from the data. However, the efficiency of implementation and processing made this algorithm a worthwhile test for prediction. Implementing many different experiments with different changes would be relatively quick and easy.

The second algorithm tested was random forest, an ensemble method that builds multiple trees in parallel and aggregates their outputs to improve generalization. The first experiment was overfit to the training data and lost accuracy on the testing set. The second experiment tested how changing the mtry (the number of variables randomly sampled as candidates at each split) from the default sqrt(p) for classification (‘p’ being the number of predictor variables at 16) to just 1, as well as increasing the nodesize to 5 from the default to limit small leaves. This fixed the overfitting issue and balanced the importance rankings but decreased the accuracy to the lowest of all the experiments. With the second, more optimized model, the use of aggregation would most likely allow for the best generalization on unseen data. Ultimately, random forest was the least successful for this dataset.

The final experiments used XGBoost or extreme gradient boosting, also an ensemble method; but trees were built sequentially instead, correcting errors from previous iterations. These experiments returned the highest accuracy on both training and testing data but took the longest time to implement, requiring the categorical columns with levels to be one-hot encoded for modeling, then converted back to factors for interpretation and evaluation. XGBoost also took the longest time to train and was much less computationally efficient. The final experiment used K-fold cross validation to help determine the optimal hyperparameters, another lengthy extra step to code and process but was very useful in creating a more optimal final model. Based on my experiments, I would recommend this algorithm as the best choice for this dataset because it was able to make highly accurate predictions with low variance between the 2 experiments.