Continuing from the previous data exploration of banking and
marketing information, I will be testing and comparing decision trees,
random forests and XGBoost to determine which is the optimal model for
predicting y
.
library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(randomForest)
library(xgboost)
bank_raw <- read.csv2(file="bank+marketing/bank/bank-full.csv")
bank <- bank_raw
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
The data will be preprocessed in the following ways: the “unknown”
and “other” values in poutcome
converted to
NA
s for elucidation of the previous campaign’s results; the
columns containing strings converted to factors.
bank <- bank |>
mutate(poutcome = na_if(poutcome, "unknown")) |>
mutate(poutcome = na_if(poutcome, "other"))
chr_cols <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
bank <- bank |> mutate(across(all_of(chr_cols), as.factor))
head(bank)
The data must first be partitioned for training and testing, at a standard 80-20.
set.seed(123)
splitIndex <- createDataPartition(bank$y, p = 0.8, list = FALSE)
bank_train <- bank[splitIndex,]
bank_test <- bank[-splitIndex,]
round(prop.table(table(select(bank, y))), 2)
## y
## no yes
## 0.88 0.12
round(prop.table(table(select(bank_train, y))), 2)
## y
## no yes
## 0.88 0.12
round(prop.table(table(select(bank_test, y))), 2)
## y
## no yes
## 0.88 0.12
After each experiment, the results will be documented in the following dataframe:
experiment_log <- data.frame(
ID = integer(),
Model = character(),
Features = character(),
Hyperparameters = character(),
Train = numeric(),
Test = numeric(),
Notes = character(),
stringsAsFactors = FALSE
)
Objective: Since decision trees recursively partition data to determine the best features upon which to split, the goal here is to find the variable with the greatest predictive power.
Variations: This is the first experiment, so it will be valuable to simply perform the test on the data as-is for a starting point.
Evaluation: A confusion matrix can be produced to view predicted values against the actual values, then used to calculate the predictive accuracy.
Experiment:
set.seed(123)
bank_dt1 <- rpart(y ~ ., method = "class", data = bank_train)
rpart.plot(bank_dt1)
# predict and evaluate on training data
dt1_train_pred <- predict(bank_dt1, bank_train, type = "class")
dt1_train_cm <- confusionMatrix(dt1_train_pred, bank_train$y)
dt1_train_cm$overall["Accuracy"]
## Accuracy
## 0.9002212
# predict and evaluate on testing data
dt1_test_pred <- predict(bank_dt1, bank_test, type = "class")
dt1_test_cm <- confusionMatrix(dt1_test_pred, bank_test$y)
dt1_test_cm$overall["Accuracy"]
## Accuracy
## 0.8949231
Review: The accuracy of this first model is fairly high, at just over
90%, and the feature determined to be most predictive of y
is duration
. If the last contact with the customer was
recorded at under 522 seconds, the probability that the customer has not
subscribed to a term deposit is only 8%. Interestingly, the decision
tree only determined features related to marketing as most predictive:
duration
, poutcome
and pdays
.
dt1_log <- data.frame(
ID = 1,
Model = "Decision Tree",
Features = "duration, poutcome, pdays",
Hyperparameters = "none",
Train = 0.90,
Test = 0.89,
Notes = "marketing features only"
)
experiment_log <- bind_rows(experiment_log, dt1_log)
Objective: To see if a different set features will be selected for predictive power once the data is altered
Variations: Feature selection will be adjusted; the
duration
column will be dropped.
Evaluation: For consistency, we will use the accuracy measure from the confusion matrix.
Experiment:
set.seed(123)
bank_sub_train <- bank_train |> select(!duration)
bank_sub_test <- bank_test |> select(!duration)
bank_dt2 <- rpart(y ~ ., method = "class", data = bank_sub_train)
rpart.plot(bank_dt2)
# predict and evaluate on training data
dt2_train_pred <- predict(bank_dt2, bank_sub_train, type = "class")
dt2_train_cm <- confusionMatrix(dt2_train_pred, bank_sub_train$y)
dt2_train_cm$overall["Accuracy"]
## Accuracy
## 0.8920652
# predict and evaluate on testing data
dt2_test_pred <- predict(bank_dt2, bank_sub_test, type = "class")
dt2_test_cm <- confusionMatrix(dt2_test_pred, bank_sub_test$y)
dt2_test_cm$overall["Accuracy"]
## Accuracy
## 0.889614
Review: The accuracy barely changed and this tree appears similar to the first experiment, indicating low variance overall. However, only marketing-related features were again chosen by the model, which may ultimately be of limited use for business decision-making (eg customer selection).
dt2_log <- data.frame(
ID = 2,
Model = "Decision Tree",
Features = "poutcome, pdays",
Hyperparameters = "none",
Train = 0.89,
Test = 0.89,
Notes = "dropped duration, minimal changes"
)
experiment_log <- bind_rows(experiment_log, dt2_log)
Objective: By switching to the random forest bagging method, the algorithm will select different features at random, and will instead predict on different dimensions of the data.
Variations: 1. Imputation: Random forests require handling of
NA
values and na.roughfix()
can be used as a
starting point. Per the documentation,
this method of imputation replaces the missing data accordingly: “For
numeric variables, NAs are replaced with column medians. For factor
variables, NAs are replaced with the most frequent levels (breaking ties
at random). If object contains no NAs, it is returned unaltered.” 2.
Number of trees: ntree
is set to 100.
Evaluation: We will continue to use a confusion matrix, as well as the importance ranking of the features.
Experiment:
set.seed(123)
bank_rf1 <- randomForest(y ~ .,
data = bank_train,
ntree = 100,
na.action = na.roughfix)
importance(bank_rf1)
## MeanDecreaseGini
## age 649.09374
## job 492.80733
## marital 145.62751
## education 185.48514
## default 13.96478
## balance 691.34660
## housing 131.72743
## loan 58.41977
## contact 138.05341
## day 583.41136
## month 877.18315
## duration 2074.21176
## campaign 259.83320
## pdays 324.76137
## previous 159.38886
## poutcome 504.25981
# predict and evaluate on training data
rf1_train_pred <- predict(bank_rf1, bank_train, type = "class")
rf1_train_cm <- confusionMatrix(rf1_train_pred, bank_train$y)
rf1_train_cm$overall["Accuracy"]
## Accuracy
## 1
# predict and evaluate on testing data
rf1_test_pred <- predict(bank_rf1, bank_test, type = "class")
rf1_test_cm <- confusionMatrix(rf1_test_pred, bank_test$y)
rf1_test_cm$overall["Accuracy"]
## Accuracy
## 0.852707
Review: The accuracy on the training set is 100% with a much lower
0.85 for the testing data, which indicates overfitting. Aside from the
high importance value for duration
as with the decision
tree, the next highest ranked features are month
,
balance
and age
.
rf1_log <- data.frame(
ID = 3,
Model = "Random Forest",
Features = "all, with different ranking order from decision trees after 'duration'",
Hyperparameters = "impute method, number of trees",
Train = 1.00,
Test = 0.85,
Notes = "overfitting"
)
experiment_log <- bind_rows(experiment_log, rf1_log)
Objective: For the second random forest model, hyperparameters will be tuned with the aim of reducing overfitting.
Variations: The nodesize
will be raised to a minimum of
5 to prevent smaller leaves. The mtry
will be reduced to 2
to limit the number of variables being randomly selected at each
split.
Evaluation: Both the confusion matrix and importance ranking will again be generated.
Experiment:
set.seed(123)
bank_rf2 <- randomForest(y ~ .,
data = bank_train,
ntree = 100,
mtry = 1,
nodesize = 5,
na.action = na.roughfix)
importance(bank_rf2)
## MeanDecreaseGini
## age 98.966817
## job 72.274013
## marital 25.173125
## education 29.218580
## default 2.881819
## balance 65.588576
## housing 73.887030
## loan 18.202610
## contact 59.522897
## day 55.201453
## month 237.243081
## duration 479.610754
## campaign 36.825847
## pdays 123.013328
## previous 78.568812
## poutcome 220.029001
# predict and evaluate on training data
rf2_train_pred <- predict(bank_rf2, bank_train, type = "class")
rf2_train_cm <- confusionMatrix(rf2_train_pred, bank_train$y)
rf2_train_cm$overall["Accuracy"]
## Accuracy
## 0.8450349
# predict and evaluate on testing data
rf2_test_pred <- predict(bank_rf2, bank_test, type = "class")
rf2_test_cm <- confusionMatrix(rf2_test_pred, bank_test$y)
rf2_test_cm$overall["Accuracy"]
## Accuracy
## 0.8128981
Review: The overfitting of the previous experiment was mitigated, but as expected, lowering the variance came at a cost to accuracy. The order of importance also shifted to more closely match the features the decision tree experiments chose.
rf2_log <- data.frame(
ID = 4,
Model = "Random Forest",
Features = "ranked 'duration', 'month', and 'poutcome'",
Hyperparameters = "leaf size, number of features randomly sampled",
Train = 0.85,
Test = 0.81,
Notes = "less accurate, lowered variance"
)
experiment_log <- bind_rows(experiment_log, rf2_log)
Objective: The boosting ensemble model XGBoost will be tested to compare accuracy and generate an importance matrix to rank the features’ effects on the predictions.
Variations: The data will be augmented via one-hot encoding, converting the factor columns with levels to numeric 0s or 1s.
Evaluation: The importance rankings will be plotted visually and the confusion matrix will continue to be utilized.
Experiment:
# convert the target column from factor to a numeric label for xgboost,
# and apply one-hot encoding to the rest of the features
bank_train2 <- bank_train
bank_train2$y <- as.numeric(bank_train2$y) - 1
bank_var_train <- dummyVars(" ~ .", data = bank_train2[c(-17)])
bank_var_train <- data.frame(predict(bank_var_train, newdata = bank_train2))
bank_test2 <- bank_test
bank_test2$y <- as.numeric(bank_test2$y) - 1
bank_var_test <- dummyVars(" ~ .", data = bank_test2[c(-17)])
bank_var_test <- data.frame(predict(bank_var_test, newdata = bank_test2))
# convert the data to a matrix
train_matrix <- xgb.DMatrix(data = as.matrix(bank_var_train), label = bank_train2$y)
test_matrix <- xgb.DMatrix(data = as.matrix(bank_var_test), label = bank_test2$y)
set.seed(123)
bank_xg1 <- xgboost(data = train_matrix,
objective = "binary:logistic",
nrounds = 100,
verbose = 0)
importance_matrix <- xgb.importance(model = bank_xg1)
xgb.plot.importance(importance_matrix)
# predict and evaluate on training data
xg1_train_pred <- predict(bank_xg1, train_matrix)
# convert probabilities back to binary yes/no at threshold = 0.5,
xg1_train_pred_factor <- ifelse(xg1_train_pred > 0.5, "yes", "no")
# then back to factors for the confusion matrix
xg1_train_pred_factor <- factor(xg1_train_pred_factor, levels = c("no", "yes"))
xg1_train_cm <- confusionMatrix(xg1_train_pred_factor, bank_train$y, positive = "yes")
xg1_train_cm$overall["Accuracy"]
## Accuracy
## 0.9556815
# predict and evaluate on testing data
xg1_test_pred <- predict(bank_xg1, test_matrix)
xg1_test_pred_factor <- ifelse(xg1_test_pred > 0.5, "yes", "no")
xg1_test_pred_factor <- factor(xg1_test_pred_factor, levels = c("no", "yes"))
xg1_test_cm <- confusionMatrix(xg1_test_pred_factor, bank_test$y, positive = "yes")
xg1_test_cm$overall["Accuracy"]
## Accuracy
## 0.9070899
Review: The accuracy of both testing and training are high, but the
model slightly overfit on the training data. There was a great deal more
data wrangling necessary than with the previous algorithms, but the
importance matrix allows for feature exploration (and re-confirms the
predictive relationship of duration
and
y
).
xg1_log <- data.frame(
ID = 5,
Model = "XGBoost",
Features = "all",
Hyperparameters = "nrounds = 100, defaults",
Train = 0.96,
Test = 0.91,
Notes = "duration ranked first"
)
experiment_log <- bind_rows(experiment_log, xg1_log)
Objective: Since accuracy is already high, the aim is to use K-fold cross-validation to decrease overfitting and test whether the model can generalize well and be applied to unseen data.
Variations: XGBoost has a built-in cross-validation function, into which the following will be set or updated from default:
the number of folds, or subsets for training and validation, will be 5
the gamma
, a regularization parameter that controls
the minimum loss reduction required to make a split and helps prune
unnecessary splits in trees, will be changed from 0 to 10 because we
have a large dataset with many weak features
the min_child_weight
, another regularization
parameter that controls minimum sum of instance weights in a child node,
will be changed from 1 to 10, again because the dataset is
large
Evaluation: The importance and confusion matrices can be compared to the previous experiment.
Experiment:
set.seed(123)
params2 <- list(
objective = "binary:logistic",
min_child_weight = 10,
gamma = 10
)
bank_xg2 <- xgb.cv(data = train_matrix,
params = params2,
nrounds = 100,
nfold = 5,
early_stopping_rounds = 10,
verbose = 0)
best_nrounds <- bank_xg2$best_iteration
best_nrounds
## [1] 55
best_n <- bank_xg2$best_ntreelimit
best_n
## [1] 55
set.seed(123)
bank_xg2 <- xgboost(data = train_matrix,
objective = "binary:logistic",
nrounds = best_nrounds,
verbose = 0)
importance_matrix <- xgb.importance(model = bank_xg1)
xgb.plot.importance(importance_matrix)
# predict and evaluate on training data
xg2_train_pred <- predict(bank_xg2, train_matrix, iteration_range = best_n)
# convert probabilities back to binary yes/no at threshold = 0.5,
# then back to factors for the confusion matrix
xg2_train_pred_factor <- ifelse(xg2_train_pred > 0.5, "yes", "no")
xg2_train_pred_factor <- factor(xg2_train_pred_factor, levels = c("no", "yes"))
xg2_train_cm <- confusionMatrix(xg2_train_pred_factor, bank_train$y, positive = "yes")
xg2_train_cm$overall["Accuracy"]
## Accuracy
## 0.9410285
# predict and evaluate on testing data
xg2_test_pred <- predict(bank_xg2, test_matrix, iteration_range = best_n)
xg2_test_pred_factor <- ifelse(xg2_test_pred > 0.5, "yes", "no")
xg2_test_pred_factor <- factor(xg2_test_pred_factor, levels = c("no", "yes"))
xg2_test_cm <- confusionMatrix(xg2_test_pred_factor, bank_test$y, positive = "yes")
xg2_test_cm$overall["Accuracy"]
## Accuracy
## 0.9096339
Review: The accuracy was once again barely affected, and the importance matrix also did not change. The model should generalize well on future unseen data. The greatest change was that by using cross-validation to determine the best hyperparameters, the number of boosting rounds was reduced to nearly half and so the final model was more efficient while achieving similar predictive accuracy.
xg2_log <- data.frame(
ID = 6,
Model = "XGBoost",
Features = "all",
Hyperparameters = "k-fold cross-validation, gamma, minimum child weight, nrounds = 55",
Train = 0.94,
Test = 0.91,
Notes = "boosting rounds reduced significanty, similar accuracy"
)
experiment_log <- bind_rows(experiment_log, xg2_log)
knitr::kable(experiment_log, format = "pipe", padding = 0)
ID | Model | Features | Hyperparameters | Train | Test | Notes |
---|---|---|---|---|---|---|
1 | Decision Tree | duration, poutcome, pdays | none | 0.90 | 0.89 | marketing features only |
2 | Decision Tree | poutcome, pdays | none | 0.89 | 0.89 | dropped duration, minimal changes |
3 | Random Forest | all, with different ranking order from decision trees after ‘duration’ | impute method, number of trees | 1.00 | 0.85 | overfitting |
4 | Random Forest | ranked ‘duration’, ‘month’, and ‘poutcome’ | leaf size, number of features randomly sampled | 0.85 | 0.81 | less accurate, lowered variance |
5 | XGBoost | all | nrounds = 100, defaults | 0.96 | 0.91 | duration ranked first |
6 | XGBoost | all | k-fold cross-validation, gamma, minimum child weight, nrounds = 55 | 0.94 | 0.91 | boosting rounds reduced significanty, similar accuracy |
The above experiment log summarizes the results of all 6 tests. The different algorithms allowed for exploration of the various ways to train and evaluate models for classification.
The first 2 experiments used similar decision trees, which split the data into branches based on the ‘best’ feature for information gain, or reduction in uncertainty. This was the simplest algorithm to implement and interpret, with a plot generated to visualize the decision branches. The accuracy was fairly high and close for the training and testing datasets and did not have much variation between the two experiments, but the results were of somewhat limited use for business logic. By having the algorithm select the most import features and not utilizing the rest, the variables relating to the customers and their banking history were left out. The results also changed fairly dramatically (though not unexpectedly) when only a single variable was removed from the data. However, the efficiency of implementation and processing made this algorithm a worthwhile test for prediction. Implementing many different experiments with different changes would be relatively quick and easy.
The second algorithm tested was random forest, an ensemble method
that builds multiple trees in parallel and aggregates their outputs to
improve generalization. The first experiment was overfit to the training
data and lost accuracy on the testing set. The second experiment tested
how changing the mtry
(the number of variables randomly
sampled as candidates at each split) from the default
sqrt(p)
for classification (‘p’ being the number of
predictor variables at 16) to just 1, as well as increasing the
nodesize
to 5 from the default to limit small leaves. This
fixed the overfitting issue and balanced the importance rankings but
decreased the accuracy to the lowest of all the experiments. With the
second, more optimized model, the use of aggregation would most likely
allow for the best generalization on unseen data. Ultimately, random
forest was the least successful for this dataset.
The final experiments used XGBoost or extreme gradient boosting, also an ensemble method; but trees were built sequentially instead, correcting errors from previous iterations. These experiments returned the highest accuracy on both training and testing data but took the longest time to implement, requiring the categorical columns with levels to be one-hot encoded for modeling, then converted back to factors for interpretation and evaluation. XGBoost also took the longest time to train and was much less computationally efficient. The final experiment used K-fold cross validation to help determine the optimal hyperparameters, another lengthy extra step to code and process but was very useful in creating a more optimal final model. Based on my experiments, I would recommend this algorithm as the best choice for this dataset because it was able to make highly accurate predictions with low variance between the 2 experiments.