In this study three models will be applied to marketing campaign data. The campaign was done by a Portuguese bank to sell term deposit subscriptions to targets.
The models applied are Decision Tree, Random Forest and AdaBoost. Each model is fit at least twice in order to improve metrics of prediction.
library(caret)
library(rpart)
library(tidyverse)
library(randomForest)
library(adabag)
library(ada)
library(pROC)
library(reshape2)
library(fmsb)
library(patchwork)
library(rpart.plot)
library(ROSE)
DECISION TREE Hypothesis #1: Mutating particular fields (value substitution to something more substantive) will improve model performance for F1 and ROC-AUC metrics. With more consistent data per both positive and negative cases Precision and Recall should improve, reflecting in better F1 and ROC-AUC.
The dataset as-is contains values like “unknown” and “other” for certain missing values of categorical variables (ex. education, contact, job). I’ll fit a decision tree model to this version, and compare metrics to a mutated dataset where the sample set imputes the mode values for those variables (grouping on a subset of other predictors in the dataset).
Imputing values manually
most<- function(vec){
mode <- names(which.max(table(vec)))
return(mode)
}
bank2.f.2 %>%
group_by(job,marital) %>%
mutate(contact2 = ifelse(contact=="unknown",most(contact),contact)) %>%
ungroup() -> bank2.f.2
bank2.f.2 %>%
group_by(marital,housing) %>%
mutate(job2 = ifelse(job=="unknown",most(job),job)) %>%
ungroup() -> bank2.f.2
bank2.f.2 %>%
group_by(marital,housing) %>%
mutate(education2 = ifelse(education=="unknown",most(education),education)) %>%
ungroup() %>%
mutate_if(is.character,as.factor)-> bank2.f.2
set.seed(123)
bank2.f.2 %>%
select(-c(education2,job2,contact2)) -> bank2.f.2.1
train_ix <- caret::createDataPartition(bank2.f.2.1$y, p=.8,list=FALSE,times=1)
bank2_train <- bank2.f.2.1[train_ix,]
bank2_test <- bank2.f.2.1[-train_ix,]
#with the same index in hand, apply index to mutated dataset
bank2.f.2 %>%
select(-c(education,job,contact)) -> bank2.f.2_m
bank2_train_m <- bank2.f.2_m[train_ix,]
bank2_test_m <- bank2.f.2_m[-train_ix,]
decisionTreeFunct <- function(train_set, test_set, xval = 5) {
#grow a large tree, then cost complexity prune after tree is grown
ctrl <- rpart.control(cp=0.001)
tree_mod <-
rpart::rpart(
y ~ .,
method = "class",
xval = xval,
data = train_set,
#original dataset has no NAs
#na.action = na.roughfix
)
#model evaluation
default_pred <- predict(tree_mod,test_set,type="class")
cm <- confusionMatrix(default_pred, test_set$y, positive = "yes")
newlist <- list("cm" = cm, "model" = tree_mod)
return (newlist)}
Fit decision trees with the same hyperparameters to each training/test set pair, then compare metric values (F1 and ROC-AUC).
dt_notreat_cm <- decisionTreeFunct(bank2_train,bank2_test)$cm
#decision tree with NA as opposed to "other" performs much better
dt_treat_cm <- decisionTreeFunct(bank2_train_m,bank2_test_m)$cm
results.df <- data.frame(model =NA, F1 =NA, ROC_AUC=NA, Recall=NA)
results.df[1,"model"] = 'dt original'
results.df[2,"model"] = 'dt mutated'
results.df[1,"F1"] = dt_notreat_cm$byClass['F1']
results.df[2,"F1"] = dt_treat_cm$byClass['F1']
results.df[1,"Recall"] = dt_notreat_cm$byClass['Recall']
results.df[2,"Recall"] = dt_treat_cm$byClass['Recall']
library(pROC)
library(ROCR)
dt_notreat_mdl <- decisionTreeFunct(bank2_train,bank2_test)$model
dt_treat_mdl <- decisionTreeFunct(bank2_train_m,bank2_test_m)$model
dt_notreat_mdl_prob <- predict(dt_notreat_mdl, type="prob")[,2]
dt_treat_mdl_prob <- predict(dt_treat_mdl, type="prob")[,2]
train_p <- prediction(dt_notreat_mdl_prob, bank2_train$y)
train_t_p <- prediction(dt_treat_mdl_prob, bank2_train_m$y)
r_auc_train1 <- performance(train_p, measure = "auc")@y.values[[1]]
r_auc_train2 <- performance(train_t_p, measure = "auc")@y.values[[1]]
results.df[1,"ROC_AUC"] = r_auc_train1
results.df[2,"ROC_AUC"] = r_auc_train2
results.df
## model F1 ROC_AUC Recall
## 1 dt original 0.4140088 0.7455501 0.3131504
## 2 dt mutated 0.3987055 0.7452799 0.2913907
Compared to the original dataset, F1 declined slightly and ROC-AUC held nearly the same. I was surprised by the results. However, given that the quantity of records with ambiguous values was low, this was a better exercise than an experiment.
Here is a look at the tree that was fit.
rpart.plot::rpart.plot(dt_treat_mdl, main="Mutated Dataset Decision Tree Model")
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Hypothesis #2 The bias-variance trade off of the tree model can be improved with cost complexity tuning. The cp parameter is applied to the post pruning formula after a tree is grown. Cross validation is applied to the subtrees associated with ea cp and the total error is tabulated. Typically the tree with the lowest ‘xerror’ is selected. Lower ‘cp’ values provide for “bushier” trees - with the risk of overfitting.
I will compare the Recall and ROC-AUC of a model with a tuned ‘cp’ parameter with that of the non-mutated model from hypothesis #1.
In order to identify an appropriate cp parameter, I run a model fit on several values of cp starting at .001, a value which allows the tree to grow very bushy. Inspecting performance at each cp interval, I’ll find one that minimizes the standard deviation of error
ctrl <- rpart.control(cp=0.001)
tree_mod <-
rpart::rpart(
y ~ .,
method = "class",
xval = 5,
data = bank2_train,
control=ctrl,
#original dataset has no NAs
#na.action = na.roughfix
)
#model evaluation
#default_pred <- predict(tree_mod,bank2_test,type="class")
#confusionMatrix(default_pred, bank2_test$y, positive = "yes")$byClass
tree_mod$cptable
## CP nsplit rel error xerror xstd
## 1 0.030954631 0 1.0000000 1.0000000 0.01444464
## 2 0.027410208 2 0.9380907 0.9362004 0.01403522
## 3 0.023156900 3 0.9106805 0.9158790 0.01390058
## 4 0.014650284 5 0.8643667 0.8714556 0.01359869
## 5 0.008979206 6 0.8497164 0.8714556 0.01359869
## 6 0.007325142 8 0.8317580 0.8594045 0.01351493
## 7 0.004253308 9 0.8244329 0.8419187 0.01339194
## 8 0.002126654 10 0.8201796 0.8338847 0.01333483
## 9 0.001949433 12 0.8159263 0.8383743 0.01336679
## 10 0.001890359 17 0.8045841 0.8345936 0.01333989
## 11 0.001831285 19 0.8008034 0.8341210 0.01333652
## 12 0.001732829 24 0.7915879 0.8350662 0.01334326
## 13 0.001713138 27 0.7863894 0.8329395 0.01332809
## 14 0.001654064 31 0.7795369 0.8329395 0.01332809
## 15 0.001417769 43 0.7568526 0.8296314 0.01330445
## 16 0.001299622 44 0.7554348 0.8319943 0.01332134
## 17 0.001181474 46 0.7528355 0.8395558 0.01337518
## 18 0.001113962 53 0.7445652 0.8388469 0.01337015
## 19 0.001063327 64 0.7318053 0.8383743 0.01336679
## 20 0.001000000 75 0.7197543 0.8416824 0.01339026
Specifying a small cost complexity allowed the tree to grow significantly. I’ll then grow a tree to the cp value with the lowest cross validation error on a 5-fold cross validation training set.
tree_mod$cptable[which.min(tree_mod$cptable[,4]),]
## CP nsplit rel error xerror xstd
## 0.001417769 43.000000000 0.756852552 0.829631380 0.013304451
min_cp <- tree_mod$cptable[which.min(tree_mod$cptable[,4]),]
ctrl <- rpart.control(cp=min_cp)
tree_mod2 <-
rpart::rpart(
y ~ .,
method = "class",
xval = 5,
data = bank2_train,
control=ctrl,
#original dataset has no NAs
#na.action = na.roughfix
)
default_pred2 <- predict(tree_mod2,bank2_test,type="class")
results.df[3,'model'] = 'tuned dt'
results.df[3,'Recall'] = confusionMatrix(default_pred2, bank2_test$y, positive = "yes")$byClass['Recall']
results.df[3,'F1'] = confusionMatrix(default_pred2, bank2_test$y, positive = "yes")$byClass['F1']
Calculate ROC-AUC
cp_model_prob <- predict(tree_mod2, type="prob")[,2]
cp_pred <- prediction(cp_model_prob, bank2_train_m$y)
results.df[3,'ROC_AUC'] = performance(cp_pred, measure = "auc")@y.values[[1]]
results.df
## model F1 ROC_AUC Recall
## 1 dt original 0.4140088 0.7455501 0.3131504
## 2 dt mutated 0.3987055 0.7452799 0.2913907
## 3 tuned dt 0.4726651 0.7931427 0.3926206
rpart.plot::rpart.plot(tree_mod2, main="Cost Complexity Tuned Model")
The tuned tree performed better on ROC-AUC and Recall. The tree contains
many nodes, which suggests overfitting. For this reason, a control on
the minimum number of values of a node before splitting should be
considered.
Hypothesis #3: rebalancing the training data by inflating the positive outcomes (subscription = “yes”) will improve Recall but may negatively impact F1 compared to the tree fit in hypothesis #2
#ROSE will undersample and oversample
rose_train_null <- ROSE(y ~ ., data = bank2_train, 26000, p=0.5)$data
#prove class balance
table(rose_train_null$y)
##
## no yes
## 13008 12992
The new dataset has nearly balanced classes. I’ll now fit models on this dataset.
#DECISION TREE w/ROSE DATASET
rose.mdl <-
rpart::rpart(
y ~ .,
method = "class",
xval = 5,
data = rose_train_null,
control=ctrl,
#original dataset has no NAs
#na.action = na.roughfix
)
default_pred2 <- predict(rose.mdl,bank2_test,type="class")
rose.cm <- confusionMatrix(default_pred2, bank2_test$y, positive = "yes")
results.df[4,'model'] = 'dt balanced classes'
results.df[4,'F1'] = rose.cm$byClass['F1']
results.df[4,'Recall'] = rose.cm$byClass['Recall']
results.df
## model F1 ROC_AUC Recall
## 1 dt original 0.4140088 0.7455501 0.3131504
## 2 dt mutated 0.3987055 0.7452799 0.2913907
## 3 tuned dt 0.4726651 0.7931427 0.3926206
## 4 dt balanced classes 0.4865156 NA 0.8533586
Recall shot way up with the prevalence of more positive cases - indicating that there was an improvement in both true positives and false negatives. In other words: out of all the true cases in the test set, training the model on a rebalanced dataset improved predictability of positive cases.
A visualization of the model.
rpart.plot::rpart.plot(rose.mdl, main="Rose Decision Tree Model")
RANDOM FOREST Hypothesis #1: Tuning various hypoerparameters of a RandomForest can improve classification metrics when compared with the default model.
I’ll compare F1 and Recall to select the better model.
#build rf model with randomForest package
rf_dflt <- randomForest(
y ~ .,
data = bank2_train,
#ntree = 100,
#mtry = 6,
#na.action = na.roughfix
)
Gather model performance stats on the default random forest model
rf_dflt_prob <- predict(rf_dflt, bank2_test, type = "prob")
rf_predict <- predict(rf_dflt, bank2_test, type="class")
cm_rf_dflt <- confusionMatrix(rf_predict, bank2_test$y, positive = "yes")
results.df <- data.frame(model =NA, F1 =NA, ROC_AUC=NA, Recall=NA)
results.df[1,"model"] = 'rf default'
results.df[2,"model"] = 'rf tuned'
results.df[1,"Recall"] = cm_rf_dflt$byClass['Recall']
results.df[1,"F1"] = cm_rf_dflt$byClass['F1']
Conducting parameter tuning on mtry (the number of features to sample at a time) and the number of trees to fit.
grid <- expand_grid(
mtry = seq(from=2, to =5, by =1),
ntree = seq(from=50, to=300, by = 50))
#iterate through ea row of a dataframe
rf_funct <- function(ntree = 100, mtry = NA) {
if (is.na(mtry)){
mtry = round(sqrt(ncol(bank2_train)),0)
}
set.seed(123)
model <- randomForest(
y ~ .,
data = bank2_train,
ntree = ntree,
mtry = mtry,
#na.action = na.roughfix
)
rf_predict <- predict(model, bank2_test)
cm <- confusionMatrix(rf_predict, bank2_test$y, positive = "yes")
return(cm)
#return(list("model"= model, "cm" =cm))
}
grid %>%
rowwise() %>%
mutate(model_output = list(rf_funct(ntree,mtry))) -> grid_results
grid_results %>%
rowwise() %>%
mutate(F1 = model_output$byClass['F1'],
Precision = model_output$byClass['Precision'],
Recall = model_output$byClass['Recall'],
Specficity = model_output$byClass['Specificity']) -> grid_results
With 24 models fit, I’ll survey the best ones for F1 and Recall
#precision (confidence/reliability in predicted positives): TP/(TP+FP)
grid_results[which.max(grid_results$Precision),]
## # A tibble: 1 × 7
## # Rowwise:
## mtry ntree model_output F1 Precision Recall Specficity
## <dbl> <dbl> <list> <dbl> <dbl> <dbl> <dbl>
## 1 2 300 <cnfsnMtr> 0.431 0.679 0.316 0.980
#true positive rate: TP/(TP+FN) aka Sensitivity
grid_results[which.max(grid_results$Recall),]
## # A tibble: 1 × 7
## # Rowwise:
## mtry ntree model_output F1 Precision Recall Specficity
## <dbl> <dbl> <list> <dbl> <dbl> <dbl> <dbl>
## 1 5 50 <cnfsnMtr> 0.541 0.631 0.473 0.963
# TN/(TN+FP) -> prediction power of negative cases
grid_results[which.max(grid_results$Specficity),]
## # A tibble: 1 × 7
## # Rowwise:
## mtry ntree model_output F1 Precision Recall Specficity
## <dbl> <dbl> <list> <dbl> <dbl> <dbl> <dbl>
## 1 2 300 <cnfsnMtr> 0.431 0.679 0.316 0.980
#balance btwn precision & recall
grid_results[which.max(grid_results$F1),]
## # A tibble: 1 × 7
## # Rowwise:
## mtry ntree model_output F1 Precision Recall Specficity
## <dbl> <dbl> <list> <dbl> <dbl> <dbl> <dbl>
## 1 5 50 <cnfsnMtr> 0.541 0.631 0.473 0.963
The same model delivers the best F1 score and Recall score.
results.df[2,"Recall"] = grid_results[which.max(grid_results$Recall),'Recall']
results.df[2,"F1"] = grid_results[which.max(grid_results$Recall),'F1']
Conclusion: The random forest with the tuned ‘cp’ parameter performs better across F1 and Recall.
Hypothesis #2: Fitting the previous model to the ROSE (balanced) dataset will improve Recall, but may lower F1.
rose.rf <- randomForest(
y ~ .,
data = rose_train_null,
ntree = 50,
mtry = 5,
#na.action = na.roughfix
)
Inspect performance
rf_rose_predict <- predict(rose.rf, bank2_test)
rose.cm <- confusionMatrix(rf_rose_predict, bank2_test$y, positive = "yes")
results.df[3,"model"] = 'rf balanced dataset'
results.df[3,"Recall"] = rose.cm$byClass['Recall']
results.df[3,"F1"] = rose.cm$byClass['F1']
results.df
## model F1 ROC_AUC Recall
## 1 rf default 0.5000000 NA 0.4143803
## 2 rf tuned 0.5405405 NA 0.4730369
## 3 rf balanced dataset 0.5773585 NA 0.8684957
There is a trade-off between Recall and F1 (harmonic mean between Precision and Recall). The boost to Recall makes this approach worthwhile.
ADABOOST Hypothesis #1: The default model can improve by some parameter adjustments (iterations and learning rate).
# ADABOOST
# parameters: iter, nu
# default of boosting iterations = 50
# nu is the shrinkage parameter and default = 0.1 (may be synonymous with learning rate)
adaboost_model <- ada::ada(y ~ ., data = bank2_train, iter = 50, nu = 1)
adaboost_pred <- predict(adaboost_model, bank2_test)
conf_matrix_ada <- confusionMatrix(adaboost_pred, bank2_test$y, positive = "yes")
conf_matrix_ada$byClass
## Sensitivity Specificity Pos Pred Value
## 0.41154210 0.96017034 0.57768924
## Neg Pred Value Precision Recall
## 0.92495174 0.57768924 0.41154210
## F1 Prevalence Detection Rate
## 0.48066298 0.11691185 0.04811415
## Detection Prevalence Balanced Accuracy
## 0.08328725 0.68585622
#Hypothesis: if we slow down the learn rate sufficiently in combination with increasing the number of iterations, we can improve the bias-variance tradeoff
param_model <- ada::ada(y ~ ., data = bank2_train, iter = 150, nu = .5)
param_pred <- predict(param_model, bank2_test)
conf_matrix_param <- confusionMatrix(param_pred, bank2_test$y, positive = "yes")
conf_matrix_param$byClass
## Sensitivity Specificity Pos Pred Value
## 0.43235572 0.96518036 0.62176871
## Neg Pred Value Precision Recall
## 0.92776306 0.62176871 0.43235572
## F1 Prevalence Detection Rate
## 0.51004464 0.11691185 0.05054751
## Detection Prevalence Balanced Accuracy
## 0.08129632 0.69876804
Conclusion: to my surprise the metrics only increased marginally from the default parameter set. With this parity, I would balance compute time with number of iterations. Additionally, these parameters could be tuned further with trials across various combinations.
Hypothesis #2: The model can be further improved by scaling the numeric variables.
#Hypothesis #2: adaboost performs better on scaled numeric variables
pre_process <- preProcess(bank2_train, method = c("center", "scale"))
train_data_normalized <- predict(pre_process, bank2_train)
test_data_normalized <- predict(pre_process, bank2_test)
adaboost_model_3 <- ada(y ~ ., data = train_data_normalized, iter = 150, nu = .5)
ada_pred <- predict(adaboost_model_3, test_data_normalized)
conf_matrix_param <- confusionMatrix(ada_pred, bank2_test$y, positive = "yes")
conf_matrix_param$byClass
## Sensitivity Specificity Pos Pred Value
## 0.44370861 0.96442886 0.62284197
## Neg Pred Value Precision Recall
## 0.92905405 0.62284197 0.44370861
## F1 Prevalence Detection Rate
## 0.51823204 0.11691185 0.05187479
## Detection Prevalence Balanced Accuracy
## 0.08328725 0.70406873
The model fit on scaled values did not perform better than the original data scale.