In this study three models will be applied to marketing campaign data. The campaign was done by a Portuguese bank to sell term deposit subscriptions to targets.

The models applied are Decision Tree, Random Forest and AdaBoost. Each model is fit at least twice in order to improve metrics of prediction.

library(caret)
library(rpart)
library(tidyverse)
library(randomForest)
library(adabag)
library(ada)
library(pROC)
library(reshape2)
library(fmsb)
library(patchwork)
library(rpart.plot)
library(ROSE)

DECISION TREE Hypothesis #1: Mutating particular fields (value substitution to something more substantive) will improve model performance for F1 and ROC-AUC metrics. With more consistent data per both positive and negative cases Precision and Recall should improve, reflecting in better F1 and ROC-AUC.

The dataset as-is contains values like “unknown” and “other” for certain missing values of categorical variables (ex. education, contact, job). I’ll fit a decision tree model to this version, and compare metrics to a mutated dataset where the sample set imputes the mode values for those variables (grouping on a subset of other predictors in the dataset).

Imputing values manually

most<- function(vec){
  mode <- names(which.max(table(vec)))
  return(mode)
}

bank2.f.2 %>%
  group_by(job,marital) %>%
  mutate(contact2 = ifelse(contact=="unknown",most(contact),contact)) %>%
  ungroup() -> bank2.f.2


bank2.f.2 %>%
  group_by(marital,housing) %>%
  mutate(job2 = ifelse(job=="unknown",most(job),job)) %>%
  ungroup() -> bank2.f.2


bank2.f.2 %>%
  group_by(marital,housing) %>%
  mutate(education2 = ifelse(education=="unknown",most(education),education)) %>%
  ungroup() %>%
  mutate_if(is.character,as.factor)-> bank2.f.2
set.seed(123)

bank2.f.2 %>%
  select(-c(education2,job2,contact2)) -> bank2.f.2.1

train_ix <- caret::createDataPartition(bank2.f.2.1$y, p=.8,list=FALSE,times=1)

bank2_train <- bank2.f.2.1[train_ix,]
bank2_test <- bank2.f.2.1[-train_ix,]

#with the same index in hand, apply index to mutated dataset
bank2.f.2 %>%
  select(-c(education,job,contact)) -> bank2.f.2_m

bank2_train_m <- bank2.f.2_m[train_ix,]
bank2_test_m <- bank2.f.2_m[-train_ix,]
decisionTreeFunct <- function(train_set, test_set, xval = 5) {
  #grow a large tree, then cost complexity prune after tree is grown
  ctrl <- rpart.control(cp=0.001)
  
  tree_mod <- 
  rpart::rpart(
    y ~ ., 
    method = "class",
    xval = xval,
    data = train_set,
    #original dataset has no NAs
    #na.action = na.roughfix
    )

  #model evaluation
  default_pred <- predict(tree_mod,test_set,type="class")

  cm <- confusionMatrix(default_pred, test_set$y, positive = "yes")

  newlist <- list("cm" = cm, "model" = tree_mod)
  return (newlist)}

Fit decision trees with the same hyperparameters to each training/test set pair, then compare metric values (F1 and ROC-AUC).

dt_notreat_cm <- decisionTreeFunct(bank2_train,bank2_test)$cm

#decision tree with NA as opposed to "other" performs much better
dt_treat_cm <- decisionTreeFunct(bank2_train_m,bank2_test_m)$cm

results.df <- data.frame(model =NA, F1 =NA, ROC_AUC=NA, Recall=NA)
results.df[1,"model"] = 'dt original'
results.df[2,"model"] =  'dt mutated'
results.df[1,"F1"] = dt_notreat_cm$byClass['F1']
results.df[2,"F1"] = dt_treat_cm$byClass['F1']
results.df[1,"Recall"] = dt_notreat_cm$byClass['Recall']
results.df[2,"Recall"] = dt_treat_cm$byClass['Recall']
library(pROC)
library(ROCR)

dt_notreat_mdl <- decisionTreeFunct(bank2_train,bank2_test)$model
dt_treat_mdl <- decisionTreeFunct(bank2_train_m,bank2_test_m)$model

dt_notreat_mdl_prob <- predict(dt_notreat_mdl, type="prob")[,2]
dt_treat_mdl_prob <- predict(dt_treat_mdl, type="prob")[,2]

train_p <- prediction(dt_notreat_mdl_prob, bank2_train$y)
train_t_p <- prediction(dt_treat_mdl_prob, bank2_train_m$y)


r_auc_train1 <- performance(train_p, measure = "auc")@y.values[[1]]
r_auc_train2 <- performance(train_t_p, measure = "auc")@y.values[[1]] 

results.df[1,"ROC_AUC"] = r_auc_train1
results.df[2,"ROC_AUC"] = r_auc_train2

results.df
##         model        F1   ROC_AUC    Recall
## 1 dt original 0.4140088 0.7455501 0.3131504
## 2  dt mutated 0.3987055 0.7452799 0.2913907

Compared to the original dataset, F1 declined slightly and ROC-AUC held nearly the same. I was surprised by the results. However, given that the quantity of records with ambiguous values was low, this was a better exercise than an experiment.

Here is a look at the tree that was fit.

rpart.plot::rpart.plot(dt_treat_mdl, main="Mutated Dataset Decision Tree Model")
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Hypothesis #2 The bias-variance trade off of the tree model can be improved with cost complexity tuning. The cp parameter is applied to the post pruning formula after a tree is grown. Cross validation is applied to the subtrees associated with ea cp and the total error is tabulated. Typically the tree with the lowest ‘xerror’ is selected. Lower ‘cp’ values provide for “bushier” trees - with the risk of overfitting.

I will compare the Recall and ROC-AUC of a model with a tuned ‘cp’ parameter with that of the non-mutated model from hypothesis #1.

In order to identify an appropriate cp parameter, I run a model fit on several values of cp starting at .001, a value which allows the tree to grow very bushy. Inspecting performance at each cp interval, I’ll find one that minimizes the standard deviation of error

ctrl <- rpart.control(cp=0.001)
tree_mod <- 
  rpart::rpart(
    y ~ ., 
    method = "class",
    xval = 5,
    data = bank2_train,
    control=ctrl,
    #original dataset has no NAs
    #na.action = na.roughfix
    )

#model evaluation
#default_pred <- predict(tree_mod,bank2_test,type="class")

#confusionMatrix(default_pred, bank2_test$y, positive = "yes")$byClass
tree_mod$cptable
##             CP nsplit rel error    xerror       xstd
## 1  0.030954631      0 1.0000000 1.0000000 0.01444464
## 2  0.027410208      2 0.9380907 0.9362004 0.01403522
## 3  0.023156900      3 0.9106805 0.9158790 0.01390058
## 4  0.014650284      5 0.8643667 0.8714556 0.01359869
## 5  0.008979206      6 0.8497164 0.8714556 0.01359869
## 6  0.007325142      8 0.8317580 0.8594045 0.01351493
## 7  0.004253308      9 0.8244329 0.8419187 0.01339194
## 8  0.002126654     10 0.8201796 0.8338847 0.01333483
## 9  0.001949433     12 0.8159263 0.8383743 0.01336679
## 10 0.001890359     17 0.8045841 0.8345936 0.01333989
## 11 0.001831285     19 0.8008034 0.8341210 0.01333652
## 12 0.001732829     24 0.7915879 0.8350662 0.01334326
## 13 0.001713138     27 0.7863894 0.8329395 0.01332809
## 14 0.001654064     31 0.7795369 0.8329395 0.01332809
## 15 0.001417769     43 0.7568526 0.8296314 0.01330445
## 16 0.001299622     44 0.7554348 0.8319943 0.01332134
## 17 0.001181474     46 0.7528355 0.8395558 0.01337518
## 18 0.001113962     53 0.7445652 0.8388469 0.01337015
## 19 0.001063327     64 0.7318053 0.8383743 0.01336679
## 20 0.001000000     75 0.7197543 0.8416824 0.01339026

Specifying a small cost complexity allowed the tree to grow significantly. I’ll then grow a tree to the cp value with the lowest cross validation error on a 5-fold cross validation training set.

tree_mod$cptable[which.min(tree_mod$cptable[,4]),]
##           CP       nsplit    rel error       xerror         xstd 
##  0.001417769 43.000000000  0.756852552  0.829631380  0.013304451
min_cp <- tree_mod$cptable[which.min(tree_mod$cptable[,4]),]

ctrl <- rpart.control(cp=min_cp)
tree_mod2 <- 
  rpart::rpart(
    y ~ ., 
    method = "class",
    xval = 5,
    data = bank2_train,
    control=ctrl,
    #original dataset has no NAs
    #na.action = na.roughfix
    )

default_pred2 <- predict(tree_mod2,bank2_test,type="class")

results.df[3,'model'] = 'tuned dt'
results.df[3,'Recall'] = confusionMatrix(default_pred2, bank2_test$y, positive = "yes")$byClass['Recall']
results.df[3,'F1'] = confusionMatrix(default_pred2, bank2_test$y, positive = "yes")$byClass['F1']

Calculate ROC-AUC

cp_model_prob <- predict(tree_mod2, type="prob")[,2]

cp_pred <- prediction(cp_model_prob, bank2_train_m$y)

results.df[3,'ROC_AUC'] = performance(cp_pred, measure = "auc")@y.values[[1]]

results.df
##         model        F1   ROC_AUC    Recall
## 1 dt original 0.4140088 0.7455501 0.3131504
## 2  dt mutated 0.3987055 0.7452799 0.2913907
## 3    tuned dt 0.4726651 0.7931427 0.3926206
rpart.plot::rpart.plot(tree_mod2, main="Cost Complexity Tuned Model")

The tuned tree performed better on ROC-AUC and Recall. The tree contains many nodes, which suggests overfitting. For this reason, a control on the minimum number of values of a node before splitting should be considered.

Hypothesis #3: rebalancing the training data by inflating the positive outcomes (subscription = “yes”) will improve Recall but may negatively impact F1 compared to the tree fit in hypothesis #2

#ROSE will undersample and oversample
rose_train_null <- ROSE(y ~ ., data  = bank2_train, 26000, p=0.5)$data

#prove class balance
table(rose_train_null$y)
## 
##    no   yes 
## 13008 12992

The new dataset has nearly balanced classes. I’ll now fit models on this dataset.

#DECISION TREE w/ROSE DATASET
rose.mdl <-
  rpart::rpart(
    y ~ ., 
    method = "class",
    xval = 5,
    data = rose_train_null,
    control=ctrl,
    #original dataset has no NAs
    #na.action = na.roughfix
    )

default_pred2 <- predict(rose.mdl,bank2_test,type="class")
rose.cm <- confusionMatrix(default_pred2, bank2_test$y, positive = "yes")

results.df[4,'model'] = 'dt balanced classes'
results.df[4,'F1'] = rose.cm$byClass['F1']
results.df[4,'Recall'] = rose.cm$byClass['Recall']

results.df
##                 model        F1   ROC_AUC    Recall
## 1         dt original 0.4140088 0.7455501 0.3131504
## 2          dt mutated 0.3987055 0.7452799 0.2913907
## 3            tuned dt 0.4726651 0.7931427 0.3926206
## 4 dt balanced classes 0.4865156        NA 0.8533586

Recall shot way up with the prevalence of more positive cases - indicating that there was an improvement in both true positives and false negatives. In other words: out of all the true cases in the test set, training the model on a rebalanced dataset improved predictability of positive cases.

A visualization of the model.

rpart.plot::rpart.plot(rose.mdl, main="Rose Decision Tree Model")

RANDOM FOREST Hypothesis #1: Tuning various hypoerparameters of a RandomForest can improve classification metrics when compared with the default model.

I’ll compare F1 and Recall to select the better model.

#build rf model with randomForest package
rf_dflt <- randomForest(
  y ~ ., 
  data = bank2_train, 
  #ntree = 100, 
  #mtry = 6, 
  #na.action = na.roughfix 
  )

Gather model performance stats on the default random forest model

rf_dflt_prob <- predict(rf_dflt, bank2_test, type = "prob")

rf_predict <- predict(rf_dflt, bank2_test, type="class")
cm_rf_dflt <- confusionMatrix(rf_predict, bank2_test$y, positive = "yes")
results.df <- data.frame(model =NA, F1 =NA, ROC_AUC=NA, Recall=NA)
results.df[1,"model"] = 'rf default'
results.df[2,"model"] =  'rf tuned'
results.df[1,"Recall"] =  cm_rf_dflt$byClass['Recall']
results.df[1,"F1"] =  cm_rf_dflt$byClass['F1']

Conducting parameter tuning on mtry (the number of features to sample at a time) and the number of trees to fit.

grid <- expand_grid(
  mtry = seq(from=2, to =5, by =1),
  ntree = seq(from=50, to=300, by = 50))

#iterate through ea row of a dataframe

rf_funct <- function(ntree = 100, mtry = NA) {
  if (is.na(mtry)){
    mtry = round(sqrt(ncol(bank2_train)),0)
  }
  set.seed(123)
  model <- randomForest(
    y ~ ., 
    data = bank2_train, 
    ntree = ntree, 
    mtry = mtry, 
    #na.action = na.roughfix 
    )
  
  rf_predict <- predict(model, bank2_test)
  cm <- confusionMatrix(rf_predict, bank2_test$y, positive = "yes")
  
  return(cm)
  #return(list("model"= model, "cm" =cm))
}


grid %>%
  rowwise() %>%
  mutate(model_output = list(rf_funct(ntree,mtry))) -> grid_results

grid_results %>%
  rowwise() %>%
  mutate(F1 = model_output$byClass['F1'], 
         Precision = model_output$byClass['Precision'], 
         Recall = model_output$byClass['Recall'], 
         Specficity = model_output$byClass['Specificity']) -> grid_results

With 24 models fit, I’ll survey the best ones for F1 and Recall

#precision (confidence/reliability in predicted positives): TP/(TP+FP)
grid_results[which.max(grid_results$Precision),]
## # A tibble: 1 × 7
## # Rowwise: 
##    mtry ntree model_output    F1 Precision Recall Specficity
##   <dbl> <dbl> <list>       <dbl>     <dbl>  <dbl>      <dbl>
## 1     2   300 <cnfsnMtr>   0.431     0.679  0.316      0.980
#true positive rate: TP/(TP+FN)  aka Sensitivity
grid_results[which.max(grid_results$Recall),]
## # A tibble: 1 × 7
## # Rowwise: 
##    mtry ntree model_output    F1 Precision Recall Specficity
##   <dbl> <dbl> <list>       <dbl>     <dbl>  <dbl>      <dbl>
## 1     5    50 <cnfsnMtr>   0.541     0.631  0.473      0.963
# TN/(TN+FP) -> prediction power of negative cases
grid_results[which.max(grid_results$Specficity),]
## # A tibble: 1 × 7
## # Rowwise: 
##    mtry ntree model_output    F1 Precision Recall Specficity
##   <dbl> <dbl> <list>       <dbl>     <dbl>  <dbl>      <dbl>
## 1     2   300 <cnfsnMtr>   0.431     0.679  0.316      0.980
#balance btwn precision & recall
grid_results[which.max(grid_results$F1),]
## # A tibble: 1 × 7
## # Rowwise: 
##    mtry ntree model_output    F1 Precision Recall Specficity
##   <dbl> <dbl> <list>       <dbl>     <dbl>  <dbl>      <dbl>
## 1     5    50 <cnfsnMtr>   0.541     0.631  0.473      0.963

The same model delivers the best F1 score and Recall score.

results.df[2,"Recall"] =  grid_results[which.max(grid_results$Recall),'Recall']
results.df[2,"F1"] =  grid_results[which.max(grid_results$Recall),'F1']

Conclusion: The random forest with the tuned ‘cp’ parameter performs better across F1 and Recall.

Hypothesis #2: Fitting the previous model to the ROSE (balanced) dataset will improve Recall, but may lower F1.

rose.rf <- randomForest(
    y ~ ., 
    data = rose_train_null, 
    ntree = 50, 
    mtry = 5, 
    #na.action = na.roughfix 
    )

Inspect performance

rf_rose_predict <- predict(rose.rf, bank2_test)
rose.cm <- confusionMatrix(rf_rose_predict, bank2_test$y, positive = "yes")

results.df[3,"model"] =  'rf balanced dataset'
results.df[3,"Recall"] =  rose.cm$byClass['Recall']
results.df[3,"F1"] =  rose.cm$byClass['F1']
results.df
##                 model        F1 ROC_AUC    Recall
## 1          rf default 0.5000000      NA 0.4143803
## 2            rf tuned 0.5405405      NA 0.4730369
## 3 rf balanced dataset 0.5773585      NA 0.8684957

There is a trade-off between Recall and F1 (harmonic mean between Precision and Recall). The boost to Recall makes this approach worthwhile.

ADABOOST Hypothesis #1: The default model can improve by some parameter adjustments (iterations and learning rate).

# ADABOOST
# parameters: iter, nu
# default of boosting iterations = 50
# nu is the shrinkage parameter and default = 0.1 (may be synonymous with learning rate)
adaboost_model <- ada::ada(y ~ ., data = bank2_train, iter = 50, nu = 1)

adaboost_pred <- predict(adaboost_model, bank2_test)

conf_matrix_ada <- confusionMatrix(adaboost_pred, bank2_test$y, positive = "yes")

conf_matrix_ada$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##           0.41154210           0.96017034           0.57768924 
##       Neg Pred Value            Precision               Recall 
##           0.92495174           0.57768924           0.41154210 
##                   F1           Prevalence       Detection Rate 
##           0.48066298           0.11691185           0.04811415 
## Detection Prevalence    Balanced Accuracy 
##           0.08328725           0.68585622

#Hypothesis: if we slow down the learn rate sufficiently in combination with increasing the number of iterations, we can improve the bias-variance tradeoff

param_model <- ada::ada(y ~ ., data = bank2_train, iter = 150, nu = .5)

param_pred <- predict(param_model, bank2_test)

conf_matrix_param <- confusionMatrix(param_pred, bank2_test$y, positive = "yes")

conf_matrix_param$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##           0.43235572           0.96518036           0.62176871 
##       Neg Pred Value            Precision               Recall 
##           0.92776306           0.62176871           0.43235572 
##                   F1           Prevalence       Detection Rate 
##           0.51004464           0.11691185           0.05054751 
## Detection Prevalence    Balanced Accuracy 
##           0.08129632           0.69876804

Conclusion: to my surprise the metrics only increased marginally from the default parameter set. With this parity, I would balance compute time with number of iterations. Additionally, these parameters could be tuned further with trials across various combinations.

Hypothesis #2: The model can be further improved by scaling the numeric variables.

#Hypothesis #2: adaboost performs better on scaled numeric variables
pre_process <- preProcess(bank2_train, method = c("center", "scale"))
train_data_normalized <- predict(pre_process, bank2_train)
test_data_normalized <- predict(pre_process, bank2_test)

adaboost_model_3 <- ada(y ~ ., data = train_data_normalized, iter = 150, nu = .5)
ada_pred <- predict(adaboost_model_3, test_data_normalized)

conf_matrix_param <- confusionMatrix(ada_pred, bank2_test$y, positive = "yes")
conf_matrix_param$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##           0.44370861           0.96442886           0.62284197 
##       Neg Pred Value            Precision               Recall 
##           0.92905405           0.62284197           0.44370861 
##                   F1           Prevalence       Detection Rate 
##           0.51823204           0.11691185           0.05187479 
## Detection Prevalence    Balanced Accuracy 
##           0.08328725           0.70406873

The model fit on scaled values did not perform better than the original data scale.