Assignment2

Author

Semyon Toybis

Assignment

Assignment 2 requires conducting experiments on the “Bank Marketing” data set from the UC Irvine Machine Learning Repository by training decision tree, random forest, and Adaboost (or Gradient boost) models and varying the approach within each to compare the results. I have saved the data-set to my working directory and will be importing it from there. Below I load the libraries that I initially think will be necessary. I will skip exploratory data analysis, as this was done in assignment 1.

Below, I import the data set

bank_data <- read.csv('bank-full.csv', sep = ';')

Pre-processing

As discussed in assignment 1, there are a few pre-processing steps:

Imputing values for the “unknown” observations for features “contact”, “education”, and “job”
Converting categorical variables to dummy variables
Checking for features with near zero variance
Transforming continuous variables to control for outliers
Centering and scaling features will be done when fitting a model via the “caret” package

preproccessed_data <- bank_data

Converting unknown values

First, I replace the observations “unknown” in the “contact,”education”, and “job” columns with NA

preproccessed_data$contact[preproccessed_data$contact=='unknown'] <- NA
preproccessed_data$education[preproccessed_data$education=='unknown'] <- NA
preproccessed_data$job[preproccessed_data$job=='unknown'] <- NA

Next, I convert the categorical columns to factors:

preproccessed_data <- preproccessed_data |> mutate(across(where(is.character), as.factor))

Next, I perform KNN imputation to replace the NAs

library(VIM)

I use parallel computing later in the code to improve the run time.

library(foreach)
library(doParallel)

preproccessed_data <- kNN(data = preproccessed_data, variable = c('contact', 'education', 'job'), k = 5, imp_var = F)

prop.table(table(preproccessed_data$contact))


  cellular  telephone 
0.93052576 0.06947424

prop.table(table(preproccessed_data$education))


  primary secondary  tertiary 
0.1586782 0.5382318 0.3030900

prop.table(table(preproccessed_data$job))


       admin.   blue-collar  entrepreneur     housemaid    management 
   0.11486143    0.21651810    0.03302294    0.02789144    0.21050187 
      retired self-employed      services       student    technician 
   0.05091681    0.03505784    0.09247749    0.02081352    0.16891907 
   unemployed 
   0.02901949

Dummy variables

The decision tree logarithm in the “rpart” package can handle categorical variables directly without conversion, so normally l could forgo converting the categorical features.

However, the data set is imbalanced and I and need to use the SMOTE algorithm (discussed later) to create a balanced training set. For this, I will have to convert categorical variables to dummy variables to use the algorithm.

dummy_variables <- dummyVars(~.-y, data = preproccessed_data, fullRank = T)

dummy_data <- predict(dummy_variables, newdata = preproccessed_data)

preproccessed_data <- cbind(as.data.frame(dummy_data), y = preproccessed_data$y)

colnames(preproccessed_data)

 [1] "age"                 "job.blue-collar"     "job.entrepreneur"   
 [4] "job.housemaid"       "job.management"      "job.retired"        
 [7] "job.self-employed"   "job.services"        "job.student"        
[10] "job.technician"      "job.unemployed"      "marital.married"    
[13] "marital.single"      "education.secondary" "education.tertiary" 
[16] "default.yes"         "balance"             "housing.yes"        
[19] "loan.yes"            "contact.telephone"   "day"                
[22] "month.aug"           "month.dec"           "month.feb"          
[25] "month.jan"           "month.jul"           "month.jun"          
[28] "month.mar"           "month.may"           "month.nov"          
[31] "month.oct"           "month.sep"           "duration"           
[34] "campaign"            "pdays"               "previous"           
[37] "poutcome.other"      "poutcome.success"    "poutcome.unknown"   
[40] "y"

Near zero variance

For a model like logistic regression, I would exclude variables with near zero variance. However, for this tree based model, I will forgo doing this as the decision tree inherently focuses on the most important variables for minimizing impurity and does not predict coefficients for variables.

Transforming continuous variables

First, I look at the continuous variables besides “day”

preproccessed_data |> select(age, balance, duration, campaign, previous) |> describe()

         vars     n    mean      sd median trimmed    mad   min    max  range
age         1 45211   40.94   10.62     39   40.25  10.38    18     95     77
balance     2 45211 1362.27 3044.77    448  767.21 664.20 -8019 102127 110146
duration    3 45211  258.16  257.53    180  210.87 137.88     0   4918   4918
campaign    4 45211    2.76    3.10      2    2.12   1.48     1     63     62
previous    5 45211    0.58    2.30      0    0.13   0.00     0    275    275
          skew kurtosis    se
age       0.68     0.32  0.05
balance   8.36   140.73 14.32
duration  3.14    18.15  1.21
campaign  4.90    39.24  0.01
previous 41.84  4506.16  0.01

preproccessed_data |> select(age, balance, duration, campaign, previous) |> pivot_longer(cols = everything(),
                                                       names_to = 'variable',
                                                       values_to = 'value') |> ggplot(aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(vars(variable), scales = 'free_y')

As discussed above, these features have outliers. However, decision trees are able to handle data sets with outliers - thus I will leave the features as they currently are without any transformations.

Below is the description of the “duration” feature from the data set description:

last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Thus, I will remove the feature from the data set

preproccessed_data$duration <- NULL

Centering and scaling data can be done during training via the Caret package.

Train-Test Split

As a reminder, our data-set has an imbalance for the target variable:

prop.table(table(preproccessed_data$y))


       no       yes 
0.8830152 0.1169848

Thus, I will apply the SMOTE algorithm to the training set to avoid biasing the model toward the majority class.

First, I create the train-test split use the 80%/20% proportion.

set.seed(123)

sample_set <- sample(nrow(preproccessed_data), round(nrow(preproccessed_data)*0.8), replace = FALSE)

train_set <- preproccessed_data[sample_set,]
test_set <- preproccessed_data[-sample_set,]

Below I check the proportions of the target

original data set:

round(prop.table(table(select(preproccessed_data, y), exclude = NULL)), 4) * 100

y
  no  yes 
88.3 11.7

training data

round(prop.table(table(select(train_set, y), exclude = NULL)), 4) * 100

y
   no   yes 
88.28 11.72

test data

round(prop.table(table(select(test_set, y), exclude = NULL)), 4) * 100

y
   no   yes 
88.39 11.61

Next, I apply the SMOTE algorithm to generate synthetic data to make the training set balanced in the target

library(smotefamily)

set.seed(456)

smote_result<- SMOTE(X = train_set[, -39], target = train_set$y, 
                      K = 5, dup_size = 6.5)

train_set<-data.frame(smote_result$data)
names(train_set)[ncol(train_set)] <- "y"
train_set$y <- as.factor(train_set$y)

round(prop.table(table(select(train_set, y), exclude = NULL)), 4) * 100

y
   no   yes 
51.83 48.17

The training set now has about an equal split in the target.

Lastly, I will convert the dummy variables to Boolean values:

train_set <- train_set |> 
  mutate(across(where(is.numeric) & !all_of(c("age", "balance", "day", "pdays", "previous", "y")), ~ . == 1))

test_set <- test_set |> 
  mutate(across(where(is.numeric) & !all_of(c("age", "balance", "day", "pdays", "previous", "y")), ~ . == 1))

str(train_set)

'data.frame':   61603 obs. of  39 variables:
 $ age                : num  65 54 35 26 30 60 43 67 48 41 ...
 $ job.blue.collar    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.entrepreneur   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.housemaid      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.management     : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
 $ job.retired        : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ job.self.employed  : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ job.services       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.student        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.technician     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.unemployed     : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ marital.married    : logi  TRUE TRUE FALSE FALSE FALSE TRUE ...
 $ marital.single     : logi  FALSE FALSE TRUE TRUE TRUE FALSE ...
 $ education.secondary: logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
 $ education.tertiary : logi  FALSE TRUE FALSE TRUE TRUE TRUE ...
 $ default.yes        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ balance            : num  952 1303 127 3704 3343 ...
 $ housing.yes        : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ loan.yes           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ contact.telephone  : logi  FALSE FALSE TRUE FALSE FALSE TRUE ...
 $ day                : num  6 3 14 19 1 26 2 11 28 5 ...
 $ month.aug          : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ month.dec          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.feb          : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ month.jan          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.jul          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.jun          : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ month.mar          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.may          : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ month.nov          : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ month.oct          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.sep          : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ campaign           : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
 $ pdays              : num  96 -1 -1 -1 -1 -1 -1 91 92 90 ...
 $ previous           : num  1 0 0 0 0 0 0 3 12 7 ...
 $ poutcome.other     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ poutcome.success   : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ poutcome.unknown   : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
 $ y                  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...

str(test_set)

'data.frame':   9042 obs. of  39 variables:
 $ age                : num  58 33 58 29 32 57 25 36 50 36 ...
 $ job.blue-collar    : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ job.entrepreneur   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ job.housemaid      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.management     : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ job.retired        : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ job.self-employed  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.services       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.student        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ job.technician     : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ job.unemployed     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ marital.married    : logi  TRUE TRUE TRUE FALSE FALSE FALSE ...
 $ marital.single     : logi  FALSE FALSE FALSE TRUE TRUE FALSE ...
 $ education.secondary: logi  FALSE TRUE FALSE TRUE FALSE TRUE ...
 $ education.tertiary : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ default.yes        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ balance            : num  2143 2 121 390 23 ...
 $ housing.yes        : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ loan.yes           : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
 $ contact.telephone  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ day                : num  5 5 5 5 5 5 5 5 5 5 ...
 $ month.aug          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.dec          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.feb          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.jan          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.jul          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.jun          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.mar          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.may          : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ month.nov          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.oct          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ month.sep          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ campaign           : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ pdays              : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ previous           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome.other     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ poutcome.success   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ poutcome.unknown   : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ y                  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Decision tree

The first experiment will involve two decision tree models.

For this experiment, I will test two models - one that does not preprocess the data by centering and scaling and one that does.

My hypothesis is that that there is no meaningful difference between the two models in terms of performance because decision trees are able to handle continuous variables without scaling.

The evaluation metric for the training set will be accuracy, since we have a roughly balanced data set. The metric for the test set will be F1, since the test set is unbalanced.

First experiment

Below I train a decision tree model without centering and scaling:

tree1 <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'rpart',
  trControl = trainControl(method = 'cv', number = 10)
)

tree1

CART 

61603 samples
   38 predictor
    2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 55443, 55442, 55443, 55442, 55442, 55443, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.05961649  0.7419124  0.4842386
  0.09355306  0.7034886  0.4103832
  0.34971186  0.5832351  0.1465104

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.05961649.

As seen above the model chosen had a complexity parameter of 0.05 and and accuracy of 0.74. Below is a visualization of the tree:

rpart.plot::rpart.plot(tree1$finalModel)

I will record the performance of the final model on the training data for comparison with other models

performance_metrics_training <- data.frame(Model=character(),
                                     Accuracy = numeric(),
                                     Kappa = numeric())

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('Tree1',                                                                round(getTrainPerf(tree1)[1],4),
                                                                round(getTrainPerf(tree1)[2],4))

Next, I try predicting the values in the test set, however I run into an issue. The test set does not have two columns:

setdiff(names(train_set), names(test_set))

[1] "job.blue.collar"   "job.self.employed"

setdiff(names(test_set), names(train_set))

[1] "job.blue-collar"   "job.self-employed"

it seems like the error is being causing due to the inconsistency in the spelling of the columns (period vs hyphen). I will change the names in the test set to match the train set.

colnames(test_set)[which(names(test_set) == "job.blue-collar")] <- "job.blue.collar"
colnames(test_set)[which(names(test_set) == "job.self-employed")] <- "job.self.employed"

setdiff(names(test_set), names(train_set))

character(0)

Next, I try predicting again:

tree1_predictions <- predict(tree1, test_set, type='raw')

Below are the performance metrics:

tree1CM <- confusionMatrix(tree1_predictions, test_set$y, positive = 'yes')
tree1CM

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  5613  489
       yes 2379  561
                                          
               Accuracy : 0.6828          
                 95% CI : (0.6731, 0.6924)
    No Information Rate : 0.8839          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1328          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.53429         
            Specificity : 0.70233         
         Pos Pred Value : 0.19082         
         Neg Pred Value : 0.91986         
             Prevalence : 0.11612         
         Detection Rate : 0.06204         
   Detection Prevalence : 0.32515         
      Balanced Accuracy : 0.61831         
                                          
       'Positive' Class : yes

I also look at the F1 score:

library(MLmetrics)

tree1F1 <- F1_Score(y_pred = tree1_predictions, y_true = test_set$y, positive = "yes")
tree1F1

[1] 0.281203

This model has poor performance - the accuracy of 0.68 is worse than just guessing “no” for all predictions, the kappa score is near zero and the F1 score is low as well.

I will create a data frame to track performance metrics and compare

performance_metrics_test <- data.frame(Model=character(),
                                     Accuracy = numeric(),
                                     F1 = numeric(),
                                     Kappa = numeric(),
                                     Recall = numeric(),
                                     Precision = numeric())

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('Tree1',                                                                round(tree1CM$overall[1],4),
                                                                round(tree1F1,4),
                                                                round(tree1CM$overall[2],4),
                                                                round(tree1CM$byClass[1],4),
                                                                round(tree1CM$byClass[3],4))

performance_metrics_test

  Model Accuracy     F1  Kappa Recall Precision
1 Tree1   0.6828 0.2812 0.1328 0.5343    0.1908

Second experiment

Below I train a decision tree model with centering and scaling:

tree2 <- train(
  y ~ .,
  data = train_set,
  preProcess = c('center','scale'),
  metric = 'Accuracy',
  method = 'rpart',
  trControl = trainControl(method = 'cv', number = 10)
)

tree2

CART 

61603 samples
   38 predictor
    2 classes: 'no', 'yes' 

Pre-processing: centered (38), scaled (38) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 55442, 55442, 55443, 55443, 55442, 55443, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.05961649  0.7424314  0.4852607
  0.09355306  0.7032606  0.4099243
  0.34971186  0.6169986  0.2222288

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.05961649.

The model was a similar cost parameter of 0.06 and an accuracy on the test set of 0.74

Below is a plot of the model

rpart.plot::rpart.plot(tree2$finalModel)

I record the performance metrics on the training set:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('Tree2',                                                                round(getTrainPerf(tree2)[1],4),
                                                                round(getTrainPerf(tree2)[2],4))

Next, I try predicting the test set:

tree2_predictions <- predict(tree2, test_set, type='raw')

Below are the performance metrics:

tree2CM <- confusionMatrix(tree2_predictions, test_set$y, positive = 'yes')
tree2CM

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  5613  489
       yes 2379  561
                                          
               Accuracy : 0.6828          
                 95% CI : (0.6731, 0.6924)
    No Information Rate : 0.8839          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1328          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.53429         
            Specificity : 0.70233         
         Pos Pred Value : 0.19082         
         Neg Pred Value : 0.91986         
             Prevalence : 0.11612         
         Detection Rate : 0.06204         
   Detection Prevalence : 0.32515         
      Balanced Accuracy : 0.61831         
                                          
       'Positive' Class : yes

Below is the F1 score:

tree2F1 <- F1_Score(y_pred = tree2_predictions, y_true = test_set$y, positive = "yes")
tree2F1

[1] 0.281203

I add the metrics to the performance data frame:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('Tree2',                                                                round(tree2CM$overall[1],4),
                                                                round(tree2F1,4),
                                                                round(tree2CM$overall[2],4),
                                                                round(tree2CM$byClass[1],4),
                                                                round(tree2CM$byClass[3],4))

performance_metrics_test

  Model Accuracy     F1  Kappa Recall Precision
1 Tree1   0.6828 0.2812 0.1328 0.5343    0.1908
2 Tree2   0.6828 0.2812 0.1328 0.5343    0.1908

My hypothesis was correct - the two models (one without centering and scaling of data and one with) have essentially the same performance.

Random Forest

The next experiment involves random forest models. Below, I look up the tunable parameters for the model.

modelLookup('rf')

  model parameter                         label forReg forClass probModel
1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE

There is only one tunable parameter - mtry, which is the number of randomly selected features to consider at each split.

The default value for mtry is the square root of the number of features in the data set when working on classification problems.

For this experiment, I will compare the output of a model with the default value for mtry vs one that tries several different values for mtry via hyper-parameter tuning.

A small value of mtry corresponds to having a wider variety of trees with significant differentiation between them.

My hypothesis is that the value of mtry from hyper-parameter tuning will be near the default value of six.

The evaluation metric for the training set will be accuracy, since we have a roughly balanced data set. The metric for the test set will be F1, since the test set is unbalanced.

First experiment

First, I will try training the random forest model without parameter tuning, using the square root of the number of features as the mtry parameter. Because I am not doing hyper-parameter tuning, I don’t need to do resampling. I will also use parallel computing to speed up the training process.

cluster00 <- makeCluster(detectCores()-2)

registerDoParallel(cluster00)
set.seed(1011)
rf0 <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'rf',
  trControl = trainControl(method = 'none'),
  tuneGrid = expand.grid(.mtry = 6)
)

stopCluster(cluster00)

rf0

Random Forest 

61603 samples
   38 predictor
    2 classes: 'no', 'yes' 

No pre-processing
Resampling: None

While there is no single tree to visualize, I can visualize variable importance:

plot(varImp(rf0))

I record the performance metrics on the training data:

rf0_train_predictions <- predict(rf0, newdata = train_set)

rf0_train_cm <-confusionMatrix(rf0_train_predictions, train_set$y, positive = 'yes')

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('rf0',                                                                round(rf0_train_cm$overall[1],4),
                                                                 round(rf0_train_cm$overall[2],4))

Next, I try predicting the test set with the model:

rf0_predictions <- predict(rf0, test_set, type='raw')

Below are the performance metrics:

rf0CM <- confusionMatrix(rf0_predictions, test_set$y, positive = 'yes')
rf0CM

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7648  727
       yes  344  323
                                          
               Accuracy : 0.8816          
                 95% CI : (0.8747, 0.8881)
    No Information Rate : 0.8839          
    P-Value [Acc > NIR] : 0.7605          
                                          
                  Kappa : 0.3144          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.30762         
            Specificity : 0.95696         
         Pos Pred Value : 0.48426         
         Neg Pred Value : 0.91319         
             Prevalence : 0.11612         
         Detection Rate : 0.03572         
   Detection Prevalence : 0.07377         
      Balanced Accuracy : 0.63229         
                                          
       'Positive' Class : yes

Below is the F1 score:

rf0F1 <- F1_Score(y_pred = rf0_predictions, y_true = test_set$y, positive = "yes")
rf0F1

[1] 0.3762376

I add the metrics to the performance table:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('RF0',                                                                round(rf0CM$overall[1],4),
                                                                round(rf0F1,4),
                                                                round(rf0CM$overall[2],4),
                                                                round(rf0CM$byClass[1],4),
                                                                round(rf0CM$byClass[3],4))

Second experiment

Below I train a random forest model that tries different values for the mtry parameter.

Below I start the cluster, train the model, and close the cluster:

cluster0 <- makeCluster(detectCores()-2)

registerDoParallel(cluster0)

set.seed(789)
rf1 <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'rf',
  trControl = trainControl(method = 'cv', number = 10)
)

stopCluster(cluster0)

rf1

Random Forest 

61603 samples
   38 predictor
    2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 55443, 55443, 55442, 55443, 55443, 55443, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8913527  0.7814750
  20    0.9245654  0.8486363
  38    0.9209617  0.8414435

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 20.

The model with the best performance metrics was one with mtry = 20. This model has better performance metrics on the training set than the decision tree models.

While there is no single tree to visualize, I can visualize variable importance:

plot(varImp(rf1))

I record the performance on the training set:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('rf1',                                                                round(getTrainPerf(rf1)[1],4),
                                                                round(getTrainPerf(rf1)[2],4))

Next, I try predicting the test set:

rf1_predictions <- predict(rf1, test_set, type='raw')

Below are the performance metrics:

rf1CM <- confusionMatrix(rf1_predictions, test_set$y, positive = 'yes')
rf1CM

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7645  708
       yes  347  342
                                          
               Accuracy : 0.8833          
                 95% CI : (0.8765, 0.8899)
    No Information Rate : 0.8839          
    P-Value [Acc > NIR] : 0.5732          
                                          
                  Kappa : 0.3318          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.32571         
            Specificity : 0.95658         
         Pos Pred Value : 0.49637         
         Neg Pred Value : 0.91524         
             Prevalence : 0.11612         
         Detection Rate : 0.03782         
   Detection Prevalence : 0.07620         
      Balanced Accuracy : 0.64115         
                                          
       'Positive' Class : yes

I also look at the F1 score:

rf1F1 <- F1_Score(y_pred = rf1_predictions, y_true = test_set$y, positive = "yes")
rf1F1

[1] 0.3933295

I add the metrics to the performance table:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('RF1',                                                                round(rf1CM$overall[1],4),
                                                                round(rf1F1,4),
                                                                round(rf1CM$overall[2],4),
                                                                round(rf1CM$byClass[1],4),
                                                                round(rf1CM$byClass[3],4))

performance_metrics_test

  Model Accuracy     F1  Kappa Recall Precision
1 Tree1   0.6828 0.2812 0.1328 0.5343    0.1908
2 Tree2   0.6828 0.2812 0.1328 0.5343    0.1908
3   RF0   0.8816 0.3762 0.3144 0.3076    0.4843
4   RF1   0.8833 0.3933 0.3318 0.3257    0.4964

Interestingly, parameter tuning picked a model with mtry = 20, which had a better accuracy on the training set than the model with mtry set to the square root of the number of features. Both models still performed poorly on the test set, though the model with parameter tuning performed better then the model without tuning and both models performed better than the decision tree models.

Boosted models

The next experiment involves boosted models. Below, I look up the tunable parameters for the model:

modelLookup('xgbTree')

    model        parameter                          label forReg forClass
1 xgbTree          nrounds          # Boosting Iterations   TRUE     TRUE
2 xgbTree        max_depth                 Max Tree Depth   TRUE     TRUE
3 xgbTree              eta                      Shrinkage   TRUE     TRUE
4 xgbTree            gamma         Minimum Loss Reduction   TRUE     TRUE
5 xgbTree colsample_bytree     Subsample Ratio of Columns   TRUE     TRUE
6 xgbTree min_child_weight Minimum Sum of Instance Weight   TRUE     TRUE
7 xgbTree        subsample           Subsample Percentage   TRUE     TRUE
  probModel
1      TRUE
2      TRUE
3      TRUE
4      TRUE
5      TRUE
6      TRUE
7      TRUE

There are seven tunable parameters for the xgbTree model. Tuning all of these parameters on a fairly large training set may take a significant amount of time and computing power, thus I will try tuning two parameters: nrounds, which is the number of boosting iterations, and max_depth, which is the maximum tree depth. I will use default values for the other parameters in order to manage run time.

For this experiment I will try different cross-validation methods: k-fold cross validation and random cross validation (known as leave-group-out-cross-validation). The difference between the two methods is that random cross validation creates the validation sets during each iteration whereas k-fold cross validation creates validation sets before iterating through the process. The sets created by random cross validation are independent from each other, which means that iterations can have some overlap in their validation sets (i.e., instances may be used more than once in validation).

My hypothesis is that there will be no difference in model performance on both the the training and test sets between the two cross validation methods. I suspect this to be the case because the training set is large and it is synthetically balanced. The metric for the training set will be accuracy and the metric for the test set will be F1, since the test set is unbalanced.

First experiment

First, I will try training the xgbTree model using k-fold cross validation. I will use the default values for the tuning parameters besides nrounds and max depth. I will also use parallel computing to improve the run time.

tune_grid <- expand.grid(
  nrounds = c(100, 200),       
  max_depth = c(3, 6),         
  eta = 0.3,            
  gamma = 0.01,              
  colsample_bytree = 1,
  min_child_weight = 1,   
  subsample = 1        
)

cluster1 <- makeCluster(detectCores()-2)

registerDoParallel(cluster1)

set.seed(1213)


xgbTree1 <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'xgbTree',
  trControl = trainControl(method = 'cv', number = 10),
  tuneGrid = tune_grid
)

stopCluster(cluster1)

xgbTree1

eXtreme Gradient Boosting 

61603 samples
   38 predictor
    2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 55443, 55443, 55443, 55442, 55443, 55443, ... 
Resampling results across tuning parameters:

  max_depth  nrounds  Accuracy   Kappa    
  3          100      0.9171633  0.8337416
  3          200      0.9279257  0.8553095
  6          100      0.9308152  0.8611369
  6          200      0.9328118  0.8651448

Tuning parameter 'eta' was held constant at a value of 0.3
Tuning

Tuning parameter 'min_child_weight' was held constant at a value of 1

Tuning parameter 'subsample' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 200, max_depth = 6, eta
 = 0.3, gamma = 0.01, colsample_bytree = 1, min_child_weight = 1 and
 subsample = 1.

Below I plot the variable importance:

plot(varImp(xgbTree1))

I record the performance on the training data:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('xgbTree1',                                                                round(getTrainPerf(xgbTree1)[1],4),
                                                                round(getTrainPerf(xgbTree1)[2],4))

Next, I try predicting the test set

xgbTree1_predictions <- predict(xgbTree1, test_set, type='raw')

Below are the performance metrics:

xgbTree1CM <- confusionMatrix(xgbTree1_predictions, test_set$y, positive = 'yes')
xgbTree1CM

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7722  744
       yes  270  306
                                          
               Accuracy : 0.8879          
                 95% CI : (0.8812, 0.8943)
    No Information Rate : 0.8839          
    P-Value [Acc > NIR] : 0.1216          
                                          
                  Kappa : 0.3205          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.29143         
            Specificity : 0.96622         
         Pos Pred Value : 0.53125         
         Neg Pred Value : 0.91212         
             Prevalence : 0.11612         
         Detection Rate : 0.03384         
   Detection Prevalence : 0.06370         
      Balanced Accuracy : 0.62882         
                                          
       'Positive' Class : yes

F1 score:

xgbTree1F1 <- F1_Score(y_pred = xgbTree1_predictions, y_true = test_set$y, positive = "yes")
xgbTree1F1

[1] 0.3763838

I add the metrics to the performance table:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('xgbTree1',                                                                round(xgbTree1CM$overall[1],4),
                                                                round(xgbTree1F1,4),
                                                                round(xgbTree1CM$overall[2],4),
                                                                round(xgbTree1CM$byClass[1],4),
                                                                round(xgbTree1CM$byClass[3],4))

Second experiment

Below, I re-run the model from above but use random cross validation.

cluster2 <- makeCluster(detectCores()-2)

registerDoParallel(cluster2)

set.seed(1213)


xgbTree2 <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'xgbTree',
  trControl = trainControl(method = 'LGOCV', p = 0.1, number = 10),
  tuneGrid = tune_grid
)

stopCluster(cluster2)

xgbTree2

eXtreme Gradient Boosting 

61603 samples
   38 predictor
    2 classes: 'no', 'yes' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (10 reps, 10%) 
Summary of sample sizes: 6161, 6161, 6161, 6161, 6161, 6161, ... 
Resampling results across tuning parameters:

  max_depth  nrounds  Accuracy   Kappa    
  3          100      0.9096389  0.8186690
  3          200      0.9161881  0.8318128
  6          100      0.9156849  0.8308264
  6          200      0.9156921  0.8308465

Tuning parameter 'eta' was held constant at a value of 0.3
Tuning

Tuning parameter 'min_child_weight' was held constant at a value of 1

Tuning parameter 'subsample' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 200, max_depth = 3, eta
 = 0.3, gamma = 0.01, colsample_bytree = 1, min_child_weight = 1 and
 subsample = 1.

Variable importance:

plot(varImp(xgbTree2))

I record the performance on the training data:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('xgbTree2',                                                                round(getTrainPerf(xgbTree2)[1],4),
                                                                round(getTrainPerf(xgbTree2)[2],4))

Predicting the test set:

xgbTree2_predictions <- predict(xgbTree2, test_set, type='raw')

Performance metrics:

xgbTree2CM <- confusionMatrix(xgbTree2_predictions, test_set$y, positive = 'yes')
xgbTree2CM

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  7697  732
       yes  295  318
                                          
               Accuracy : 0.8864          
                 95% CI : (0.8797, 0.8929)
    No Information Rate : 0.8839          
    P-Value [Acc > NIR] : 0.2307          
                                          
                  Kappa : 0.3246          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.30286         
            Specificity : 0.96309         
         Pos Pred Value : 0.51876         
         Neg Pred Value : 0.91316         
             Prevalence : 0.11612         
         Detection Rate : 0.03517         
   Detection Prevalence : 0.06779         
      Balanced Accuracy : 0.63297         
                                          
       'Positive' Class : yes

F1 score:

xgbTree2F1 <- F1_Score(y_pred = xgbTree2_predictions, y_true = test_set$y, positive = "yes")
xgbTree2F1

[1] 0.3824414

Performance table

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('xgbTree2',                                                                round(xgbTree2CM$overall[1],4),
                                                                round(xgbTree2F1,4),
                                                                round(xgbTree2CM$overall[2],4),
                                                                round(xgbTree2CM$byClass[1],4),
                                                                round(xgbTree2CM$byClass[3],4))

performance_metrics_test

     Model Accuracy     F1  Kappa Recall Precision
1    Tree1   0.6828 0.2812 0.1328 0.5343    0.1908
2    Tree2   0.6828 0.2812 0.1328 0.5343    0.1908
3      RF0   0.8816 0.3762 0.3144 0.3076    0.4843
4      RF1   0.8833 0.3933 0.3318 0.3257    0.4964
5 xgbTree1   0.8879 0.3764 0.3205 0.2914    0.5312
6 xgbTree2   0.8864 0.3824 0.3246 0.3029    0.5188

There is no performance difference between the two models on the test set.

Conclusion

Below are the final data frames that have performance on the training and test sets by model

Training:

performance_metrics_training

     Model Accuracy  Kappa
1    Tree1   0.7419 0.4842
2    Tree2   0.7424 0.4853
3      rf0   0.9559 0.9115
4      rf1   0.9246 0.8486
5 xgbTree1   0.9328 0.8651
6 xgbTree2   0.9162 0.8318

Test:

performance_metrics_test

     Model Accuracy     F1  Kappa Recall Precision
1    Tree1   0.6828 0.2812 0.1328 0.5343    0.1908
2    Tree2   0.6828 0.2812 0.1328 0.5343    0.1908
3      RF0   0.8816 0.3762 0.3144 0.3076    0.4843
4      RF1   0.8833 0.3933 0.3318 0.3257    0.4964
5 xgbTree1   0.8879 0.3764 0.3205 0.2914    0.5312
6 xgbTree2   0.8864 0.3824 0.3246 0.3029    0.5188

The results of the models are seen above. Relative to the hypothesis, the conclusion for each experiment is:

Experiment 1: the hypothesis that there is no meaningful difference between a model that centers and scales the data and one that does not was correct
Experiment 2: the hypothesis that the model with mtry equal to the square root of the number of features would perform better than one that trying hyperemater tuning was slightly incorrect: the model that tuned the mtry hypermeter had worse accuracy on the training set but a slightly better F1 score on the test set
Experiment 3: the hypothesis that there would be no difference between a model that uses k-fold cross validation and random cross validation was correct for the test set, though the model with k-fold cross validation had better performance on the training set

The decision tree models had mediocre performance on the raining set while the random forest models and the xgbTree models had strong performance on the training set. However, all of the models had poor performance on the test set, which was imbalanced. Given the imbalance, we were interested in creating a model that can accurately predict the minority class (“yes”) and decided to use the F1 Score which combines both recall (actual positive instances that were correctly predicted by the model) and precision (correct positive predictions out of all positive predictions). All of the models had an F1 score below 0.5, indicating poor performance. The model that performed the best on the test set has RF1, with an F1 Score of 0.4. This suggests that the models have low bias but high variance. Though the RF1 model had the highest performance on the test set, it still performed poorly overall and thus I would not recommend the model. Further experiments can consider more hyper-parameter tuning. Additionally, I am curious how the models would have performed if I did not create a synthetically balanced training set. Its possible that there was too much synthetic data which poorly represented the minority class in the test set. Additionally, I would try keeping categorical variables as one column rather than creating dummy variables as decision tree models are able to handle categorical data.