Libraries and setup

Libraries used for the analysis shown here.

library(tidyverse)
library(caret)
library(mlbench)
library(ggplot2)
library(data.table)
library(randomForest)
library(JOUSBoost)
library(ada)
seed_num <- 42

Example - Random Forest vs. ADABoost

First the cell below imports the Breast cancer dataset from the mlbench package, and does a bit of data cleaning to remove NaN rows and a predictively useless Id column.

data("BreastCancer")
df <- BreastCancer
df <- na.omit(df)
df <- select(df, -Id)
summary(df)
##   Cl.thickness   Cell.size     Cell.shape  Marg.adhesion  Epith.c.size
##  1      :139   1      :373   1      :346   1      :393   2      :376  
##  5      :128   10     : 67   2      : 58   2      : 58   3      : 71  
##  3      :104   3      : 52   10     : 58   3      : 58   4      : 48  
##  4      : 79   2      : 45   3      : 53   10     : 55   1      : 44  
##  10     : 69   4      : 38   4      : 43   4      : 33   6      : 40  
##  2      : 50   5      : 30   5      : 32   8      : 25   5      : 39  
##  (Other):114   (Other): 78   (Other): 93   (Other): 61   (Other): 65  
##   Bare.nuclei   Bl.cromatin  Normal.nucleoli    Mitoses          Class    
##  1      :402   3      :161   1      :432     1      :563   benign   :444  
##  10     :132   2      :160   10     : 60     2      : 35   malignant:239  
##  2      : 30   1      :150   3      : 42     3      : 33                  
##  5      : 30   7      : 71   2      : 36     10     : 14                  
##  3      : 28   4      : 39   8      : 23     4      : 12                  
##  8      : 21   5      : 34   6      : 22     7      :  9                  
##  (Other): 40   (Other): 68   (Other): 68     (Other): 17

The two classes in this dataset tell us whether or not a tumor is benign or malignant, and comprises an additional 9 predictor variables.

#df <- df  %>% mutate(
#  Class_new = case_when(
#    Class == 'benign' ~ -1,
#    TRUE ~ 1
#  )
#)
#df <- dplyr::select(df, -Class)
#df$Class <- df$Class_new
#df <- dplyr::select(df, -Class_new)
df$Class <- as.factor(df$Class)
table(df$Class)
## 
##    benign malignant 
##       444       239

The cell below splits the data into testing and training datasets using and 75/25 split.

set.seed(seed_num)
ind <- sample(2, nrow(df), replace = TRUE, prob = c(0.75, 0.25))
train <- df[ind==1,]
test <- df[ind==2,]
x_train <- dplyr::select(train, -Class)
y_train <- train$Class
x_test <- dplyr::select(test, -Class)
y_test <- test$Class

The cell below trains an “out of the box” random forest classifier using the training data.

rf <- randomForest(Class~., data=train) 
print(rf)
## 
## Call:
##  randomForest(formula = Class ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 2.89%
## Confusion matrix:
##           benign malignant class.error
## benign       325         9  0.02694611
## malignant      6       179  0.03243243

Next, we use the random forest model to make predictions on the testing dataset and see how well it performed:

p_rf <- predict(rf, x_test)
confusionMatrix(p_rf, y_test)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       106         1
##   malignant      4        53
##                                         
##                Accuracy : 0.9695        
##                  95% CI : (0.9303, 0.99)
##     No Information Rate : 0.6707        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9319        
##                                         
##  Mcnemar's Test P-Value : 0.3711        
##                                         
##             Sensitivity : 0.9636        
##             Specificity : 0.9815        
##          Pos Pred Value : 0.9907        
##          Neg Pred Value : 0.9298        
##              Prevalence : 0.6707        
##          Detection Rate : 0.6463        
##    Detection Prevalence : 0.6524        
##       Balanced Accuracy : 0.9726        
##                                         
##        'Positive' Class : benign        
## 

Since this is a case in which we really want to limit false negatives (cases in which a malignant tumor is diagnosed as benign), we will focus on sensitivity as the metric of choice. The results above show that the random forest model performed exceedingly well: >96% of all malignant tumors were indeed identified as such by the model.

The cell below implements an ADABoosting model using the same training data:

ada <- ada(Class~., data=train) 
summary(ada)
## Call:
## ada(Class ~ ., data = train)
## 
## Loss: exponential Method: discrete   Iteration: 50 
## 
## Training Results
## 
## Accuracy: 0.988 Kappa: 0.975

We can now use the ADABoosting model to make predictions on the test set:

p_ada <- predict(ada, x_test)
confusionMatrix(p_ada, y_test)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       105         2
##   malignant      5        52
##                                          
##                Accuracy : 0.9573         
##                  95% CI : (0.914, 0.9827)
##     No Information Rate : 0.6707         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9047         
##                                          
##  Mcnemar's Test P-Value : 0.4497         
##                                          
##             Sensitivity : 0.9545         
##             Specificity : 0.9630         
##          Pos Pred Value : 0.9813         
##          Neg Pred Value : 0.9123         
##              Prevalence : 0.6707         
##          Detection Rate : 0.6402         
##    Detection Prevalence : 0.6524         
##       Balanced Accuracy : 0.9588         
##                                          
##        'Positive' Class : benign         
## 

In this case, we see that the ADABoost model did indeed result in a higher sensitivity, meaning better performance for the task at hand. It is worth noting however, that the balanced accuracy slightly decreased.

Conclusions

In general, ADABoosting can be used to improve the performance of binary classification models, which exactly the behavior witnessed here. That being said, better performance might not always be exhibited for different datasets. As such, proper testing is necessary to ensure when ADABoosting is an appropriate choice.