Example - Random Forest vs. ADABoost

First the cell below imports the Breast cancer dataset from the mlbench package, and does a bit of data cleaning to remove NaN rows and a predictively useless Id column.

data("BreastCancer")
df <- BreastCancer
df <- na.omit(df)
df <- select(df, -Id)
summary(df)

##   Cl.thickness   Cell.size     Cell.shape  Marg.adhesion  Epith.c.size
##  1      :139   1      :373   1      :346   1      :393   2      :376  
##  5      :128   10     : 67   2      : 58   2      : 58   3      : 71  
##  3      :104   3      : 52   10     : 58   3      : 58   4      : 48  
##  4      : 79   2      : 45   3      : 53   10     : 55   1      : 44  
##  10     : 69   4      : 38   4      : 43   4      : 33   6      : 40  
##  2      : 50   5      : 30   5      : 32   8      : 25   5      : 39  
##  (Other):114   (Other): 78   (Other): 93   (Other): 61   (Other): 65  
##   Bare.nuclei   Bl.cromatin  Normal.nucleoli    Mitoses          Class    
##  1      :402   3      :161   1      :432     1      :563   benign   :444  
##  10     :132   2      :160   10     : 60     2      : 35   malignant:239  
##  2      : 30   1      :150   3      : 42     3      : 33                  
##  5      : 30   7      : 71   2      : 36     10     : 14                  
##  3      : 28   4      : 39   8      : 23     4      : 12                  
##  8      : 21   5      : 34   6      : 22     7      :  9                  
##  (Other): 40   (Other): 68   (Other): 68     (Other): 17

The two classes in this dataset tell us whether or not a tumor is benign or malignant, and comprises an additional 9 predictor variables.

#df <- df  %>% mutate(
#  Class_new = case_when(
#    Class == 'benign' ~ -1,
#    TRUE ~ 1
#  )
#)
#df <- dplyr::select(df, -Class)
#df$Class <- df$Class_new
#df <- dplyr::select(df, -Class_new)
df$Class <- as.factor(df$Class)
table(df$Class)

## 
##    benign malignant 
##       444       239

The cell below splits the data into testing and training datasets using and 75/25 split.

set.seed(seed_num)
ind <- sample(2, nrow(df), replace = TRUE, prob = c(0.75, 0.25))
train <- df[ind==1,]
test <- df[ind==2,]
x_train <- dplyr::select(train, -Class)
y_train <- train$Class
x_test <- dplyr::select(test, -Class)
y_test <- test$Class

The cell below trains an “out of the box” random forest classifier using the training data.

rf <- randomForest(Class~., data=train) 
print(rf)

## 
## Call:
##  randomForest(formula = Class ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 2.89%
## Confusion matrix:
##           benign malignant class.error
## benign       325         9  0.02694611
## malignant      6       179  0.03243243

Next, we use the random forest model to make predictions on the testing dataset and see how well it performed:

p_rf <- predict(rf, x_test)
confusionMatrix(p_rf, y_test)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       106         1
##   malignant      4        53
##                                         
##                Accuracy : 0.9695        
##                  95% CI : (0.9303, 0.99)
##     No Information Rate : 0.6707        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9319        
##                                         
##  Mcnemar's Test P-Value : 0.3711        
##                                         
##             Sensitivity : 0.9636        
##             Specificity : 0.9815        
##          Pos Pred Value : 0.9907        
##          Neg Pred Value : 0.9298        
##              Prevalence : 0.6707        
##          Detection Rate : 0.6463        
##    Detection Prevalence : 0.6524        
##       Balanced Accuracy : 0.9726        
##                                         
##        'Positive' Class : benign        
##

Since this is a case in which we really want to limit false negatives (cases in which a malignant tumor is diagnosed as benign), we will focus on sensitivity as the metric of choice. The results above show that the random forest model performed exceedingly well: >96% of all malignant tumors were indeed identified as such by the model.

The cell below implements an ADABoosting model using the same training data:

ada <- ada(Class~., data=train) 
summary(ada)

## Call:
## ada(Class ~ ., data = train)
## 
## Loss: exponential Method: discrete   Iteration: 50 
## 
## Training Results
## 
## Accuracy: 0.988 Kappa: 0.975

We can now use the ADABoosting model to make predictions on the test set:

p_ada <- predict(ada, x_test)
confusionMatrix(p_ada, y_test)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       105         2
##   malignant      5        52
##                                          
##                Accuracy : 0.9573         
##                  95% CI : (0.914, 0.9827)
##     No Information Rate : 0.6707         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9047         
##                                          
##  Mcnemar's Test P-Value : 0.4497         
##                                          
##             Sensitivity : 0.9545         
##             Specificity : 0.9630         
##          Pos Pred Value : 0.9813         
##          Neg Pred Value : 0.9123         
##              Prevalence : 0.6707         
##          Detection Rate : 0.6402         
##    Detection Prevalence : 0.6524         
##       Balanced Accuracy : 0.9588         
##                                          
##        'Positive' Class : benign         
##

In this case, we see that the ADABoost model did indeed result in a higher sensitivity, meaning better performance for the task at hand. It is worth noting however, that the balanced accuracy slightly decreased.

Data 622 - Week 7 Discussion - ADA Boosting

William Jasmine

2024-03-19

Libraries and setup

Example - Random Forest vs. ADABoost

Conclusions