Classification: Ensemble Methods

Preliminary

We will use the DescTools, caret, randomForest and ipred packages. If you do not already have the randomForest and ipred packages installed, you will first install them using the install.packages() function.

install.packages(c("randomForest", "ipred"))

Next, we load all of the necessary libraries for use in the session.

library(caret)
library(DescTools)
library(randomForest)
library(ipred)

In this lesson, we use the DirectMarketing.csv file, which contains information for 1000 customers of a business that sells products by mail-order catalog. The marketing manager of the company wants to use historical data about his customers (demographic and past marketing information) to predict if the customer will spend (Amount) at or below average (Below_Avg) or above average (Above_Avg). The company prioritizes correctly predicting if a customer will purchase more than the average customer, because they have limited advertising budget and need to make strategic marketing decisions.

Variables include:

Age: customer age (Young, Middle, Old)
Gender: customer gender (Male, Female)
OwnHome: if the customer owns a home (Own) or rents (Rent)
Married: if the customer is married (Married) or single (Single)
Location: if the customer lives close to the company (Close) or not (Far)
Salary: the customers yearly salary
Children: the number of children that the customer have (0, 1, 2, 3)
Catalogs: the total number of catalogs that the customer has received (6, 12, 18, 24)
Amount: the level of money that the customer has spent, either at or below average (Below_Avg) or above (Above_Avg)

We use the read.csv() function to import the CSV file into R as a dataframe named DM. We set stringsAsFactors = FALSE to keep any character columns as-is.

DM <- read.csv(file = "DirectMarketing.csv",
                   stringsAsFactors = FALSE)

Data Exploration & Variable Preparation

First, we can obtain high-level information about the DM dataframe to look at the variable types and to check for missing (NA) values.

Abstract(DM)

## ------------------------------------------------------------------------------ 
## DM
## 
## data frame:  1000 obs. of  9 variables
##      1000 complete cases (100.0%)
## 
##   Nr  ColName   Class      NAs  Levels
##   1   Age       character  .          
##   2   Gender    character  .          
##   3   OwnHome   character  .          
##   4   Married   character  .          
##   5   Location  character  .          
##   6   Salary    integer    .          
##   7   Children  integer    .          
##   8   Catalogs  integer    .          
##   9   Amount    character  .

Preparing Target (Y) Variable

Next, we can convert our target class variable that we want to predict, Amount to a nominal factor variable.

DM$Amount <- factor(x = DM$Amount)

We can plot the distribution using a bar plot, which is the default plot for the plot() function when applied to factor variables.

plot(x = DM$Amount,
     main = "Amount Spent")

Preparing Predictor (X) Variables

Nominal Variables

First, we can convert our nominal predictor variables to factor variables. We will set up a convenience vector of the names of the nominal variables (noms) so that we can more easily refer to them.

noms <- c("Gender", "OwnHome", "Married", "Location")
DM[ ,noms] <- lapply(X = DM[ ,noms],
                     FUN = factor)

Ordinal Variables

Next, we can convert our ordinal variables. The Age variable takes on three levels, which we will need to specify with the correct ordering. Since the Children and Catalogs variables take on numbers, we can use the default ordering and convert them at the same time.We will set up a convenience vector of the names of the ordinal variables (ords) so that we can more easily refer to them.

ords <- c("Age", "Children", "Catalogs")

DM$Age <- factor(x = DM$Age,
                 levels = c("Young", "Middle", "Old"),
                 ordered = TRUE)

DM[,c("Children", "Catalogs")] <- lapply(X = DM[,c("Children", "Catalogs")],
                                         FUN = factor,
                                         ordered = TRUE)

Numerical Variables

We only have one numerical variable, Salary.

Data Preprocessing & Transformation

First, we can view summary() information for our prepared data.

summary(DM)

##      Age         Gender    OwnHome       Married     Location  
##  Young :287   Female:506   Own :516   Married:502   Close:710  
##  Middle:508   Male  :494   Rent:484   Single :498   Far  :290  
##  Old   :205                                                    
##                                                                
##                                                                
##                                                                
##      Salary       Children Catalogs       Amount   
##  Min.   : 10100   0:462    6 :252   Above_Avg:399  
##  1st Qu.: 29975   1:267    12:282   Below_Avg:601  
##  Median : 53700   2:146    18:233                  
##  Mean   : 56104   3:125    24:233                  
##  3rd Qu.: 77025                                    
##  Max.   :168800

Since we will be building ensemble classifiers made of Decision Trees, we have the same considerations as Decision Trees. We know that Decision Trees can handle missing values (we can impute or eliminate up-front or tell the model how to handle them), irrelevant and redundant variables and no rescaling/standardization is needed. For this reason, we can use the dataset as-is in our modeling, without any transformations.

Missing Values First, we check for missing values. If missing values are present, we can either remove them row-wise (na.omit()) or perform imputation (or we can handle them during analysis).

PlotMiss(DM)

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 85/15 split (85% in training and 15% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (Amount). Since the function takes a random sample, we initialize a random seed first for reproducibility.

set.seed(831) # initialize the random seed

# Generate the list of observations for the
# train dataframe
sub <- createDataPartition(y = DM$Amount, # target variable
                           p = 0.85, # proportion in training set
                           list = FALSE)

We subset the rows of the DM dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- DM[sub, ] 
test <- DM[-sub, ]

Analysis

Bagging

We use the bagging() function in the ipred package to perform bagging. NA values are handled as in the rpart() function (NA predictor values are ignored and NA target values are removed row-wise).

set.seed(831) # initialize random seed

bag.mod <- bagging(formula = Amount ~ ., # use all other variables to predict Amount
                   nbagg = 500, # build 500 trees
                   data = train, # use train data
                   coob = TRUE) # compute out of bag (OOB) error

We can run a code line of the name of the model to view basic information about the model.

bag.mod

## 
## Bagging classification trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Amount ~ ., data = train, nbagg = 500, 
##     coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.1422

Model Performance & Fit

Training Performance

We use the predict() function to generate class predictions and do not specify anything for the newdata argument so that the out-of-bag estimate is computed. This allows us to honestly assess our training performance.

bagtrpreds <- predict(object = bag.mod)

We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the training dataset (train). Since we prioritize predicting the Above_Avg class, we set positive = "Above_Avg".

bag_train_conf <- confusionMatrix(data = bagtrpreds, # predictions
                                 reference = train$Amount, # actual
                                 positive = "Above_Avg",
                                 mode = "everything")

We will view the output when assessing goodness of fit.

Testing Performance

We use the predict() function to generate class predictions for our testing set (test).

bagtepreds <- predict(object = bag.mod, # Bagging model
                        newdata = test, # testing data
                        type = "class")

We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the testing dataset (test).

bag_test_conf <- confusionMatrix(data = bagtepreds, # predictions
                                reference = test$Amount, # actual
                                positive = "Above_Avg",
                                mode = "everything")
bag_test_conf

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Above_Avg Below_Avg
##   Above_Avg        46        12
##   Below_Avg        13        78
##                                           
##                Accuracy : 0.8322          
##                  95% CI : (0.7624, 0.8884)
##     No Information Rate : 0.604           
##     P-Value [Acc > NIR] : 1.443e-09       
##                                           
##                   Kappa : 0.6482          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7797          
##             Specificity : 0.8667          
##          Pos Pred Value : 0.7931          
##          Neg Pred Value : 0.8571          
##               Precision : 0.7931          
##                  Recall : 0.7797          
##                      F1 : 0.7863          
##              Prevalence : 0.3960          
##          Detection Rate : 0.3087          
##    Detection Prevalence : 0.3893          
##       Balanced Accuracy : 0.8232          
##                                           
##        'Positive' Class : Above_Avg       
##

As shown, the model is doing better at predicting our negative class, Below_Avg, based on sensitivity and specificity, but the misclassifications per-class are approximately the same (there are more observations of Below_Avg than Above_Avg). Our Accuracy is good and our Kappa value shows good agreement between our actual and predicted values. The model has good performance, although the performance for the negative class, Below_Avg, is better than for our class of interest.

Goodness of Fit

To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side. We want our model to be balanced, and have consistent performance across the training and testing sets.

# Overall
cbind(Training = bag_train_conf$overall,
      Testing = bag_test_conf$overall)

##                    Training      Testing
## Accuracy       8.578143e-01 8.322148e-01
## Kappa          7.017555e-01 6.482199e-01
## AccuracyLower  8.325244e-01 7.623590e-01
## AccuracyUpper  8.805949e-01 8.883770e-01
## AccuracyNull   6.004700e-01 6.040268e-01
## AccuracyPValue 8.507076e-61 1.442926e-09
## McnemarPValue  2.753129e-01 1.000000e+00

# Class-Level
cbind(Training = bag_train_conf$byClass,
      Testing = bag_test_conf$byClass)

##                       Training   Testing
## Sensitivity          0.8029412 0.7796610
## Specificity          0.8943249 0.8666667
## Pos Pred Value       0.8348624 0.7931034
## Neg Pred Value       0.8721374 0.8571429
## Precision            0.8348624 0.7931034
## Recall               0.8029412 0.7796610
## F1                   0.8185907 0.7863248
## Prevalence           0.3995300 0.3959732
## Detection Rate       0.3207991 0.3087248
## Detection Prevalence 0.3842538 0.3892617
## Balanced Accuracy    0.8486330 0.8231638

Based on the output above, we see that the model is performing consistently across the training and testing sets and we would describe the fit as balanced.

Random Forest

Base Model

We use the randomForest() function in the randomForest package to build the model. We specify that 500 trees will be used with the ntree argument.

If NA values are present, the na.action argument can be used to identify how they should be handled (such as na.omit). The default is na.fail, which will give an error if NAs are present.

By default m (mtry) is equal to the square root of our number of predictor (X) variables, rounded down to the nearest integer.

floor(sqrt(ncol(train)-1))

## [1] 2

After building a default model, we will tune the model by changing the m (mtry) hyperparameter value.

set.seed(831) # initialize random seed

rf_mod <- randomForest(formula = Amount ~. , # use all other variables to predict Amount
                       data = train, # training data
                       importance = TRUE, # obtain variable importance 
                       ntree = 500) # number of trees in forest

We can view basic information about our model by running a code line with the name of the randomForest object, rf_mod.

rf_mod

## 
## Call:
##  randomForest(formula = Amount ~ ., data = train, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 14.57%
## Confusion matrix:
##           Above_Avg Below_Avg class.error
## Above_Avg       273        67   0.1970588
## Below_Avg        57       454   0.1115460

We can visualize the most important variables in the Random Forest model using the varImpPlot() function from the randomForest package. The plot displays the variables in decreasing order of importance, so our most important variables are shown at the top of the plot. In the left plot, the mean decrease in accuracy if the variable were not used ot split is displayed. In the right plot, the scaled average decrease in Gini Impurity when the variable is used to split is displayed.

varImpPlot(x = rf_mod, # randomForest object
           main = "Variable Importance Plot") # title

As shown, our two most important variables based on btoh criteria are Salary and Catalogs.

Training Performance

We use the predict() function to generate class predictions for our training set. Again, to gain an honest prediction of our performance, we do not specify anything for the newdata argument, so that we use the out-of-bag estimation.

base.RFpreds <- predict(object = rf_mod, # RF model
                        type = "class") # class predictions

Next, we can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance training measures.

RF_btrain_conf <- confusionMatrix(data = base.RFpreds, # predictions
                                 reference = train$Amount, # actual
                                 positive = "Above_Avg",
                                 mode = "everything")

We will view the output when assessing goodness of fit.

Testing Performance

To assess model performance, we we look at how well the model performs using the testing data. First, we use the predict() function to generate class predictions for our testing set (test).

base.teRFpreds <- predict(object = rf_mod, # RF model
                        newdata = test, # testing data
                        type = "class")

We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the testing dataset (test).

RF_btest_conf <- confusionMatrix(data = base.teRFpreds, # predictions
                                reference = test$Amount, # actual
                                positive = "Above_Avg",
                                mode = "everything")
RF_btest_conf

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Above_Avg Below_Avg
##   Above_Avg        47        14
##   Below_Avg        12        76
##                                           
##                Accuracy : 0.8255          
##                  95% CI : (0.7549, 0.8827)
##     No Information Rate : 0.604           
##     P-Value [Acc > NIR] : 4.605e-09       
##                                           
##                   Kappa : 0.6373          
##                                           
##  Mcnemar's Test P-Value : 0.8445          
##                                           
##             Sensitivity : 0.7966          
##             Specificity : 0.8444          
##          Pos Pred Value : 0.7705          
##          Neg Pred Value : 0.8636          
##               Precision : 0.7705          
##                  Recall : 0.7966          
##                      F1 : 0.7833          
##              Prevalence : 0.3960          
##          Detection Rate : 0.3154          
##    Detection Prevalence : 0.4094          
##       Balanced Accuracy : 0.8205          
##                                           
##        'Positive' Class : Above_Avg       
##

As in the case of Bagging, we see around the same number of misclassifications for each class based on the confusion matrix. Again, we have higher Specificity than Sensitivity, although the values are pretty close. We have good overall performance based on our Accuracy and Kappa values.

Goodness of Fit

To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side.

# Overall
cbind(Training = RF_btrain_conf$overall,
      Testing = RF_btest_conf$overall)

##                    Training      Testing
## Accuracy       8.542891e-01 8.255034e-01
## Kappa          6.948145e-01 6.373338e-01
## AccuracyLower  8.287726e-01 7.548702e-01
## AccuracyUpper  8.773218e-01 8.827328e-01
## AccuracyNull   6.004700e-01 6.040268e-01
## AccuracyPValue 5.267649e-59 4.604956e-09
## McnemarPValue  4.189617e-01 8.445193e-01

# Class-Level
cbind(Training = RF_btrain_conf$byClass,
      Testing = RF_btest_conf$byClass)

##                       Training   Testing
## Sensitivity          0.8029412 0.7966102
## Specificity          0.8884540 0.8444444
## Pos Pred Value       0.8272727 0.7704918
## Neg Pred Value       0.8714012 0.8636364
## Precision            0.8272727 0.7704918
## Recall               0.8029412 0.7966102
## F1                   0.8149254 0.7833333
## Prevalence           0.3995300 0.3959732
## Detection Rate       0.3207991 0.3154362
## Detection Prevalence 0.3877791 0.4093960
## Balanced Accuracy    0.8456976 0.8205273

Based on the output above for the overall and class-level performance, we see that the performance is consistent across our training and testing data sets and we would describe the model as balanced.

Hyperparameter Tuning Model

Next, we will tune the number of variables to randomly sample as potential variables to split on (m, the mtry argument). We use the tuneRF() function in the randomForest package. The output will typically be a plot, where we choose the mtry with the smallest OOB Error. By setting doBest = TRUE, the best mtry will be used to automatically create the best fitting, tuned RF model.

set.seed(831) # initialize random seed

tuneR <- tuneRF(x = train[, -9], # use all variable except the 9th (amount) as predictors
                y = train$Amount, # use Amount as the target variable
                ntreeTry = 500, # 500 trees in the forest
                doBest = TRUE) # use the best m value to create an RF model

## mtry = 2  OOB error = 14.57% 
## Searching left ...
## mtry = 1     OOB error = 15.98% 
## -0.09677419 0.05 
## Searching right ...
## mtry = 4     OOB error = 13.4% 
## 0.08064516 0.05 
## mtry = 8     OOB error = 13.63% 
## -0.01754386 0.05

Based on the plot, we see that the optimal mtry (m) value is 4, which is higher than our default value, 2.

We can run a code line with the name of the tuneRF object, tuneR, to obtain basic information about the model.

tuneR

## 
## Call:
##  randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 13.16%
## Confusion matrix:
##           Above_Avg Below_Avg class.error
## Above_Avg       277        63  0.18529412
## Below_Avg        49       462  0.09589041

Again, we can use the varImpPlot() function to obtain variable importance information for our tuned model. For the output of tuning, we only obtain 1 measure, the average decrease in Gini Impurity. Based on the plot, we see that Salary and Catalogs are still our two most important variables.

varImpPlot(x = tuneR, main = "Variable Importance Plot")

Training Performance

We obtain predictions based on our out-of-bag estimation using the predict() function to generate class predictions.

tune.trRFpreds <- predict(object = tuneR, # Tuned RF model
                        type = "class") # class predictions

We use the class predictions and actual class in the confusionMatrix() function to obtain training performance, which we will use to assess goodness of fit.

RF_ttrain_conf <- confusionMatrix(data = tune.trRFpreds, # predictions
                                 reference = train$Amount, # actual
                                 positive = "Above_Avg",
                                 mode = "everything")

We will wait to view the performance until we assess goodness of fit.

Testing Performance

To assess model performance, we we look at how well the model performs using the testing data. First, we use the predict() function to generate class predictions for our testing set (test).

tune.teRFpreds <- predict(object = tuneR, # tuned RF model
                        newdata = test, # testing data
                        type = "class")

We use the class predictions and actual class in the confusionMatrix() function to obtain testing performance.

RF_ttest_conf <- confusionMatrix(data = tune.teRFpreds, # predictions
                                reference = test$Amount, # actual
                                positive = "Above_Avg",
                                mode = "everything")
RF_ttest_conf

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Above_Avg Below_Avg
##   Above_Avg        45         9
##   Below_Avg        14        81
##                                           
##                Accuracy : 0.8456          
##                  95% CI : (0.7774, 0.8996)
##     No Information Rate : 0.604           
##     P-Value [Acc > NIR] : 1.23e-10        
##                                           
##                   Kappa : 0.6725          
##                                           
##  Mcnemar's Test P-Value : 0.4042          
##                                           
##             Sensitivity : 0.7627          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.8526          
##               Precision : 0.8333          
##                  Recall : 0.7627          
##                      F1 : 0.7965          
##              Prevalence : 0.3960          
##          Detection Rate : 0.3020          
##    Detection Prevalence : 0.3624          
##       Balanced Accuracy : 0.8314          
##                                           
##        'Positive' Class : Above_Avg       
##

Based on the output, the tuned model shows an improvement in performance over the default model. Although our sensitivity is around the same, our specificity has increased, suggesting that this model is better at predicting the Below_Avg class level. Based on our F1, this model is doing a pretty good job of predicting our positive class. Based on our overall model performance measures of Accuracy and Kappa, we have a good model.

Goodness of Fit

To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side.

# Overall
cbind(Training = RF_ttrain_conf$overall,
      Testing = RF_ttest_conf$overall)

##                    Training      Testing
## Accuracy       8.683901e-01 8.456376e-01
## Kappa          7.238022e-01 6.725275e-01
## AccuracyLower  8.438097e-01 7.774373e-01
## AccuracyUpper  8.903844e-01 8.995602e-01
## AccuracyNull   6.004700e-01 6.040268e-01
## AccuracyPValue 2.120031e-66 1.230087e-10
## McnemarPValue  2.193026e-01 4.042485e-01

# Class-Level
cbind(Training = RF_ttrain_conf$byClass,
      Testing = RF_ttest_conf$byClass)

##                       Training   Testing
## Sensitivity          0.8147059 0.7627119
## Specificity          0.9041096 0.9000000
## Pos Pred Value       0.8496933 0.8333333
## Neg Pred Value       0.8800000 0.8526316
## Precision            0.8496933 0.8333333
## Recall               0.8147059 0.7627119
## F1                   0.8318318 0.7964602
## Prevalence           0.3995300 0.3959732
## Detection Rate       0.3254994 0.3020134
## Detection Prevalence 0.3830787 0.3624161
## Balanced Accuracy    0.8594077 0.8313559

Based on the output, we see that the model is performing very consistently across the training and testing sets for both our overall and class-level performance measures.

Comparing Across Modeling Methods

Our goal of the classification analysis is to be able to build the best possible model, which the company should use to strategically target their customers. We will consider our Bagging model and tuned Random Forest model. The next step is to compare these models so that we can make a recommendation to the company.

When choosing the model we want to consider performance, goodness of fit and any other relevant factors that can impact the choice, such as scalability, computational complexity and interpretability.

1. Performance

First, we can use the cbind() function to compare the testing performance of the Bagging and Random Forest models for the overall and class-level measures.

# Overall
cbind(Bagging = bag_test_conf$overall,
      RF = RF_ttest_conf$overall)

##                     Bagging           RF
## Accuracy       8.322148e-01 8.456376e-01
## Kappa          6.482199e-01 6.725275e-01
## AccuracyLower  7.623590e-01 7.774373e-01
## AccuracyUpper  8.883770e-01 8.995602e-01
## AccuracyNull   6.040268e-01 6.040268e-01
## AccuracyPValue 1.442926e-09 1.230087e-10
## McnemarPValue  1.000000e+00 4.042485e-01

# Class-Level
cbind(Bagging = bag_test_conf$byClass,
      RF = RF_ttest_conf$byClass)

##                        Bagging        RF
## Sensitivity          0.7796610 0.7627119
## Specificity          0.8666667 0.9000000
## Pos Pred Value       0.7931034 0.8333333
## Neg Pred Value       0.8571429 0.8526316
## Precision            0.7931034 0.8333333
## Recall               0.7796610 0.7627119
## F1                   0.7863248 0.7964602
## Prevalence           0.3959732 0.3959732
## Detection Rate       0.3087248 0.3020134
## Detection Prevalence 0.3892617 0.3624161
## Balanced Accuracy    0.8231638 0.8313559

As shown, the models have pretty similar performance, although the Random Forest model has slightly better performance based on Accuracy, Kappa, Precision, F1 and Specificity. The Bagging model, however, has slightly higher Sensitivity/Recall, suggesting that it is doing a better job of predicting our class-level of interest, Above_Avg.

2. Goodness of Fit

Next, we can consider the goodness of fit for each model. When comparing across methods, if one model is overfitting and another is balanced, we would prefer the model that is balanced. In this case, both models are balanced.

3. Other considerations

Since the performance of the models is very similar and both are balanced, to make a determination of which model to recommend to the company, we may want to consider additional priorities, such as scalability, implementation and interpretation. Both models are efficient, and can be easily implemented, particularly considering that the amount of preprocessing and transformation is minimal. The Random Forest performed better across most of our performance measures and the implementation using the randomForest package also allows us to get information about variable importance. For this reason, we may suggest that the company implement the Random Forest model because of this added interpretability.

Classification: Ensemble Methods

Dr. Chelsey Hill

Preliminary

Data Exploration & Variable Preparation

Preparing Target (Y) Variable

Preparing Predictor (X) Variables

Nominal Variables

Ordinal Variables

Numerical Variables

Data Preprocessing & Transformation

Training & Testing Sets

Analysis

Bagging

Model Performance & Fit

Training Performance

Testing Performance

Goodness of Fit

Random Forest

Base Model

Training Performance

Testing Performance

Goodness of Fit

Hyperparameter Tuning Model

Training Performance

Testing Performance

Goodness of Fit

Comparing Across Modeling Methods

1. Performance

2. Goodness of Fit

3. Other considerations