Preliminary

We will not use one new package, the caretEnsemble package to more easily compare our trained models. If you do not already have the caretEnsemble package installed, you will first install it using the install.packages() function.

install.packages("caretEnsemble")

In this lesson, we will use the DescTools, caret, and caretEnsemble packages. Next, we load these packages for use in the session.

library(DescTools)
library(caret)
library(caretEnsemble)

In the lesson that follows we will use the BostonHousing.csv file. The famous Boston Housing dataset contains data about different census tracts in Boston and their average (or median) values. The variable of interest is median_val, which indicates if the median home value of occupied homes in the area is greater than (Above) or less than (Below) the median value (30k). The census bureau wants to create a predictive model to predict the median_val variable for new census tracts.

The variables include:

  • crim: Per capita crime rate by town
  • zn: Proportion of residential land zoned for lots over 25,000 ft2
  • indus: Proportion of nonretail business acres per town
  • chas: Charles River dummy variable (1 if tract bounds river, 0 otherwise)
  • nox: Nitric oxide concentration (parts per 10 million)
  • rm: Average number of rooms per dwelling
  • age: Proportion of owner-occupied units built prior to 1940
  • dis: Weighted distances to five Boston employment centers
  • rad: Index of accessibility to radial highways (1-8,24)
  • tax: Full-value property-tax rate per $10,000
  • ptratio: Pupil/teacher ratio by town
  • median_val: Above if the median home value is above the overall median, 30k, otherwise Below

We use the read.csv() function to import the CSV file into R as a dataframe named BH. We set stringsAsFactors = FALSE to keep any character columns as-is.

BH <- read.csv(file = "BostonHousing.csv",
                   stringsAsFactors = FALSE)

Data Exploration & Variable Preparation

First, we can obtain high-level information about the BH dataframe to look at the variable types and to check for missing (NA) values.

Abstract(BH)
## ------------------------------------------------------------------------------ 
## BH
## 
## data frame:  505 obs. of  13 variables
##      505 complete cases (100.0%)
## 
##   Nr  ColName     Class      NAs  Levels
##   1   crim        numeric    .          
##   2   zn          numeric    .          
##   3   indus       numeric    .          
##   4   chas        integer    .          
##   5   nox         numeric    .          
##   6   rm          numeric    .          
##   7   age         numeric    .          
##   8   dis         numeric    .          
##   9   rad         integer    .          
##   10  tax         integer    .          
##   11  ptratio     numeric    .          
##   12  b           numeric    .          
##   13  median_val  character  .

Preparing Target (Y) Variables

Next, we can convert our target class variable that we want to predict, median_val to a nominal factor variable.

BH$median_val <- factor(x = BH$median_val)

We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more Below median census tracts than Above.

plot(BH$median_val, 
     main = "Median Value")

Preparing Predictor (X) Variables

All of our potential predictors are numeric, but based on the data description, chas is categorical (binary) and rad could either be treated as categorical or numeric. We will convert chas to a nominal variable and rad to an ordinal variable.

BH$chas <- factor(x = BH$chas)
BH$rad <- factor(x = BH$rad,
                 ordered = TRUE)

We will set up convenience vectors based on if the variable is categorical (fac, ordinal or nominal) or numeric (num) and combine these vectors to create a vector of our predictor variables (preds).

fac <- c("chas", "rad") # categorical
num <- names(BH)[!names(BH) %in% c(fac, "median_val")] # numeric
preds <- c(fac, num)

Finally, we can get summary statistic information for our prepared data.

summary(BH)
##       crim                zn             indus       chas         nox        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850  
##  1st Qu.: 0.08199   1st Qu.:  0.00   1st Qu.: 5.19   1: 34   1st Qu.:0.4490  
##  Median : 0.25387   Median :  0.00   Median : 9.69           Median :0.5380  
##  Mean   : 3.61380   Mean   : 11.39   Mean   :11.12           Mean   :0.5544  
##  3rd Qu.: 3.67822   3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
##                                                                              
##        rm             age              dis              rad     
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   24     :131  
##  1st Qu.:5.885   1st Qu.: 45.00   1st Qu.: 2.101   5      :115  
##  Median :6.208   Median : 77.30   Median : 3.216   4      :110  
##  Mean   :6.280   Mean   : 68.55   Mean   : 3.799   3      : 38  
##  3rd Qu.:6.619   3rd Qu.: 94.10   3rd Qu.: 5.212   6      : 26  
##  Max.   :8.725   Max.   :100.00   Max.   :12.127   2      : 24  
##                                                    (Other): 61  
##       tax           ptratio            b          median_val 
##  Min.   :187.0   Min.   :12.60   Min.   :  0.32   Above: 84  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.52   Below:421  
##  Median :330.0   Median :19.00   Median :391.45              
##  Mean   :407.7   Mean   :18.45   Mean   :356.68              
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23              
##  Max.   :711.0   Max.   :22.00   Max.   :396.90              
## 

Data Preprocessing & Transformation

The models that we will create will be Decision Tree models, so we only need to handle missing values (either before or during analysis).

  1. Missing Values First, we check for missing values. If missing values are present, we can either remove them row-wise (na.omit()) or perform imputation.
PlotMiss(x = BH)

We do not have any missing values.

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (median_val). Since the function takes a random sample, we initialize a random seed first for reproducibility.

set.seed(831)
sub <- createDataPartition(y = BH$median_val, # target variable
                           p = 0.80, # % in training
                           list = FALSE)

Next, we subset the rows of the BH dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- BH[sub, ] 
test <- BH[-sub, ]

Class Imbalance

First, we can evaluate if class imbalance is present in our target (Y) variable, median_val.

Desc(train$median_val)
## ------------------------------------------------------------------------------ 
## train$median_val (factor - dichotomous)
## 
##   length      n    NAs unique
##      405    405      0      2
##          100.0%   0.0%       
## 
##        freq   perc  lci.95  uci.95'
## Above    68  16.8%   13.5%   20.7%
## Below   337  83.2%   79.3%   86.5%
## 
## ' 95%-CI (Wilson)

As shown, we have a significant difference in the frequency of our majority class Below and our minority class Above.

Baseline Model

Before we use resampling methods to address the class imbalance, we will train a Decision Tree classification model as our baseline model for comparison.

We use the train() function from the caret package to tune a Decision Tree model using repeated 5-Fold Cross Validation, and search for our optimal cp value across the default grid of 5 cp values.

First, we set up our control object as input to the trControl argument in the train() function.

ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # k = 5 folds
                     repeats = 3) # repeated 3 times

Next, we initialize our random seed and use the train() function to tune our DT model, specifying method = "rpart".

set.seed(831) # initialize random seed

DTFit <- train(x = train[ ,preds], # predictors
               y = train$median_val, # target
               method = "rpart", # use the rpart package
               trControl = ctrl, # control object
               tuneLength = 5) # try 5 cp values

We can view the average Accuracy and Kappa values across our cp values and identify the optimal value.

DTFit
## CART 
## 
## 405 samples
##  12 predictor
##   2 classes: 'Above', 'Below' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 323, 324, 325, 324, 324, 325, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa    
##   0.0000000  0.9424029  0.7916790
##   0.1580882  0.9251578  0.7203295
##   0.3161765  0.9251578  0.7203295
##   0.4742647  0.9251578  0.7203295
##   0.6323529  0.9023033  0.5625388
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.

We will obtain testing predictions and performance, which we will use to compare the original model against resampled models, which will aim to correct for the class imbalance present in the target variable.

base_preds <- predict(object = DTFit, 
                      newdata = test)

base_CM <- confusionMatrix(data = base_preds, # predictions
                reference = test$median_val, # actual
                positive = "Above",
                mode = "everything")

Random Undersampling

We can use the downSample() function from the caret package to randomly undersample the training data to balance the target variable. The output of the function is a copy of the original dataframe, in which the number of observations in the majority class is reduced to be equal to the number of observations in the minority class. To produce a reproducible sample, we first initialize a random seed.

set.seed(831) # initialize random seed
train_ds <- downSample(x = train[ ,preds], # predictors
                       y = train$median_val, # target
                       yname = "median_val") # name of y variable

We can compare the distribution of the target variable, median_val, before and after random undersampling.

Next, we can use our undersampled training data (train_ds) to train another DT model, using all of the same arguments as in the baseline model.

set.seed(831) # initialize random seed

DTFit_RUS <- train(x = train_ds[,preds], # predictors
                   y = train_ds$median_val, # target
                   method = "rpart", # use the rpart package
                   trControl = ctrl, # control object
                   tuneLength = 5) # try 5 cp values

We can view the average Accuracy and Kappa values across our cp values and identify the optimal value.

DTFit_RUS
## CART 
## 
## 136 samples
##  12 predictor
##   2 classes: 'Above', 'Below' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 108, 109, 109, 109, 109, 110, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa    
##   0.0000000  0.8900895  0.7797734
##   0.2058824  0.8997015  0.7985788
##   0.4117647  0.8997015  0.7985788
##   0.6176471  0.8997015  0.7985788
##   0.8235294  0.6859585  0.3808883
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.6176471.

We obtain testing predictions and performance, which we will use to compare against the original model and the oversampled model.

RUS_preds <- predict(object = DTFit_RUS, 
                      newdata = test)

RUS_CM <- confusionMatrix(data = RUS_preds, 
                reference = test$median_val,
                positive = "Above",
                mode = "everything")

Random Oversampling

We use the upSample() function in the caret package to perform random oversampling to the training data to balance the target variable. The output of the function is a copy of the original dataframe, in which the number of observations in the minority class is increased to be equal to the number of observations in the majority class. To produce a reproducible sample, we first initialize a random seed.

set.seed(831) # initialize random seed

train_us <- upSample(x = train[ ,preds], # predictors
                     y = train$median_val, # target
                     yname = "median_val") # name of y variable

We can compare the distribution of the target variable, median_val, before and after random oversampling.

plot(train$median_val, main = "Original")

plot(train_us$median_val, main = "ROS")

par(mfrow = c(1,1)) # reset plot window to 1 row, 1 column

Next, we can use our oversampled training data (train_os) to train another DT model, using all of the same arguments as in the baseline and undersampled models.

set.seed(831) # initialize random seed

DTFit_ROS <- train(x = train_us[,preds],
                   y = train_us$median_val,
                   method = "rpart", # use the rpart package
                   trControl = ctrl, # control object
                   tuneLength = 5) # try 5 cp values

We can view the average Accuracy and Kappa values across our cp values and identify the optimal value.

DTFit_ROS
## CART 
## 
## 674 samples
##  12 predictor
##   2 classes: 'Above', 'Below' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 538, 540, 539, 539, 540, 540, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01335312  0.9481093  0.8962217
##   0.02373887  0.9342670  0.8685379
##   0.02670623  0.9313003  0.8626031
##   0.06231454  0.8966394  0.7932996
##   0.80415430  0.7522396  0.5054890
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01335312.

We obtain testing predictions and performance, which we will use to compare against the original model and the undersampled model.

ROS_preds <- predict(object = DTFit_ROS, 
                     newdata = test)

ROS_CM <- confusionMatrix(data = ROS_preds, 
                reference = test$median_val,
                positive = "Above",
                mode = "everything")

Comparing the Models

Finally, we can compare the 3 models.

Training Performance

We can use the training information from our cross-validation to compare our performance. First, we use the resamples() function from the caretEnsemble package to combine the output information.

ci_res <- resamples(c(CI = DTFit,
                      Down = DTFit_RUS,
                      Up = DTFit_ROS))

Then, we can view summary information for our Accuracy and Kappa across training using the summary() function.

summary(ci_res)
## 
## Call:
## summary.resamples(object = ci_res)
## 
## Models: CI.rpart1, Down.rpart2, Up.rpart3 
## Number of resamples: 15 
## 
## Accuracy 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CI.rpart1   0.8765432 0.9375000 0.9506173 0.9424029 0.9625000 0.9634146    0
## Down.rpart2 0.8148148 0.8544974 0.8928571 0.8997015 0.9272487 1.0000000    0
## Up.rpart3   0.8897059 0.9338235 0.9481481 0.9481093 0.9701493 0.9851852    0
## 
## Kappa 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CI.rpart1   0.5913219 0.7549343 0.8169492 0.7916790 0.8584633 0.8667032    0
## Down.rpart2 0.6239554 0.7079716 0.7857143 0.7985788 0.8548009 1.0000000    0
## Up.rpart3   0.7794118 0.8676471 0.8962564 0.8962217 0.9402985 0.9703622    0

As shown, our oversampled model has higher Accuracy and Kappa than the original model. Next, we can evaluate the testing performance by comparing the confusionMatrix output from our models at the model and class-level.

First, we can compare the overall performance. We use cbind() to combine the overall list component for each of the objects.

cbind(Base = base_CM$overall,
      Under = RUS_CM$overall,
      Over = ROS_CM$overall)
##                      Base      Under       Over
## Accuracy       0.91000000 0.91000000 0.89000000
## Kappa          0.65648855 0.68879668 0.65233881
## AccuracyLower  0.83601774 0.83601774 0.81169887
## AccuracyUpper  0.95801640 0.95801640 0.94379298
## AccuracyNull   0.84000000 0.84000000 0.84000000
## AccuracyPValue 0.03155943 0.03155943 0.10613832
## McnemarPValue  1.00000000 0.50498508 0.07044043

As shown, the undersampled model has the strongest performance for Kappa and the same Accuracy as in the original model. To assess how well our models do, we need to evaluate the class-level performance.

cbind(Base = base_CM$byClass,
      Under = RUS_CM$byClass,
      Over = ROS_CM$byClass)
##                           Base     Under      Over
## Sensitivity          0.6875000 0.8125000 0.8750000
## Specificity          0.9523810 0.9285714 0.8928571
## Pos Pred Value       0.7333333 0.6842105 0.6086957
## Neg Pred Value       0.9411765 0.9629630 0.9740260
## Precision            0.7333333 0.6842105 0.6086957
## Recall               0.6875000 0.8125000 0.8750000
## F1                   0.7096774 0.7428571 0.7179487
## Prevalence           0.1600000 0.1600000 0.1600000
## Detection Rate       0.1100000 0.1300000 0.1400000
## Detection Prevalence 0.1500000 0.1900000 0.2300000
## Balanced Accuracy    0.8199405 0.8705357 0.8839286

Since the class level of interest to us is the minority class, Above, it is important that we are correctly predicting it–or that we have high Sensitivity/Recall and F1. As shown, the resampled models show improvement over the baseline model, meaning that we are better able to predict our class of interest using the resampled models. The F1 is highest in the undersampled model and the Sensitivity/Recall is highest in the oversampled model. Since F1 is balancing Precision and Recall, we may prefer the undersampled model. However, since the undersampled training data is so small (136 observations), the model may not generalize well, and we would likely prefer the oversampled model in this case.

Feature Selection

Filter-Based Methods

One basic filter method of feature selection is to remove and identify variables that have near zero variance. We can use the nearZeroVar() function from the caret package. We set names = TRUE to return a vector of any near zero variance predictors, so that we can remove them from consideration.

nearZeroVar(x = train[,preds], 
            names = TRUE)
## character(0)

As shown, we do not have any variables to remove based on this feature selection criteria.

Other filter selection methods are based on the relationships between our individual predictors and the target variable.

Numeric Predictors

First, we can view grouped box plots (plot = "box") of our numeric variables, grouped by the target variable. We use the scale() function to better understand the distributions using a single plot.

featurePlot(x = scale(train[,num]), # scale for visualization
            y = train$median_val,
            plot = "box")

For numeric variables, we can test for differences in the mean values across the class levels of the target variable. We can use the anovaScores() function in the caret package to get p-values. If we have a small p-value, we can conclude that there is a statistically significant difference in the means, and these predictors may be good at differentiating between our classes. We use the sapply() function to apply the anovaScores() function to each of our numerical predictor variables (num) and return a vector of the p-values.

AScores <- sapply(X = train[,num],
             FUN = anovaScores,
             y = train$median_val)

Next, we sort the p-values in increasing order, so that the variables with the most significant p-values are first.

sort(AScores, decreasing = FALSE)
##           rm      ptratio           zn        indus          tax          nox 
## 4.300871e-50 1.109710e-22 6.027870e-14 1.777134e-13 4.515601e-09 5.399012e-06 
##          age            b         crim          dis 
## 1.948073e-04 1.557972e-03 1.766358e-03 4.359388e-02

Categorical Predictors

For categorical predictor variables, we can use a Chi-Square Test.

chs <- lapply(train[ ,fac],
              FUN = chisq.test, # chi square test
              y = train$median_val)

Again,w e can isolate the p-values and sort in increasing order. As shown, chas can be removed, as the relationship between the variables is not statistically significant.

sort(sapply(X = chs,
            FUN = function(x) x$p.value),
     decreasing = FALSE)
##          rad         chas 
## 3.416338e-09 7.877104e-02

Model-Based Methods

For models that provide information about variable importance (DT, RF), we can use the varImp() function from the caret package. The varImp() function provides scaled importance measures (most important = 100).

imp <- varImp(DTFit_ROS) # using the oversampled model

We can use the plot() function on the varImp object. If you have a lot of variables, you can include the top argument to limit the number of top variables shown (ie top = 10).

plot(imp)

As shown, our most important variables is rm, followed by ptratio, indus, tax and zn. Our least important vairbales are b and rad. We can remove these unimportant variables from our models, and can consider removing other unimportant variables from our list of potential predictors.

Wrapper Methods

One wrapper method that can be used for feature selection is Recursive Feature Eliminate (RFE). We use the rfe() function from the caret package. First, we set up an rfeControl object. We set functions = rfFuncs, which means that we will be using random forest models for the RFE model. We will perform 5-fold cross validation.

RFE_ctrl <- rfeControl(functions = rfFuncs, # RF-based
                       method = "cv", # k-fold CV
                       number = 5) # 5 = k folds

Next, we can set our seed and perform RFE. We use the rfe() function on our training data (train). We will consider having a minimum of 4 variables, and will obtain performance for models including between 4 and the total number of predictors in the preds vector (sizes argument).

set.seed(831) # initialize random seed

# The sizes argument is used to identify the 
# number of predictors to consider retaining.
rfe_mod <- rfe(x = train[ ,preds], # predictor variables
               y = train$median_val, # target variable
               sizes = 4:length(preds), # minimum of 4 variables
               rfeControl = RFE_ctrl) # control object

We can view the output of performing RFE, including the optimal number of predictor variables and the corresponding Accuracy and Kappa values.

rfe_mod
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          4   0.9506 0.8127   0.008573 0.03029         
##          5   0.9507 0.8179   0.019306 0.06997         
##          6   0.9483 0.8105   0.021765 0.07676         
##          7   0.9458 0.7997   0.018382 0.06328         
##          8   0.9507 0.8182   0.019139 0.06742         
##          9   0.9458 0.7997   0.018382 0.06328         
##         10   0.9531 0.8257   0.018200 0.06656         
##         11   0.9581 0.8417   0.013905 0.05280        *
##         12   0.9556 0.8339   0.018605 0.06774         
## 
## The top 5 variables (out of 11):
##    rm, ptratio, indus, tax, nox

Finally, we can save and view a list of our optimal predictors, which is available in the optVariables list component of the output from running the rfe() function (rfe_mod). Then, instead of using the preds vector as input to other models, we can use this new vectors, containing our selected features, preds_fs.

preds_fs <- rfe_mod$optVariables
preds_fs
##  [1] "rm"      "ptratio" "indus"   "tax"     "nox"     "zn"      "crim"   
##  [8] "dis"     "rad"     "chas"    "b"

As shown, the age variable was not selected, and is omitted from the preds_fs vector.

Feature Selection can be useful in the case of irrelevant variables and in the event that a model is overfitting. While some models are robust to irrelevant variables, reducing the amount of noise in the model can greatly improve the performance of other models.