We will not use one new package, the caretEnsemble package to more easily compare our trained models. If you do not already have the caretEnsemble package installed, you will first install it using the install.packages() function.
install.packages("caretEnsemble")
In this lesson, we will use the DescTools, caret, and caretEnsemble packages. Next, we load these packages for use in the session.
library(DescTools)
library(caret)
library(caretEnsemble)
In the lesson that follows we will use the BostonHousing.csv file. The famous Boston Housing dataset contains data about different census tracts in Boston and their average (or median) values. The variable of interest is median_val, which indicates if the median home value of occupied homes in the area is greater than (Above) or less than (Below) the median value (30k). The census bureau wants to create a predictive model to predict the median_val variable for new census tracts.
The variables include:
We use the read.csv() function to import the CSV file into R as a dataframe named BH. We set stringsAsFactors = FALSE to keep any character columns as-is.
BH <- read.csv(file = "BostonHousing.csv",
stringsAsFactors = FALSE)
First, we can obtain high-level information about the BH dataframe to look at the variable types and to check for missing (NA) values.
Abstract(BH)
## ------------------------------------------------------------------------------
## BH
##
## data frame: 505 obs. of 13 variables
## 505 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 crim numeric .
## 2 zn numeric .
## 3 indus numeric .
## 4 chas integer .
## 5 nox numeric .
## 6 rm numeric .
## 7 age numeric .
## 8 dis numeric .
## 9 rad integer .
## 10 tax integer .
## 11 ptratio numeric .
## 12 b numeric .
## 13 median_val character .
Next, we can convert our target class variable that we want to predict, median_val to a nominal factor variable.
BH$median_val <- factor(x = BH$median_val)
We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more Below median census tracts than Above.
plot(BH$median_val,
main = "Median Value")
All of our potential predictors are numeric, but based on the data description, chas is categorical (binary) and rad could either be treated as categorical or numeric. We will convert chas to a nominal variable and rad to an ordinal variable.
BH$chas <- factor(x = BH$chas)
BH$rad <- factor(x = BH$rad,
ordered = TRUE)
We will set up convenience vectors based on if the variable is categorical (fac, ordinal or nominal) or numeric (num) and combine these vectors to create a vector of our predictor variables (preds).
fac <- c("chas", "rad") # categorical
num <- names(BH)[!names(BH) %in% c(fac, "median_val")] # numeric
preds <- c(fac, num)
Finally, we can get summary statistic information for our prepared data.
summary(BH)
## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08199 1st Qu.: 0.00 1st Qu.: 5.19 1: 34 1st Qu.:0.4490
## Median : 0.25387 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61380 Mean : 11.39 Mean :11.12 Mean :0.5544
## 3rd Qu.: 3.67822 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
##
## rm age dis rad
## Min. :3.561 Min. : 2.90 Min. : 1.130 24 :131
## 1st Qu.:5.885 1st Qu.: 45.00 1st Qu.: 2.101 5 :115
## Median :6.208 Median : 77.30 Median : 3.216 4 :110
## Mean :6.280 Mean : 68.55 Mean : 3.799 3 : 38
## 3rd Qu.:6.619 3rd Qu.: 94.10 3rd Qu.: 5.212 6 : 26
## Max. :8.725 Max. :100.00 Max. :12.127 2 : 24
## (Other): 61
## tax ptratio b median_val
## Min. :187.0 Min. :12.60 Min. : 0.32 Above: 84
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.52 Below:421
## Median :330.0 Median :19.00 Median :391.45
## Mean :407.7 Mean :18.45 Mean :356.68
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :711.0 Max. :22.00 Max. :396.90
##
The models that we will create will be Decision Tree models, so we only need to handle missing values (either before or during analysis).
na.omit()) or perform imputation.PlotMiss(x = BH)
We do not have any missing values.
We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (median_val). Since the function takes a random sample, we initialize a random seed first for reproducibility.
set.seed(831)
sub <- createDataPartition(y = BH$median_val, # target variable
p = 0.80, # % in training
list = FALSE)
Next, we subset the rows of the BH dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.
train <- BH[sub, ]
test <- BH[-sub, ]
First, we can evaluate if class imbalance is present in our target (Y) variable, median_val.
Desc(train$median_val)
## ------------------------------------------------------------------------------
## train$median_val (factor - dichotomous)
##
## length n NAs unique
## 405 405 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## Above 68 16.8% 13.5% 20.7%
## Below 337 83.2% 79.3% 86.5%
##
## ' 95%-CI (Wilson)
As shown, we have a significant difference in the frequency of our majority class
Below and our minority class Above.
Before we use resampling methods to address the class imbalance, we will train a Decision Tree classification model as our baseline model for comparison.
We use the train() function from the caret package to tune a Decision Tree model using repeated 5-Fold Cross Validation, and search for our optimal cp value across the default grid of 5 cp values.
First, we set up our control object as input to the trControl argument in the train() function.
ctrl <- trainControl(method = "repeatedcv",
number = 5, # k = 5 folds
repeats = 3) # repeated 3 times
Next, we initialize our random seed and use the train() function to tune our DT model, specifying method = "rpart".
set.seed(831) # initialize random seed
DTFit <- train(x = train[ ,preds], # predictors
y = train$median_val, # target
method = "rpart", # use the rpart package
trControl = ctrl, # control object
tuneLength = 5) # try 5 cp values
We can view the average Accuracy and Kappa values across our cp values and identify the optimal value.
DTFit
## CART
##
## 405 samples
## 12 predictor
## 2 classes: 'Above', 'Below'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 323, 324, 325, 324, 324, 325, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0000000 0.9424029 0.7916790
## 0.1580882 0.9251578 0.7203295
## 0.3161765 0.9251578 0.7203295
## 0.4742647 0.9251578 0.7203295
## 0.6323529 0.9023033 0.5625388
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
We will obtain testing predictions and performance, which we will use to compare the original model against resampled models, which will aim to correct for the class imbalance present in the target variable.
base_preds <- predict(object = DTFit,
newdata = test)
base_CM <- confusionMatrix(data = base_preds, # predictions
reference = test$median_val, # actual
positive = "Above",
mode = "everything")
We can use the downSample() function from the caret package to randomly undersample the training data to balance the target variable. The output of the function is a copy of the original dataframe, in which the number of observations in the majority class is reduced to be equal to the number of observations in the minority class. To produce a reproducible sample, we first initialize a random seed.
set.seed(831) # initialize random seed
train_ds <- downSample(x = train[ ,preds], # predictors
y = train$median_val, # target
yname = "median_val") # name of y variable
We can compare the distribution of the target variable, median_val, before and after random undersampling.
Next, we can use our undersampled training data (train_ds) to train another DT model, using all of the same arguments as in the baseline model.
set.seed(831) # initialize random seed
DTFit_RUS <- train(x = train_ds[,preds], # predictors
y = train_ds$median_val, # target
method = "rpart", # use the rpart package
trControl = ctrl, # control object
tuneLength = 5) # try 5 cp values
We can view the average Accuracy and Kappa values across our cp values and identify the optimal value.
DTFit_RUS
## CART
##
## 136 samples
## 12 predictor
## 2 classes: 'Above', 'Below'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 108, 109, 109, 109, 109, 110, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0000000 0.8900895 0.7797734
## 0.2058824 0.8997015 0.7985788
## 0.4117647 0.8997015 0.7985788
## 0.6176471 0.8997015 0.7985788
## 0.8235294 0.6859585 0.3808883
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.6176471.
We obtain testing predictions and performance, which we will use to compare against the original model and the oversampled model.
RUS_preds <- predict(object = DTFit_RUS,
newdata = test)
RUS_CM <- confusionMatrix(data = RUS_preds,
reference = test$median_val,
positive = "Above",
mode = "everything")
We use the upSample() function in the caret package to perform random oversampling to the training data to balance the target variable. The output of the function is a copy of the original dataframe, in which the number of observations in the minority class is increased to be equal to the number of observations in the majority class. To produce a reproducible sample, we first initialize a random seed.
set.seed(831) # initialize random seed
train_us <- upSample(x = train[ ,preds], # predictors
y = train$median_val, # target
yname = "median_val") # name of y variable
We can compare the distribution of the target variable, median_val, before and after random oversampling.
plot(train$median_val, main = "Original")
plot(train_us$median_val, main = "ROS")
par(mfrow = c(1,1)) # reset plot window to 1 row, 1 column
Next, we can use our oversampled training data (train_os) to train another DT model, using all of the same arguments as in the baseline and undersampled models.
set.seed(831) # initialize random seed
DTFit_ROS <- train(x = train_us[,preds],
y = train_us$median_val,
method = "rpart", # use the rpart package
trControl = ctrl, # control object
tuneLength = 5) # try 5 cp values
We can view the average Accuracy and Kappa values across our cp values and identify the optimal value.
DTFit_ROS
## CART
##
## 674 samples
## 12 predictor
## 2 classes: 'Above', 'Below'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 538, 540, 539, 539, 540, 540, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01335312 0.9481093 0.8962217
## 0.02373887 0.9342670 0.8685379
## 0.02670623 0.9313003 0.8626031
## 0.06231454 0.8966394 0.7932996
## 0.80415430 0.7522396 0.5054890
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01335312.
We obtain testing predictions and performance, which we will use to compare against the original model and the undersampled model.
ROS_preds <- predict(object = DTFit_ROS,
newdata = test)
ROS_CM <- confusionMatrix(data = ROS_preds,
reference = test$median_val,
positive = "Above",
mode = "everything")
Finally, we can compare the 3 models.
We can use the training information from our cross-validation to compare our performance. First, we use the resamples() function from the caretEnsemble package to combine the output information.
ci_res <- resamples(c(CI = DTFit,
Down = DTFit_RUS,
Up = DTFit_ROS))
Then, we can view summary information for our Accuracy and Kappa across training using the summary() function.
summary(ci_res)
##
## Call:
## summary.resamples(object = ci_res)
##
## Models: CI.rpart1, Down.rpart2, Up.rpart3
## Number of resamples: 15
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CI.rpart1 0.8765432 0.9375000 0.9506173 0.9424029 0.9625000 0.9634146 0
## Down.rpart2 0.8148148 0.8544974 0.8928571 0.8997015 0.9272487 1.0000000 0
## Up.rpart3 0.8897059 0.9338235 0.9481481 0.9481093 0.9701493 0.9851852 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CI.rpart1 0.5913219 0.7549343 0.8169492 0.7916790 0.8584633 0.8667032 0
## Down.rpart2 0.6239554 0.7079716 0.7857143 0.7985788 0.8548009 1.0000000 0
## Up.rpart3 0.7794118 0.8676471 0.8962564 0.8962217 0.9402985 0.9703622 0
As shown, our oversampled model has higher Accuracy and Kappa than the original model. Next, we can evaluate the testing performance by comparing the confusionMatrix output from our models at the model and class-level.
First, we can compare the overall performance. We use cbind() to combine the overall list component for each of the objects.
cbind(Base = base_CM$overall,
Under = RUS_CM$overall,
Over = ROS_CM$overall)
## Base Under Over
## Accuracy 0.91000000 0.91000000 0.89000000
## Kappa 0.65648855 0.68879668 0.65233881
## AccuracyLower 0.83601774 0.83601774 0.81169887
## AccuracyUpper 0.95801640 0.95801640 0.94379298
## AccuracyNull 0.84000000 0.84000000 0.84000000
## AccuracyPValue 0.03155943 0.03155943 0.10613832
## McnemarPValue 1.00000000 0.50498508 0.07044043
As shown, the undersampled model has the strongest performance for Kappa and the same Accuracy as in the original model. To assess how well our models do, we need to evaluate the class-level performance.
cbind(Base = base_CM$byClass,
Under = RUS_CM$byClass,
Over = ROS_CM$byClass)
## Base Under Over
## Sensitivity 0.6875000 0.8125000 0.8750000
## Specificity 0.9523810 0.9285714 0.8928571
## Pos Pred Value 0.7333333 0.6842105 0.6086957
## Neg Pred Value 0.9411765 0.9629630 0.9740260
## Precision 0.7333333 0.6842105 0.6086957
## Recall 0.6875000 0.8125000 0.8750000
## F1 0.7096774 0.7428571 0.7179487
## Prevalence 0.1600000 0.1600000 0.1600000
## Detection Rate 0.1100000 0.1300000 0.1400000
## Detection Prevalence 0.1500000 0.1900000 0.2300000
## Balanced Accuracy 0.8199405 0.8705357 0.8839286
Since the class level of interest to us is the minority class, Above, it is important that we are correctly predicting it–or that we have high Sensitivity/Recall and F1. As shown, the resampled models show improvement over the baseline model, meaning that we are better able to predict our class of interest using the resampled models. The F1 is highest in the undersampled model and the Sensitivity/Recall is highest in the oversampled model. Since F1 is balancing Precision and Recall, we may prefer the undersampled model. However, since the undersampled training data is so small (136 observations), the model may not generalize well, and we would likely prefer the oversampled model in this case.
One basic filter method of feature selection is to remove and identify variables that have near zero variance. We can use the nearZeroVar() function from the caret package. We set names = TRUE to return a vector of any near zero variance predictors, so that we can remove them from consideration.
nearZeroVar(x = train[,preds],
names = TRUE)
## character(0)
As shown, we do not have any variables to remove based on this feature selection criteria.
Other filter selection methods are based on the relationships between our individual predictors and the target variable.
First, we can view grouped box plots (plot = "box") of our numeric variables, grouped by the target variable. We use the scale() function to better understand the distributions using a single plot.
featurePlot(x = scale(train[,num]), # scale for visualization
y = train$median_val,
plot = "box")
For numeric variables, we can test for differences in the mean values across the class levels of the target variable. We can use the anovaScores() function in the caret package to get p-values. If we have a small p-value, we can conclude that there is a statistically significant difference in the means, and these predictors may be good at differentiating between our classes. We use the sapply() function to apply the anovaScores() function to each of our numerical predictor variables (num) and return a vector of the p-values.
AScores <- sapply(X = train[,num],
FUN = anovaScores,
y = train$median_val)
Next, we sort the p-values in increasing order, so that the variables with the most significant p-values are first.
sort(AScores, decreasing = FALSE)
## rm ptratio zn indus tax nox
## 4.300871e-50 1.109710e-22 6.027870e-14 1.777134e-13 4.515601e-09 5.399012e-06
## age b crim dis
## 1.948073e-04 1.557972e-03 1.766358e-03 4.359388e-02
For categorical predictor variables, we can use a Chi-Square Test.
chs <- lapply(train[ ,fac],
FUN = chisq.test, # chi square test
y = train$median_val)
Again,w e can isolate the p-values and sort in increasing order. As shown, chas can be removed, as the relationship between the variables is not statistically significant.
sort(sapply(X = chs,
FUN = function(x) x$p.value),
decreasing = FALSE)
## rad chas
## 3.416338e-09 7.877104e-02
For models that provide information about variable importance (DT, RF), we can use the varImp() function from the caret package. The varImp() function provides scaled importance measures (most important = 100).
imp <- varImp(DTFit_ROS) # using the oversampled model
We can use the plot() function on the varImp object. If you have a lot of variables, you can include the top argument to limit the number of top variables shown (ie top = 10).
plot(imp)
As shown, our most important variables is
rm, followed by ptratio, indus, tax and zn. Our least important vairbales are b and rad. We can remove these unimportant variables from our models, and can consider removing other unimportant variables from our list of potential predictors.
One wrapper method that can be used for feature selection is Recursive Feature Eliminate (RFE). We use the rfe() function from the caret package. First, we set up an rfeControl object. We set functions = rfFuncs, which means that we will be using random forest models for the RFE model. We will perform 5-fold cross validation.
RFE_ctrl <- rfeControl(functions = rfFuncs, # RF-based
method = "cv", # k-fold CV
number = 5) # 5 = k folds
Next, we can set our seed and perform RFE. We use the rfe() function on our training data (train). We will consider having a minimum of 4 variables, and will obtain performance for models including between 4 and the total number of predictors in the preds vector (sizes argument).
set.seed(831) # initialize random seed
# The sizes argument is used to identify the
# number of predictors to consider retaining.
rfe_mod <- rfe(x = train[ ,preds], # predictor variables
y = train$median_val, # target variable
sizes = 4:length(preds), # minimum of 4 variables
rfeControl = RFE_ctrl) # control object
We can view the output of performing RFE, including the optimal number of predictor variables and the corresponding Accuracy and Kappa values.
rfe_mod
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (5 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 4 0.9506 0.8127 0.008573 0.03029
## 5 0.9507 0.8179 0.019306 0.06997
## 6 0.9483 0.8105 0.021765 0.07676
## 7 0.9458 0.7997 0.018382 0.06328
## 8 0.9507 0.8182 0.019139 0.06742
## 9 0.9458 0.7997 0.018382 0.06328
## 10 0.9531 0.8257 0.018200 0.06656
## 11 0.9581 0.8417 0.013905 0.05280 *
## 12 0.9556 0.8339 0.018605 0.06774
##
## The top 5 variables (out of 11):
## rm, ptratio, indus, tax, nox
Finally, we can save and view a list of our optimal predictors, which is available in the optVariables list component of the output from running the rfe() function (rfe_mod). Then, instead of using the preds vector as input to other models, we can use this new vectors, containing our selected features, preds_fs.
preds_fs <- rfe_mod$optVariables
preds_fs
## [1] "rm" "ptratio" "indus" "tax" "nox" "zn" "crim"
## [8] "dis" "rad" "chas" "b"
As shown, the age variable was not selected, and is omitted from the preds_fs vector.
Feature Selection can be useful in the case of irrelevant variables and in the event that a model is overfitting. While some models are robust to irrelevant variables, reducing the amount of noise in the model can greatly improve the performance of other models.