Exploring the caret package

This material is mostly based on Max Kuhn’s tutorial for the caret package, as well as the course he made together with Zachary Deane-Mayer for DataCamp

Cros-validation for model performance estimation

multiple systematic test sets, rather than a single random train/test split
caret supports various types of cross-validation
type of cross-validation as well as the number of cross-validation folds can be specified with the trainControl() function, which is passed to the trControl argument in train():

model <- train(
  y ~ ., my_data,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 10,
    verboseIter = TRUE
  )
)

It’s important to note that the method for modeling is passed to the main train() function and the method for cross-validation to the trainControl() function.

Example 1

# Fit lm model using 10-fold CV: model

library(caret)

model <- train(
  price ~ ., diamonds,
  method = "lm",
  trControl = trainControl(
    method = "cv", 
    number = 10,
    verboseIter = TRUE
  )
)

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

# Print model to console
model

## Linear Regression 
## 
## 53940 samples
##     9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 48546, 48547, 48545, 48546, 48546, 48546, ... 
## Resampling results:
## 
##   RMSE     Rsquared 
##   1131.04  0.9195964
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
##

It is possible to perform more than on iteration of cross-validation.
Repeated cross-validation provides a better estimate of the test-set error.
Moreover the whole cv procedure can be repeated. This takes longer but gives many more out-of-sample datasets to check and hence enables more precise assessment of model performance.
One of the awesome things about the train() function in caret is how easy it is to run very different models or methods of cross-validation just by tweaking a few simple arguments to the function call.
For example, you could repeat your entire cross-validation procedure 5 times for greater confidence in your estimates of the model’s out-of-sample accuracy, e.g.:

trControl = trainControl(
  method = "cv", number = 5,
  repeats = 5 # verboseIter = TRUE
)

Example 2

Let’s compare estimation of the model accuracy using “lm” on the “Boston” data set.
First we’ll do 5 fold cross-validation and than repeat the whole procedure but this time we’ll do five rounds of 5 fold cross-validation:

library(MASS)

set.seed(333)

# Fit lm model using 5-fold CV: model
model <- train(
  medv ~ ., Boston,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 5
    # verboseIter = TRUE
  )
)

# Print model to console
model

## Linear Regression 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 404, 404, 404, 406, 406 
## Resampling results:
## 
##   RMSE      Rsquared 
##   4.945624  0.7192787
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
##

# Fit lm model using 5 x 5-fold CV: model
model <- train(
  medv ~ ., Boston,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 5,
    repeats = 5 # verboseIter = TRUE
  )
)

# Print model to console
model

## Linear Regression 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 404, 405, 405, 405, 405 
## Resampling results:
## 
##   RMSE      Rsquared 
##   4.911869  0.7177594
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
##

Further customisation of the `trainControl` function - using AUC (instead of accuracy) for model tuning

The trainControl() function in caret can be adjusted to use AUC (instead of accuracy), to tune the parameters of trained models.
The twoClassSummary() convenience function allows for this to be done easily.
IMPORTANT NOTE: When using twoClassSummary(), be sure to always include the argument classProbs = TRUE, otherwise your model will throw an error! (AUC can’t be calculated with just class predictions. Class probabilities are needed as well.)

Example 1

library(mlbench) # This loads a collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.
library(caTools)
data("Sonar")
str(Sonar)

## 'data.frame':    208 obs. of  61 variables:
##  $ V1   : num  0.02 0.0453 0.0262 0.01 0.0762 0.0286 0.0317 0.0519 0.0223 0.0164 ...
##  $ V2   : num  0.0371 0.0523 0.0582 0.0171 0.0666 0.0453 0.0956 0.0548 0.0375 0.0173 ...
##  $ V3   : num  0.0428 0.0843 0.1099 0.0623 0.0481 ...
##  $ V4   : num  0.0207 0.0689 0.1083 0.0205 0.0394 ...
##  $ V5   : num  0.0954 0.1183 0.0974 0.0205 0.059 ...
##  $ V6   : num  0.0986 0.2583 0.228 0.0368 0.0649 ...
##  $ V7   : num  0.154 0.216 0.243 0.11 0.121 ...
##  $ V8   : num  0.16 0.348 0.377 0.128 0.247 ...
##  $ V9   : num  0.3109 0.3337 0.5598 0.0598 0.3564 ...
##  $ V10  : num  0.211 0.287 0.619 0.126 0.446 ...
##  $ V11  : num  0.1609 0.4918 0.6333 0.0881 0.4152 ...
##  $ V12  : num  0.158 0.655 0.706 0.199 0.395 ...
##  $ V13  : num  0.2238 0.6919 0.5544 0.0184 0.4256 ...
##  $ V14  : num  0.0645 0.7797 0.532 0.2261 0.4135 ...
##  $ V15  : num  0.066 0.746 0.648 0.173 0.453 ...
##  $ V16  : num  0.227 0.944 0.693 0.213 0.533 ...
##  $ V17  : num  0.31 1 0.6759 0.0693 0.7306 ...
##  $ V18  : num  0.3 0.887 0.755 0.228 0.619 ...
##  $ V19  : num  0.508 0.802 0.893 0.406 0.203 ...
##  $ V20  : num  0.48 0.782 0.862 0.397 0.464 ...
##  $ V21  : num  0.578 0.521 0.797 0.274 0.415 ...
##  $ V22  : num  0.507 0.405 0.674 0.369 0.429 ...
##  $ V23  : num  0.433 0.396 0.429 0.556 0.573 ...
##  $ V24  : num  0.555 0.391 0.365 0.485 0.54 ...
##  $ V25  : num  0.671 0.325 0.533 0.314 0.316 ...
##  $ V26  : num  0.641 0.32 0.241 0.533 0.229 ...
##  $ V27  : num  0.71 0.327 0.507 0.526 0.7 ...
##  $ V28  : num  0.808 0.277 0.853 0.252 1 ...
##  $ V29  : num  0.679 0.442 0.604 0.209 0.726 ...
##  $ V30  : num  0.386 0.203 0.851 0.356 0.472 ...
##  $ V31  : num  0.131 0.379 0.851 0.626 0.51 ...
##  $ V32  : num  0.26 0.295 0.504 0.734 0.546 ...
##  $ V33  : num  0.512 0.198 0.186 0.612 0.288 ...
##  $ V34  : num  0.7547 0.2341 0.2709 0.3497 0.0981 ...
##  $ V35  : num  0.854 0.131 0.423 0.395 0.195 ...
##  $ V36  : num  0.851 0.418 0.304 0.301 0.418 ...
##  $ V37  : num  0.669 0.384 0.612 0.541 0.46 ...
##  $ V38  : num  0.61 0.106 0.676 0.881 0.322 ...
##  $ V39  : num  0.494 0.184 0.537 0.986 0.283 ...
##  $ V40  : num  0.274 0.197 0.472 0.917 0.243 ...
##  $ V41  : num  0.051 0.167 0.465 0.612 0.198 ...
##  $ V42  : num  0.2834 0.0583 0.2587 0.5006 0.2444 ...
##  $ V43  : num  0.282 0.14 0.213 0.321 0.185 ...
##  $ V44  : num  0.4256 0.1628 0.2222 0.3202 0.0841 ...
##  $ V45  : num  0.2641 0.0621 0.2111 0.4295 0.0692 ...
##  $ V46  : num  0.1386 0.0203 0.0176 0.3654 0.0528 ...
##  $ V47  : num  0.1051 0.053 0.1348 0.2655 0.0357 ...
##  $ V48  : num  0.1343 0.0742 0.0744 0.1576 0.0085 ...
##  $ V49  : num  0.0383 0.0409 0.013 0.0681 0.023 0.0264 0.0507 0.0285 0.0777 0.0092 ...
##  $ V50  : num  0.0324 0.0061 0.0106 0.0294 0.0046 0.0081 0.0159 0.0178 0.0439 0.0198 ...
##  $ V51  : num  0.0232 0.0125 0.0033 0.0241 0.0156 0.0104 0.0195 0.0052 0.0061 0.0118 ...
##  $ V52  : num  0.0027 0.0084 0.0232 0.0121 0.0031 0.0045 0.0201 0.0081 0.0145 0.009 ...
##  $ V53  : num  0.0065 0.0089 0.0166 0.0036 0.0054 0.0014 0.0248 0.012 0.0128 0.0223 ...
##  $ V54  : num  0.0159 0.0048 0.0095 0.015 0.0105 0.0038 0.0131 0.0045 0.0145 0.0179 ...
##  $ V55  : num  0.0072 0.0094 0.018 0.0085 0.011 0.0013 0.007 0.0121 0.0058 0.0084 ...
##  $ V56  : num  0.0167 0.0191 0.0244 0.0073 0.0015 0.0089 0.0138 0.0097 0.0049 0.0068 ...
##  $ V57  : num  0.018 0.014 0.0316 0.005 0.0072 0.0057 0.0092 0.0085 0.0065 0.0032 ...
##  $ V58  : num  0.0084 0.0049 0.0164 0.0044 0.0048 0.0027 0.0143 0.0047 0.0093 0.0035 ...
##  $ V59  : num  0.009 0.0052 0.0095 0.004 0.0107 0.0051 0.0036 0.0048 0.0059 0.0056 ...
##  $ V60  : num  0.0032 0.0044 0.0078 0.0117 0.0094 0.0062 0.0103 0.0053 0.0022 0.004 ...
##  $ Class: Factor w/ 2 levels "M","R": 2 2 2 2 2 2 2 2 2 2 ...

set.seed(33)

# Create trainControl object: myControl
myControl <- trainControl(
  method = "repeatedCv",
  number = 5,
  repeats = 5,
  summaryFunction = twoClassSummary,
  classProbs = TRUE # IMPORTANT!
  # verboseIter = TRUE
)

# After creating a custom trainControl object, it's easy to fit caret models that use AUC rather than accuracy to tune and evaluate the model. You can just pass your custom trainControl object to the train() function via the trControl argument.

# Train glm with custom trainControl: model
model <- train(Class ~., Sonar, method = "glm", trControl = myControl)

# Print model to console
model

## Generalized Linear Model 
## 
## 208 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 166, 166, 166, 167, 167, 167, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7420305  0.7566798  0.6742105
## 
##

# Let's make separate train and test sets
inTraining <- createDataPartition(Sonar$Class, p = .7, list = FALSE)

train <- Sonar[inTraining, ]
test <- Sonar[-inTraining, ]

# Train the model with the train set
model_1 <- train(Class ~., data = train, method = "glm", family = "binomial", trControl = myControl) # argument family = "binomial" explicetily tells the train to do logistic regression with "glm"

# Predict probabilites, calculate AUC, and draw ROC
prediction_p <- predict(model_1, test, type = "prob")
colAUC(prediction_p, test$Class, plotROC = TRUE)

##                 M         R
## M vs. R 0.6964472 0.6964472

# Predict class, find the confusion matrix and calculate accuracy, sensitivity...
prediction_c <- predict(model_1, test)
confusionMatrix(prediction_c, test$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 22 11
##          R 11 18
##                                           
##                Accuracy : 0.6452          
##                  95% CI : (0.5134, 0.7626)
##     No Information Rate : 0.5323          
##     P-Value [Acc > NIR] : 0.04813         
##                                           
##                   Kappa : 0.2874          
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.6207          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.6207          
##              Prevalence : 0.5323          
##          Detection Rate : 0.3548          
##    Detection Prevalence : 0.5323          
##       Balanced Accuracy : 0.6437          
##                                           
##        'Positive' Class : M               
##

#Let's try another probability trashold and see how the model behaves

# Apply threshold of 0.9: p_class
p_class <- ifelse(prediction_p$M > 0.9, "M", "R")

# Create confusion matrix
confusionMatrix(p_class, test$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 22 10
##          R 11 19
##                                           
##                Accuracy : 0.6613          
##                  95% CI : (0.5299, 0.7767)
##     No Information Rate : 0.5323          
##     P-Value [Acc > NIR] : 0.02721         
##                                           
##                   Kappa : 0.3212          
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.6552          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.6333          
##              Prevalence : 0.5323          
##          Detection Rate : 0.3548          
##    Detection Prevalence : 0.5161          
##       Balanced Accuracy : 0.6609          
##                                           
##        'Positive' Class : M               
##

Random Forest with `caret`

Example

# Fit random forest: model
# Set seed
set.seed(33)
# Fit a model
model <- train(Class~.,
               data = Sonar,
               method = "ranger",
               trControl = trainControl(method = "cv", number = 5)
                             )
# Let's check the model
model

## Random Forest 
## 
## 208 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 166, 166, 166, 167, 167 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8367015  0.6672095
##   31    0.8077816  0.6095856
##   60    0.8029036  0.5987429
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

# Plot the results
plot(model)

We can determine the “granularity” of the tune grid by the use of the tuneLength parameter, which by default has a value 3. Let’s try tuneLength = 10:

model <- train(Class~.,
               data = Sonar,
               method = "ranger",
               trControl = trainControl(method = "cv", number = 5),
               tuneLength = 10
                             )
# Let's check the model
model

## Random Forest 
## 
## 208 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 165, 167, 167, 166, 167 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8133025  0.6191802
##    8    0.8088783  0.6122841
##   14    0.8237393  0.6417609
##   21    0.7947033  0.5824658
##   27    0.7993545  0.5925080
##   34    0.7993545  0.5927957
##   40    0.7993545  0.5928595
##   47    0.7993545  0.5929299
##   53    0.7994706  0.5929108
##   60    0.7945871  0.5828487
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 14.

# Plot the results
plot(model)

Custom tuning grids - Example

Pros and cons of custom tuning

Pass custom tuning grids to tuneGrid argument
Advantages
- Most flexible method for filling caret models
- Complete control over how the model is fit
Disadvantages
- Requires some knowledge of the model
- Can dramatically increase run time

# Define a custom tuning grid
myGrid <- data.frame(mtry = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30))

# Fit a model with a custom tuning grid
set.seed(33)
model <- train(Class ~ ., 
               data = Sonar, 
               method = "ranger",
               trControl = trainControl(method = "cv", number = 5),
               tuneGrid = myGrid)

# Plot the results
plot(model)

model

## Random Forest 
## 
## 208 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 166, 166, 166, 167, 167 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8222997  0.6374971
##    3    0.8270616  0.6475681
##    4    0.8317073  0.6571388
##    5    0.8222997  0.6379683
##    6    0.8463415  0.6866811
##    7    0.8270616  0.6480316
##    8    0.8415796  0.6772098
##    9    0.8318235  0.6573567
##   10    0.8222997  0.6381326
##   20    0.8220674  0.6379680
##   30    0.7886179  0.5699430
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 6.

Introducing `glmnet`

Extension of glm models with built-in variable selection
Helps deal with collinearity and small samples sizes
Two primary forms
Lasso regression
- Penalizes number of non-zero coefficients
Ridge regression
- Penalizes absolute magnitude of coefficients
Attempts to find a parsimonious (i.e. simple) model
Pairs well with random forest models - it usually yields different results

Tuning glmnet models

Combination of lasso and ridge regression
Can fit a mix of the two models
alpha [0, 1]: pure lasso to pure ridge ● lambda (0, infinity): size of the penalty

Example

# Load data
if (!exists("overfit")) {
                         url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1048/datasets/overfit.csv"
                         overfit <- read.csv(url)
                        }

# Make a custom trainControl - use ROC as a model selection criteria
myControl <- trainControl(
                           method = "cv", number = 10,
                           summaryFunction = twoClassSummary,
                           classProbs = TRUE # Super important!
                          )
# Fit a model
set.seed(33)

model <- train(y ~ ., overfit, method = "glmnet", trControl = myControl)

#Check the model
model

## glmnet 
## 
## 250 samples
## 200 predictors
##   2 classes: 'class1', 'class2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 226, 225, 225, 225, 224, 225, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        ROC        Sens  Spec     
##   0.10   0.0001012745  0.4781703  0     0.9570652
##   0.10   0.0010127448  0.4782609  0     0.9657609
##   0.10   0.0101274483  0.4782609  0     0.9829710
##   0.55   0.0001012745  0.4416667  0     0.9574275
##   0.55   0.0010127448  0.4401268  0     0.9617754
##   0.55   0.0101274483  0.4938406  0     0.9829710
##   1.00   0.0001012745  0.4143116  0     0.9572464
##   1.00   0.0010127448  0.4101449  0     0.9659420
##   1.00   0.0101274483  0.4746377  0     0.9829710
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were alpha = 0.55 and lambda
##  = 0.01012745.

# Plot results
plot(model)

# Print maximum ROC statistic
max(model[["results"]]$ROC)

## [1] 0.4938406

`glmnet` with custom `trainControl` and custom tuning grid

The glmnet model actually fits many models at once
This can be exploited by passing a large number of lambda values, which control the amount of penalization in the model - train() is “smart”" enough to only fit one model per alpha value and pass all of the lambda values at once for simultaneous fitting
Many models are explored for the “price” of one

Example

myGrid <- expand.grid(
                       alpha = 0:1,
                       lambda = seq(0.0001, 1, length = 20)
                      )

# Fit the model
set.seed(42)
model <- train(y ~., data = overfit, method = "glmnet",
               tuneGrid = myGrid, trControl = myControl)

# Check the model
model

## glmnet 
## 
## 250 samples
## 200 predictors
##   2 classes: 'class1', 'class2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 225, 226, 225, 224, 224, 225, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      ROC        Sens  Spec     
##   0      0.00010000  0.3955616  0.0   0.9786232
##   0      0.05272632  0.4213768  0.0   1.0000000
##   0      0.10535263  0.4366848  0.0   1.0000000
##   0      0.15797895  0.4387681  0.0   1.0000000
##   0      0.21060526  0.4369565  0.0   1.0000000
##   0      0.26323158  0.4478261  0.0   1.0000000
##   0      0.31585789  0.4543478  0.0   1.0000000
##   0      0.36848421  0.4565217  0.0   1.0000000
##   0      0.42111053  0.4608696  0.0   1.0000000
##   0      0.47373684  0.4652174  0.0   1.0000000
##   0      0.52636316  0.4717391  0.0   1.0000000
##   0      0.57898947  0.4717391  0.0   1.0000000
##   0      0.63161579  0.4717391  0.0   1.0000000
##   0      0.68424211  0.4717391  0.0   1.0000000
##   0      0.73686842  0.4739130  0.0   1.0000000
##   0      0.78949474  0.4717391  0.0   1.0000000
##   0      0.84212105  0.4717391  0.0   1.0000000
##   0      0.89474737  0.4717391  0.0   1.0000000
##   0      0.94737368  0.4717391  0.0   1.0000000
##   0      1.00000000  0.4717391  0.0   1.0000000
##   1      0.00010000  0.3544384  0.1   0.9311594
##   1      0.05272632  0.5125906  0.0   1.0000000
##   1      0.10535263  0.5000000  0.0   1.0000000
##   1      0.15797895  0.5000000  0.0   1.0000000
##   1      0.21060526  0.5000000  0.0   1.0000000
##   1      0.26323158  0.5000000  0.0   1.0000000
##   1      0.31585789  0.5000000  0.0   1.0000000
##   1      0.36848421  0.5000000  0.0   1.0000000
##   1      0.42111053  0.5000000  0.0   1.0000000
##   1      0.47373684  0.5000000  0.0   1.0000000
##   1      0.52636316  0.5000000  0.0   1.0000000
##   1      0.57898947  0.5000000  0.0   1.0000000
##   1      0.63161579  0.5000000  0.0   1.0000000
##   1      0.68424211  0.5000000  0.0   1.0000000
##   1      0.73686842  0.5000000  0.0   1.0000000
##   1      0.78949474  0.5000000  0.0   1.0000000
##   1      0.84212105  0.5000000  0.0   1.0000000
##   1      0.89474737  0.5000000  0.0   1.0000000
##   1      0.94737368  0.5000000  0.0   1.0000000
##   1      1.00000000  0.5000000  0.0   1.0000000
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.

# Print maximum ROC statistic
max((model$results)$ROC)

## [1] 0.5125906

# Plot the model
plot(model)

Preprocessing with `caret`

Dealing with missing values: Median imputation

Most models require numbers, can’t handle missing data!
Common approach: remove rows with missing data
- Can lead to biases in data
- Generate over-confident models
Beter strategy: median imputation!
- Replace missing values with medians
- Works well if data missing at random (MAR)!

library(mlbench)
library(dplyr)
library(purrr)
data("BreastCancer")

str(BreastCancer)

## 'data.frame':    699 obs. of  11 variables:
##  $ Id             : chr  "1000025" "1002945" "1015425" "1016277" ...
##  $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

set.seed(33)
X <- select(BreastCancer, c(-Class, -Id))
X <- map(X, as.character) %>% map(as.numeric) %>% as.data.frame()# we need numeric values so we can use adequate "impute" optins in the "preProcess" function!
# Let's provide some "NA" values so we can play the "imputation" game
X[sample(1:nrow(X), 50), "Bare.nuclei"] <- NA
Y <- BreastCancer$Class # VERY IMPORTANT: y has to be a numeric or factor vector, not a data frame!
sum(is.na(X))

## [1] 64

sum(is.na(Y))

## [1] 0

myControl <- trainControl(
                           method = "cv", number = 10,
                           summaryFunction = twoClassSummary,
                           classProbs = TRUE # Super important!
                           # verboseIter = TRUE
                          )

model_median <- train(x = X,  y = Y, 
               method = "glm", 
               trControl = myControl, 
               preProcess = "medianImpute"
               )

# Check the model
model

## glmnet 
## 
## 250 samples
## 200 predictors
##   2 classes: 'class1', 'class2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 225, 226, 225, 224, 224, 225, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      ROC        Sens  Spec     
##   0      0.00010000  0.3955616  0.0   0.9786232
##   0      0.05272632  0.4213768  0.0   1.0000000
##   0      0.10535263  0.4366848  0.0   1.0000000
##   0      0.15797895  0.4387681  0.0   1.0000000
##   0      0.21060526  0.4369565  0.0   1.0000000
##   0      0.26323158  0.4478261  0.0   1.0000000
##   0      0.31585789  0.4543478  0.0   1.0000000
##   0      0.36848421  0.4565217  0.0   1.0000000
##   0      0.42111053  0.4608696  0.0   1.0000000
##   0      0.47373684  0.4652174  0.0   1.0000000
##   0      0.52636316  0.4717391  0.0   1.0000000
##   0      0.57898947  0.4717391  0.0   1.0000000
##   0      0.63161579  0.4717391  0.0   1.0000000
##   0      0.68424211  0.4717391  0.0   1.0000000
##   0      0.73686842  0.4739130  0.0   1.0000000
##   0      0.78949474  0.4717391  0.0   1.0000000
##   0      0.84212105  0.4717391  0.0   1.0000000
##   0      0.89474737  0.4717391  0.0   1.0000000
##   0      0.94737368  0.4717391  0.0   1.0000000
##   0      1.00000000  0.4717391  0.0   1.0000000
##   1      0.00010000  0.3544384  0.1   0.9311594
##   1      0.05272632  0.5125906  0.0   1.0000000
##   1      0.10535263  0.5000000  0.0   1.0000000
##   1      0.15797895  0.5000000  0.0   1.0000000
##   1      0.21060526  0.5000000  0.0   1.0000000
##   1      0.26323158  0.5000000  0.0   1.0000000
##   1      0.31585789  0.5000000  0.0   1.0000000
##   1      0.36848421  0.5000000  0.0   1.0000000
##   1      0.42111053  0.5000000  0.0   1.0000000
##   1      0.47373684  0.5000000  0.0   1.0000000
##   1      0.52636316  0.5000000  0.0   1.0000000
##   1      0.57898947  0.5000000  0.0   1.0000000
##   1      0.63161579  0.5000000  0.0   1.0000000
##   1      0.68424211  0.5000000  0.0   1.0000000
##   1      0.73686842  0.5000000  0.0   1.0000000
##   1      0.78949474  0.5000000  0.0   1.0000000
##   1      0.84212105  0.5000000  0.0   1.0000000
##   1      0.89474737  0.5000000  0.0   1.0000000
##   1      0.94737368  0.5000000  0.0   1.0000000
##   1      1.00000000  0.5000000  0.0   1.0000000
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.

Dealing with missing values: KNN imputation

Median imputation is fast, but…
Can produce incorrect results if data missing not at random
k-nearest neighbors (KNN) imputation
Imputes based on “similar” non-missing rows

set.seed(33)

model_knn <- train(
               x = X,  y = Y, 
               method = "glm", 
               preProcess = "knnImpute",
               trControl = myControl 
               )

# Check the model
model

## glmnet 
## 
## 250 samples
## 200 predictors
##   2 classes: 'class1', 'class2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 225, 226, 225, 224, 224, 225, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      ROC        Sens  Spec     
##   0      0.00010000  0.3955616  0.0   0.9786232
##   0      0.05272632  0.4213768  0.0   1.0000000
##   0      0.10535263  0.4366848  0.0   1.0000000
##   0      0.15797895  0.4387681  0.0   1.0000000
##   0      0.21060526  0.4369565  0.0   1.0000000
##   0      0.26323158  0.4478261  0.0   1.0000000
##   0      0.31585789  0.4543478  0.0   1.0000000
##   0      0.36848421  0.4565217  0.0   1.0000000
##   0      0.42111053  0.4608696  0.0   1.0000000
##   0      0.47373684  0.4652174  0.0   1.0000000
##   0      0.52636316  0.4717391  0.0   1.0000000
##   0      0.57898947  0.4717391  0.0   1.0000000
##   0      0.63161579  0.4717391  0.0   1.0000000
##   0      0.68424211  0.4717391  0.0   1.0000000
##   0      0.73686842  0.4739130  0.0   1.0000000
##   0      0.78949474  0.4717391  0.0   1.0000000
##   0      0.84212105  0.4717391  0.0   1.0000000
##   0      0.89474737  0.4717391  0.0   1.0000000
##   0      0.94737368  0.4717391  0.0   1.0000000
##   0      1.00000000  0.4717391  0.0   1.0000000
##   1      0.00010000  0.3544384  0.1   0.9311594
##   1      0.05272632  0.5125906  0.0   1.0000000
##   1      0.10535263  0.5000000  0.0   1.0000000
##   1      0.15797895  0.5000000  0.0   1.0000000
##   1      0.21060526  0.5000000  0.0   1.0000000
##   1      0.26323158  0.5000000  0.0   1.0000000
##   1      0.31585789  0.5000000  0.0   1.0000000
##   1      0.36848421  0.5000000  0.0   1.0000000
##   1      0.42111053  0.5000000  0.0   1.0000000
##   1      0.47373684  0.5000000  0.0   1.0000000
##   1      0.52636316  0.5000000  0.0   1.0000000
##   1      0.57898947  0.5000000  0.0   1.0000000
##   1      0.63161579  0.5000000  0.0   1.0000000
##   1      0.68424211  0.5000000  0.0   1.0000000
##   1      0.73686842  0.5000000  0.0   1.0000000
##   1      0.78949474  0.5000000  0.0   1.0000000
##   1      0.84212105  0.5000000  0.0   1.0000000
##   1      0.89474737  0.5000000  0.0   1.0000000
##   1      0.94737368  0.5000000  0.0   1.0000000
##   1      1.00000000  0.5000000  0.0   1.0000000
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.

Comparing models in `caret`

Let’s see how models obtained by “median imputation” and “knn imputation” compare
We’ll do it in three ways:
- the distributions summarized in terms of the percentiles
- the distributions summarized as box plots
- and finally the distributions summarized as dot plots

# collect resamples
results <- resamples(list(median_impute = model_median, knn_impute = model_knn))

# summarize the distributions
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: median_impute, knn_impute 
## Number of resamples: 10 
## 
## ROC 
##                 Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## median_impute 0.9873  0.9917 0.9937 0.9941  0.9975    1    0
## knn_impute    0.9792  0.9903 0.9955 0.9939  0.9998    1    0
## 
## Sens 
##                 Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## median_impute 0.9556  0.9565 0.9783 0.9716  0.9783    1    0
## knn_impute    0.9348  0.9779 0.9783 0.9739  0.9783    1    0
## 
## Spec 
##                 Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
## median_impute 0.8333  0.9583 0.9583 0.946  0.9583    1    0
## knn_impute    0.8750  0.9167 0.9392 0.942  0.9583    1    0

# boxplots of results
bwplot(results)

# dot plots of results
dotplot(results)

dotplot(results, metric = "ROC")

Well it seems that the model which exploits KNN for missing values imputation performs slightly better!

Combining preprocessing methods - Preprocessing cheat sheet

Start with median imputation
- Try KNN imputation if data missing not at random
For linear models…
- Center and scale
- Try PCA and spatial sign
Tree-based models don’t need much preprocessing

set.seed(333)

# Fit glm with median imputation: model1
model1 <- train(
  x = X, y = Y,
  method = "glm",
  trControl = myControl,
  preProcess = "medianImpute"
)

# Print model1
model1

## Generalized Linear Model 
## 
## 699 samples
##   9 predictor
##   2 classes: 'benign', 'malignant' 
## 
## Pre-processing: median imputation (9) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 629, 630, 628, 629, 629, 629, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.9943732  0.9737681  0.9501667
## 
##

# Fit glm with median imputation and standardization: model2
model2 <- train(
  x = X, y = Y,
  method = "glm",
  trControl = myControl,
  preProcess = c("medianImpute", "center", "scale")
)

# Print model2
model2

## Generalized Linear Model 
## 
## 699 samples
##   9 predictor
##   2 classes: 'benign', 'malignant' 
## 
## Pre-processing: median imputation (9), centered (9), scaled (9) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 629, 629, 630, 629, 629, 629, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.9945729  0.9758937  0.9418333
## 
##

Handling low-information predictors with `caret`

Some variables don’t contain much information
Constant (i.e. no variance)
Nearly constant (i.e. low variance)
Easy for one fold of CV to end up with constant column
Can cause problems for your models
Usually it is a good idea to remove extremely low variance variables
caret contains a utility function called nearZeroVar() for removing such variables to save time during modeling
- nearZeroVar() takes in x,i.e. one predictor variable, at a time, then looks at the ratio of the most common value to the second most common value, freqCut, and the percentage of distinct values out of the number of total samples, uniqueCut
- If the frequency ratio is greater than a pre-specified threshold and the unique value percentage is less than a threshold, we might consider a predictor to be near zero-variance
- So nearZeroVar() not only removes predictors that have one unique value across samples (zero variance predictors), but also removes predictors that have both:
  1. few unique values relative to the number of samples and
  2. large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors).
- By default, caret uses freqCut = 19 and uniqueCut = 10, which is fairly conservative
- To be a little more aggressive, when calling nearZeroVar(), recommended values would be: freqCut = 2 and uniqueCut = 20

Handling low-information predictors with `caret`: Example

Zero and near-zero variance predictors, also called constant and almost constant predictors across samples, are often present in a dataset. One frequent reason for this is breaking a categorical variable with many categories into several dummy variables. Hence, when one of the categories have zero observations, it becomes a dummy variable full of zeroes. To illustrate this, take a look at what happens when we want to apply Linear Discriminant Analysis (LDA) to the German Credit Data.

library(MASS)

data(GermanCredit)

model <-  lda(Class ~ ., data = GermanCredit)

## Error in lda.default(x, grouping, ...): variables 26 44 appear to be constant within groups

If we take a closer look at those predictors indicated as problematic by lda we can see what is the problem. Note that +1 is added to the index since lda does not count the target variable when informing you where the problem is.

colnames(GermanCredit)[26 + 1]

## [1] "Purpose.Vacation"

table(GermanCredit[26 + 1])

## 
##    0 
## 1000

colnames(GermanCredit)[44 + 1]

## [1] "Personal.Female.Single"

table(GermanCredit[44 + 1])

## 
##    0 
## 1000

Quick and dirty solution: throw data away

As we can see above no loan was taken to pay for a vacation and there is no single female in our dataset. A natural first choice is to remove predictors like those. And this is exactly what the function nearZeroVar from the caret package does. It will not only remove predictors that have one unique value across samples (zero variance predictors), but also, as explained, predictors that have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors).

So let’s filter out the variables (predictors) which satisfy these conditions.

x = nearZeroVar(GermanCredit, saveMetrics = TRUE)
str(x)

## 'data.frame':    62 obs. of  4 variables:
##  $ freqRatio    : num  1.03 1 2.06 1.34 1.02 ...
##  $ percentUnique: num  3.3 92.1 0.4 0.4 5.3 0.4 0.2 0.2 0.2 0.2 ...
##  $ zeroVar      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ nzv          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

# Be aware that predictors are now used as row names for "x"!
head(row.names(x))

## [1] "Duration"                  "Amount"                   
## [3] "InstallmentRatePercentage" "ResidenceDuration"        
## [5] "Age"                       "NumberExistingCredits"

head(x)

##                           freqRatio percentUnique zeroVar   nzv
## Duration                   1.027933           3.3   FALSE FALSE
## Amount                     1.000000          92.1   FALSE FALSE
## InstallmentRatePercentage  2.060606           0.4   FALSE FALSE
## ResidenceDuration          1.340909           0.4   FALSE FALSE
## Age                        1.020000           5.3   FALSE FALSE
## NumberExistingCredits      1.900901           0.4   FALSE FALSE

We can see above that if we call the nearZeroVar function with the argument saveMetrics = TRUE we have access to the frequency ratios and the percentages of unique values for each predictor, as well as flags that indicates if the variables are considered zero variance or near-zero variance predictors. By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than 10% and when the frequency ratio mentioned above is greater than 19 (95/5). As already mentioned these default values can be changed by setting the arguments uniqueCut and freqCut.

Let’s explore which ones are the zero variance predictors:

x[x[,"zeroVar"] == TRUE, ]

##                        freqRatio percentUnique zeroVar  nzv
## Purpose.Vacation               0           0.1    TRUE TRUE
## Personal.Female.Single         0           0.1    TRUE TRUE

And which are the near zero variance predictors (these will also include the zero variance predictors):

x[x[, "nzv"] == TRUE, ]

##                                    freqRatio percentUnique zeroVar  nzv
## ForeignWorker                       26.02703           0.2   FALSE TRUE
## CreditHistory.NoCredit.AllPaid      24.00000           0.2   FALSE TRUE
## CreditHistory.ThisBank.AllPaid      19.40816           0.2   FALSE TRUE
## Purpose.DomesticAppliance           82.33333           0.2   FALSE TRUE
## Purpose.Repairs                     44.45455           0.2   FALSE TRUE
## Purpose.Vacation                     0.00000           0.1    TRUE TRUE
## Purpose.Retraining                 110.11111           0.2   FALSE TRUE
## Purpose.Other                       82.33333           0.2   FALSE TRUE
## SavingsAccountBonds.gt.1000         19.83333           0.2   FALSE TRUE
## Personal.Female.Single               0.00000           0.1    TRUE TRUE
## OtherDebtorsGuarantors.CoApplicant  23.39024           0.2   FALSE TRUE
## OtherInstallmentPlans.Stores        20.27660           0.2   FALSE TRUE
## Job.UnemployedUnskilled             44.45455           0.2   FALSE TRUE

Let’s now remove these predictors, by hand, and by using a bit more aggressive values for freqCut and uniqueCut and conduct the LDA once again:

# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(subset(GermanCredit, select = -Class), 
                           names = TRUE, freqCut = 10, uniqueCut = 15) # Just take care not to drop out the Class variable

# Get all column names from GermanCredit: all_cols
all_cols <- names(GermanCredit)

# Remove from data: german_no_nzv
german_no_nzv <- GermanCredit[ , setdiff(all_cols, remove_cols)]
str(german_no_nzv)

## 'data.frame':    1000 obs. of  42 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

model1 <-  lda(Class ~ ., data = german_no_nzv)
model1

## Call:
## lda(Class ~ ., data = german_no_nzv)
## 
## Prior probabilities of groups:
##  Bad Good 
##  0.3  0.7 
## 
## Group means:
##      Duration   Amount InstallmentRatePercentage ResidenceDuration
## Bad  24.86000 3938.127                  3.096667          2.850000
## Good 19.20714 2985.457                  2.920000          2.842857
##           Age NumberExistingCredits NumberPeopleMaintenance Telephone
## Bad  33.96333              1.366667                1.153333 0.6233333
## Good 36.22429              1.424286                1.155714 0.5842857
##      CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## Bad                   0.4500000                      0.3500000
## Good                  0.1985714                      0.2342857
##      CheckingAccountStatus.none CreditHistory.PaidDuly
## Bad                   0.1533333              0.5633333
## Good                  0.4971429              0.5157143
##      CreditHistory.Critical Purpose.NewCar Purpose.UsedCar
## Bad               0.1666667      0.2966667      0.05666667
## Good              0.3471429      0.2071429      0.12285714
##      Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.Business
## Bad                    0.1933333                0.2066667        0.1133333
## Good                   0.1757143                0.3114286        0.0900000
##      SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## Bad                   0.7233333                     0.11333333
## Good                  0.5514286                     0.09857143
##      SavingsAccountBonds.Unknown EmploymentDuration.lt.1
## Bad                    0.1066667               0.2333333
## Good                   0.2157143               0.1457143
##      EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## Bad                  0.3466667                 0.1300000
## Good                 0.3357143                 0.1928571
##      EmploymentDuration.gt.7 Personal.Female.NotSingle
## Bad                0.2133333                 0.3633333
## Good               0.2700000                 0.2871429
##      Personal.Male.Single Personal.Male.Married.Widowed
## Bad             0.4866667                    0.08333333
## Good            0.5742857                    0.09571429
##      OtherDebtorsGuarantors.None Property.RealEstate Property.Insurance
## Bad                    0.9066667           0.2000000          0.2366667
## Good                   0.9071429           0.3171429          0.2300000
##      Property.CarOther Property.Unknown OtherInstallmentPlans.Bank
## Bad          0.3400000        0.2233333                  0.1900000
## Good         0.3285714        0.1242857                  0.1171429
##      OtherInstallmentPlans.None Housing.Rent Housing.Own Housing.ForFree
## Bad                   0.7466667    0.2333333   0.6200000      0.14666667
## Good                  0.8428571    0.1557143   0.7528571      0.09142857
##      Job.UnskilledResident Job.SkilledEmployee
## Bad              0.1866667           0.6200000
## Good             0.2057143           0.6342857
##      Job.Management.SelfEmp.HighlyQualified
## Bad                               0.1700000
## Good                              0.1385714
## 
## Coefficients of linear discriminants:
##                                                  LD1
## Duration                               -2.389263e-02
## Amount                                 -8.998711e-05
## InstallmentRatePercentage              -2.331679e-01
## ResidenceDuration                       1.431743e-04
## Age                                     8.438415e-03
## NumberExistingCredits                  -1.355913e-01
## NumberPeopleMaintenance                -1.414723e-01
## Telephone                              -2.233048e-01
## CheckingAccountStatus.lt.0             -9.307483e-01
## CheckingAccountStatus.0.to.200         -4.730931e-01
## CheckingAccountStatus.none              4.367377e-01
## CreditHistory.PaidDuly                  2.427631e-01
## CreditHistory.Critical                  7.624888e-01
## Purpose.NewCar                         -2.326781e-01
## Purpose.UsedCar                         9.178499e-01
## Purpose.Furniture.Equipment             3.429477e-01
## Purpose.Radio.Television                4.060980e-01
## Purpose.Business                        2.878719e-01
## SavingsAccountBonds.lt.100             -5.312015e-01
## SavingsAccountBonds.100.to.500         -3.062926e-01
## SavingsAccountBonds.Unknown             9.909503e-02
## EmploymentDuration.lt.1                 2.342099e-02
## EmploymentDuration.1.to.4               1.786391e-01
## EmploymentDuration.4.to.7               6.343731e-01
## EmploymentDuration.gt.7                 2.692400e-01
## Personal.Female.NotSingle               2.360033e-01
## Personal.Male.Single                    6.303865e-01
## Personal.Male.Married.Widowed           4.664764e-01
## OtherDebtorsGuarantors.None            -3.371464e-01
## Property.RealEstate                     2.750368e-01
## Property.Insurance                      4.056157e-02
## Property.CarOther                       1.615275e-02
## Property.Unknown                       -5.124192e-01
## OtherInstallmentPlans.Bank              1.218611e-02
## OtherInstallmentPlans.None              4.501033e-01
## Housing.Rent                           -3.241949e-01
## Housing.Own                             7.505786e-02
## Housing.ForFree                         3.359858e-01
## Job.UnskilledResident                  -2.615916e-01
## Job.SkilledEmployee                    -3.393272e-01
## Job.Management.SelfEmp.HighlyQualified -2.107179e-01

plot(model1)

# or by using the "train" function from "caret" 
model2 <- train(Class ~., data = german_no_nzv, method = "lda")
model2

## Linear Discriminant Analysis 
## 
## 1000 samples
##   41 predictor
##    2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7422605  0.3524467
## 
##

# and finally usin the "preProcess"" function inside the "train" function

model3 <- train(Class ~.,
                data = GermanCredit,
                method = "lda", 
                preProcess = "nzv"
                )
model3

## Linear Discriminant Analysis 
## 
## 1000 samples
##   61 predictor
##    2 classes: 'Bad', 'Good' 
## 
## Pre-processing: remove (13) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7387158  0.3477714
## 
##

PCA with `caret`: using PCA as an alternative to `nearZeroVar()`

An alternative to removing low-variance predictors is to run PCA on your dataset - - This is sometimes preferable because it does not throw out all of your data:
- many different low variance predictors may end up combined into one high variance PCA variable, which might have a positive impact on your model’s accuracy
This is an especially good trick for linear models: the pca option in the preProcess argument will center and scale your data, combine low variance variables, and ensure that all of your predictors are orthogonal
- This creates an ideal dataset for linear regression modeling, and can often improve the accuracy of your models

# Fit "lda" model using PCA: model
model4 <- train(Class ~.,
                data = GermanCredit,
                method = "lda", 
                preProcess = c("zv", "center", "scale", "pca")
               )

# Print model to console
model4

## Linear Discriminant Analysis 
## 
## 1000 samples
##   61 predictor
##    2 classes: 'Bad', 'Good' 
## 
## Pre-processing: centered (59), scaled (59), principal component
##  signal extraction (59), remove (2) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7515928  0.3702907
## 
##

Comparing models - extensive example and more details

We’ll go through a real-world example:

The data that will be used for the exaple: customer churn at telecom company (package c50)
Fit different models and choose the best
Models must use the same training/test splits!
Create a shared trainControl object

# Summarize the target variables
library(caret)
library(C50)
data(churn)

str(churnTrain)

## 'data.frame':    3333 obs. of  20 variables:
##  $ state                        : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
##  $ account_length               : int  128 107 137 84 75 118 121 147 117 141 ...
##  $ area_code                    : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
##  $ international_plan           : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
##  $ voice_mail_plan              : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
##  $ number_vmail_messages        : int  25 26 0 0 0 0 24 0 0 37 ...
##  $ total_day_minutes            : num  265 162 243 299 167 ...
##  $ total_day_calls              : int  110 123 114 71 113 98 88 79 97 84 ...
##  $ total_day_charge             : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ total_eve_minutes            : num  197.4 195.5 121.2 61.9 148.3 ...
##  $ total_eve_calls              : int  99 103 110 88 122 101 108 94 80 111 ...
##  $ total_eve_charge             : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ total_night_minutes          : num  245 254 163 197 187 ...
##  $ total_night_calls            : int  91 103 104 89 121 118 118 96 90 97 ...
##  $ total_night_charge           : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ total_intl_minutes           : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ total_intl_calls             : int  3 3 5 7 3 6 7 6 4 5 ...
##  $ total_intl_charge            : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ number_customer_service_calls: int  1 1 0 2 3 0 3 0 1 0 ...
##  $ churn                        : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

prop.table(table(churnTrain$churn))

## 
##       yes        no 
## 0.1449145 0.8550855

# Create train/test indices
set.seed(33)
myFolds <- createFolds(churnTrain$churn, k = 5) #This creates folds which preserve the class distribution!

# Check whether the class distributions are preserverd inside the folds
i <- myFolds$Fold1
prop.table(table(churnTrain$churn[i]))

## 
##       yes        no 
## 0.1454273 0.8545727

# Use these folds to create a reusable trainControl object, so that we can exploit the exact same cross-validation folds for each model: myControl
myControl <- trainControl(
  summaryFunction = twoClassSummary,
  classProbs = TRUE, # IMPORTANT!
  verboseIter = TRUE,
  savePredictions = TRUE,
  index = myFolds
)

First model that we are going to fit to our data will be the “good, old” glmnet. So let’s recap on basic features of this algorithm: - Linear model with built-in variable selection - Great baseline model - Advantages - Fits quickly - Ignores noisy variables - Provides interpretable coefficients

# Fit the model
set.seed(33)
model_glmnet <- train(
                      churn ~ ., 
                      data = churnTrain,
                      metric = "ROC",
                      method = "glmnet",
                      tuneGrid = expand.grid(
                                             alpha = 0:1,
                                             lambda = 0:10/10
                                            ),
                      trControl = myControl
                     )

## + Fold1: alpha=0, lambda=1 
## - Fold1: alpha=0, lambda=1 
## + Fold1: alpha=1, lambda=1 
## - Fold1: alpha=1, lambda=1 
## + Fold2: alpha=0, lambda=1 
## - Fold2: alpha=0, lambda=1 
## + Fold2: alpha=1, lambda=1 
## - Fold2: alpha=1, lambda=1 
## + Fold3: alpha=0, lambda=1 
## - Fold3: alpha=0, lambda=1 
## + Fold3: alpha=1, lambda=1 
## - Fold3: alpha=1, lambda=1 
## + Fold4: alpha=0, lambda=1 
## - Fold4: alpha=0, lambda=1 
## + Fold4: alpha=1, lambda=1 
## - Fold4: alpha=1, lambda=1 
## + Fold5: alpha=0, lambda=1 
## - Fold5: alpha=0, lambda=1 
## + Fold5: alpha=1, lambda=1 
## - Fold5: alpha=1, lambda=1 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0, lambda = 0.2 on full training set

# Check the model
model_glmnet

## glmnet 
## 
## 3333 samples
##   19 predictor
##    2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Bootstrapped (5 reps) 
## Summary of sample sizes: 667, 667, 666, 667, 666 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  ROC        Sens         Spec     
##   0      0.0     0.7447031  0.214815707  0.9580702
##   0      0.1     0.7620925  0.062115918  0.9933333
##   0      0.2     0.7630074  0.018115971  0.9987719
##   0      0.3     0.7628231  0.004138383  1.0000000
##   0      0.4     0.7624807  0.001034931  1.0000000
##   0      0.5     0.7621059  0.000000000  1.0000000
##   0      0.6     0.7617620  0.000000000  1.0000000
##   0      0.7     0.7614339  0.000000000  1.0000000
##   0      0.8     0.7611350  0.000000000  1.0000000
##   0      0.9     0.7608636  0.000000000  1.0000000
##   0      1.0     0.7607234  0.000000000  1.0000000
##   1      0.0     0.7008344  0.260359347  0.9450877
##   1      0.1     0.5586068  0.000000000  1.0000000
##   1      0.2     0.5000000  0.000000000  1.0000000
##   1      0.3     0.5000000  0.000000000  1.0000000
##   1      0.4     0.5000000  0.000000000  1.0000000
##   1      0.5     0.5000000  0.000000000  1.0000000
##   1      0.6     0.5000000  0.000000000  1.0000000
##   1      0.7     0.5000000  0.000000000  1.0000000
##   1      0.8     0.5000000  0.000000000  1.0000000
##   1      0.9     0.5000000  0.000000000  1.0000000
##   1      1.0     0.5000000  0.000000000  1.0000000
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were alpha = 0 and lambda = 0.2.

# Plot the results
plot(model_glmnet)

# Plot the coefficients
plot(model_glmnet$finalModel)

Next algorithm that we are going to use for modeling the data is the random forest. Here is a quick review of this frequently used maschine learning algorithm: - Slower to fit than glmnet - Less interpretable - Ofteen (but not always) more accurate than glmnet - Easier to tune - Requires little preprocessing - Captures threshold effects and variable interactions

set.seed(33)

model_rf <- train(
                  churn ~ ., churnTrain,
                  metric = "ROC",
                  method = "ranger",
                  trControl = myControl
                  )

## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry=35 
## - Fold1: mtry=35 
## + Fold1: mtry=69 
## - Fold1: mtry=69 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry=35 
## - Fold2: mtry=35 
## + Fold2: mtry=69 
## - Fold2: mtry=69 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry=35 
## - Fold3: mtry=35 
## + Fold3: mtry=69 
## - Fold3: mtry=69 
## + Fold4: mtry= 2 
## - Fold4: mtry= 2 
## + Fold4: mtry=35 
## - Fold4: mtry=35 
## + Fold4: mtry=69 
## - Fold4: mtry=69 
## + Fold5: mtry= 2 
## - Fold5: mtry= 2 
## + Fold5: mtry=35 
## - Fold5: mtry=35 
## + Fold5: mtry=69 
## - Fold5: mtry=69 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 35 on full training set

model_rf

## Random Forest 
## 
## 3333 samples
##   19 predictor
##    2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Bootstrapped (5 reps) 
## Summary of sample sizes: 667, 667, 666, 667, 666 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens        Spec     
##    2    0.8675389  0.03725616  0.9995614
##   35    0.9019706  0.62472721  0.9775439
##   69    0.8988657  0.64645674  0.9728947
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 35.

plot(model_rf)

### Comparing models

Make sure they were fit on the same data!
Selection criteria
Highest average AUC
Lowest standard deviation in AUC
The resamples() function is your friend

# Make a list containing models to be compared
model_list <- list(
                   glmnet = model_glmnet,
                   rf = model_rf
                  )

# Collect resamples from the CV folds
resamps <- resamples(model_list)
resamps

## 
## Call:
## resamples.default(x = model_list)
## 
## Models: glmnet, rf 
## Number of resamples: 5 
## Performance metrics: ROC, Sens, Spec 
## Time estimates for: everything, final model fit

# Summarize the results
summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: glmnet, rf 
## Number of resamples: 5 
## 
## ROC 
##          Min. 1st Qu. Median  Mean 3rd Qu.   Max. NA's
## glmnet 0.7264  0.7617 0.7679 0.763  0.7759 0.7832    0
## rf     0.8958  0.8961 0.8995 0.902  0.9062 0.9121    0
## 
## Sens 
##           Min. 1st Qu.  Median    Mean 3rd Qu.    Max. NA's
## glmnet 0.01034 0.01036 0.01813 0.01812 0.02584 0.02591    0
## rf     0.58290 0.59430 0.61660 0.62470 0.64770 0.68220    0
## 
## Spec 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## glmnet 0.9974  0.9982 0.9991 0.9988  0.9991 1.0000    0
## rf     0.9697  0.9715 0.9724 0.9775  0.9829 0.9912    0

# Box-and-whisker plot
bwplot(resamps)

# Or if we are interested only in ROC AUC
bwplot(resamps, metric = "ROC")

# Dot plot
dotplot(resamps, metric = "ROC")

# Density plot
densityplot(resamps, metric = "ROC") # A usefull way to look for outlier folds with unusually high or low AUC

# Scatter plot
xyplot(resamps, metric = "ROC") # in this case showing that for all cv folds rf provides better performance in terms of ROC AUC

Exploring the `caret` package

Igor Hut

Cros-validation for model performance estimation

Example 1

Example 2

Further customisation of the `trainControl` function - using AUC (instead of accuracy) for model tuning

Example 1

Random Forest with `caret`

Example

Custom tuning grids - Example

Introducing `glmnet`

Tuning glmnet models

Example

`glmnet` with custom `trainControl` and custom tuning grid

Example

Preprocessing with `caret`

Dealing with missing values: Median imputation

Dealing with missing values: KNN imputation

Comparing models in `caret`

Combining preprocessing methods - Preprocessing cheat sheet

Handling low-information predictors with `caret`

Handling low-information predictors with `caret`: Example

PCA with `caret`: using PCA as an alternative to `nearZeroVar()`

Comparing models - extensive example and more details

Exploring the caret package

Igor Hut

Cros-validation for model performance estimation

Example 1

Example 2

Further customisation of the trainControl function - using AUC (instead of accuracy) for model tuning

Example 1

Random Forest with caret

Example

Custom tuning grids - Example

Introducing glmnet

Tuning glmnet models

Example

glmnet with custom trainControl and custom tuning grid

Example

Preprocessing with caret

Dealing with missing values: Median imputation

Dealing with missing values: KNN imputation

Comparing models in caret

Combining preprocessing methods - Preprocessing cheat sheet

Handling low-information predictors with caret

Handling low-information predictors with caret: Example

PCA with caret: using PCA as an alternative to nearZeroVar()

Comparing models - extensive example and more details

Exploring the `caret` package

Further customisation of the `trainControl` function - using AUC (instead of accuracy) for model tuning

Random Forest with `caret`

Introducing `glmnet`

`glmnet` with custom `trainControl` and custom tuning grid

Preprocessing with `caret`

Comparing models in `caret`

Handling low-information predictors with `caret`

Handling low-information predictors with `caret`: Example

PCA with `caret`: using PCA as an alternative to `nearZeroVar()`