caret
packagetrainControl
function - using AUC (instead of accuracy) for model tuningcaret
glmnet
caret
caret
caret
caret
: Examplecaret
: using PCA as an alternative to nearZeroVar()
This material is mostly based on Max Kuhn’s tutorial for the caret
package, as well as the course he made together with Zachary Deane-Mayer for DataCamp
caret
supports various types of cross-validationtrainControl()
function, which is passed to the trControl
argument in train()
:model <- train(
y ~ ., my_data,
method = "lm",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)
train()
function and the method for cross-validation to the trainControl()
function.# Fit lm model using 10-fold CV: model
library(caret)
model <- train(
price ~ ., diamonds,
method = "lm",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
## + Fold01: intercept=TRUE
## - Fold01: intercept=TRUE
## + Fold02: intercept=TRUE
## - Fold02: intercept=TRUE
## + Fold03: intercept=TRUE
## - Fold03: intercept=TRUE
## + Fold04: intercept=TRUE
## - Fold04: intercept=TRUE
## + Fold05: intercept=TRUE
## - Fold05: intercept=TRUE
## + Fold06: intercept=TRUE
## - Fold06: intercept=TRUE
## + Fold07: intercept=TRUE
## - Fold07: intercept=TRUE
## + Fold08: intercept=TRUE
## - Fold08: intercept=TRUE
## + Fold09: intercept=TRUE
## - Fold09: intercept=TRUE
## + Fold10: intercept=TRUE
## - Fold10: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
# Print model to console
model
## Linear Regression
##
## 53940 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 48546, 48547, 48545, 48546, 48546, 48546, ...
## Resampling results:
##
## RMSE Rsquared
## 1131.04 0.9195964
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
train()
function in caret
is how easy it is to run very different models or methods of cross-validation just by tweaking a few simple arguments to the function call.trControl = trainControl(
method = "cv", number = 5,
repeats = 5 # verboseIter = TRUE
)
library(MASS)
set.seed(333)
# Fit lm model using 5-fold CV: model
model <- train(
medv ~ ., Boston,
method = "lm",
trControl = trainControl(
method = "cv", number = 5
# verboseIter = TRUE
)
)
# Print model to console
model
## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 404, 404, 404, 406, 406
## Resampling results:
##
## RMSE Rsquared
## 4.945624 0.7192787
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
# Fit lm model using 5 x 5-fold CV: model
model <- train(
medv ~ ., Boston,
method = "lm",
trControl = trainControl(
method = "cv", number = 5,
repeats = 5 # verboseIter = TRUE
)
)
# Print model to console
model
## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 404, 405, 405, 405, 405
## Resampling results:
##
## RMSE Rsquared
## 4.911869 0.7177594
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
trainControl
function - using AUC (instead of accuracy) for model tuningtrainControl()
function in caret
can be adjusted to use AUC (instead of accuracy), to tune the parameters of trained models.twoClassSummary()
convenience function allows for this to be done easily.twoClassSummary()
, be sure to always include the argument classProbs = TRUE
, otherwise your model will throw an error! (AUC can’t be calculated with just class predictions. Class probabilities are needed as well.)library(mlbench) # This loads a collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.
library(caTools)
data("Sonar")
str(Sonar)
## 'data.frame': 208 obs. of 61 variables:
## $ V1 : num 0.02 0.0453 0.0262 0.01 0.0762 0.0286 0.0317 0.0519 0.0223 0.0164 ...
## $ V2 : num 0.0371 0.0523 0.0582 0.0171 0.0666 0.0453 0.0956 0.0548 0.0375 0.0173 ...
## $ V3 : num 0.0428 0.0843 0.1099 0.0623 0.0481 ...
## $ V4 : num 0.0207 0.0689 0.1083 0.0205 0.0394 ...
## $ V5 : num 0.0954 0.1183 0.0974 0.0205 0.059 ...
## $ V6 : num 0.0986 0.2583 0.228 0.0368 0.0649 ...
## $ V7 : num 0.154 0.216 0.243 0.11 0.121 ...
## $ V8 : num 0.16 0.348 0.377 0.128 0.247 ...
## $ V9 : num 0.3109 0.3337 0.5598 0.0598 0.3564 ...
## $ V10 : num 0.211 0.287 0.619 0.126 0.446 ...
## $ V11 : num 0.1609 0.4918 0.6333 0.0881 0.4152 ...
## $ V12 : num 0.158 0.655 0.706 0.199 0.395 ...
## $ V13 : num 0.2238 0.6919 0.5544 0.0184 0.4256 ...
## $ V14 : num 0.0645 0.7797 0.532 0.2261 0.4135 ...
## $ V15 : num 0.066 0.746 0.648 0.173 0.453 ...
## $ V16 : num 0.227 0.944 0.693 0.213 0.533 ...
## $ V17 : num 0.31 1 0.6759 0.0693 0.7306 ...
## $ V18 : num 0.3 0.887 0.755 0.228 0.619 ...
## $ V19 : num 0.508 0.802 0.893 0.406 0.203 ...
## $ V20 : num 0.48 0.782 0.862 0.397 0.464 ...
## $ V21 : num 0.578 0.521 0.797 0.274 0.415 ...
## $ V22 : num 0.507 0.405 0.674 0.369 0.429 ...
## $ V23 : num 0.433 0.396 0.429 0.556 0.573 ...
## $ V24 : num 0.555 0.391 0.365 0.485 0.54 ...
## $ V25 : num 0.671 0.325 0.533 0.314 0.316 ...
## $ V26 : num 0.641 0.32 0.241 0.533 0.229 ...
## $ V27 : num 0.71 0.327 0.507 0.526 0.7 ...
## $ V28 : num 0.808 0.277 0.853 0.252 1 ...
## $ V29 : num 0.679 0.442 0.604 0.209 0.726 ...
## $ V30 : num 0.386 0.203 0.851 0.356 0.472 ...
## $ V31 : num 0.131 0.379 0.851 0.626 0.51 ...
## $ V32 : num 0.26 0.295 0.504 0.734 0.546 ...
## $ V33 : num 0.512 0.198 0.186 0.612 0.288 ...
## $ V34 : num 0.7547 0.2341 0.2709 0.3497 0.0981 ...
## $ V35 : num 0.854 0.131 0.423 0.395 0.195 ...
## $ V36 : num 0.851 0.418 0.304 0.301 0.418 ...
## $ V37 : num 0.669 0.384 0.612 0.541 0.46 ...
## $ V38 : num 0.61 0.106 0.676 0.881 0.322 ...
## $ V39 : num 0.494 0.184 0.537 0.986 0.283 ...
## $ V40 : num 0.274 0.197 0.472 0.917 0.243 ...
## $ V41 : num 0.051 0.167 0.465 0.612 0.198 ...
## $ V42 : num 0.2834 0.0583 0.2587 0.5006 0.2444 ...
## $ V43 : num 0.282 0.14 0.213 0.321 0.185 ...
## $ V44 : num 0.4256 0.1628 0.2222 0.3202 0.0841 ...
## $ V45 : num 0.2641 0.0621 0.2111 0.4295 0.0692 ...
## $ V46 : num 0.1386 0.0203 0.0176 0.3654 0.0528 ...
## $ V47 : num 0.1051 0.053 0.1348 0.2655 0.0357 ...
## $ V48 : num 0.1343 0.0742 0.0744 0.1576 0.0085 ...
## $ V49 : num 0.0383 0.0409 0.013 0.0681 0.023 0.0264 0.0507 0.0285 0.0777 0.0092 ...
## $ V50 : num 0.0324 0.0061 0.0106 0.0294 0.0046 0.0081 0.0159 0.0178 0.0439 0.0198 ...
## $ V51 : num 0.0232 0.0125 0.0033 0.0241 0.0156 0.0104 0.0195 0.0052 0.0061 0.0118 ...
## $ V52 : num 0.0027 0.0084 0.0232 0.0121 0.0031 0.0045 0.0201 0.0081 0.0145 0.009 ...
## $ V53 : num 0.0065 0.0089 0.0166 0.0036 0.0054 0.0014 0.0248 0.012 0.0128 0.0223 ...
## $ V54 : num 0.0159 0.0048 0.0095 0.015 0.0105 0.0038 0.0131 0.0045 0.0145 0.0179 ...
## $ V55 : num 0.0072 0.0094 0.018 0.0085 0.011 0.0013 0.007 0.0121 0.0058 0.0084 ...
## $ V56 : num 0.0167 0.0191 0.0244 0.0073 0.0015 0.0089 0.0138 0.0097 0.0049 0.0068 ...
## $ V57 : num 0.018 0.014 0.0316 0.005 0.0072 0.0057 0.0092 0.0085 0.0065 0.0032 ...
## $ V58 : num 0.0084 0.0049 0.0164 0.0044 0.0048 0.0027 0.0143 0.0047 0.0093 0.0035 ...
## $ V59 : num 0.009 0.0052 0.0095 0.004 0.0107 0.0051 0.0036 0.0048 0.0059 0.0056 ...
## $ V60 : num 0.0032 0.0044 0.0078 0.0117 0.0094 0.0062 0.0103 0.0053 0.0022 0.004 ...
## $ Class: Factor w/ 2 levels "M","R": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(33)
# Create trainControl object: myControl
myControl <- trainControl(
method = "repeatedCv",
number = 5,
repeats = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE # IMPORTANT!
# verboseIter = TRUE
)
# After creating a custom trainControl object, it's easy to fit caret models that use AUC rather than accuracy to tune and evaluate the model. You can just pass your custom trainControl object to the train() function via the trControl argument.
# Train glm with custom trainControl: model
model <- train(Class ~., Sonar, method = "glm", trControl = myControl)
# Print model to console
model
## Generalized Linear Model
##
## 208 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 166, 166, 166, 167, 167, 167, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7420305 0.7566798 0.6742105
##
##
# Let's make separate train and test sets
inTraining <- createDataPartition(Sonar$Class, p = .7, list = FALSE)
train <- Sonar[inTraining, ]
test <- Sonar[-inTraining, ]
# Train the model with the train set
model_1 <- train(Class ~., data = train, method = "glm", family = "binomial", trControl = myControl) # argument family = "binomial" explicetily tells the train to do logistic regression with "glm"
# Predict probabilites, calculate AUC, and draw ROC
prediction_p <- predict(model_1, test, type = "prob")
colAUC(prediction_p, test$Class, plotROC = TRUE)
## M R
## M vs. R 0.6964472 0.6964472
# Predict class, find the confusion matrix and calculate accuracy, sensitivity...
prediction_c <- predict(model_1, test)
confusionMatrix(prediction_c, test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 22 11
## R 11 18
##
## Accuracy : 0.6452
## 95% CI : (0.5134, 0.7626)
## No Information Rate : 0.5323
## P-Value [Acc > NIR] : 0.04813
##
## Kappa : 0.2874
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.6667
## Specificity : 0.6207
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.6207
## Prevalence : 0.5323
## Detection Rate : 0.3548
## Detection Prevalence : 0.5323
## Balanced Accuracy : 0.6437
##
## 'Positive' Class : M
##
#Let's try another probability trashold and see how the model behaves
# Apply threshold of 0.9: p_class
p_class <- ifelse(prediction_p$M > 0.9, "M", "R")
# Create confusion matrix
confusionMatrix(p_class, test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 22 10
## R 11 19
##
## Accuracy : 0.6613
## 95% CI : (0.5299, 0.7767)
## No Information Rate : 0.5323
## P-Value [Acc > NIR] : 0.02721
##
## Kappa : 0.3212
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.6667
## Specificity : 0.6552
## Pos Pred Value : 0.6875
## Neg Pred Value : 0.6333
## Prevalence : 0.5323
## Detection Rate : 0.3548
## Detection Prevalence : 0.5161
## Balanced Accuracy : 0.6609
##
## 'Positive' Class : M
##
caret
# Fit random forest: model
# Set seed
set.seed(33)
# Fit a model
model <- train(Class~.,
data = Sonar,
method = "ranger",
trControl = trainControl(method = "cv", number = 5)
)
# Let's check the model
model
## Random Forest
##
## 208 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 166, 166, 166, 167, 167
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8367015 0.6672095
## 31 0.8077816 0.6095856
## 60 0.8029036 0.5987429
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Plot the results
plot(model)
tuneLength
parameter, which by default has a value 3. Let’s try tuneLength = 10
:model <- train(Class~.,
data = Sonar,
method = "ranger",
trControl = trainControl(method = "cv", number = 5),
tuneLength = 10
)
# Let's check the model
model
## Random Forest
##
## 208 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 165, 167, 167, 166, 167
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8133025 0.6191802
## 8 0.8088783 0.6122841
## 14 0.8237393 0.6417609
## 21 0.7947033 0.5824658
## 27 0.7993545 0.5925080
## 34 0.7993545 0.5927957
## 40 0.7993545 0.5928595
## 47 0.7993545 0.5929299
## 53 0.7994706 0.5929108
## 60 0.7945871 0.5828487
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.
# Plot the results
plot(model)
Pros and cons of custom tuning
# Define a custom tuning grid
myGrid <- data.frame(mtry = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30))
# Fit a model with a custom tuning grid
set.seed(33)
model <- train(Class ~ .,
data = Sonar,
method = "ranger",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = myGrid)
# Plot the results
plot(model)
model
## Random Forest
##
## 208 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 166, 166, 166, 167, 167
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8222997 0.6374971
## 3 0.8270616 0.6475681
## 4 0.8317073 0.6571388
## 5 0.8222997 0.6379683
## 6 0.8463415 0.6866811
## 7 0.8270616 0.6480316
## 8 0.8415796 0.6772098
## 9 0.8318235 0.6573567
## 10 0.8222997 0.6381326
## 20 0.8220674 0.6379680
## 30 0.7886179 0.5699430
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
glmnet
# Load data
if (!exists("overfit")) {
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1048/datasets/overfit.csv"
overfit <- read.csv(url)
}
# Make a custom trainControl - use ROC as a model selection criteria
myControl <- trainControl(
method = "cv", number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE # Super important!
)
# Fit a model
set.seed(33)
model <- train(y ~ ., overfit, method = "glmnet", trControl = myControl)
#Check the model
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 226, 225, 225, 225, 224, 225, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0.10 0.0001012745 0.4781703 0 0.9570652
## 0.10 0.0010127448 0.4782609 0 0.9657609
## 0.10 0.0101274483 0.4782609 0 0.9829710
## 0.55 0.0001012745 0.4416667 0 0.9574275
## 0.55 0.0010127448 0.4401268 0 0.9617754
## 0.55 0.0101274483 0.4938406 0 0.9829710
## 1.00 0.0001012745 0.4143116 0 0.9572464
## 1.00 0.0010127448 0.4101449 0 0.9659420
## 1.00 0.0101274483 0.4746377 0 0.9829710
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.55 and lambda
## = 0.01012745.
# Plot results
plot(model)
# Print maximum ROC statistic
max(model[["results"]]$ROC)
## [1] 0.4938406
glmnet
with custom trainControl
and custom tuning gridglmnet
model actually fits many models at oncetrain()
is “smart”" enough to only fit one model per alpha value and pass all of the lambda values at once for simultaneous fittingmyGrid <- expand.grid(
alpha = 0:1,
lambda = seq(0.0001, 1, length = 20)
)
# Fit the model
set.seed(42)
model <- train(y ~., data = overfit, method = "glmnet",
tuneGrid = myGrid, trControl = myControl)
# Check the model
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 226, 225, 224, 224, 225, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0 0.00010000 0.3955616 0.0 0.9786232
## 0 0.05272632 0.4213768 0.0 1.0000000
## 0 0.10535263 0.4366848 0.0 1.0000000
## 0 0.15797895 0.4387681 0.0 1.0000000
## 0 0.21060526 0.4369565 0.0 1.0000000
## 0 0.26323158 0.4478261 0.0 1.0000000
## 0 0.31585789 0.4543478 0.0 1.0000000
## 0 0.36848421 0.4565217 0.0 1.0000000
## 0 0.42111053 0.4608696 0.0 1.0000000
## 0 0.47373684 0.4652174 0.0 1.0000000
## 0 0.52636316 0.4717391 0.0 1.0000000
## 0 0.57898947 0.4717391 0.0 1.0000000
## 0 0.63161579 0.4717391 0.0 1.0000000
## 0 0.68424211 0.4717391 0.0 1.0000000
## 0 0.73686842 0.4739130 0.0 1.0000000
## 0 0.78949474 0.4717391 0.0 1.0000000
## 0 0.84212105 0.4717391 0.0 1.0000000
## 0 0.89474737 0.4717391 0.0 1.0000000
## 0 0.94737368 0.4717391 0.0 1.0000000
## 0 1.00000000 0.4717391 0.0 1.0000000
## 1 0.00010000 0.3544384 0.1 0.9311594
## 1 0.05272632 0.5125906 0.0 1.0000000
## 1 0.10535263 0.5000000 0.0 1.0000000
## 1 0.15797895 0.5000000 0.0 1.0000000
## 1 0.21060526 0.5000000 0.0 1.0000000
## 1 0.26323158 0.5000000 0.0 1.0000000
## 1 0.31585789 0.5000000 0.0 1.0000000
## 1 0.36848421 0.5000000 0.0 1.0000000
## 1 0.42111053 0.5000000 0.0 1.0000000
## 1 0.47373684 0.5000000 0.0 1.0000000
## 1 0.52636316 0.5000000 0.0 1.0000000
## 1 0.57898947 0.5000000 0.0 1.0000000
## 1 0.63161579 0.5000000 0.0 1.0000000
## 1 0.68424211 0.5000000 0.0 1.0000000
## 1 0.73686842 0.5000000 0.0 1.0000000
## 1 0.78949474 0.5000000 0.0 1.0000000
## 1 0.84212105 0.5000000 0.0 1.0000000
## 1 0.89474737 0.5000000 0.0 1.0000000
## 1 0.94737368 0.5000000 0.0 1.0000000
## 1 1.00000000 0.5000000 0.0 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.
# Print maximum ROC statistic
max((model$results)$ROC)
## [1] 0.5125906
# Plot the model
plot(model)
caret
library(mlbench)
library(dplyr)
library(purrr)
data("BreastCancer")
str(BreastCancer)
## 'data.frame': 699 obs. of 11 variables:
## $ Id : chr "1000025" "1002945" "1015425" "1016277" ...
## $ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
set.seed(33)
X <- select(BreastCancer, c(-Class, -Id))
X <- map(X, as.character) %>% map(as.numeric) %>% as.data.frame()# we need numeric values so we can use adequate "impute" optins in the "preProcess" function!
# Let's provide some "NA" values so we can play the "imputation" game
X[sample(1:nrow(X), 50), "Bare.nuclei"] <- NA
Y <- BreastCancer$Class # VERY IMPORTANT: y has to be a numeric or factor vector, not a data frame!
sum(is.na(X))
## [1] 64
sum(is.na(Y))
## [1] 0
myControl <- trainControl(
method = "cv", number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE # Super important!
# verboseIter = TRUE
)
model_median <- train(x = X, y = Y,
method = "glm",
trControl = myControl,
preProcess = "medianImpute"
)
# Check the model
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 226, 225, 224, 224, 225, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0 0.00010000 0.3955616 0.0 0.9786232
## 0 0.05272632 0.4213768 0.0 1.0000000
## 0 0.10535263 0.4366848 0.0 1.0000000
## 0 0.15797895 0.4387681 0.0 1.0000000
## 0 0.21060526 0.4369565 0.0 1.0000000
## 0 0.26323158 0.4478261 0.0 1.0000000
## 0 0.31585789 0.4543478 0.0 1.0000000
## 0 0.36848421 0.4565217 0.0 1.0000000
## 0 0.42111053 0.4608696 0.0 1.0000000
## 0 0.47373684 0.4652174 0.0 1.0000000
## 0 0.52636316 0.4717391 0.0 1.0000000
## 0 0.57898947 0.4717391 0.0 1.0000000
## 0 0.63161579 0.4717391 0.0 1.0000000
## 0 0.68424211 0.4717391 0.0 1.0000000
## 0 0.73686842 0.4739130 0.0 1.0000000
## 0 0.78949474 0.4717391 0.0 1.0000000
## 0 0.84212105 0.4717391 0.0 1.0000000
## 0 0.89474737 0.4717391 0.0 1.0000000
## 0 0.94737368 0.4717391 0.0 1.0000000
## 0 1.00000000 0.4717391 0.0 1.0000000
## 1 0.00010000 0.3544384 0.1 0.9311594
## 1 0.05272632 0.5125906 0.0 1.0000000
## 1 0.10535263 0.5000000 0.0 1.0000000
## 1 0.15797895 0.5000000 0.0 1.0000000
## 1 0.21060526 0.5000000 0.0 1.0000000
## 1 0.26323158 0.5000000 0.0 1.0000000
## 1 0.31585789 0.5000000 0.0 1.0000000
## 1 0.36848421 0.5000000 0.0 1.0000000
## 1 0.42111053 0.5000000 0.0 1.0000000
## 1 0.47373684 0.5000000 0.0 1.0000000
## 1 0.52636316 0.5000000 0.0 1.0000000
## 1 0.57898947 0.5000000 0.0 1.0000000
## 1 0.63161579 0.5000000 0.0 1.0000000
## 1 0.68424211 0.5000000 0.0 1.0000000
## 1 0.73686842 0.5000000 0.0 1.0000000
## 1 0.78949474 0.5000000 0.0 1.0000000
## 1 0.84212105 0.5000000 0.0 1.0000000
## 1 0.89474737 0.5000000 0.0 1.0000000
## 1 0.94737368 0.5000000 0.0 1.0000000
## 1 1.00000000 0.5000000 0.0 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.
set.seed(33)
model_knn <- train(
x = X, y = Y,
method = "glm",
preProcess = "knnImpute",
trControl = myControl
)
# Check the model
model
## glmnet
##
## 250 samples
## 200 predictors
## 2 classes: 'class1', 'class2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 225, 226, 225, 224, 224, 225, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0 0.00010000 0.3955616 0.0 0.9786232
## 0 0.05272632 0.4213768 0.0 1.0000000
## 0 0.10535263 0.4366848 0.0 1.0000000
## 0 0.15797895 0.4387681 0.0 1.0000000
## 0 0.21060526 0.4369565 0.0 1.0000000
## 0 0.26323158 0.4478261 0.0 1.0000000
## 0 0.31585789 0.4543478 0.0 1.0000000
## 0 0.36848421 0.4565217 0.0 1.0000000
## 0 0.42111053 0.4608696 0.0 1.0000000
## 0 0.47373684 0.4652174 0.0 1.0000000
## 0 0.52636316 0.4717391 0.0 1.0000000
## 0 0.57898947 0.4717391 0.0 1.0000000
## 0 0.63161579 0.4717391 0.0 1.0000000
## 0 0.68424211 0.4717391 0.0 1.0000000
## 0 0.73686842 0.4739130 0.0 1.0000000
## 0 0.78949474 0.4717391 0.0 1.0000000
## 0 0.84212105 0.4717391 0.0 1.0000000
## 0 0.89474737 0.4717391 0.0 1.0000000
## 0 0.94737368 0.4717391 0.0 1.0000000
## 0 1.00000000 0.4717391 0.0 1.0000000
## 1 0.00010000 0.3544384 0.1 0.9311594
## 1 0.05272632 0.5125906 0.0 1.0000000
## 1 0.10535263 0.5000000 0.0 1.0000000
## 1 0.15797895 0.5000000 0.0 1.0000000
## 1 0.21060526 0.5000000 0.0 1.0000000
## 1 0.26323158 0.5000000 0.0 1.0000000
## 1 0.31585789 0.5000000 0.0 1.0000000
## 1 0.36848421 0.5000000 0.0 1.0000000
## 1 0.42111053 0.5000000 0.0 1.0000000
## 1 0.47373684 0.5000000 0.0 1.0000000
## 1 0.52636316 0.5000000 0.0 1.0000000
## 1 0.57898947 0.5000000 0.0 1.0000000
## 1 0.63161579 0.5000000 0.0 1.0000000
## 1 0.68424211 0.5000000 0.0 1.0000000
## 1 0.73686842 0.5000000 0.0 1.0000000
## 1 0.78949474 0.5000000 0.0 1.0000000
## 1 0.84212105 0.5000000 0.0 1.0000000
## 1 0.89474737 0.5000000 0.0 1.0000000
## 1 0.94737368 0.5000000 0.0 1.0000000
## 1 1.00000000 0.5000000 0.0 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.05272632.
caret
# collect resamples
results <- resamples(list(median_impute = model_median, knn_impute = model_knn))
# summarize the distributions
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: median_impute, knn_impute
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## median_impute 0.9873 0.9917 0.9937 0.9941 0.9975 1 0
## knn_impute 0.9792 0.9903 0.9955 0.9939 0.9998 1 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## median_impute 0.9556 0.9565 0.9783 0.9716 0.9783 1 0
## knn_impute 0.9348 0.9779 0.9783 0.9739 0.9783 1 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## median_impute 0.8333 0.9583 0.9583 0.946 0.9583 1 0
## knn_impute 0.8750 0.9167 0.9392 0.942 0.9583 1 0
# boxplots of results
bwplot(results)
# dot plots of results
dotplot(results)
dotplot(results, metric = "ROC")
Well it seems that the model which exploits KNN for missing values imputation performs slightly better!
set.seed(333)
# Fit glm with median imputation: model1
model1 <- train(
x = X, y = Y,
method = "glm",
trControl = myControl,
preProcess = "medianImpute"
)
# Print model1
model1
## Generalized Linear Model
##
## 699 samples
## 9 predictor
## 2 classes: 'benign', 'malignant'
##
## Pre-processing: median imputation (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 629, 630, 628, 629, 629, 629, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9943732 0.9737681 0.9501667
##
##
# Fit glm with median imputation and standardization: model2
model2 <- train(
x = X, y = Y,
method = "glm",
trControl = myControl,
preProcess = c("medianImpute", "center", "scale")
)
# Print model2
model2
## Generalized Linear Model
##
## 699 samples
## 9 predictor
## 2 classes: 'benign', 'malignant'
##
## Pre-processing: median imputation (9), centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 629, 629, 630, 629, 629, 629, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9945729 0.9758937 0.9418333
##
##
caret
caret
contains a utility function called nearZeroVar()
for removing such variables to save time during modeling
nearZeroVar()
takes in x
,i.e. one predictor variable, at a time, then looks at the ratio of the most common value to the second most common value, freqCut
, and the percentage of distinct values out of the number of total samples, uniqueCut
nearZeroVar()
not only removes predictors that have one unique value across samples (zero variance predictors), but also removes predictors that have both:
caret
uses freqCut = 19
and uniqueCut = 10
, which is fairly conservativenearZeroVar()
, recommended values would be: freqCut = 2
and uniqueCut = 20
caret
: ExampleZero and near-zero variance predictors, also called constant and almost constant predictors across samples, are often present in a dataset. One frequent reason for this is breaking a categorical variable with many categories into several dummy variables. Hence, when one of the categories have zero observations, it becomes a dummy variable full of zeroes. To illustrate this, take a look at what happens when we want to apply Linear Discriminant Analysis (LDA) to the German Credit Data.
library(MASS)
data(GermanCredit)
model <- lda(Class ~ ., data = GermanCredit)
## Error in lda.default(x, grouping, ...): variables 26 44 appear to be constant within groups
If we take a closer look at those predictors indicated as problematic by lda
we can see what is the problem. Note that +1 is added to the index since lda
does not count the target variable when informing you where the problem is.
colnames(GermanCredit)[26 + 1]
## [1] "Purpose.Vacation"
table(GermanCredit[26 + 1])
##
## 0
## 1000
colnames(GermanCredit)[44 + 1]
## [1] "Personal.Female.Single"
table(GermanCredit[44 + 1])
##
## 0
## 1000
Quick and dirty solution: throw data away
As we can see above no loan was taken to pay for a vacation and there is no single female in our dataset. A natural first choice is to remove predictors like those. And this is exactly what the function nearZeroVar
from the caret
package does. It will not only remove predictors that have one unique value across samples (zero variance predictors), but also, as explained, predictors that have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors).
So let’s filter out the variables (predictors) which satisfy these conditions.
x = nearZeroVar(GermanCredit, saveMetrics = TRUE)
str(x)
## 'data.frame': 62 obs. of 4 variables:
## $ freqRatio : num 1.03 1 2.06 1.34 1.02 ...
## $ percentUnique: num 3.3 92.1 0.4 0.4 5.3 0.4 0.2 0.2 0.2 0.2 ...
## $ zeroVar : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ nzv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
# Be aware that predictors are now used as row names for "x"!
head(row.names(x))
## [1] "Duration" "Amount"
## [3] "InstallmentRatePercentage" "ResidenceDuration"
## [5] "Age" "NumberExistingCredits"
head(x)
## freqRatio percentUnique zeroVar nzv
## Duration 1.027933 3.3 FALSE FALSE
## Amount 1.000000 92.1 FALSE FALSE
## InstallmentRatePercentage 2.060606 0.4 FALSE FALSE
## ResidenceDuration 1.340909 0.4 FALSE FALSE
## Age 1.020000 5.3 FALSE FALSE
## NumberExistingCredits 1.900901 0.4 FALSE FALSE
We can see above that if we call the nearZeroVar
function with the argument saveMetrics = TRUE
we have access to the frequency ratios and the percentages of unique values for each predictor, as well as flags that indicates if the variables are considered zero variance or near-zero variance predictors. By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than 10% and when the frequency ratio mentioned above is greater than 19 (95/5). As already mentioned these default values can be changed by setting the arguments uniqueCut
and freqCut
.
Let’s explore which ones are the zero variance predictors:
x[x[,"zeroVar"] == TRUE, ]
## freqRatio percentUnique zeroVar nzv
## Purpose.Vacation 0 0.1 TRUE TRUE
## Personal.Female.Single 0 0.1 TRUE TRUE
And which are the near zero variance predictors (these will also include the zero variance predictors):
x[x[, "nzv"] == TRUE, ]
## freqRatio percentUnique zeroVar nzv
## ForeignWorker 26.02703 0.2 FALSE TRUE
## CreditHistory.NoCredit.AllPaid 24.00000 0.2 FALSE TRUE
## CreditHistory.ThisBank.AllPaid 19.40816 0.2 FALSE TRUE
## Purpose.DomesticAppliance 82.33333 0.2 FALSE TRUE
## Purpose.Repairs 44.45455 0.2 FALSE TRUE
## Purpose.Vacation 0.00000 0.1 TRUE TRUE
## Purpose.Retraining 110.11111 0.2 FALSE TRUE
## Purpose.Other 82.33333 0.2 FALSE TRUE
## SavingsAccountBonds.gt.1000 19.83333 0.2 FALSE TRUE
## Personal.Female.Single 0.00000 0.1 TRUE TRUE
## OtherDebtorsGuarantors.CoApplicant 23.39024 0.2 FALSE TRUE
## OtherInstallmentPlans.Stores 20.27660 0.2 FALSE TRUE
## Job.UnemployedUnskilled 44.45455 0.2 FALSE TRUE
Let’s now remove these predictors, by hand, and by using a bit more aggressive values for freqCut
and uniqueCut
and conduct the LDA once again:
# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(subset(GermanCredit, select = -Class),
names = TRUE, freqCut = 10, uniqueCut = 15) # Just take care not to drop out the Class variable
# Get all column names from GermanCredit: all_cols
all_cols <- names(GermanCredit)
# Remove from data: german_no_nzv
german_no_nzv <- GermanCredit[ , setdiff(all_cols, remove_cols)]
str(german_no_nzv)
## 'data.frame': 1000 obs. of 42 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ Class : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
model1 <- lda(Class ~ ., data = german_no_nzv)
model1
## Call:
## lda(Class ~ ., data = german_no_nzv)
##
## Prior probabilities of groups:
## Bad Good
## 0.3 0.7
##
## Group means:
## Duration Amount InstallmentRatePercentage ResidenceDuration
## Bad 24.86000 3938.127 3.096667 2.850000
## Good 19.20714 2985.457 2.920000 2.842857
## Age NumberExistingCredits NumberPeopleMaintenance Telephone
## Bad 33.96333 1.366667 1.153333 0.6233333
## Good 36.22429 1.424286 1.155714 0.5842857
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## Bad 0.4500000 0.3500000
## Good 0.1985714 0.2342857
## CheckingAccountStatus.none CreditHistory.PaidDuly
## Bad 0.1533333 0.5633333
## Good 0.4971429 0.5157143
## CreditHistory.Critical Purpose.NewCar Purpose.UsedCar
## Bad 0.1666667 0.2966667 0.05666667
## Good 0.3471429 0.2071429 0.12285714
## Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.Business
## Bad 0.1933333 0.2066667 0.1133333
## Good 0.1757143 0.3114286 0.0900000
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## Bad 0.7233333 0.11333333
## Good 0.5514286 0.09857143
## SavingsAccountBonds.Unknown EmploymentDuration.lt.1
## Bad 0.1066667 0.2333333
## Good 0.2157143 0.1457143
## EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## Bad 0.3466667 0.1300000
## Good 0.3357143 0.1928571
## EmploymentDuration.gt.7 Personal.Female.NotSingle
## Bad 0.2133333 0.3633333
## Good 0.2700000 0.2871429
## Personal.Male.Single Personal.Male.Married.Widowed
## Bad 0.4866667 0.08333333
## Good 0.5742857 0.09571429
## OtherDebtorsGuarantors.None Property.RealEstate Property.Insurance
## Bad 0.9066667 0.2000000 0.2366667
## Good 0.9071429 0.3171429 0.2300000
## Property.CarOther Property.Unknown OtherInstallmentPlans.Bank
## Bad 0.3400000 0.2233333 0.1900000
## Good 0.3285714 0.1242857 0.1171429
## OtherInstallmentPlans.None Housing.Rent Housing.Own Housing.ForFree
## Bad 0.7466667 0.2333333 0.6200000 0.14666667
## Good 0.8428571 0.1557143 0.7528571 0.09142857
## Job.UnskilledResident Job.SkilledEmployee
## Bad 0.1866667 0.6200000
## Good 0.2057143 0.6342857
## Job.Management.SelfEmp.HighlyQualified
## Bad 0.1700000
## Good 0.1385714
##
## Coefficients of linear discriminants:
## LD1
## Duration -2.389263e-02
## Amount -8.998711e-05
## InstallmentRatePercentage -2.331679e-01
## ResidenceDuration 1.431743e-04
## Age 8.438415e-03
## NumberExistingCredits -1.355913e-01
## NumberPeopleMaintenance -1.414723e-01
## Telephone -2.233048e-01
## CheckingAccountStatus.lt.0 -9.307483e-01
## CheckingAccountStatus.0.to.200 -4.730931e-01
## CheckingAccountStatus.none 4.367377e-01
## CreditHistory.PaidDuly 2.427631e-01
## CreditHistory.Critical 7.624888e-01
## Purpose.NewCar -2.326781e-01
## Purpose.UsedCar 9.178499e-01
## Purpose.Furniture.Equipment 3.429477e-01
## Purpose.Radio.Television 4.060980e-01
## Purpose.Business 2.878719e-01
## SavingsAccountBonds.lt.100 -5.312015e-01
## SavingsAccountBonds.100.to.500 -3.062926e-01
## SavingsAccountBonds.Unknown 9.909503e-02
## EmploymentDuration.lt.1 2.342099e-02
## EmploymentDuration.1.to.4 1.786391e-01
## EmploymentDuration.4.to.7 6.343731e-01
## EmploymentDuration.gt.7 2.692400e-01
## Personal.Female.NotSingle 2.360033e-01
## Personal.Male.Single 6.303865e-01
## Personal.Male.Married.Widowed 4.664764e-01
## OtherDebtorsGuarantors.None -3.371464e-01
## Property.RealEstate 2.750368e-01
## Property.Insurance 4.056157e-02
## Property.CarOther 1.615275e-02
## Property.Unknown -5.124192e-01
## OtherInstallmentPlans.Bank 1.218611e-02
## OtherInstallmentPlans.None 4.501033e-01
## Housing.Rent -3.241949e-01
## Housing.Own 7.505786e-02
## Housing.ForFree 3.359858e-01
## Job.UnskilledResident -2.615916e-01
## Job.SkilledEmployee -3.393272e-01
## Job.Management.SelfEmp.HighlyQualified -2.107179e-01
plot(model1)
# or by using the "train" function from "caret"
model2 <- train(Class ~., data = german_no_nzv, method = "lda")
model2
## Linear Discriminant Analysis
##
## 1000 samples
## 41 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7422605 0.3524467
##
##
# and finally usin the "preProcess"" function inside the "train" function
model3 <- train(Class ~.,
data = GermanCredit,
method = "lda",
preProcess = "nzv"
)
model3
## Linear Discriminant Analysis
##
## 1000 samples
## 61 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: remove (13)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7387158 0.3477714
##
##
caret
: using PCA as an alternative to nearZeroVar()
pca
option in the preProcess
argument will center and scale your data, combine low variance variables, and ensure that all of your predictors are orthogonal
# Fit "lda" model using PCA: model
model4 <- train(Class ~.,
data = GermanCredit,
method = "lda",
preProcess = c("zv", "center", "scale", "pca")
)
# Print model to console
model4
## Linear Discriminant Analysis
##
## 1000 samples
## 61 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: centered (59), scaled (59), principal component
## signal extraction (59), remove (2)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7515928 0.3702907
##
##
We’ll go through a real-world example:
trainControl
object# Summarize the target variables
library(caret)
library(C50)
data(churn)
str(churnTrain)
## 'data.frame': 3333 obs. of 20 variables:
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
## $ account_length : int 128 107 137 84 75 118 121 147 117 141 ...
## $ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
## $ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
## $ voice_mail_plan : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
## $ number_vmail_messages : int 25 26 0 0 0 0 24 0 0 37 ...
## $ total_day_minutes : num 265 162 243 299 167 ...
## $ total_day_calls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ total_day_charge : num 45.1 27.5 41.4 50.9 28.3 ...
## $ total_eve_minutes : num 197.4 195.5 121.2 61.9 148.3 ...
## $ total_eve_calls : int 99 103 110 88 122 101 108 94 80 111 ...
## $ total_eve_charge : num 16.78 16.62 10.3 5.26 12.61 ...
## $ total_night_minutes : num 245 254 163 197 187 ...
## $ total_night_calls : int 91 103 104 89 121 118 118 96 90 97 ...
## $ total_night_charge : num 11.01 11.45 7.32 8.86 8.41 ...
## $ total_intl_minutes : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ total_intl_calls : int 3 3 5 7 3 6 7 6 4 5 ...
## $ total_intl_charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ number_customer_service_calls: int 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
prop.table(table(churnTrain$churn))
##
## yes no
## 0.1449145 0.8550855
# Create train/test indices
set.seed(33)
myFolds <- createFolds(churnTrain$churn, k = 5) #This creates folds which preserve the class distribution!
# Check whether the class distributions are preserverd inside the folds
i <- myFolds$Fold1
prop.table(table(churnTrain$churn[i]))
##
## yes no
## 0.1454273 0.8545727
# Use these folds to create a reusable trainControl object, so that we can exploit the exact same cross-validation folds for each model: myControl
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
First model that we are going to fit to our data will be the “good, old” glmnet
. So let’s recap on basic features of this algorithm: - Linear model with built-in variable selection - Great baseline model - Advantages - Fits quickly - Ignores noisy variables - Provides interpretable coefficients
# Fit the model
set.seed(33)
model_glmnet <- train(
churn ~ .,
data = churnTrain,
metric = "ROC",
method = "glmnet",
tuneGrid = expand.grid(
alpha = 0:1,
lambda = 0:10/10
),
trControl = myControl
)
## + Fold1: alpha=0, lambda=1
## - Fold1: alpha=0, lambda=1
## + Fold1: alpha=1, lambda=1
## - Fold1: alpha=1, lambda=1
## + Fold2: alpha=0, lambda=1
## - Fold2: alpha=0, lambda=1
## + Fold2: alpha=1, lambda=1
## - Fold2: alpha=1, lambda=1
## + Fold3: alpha=0, lambda=1
## - Fold3: alpha=0, lambda=1
## + Fold3: alpha=1, lambda=1
## - Fold3: alpha=1, lambda=1
## + Fold4: alpha=0, lambda=1
## - Fold4: alpha=0, lambda=1
## + Fold4: alpha=1, lambda=1
## - Fold4: alpha=1, lambda=1
## + Fold5: alpha=0, lambda=1
## - Fold5: alpha=0, lambda=1
## + Fold5: alpha=1, lambda=1
## - Fold5: alpha=1, lambda=1
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0, lambda = 0.2 on full training set
# Check the model
model_glmnet
## glmnet
##
## 3333 samples
## 19 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Bootstrapped (5 reps)
## Summary of sample sizes: 667, 667, 666, 667, 666
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0 0.0 0.7447031 0.214815707 0.9580702
## 0 0.1 0.7620925 0.062115918 0.9933333
## 0 0.2 0.7630074 0.018115971 0.9987719
## 0 0.3 0.7628231 0.004138383 1.0000000
## 0 0.4 0.7624807 0.001034931 1.0000000
## 0 0.5 0.7621059 0.000000000 1.0000000
## 0 0.6 0.7617620 0.000000000 1.0000000
## 0 0.7 0.7614339 0.000000000 1.0000000
## 0 0.8 0.7611350 0.000000000 1.0000000
## 0 0.9 0.7608636 0.000000000 1.0000000
## 0 1.0 0.7607234 0.000000000 1.0000000
## 1 0.0 0.7008344 0.260359347 0.9450877
## 1 0.1 0.5586068 0.000000000 1.0000000
## 1 0.2 0.5000000 0.000000000 1.0000000
## 1 0.3 0.5000000 0.000000000 1.0000000
## 1 0.4 0.5000000 0.000000000 1.0000000
## 1 0.5 0.5000000 0.000000000 1.0000000
## 1 0.6 0.5000000 0.000000000 1.0000000
## 1 0.7 0.5000000 0.000000000 1.0000000
## 1 0.8 0.5000000 0.000000000 1.0000000
## 1 0.9 0.5000000 0.000000000 1.0000000
## 1 1.0 0.5000000 0.000000000 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0 and lambda = 0.2.
# Plot the results
plot(model_glmnet)
# Plot the coefficients
plot(model_glmnet$finalModel)
Next algorithm that we are going to use for modeling the data is the random forest. Here is a quick review of this frequently used maschine learning algorithm: - Slower to fit than
glmnet
- Less interpretable - Ofteen (but not always) more accurate than glmnet
- Easier to tune - Requires little preprocessing - Captures threshold effects and variable interactions
set.seed(33)
model_rf <- train(
churn ~ ., churnTrain,
metric = "ROC",
method = "ranger",
trControl = myControl
)
## + Fold1: mtry= 2
## - Fold1: mtry= 2
## + Fold1: mtry=35
## - Fold1: mtry=35
## + Fold1: mtry=69
## - Fold1: mtry=69
## + Fold2: mtry= 2
## - Fold2: mtry= 2
## + Fold2: mtry=35
## - Fold2: mtry=35
## + Fold2: mtry=69
## - Fold2: mtry=69
## + Fold3: mtry= 2
## - Fold3: mtry= 2
## + Fold3: mtry=35
## - Fold3: mtry=35
## + Fold3: mtry=69
## - Fold3: mtry=69
## + Fold4: mtry= 2
## - Fold4: mtry= 2
## + Fold4: mtry=35
## - Fold4: mtry=35
## + Fold4: mtry=69
## - Fold4: mtry=69
## + Fold5: mtry= 2
## - Fold5: mtry= 2
## + Fold5: mtry=35
## - Fold5: mtry=35
## + Fold5: mtry=69
## - Fold5: mtry=69
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 35 on full training set
model_rf
## Random Forest
##
## 3333 samples
## 19 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Bootstrapped (5 reps)
## Summary of sample sizes: 667, 667, 666, 667, 666
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.8675389 0.03725616 0.9995614
## 35 0.9019706 0.62472721 0.9775439
## 69 0.8988657 0.64645674 0.9728947
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 35.
plot(model_rf)
### Comparing models
resamples()
function is your friend# Make a list containing models to be compared
model_list <- list(
glmnet = model_glmnet,
rf = model_rf
)
# Collect resamples from the CV folds
resamps <- resamples(model_list)
resamps
##
## Call:
## resamples.default(x = model_list)
##
## Models: glmnet, rf
## Number of resamples: 5
## Performance metrics: ROC, Sens, Spec
## Time estimates for: everything, final model fit
# Summarize the results
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: glmnet, rf
## Number of resamples: 5
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.7264 0.7617 0.7679 0.763 0.7759 0.7832 0
## rf 0.8958 0.8961 0.8995 0.902 0.9062 0.9121 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.01034 0.01036 0.01813 0.01812 0.02584 0.02591 0
## rf 0.58290 0.59430 0.61660 0.62470 0.64770 0.68220 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.9974 0.9982 0.9991 0.9988 0.9991 1.0000 0
## rf 0.9697 0.9715 0.9724 0.9775 0.9829 0.9912 0
# Box-and-whisker plot
bwplot(resamps)
# Or if we are interested only in ROC AUC
bwplot(resamps, metric = "ROC")
# Dot plot
dotplot(resamps, metric = "ROC")
# Density plot
densityplot(resamps, metric = "ROC") # A usefull way to look for outlier folds with unusually high or low AUC
# Scatter plot
xyplot(resamps, metric = "ROC") # in this case showing that for all cv folds rf provides better performance in terms of ROC AUC