My final post consists of using the caretEnsemble. Caret Ensemble allows the user to train multiple models by using the caret List function. The only drawback is the computing time this might take. Models can also be combined to utilize the caret stack function to make better predictions.
The dataset used in this example was the Orange Juice from the ISLR package. The data was imputed and transformed. The post is short and powerful due to the insights gained with fitting multiple models with little trouble.
featurePlot(x = trainData[, 1:18],
y = trainData$Purchase,
plot = "box",
strip=strip.custom(par.strip.text=list(cex=.7)),
scales = list(x = list(relation="free"),
y = list(relation="free")))
featurePlot(x = trainData[, 1:18],
y = trainData$Purchase,
plot = "density",
strip=strip.custom(par.strip.text=list(cex=.7)),
scales = list(x = list(relation="free"),
y = list(relation="free")))
I start by parallelizing to decrease the speed it takes to train multiple models. I also created a train control using repeatedcv. The models being fitted were Random Forest, AdaBoost, earth,xgbDart, and svmRadial. The caretlist function is similar to the train function in the caret package. In this example, we are trying to predict purchase.
library("parallel")
library("doParallel")
library("caretEnsemble")
Mycluster = makeCluster(detectCores()-1)
registerDoParallel(Mycluster)
# Stacking Algorithms - Run multiple algos in one call.
trainControl = trainControl(method="repeatedcv",
number=5,
# repeats=3,
savePredictions=TRUE,
classProbs=TRUE, allowParallel = T)
algorithmList = c('rf', 'adaboost', 'earth', 'xgbDART', 'svmRadial')
set.seed(143)
models = caretList(Purchase ~ ., data=trainData, trControl=trainControl, methodList=algorithmList)
results = resamples(models)
stopCluster(Mycluster)
registerDoSEQ()
rm(Mycluster)
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: rf, adaboost, earth, xgbDART, svmRadial
## Number of resamples: 5
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.7791 0.7836 0.7965 0.8028 0.8070 0.8480 0
## adaboost 0.7368 0.7733 0.7907 0.7970 0.8187 0.8655 0
## earth 0.7791 0.7953 0.8070 0.8087 0.8081 0.8538 0
## xgbDART 0.7953 0.8081 0.8363 0.8320 0.8547 0.8655 0
## svmRadial 0.8012 0.8081 0.8256 0.8297 0.8421 0.8713 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.5380 0.5496 0.5710 0.5850 0.6004 0.6661 0
## adaboost 0.4551 0.5245 0.5599 0.5727 0.6186 0.7055 0
## earth 0.5304 0.5694 0.5896 0.5941 0.5977 0.6835 0
## xgbDART 0.5784 0.5955 0.6600 0.6479 0.6968 0.7089 0
## svmRadial 0.5761 0.5933 0.6252 0.6356 0.6642 0.7191 0
# Box plots to compare models
scales = list(x=list(relation="free"), y=list(relation="free"))
bwplot(results, scales=scales)
# Create the trainControl
set.seed(143)
stackControl = trainControl(method="repeatedcv",
number=10,
repeats=3,
savePredictions=TRUE,
classProbs=TRUE )
# Ensemble the predictions of `models` to form a new combined prediction based on glm
stack.glm = caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)
The five models gives an accuracy of 82%.
print(stack.glm)
## A glm ensemble of 2 base models: rf, adaboost, earth, xgbDART, svmRadial
##
## Ensemble results:
## Generalized Linear Model
##
## 857 samples
## 5 predictor
## 2 classes: 'CH', 'MM'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 772, 772, 772, 770, 771, 771, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8230177 0.6246467
#stack_predicteds <- predict(stack.glm, newdata=testData4)