Using JHU’s Coursera Data Science Machine Learning Course techniques:
Ensembling (combining) models can increase accuracy for predictions at the cost of reducing interpretability.
Using the Wage data (from the ISLR) package and removing the logWages variable, I’ll make several predictions, combine models, and estimate out of sample error.
library(ISLR)
library(ggplot2)
library(caret)
## Loading required package: lattice
data(Wage)
#Data splitting
Wg <- subset(Wage, select = -c(logwage,sex,region)) #remove logwage and some damaging cols
set.seed(4)
nonvalidation <- createDataPartition(y = Wg$wage, p = .7, list = FALSE)
validationwage <- Wg[-nonvalidation,] # 30% of original data
WgTrainTest <- Wg[nonvalidation,] # other 70% of data
trainrows <- createDataPartition(WgTrainTest$wage, p = .7, list = FALSE)
trainwage <- WgTrainTest[trainrows,] # 49% of original data
testwage <- WgTrainTest[-trainrows,] # 21% of original data
rm(nonvalidation,WgTrainTest,trainrows,Wg,Wage) #declutter Environment
#Model 1 a simple GLM model #due to rank deficiency sex and region were also removed
mdl1 <- train(wage ~ ., method="glm", data = trainwage)
#Model 2 is a random forest model with cross validation
mdl2 <- train(wage ~., method = "rf", data = trainwage,
trControl = trainControl(method = "cv", number = 3))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
#repeatedcv was computationally taxing for my laptop
With these two models we can plot how their predictions are similar and different in the testing data.
predglm <- predict(mdl1, newdata = testwage) #beware, predict can give wrong output length
predrf <- predict(mdl2, newdata = testwage) #ensure you use "newdata" argument
g <- ggplot(aes(x = predglm, y = predrf,color = wage), data = testwage)
gg <- g + geom_point() + labs(title = "Prediction Comparison", x = "General Linear Model",
y = "Random Forest")
ggg <- gg + geom_abline(slope = 1) + xlim(50,175)+ ylim(50,175)
ggg
There seems to be a strong difference in the models, past 100, GLM consistently predicts higher values than random forest. Both models understated very high wages (the lightest blues).
Creating a new data frame from the previous predictions and training it on itself to create an ensembled model.
dfrmpredict <- data.frame(predglm,predrf,wage = testwage$wage)
combinedfit <- train(wage~., method = "gam", data = dfrmpredict) #generalized additive model
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
#note here, wage is predicted on the OTHER TWO MODELS instead of the original features
combinedpredictions <- predict(combinedfit, newdata = dfrmpredict)
So let’s compare the models testing error.
errtest <- function(md, tst = testwage$wage){
sse <- sum((md -tst)^2)
errtest <- sqrt(sse)
return(errtest)
}
modelist <- list(predglm,predrf,combinedpredictions)
errortbl <- as.data.frame(lapply(modelist, errtest))
colnames(errortbl) <- c("GLM","RF","Combined")
errortbl
## GLM RF Combined
## 1 784.7474 800.7634 772.0725
While the GLM model performed better than the Random Forest model, a combined GAM model showed a noticeable improvement over both.
# models on validation set (30% of original data)
predValglm <- predict(mdl1, newdata = validationwage) #use glm on validation wages
predValrf <- predict(mdl2, newdata = validationwage) #use rf on validation wages
dfrmVal <- data.frame(predglm = predValglm, predrf = predValrf)
#create a data frame of validation predictions
#NOTE: names are replaced for predict() to use combinedfit model
#if you do not match the names in dfrmVal to the names used to make combinedfit
#then predict() will stop and return a warning about the number of rows
combinedValpredict <- predict(combinedfit,dfrmVal) #use the combined model
modelValist <- list(predValglm,predValrf,combinedValpredict)
errVal <- function(md, tst = validationwage$wage){
sse <- sum((md -tst)^2)
errVal <- sqrt(sse)
return(errVal)
}
errorValtbl <- as.data.frame(lapply(modelValist, errVal))
colnames(errorValtbl) <- c("GLM","RF","Combined")
errorValtbl
## GLM RF Combined
## 1 1003.497 1017.333 1003.958
Shockingly, the combined model performed marginally worse than the GLM model by itself for the validation set.
Anyway, this is how you make and use ensembled models.