Using JHU’s Coursera Data Science Machine Learning Course techniques:

Ensembling (combining) models can increase accuracy for predictions at the cost of reducing interpretability.

Using the Wage data (from the ISLR) package and removing the logWages variable, I’ll make several predictions, combine models, and estimate out of sample error.

Data Gathering, Cleaning, and Splitting

library(ISLR)
library(ggplot2)
library(caret)
## Loading required package: lattice
data(Wage)

#Data splitting 

Wg <- subset(Wage, select = -c(logwage,sex,region)) #remove logwage and some damaging cols 

set.seed(4)

nonvalidation <- createDataPartition(y = Wg$wage, p = .7, list = FALSE)

validationwage <- Wg[-nonvalidation,] # 30% of original data 
WgTrainTest <- Wg[nonvalidation,] # other 70% of data 

trainrows <- createDataPartition(WgTrainTest$wage, p = .7, list = FALSE) 

trainwage <- WgTrainTest[trainrows,] # 49% of original data 
testwage <- WgTrainTest[-trainrows,] # 21% of original data 


rm(nonvalidation,WgTrainTest,trainrows,Wg,Wage) #declutter Environment 

Model Building

#Model 1 a simple GLM model #due to rank deficiency sex and region were also removed
mdl1 <- train(wage ~ ., method="glm", data = trainwage)

#Model 2 is a random forest model with cross validation 
mdl2 <- train(wage ~., method = "rf", data = trainwage, 
              trControl = trainControl(method = "cv", number = 3)) 
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
              #repeatedcv was computationally taxing for my laptop 

With these two models we can plot how their predictions are similar and different in the testing data.

predglm <- predict(mdl1, newdata = testwage) #beware, predict can give wrong output length
predrf <- predict(mdl2, newdata = testwage) #ensure you use "newdata" argument 

g <- ggplot(aes(x = predglm, y = predrf,color = wage), data = testwage)
gg <- g + geom_point() + labs(title = "Prediction Comparison", x = "General Linear Model",
                              y = "Random Forest")
ggg <- gg + geom_abline(slope = 1) + xlim(50,175)+ ylim(50,175)
ggg

There seems to be a strong difference in the models, past 100, GLM consistently predicts higher values than random forest. Both models understated very high wages (the lightest blues).

Combining the models:

Creating a new data frame from the previous predictions and training it on itself to create an ensembled model.

dfrmpredict <- data.frame(predglm,predrf,wage = testwage$wage)
combinedfit <- train(wage~., method = "gam", data = dfrmpredict) #generalized additive model 
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
#note here, wage is predicted on the OTHER TWO MODELS instead of the original features 

combinedpredictions <- predict(combinedfit, newdata = dfrmpredict)

So let’s compare the models testing error.

errtest <- function(md, tst = testwage$wage){
        sse <- sum((md -tst)^2)
        errtest <- sqrt(sse)
        return(errtest)
}

modelist <- list(predglm,predrf,combinedpredictions)


errortbl <- as.data.frame(lapply(modelist, errtest))
colnames(errortbl) <- c("GLM","RF","Combined")

errortbl
##        GLM       RF Combined
## 1 784.7474 800.7634 772.0725

While the GLM model performed better than the Random Forest model, a combined GAM model showed a noticeable improvement over both.

Checking Results on the validation set

# models on validation set (30% of original data)
predValglm <- predict(mdl1, newdata = validationwage) #use glm on validation wages
predValrf <- predict(mdl2, newdata = validationwage) #use rf on validation wages

dfrmVal <- data.frame(predglm = predValglm, predrf = predValrf) 
#create a data frame of validation predictions
#NOTE: names are replaced for predict() to use combinedfit model 

#if you do not match the names in dfrmVal to the names used to make combinedfit 
#then predict() will stop and return a warning about the number of rows 

combinedValpredict <- predict(combinedfit,dfrmVal) #use the combined model
modelValist <- list(predValglm,predValrf,combinedValpredict)
errVal <- function(md, tst = validationwage$wage){
        sse <- sum((md -tst)^2)
        errVal <- sqrt(sse)
        return(errVal)
}
errorValtbl <- as.data.frame(lapply(modelValist, errVal))

colnames(errorValtbl) <- c("GLM","RF","Combined")

errorValtbl
##        GLM       RF Combined
## 1 1003.497 1017.333 1003.958

Shockingly, the combined model performed marginally worse than the GLM model by itself for the validation set.

Anyway, this is how you make and use ensembled models.