require(ISLR)
## Loading required package: ISLR
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
data(Wage)
Wage<-subset(Wage,select=-c(logwage))
set.seed(222)
inBuild<-createDataPartition(y=Wage$wage,p=.7,list=FALSE)
buildData<-Wage[inBuild,]
validation<-Wage[-inBuild,]
inTrain<-createDataPartition(y=buildData$wage,p=.7,list=FALSE)
training<-buildData[inTrain,]
testing<-buildData[inTrain,]

Notice in this data partition step we’ve partitioned into 3 groups instead of 2. This is important because we’ll use the training set and testing set as to build, model, and

mod1<-train(wage~.,method="glm",data=training)
mod2<-train(wage~.,method="rf",
            data=training,
            trControl=trainControl(method="cv"), number=3)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin

We’ve created 2 models – mod1 is a generalized linear model, mod2 is a random forest model.

pred1<-predict(mod1,testing)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pred2<-predict(mod2,testing)
qplot(pred1,pred2,col=wage,data=testing)

We use each model to predict wage – note that if the predictions were identical, they would be exactly on the diagonal above. Furthermore, if they were fully accurate, we would see a perfect gradient of low to high wage from bottom-left to upper-right.

predDF<-data.frame(pred1,pred2,wage=testing$wage)
combModFit<-train(wage~.,method="gam",data=predDF)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
combPred<-predict(combModFit,predDF)

The above code combines the two models into a single generalized additive model using the caret package. Let’s check our 3 models and look at errors:

sqrt(sum((pred1-testing$wage)^2))
## [1] 1280.322
sqrt(sum((pred2-testing$wage)^2))
## [1] 825.1333
sqrt(sum((combPred-testing$wage)^2))
## [1] 641.7012

We can see that the of the first two models, the random forest was a better model, with lower total error rate. However, when we combine the models we have a significantly lower total error than even the random forest model.

pred1V<-predict(mod1,validation)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pred2V<-predict(mod2,validation)
predVDF<-data.frame(pred1=pred1V,pred2=pred2V,wage=validation$wage)
combPredV<-predict(combModFit,predVDF)

Finally, we use the validation data set to test our models. Let’s check out sum of absolute errors:

sqrt(sum((pred1V-validation$wage)^2))
## [1] 1034.019
sqrt(sum((pred2V-validation$wage)^2))
## [1] 1060.571
sqrt(sum((combPredV-validation$wage)^2))
## [1] 1193.752

Now we see that in this particular case, the combined prediction model is actually less accurate with the validation data set than either the random forest or the glm. This probably is an indication of over-fitting to the test/training set, but the idea is the same.

N.B. - In the lecture, this outcome with the combined model having greater error than the individual models did not occur. Either the seed chosen by the lecturer (not specified) led to a very different result than the 3 seeds I tried, or the data set has been updated/changed since the lecture series was published (2014).

..