Class 8B: Cross Validation

Cross Validation for Model Selection

Cross validation is a technique that can be used to choose a model that has the best predictive power. This is considered to be a re-sampling technique because it relies on splitting the dataset into training and testing sets.

(Single) Validation Set

In this approach we split our data once into a training and testing set. We will look at an example using the Auto dataset.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.2

## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## Warning: package 'tibble' was built under R version 3.4.4

## Warning: package 'tidyr' was built under R version 3.4.4

## Warning: package 'readr' was built under R version 3.4.4

## Warning: package 'purrr' was built under R version 3.4.4

## Warning: package 'dplyr' was built under R version 3.4.4

## Warning: package 'stringr' was built under R version 3.4.4

## Warning: package 'forcats' was built under R version 3.4.4

## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.4.2

# We are going to use the Auto dataset
ggplot(data=Auto, aes(x=horsepower, y=mpg))+
  geom_point()

# observe the non-linearity

Since we are randomly splitting the set, we will need to set a seed so that we have reproducible results.

# since we are using sample, which has randomness
# we should set a seed to get reproducible results
set.seed(1)

# gives a 50-50 split of the indexes
train<-sample(392, 196)

Now fit the model for the training set and test how well it does with the “new” testing set

# fit the model on the training set
lm.fit<-lm(mpg~horsepower, data=Auto, subset=train)

attach(Auto)

## The following object is masked from package:ggplot2:
## 
##     mpg

# does prediction for the whole dataset 
# then only selects the differences that are not in the training set
mean((mpg-predict(lm.fit, Auto))[-train]^2)

## [1] 26.14142

But, recall that there was some curvature so lets include a polynomial terms and assess the error on the “new” test data

# now consider different amounts of flexibility
# 2nd order polynomial
lm.fit2<-lm(mpg~poly(horsepower, 2), data=Auto, subset=train)
mean((mpg-predict(lm.fit2, Auto))[-train]^2)

## [1] 19.82259

# 3rd order polynomial
lm.fit3<-lm(mpg~poly(horsepower, 3), data=Auto, subset=train)
mean((mpg-predict(lm.fit3, Auto))[-train]^2)

## [1] 19.78252

We can actually use a loop to test many orders:

# we can put this code in a loop to test different amounts of flexibility 
degreePoly<-10
polyMSE<-matrix(nrow=degreePoly, ncol=2)
colnames(polyMSE)<-c("degree", "MSE")
for(i in 1:degreePoly){ 
  polyMSE[i,1]<-i
  this.fit<-lm(mpg~poly(horsepower,i), data=Auto, subset=train)
  polyMSE[i,2]<-mean((mpg-predict(this.fit, Auto))[-train]^2)
  }

polyDF<-as.data.frame(polyMSE)
head(polyDF)

##   degree      MSE
## 1      1 26.14142
## 2      2 19.82259
## 3      3 19.78252
## 4      4 19.99969
## 5      5 20.18225
## 6      6 20.82300

ggplot(data=polyDF, aes(x=degree, y=MSE))+
  geom_point()+
  geom_line()

However, remember that this is a random split? This means that these errors can change depending in what observations are included in the split.

# however, we should notice that this is dependent on the random split
# try a different random split
set.seed(2)
train<-sample(392, 196)

# fit models again
# 1st order polynomial 
lm.fit<-lm(mpg~horsepower, data=Auto, subset=train)
mean((mpg-predict(lm.fit, Auto))[-train]^2)

## [1] 23.29559

# 2nd order polynomial
lm.fit2<-lm(mpg~poly(horsepower, 2), data=Auto, subset=train)
mean((mpg-predict(lm.fit2, Auto))[-train]^2)

## [1] 18.90124

# 3rd order polynomial
lm.fit3<-lm(mpg~poly(horsepower, 3), data=Auto, subset=train)
mean((mpg-predict(lm.fit3, Auto))[-train]^2)

## [1] 19.2574

Multiple Random Splits

We might be interested in how variable to MSE for the testing set is dependent on the splits.

### Let's create many random splits and compare their results
# we can put this code in a double loop to test different amounts of flexibility 
degreePoly<-10
splits<-10

splitMat<-matrix(nrow=degreePoly*splits, ncol=3)
colnames(splitMat)<-c("run","MSE", "degree")

for(i in 1:splits){
  a=(i-1)*degreePoly+1
  b=i*degreePoly
  
  set.seed(i*3)
  splitMat[a:b,1]<-i
  train<-sample(392, 196)
  
for(j in 1:degreePoly){ 
  c=a+(j-1)
  
  this.fit<-lm(mpg~poly(horsepower,j), data=Auto, subset=train)
  splitMat[c,2]<-mean((mpg-predict(this.fit, Auto))[-train]^2)
  splitMat[c,3]<-j
  }
}
splitDF<-as.data.frame(splitMat)
head(splitDF)

##   run      MSE degree
## 1   1 26.29134      1
## 2   1 21.50410      2
## 3   1 21.50825      3
## 4   1 21.43146      4
## 5   1 20.88538      5
## 6   1 20.65816      6

ggplot(data=splitDF, aes(x=degree, y=MSE, color=as.factor(run)))+
  geom_point()+
  geom_line()

Leave One Out Cross Validation

In this version of cross validation, we leave one observation out and fit a model on all the other observations. We do this n times, so this can be very computationally expensive! Instead of coding this from scratch with will use built in functions:

## LOOCV
# use the glm function instead of lm 

glm.fit<-glm(mpg~horsepower, data=Auto)
coef(glm.fit)

## (Intercept)  horsepower 
##  39.9358610  -0.1578447

# compare this to the lm output 
lm.fit<-lm(mpg~horsepower, data=Auto)
coef(lm.fit)

## (Intercept)  horsepower 
##  39.9358610  -0.1578447

# same!

# we will use the cross validation function in the boot package
#install.packages("boot")
library(boot)

glm.fit<-glm(mpg~horsepower, data=Auto)
cv.err<-cv.glm(Auto, glm.fit)

# cross validation results 
# in LOOCV same up to two decimal places
# second value represents a bias correction
cv.err$delta

## [1] 24.23151 24.23114

# repeat procedure of increasingly complex polynomial fits
# using a for loop
cv.error<-rep(0, 10)
for(i in 1:10){
  glm.fit<-glm(mpg~poly(horsepower, i), data=Auto)
  cv.error[i]<-cv.glm(Auto, glm.fit)$delta[1]
}

cvDF<-data.frame(degree=1:10, cv.error)

ggplot(data=cvDF, aes(x=degree, y=cv.error))+
  geom_point()+
  geom_line()

One of the downsides to LOOCV is that the results are highly correlated. We can improve on this with k-fold cross-validation.

k-Fold Cross Validation

One of the downsides to LOOCV is that the results are highly correlated. We can improve on this with k-fold cross-validation. In k-fold we split the data into k different sets.

set.seed(17)

cv.error.k10<-rep(0, 10)
for(i in 1:10){
  glm.fit<-glm(mpg~poly(horsepower, i), data=Auto)
  cv.error.k10[i]<-cv.glm(Auto, glm.fit, K=10)$delta[1]
}

cvDF<-data.frame(degree=1:10, cv.error.k10)

ggplot(data=cvDF, aes(x=degree, y=cv.error.k10))+
  geom_point()+
  geom_line()