Cross validation is a technique that can be used to choose a model that has the best predictive power. This is considered to be a re-sampling technique because it relies on splitting the dataset into training and testing sets.
In this approach we split our data once into a training and testing set. We will look at an example using the Auto dataset.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.2
## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4
## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.2
# We are going to use the Auto dataset
ggplot(data=Auto, aes(x=horsepower, y=mpg))+
geom_point()
# observe the non-linearity
Since we are randomly splitting the set, we will need to set a seed so that we have reproducible results.
# since we are using sample, which has randomness
# we should set a seed to get reproducible results
set.seed(1)
# gives a 50-50 split of the indexes
train<-sample(392, 196)
Now fit the model for the training set and test how well it does with the “new” testing set
# fit the model on the training set
lm.fit<-lm(mpg~horsepower, data=Auto, subset=train)
attach(Auto)
## The following object is masked from package:ggplot2:
##
## mpg
# does prediction for the whole dataset
# then only selects the differences that are not in the training set
mean((mpg-predict(lm.fit, Auto))[-train]^2)
## [1] 26.14142
But, recall that there was some curvature so lets include a polynomial terms and assess the error on the “new” test data
# now consider different amounts of flexibility
# 2nd order polynomial
lm.fit2<-lm(mpg~poly(horsepower, 2), data=Auto, subset=train)
mean((mpg-predict(lm.fit2, Auto))[-train]^2)
## [1] 19.82259
# 3rd order polynomial
lm.fit3<-lm(mpg~poly(horsepower, 3), data=Auto, subset=train)
mean((mpg-predict(lm.fit3, Auto))[-train]^2)
## [1] 19.78252
We can actually use a loop to test many orders:
# we can put this code in a loop to test different amounts of flexibility
degreePoly<-10
polyMSE<-matrix(nrow=degreePoly, ncol=2)
colnames(polyMSE)<-c("degree", "MSE")
for(i in 1:degreePoly){
polyMSE[i,1]<-i
this.fit<-lm(mpg~poly(horsepower,i), data=Auto, subset=train)
polyMSE[i,2]<-mean((mpg-predict(this.fit, Auto))[-train]^2)
}
polyDF<-as.data.frame(polyMSE)
head(polyDF)
## degree MSE
## 1 1 26.14142
## 2 2 19.82259
## 3 3 19.78252
## 4 4 19.99969
## 5 5 20.18225
## 6 6 20.82300
ggplot(data=polyDF, aes(x=degree, y=MSE))+
geom_point()+
geom_line()
However, remember that this is a random split? This means that these errors can change depending in what observations are included in the split.
# however, we should notice that this is dependent on the random split
# try a different random split
set.seed(2)
train<-sample(392, 196)
# fit models again
# 1st order polynomial
lm.fit<-lm(mpg~horsepower, data=Auto, subset=train)
mean((mpg-predict(lm.fit, Auto))[-train]^2)
## [1] 23.29559
# 2nd order polynomial
lm.fit2<-lm(mpg~poly(horsepower, 2), data=Auto, subset=train)
mean((mpg-predict(lm.fit2, Auto))[-train]^2)
## [1] 18.90124
# 3rd order polynomial
lm.fit3<-lm(mpg~poly(horsepower, 3), data=Auto, subset=train)
mean((mpg-predict(lm.fit3, Auto))[-train]^2)
## [1] 19.2574
We might be interested in how variable to MSE for the testing set is dependent on the splits.
### Let's create many random splits and compare their results
# we can put this code in a double loop to test different amounts of flexibility
degreePoly<-10
splits<-10
splitMat<-matrix(nrow=degreePoly*splits, ncol=3)
colnames(splitMat)<-c("run","MSE", "degree")
for(i in 1:splits){
a=(i-1)*degreePoly+1
b=i*degreePoly
set.seed(i*3)
splitMat[a:b,1]<-i
train<-sample(392, 196)
for(j in 1:degreePoly){
c=a+(j-1)
this.fit<-lm(mpg~poly(horsepower,j), data=Auto, subset=train)
splitMat[c,2]<-mean((mpg-predict(this.fit, Auto))[-train]^2)
splitMat[c,3]<-j
}
}
splitDF<-as.data.frame(splitMat)
head(splitDF)
## run MSE degree
## 1 1 26.29134 1
## 2 1 21.50410 2
## 3 1 21.50825 3
## 4 1 21.43146 4
## 5 1 20.88538 5
## 6 1 20.65816 6
ggplot(data=splitDF, aes(x=degree, y=MSE, color=as.factor(run)))+
geom_point()+
geom_line()
In this version of cross validation, we leave one observation out and fit a model on all the other observations. We do this n times, so this can be very computationally expensive! Instead of coding this from scratch with will use built in functions:
## LOOCV
# use the glm function instead of lm
glm.fit<-glm(mpg~horsepower, data=Auto)
coef(glm.fit)
## (Intercept) horsepower
## 39.9358610 -0.1578447
# compare this to the lm output
lm.fit<-lm(mpg~horsepower, data=Auto)
coef(lm.fit)
## (Intercept) horsepower
## 39.9358610 -0.1578447
# same!
# we will use the cross validation function in the boot package
#install.packages("boot")
library(boot)
glm.fit<-glm(mpg~horsepower, data=Auto)
cv.err<-cv.glm(Auto, glm.fit)
# cross validation results
# in LOOCV same up to two decimal places
# second value represents a bias correction
cv.err$delta
## [1] 24.23151 24.23114
# repeat procedure of increasingly complex polynomial fits
# using a for loop
cv.error<-rep(0, 10)
for(i in 1:10){
glm.fit<-glm(mpg~poly(horsepower, i), data=Auto)
cv.error[i]<-cv.glm(Auto, glm.fit)$delta[1]
}
cvDF<-data.frame(degree=1:10, cv.error)
ggplot(data=cvDF, aes(x=degree, y=cv.error))+
geom_point()+
geom_line()
One of the downsides to LOOCV is that the results are highly correlated. We can improve on this with k-fold cross-validation.
One of the downsides to LOOCV is that the results are highly correlated. We can improve on this with k-fold cross-validation. In k-fold we split the data into k different sets.
set.seed(17)
cv.error.k10<-rep(0, 10)
for(i in 1:10){
glm.fit<-glm(mpg~poly(horsepower, i), data=Auto)
cv.error.k10[i]<-cv.glm(Auto, glm.fit, K=10)$delta[1]
}
cvDF<-data.frame(degree=1:10, cv.error.k10)
ggplot(data=cvDF, aes(x=degree, y=cv.error.k10))+
geom_point()+
geom_line()