Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeoff. You’ll use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to fit a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function.
Fit various polynomial models to compute mpg as a function of the other four variables acceleration, weight, horsepower, and displacement using glm function. For example:
glm.fit=glm(mpg~poly(disp+hp+wt+acc,2), data=auto)
cv.err5[2]=cv.glm(auto,glm.fit,K=5)$delta[1]
will fit a 2nd degree polynomial function between mpg and the remaining 4 variables and perform 5 iterations of cross-validations. This result will be stored in a cv.err5 array. cv.glm returns the estimated cross validation error and its adjusted value in a variable called delta. Please see the help on cv.glm to see more information.
Once you have fit the various polynomials from degree 1 to 8, you can plot the crossvalidation error function as
degree=1:8
plot(degree,cv.err5,type='b')
# Load libraries
library(stats)
library(boot)
library(DT)
## Warning: package 'DT' was built under R version 3.3.3
# Setting working directory
# Note: Both .rmd and .data needs to be in same directory
set_directory <- function(directory) {
if (!is.null(directory))
setwd(directory)
}
# Read data and give column names
auto <- read.table("auto-mpg.data", header=FALSE,as.is=TRUE)
colnames(auto) <- c("disp", "hp","wt","acc","mpg")
datatable(auto, options = list(pageLength = 5))
set.seed(7)
## Warning in set.seed(7): '.Random.seed' is not an integer vector but of type
## 'NULL', so ignored
cv.err5 <- c()
for(i in 1:8){
model <- glm(mpg~poly(disp + hp + wt + acc, i), data = auto)
cv.err5[i] <- cv.glm(auto, model , K=5)$delta[1]
}
degrees <- 1:8
plot(degrees, cv.err5, type='b', main = 'Tradeoff Between Bias and Variance',
ylab = "Prediction Error", xlab='Polynomial Degree', col='blue')
Based on the cross-valiation error, we should consider selecting the 2nd or 3rd degree polynomial model as the providing the best fit.