DATA 605 Assignment 12

Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeoff. You’ll use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to fit a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function.

Fit various polynomial models to compute mpg as a function of the other four variables acceleration, weight, horsepower, and displacement using glm function. For example:

glm.fit=glm(mpg~poly(disp+hp+wt+acc,2), data=auto)
cv.err5[2]=cv.glm(auto,glm.fit,K=5)$delta[1]

will fit a 2nd degree polynomial function between mpg and the remaining 4 variables and perform 5 iterations of cross-validations. This result will be stored in a cv.err5 array. cv.glm returns the estimated cross validation error and its adjusted value in a variable called delta. Please see the help on cv.glm to see more information.

Once you have fit the various polynomials from degree 1 to 8, you can plot the crossvalidation error function as

degree=1:8
plot(degree,cv.err5,type='b')

Load libraries, set directory and prepare data set

# Load libraries
library(stats)
library(boot)
library(DT)

## Warning: package 'DT' was built under R version 3.3.3

# Setting working directory
# Note: Both .rmd and .data needs to be in same directory
set_directory <- function(directory) {
if (!is.null(directory))
   setwd(directory)
}

# Read data and give column names
auto <- read.table("auto-mpg.data", header=FALSE,as.is=TRUE)
colnames(auto) <- c("disp", "hp","wt","acc","mpg")
datatable(auto, options = list(pageLength = 5))

Poly Fit

set.seed(7)

## Warning in set.seed(7): '.Random.seed' is not an integer vector but of type
## 'NULL', so ignored

cv.err5 <- c()
for(i in 1:8){
  model <- glm(mpg~poly(disp + hp + wt + acc, i), data = auto)
  cv.err5[i] <- cv.glm(auto, model , K=5)$delta[1]
}

Plot

degrees <- 1:8
plot(degrees, cv.err5, type='b', main = 'Tradeoff Between Bias and Variance',
      ylab = "Prediction Error", xlab='Polynomial Degree', col='blue')

Based on the cross-valiation error, we should consider selecting the 2nd or 3rd degree polynomial model as the providing the best fit.

DATA 605 Assignment 12

Ahsanul Choudhury

April 30, 2017

Load libraries, set directory and prepare data set

Poly Fit

Plot