Problem set 1. LINEAR REGRESSION IN R

Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeoff. You’ll use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to fit a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function. Fit various polynomial models to compute mpg as a function of the other four variables acceleration, weight, horsepower, and displacement using glm function. For example:

glm.fit=glm(mpg~poly(disp+hp+wt+acc,2), data=auto) cv.err5[2]=cv.glm(auto,glm.fit,K=5)$delta[1]

will fit a 2nd degree polynomial function between mpg and the remaining 4 variables and perform 5 iterations of cross-validations. This result will be stored in a cv.err5 array. cv.glm returns the estimated cross validation error and its adjusted value in a variable called delta. Please see the help on cv.glm to see more information. Once you have fit the various polynomials from degree 1 to 8, you can plot the cross-validation error function as

degree=1:8 plot(degree,cv.err5,type=’b’)

For your assignment, please create an R-markdown document where you load the auto data set, perform the polynomial fit and then plot the resulting 5 fold cross validation curve. Your output should show the characteristic U-shape illustrating the tradeoff between bias and variance

library(stats)
library(boot)


cars <- read.table("https://raw.githubusercontent.com/bkreis84/Math/master/auto-mpg.data") 

colnames(cars) <- c("disp", "hp", "wgt", "acc", "mpg")

#Take a look at the output of this function
glm.fit=glm(mpg~poly(disp+hp+wgt+acc,8), data=cars)
glm.fit
## 
## Call:  glm(formula = mpg ~ poly(disp + hp + wgt + acc, 8), data = cars)
## 
## Coefficients:
##                     (Intercept)  poly(disp + hp + wgt + acc, 8)1  
##                         23.4459                        -129.0809  
## poly(disp + hp + wgt + acc, 8)2  poly(disp + hp + wgt + acc, 8)3  
##                         24.5509                          -0.6179  
## poly(disp + hp + wgt + acc, 8)4  poly(disp + hp + wgt + acc, 8)5  
##                         -2.3957                           2.8573  
## poly(disp + hp + wgt + acc, 8)6  poly(disp + hp + wgt + acc, 8)7  
##                         -3.7582                           6.9607  
## poly(disp + hp + wgt + acc, 8)8  
##                         -1.3778  
## 
## Degrees of Freedom: 391 Total (i.e. Null);  383 Residual
## Null Deviance:       23820 
## Residual Deviance: 6476  AIC: 2232
plot(glm.fit)

#empty array
cv.err5 <- c()


for (i in 1:8){
  glm.fit=glm(mpg~poly(disp+hp+wgt+acc,i), data=cars)
  cv.err5[i]=cv.glm(cars,glm.fit,K=5)$delta[1]
}

cv.err5
## [1] 18.79162 16.98504 16.88355 17.07307 17.15663 16.93668 17.26240 17.34832
#plot the cross validation error values that we obtained 
degree=1:8
plot(degree,cv.err5, type = 'b')

The bias, which is the error caused by using a less apt model (underfitting), decreases as we increase the complexity of the model, while the variance increases as we increase the complexity of the model as any sample of data may not be well represented by the model (overfitting). It looks as thought the third degree polynomial is the best choice as the cross validation error is lowest, indicating that we have reduced the cross validation error to the lowest point.