Instruction 1:

Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeo . You’ll use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to t a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function. Fit various polynomial models to compute mpg as a function of the other four variables acceleration, weight, horsepower, and displacement using glm function.

options(warn = -1)
library(knitr)
library(stats)
library(boot)



autompg <- read.table("https://raw.githubusercontent.com/mascotinme/GitHub/master/MSDA%20605/auto-mpg.txt", col.names = c("displacement", "hp", "weight", "acceleration", "mpg"))



kable(head(autompg)) # A glimpse of the dataset
displacement hp weight acceleration mpg
307 130 3504 12.0 18
350 165 3693 11.5 15
318 150 3436 11.0 18
304 150 3433 12.0 16
302 140 3449 10.5 17
429 198 4341 10.0 15
str(autompg) # The data structure
## 'data.frame':    392 obs. of  5 variables:
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ hp          : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
kable(summary(autompg))# inferential statistic of the variables.
displacement hp weight acceleration mpg
Min. : 68.0 Min. : 46.0 Min. :1613 Min. : 8.00 Min. : 9.00
1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 1st Qu.:13.78 1st Qu.:17.00
Median :151.0 Median : 93.5 Median :2804 Median :15.50 Median :22.75
Mean :194.4 Mean :104.5 Mean :2978 Mean :15.54 Mean :23.45
3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:29.00
Max. :455.0 Max. :230.0 Max. :5140 Max. :24.80 Max. :46.60

Instruction 2:

\(\quad glm.fit=glm(mpg~poly(disp+hp+wt+acc,2), data=auto)\quad\)

\(\quad cv.err5[2]=cv.glm(auto,glm.fit,K=5)\)delta[1] $

will fit a 2nd degree polynomial function between mpg and the remaining 4 variables and perform 5 iterations of cross-validations. This result will be stored in a cv.err5 array. cv.glm returns the estimated cross validation error and its adjusted value in a variable called delta. Please see the help on cv.glm to see more information.

Once you have t the various polynomials from degree 1 to 8, you can plot the cross- validation error function as

\(\quad degree=1:8 \quad\)

\(\quad plot(degree,cv.err5,type='b') \quad\)

training <- autompg
crossvalidation <- autompg
set.seed(8)

cv.err <- c()

for(n in 1:8)
{
  fit <- glm(mpg ~ poly(displacement + hp + weight + acceleration, n), data=training)
  fit
  
  cv.err[n] <- cv.glm(crossvalidation, fit, K=5)$delta[1]
}


cv.err
## [1] 18.40003 16.95578 17.44431 17.05143 17.17140 17.44844 17.00235 16.97621

perform the polynomial fitt and then plot the resulting 5 fold cross validation curve. Your output should show the characteristic U-shape illustrating the tradeo between bias and variance.

degree <- 1:8
plot(degree, y = cv.err, type='b',xlab = "Polynomial Degree", ylab = "Cross Validation Error", main = "Plot of Tradeoff Between Bias and Variance")