Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeof

Load the required R packages

#install.packages('stats')
library(stats)
library(boot)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3

Load the auto-mpg dataset

filepath <- c("C:/Users/Mezue/Downloads/assign11/assign11/auto-mpg.data")

Auto_mpg <-read.table(filepath,header = FALSE, sep = "")

colnames(Auto_mpg) <- c('displacement', 'horsepower', 'weight', 'acceleration', 'mpg')

Plot the auto-mpg dataset

qplot(poly(displacement+horsepower+weight+acceleration), mpg, data = Auto_mpg) + 
    scale_x_continuous(name = "Polynomial: displacement+horsepower+weight+acceleration") +
     ggtitle("Auto-MPG Dataset")

We want to t a polynomial model of various degrees using the glm function in R and then measure the cross validation error

The function poly returns orthogonal polynomials of specified degrees. The function glm is used to fit generalized linear models. The function cv.glm calculates the estimated KK-fold cross-validation prediction error for generalized linear models. The returned value delta is a vector of length two. The first component is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is compensates for the bias introduced by not using leave-one-out cross-validation.

Output of the cv.glm function is noted below:

K: The value of K used for the K-fold cross validation.

delta: A vector of length two. [1] The first component is the raw cross-validation estimate of prediction error. [2] The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.

n <- 8
degrees <- 1:n
cv.err5 <- numeric()
for (i in degrees) {
  glm.fit <- glm(mpg ~ poly(displacement + horsepower + weight + acceleration, 
                            degree = i), data = Auto_mpg)
  cv.err5[i - min(degrees) + 1] <- cv.glm(Auto_mpg, glm.fit, K = 5)$delta[1]
}

Plot the Resulting Five-Fold Cross-Validation Curve

plot(degrees, cv.err5, type='p', xlab = "Complexity (Degrees)", 
     ylab = "Error", main = paste0("n = 1:", 8), xaxt = "n", yaxt = "n")
mtext("At first Bias decreases, but then Variance increases", col = 4)
lines(cv.err5, lwd = 3, col = "blue")

#lines(predict(loess(cv.err5 ~ degrees)), col='red', lwd=2)

Selected Model (1): 2nd Degree Polynomial

ggplot(Auto_mpg, aes(x = poly(displacement + horsepower + weight + acceleration), y = mpg)) + geom_point() +
   scale_x_continuous(name = "Polynomial: displacement + horsepower + weight + acceleration") +
    stat_smooth(method = "glm", formula = y ~ poly(x, 2), size = 1) + 
    ggtitle("Fit - 2nd Degree Polynomial Model")