Assignment Instructions

Using the stats and boot libraries in R, perform a cross-validation experiment to observe the bias-variance tradeoff. Load the auto-mpg data set, perform the polynomial fit, and then plot the resulting 5-fold cross-validation curve. The output should show the characteristic U-shape illustrating the tradeoff between bias and variance.

library(stats)
library(boot)

Load the Data

The auto-mpg data set from previous assignments (attached here) has 392 observations across five variables.

url <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20605/auto-mpg.data"
auto.mpg <- read.table(url)
colnames(auto.mpg) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")

Perform the Polynomial Fit

Fit a multiple-degree polynomial model of various degrees using the glm function in R. Perform five iterations of cross-validations to measure the 5-fold cross validation error using cv.glm function.

n <- 8
degrees <- 1:n
cv.err5 <- numeric()
for (i in degrees) {
  glm.fit <- glm(mpg ~ poly(displacement + horsepower + weight + acceleration, 
                            degree = i), data = auto.mpg)
  cv.err5[i - min(degrees) + 1] <- cv.glm(auto.mpg, glm.fit, K = 5)$delta[1]
}

The function poly returns orthogonal polynomials of specified degrees. The function glm is used to fit generalized linear models. The function cv.glm calculates the estimated \(K\)-fold cross-validation prediction error for generalized linear models. The returned value delta is a vector of length two. The first component is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is compensates for the bias introduced by not using leave-one-out cross-validation.

Plot the Resulting Five-Fold Cross-Validation Curve

plot(degrees, cv.err5, type='p', xlab = "Complexity (Degrees)", 
     ylab = "Error", main = paste0("n = 1:", 8), xaxt = "n", yaxt = "n")
mtext("At first Bias decreases, but then Variance dncreases", col = 4)
lines(cv.err5, lwd = 3, col = "steelblue")
lines(predict(loess(cv.err5 ~ degrees)), col='red', lwd=2)

Compare other Polynomials and Curves

n <- 13
bounds <- 5:n
par(mfrow = c(3,3))
for (j in bounds) {
  degrees <- 1:j
  cv.err5 <- numeric()
  for (i in degrees) {
    glm.fit <- glm(mpg ~ poly(displacement + horsepower + weight + acceleration, 
                              degree = i), data = auto.mpg)
    cv.err5[i] <- cv.glm(auto.mpg, glm.fit, K = 5)$delta[1]
  }
  plot(degrees, cv.err5, type='p', xlab = "Complexity (Degrees)", 
     ylab = "Error", main = paste0("n = 1:", i))
  lines(cv.err5, lwd = 3, col = "steelblue")
  lines(predict(loess(cv.err5 ~ degrees)), col='red', lwd=2)
}

data.frame(degrees, cv.err5)
##    degrees  cv.err5
## 1        1 18.44483
## 2        2 17.03545
## 3        3 16.94883
## 4        4 17.25613
## 5        5 17.04665
## 6        6 17.07333
## 7        7 17.05365
## 8        8 17.25689
## 9        9 17.80534
## 10      10 18.01060
## 11      11 17.20774
## 12      12 17.23849
## 13      13 36.60608
match(min(cv.err5), cv.err5)
## [1] 3

This random cross-validation has a minimum error of 16.9488338 when the polynomial is of degree 3.