Using the stats
and boot
libraries in R
, perform a cross-validation experiment to observe the bias-variance tradeoff. Load the auto-mpg data set, perform the polynomial fit, and then plot the resulting 5-fold cross-validation curve. The output should show the characteristic U-shape illustrating the tradeoff between bias and variance.
library(stats)
library(boot)
The auto-mpg data set from previous assignments (attached here) has 392 observations across five variables.
url <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20605/auto-mpg.data"
auto.mpg <- read.table(url)
colnames(auto.mpg) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")
Fit a multiple-degree polynomial model of various degrees using the glm
function in R
. Perform five iterations of cross-validations to measure the 5-fold cross validation error using cv.glm
function.
n <- 8
degrees <- 1:n
cv.err5 <- numeric()
for (i in degrees) {
glm.fit <- glm(mpg ~ poly(displacement + horsepower + weight + acceleration,
degree = i), data = auto.mpg)
cv.err5[i - min(degrees) + 1] <- cv.glm(auto.mpg, glm.fit, K = 5)$delta[1]
}
The function poly
returns orthogonal polynomials of specified degrees. The function glm
is used to fit generalized linear models. The function cv.glm
calculates the estimated \(K\)-fold cross-validation prediction error for generalized linear models. The returned value delta
is a vector of length two. The first component is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is compensates for the bias introduced by not using leave-one-out cross-validation.
plot(degrees, cv.err5, type='p', xlab = "Complexity (Degrees)",
ylab = "Error", main = paste0("n = 1:", 8), xaxt = "n", yaxt = "n")
mtext("At first Bias decreases, but then Variance dncreases", col = 4)
lines(cv.err5, lwd = 3, col = "steelblue")
lines(predict(loess(cv.err5 ~ degrees)), col='red', lwd=2)
n <- 13
bounds <- 5:n
par(mfrow = c(3,3))
for (j in bounds) {
degrees <- 1:j
cv.err5 <- numeric()
for (i in degrees) {
glm.fit <- glm(mpg ~ poly(displacement + horsepower + weight + acceleration,
degree = i), data = auto.mpg)
cv.err5[i] <- cv.glm(auto.mpg, glm.fit, K = 5)$delta[1]
}
plot(degrees, cv.err5, type='p', xlab = "Complexity (Degrees)",
ylab = "Error", main = paste0("n = 1:", i))
lines(cv.err5, lwd = 3, col = "steelblue")
lines(predict(loess(cv.err5 ~ degrees)), col='red', lwd=2)
}
data.frame(degrees, cv.err5)
## degrees cv.err5
## 1 1 18.44483
## 2 2 17.03545
## 3 3 16.94883
## 4 4 17.25613
## 5 5 17.04665
## 6 6 17.07333
## 7 7 17.05365
## 8 8 17.25689
## 9 9 17.80534
## 10 10 18.01060
## 11 11 17.20774
## 12 12 17.23849
## 13 13 36.60608
match(min(cv.err5), cv.err5)
## [1] 3
This random cross-validation has a minimum error of 16.9488338 when the polynomial is of degree 3.