Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeof
#install.packages('stats')
library(stats)
library(boot)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
filepath <- c("C:/Users/Mezue/Downloads/assign11/assign11/auto-mpg.data")
Auto_mpg <-read.table(filepath,header = FALSE, sep = "")
colnames(Auto_mpg) <- c('displacement', 'horsepower', 'weight', 'acceleration', 'mpg')
Plot the auto-mpg dataset
qplot(poly(displacement+horsepower+weight+acceleration), mpg, data = Auto_mpg) +
scale_x_continuous(name = "Polynomial: displacement+horsepower+weight+acceleration") +
ggtitle("Auto-MPG Dataset")
We want to t a polynomial model of various degrees using the glm function in R and then measure the cross validation error
The function poly returns orthogonal polynomials of specified degrees. The function glm is used to fit generalized linear models. The function cv.glm calculates the estimated KK-fold cross-validation prediction error for generalized linear models. The returned value delta is a vector of length two. The first component is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is compensates for the bias introduced by not using leave-one-out cross-validation.
Output of the cv.glm function is noted below:
K: The value of K used for the K-fold cross validation.
delta: A vector of length two. [1] The first component is the raw cross-validation estimate of prediction error. [2] The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.
n <- 8
degrees <- 1:n
cv.err5 <- numeric()
for (i in degrees) {
glm.fit <- glm(mpg ~ poly(displacement + horsepower + weight + acceleration,
degree = i), data = Auto_mpg)
cv.err5[i - min(degrees) + 1] <- cv.glm(Auto_mpg, glm.fit, K = 5)$delta[1]
}
plot(degrees, cv.err5, type='p', xlab = "Complexity (Degrees)",
ylab = "Error", main = paste0("n = 1:", 8), xaxt = "n", yaxt = "n")
mtext("At first Bias decreases, but then Variance increases", col = 4)
lines(cv.err5, lwd = 3, col = "blue")
#lines(predict(loess(cv.err5 ~ degrees)), col='red', lwd=2)
Selected Model (1): 2nd Degree Polynomial
ggplot(Auto_mpg, aes(x = poly(displacement + horsepower + weight + acceleration), y = mpg)) + geom_point() +
scale_x_continuous(name = "Polynomial: displacement + horsepower + weight + acceleration") +
stat_smooth(method = "glm", formula = y ~ poly(x, 2), size = 1) +
ggtitle("Fit - 2nd Degree Polynomial Model")