Using the stats
and boot
libraries in R perform a cross-validation experiment to observe the bias variance tradeoff. Use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to fit a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function.
Fit various polynomial models to compute mpg as a function of the other four variables acceleration, weight, horsepower, and displacement using glm function.
For example:
glm.fit=glm(mpg~poly(disp+hp+wt+acc,2), data=auto)
cv.err5[2]=cv.glm(auto,glm.fit,K=5)$delta[1]]
will fit a 2nd degree polynomial function between mpg and the remaining 4 variables and perform 5 iterations of cross-validations. This result will be stored in a cv.err5 array. cv.glm returns the estimated cross validation error and its adjusted value in a variable called delta. See the help on cv.glm to see more information.
suppressWarnings(suppressMessages(library(stats)))
suppressWarnings(suppressMessages(library(boot)))
suppressWarnings(suppressMessages(library(ggplot2)))
Consider the modified auto-mpg data (obtained from the UC Irvine Machine Learning dataset). This dataset contains 5 columns: displacement, horsepower, weight, acceleration, mpg.
# read the auto-mpg dataset into a dataframe
setwd("C:/Users/keith/Google Drive/Data Science/CUNY/DATA605 - Comp Math/Week12 - Bias and Variance Trade-off")
auto_mpg <- read.table("auto-mpg.data", header=FALSE)
# set the column names
colnames(auto_mpg) <- c("disp", "hp", "wt", "acc", "mpg")
head(auto_mpg)
## disp hp wt acc mpg
## 1 307 130 3504 12.0 18
## 2 350 165 3693 11.5 15
## 3 318 150 3436 11.0 18
## 4 304 150 3433 12.0 16
## 5 302 140 3449 10.5 17
## 6 429 198 4341 10.0 15
str(auto_mpg)
## 'data.frame': 392 obs. of 5 variables:
## $ disp: num 307 350 318 304 302 429 454 440 455 390 ...
## $ hp : num 130 165 150 150 140 198 220 215 225 190 ...
## $ wt : num 3504 3693 3436 3433 3449 ...
## $ acc : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
Plot the auto-mpg dataset using:
x = poly(disp+hp+wt+acc)
y = mpg
qplot(poly(disp+hp+wt+acc), mpg, data = auto_mpg) +
scale_x_continuous(name = "Polynomial: disp+hp+wt+acc") +
ggtitle("Auto-MPG Dataset")
Output of the cv.glm
function is noted below:
K
: The value of K used for the K-fold cross validation.
delta
: A vector of length two.
[1] The first component is the raw cross-validation estimate of prediction error.
[2] The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.
set.seed(19)
degree <- 1:9
cv.err5 <- c()
cv.err5.adj <- c()
for (i in (degree)) {
glm.fit <- glm(mpg~poly(disp+hp+wt+acc, i), data=auto_mpg)
# probably a better way to do this
# append the predicted values for each model to the auto-mpg dataset
if (i==1){ auto_mpg$model1 <- predict(glm.fit, new_data= auto_mpg)}
if (i==2){ auto_mpg$model2 <- predict(glm.fit, new_data= auto_mpg)}
if (i==3){ auto_mpg$model3 <- predict(glm.fit, new_data= auto_mpg)}
if (i==4){ auto_mpg$model4 <- predict(glm.fit, new_data= auto_mpg)}
if (i==5){ auto_mpg$model5 <- predict(glm.fit, new_data= auto_mpg)}
if (i==6){ auto_mpg$model6 <- predict(glm.fit, new_data= auto_mpg)}
if (i==7){ auto_mpg$model7 <- predict(glm.fit, new_data= auto_mpg)}
if (i==8){ auto_mpg$model8 <- predict(glm.fit, new_data= auto_mpg)}
if (i==9){ auto_mpg$model9 <- predict(glm.fit, new_data= auto_mpg)}
cv <- cv.glm(auto_mpg, glm.fit, K=5)
cv.err5[i] <- cv$delta[1] # raw cross-validation estimate of prediction error
cv.err5.adj[i] <- cv$delta[2] # adjusted cross-validation estimate of prediction error
}
plot(degree, cv.err5, type='b', main = "Cross-validation Estimate of Prediction Error vs. Degree",
ylab = "Prediction Error")
“At its root, dealing with bias and variance is really about dealing with over- and under-fitting. Bias is reduced and variance is increased in relation to model complexity. As more and more parameters are added to a model, the complexity of the model rises and variance becomes our primary concern while bias steadily falls” (http://scott.fortmann-roe.com/docs/BiasVariance.html)
Based on the cross-valiation error, we should consider selecting the 2nd or 3rd degree polynomial model as the providing the best fit.
ggplot(auto_mpg, aes(x = poly(disp+hp+wt+acc), y = mpg)) + geom_point() +
scale_x_continuous(name = "Polynomial: disp+hp+wt+acc") +
geom_line(aes(x=poly(disp+hp+wt+acc), y=model2, colour="Degree 2"), size=2, alpha=.5) +
geom_line(aes(x=poly(disp+hp+wt+acc), y=model3, colour="Degree 3")) +
geom_line(aes(x=poly(disp+hp+wt+acc), y=model4, colour="Degree 4") ) +
geom_line(aes(x=poly(disp+hp+wt+acc), y=model5, colour="Degree 5") ) +
geom_line(aes(x=poly(disp+hp+wt+acc), y=model6, colour="Degree 6") ) +
geom_line(aes(x=poly(disp+hp+wt+acc), y=model7, colour="Degree 7") ) +
geom_line(aes(x=poly(disp+hp+wt+acc), y=model8, colour="Degree 8") ) +
geom_line(aes(x=poly(disp+hp+wt+acc), y=model9, colour="Degree 9") ) +
scale_colour_brewer(palette="Pastel2") +
guides(color=guide_legend(title="Model")) +
ggtitle("Auto MPG and Polynomial Fit")
Selected Model (1): 2nd Degree Polynomial
ggplot(auto_mpg, aes(x = poly(disp+hp+wt+acc), y = mpg)) + geom_point() +
scale_x_continuous(name = "Polynomial: disp+hp+wt+acc") +
stat_smooth(method = "glm", formula = y ~ poly(x, 2), size = 1) +
ggtitle("Fit - 2nd Degree Polynomial Model")
Selected Model (2): 3rd Degree Polynomial
ggplot(auto_mpg, aes(x = poly(disp+hp+wt+acc), y = mpg)) + geom_point() +
scale_x_continuous(name = "Polynomial: disp+hp+wt+acc") +
stat_smooth(method = "glm", formula = y ~ poly(x, 3), size = 1) +
ggtitle("Fit - 3rd Degree Polynomial Model")
Reference articles:
https://www.r-bloggers.com/fitting-polynomial-regression-in-r/
https://www.r-bloggers.com/cross-validation-for-predictive-analytics-using-r/
http://stackoverflow.com/questions/23334360/plot-polynomial-regression-curve-in-r
http://rstudio-pubs-static.s3.amazonaws.com/1690_b6906d2174654e339c33a07d020e6cc3.html
http://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/