DATA 605: Assignment 12

Bias Variance Trade-off in R

Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeoff. Use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to fit a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function.

Fit various polynomial models to compute mpg as a function of the other four variables acceleration, weight, horsepower, and displacement using glm function.

For example:

glm.fit=glm(mpg~poly(disp+hp+wt+acc,2), data=auto) 
cv.err5[2]=cv.glm(auto,glm.fit,K=5)$delta[1]]

will fit a 2nd degree polynomial function between mpg and the remaining 4 variables and perform 5 iterations of cross-validations. This result will be stored in a cv.err5 array. cv.glm returns the estimated cross validation error and its adjusted value in a variable called delta. See the help on cv.glm to see more information.

Load the required R packages

suppressWarnings(suppressMessages(library(stats)))
suppressWarnings(suppressMessages(library(boot)))
suppressWarnings(suppressMessages(library(ggplot2)))

Load the auto-mpg dataset

Consider the modified auto-mpg data (obtained from the UC Irvine Machine Learning dataset). This dataset contains 5 columns: displacement, horsepower, weight, acceleration, mpg.

# read the auto-mpg dataset into a dataframe

setwd("C:/Users/keith/Google Drive/Data Science/CUNY/DATA605 - Comp Math/Week12 - Bias and Variance Trade-off")

auto_mpg <-  read.table("auto-mpg.data", header=FALSE)

# set the column names
colnames(auto_mpg) <- c("disp", "hp", "wt", "acc", "mpg")

head(auto_mpg)

##   disp  hp   wt  acc mpg
## 1  307 130 3504 12.0  18
## 2  350 165 3693 11.5  15
## 3  318 150 3436 11.0  18
## 4  304 150 3433 12.0  16
## 5  302 140 3449 10.5  17
## 6  429 198 4341 10.0  15

str(auto_mpg)

## 'data.frame':    392 obs. of  5 variables:
##  $ disp: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ hp  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ wt  : num  3504 3693 3436 3433 3449 ...
##  $ acc : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ mpg : num  18 15 18 16 17 15 14 14 14 15 ...

Plot the auto-mpg dataset using:

x = poly(disp+hp+wt+acc)
y = mpg

qplot(poly(disp+hp+wt+acc), mpg, data = auto_mpg) + 
    scale_x_continuous(name = "Polynomial: disp+hp+wt+acc") +
     ggtitle("Auto-MPG Dataset")

Fit the polynomial models using degrees 1 through 9 using the glm function in R and then measure the cross validation error using cv.glm function. While fiting the polynomial model, append the predicted values based on the model to the auto-mpg dataset.

Output of the cv.glm function is noted below:

K: The value of K used for the K-fold cross validation.

delta: A vector of length two.
[1] The first component is the raw cross-validation estimate of prediction error.
[2] The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.

set.seed(19)

degree <- 1:9

cv.err5 <- c()
cv.err5.adj <- c()

for (i in (degree)) {

  glm.fit <- glm(mpg~poly(disp+hp+wt+acc, i), data=auto_mpg) 
  
  # probably a better way to do this
  # append the predicted values for each model to the auto-mpg dataset
  if (i==1){ auto_mpg$model1 <- predict(glm.fit, new_data= auto_mpg)}
  if (i==2){ auto_mpg$model2 <- predict(glm.fit, new_data= auto_mpg)}
  if (i==3){ auto_mpg$model3 <- predict(glm.fit, new_data= auto_mpg)}
  if (i==4){ auto_mpg$model4 <- predict(glm.fit, new_data= auto_mpg)}
  if (i==5){ auto_mpg$model5 <- predict(glm.fit, new_data= auto_mpg)}
  if (i==6){ auto_mpg$model6 <- predict(glm.fit, new_data= auto_mpg)}
  if (i==7){ auto_mpg$model7 <- predict(glm.fit, new_data= auto_mpg)}
  if (i==8){ auto_mpg$model8 <- predict(glm.fit, new_data= auto_mpg)}
  if (i==9){ auto_mpg$model9 <- predict(glm.fit, new_data= auto_mpg)}
  
  cv <- cv.glm(auto_mpg, glm.fit, K=5)
  
  cv.err5[i] <- cv$delta[1]     # raw cross-validation estimate of prediction error
  cv.err5.adj[i] <- cv$delta[2] # adjusted cross-validation estimate of prediction error
}

Plot the cross-valiation error versus the number of degrees used in the polynomial model

plot(degree, cv.err5, type='b', main = "Cross-validation Estimate of Prediction Error vs. Degree",
      ylab = "Prediction Error")

“At its root, dealing with bias and variance is really about dealing with over- and under-fitting. Bias is reduced and variance is increased in relation to model complexity. As more and more parameters are added to a model, the complexity of the model rises and variance becomes our primary concern while bias steadily falls” (http://scott.fortmann-roe.com/docs/BiasVariance.html)

Based on the cross-valiation error, we should consider selecting the 2nd or 3rd degree polynomial model as the providing the best fit.

Plot the Predicted Values for models 2 through 9.

ggplot(auto_mpg, aes(x = poly(disp+hp+wt+acc), y = mpg)) + geom_point() +
   scale_x_continuous(name = "Polynomial: disp+hp+wt+acc") +
  geom_line(aes(x=poly(disp+hp+wt+acc), y=model2, colour="Degree 2"), size=2, alpha=.5) + 
  geom_line(aes(x=poly(disp+hp+wt+acc), y=model3, colour="Degree 3")) + 
  geom_line(aes(x=poly(disp+hp+wt+acc), y=model4, colour="Degree 4") ) + 
  geom_line(aes(x=poly(disp+hp+wt+acc), y=model5, colour="Degree 5") ) + 
  geom_line(aes(x=poly(disp+hp+wt+acc), y=model6, colour="Degree 6") ) + 
  geom_line(aes(x=poly(disp+hp+wt+acc), y=model7, colour="Degree 7") ) + 
  geom_line(aes(x=poly(disp+hp+wt+acc), y=model8, colour="Degree 8") ) + 
  geom_line(aes(x=poly(disp+hp+wt+acc), y=model9, colour="Degree 9") ) + 
  scale_colour_brewer(palette="Pastel2") + 
  guides(color=guide_legend(title="Model")) + 
  ggtitle("Auto MPG and Polynomial Fit")

Selected Model (1): 2nd Degree Polynomial

ggplot(auto_mpg, aes(x = poly(disp+hp+wt+acc), y = mpg)) + geom_point() +
   scale_x_continuous(name = "Polynomial: disp+hp+wt+acc") +
    stat_smooth(method = "glm", formula = y ~ poly(x, 2), size = 1) + 
    ggtitle("Fit - 2nd Degree Polynomial Model")

Selected Model (2): 3rd Degree Polynomial

ggplot(auto_mpg, aes(x = poly(disp+hp+wt+acc), y = mpg)) + geom_point() +
   scale_x_continuous(name = "Polynomial: disp+hp+wt+acc") +
    stat_smooth(method = "glm", formula = y ~ poly(x, 3), size = 1) +
    ggtitle("Fit - 3rd Degree Polynomial Model")

Reference articles:

https://www.r-bloggers.com/fitting-polynomial-regression-in-r/
https://www.r-bloggers.com/cross-validation-for-predictive-analytics-using-r/
http://stackoverflow.com/questions/23334360/plot-polynomial-regression-curve-in-r
http://rstudio-pubs-static.s3.amazonaws.com/1690_b6906d2174654e339c33a07d020e6cc3.html
http://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/