ASSIGNMENT 12 - BIAS VARIANCE TRADEOFF IN R


Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeoff. You’ll use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to fit a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function. Fit various polynomial models to compute mpg as a function of the other four variables acceleration, weight, horsepower, and displacement using glm function.


The main purpose is to estimate the true behavior of mpg given the sample of data we have, D. If we tried to fit D with no error by itself, it would most likely overfit our estimation, i.e. give a large variance and low bias in our predictions of \(\hat{f}_i|x_i\). On the other hand, if we do not place enough emphasis on how well D fits the model, we will have a high bias and a low variance.

In order to find this sweet spot in between, i.e. an estimation that fits the model well across variations of our sample sets, we want to minimize both the variance and the bias. This is a trade-off because reducing one will in turn increase the other.

Given our sample D, we will split it into 5 sets of approximately the same size and fit a 1st - 8th degree polynomial. For each iteration, 1 of the 5 sets will be withheld from the estimation. At the end, this model will be applied to the withed set to weigh its goodness of fit. The resultant value is the prediction error of the cross validation.

While using this method, the best fit polynomial order changed frequently from 2 to 7. In order to increase the variation in the folds chosen for each polynomial and, hence, get a clearer cross validation error from this data set, we will run each cross validation 100 times.

auto <- read.csv(repo_url, header = FALSE, sep = "")
colnames(auto) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")

#set.seed(8675309)

cv.err5 <- matrix(c(seq(1, 8, 1), rep(0, 8)), nrow = 8, ncol = 2)
iters <- 100

#loop data through 1st order - 8th order polynomial
#creating 5 folds at each itteration
for (j in 1:iters) {
    for (i in 1:nrow(cv.err5)) {
        tmp_fit <- glm(mpg~poly(
            displacement+horsepower+weight+acceleration,i),
            data=auto)
        cv.err5[i, 2] <- cv.err5[i, 2] + cv.glm(auto, tmp_fit, K = 5)$delta[1]
    }
}

cv.err5[,2] <- cv.err5[,2]/iters

min_cv.poly <- cv.err5[min(cv.err5[,2]) == cv.err5[,2], 1] 

A plot of the cross validation errors for each degree of polynomial is below.

plot(cv.err5[,1], cv.err5[,2], type = 'b')

Using the method outlined above the minimum cross validation error occurs with a polynomial of 2.

Looking at the change in fit relative to 1st order polynomial

Looking at the residual error using the 2nd degree polynomial.

cv.best <- glm(mpg~poly(displacement+horsepower+weight+acceleration,min_cv.poly),data=auto)
plot(cv.best$residuals~cv.best$fitted.values)
abline(h=0, lty = 3)

Comparing it to the previous assignment’s 1st degree polynomial

poly1 <- glm(mpg~poly(displacement+horsepower+weight+acceleration,1),data=auto)
plot(poly1$residuals~poly1$fitted.values)
abline(h=0, lty = 3)

The fit, for the data set at least, is a lot better using the polynomial of degree min_cv.poly. If this is applicable to the population at large is a more difficult assertion to make, but the cross validation method makes us a lot more sure of our estimations than simple linear regression techniques alone.