Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeoff.

library(stats)
library(boot)

autodata <- read.table('auto-mpg.data', col.names = c('disp', 'hp', 'weight', 'accel', 'mpg'))
head(autodata)
##   disp  hp weight accel mpg
## 1  307 130   3504  12.0  18
## 2  350 165   3693  11.5  15
## 3  318 150   3436  11.0  18
## 4  304 150   3433  12.0  16
## 5  302 140   3449  10.5  17
## 6  429 198   4341  10.0  15
summary(autodata)
##       disp             hp            weight         accel      
##  Min.   : 68.0   Min.   : 46.0   Min.   :1613   Min.   : 8.00  
##  1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225   1st Qu.:13.78  
##  Median :151.0   Median : 93.5   Median :2804   Median :15.50  
##  Mean   :194.4   Mean   :104.5   Mean   :2978   Mean   :15.54  
##  3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615   3rd Qu.:17.02  
##  Max.   :455.0   Max.   :230.0   Max.   :5140   Max.   :24.80  
##       mpg       
##  Min.   : 9.00  
##  1st Qu.:17.00  
##  Median :22.75  
##  Mean   :23.45  
##  3rd Qu.:29.00  
##  Max.   :46.60

Now, lets try to fit models for polynomials of degree 1 thru 9, to measure the model performance. we will also measure the cross validation error using cv.glm() function, and plot those errors.

#our training and cross validation sets represents the whole data set.[cv.glm would take care of splitting this]
trainingset = autodata
crossvalidset = autodata

set.seed(9)

cv.err <- c()

for(i in 1:9)
{
  fit <- glm(mpg ~ poly(disp + hp + weight + accel, i), data=trainingset)
  fit
  
  cv.err[i] <- cv.glm(crossvalidset, fit, K=5)$delta[1]
}
cv.err
## [1] 18.37795 17.04847 17.17806 17.19475 17.04791 17.29740 16.98568 16.75582
## [9] 18.38613
plot(x = 1:9, y = cv.err, type='b',xlab = "Polynomial Degree", ylab = "Cross Validation Error", main = "Bias / Variance Tradeoff")

The above plot of cross validation errors shows a typical U shaped representation of the relationship between Bias and Variance among the models. Initially, we see under-fit (biased but low variance). As we increase the polynomial degree, the bias decreases without much of an increase in the variance. However, when we continue increasing the degree of polynomial , we start to see over-fit model (reduce in bias, and increase in variance), which is shown as an increase in the cross validation error. Overall, the 3rd degree polynomial might be a better choice for minimizing the cross validation error in the model ( both bias and variance are minimal ).