library(stats)
library(boot)
autodata <- read.table('auto-mpg.data', col.names = c('disp', 'hp', 'weight', 'accel', 'mpg'))
head(autodata)
## disp hp weight accel mpg
## 1 307 130 3504 12.0 18
## 2 350 165 3693 11.5 15
## 3 318 150 3436 11.0 18
## 4 304 150 3433 12.0 16
## 5 302 140 3449 10.5 17
## 6 429 198 4341 10.0 15
summary(autodata)
## disp hp weight accel
## Min. : 68.0 Min. : 46.0 Min. :1613 Min. : 8.00
## 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 1st Qu.:13.78
## Median :151.0 Median : 93.5 Median :2804 Median :15.50
## Mean :194.4 Mean :104.5 Mean :2978 Mean :15.54
## 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 3rd Qu.:17.02
## Max. :455.0 Max. :230.0 Max. :5140 Max. :24.80
## mpg
## Min. : 9.00
## 1st Qu.:17.00
## Median :22.75
## Mean :23.45
## 3rd Qu.:29.00
## Max. :46.60
Now, lets try to fit models for polynomials of degree 1 thru 9, to measure the model performance. we will also measure the cross validation error using cv.glm() function, and plot those errors.
#our training and cross validation sets represents the whole data set.[cv.glm would take care of splitting this]
trainingset = autodata
crossvalidset = autodata
set.seed(9)
cv.err <- c()
for(i in 1:9)
{
fit <- glm(mpg ~ poly(disp + hp + weight + accel, i), data=trainingset)
fit
cv.err[i] <- cv.glm(crossvalidset, fit, K=5)$delta[1]
}
cv.err
## [1] 18.37795 17.04847 17.17806 17.19475 17.04791 17.29740 16.98568 16.75582
## [9] 18.38613
plot(x = 1:9, y = cv.err, type='b',xlab = "Polynomial Degree", ylab = "Cross Validation Error", main = "Bias / Variance Tradeoff")
The above plot of cross validation errors shows a typical U shaped representation of the relationship between Bias and Variance among the models. Initially, we see under-fit (biased but low variance). As we increase the polynomial degree, the bias decreases without much of an increase in the variance. However, when we continue increasing the degree of polynomial , we start to see over-fit model (reduce in bias, and increase in variance), which is shown as an increase in the cross validation error. Overall, the 3rd degree polynomial might be a better choice for minimizing the cross validation error in the model ( both bias and variance are minimal ).