|IS 605 FUNDAMENTALS OF COMPUTATIONAL MATHEMATICS - WEEK 12 | Data Analytics
Instruction 1:
Using the stats and boot libraries in R perform a cross-validation experiment to observe the bias variance tradeo . You’ll use the auto data set from previous assignments. This dataset has 392 observations across 5 variables. We want to t a polynomial model of various degrees using the glm function in R and then measure the cross validation error using cv.glm function. Fit various polynomial models to compute mpg as a function of the other four variables acceleration, weight, horsepower, and displacement using glm function.
options(warn = -1)
library(knitr)
library(stats)
library(boot)
autompg <- read.table("https://raw.githubusercontent.com/mascotinme/GitHub/master/MSDA%20605/auto-mpg.txt", col.names = c("displacement", "hp", "weight", "acceleration", "mpg"))
kable(head(autompg)) # A glimpse of the dataset| displacement | hp | weight | acceleration | mpg |
|---|---|---|---|---|
| 307 | 130 | 3504 | 12.0 | 18 |
| 350 | 165 | 3693 | 11.5 | 15 |
| 318 | 150 | 3436 | 11.0 | 18 |
| 304 | 150 | 3433 | 12.0 | 16 |
| 302 | 140 | 3449 | 10.5 | 17 |
| 429 | 198 | 4341 | 10.0 | 15 |
str(autompg) # The data structure## 'data.frame': 392 obs. of 5 variables:
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ hp : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
kable(summary(autompg))# inferential statistic of the variables.| displacement | hp | weight | acceleration | mpg | |
|---|---|---|---|---|---|
| Min. : 68.0 | Min. : 46.0 | Min. :1613 | Min. : 8.00 | Min. : 9.00 | |
| 1st Qu.:105.0 | 1st Qu.: 75.0 | 1st Qu.:2225 | 1st Qu.:13.78 | 1st Qu.:17.00 | |
| Median :151.0 | Median : 93.5 | Median :2804 | Median :15.50 | Median :22.75 | |
| Mean :194.4 | Mean :104.5 | Mean :2978 | Mean :15.54 | Mean :23.45 | |
| 3rd Qu.:275.8 | 3rd Qu.:126.0 | 3rd Qu.:3615 | 3rd Qu.:17.02 | 3rd Qu.:29.00 | |
| Max. :455.0 | Max. :230.0 | Max. :5140 | Max. :24.80 | Max. :46.60 |
Instruction 2:
\(\quad glm.fit=glm(mpg~poly(disp+hp+wt+acc,2), data=auto)\quad\)
\(\quad cv.err5[2]=cv.glm(auto,glm.fit,K=5)\)delta[1] $
will fit a 2nd degree polynomial function between mpg and the remaining 4 variables and perform 5 iterations of cross-validations. This result will be stored in a cv.err5 array. cv.glm returns the estimated cross validation error and its adjusted value in a variable called delta. Please see the help on cv.glm to see more information.
Once you have t the various polynomials from degree 1 to 8, you can plot the cross- validation error function as
\(\quad degree=1:8 \quad\)
\(\quad plot(degree,cv.err5,type='b') \quad\)
training <- autompg
crossvalidation <- autompg
set.seed(8)
cv.err <- c()
for(n in 1:8)
{
fit <- glm(mpg ~ poly(displacement + hp + weight + acceleration, n), data=training)
fit
cv.err[n] <- cv.glm(crossvalidation, fit, K=5)$delta[1]
}
cv.err## [1] 18.40003 16.95578 17.44431 17.05143 17.17140 17.44844 17.00235 16.97621
perform the polynomial fitt and then plot the resulting 5 fold cross validation curve. Your output should show the characteristic U-shape illustrating the tradeo between bias and variance.
degree <- 1:8
plot(degree, y = cv.err, type='b',xlab = "Polynomial Degree", ylab = "Cross Validation Error", main = "Plot of Tradeoff Between Bias and Variance")