Dataset used = Auto which is part of ISLR
library(ISLR)
dim(Auto) #check number of rows and columns in dataset. We notice 392 rows and 9 columns
## [1] 392 9
head(Auto) # check sample of dataset to see what it contains
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
set.seed(1)
tr = sample(392,196) # using sample command, to randomly choose 196 elements from 392. This is because the dataset contains 392 rows and we wish to use 192 for the sample
lm.fit=lm(mpg~horsepower,data =Auto, subset=tr) # performing linear regression of mpg over horsepoer on a subset of Auto data
attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-tr]^2)
## [1] 26.14142
lm.fit2 =lm(mpg~poly(horsepower,2),data=Auto,subset =tr) # performing regression on quadratic form
mean((mpg-predict(lm.fit2,Auto))[-tr]^2)
## [1] 19.82259
lm.fit3 =lm(mpg~poly(horsepower,3),data=Auto,subset =tr) # performing regression on 3rd order
mean((mpg-predict(lm.fit3,Auto))[-tr]^2)
## [1] 19.78252
We notice that the standard error for the quadratic equation are the lowest
The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. cv.glm() function is part of boot library
library(boot)
glm.fit=glm(mpg~horsepower, data =Auto) # performing the regression of mpg on horsepower using dataset Auto
cv.err =cv.glm(Auto,glm.fit) #cv.glm() is used to perform the cross validation on Auto dataset
cv.err$delta # cv.glm function produces a list with several values. The corss validation errors are listed in delta vector
## [1] 24.23151 24.23114
For calculating the CV errors of linear regression upto 5th order we use for loop
cv.err =rep(0,5) # creates sequence of 5 numbers starting from 0
for (i in 1:5){
glm.fit = glm(mpg~poly(horsepower,i),data =Auto) #performing the regression of mpg on horsepower using dataset Auto for polynomial value =i
cv.err[i] =cv.glm(Auto,glm.fit)$delta[1] #caclulating CV error on Auto dataset
}
cv.err
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
The above values reflect the CV errors for upto 5th order
The cv.glm() function can also be used to implement k-fold CV. Using the same for loop above and K =10, we compute the CV errors corresponding to the polynomial fits of orders one to ten.
set.seed(17)
cv.err.10 = seq(0,10) #initialize a vector in which we will store the CV errors
for(i in 1:10){
glm.fit=glm(mpg~poly(horsepower,i), data=Auto) #performing the regression of mpg on horsepower using dataset Auto for polynomial value =i
cv.err.10[i]=cv.glm(Auto,glm.fit,K=10)$delta[1] # defining K=10 means we are using 10 folds
}
cv.err.10
## [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
## [8] 19.71201 18.95140 19.50196 10.00000
Performing a bootstrap analysis in R entails only two steps. First, we must create a function that computes the statistic of interest. Second, we use the the boot() function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement
Let’s use the bootstrap approach i to assess the variability of the estimates for β0 and β1, for the linear regression model that uses horsepower to predict mpg in the Auto data set
# we first create a function to return the coefficient of linear regression model
boot.fn =function(data,index)
return(coef(glm(mpg~horsepower,data=data,subset=index))) # returns coefficient of regression of mpg ver horsepower in the dataset passed on the function under value"data" for a subset which is passed on from the function under value"index"
# we check if our function is working fine
boot.fn(Auto,1:392) # We are essentially saying dataset is Auto and the subset is rows from 1 to 392 This is practically using all rows in the dataset
## (Intercept) horsepower
## 39.9358610 -0.1578447
Using the boot.fn() created above we can manually create the bootstrap estimates. Remember the bootstrap works by reampling the training dataset with replacement
set.seed(1)
boot.fn(Auto,sample(392,392,replace=T)) # here we are essentially saying that on a Auto dataset create a sample of 392 rows from 392 rows of data with replacement
## (Intercept) horsepower
## 38.7387134 -0.1481952
If we perform the same thing again we will get the different values since sampe wil be shuffelled
set.seed(1)
boot.fn(Auto,sample(392,392,replace=T)) # here we are essentially saying that on a Auto dataset create a sample of 392 rows from 392 rows of data with replacement
## (Intercept) horsepower
## 38.7387134 -0.1481952
Now we use the boot() function to compute the standard errors of 1,000 bootstrap estimates for the intercept and slope terms.
boot(Auto,boot.fn,1000) # using Auto dataset and boot.fn created to perform liner regression and running 1000 bootstrap estimates
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 39.9358610 0.0296667441 0.860440524
## t2* -0.1578447 -0.0003113047 0.007411218
This indicates that the bootstrap estimate for SE(βˆ0) is 0.86, and that the bootstrap estimate for SE(βˆ1) is 0.0074
Since we are boottrapping the linear regression, we can directly calculate the standard error from regression
summary(glm(mpg~horsepower,data=Auto))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
## horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
Comparing the standard errors estimates from bootstrap and regression directly we notice that the estiates of bootstrap are higher. We consider the estimates from bootstrap as more accurate because: 1. standard errors in regression depends on the unknown parameter σ^2, the noise variance 2.It assumes that the xi are fixed, and all the variability comes from the variation in the errors εi