Resampling Methods

Validation Set Approach

Dataset used = Auto which is part of ISLR

library(ISLR)
dim(Auto)   #check number of rows and columns in dataset. We notice 392 rows and 9 columns
## [1] 392   9
head(Auto)  # check  sample of dataset to see what it contains
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
set.seed(1)
tr = sample(392,196) # using sample command, to randomly choose 196 elements from 392. This is because  the dataset contains 392 rows and we wish to use 192 for the sample

lm.fit=lm(mpg~horsepower,data =Auto, subset=tr)  # performing linear regression of mpg over horsepoer on a subset of Auto data

attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-tr]^2)
## [1] 26.14142
lm.fit2 =lm(mpg~poly(horsepower,2),data=Auto,subset =tr) # performing regression on quadratic form 
mean((mpg-predict(lm.fit2,Auto))[-tr]^2)
## [1] 19.82259
lm.fit3 =lm(mpg~poly(horsepower,3),data=Auto,subset =tr) # performing regression on 3rd order 
mean((mpg-predict(lm.fit3,Auto))[-tr]^2)
## [1] 19.78252

We notice that the standard error for the quadratic equation are the lowest

Leave One Out Cross Validation

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. cv.glm() function is part of boot library

library(boot)
 glm.fit=glm(mpg~horsepower, data =Auto) # performing the regression of mpg on horsepower using dataset Auto
 cv.err =cv.glm(Auto,glm.fit) #cv.glm() is used to perform the cross validation on Auto dataset
 cv.err$delta # cv.glm function produces a list with several values. The corss validation errors are listed in delta vector
## [1] 24.23151 24.23114

For calculating the CV errors of linear regression upto 5th order we use for loop

cv.err =rep(0,5)   # creates sequence of 5 numbers starting from 0
for (i in 1:5){
  glm.fit = glm(mpg~poly(horsepower,i),data =Auto)  #performing the regression of mpg on horsepower using dataset Auto for polynomial value =i
  cv.err[i] =cv.glm(Auto,glm.fit)$delta[1] #caclulating CV error on Auto dataset
}
cv.err
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321

The above values reflect the CV errors for upto 5th order

k-fold Cross Validation

The cv.glm() function can also be used to implement k-fold CV. Using the same for loop above and K =10, we compute the CV errors corresponding to the polynomial fits of orders one to ten.

set.seed(17)
cv.err.10 = seq(0,10) #initialize a vector in which we will store the CV errors
for(i in 1:10){
  glm.fit=glm(mpg~poly(horsepower,i), data=Auto) #performing the regression of mpg on horsepower using dataset Auto for polynomial value =i
  cv.err.10[i]=cv.glm(Auto,glm.fit,K=10)$delta[1] # defining K=10 means we are using 10 folds
}
cv.err.10
##  [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
##  [8] 19.71201 18.95140 19.50196 10.00000

Bootstrap

Performing a bootstrap analysis in R entails only two steps. First, we must create a function that computes the statistic of interest. Second, we use the the boot() function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement

Let’s use the bootstrap approach i to assess the variability of the estimates for β0 and β1, for the linear regression model that uses horsepower to predict mpg in the Auto data set

# we first create a function to return the coefficient of linear regression model
boot.fn =function(data,index)
  return(coef(glm(mpg~horsepower,data=data,subset=index))) # returns coefficient of regression of mpg ver horsepower in the dataset  passed on the function under value"data" for a subset which is passed on from the function under value"index"

# we check if our function is working fine

boot.fn(Auto,1:392)  #  We are essentially saying dataset is Auto and the subset is rows from 1 to 392 This is practically using all rows in the dataset
## (Intercept)  horsepower 
##  39.9358610  -0.1578447

Using the boot.fn() created above we can manually create the bootstrap estimates. Remember the bootstrap works by reampling the training dataset with replacement

set.seed(1)
boot.fn(Auto,sample(392,392,replace=T))  # here we are essentially saying that on a Auto dataset create a sample of 392 rows from 392 rows of data with replacement
## (Intercept)  horsepower 
##  38.7387134  -0.1481952

If we perform the same thing again we will get the different values since sampe wil be shuffelled

set.seed(1)
boot.fn(Auto,sample(392,392,replace=T))  # here we are essentially saying that on a Auto dataset create a sample of 392 rows from 392 rows of data with replacement
## (Intercept)  horsepower 
##  38.7387134  -0.1481952

Now we use the boot() function to compute the standard errors of 1,000 bootstrap estimates for the intercept and slope terms.

boot(Auto,boot.fn,1000) # using Auto dataset and boot.fn created to perform liner regression and running 1000 bootstrap estimates
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.9358610  0.0296667441 0.860440524
## t2* -0.1578447 -0.0003113047 0.007411218

This indicates that the bootstrap estimate for SE(βˆ0) is 0.86, and that the bootstrap estimate for SE(βˆ1) is 0.0074

Since we are boottrapping the linear regression, we can directly calculate the standard error from regression

summary(glm(mpg~horsepower,data=Auto))$coef
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81

Comparing the standard errors estimates from bootstrap and regression directly we notice that the estiates of bootstrap are higher. We consider the estimates from bootstrap as more accurate because: 1. standard errors in regression depends on the unknown parameter σ^2, the noise variance 2.It assumes that the xi are fixed, and all the variability comes from the variation in the errors εi