R Labs from ISL Book(Trevor&Hastie)

Chapter 5 - Resampling methods

In this lab, we explore the resampling techniques covered in this chapter. Some of the commands in this lab may take a while to run on your computer.

The Validation Set Approach

We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the Auto data set. Before we begin, we use:

set.seed() - in order to set a seed for R’s random number generator, so that we get precisely the same results for every run. It is generally a good idea to set a random seed when performing an analysis such as cross-validation that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time.
sample() - is used to split the set of observations into two halves. In this case, by selecting a random subset of 196 observations out of the original 392 observations referred to as the training set here onwards.

library (ISLR)

## Warning: package 'ISLR' was built under R version 3.4.3

set.seed (3)
train=sample(392 ,196)

(Here we use a shortcut in the sample command; see ?sample for details.) We then use the subset option in lm() to fit a linear regression using only the observations corresponding to the training set.

lm.fit = lm(mpg∼horsepower ,data=Auto ,subset =train )

We now use:-

predict() - to estimate the response for all 392 observations, and we use the
mean() - to calculate the MSE of the 196 observations in the validation set. Note that the -train index below selects only the observations that are not in the training set.

attach (Auto)
mean((mpg -predict (lm.fit ,Auto))[-train ]^2)

## [1] 26.29134

Therefore, the estimated test MSE for the linear regression fit is 26.14. Similarly, we can use the poly() function to estimate the test error for the quadratic and cubic regressions.

lm.fit2=lm(mpg∼poly(horsepower ,2) ,data=Auto ,subset =train )
mean((mpg -predict (lm.fit2 ,Auto))[-train ]^2)

## [1] 21.5041

lm.fit3=lm(mpg∼poly(horsepower ,3) ,data=Auto ,subset =train )
mean((mpg -predict (lm.fit3 ,Auto))[-train ]^2)

## [1] 21.50825

These error rates are 19.82 and 19.78, respectively. If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.

set.seed (2)
train=sample (392 ,196)
lm.fit =lm(mpg∼horsepower ,subset =train)
mean((mpg -predict (lm.fit ,Auto))[-train ]^2)

## [1] 23.29559

lm.fit2=lm(mpg∼poly(horsepower ,2) ,data=Auto ,subset =train )
mean((mpg -predict (lm.fit2 ,Auto))[-train ]^2)

## [1] 18.90124

lm.fit3=lm(mpg∼poly(horsepower ,3) ,data=Auto ,subset =train )
mean((mpg -predict (lm.fit3 ,Auto))[-train ]^2)

## [1] 19.2574

Using this split of the observations into a training set and a validation set, we find that the validation set error rates for the models with linear, quadratic, and cubic terms are 23.30, 18.90, and 19.26, respectively. These results are consistent with our previous findings: a model that predicts mpg using a quadratic function of horsepower performs better than a model that involves only a linear function of horsepower, and there is little evidence in favor of a model that uses a cubic function of horsepower.

Leave-One-Out Cross-Validation

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. Earlier in Chapter 4 Lab we used:-

glm() - to perform logistic regression by passing in the family=“binomial” argument. But if we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function. So for instance,

glm.fit=glm(mpg∼horsepower ,data=Auto)
coef(glm.fit)

## (Intercept)  horsepower 
##  39.9358610  -0.1578447

and

lm.fit =lm(mpg∼horsepower ,data=Auto)
coef(lm.fit)

## (Intercept)  horsepower 
##  39.9358610  -0.1578447

yield identical linear regression models. In this lab, we will perform linear regression using the glm() function rather than the lm() function because the former can be used together with cv.glm(). The cv.glm() function is part of the boot library.

library (boot)
glm.fit=glm(mpg∼horsepower ,data=Auto)
cv.err =cv.glm(Auto ,glm.fit)
cv.err$delta

## [1] 24.23151 24.23114

cv.glm() - produces a list with several components. The two numbers in the delta vector contain the cross-validation results. In this case the numbers are identical (up to two decimal places) and correspond to the LOOCV statistic given in (5.1). Below, we discuss a situation in which the two numbers differ. Our cross-validation estimate for the test error is approximately 24.23.
We can repeat this procedure for increasingly complex polynomial fits. To automate the process, we use the for() function to initiate a for loop which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 5, computes the associated cross-validation error, and stores it in the ith element of the vector cv.error. We begin by initializing the vector. This command will likely take a couple of minutes to run.

cv.error=rep (0,5)
for (i in 1:5){
  glm.fit=glm(mpg∼poly(horsepower ,i),data=Auto)
  cv.error[i]=cv.glm (Auto ,glm.fit)$delta [1]
  }
cv.error

## [1] 24.23151 19.24821 19.33498 19.42443 19.03321

As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and quadratic fits, but then no clear improvement from using higher-order polynomials.

k-Fold Cross-Validation

The cv.glm() function can also be used to implement k-fold CV. Below we use k = 10, a common choice for k, on the Auto data set. We once again set a random seed and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.

set.seed (17)
cv.error.10 = rep(0 ,10)
for (i in 1:10) {
glm.fit=glm(mpg∼poly(horsepower ,i),data=Auto)
cv.error.10[i]=cv.glm(Auto ,glm.fit ,K=10)$delta [1]
 }
cv.error.10

##  [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
##  [8] 19.71201 18.95140 19.50196

Notice that the computation time is much shorter than that of LOOCV. (In principle, the computation time for LOOCV for a least squares linear model should be faster than for k-fold CV, due to the availability of the formula (5.2) for LOOCV; however, unfortunately the cv.glm() function does not make use of this formula.) We still see little evidence that using cubic or higher-order polynomial terms leads to lower test error than simply using a quadratic fit.
We saw in Section 5.3.2 that the two numbers associated with delta are essentially the same when LOOCV is performed. When we instead perform k-fold CV, then the two numbers associated with delta differ slightly. The first is the standard k-fold CV estimate, as in (5.3). The second is a bias-corrected version. On this data set, the two estimates are very similar to each other.

The Bootstrap

We illustrate the use of the bootstrap in the simple example of Section 5.2, as well as on an example involving estimating the accuracy of the linear regression model on the Auto data set.

Estimating the Accuracy of a Statistic of Interest

One of the great advantages of the bootstrap approach is that it can be applied in almost all situations. No complicated mathematical calculations are required. Performing a bootstrap analysis in R entails only two steps.
* First, we must create a function that computes the statistic of interest.
* Second, we use the boot() function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement.

The Portfolio data set in the ISLR package is described in Section 5.2. To illustrate the use of the bootstrap on this data, we must first create a function, alpha.fn(), which takes as input the (X, Y) data as well as a vector indicating which observations should be used to estimate α. The function then outputs the estimate for α based on the selected observations.

alpha.fn=function (data ,index){
  X=data$X[index]
  Y=data$Y[index]
  return ((var(Y)-cov (X,Y))/(var(X)+var(Y) -2* cov(X,Y)))
 }

This function returns, or outputs, an estimate for α based on applying eqn(5.7) to the observations indexed by the argument index. For instance, the following command tells R to estimate α using all 100 observations.

alpha.fn(Portfolio ,1:100)

## [1] 0.5758321

The next command uses the sample() function to randomly select 100 observations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing ˆα based on the new data set.

set.seed (1)
alpha.fn(Portfolio ,sample (100 ,100 , replace =T))

## [1] 0.5963833

boot(Portfolio ,alpha.fn,R=1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original        bias    std. error
## t1* 0.5758321 -7.315422e-05  0.08861826

We can implement a bootstrap analysis by

performing this alpha.fn() command above many times,
recording all of the corresponding estimates for α, and
computing the resulting standard deviation.

However, the boot() function automates this approach. In above code, using last line we produce R = 1,000 bootstrap estimates for α.
The final output shows that using the original data, ˆα = 0.5758, and that the bootstrap estimate for SE(ˆα) is 0.0886.

Estimating the Accuracy of a Linear Regression Model

The bootstrap approach can be used to

assess the variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for β0 and β1, the intercept and slope terms for the linear regression model that uses horsepower to predict mpg in the Auto data set.

We will compare the estimates obtained using the bootstrap to those obtained using the formulas for SE(ˆβ0) and SE(ˆβ1) described in Section 3.1.2. We first create a simple function, boot.fn() which takes in the Auto data set as well as a set of indices for the observations, and returns the intercept and slope estimates for the linear regression model. We then apply this function to the full set of 392 observations in order to compute the estimates of β0 and β1 on the entire data set using the usual linear regression coefficient estimate formulas from Chapter 3. For a one line fn, we do not need the { and }.

boot.fn=function (data ,index )
   return (coef(lm(mpg∼horsepower ,data=data ,subset =index)))
boot.fn(Auto ,1:392)

## (Intercept)  horsepower 
##  39.9358610  -0.1578447

The boot.fn() function can also be used in order to create bootstrap estimates for the intercept and slope terms by randomly sampling from among the observations with replacement. Here we give two examples.

set.seed(1)
boot.fn(Auto ,sample (392 ,392 , replace =T))

## (Intercept)  horsepower 
##  38.7387134  -0.1481952

Next, we use:

boot() - to compute the standard errors of several (in this case 1,000) bootstrap estimates for the intercept and slope terms.

boot(Auto ,boot.fn ,1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.9358610  0.0296667441 0.860440524
## t2* -0.1578447 -0.0003113047 0.007411218

This indicates that the bootstrap estimate for SE(ˆβ0) is 0.86, and that the bootstrap estimate for SE(ˆβ1) is 0.0074. As discussed in Section 3.1.2, standard formulas can be used to compute the standard errors for the regression coefficients in a linear model. These can be obtained using the summary() function.

summary(lm(mpg∼horsepower ,data=Auto))$coef

##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81

The standard error estimates for ˆβ0 and ˆβ1 obtained using the formulas from Section 3.1.2 are 0.717 for the intercept and 0.0064 for the slope. Interestingly, these are somewhat different from the estimates obtained using the bootstrap. Does this indicate a problem with the bootstrap? In fact, it suggests the opposite. Recall that the standard formulas given in Equation 3.8 on page 66 rely on certain assumptions. For example, they depend on the unknown parameter σ2, the noise variance.We then estimate σ2 using the RSS. Now although the formula for the standard errors do not rely on the linear model being correct, the estimate for σ2 does. We see in Figure 3.8 on page 91 that there is a non-linear relationship in the data, and so the residuals from a linear fit will be inflated, and so will ˆσ2. Secondly, the standard formulas assume (somewhat unrealistically) that the xi are fixed, and all the variability comes from the variation in the errors i. The bootstrap approach does not rely on any of these assumptions, and so it is likely giving a more accurate estimate of the standard errors of ˆβ0 and ˆβ1 than is the summary() function.

Below we compute the bootstrap standard error estimates and the standard linear regression estimates that result from fitting the quadratic model to the data. Since this model provides a good fit to the data (Figure 3.8), there is now a better correspondence between the bootstrap estimates and the standard estimates of SE(ˆβ0), SE(ˆβ1) and SE(ˆβ2).

boot.fn=function (data ,index )
  coefficients(lm(mpg∼horsepower + I( horsepower ^2) ,data=data ,
subset =index))
set.seed (1)
boot(Auto ,boot.fn ,1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 56.900099702  6.098115e-03 2.0944855842
## t2* -0.466189630 -1.777108e-04 0.0334123802
## t3*  0.001230536  1.324315e-06 0.0001208339

summary (lm(mpg∼horsepower +I(horsepower ^2) ,data=Auto))$coef

##                     Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)     56.900099702 1.8004268063  31.60367 1.740911e-109
## horsepower      -0.466189630 0.0311246171 -14.97816  2.289429e-40
## I(horsepower^2)  0.001230536 0.0001220759  10.08009  2.196340e-21

R Labs from ISL Book(Trevor&Hastie)

Priyanka Gagneja

January 28, 2018