# An Introduction to Statistical Learning with Applications in R
# Chapter 5 Lab: Cross-Validation and the Bootstrap
# In this lab, we explore the resampling techniques covered in this chapter.
# Some of the commands in this lab may take a while to run on your computer.
# The Validation Set Approach
# We explore the use of the validation set approach in order to estimate the
# test error rates that result from fitting various linear models on the Auto
# data set.
# Before we begin, we use the set.seed() function in order to set a seed for
# R’s random number generator, so that the reader of this book will obtain
# precisely the same results as those shown below. It is generally a good idea
# to set a random seed when performing an analysis such as cross-validation
# that contains an element of randomness, so that the results obtained can
# be reproduced precisely at a later time.
# We begin by using the sample() function to split the set of observations
# into two halves, by selecting a random subset of 196 observations out of
# the original 392 observations. We refer to these observations as the training
# set.
# load ISLR package which contains data frame from the book
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.4
# load Auto data and display data editor window
fix(Auto)
# COMMENTS:
# 392 observations of 9 variables
# set seed for reproducibility
set.seed(1)
# use sample() to get training set
# this selects 196 indices from a dataset with
# 392 rows.
train <- sample(392,196)
# We then use the subset option in lm() to fit a linear regression using only
# the observations corresponding to the training set.
lm.fit <- lm(mpg~horsepower,data=Auto,subset=train)
# attach() allows you to directly reference variable names in your
# code without having to reference the data frame name. This
# lab uses this approach but I usually do not as it is not a good
# practice from a reproducibility perspective
attach(Auto)
# We now use the predict() function to estimate the response for all 392
# observations, and we use the mean() function to calculate the MSE of the
# 196 observations in the validation set. Note that the -train index below
# selects only the observations that are not in the training set.
mean((mpg-predict(lm.fit,Auto))[-train]^2)
## [1] 23.26601
# Therefore, the estimated test MSE for the linear regression fit is 23.27. We
# can use the poly() function to estimate the test error for the quadratic
# and cubic regressions
lm.fit2 <- lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)
## [1] 18.71646
lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)
## [1] 18.79401
# These error rates are 18.72 and 18.79, respectively. If we choose a different
# training set instead, then we will obtain somewhat different errors on the
# validation set.
# lets choose a different starting seed and see what the impact
# is on the error rates
set.seed(2)
train=sample(392,196)
lm.fit=lm(mpg~horsepower,subset=train)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
## [1] 25.72651
lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)
## [1] 20.43036
lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)
## [1] 20.38533
# Using this split of the observations into a training set and a validation
# set, we find that the validation set error rates for the models with linear,
# quadratic, and cubic terms are 25.73, 20.43, and 20.39, respectively.
# These results are consistent with our previous findings: a model that
# predicts mpg using a quadratic function of horsepower performs better than
# a model that involves only a linear function of horsepower, and there is
# little evidence in favor of a model that uses a cubic function of horsepower.
# Leave-One-Out Cross-Validation (LOOCV)
# The LOOCV estimate can be automatically computed for any generalized
# linear model using the glm() and cv.glm() functions. In the lab for Chapter 4
# we used the glm() function to perform logistic regression by passing
# in the family="binomial" argument. But if we use glm() to fit a model
# without passing in the family argument, then it performs linear regression,
# just like the lm() function. So for instance,
glm.fit=glm(mpg~horsepower,data=Auto)
coef(glm.fit)
## (Intercept) horsepower
## 39.9358610 -0.1578447
# and
lm.fit=lm(mpg~horsepower,data=Auto)
coef(lm.fit)
## (Intercept) horsepower
## 39.9358610 -0.1578447
# yield identical linear regression models. In this lab, we will perform linear
# regression using the glm() function rather than the lm() function because
# the former can be used together with cv.glm(). The cv.glm() function is
# part of the boot library.
# load boot library
library(boot)
glm.fit=glm(mpg~horsepower,data=Auto)
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta
## [1] 24.23151 24.23114
# The cv.glm() function produces a list with several components. The two
# numbers in the delta vector contain the cross-validation results. In this
# case the numbers are identical (up to two decimal places) and correspond
# to the LOOCV statistic given in (5.1). Below, we discuss a situation in
# which the two numbers differ. Our cross-validation estimate for the test
# error is approximately 24.23.
# We can repeat this procedure for increasingly complex polynomial fits.
# To automate the process, we use the for() function to initiate a for loop
# which iteratively fits polynomial regressions for polynomials of order i = 1
# to i = 5, computes the associated cross-validation error, and stores it in
# the ith element of the vector cv.error. We begin by initializing the vector.
# This command will likely take a couple of minutes to run.
cv.error=rep(0,5)
degree=1:5
for (i in degree){
glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
cv.error[i]=cv.glm(Auto,glm.fit)$delta[1]
}
cv.error
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
plot(degree,cv.error,type="b")

# we see a sharp drop in the estimated test MSE between
# the linear (24.23) and quadratic (19.25) fits, but then no clear improvement
# from using higher-order polynomials.
# k-Fold Cross-Validation
# The cv.glm() function can also be used to implement k-fold CV. Below we
# use k = 10, a common choice for k, on the Auto data set. We once again set
# a random seed and initialize a vector in which we will store the CV errors
# corresponding to the polynomial fits of orders one to ten.
set.seed(17)
cv.error.10=rep(0,10)
degree=1:10
for (i in degree){
glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
cv.error.10[i]=cv.glm(Auto,glm.fit,K=10)$delta[1]
}
cv.error.10
## [1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
## [9] 18.87013 20.95520
plot(degree,cv.error.10,type="b")

# Notice that the computation time is much shorter than that of LOOCV.
# (In principle, the computation time for LOOCV for a least squares linear
# model should be faster than for k-fold CV, due to the availability of the
# formula (5.2) for LOOCV; however, unfortunately the cv.glm() function
# does not make use of this formula.) We still see little evidence that using
# cubic or higher-order polynomial terms leads to lower test error than simply
# using a quadratic fit.
# We saw in Section 5.3.2 that the two numbers associated with delta are
# essentially the same when LOOCV is performed. When we instead perform
# k-fold CV, then the two numbers associated with delta differ slightly. The
# first is the standard k-fold CV estimate, as in (5.3). The second is a bias
# corrected version. On this data set, the two estimates are very similar to
# each other.
# The Bootstrap
# We illustrate the use of the bootstrap in the simple example of Section 5.2,
# as well as on an example involving estimating the accuracy of the linear
# regression model on the Auto data set.
# One of the great advantages of the bootstrap approach is that it can be
# applied in almost all situations. No complicated mathematical calculations
# are required. Performing a bootstrap analysis in R entails only two steps.
# First, we must create a function that computes the statistic of interest.
# Second, we use the boot() function, which is part of the boot library, to
# perform the bootstrap by repeatedly sampling observations from the data
# set with replacement.
# The Portfolio data set in the ISLR package is described in Section 5.2.
# To illustrate the use of the bootstrap on this data, we must first create
# a function, alpha.fn(), which takes as input the (X, Y) data as well as
# a vector indicating which observations should be used to estimate α. The
# function then outputs the estimate for α based on the selected observations.
alpha.fn=function(data,index){
X=data$X[index]
Y=data$Y[index]
return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}
# This function returns, or outputs, an estimate for α based on applying
# (5.7) to the observations indexed by the argument index. For instance, the
# following command tells R to estimate α using all 100 observations.
alpha.fn(Portfolio,1:100)
## [1] 0.5758321
# The next command uses the sample() function to randomly select 100 observations
# from the range 1 to 100, with replacement. This is equivalent
# to constructing a new bootstrap data set and recomputing ˆα based on the
# new data set.
set.seed(1)
alpha.fn(Portfolio,sample(100,100,replace=T))
## [1] 0.7368375
# We can implement a bootstrap analysis by performing this command many
# times, recording all of the corresponding estimates for α, and computing
# the resulting standard deviation. However, the boot() function automates
# this approach. Below we produce R = 1, 000 bootstrap estimates for α.
boot.out <- boot(Portfolio,alpha.fn,R=1000)
boot.out
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.5758321 -0.001695873 0.09366347
plot(boot.out)
# The final output shows that using the original data, ˆα = 0.5758, and that
# the bootstrap estimate for SE(ˆα) is 0.09.
# Estimating the Accuracy of a Linear Regression Model
# The bootstrap approach can be used to assess the variability of the coefficient
# estimates and predictions from a statistical learning method. Here
# we use the bootstrap approach in order to assess the variability of the
# estimates for β0 and β1, the intercept and slope terms for the linear regression
# model that uses horsepower to predict mpg in the Auto data set. We
# will compare the estimates obtained using the bootstrap to those obtained
# using the formulas for SE(β0) and SE(β1) described in Section 3.1.2.
# We first create a simple function, boot.fn(), which takes in the Auto data
# set as well as a set of indices for the observations, and returns the intercept
# and slope estimates for the linear regression model. We then apply this
# function to the full set of 392 observations in order to compute the estimates
# of β0 and β1 on the entire data set using the usual linear regression
# coefficient estimate formulas from Chapter 3. Note that we do not need the
# { and } at the beginning and end of the function because it is only one line
# long.
boot.fn=function(data,index)
return(coef(lm(mpg~horsepower,data=data,subset=index)))
boot.fn(Auto,1:392)
## (Intercept) horsepower
## 39.9358610 -0.1578447
# The boot.fn() function can also be used in order to create bootstrap estimates
# for the intercept and slope terms by randomly sampling from among
# the observations with replacement. Here we give two examples.
set.seed(1)
boot.fn(Auto,sample(392,392,replace=T))
## (Intercept) horsepower
## 40.3404517 -0.1634868
boot.fn(Auto,sample(392,392,replace=T))
## (Intercept) horsepower
## 40.1186906 -0.1577063
# Next, we use the boot() function to compute the standard errors of 1,000
# bootstrap estimates for the intercept and slope terms.
boot(Auto,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 39.9358610 0.0544513229 0.841289790
## t2* -0.1578447 -0.0006170901 0.007343073
# This indicates that the bootstrap estimate for SE(β0) is 0.84, and that
# the bootstrap estimate for SE(β1) is 0.0073. As discussed in Section 3.1.2,
# standard formulas can be used to compute the standard errors for the
# regression coefficients in a linear model. These can be obtained using the
# summary() function.
summary(lm(mpg~horsepower,data=Auto))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
## horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
# The standard error estimates for β0 and β1 obtained using the formulas
# from Section 3.1.2 are 0.717 for the intercept and 0.0064 for the slope.
# Interestingly, these are somewhat different from the estimates obtained
# using the bootstrap. Does this indicate a problem with the bootstrap? In
# fact, it suggests the opposite. Recall that the standard formulas given in
# Equation 3.8 on page 66 rely on certain assumptions. For example, they
# depend on the unknown parameter σ2, the noise variance.We then estimate
# σ2 using the RSS. Now although the formula for the standard errors do not
# rely on the linear model being correct, the estimate for σ2 does. We see in
# Figure 3.8 on page 91 that there is a non-linear relationship in the data, and
# so the residuals from a linear fit will be inflated, and so will the estimate
# for σ2. Secondly, the standard formulas assume (somewhat unrealistically)
# that the xi are fixed, and all the variability comes from the variation in
# the errors epsilon_i. The bootstrap approach does not rely on any of these
# assumptions, and so it is likely giving a more accurate estimate of
# the standard errors of β0 and β1 than is the summary() function.
# this code was not in lab -- added so we could see the plot
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'Auto':
##
## mpg

ggplot(Auto, aes(x=horsepower, y=mpg))+
geom_point()+
geom_smooth(method=lm,formula= y~x,se=FALSE) +
geom_smooth(method=lm,color="green",formula= y~x+I(x^2),se=FALSE)

# Below we compute the bootstrap standard error estimates and the standard
# linear regression estimates that result from fitting the quadratic model
# to the data. Since this model provides a good fit to the data (Figure 3.8),
# there is now a better correspondence between the bootstrap estimates and
# the standard estimates of SE(β0), SE(β1) and SE(β2).
boot.fn=function(data,index)
coefficients(lm(mpg~horsepower+I(horsepower^2),data=data,subset=index))
set.seed(1)
boot(Auto,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 56.900099702 3.511640e-02 2.0300222526
## t2* -0.466189630 -7.080834e-04 0.0324241984
## t3* 0.001230536 2.840324e-06 0.0001172164
summary(lm(mpg~horsepower+I(horsepower^2),data=Auto))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.900099702 1.8004268063 31.60367 1.740911e-109
## horsepower -0.466189630 0.0311246171 -14.97816 2.289429e-40
## I(horsepower^2) 0.001230536 0.0001220759 10.08009 2.196340e-21