# An Introduction to Statistical Learning with Applications in R

# Chapter 5 Lab: Cross-Validation and the Bootstrap

# In this lab, we explore the resampling techniques covered in this chapter.
# Some of the commands in this lab may take a while to run on your computer.

# The Validation Set Approach

# We explore the use of the validation set approach in order to estimate the
# test error rates that result from fitting various linear models on the Auto
# data set.
# Before we begin, we use the set.seed() function in order to set a seed for
# R’s random number generator, so that the reader of this book will obtain
# precisely the same results as those shown below. It is generally a good idea
# to set a random seed when performing an analysis such as cross-validation
# that contains an element of randomness, so that the results obtained can
# be reproduced precisely at a later time.
# We begin by using the sample() function to split the set of observations
# into two halves, by selecting a random subset of 196 observations out of
# the original 392 observations. We refer to these observations as the training
# set.

# load ISLR package which contains data frame from the book 
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.4
# load Auto data and display data editor window 
fix(Auto)
# COMMENTS: 
# 392 observations of 9 variables

# set seed for reproducibility
set.seed(1)

# use sample() to get training set
# this selects 196 indices from a dataset with 
# 392 rows.
train <- sample(392,196)


# We then use the subset option in lm() to fit a linear regression using only
# the observations corresponding to the training set.
lm.fit <- lm(mpg~horsepower,data=Auto,subset=train)

# attach() allows you to directly reference variable names in your
# code without having to reference the data frame name.  This
# lab uses this approach but I usually do not as it is not a good
# practice from a reproducibility perspective
attach(Auto)

# We now use the predict() function to estimate the response for all 392
# observations, and we use the mean() function to calculate the MSE of the
# 196 observations in the validation set. Note that the -train index below
# selects only the observations that are not in the training set.
mean((mpg-predict(lm.fit,Auto))[-train]^2)
## [1] 23.26601
# Therefore, the estimated test MSE for the linear regression fit is 23.27. We
# can use the poly() function to estimate the test error for the quadratic
# and cubic regressions
lm.fit2 <- lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)
## [1] 18.71646
lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)
## [1] 18.79401
# These error rates are 18.72 and 18.79, respectively. If we choose a different
# training set instead, then we will obtain somewhat different errors on the
# validation set.

# lets choose a different starting seed and see what the impact
# is on the error rates
set.seed(2)
train=sample(392,196)
lm.fit=lm(mpg~horsepower,subset=train)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
## [1] 25.72651
lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)
## [1] 20.43036
lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)
## [1] 20.38533
# Using this split of the observations into a training set and a validation
# set, we find that the validation set error rates for the models with linear,
# quadratic, and cubic terms are 25.73, 20.43, and 20.39, respectively.

# These results are consistent with our previous findings: a model that
# predicts mpg using a quadratic function of horsepower performs better than
# a model that involves only a linear function of horsepower, and there is
# little evidence in favor of a model that uses a cubic function of horsepower.

# Leave-One-Out Cross-Validation (LOOCV)

# The LOOCV estimate can be automatically computed for any generalized
# linear model using the glm() and cv.glm() functions. In the lab for Chapter 4
# we used the glm() function to perform logistic regression by passing
# in the family="binomial" argument. But if we use glm() to fit a model
# without passing in the family argument, then it performs linear regression,
# just like the lm() function. So for instance,
glm.fit=glm(mpg~horsepower,data=Auto)
coef(glm.fit)
## (Intercept)  horsepower 
##  39.9358610  -0.1578447
# and 
lm.fit=lm(mpg~horsepower,data=Auto)
coef(lm.fit)
## (Intercept)  horsepower 
##  39.9358610  -0.1578447
# yield identical linear regression models. In this lab, we will perform linear
# regression using the glm() function rather than the lm() function because
# the former can be used together with cv.glm(). The cv.glm() function is
# part of the boot library.

# load boot library
library(boot)
glm.fit=glm(mpg~horsepower,data=Auto)
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta
## [1] 24.23151 24.23114
# The cv.glm() function produces a list with several components. The two
# numbers in the delta vector contain the cross-validation results. In this
# case the numbers are identical (up to two decimal places) and correspond
# to the LOOCV statistic given in (5.1). Below, we discuss a situation in
# which the two numbers differ. Our cross-validation estimate for the test
# error is approximately 24.23.

# We can repeat this procedure for increasingly complex polynomial fits.
# To automate the process, we use the for() function to initiate a for loop
# which iteratively fits polynomial regressions for polynomials of order i = 1
# to i = 5, computes the associated cross-validation error, and stores it in
# the ith element of the vector cv.error. We begin by initializing the vector.
# This command will likely take a couple of minutes to run.
cv.error=rep(0,5)
degree=1:5
for (i in degree){
  glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
  cv.error[i]=cv.glm(Auto,glm.fit)$delta[1]
}
cv.error
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
plot(degree,cv.error,type="b")

# we see a sharp drop in the estimated test MSE between
# the linear (24.23) and quadratic (19.25) fits, but then no clear improvement 
# from using higher-order polynomials.

# k-Fold Cross-Validation

# The cv.glm() function can also be used to implement k-fold CV. Below we
# use k = 10, a common choice for k, on the Auto data set. We once again set
# a random seed and initialize a vector in which we will store the CV errors
# corresponding to the polynomial fits of orders one to ten.
set.seed(17)
cv.error.10=rep(0,10)
degree=1:10
for (i in degree){
  glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
  cv.error.10[i]=cv.glm(Auto,glm.fit,K=10)$delta[1]
}
cv.error.10
##  [1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
##  [9] 18.87013 20.95520
plot(degree,cv.error.10,type="b")

# Notice that the computation time is much shorter than that of LOOCV.
# (In principle, the computation time for LOOCV for a least squares linear
# model should be faster than for k-fold CV, due to the availability of the
# formula (5.2) for LOOCV; however, unfortunately the cv.glm() function
# does not make use of this formula.) We still see little evidence that using
# cubic or higher-order polynomial terms leads to lower test error than simply
# using a quadratic fit.

# We saw in Section 5.3.2 that the two numbers associated with delta are
# essentially the same when LOOCV is performed. When we instead perform
# k-fold CV, then the two numbers associated with delta differ slightly. The
# first is the standard k-fold CV estimate, as in (5.3). The second is a bias
# corrected version. On this data set, the two estimates are very similar to
# each other.

# The Bootstrap

# We illustrate the use of the bootstrap in the simple example of Section 5.2,
# as well as on an example involving estimating the accuracy of the linear
# regression model on the Auto data set.

# One of the great advantages of the bootstrap approach is that it can be
# applied in almost all situations. No complicated mathematical calculations
# are required. Performing a bootstrap analysis in R entails only two steps.
# First, we must create a function that computes the statistic of interest.
# Second, we use the boot() function, which is part of the boot library, to
# perform the bootstrap by repeatedly sampling observations from the data
# set with replacement.

# The Portfolio data set in the ISLR package is described in Section 5.2.
# To illustrate the use of the bootstrap on this data, we must first create
# a function, alpha.fn(), which takes as input the (X, Y) data as well as
# a vector indicating which observations should be used to estimate α. The
# function then outputs the estimate for α based on the selected observations.
alpha.fn=function(data,index){
  X=data$X[index]
  Y=data$Y[index]
  return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}

# This function returns, or outputs, an estimate for α based on applying
# (5.7) to the observations indexed by the argument index. For instance, the
# following command tells R to estimate α using all 100 observations.

alpha.fn(Portfolio,1:100)
## [1] 0.5758321
# The next command uses the sample() function to randomly select 100 observations
# from the range 1 to 100, with replacement. This is equivalent
# to constructing a new bootstrap data set and recomputing ˆα based on the
# new data set.
set.seed(1)
alpha.fn(Portfolio,sample(100,100,replace=T))
## [1] 0.7368375
# We can implement a bootstrap analysis by performing this command many
# times, recording all of the corresponding estimates for α, and computing
# the resulting standard deviation. However, the boot() function automates
# this approach. Below we produce R = 1, 000 bootstrap estimates for α.
boot.out <- boot(Portfolio,alpha.fn,R=1000)
boot.out
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.5758321 -0.001695873  0.09366347
plot(boot.out)


# The final output shows that using the original data, ˆα = 0.5758, and that
# the bootstrap estimate for SE(ˆα) is 0.09.

# Estimating the Accuracy of a Linear Regression Model

# The bootstrap approach can be used to assess the variability of the coefficient
# estimates and predictions from a statistical learning method. Here
# we use the bootstrap approach in order to assess the variability of the
# estimates for β0 and β1, the intercept and slope terms for the linear regression
# model that uses horsepower to predict mpg in the Auto data set. We
# will compare the estimates obtained using the bootstrap to those obtained
# using the formulas for SE(β0) and SE(β1) described in Section 3.1.2.

# We first create a simple function, boot.fn(), which takes in the Auto data
# set as well as a set of indices for the observations, and returns the intercept
# and slope estimates for the linear regression model. We then apply this
# function to the full set of 392 observations in order to compute the estimates
# of β0 and β1 on the entire data set using the usual linear regression
# coefficient estimate formulas from Chapter 3. Note that we do not need the
# { and } at the beginning and end of the function because it is only one line
# long.
boot.fn=function(data,index)
  return(coef(lm(mpg~horsepower,data=data,subset=index)))
boot.fn(Auto,1:392)
## (Intercept)  horsepower 
##  39.9358610  -0.1578447
# The boot.fn() function can also be used in order to create bootstrap estimates
# for the intercept and slope terms by randomly sampling from among
# the observations with replacement. Here we give two examples.
set.seed(1)
boot.fn(Auto,sample(392,392,replace=T))
## (Intercept)  horsepower 
##  40.3404517  -0.1634868
boot.fn(Auto,sample(392,392,replace=T))
## (Intercept)  horsepower 
##  40.1186906  -0.1577063
# Next, we use the boot() function to compute the standard errors of 1,000
# bootstrap estimates for the intercept and slope terms.
boot(Auto,boot.fn,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.9358610  0.0544513229 0.841289790
## t2* -0.1578447 -0.0006170901 0.007343073
# This indicates that the bootstrap estimate for SE(β0) is 0.84, and that
# the bootstrap estimate for SE(β1) is 0.0073. As discussed in Section 3.1.2,
# standard formulas can be used to compute the standard errors for the
# regression coefficients in a linear model. These can be obtained using the
# summary() function.
summary(lm(mpg~horsepower,data=Auto))$coef
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81
# The standard error estimates for β0 and β1 obtained using the formulas
# from Section 3.1.2 are 0.717 for the intercept and 0.0064 for the slope.
# Interestingly, these are somewhat different from the estimates obtained
# using the bootstrap. Does this indicate a problem with the bootstrap? In
# fact, it suggests the opposite. Recall that the standard formulas given in
# Equation 3.8 on page 66 rely on certain assumptions. For example, they
# depend on the unknown parameter σ2, the noise variance.We then estimate
# σ2 using the RSS. Now although the formula for the standard errors do not
# rely on the linear model being correct, the estimate for σ2 does. We see in
# Figure 3.8 on page 91 that there is a non-linear relationship in the data, and
# so the residuals from a linear fit will be inflated, and so will the estimate
# for σ2. Secondly, the standard formulas assume (somewhat unrealistically) 
# that the xi are fixed, and all the variability comes from the variation in 
# the errors epsilon_i. The bootstrap approach does not rely on any of these 
# assumptions, and so it is likely giving a more accurate estimate of 
# the standard errors of β0 and β1 than is the summary() function.

# this code was not in lab -- added so we could see the plot
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'Auto':
## 
##     mpg

ggplot(Auto, aes(x=horsepower, y=mpg))+
  geom_point()+
  geom_smooth(method=lm,formula= y~x,se=FALSE) +
  geom_smooth(method=lm,color="green",formula= y~x+I(x^2),se=FALSE)

# Below we compute the bootstrap standard error estimates and the standard
# linear regression estimates that result from fitting the quadratic model
# to the data. Since this model provides a good fit to the data (Figure 3.8),
# there is now a better correspondence between the bootstrap estimates and
# the standard estimates of SE(β0), SE(β1) and SE(β2).

boot.fn=function(data,index)
  coefficients(lm(mpg~horsepower+I(horsepower^2),data=data,subset=index))
set.seed(1)
boot(Auto,boot.fn,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 56.900099702  3.511640e-02 2.0300222526
## t2* -0.466189630 -7.080834e-04 0.0324241984
## t3*  0.001230536  2.840324e-06 0.0001172164
summary(lm(mpg~horsepower+I(horsepower^2),data=Auto))$coef
##                     Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)     56.900099702 1.8004268063  31.60367 1.740911e-109
## horsepower      -0.466189630 0.0311246171 -14.97816  2.289429e-40
## I(horsepower^2)  0.001230536 0.0001220759  10.08009  2.196340e-21