5.3.1 The Validation Set Approch
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
attach(Auto)
lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
mean((mpg-predict(lm.fit,Auto))[-train]^2)#-train index below selects only the observations that are not in the training set
## [1] 26.14142
#26.14
lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)#19.82
## [1] 19.82259
lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)#19.78
## [1] 19.78252
set.seed(2)
train=sample(392,196)
lm.fit=lm(mpg~horsepower,subset=train)
mean((mpg-predict(lm.fit,Auto))[-train]^2)#23.30
## [1] 23.29559
lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)#18.90
## [1] 18.90124
lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)#19.26
## [1] 19.2574
These results are consistent with our previous findings: a model that predicts mpg using a quadratic function of horsepower performs better than a model that invovles only a linear functon of horsepower,and there is little evidence in favor of a model that uses a cubic function of horsepower.
5.3.2 Leave-One-Out Cross Validation
glm.fit=glm(mpg~horsepower,data=Auto)
coef(glm.fit)
## (Intercept) horsepower
## 39.9358610 -0.1578447
lm.fit=lm(mpg~horsepower,data=Auto)
coef(glm.fit)
## (Intercept) horsepower
## 39.9358610 -0.1578447
library(boot)
glm.fit=glm(mpg~horsepower,data=Auto)
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta
## [1] 24.23151 24.23114
The cv.glm() function produces a list with several components.The two numbers in the delta vector contain the cross-validation results.
cv.error=rep(0,5)
for(i in 1:5){
glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
cv.error[i]=cv.glm(Auto,glm.fit)$delta[1]
}
cv.error
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
We see a sharp drop in the estimated test MSE between the linear and quadratic fits,but then no clear improvement from using high-order polynomials.
5.3.3 k-Fold Cross-Validation
set.seed(17)
cv.error.10=rep(0,10)
for(i in 1:10){
glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
cv.error.10[i]=cv.glm(Auto,glm.fit,K=10)$delta[1]
}
cv.error.10
## [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
## [8] 19.71201 18.95140 19.50196
When we instead perform k-fold CV, then the two numbers associated with delta differ slightly.The first is the standard k-fold CV estimate. The second is a bias-corrected version.
5.3.4 The Boostrap
Performing a boostrap analysis in R entails only two steps. First,we must create a function that computes the statistic of interest. Second,we use the boot()function, which is part of the boot library,to perform the boostrap by repeatedly sampling observations from the data set with replacement.
Estimating the Accuracy of a Statistic of Interest
alpha.fn=function(data,index){
X=data$X[index]
Y=data$Y[index]
return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}
alpha.fn(Portfolio,1:100)
## [1] 0.5758321
set.seed(1)
alpha.fn(Portfolio,sample(100,100,replace=T))
## [1] 0.5963833
boot(Portfolio,alpha.fn,R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.5758321 -7.315422e-05 0.08861826
The final output shows that using the original data, alpha=0.5758,and that the boostrap estimate for SE(alpha) is 0.0886.
Estimating the Accuracy of a Linear Regression Model
boot.fn=function(data,index)
return(coef(lm(mpg~horsepower,data=data,subset=index)))
boot.fn(Auto,1:392)
## (Intercept) horsepower
## 39.9358610 -0.1578447
set.seed(1)
boot.fn(Auto,sample(392,392,replace=T))
## (Intercept) horsepower
## 38.7387134 -0.1481952
boot.fn(Auto,sample(392,392,replace=T))
## (Intercept) horsepower
## 40.0383086 -0.1596104
boot(Auto,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 39.9358610 0.02972191 0.860007896
## t2* -0.1578447 -0.00030823 0.007404467
summary(lm(mpg~horsepower,data=Auto))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
## horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
boot.fn=function(data,index)
coefficients(lm(mpg~horsepower+I(horsepower^2),data=data,subset=index))
set.seed(1)
boot(Auto,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 56.900099702 6.098115e-03 2.0944855842
## t2* -0.466189630 -1.777108e-04 0.0334123802
## t3* 0.001230536 1.324315e-06 0.0001208339
summary(lm(mpg~horsepower+I(horsepower^2),data=Auto))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.900099702 1.8004268063 31.60367 1.740911e-109
## horsepower -0.466189630 0.0311246171 -14.97816 2.289429e-40
## I(horsepower^2) 0.001230536 0.0001220759 10.08009 2.196340e-21