Select training set from population randomly for the first time:
library(ISLR)
set.seed(1)
train <- sample(nrow(Auto), 0.5 * nrow(Auto))
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
mean((Auto$mpg - predict(lm.fit, Auto))[-train] ^ 2)
[1] 26.14142
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mean((Auto$mpg - predict(lm.fit2, Auto))[-train] ^ 2)
[1] 19.82259
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
mean((Auto$mpg - predict(lm.fit3, Auto))[-train] ^ 2)
[1] 19.78252
Select training set from population randomly for the second time:
set.seed(2)
train <- sample(nrow(Auto), 0.5 * nrow(Auto))
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
mean((Auto$mpg - predict(lm.fit, Auto))[-train] ^ 2)
[1] 23.29559
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mean((Auto$mpg - predict(lm.fit2, Auto))[-train] ^ 2)
[1] 18.90124
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
mean((Auto$mpg - predict(lm.fit3, Auto))[-train] ^ 2)
[1] 19.2574
LOOCV method:
library(boot)
glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.error <- cv.glm(Auto, glm.fit)
cv.error$delta
[1] 24.23151 24.23114
The first is the standard k-fold CV estimate, as in (5.3). The second is a bias-corrected version.
LOOCV with high order regression method:
cv.err <- rep(0, 5)
for (i in 1:5) {
glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
cv.err[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.err
set.seed(17)
cv.err.10 <- rep(0, 10)
for (i in 1:10) {
glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
cv.err.10[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.err.10
[1] 24.23151 19.24821 19.33498 19.42443 19.03321 18.97864 18.83305 18.96115 19.06863 19.49093
alpha.fn <- function(data, index) {
X <- data$X[index]
Y <- data$Y[index]
(var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2* cov(X, Y))
}
alpha.fn(Portfolio, 1:100)
[1] 0.5758321
set.seed(1)
alpha.fn(Portfolio, sample(100, 100, replace = TRUE))
[1] 0.5963833
boot(Portfolio, alpha.fn, R = 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Portfolio, statistic = alpha.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 0.5758321 -7.315422e-05 0.08861826
Get coefficients with basic linear regression:
boot.fn <- function(data, index) coef(lm(mpg ~ horsepower, data = data, subset = index))
all.obs <- nrow(Auto)
boot.fn(Auto, 1:all.obs)
(Intercept) horsepower
39.9358610 -0.1578447
With bootstrap:
set.seed(1)
boot.fn(Auto, sample(all.obs, all.obs, replace = TRUE))
(Intercept) horsepower
38.7387134 -0.1481952
boot.fn(Auto, sample(all.obs, all.obs, replace = TRUE))
(Intercept) horsepower
40.0383086 -0.1596104
Compare coefficients calculated with bootstrap and equation (3.8):
boot(Auto, boot.fn, R = 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Auto, statistic = boot.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 39.9358610 0.02972191 0.860007896
t2* -0.1578447 -0.00030823 0.007404467
summary(lm(mpg ~ horsepower, data = Auto))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
Compare with a quadratic model:
boot.fn2 <- function(data, index) coef(lm(mpg ~ horsepower + I(horsepower^2), data = data, subset = index))
set.seed(1)
boot(Auto, boot.fn2, R = 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Auto, statistic = boot.fn2, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 56.900099702 6.098115e-03 2.0944855842
t2* -0.466189630 -1.777108e-04 0.0334123802
t3* 0.001230536 1.324315e-06 0.0001208339
summary(lm(mpg ~ horsepower + I(horsepower^2), data = Auto))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.900099702 1.8004268063 31.60367 1.740911e-109
horsepower -0.466189630 0.0311246171 -14.97816 2.289429e-40
I(horsepower^2) 0.001230536 0.0001220759 10.08009 2.196340e-21
Why poly()
and I()
produce different results?
boot.fn2 <- function(data, index) coef(lm(mpg ~ poly(horsepower, 2), data = data, subset = index))
set.seed(1)
boot(Auto, boot.fn2, R = 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Auto, statistic = boot.fn2, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 23.44592 0.003943212 0.2255528
t2* -120.13774 0.117312678 3.7008952
t3* 44.08953 0.047449584 4.3294215
summary(lm(mpg ~ poly(horsepower, 2), data = Auto))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.44592 0.2209163 106.13030 2.752212e-289
poly(horsepower, 2)1 -120.13774 4.3739206 -27.46683 4.169400e-93
poly(horsepower, 2)2 44.08953 4.3739206 10.08009 2.196340e-21
Answer: the equivalent of x + I(x^2)
is poly(x, 2, raw = TRUE)
:
summary(lm(mpg ~ poly(horsepower, 2, raw = TRUE), data = Auto))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.900099702 1.8004268063 31.60367 1.740911e-109
poly(horsepower, 2, raw = TRUE)1 -0.466189630 0.0311246171 -14.97816 2.289429e-40
poly(horsepower, 2, raw = TRUE)2 0.001230536 0.0001220759 10.08009 2.196340e-21
poly(x, n)
is orthogonal polynomials with degree \(n\). See G. Grothendieck’s answer in What does the R function poly
really do? for detailed explanations.