lrf <- glm(default ~ balance + income, data = Default, family = 'binomial')
summary(lrf)
Call:
glm(formula = default ~ balance + income, family = "binomial",
data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4725 -0.1444 -0.0574 -0.0211 3.7245
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1579.0 on 9997 degrees of freedom
AIC: 1585
Number of Fisher Scoring iterations: 8
vsd <- function(seed) {
set.seed(seed)
isTraining <- sample(nrow(Default), 0.5 * nrow(Default))
f5b <- glm(default ~ balance + income, data = Default, family = 'binomial', subset = isTraining)
summary(f5b)
default.pred <- predict(f5b, Default[-isTraining, ], type = 'response')
def.res <- rep('No', 0.5 * nrow(Default))
def.res[default.pred > 0.5] <- 'Yes'
mean(def.res != Default[-isTraining, "default"])
}
vsd(1)
[1] 0.0286
vsd(3)
[1] 0.0248
vsd(1)
[1] 0.0286
vsd(17)
[1] 0.0224
vsd(1)
[1] 0.0286
vsd(23)
[1] 0.0268
So the error rate is a functin a random seed. The mean of error rate is about 2.6%.
set.seed(1)
isTraining <- sample(nrow(Default), 0.5 * nrow(Default))
f5d <- glm(default ~ balance + income + student, data = Default, family = 'binomial', subset = isTraining)
default.pred <- predict(f5d, Default[-isTraining, ], type = 'response')
def.res <- rep('No', 0.5 * nrow(Default))
def.res[default.pred > 0.5] <- 'Yes'
mean(def.res != Default[-isTraining, "default"])
[1] 0.0288
Including student reduce the error rate a little (\(0.0288 - 0.0286\)) for predicting.
set.seed(1)
f6a <- glm(default ~ balance + income, data = Default, family = 'binomial')
summary(f6a)
Call:
glm(formula = default ~ balance + income, family = "binomial",
data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4725 -0.1444 -0.0574 -0.0211 3.7245
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1579.0 on 9997 degrees of freedom
AIC: 1585
Number of Fisher Scoring iterations: 8
boot.fn <- function(data, index) coef(glm(default ~ balance + income, data = data, family = 'binomial', subset = index))
boot(Default, boot.fn, R = 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Default, statistic = boot.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* -1.154047e+01 -4.952225e-02 4.136788e-01
t2* 5.647103e-03 2.144899e-05 2.175920e-04
t3* 2.080898e-05 2.943832e-07 4.910237e-06
The standard errors from glm() and boot() function are basically the same.
lrf <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
lrf <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1,], family = binomial)
prob.7c <- predict(lrf, Weekly[1,], type = 'response')
(if (prob.7c > 0.5) 'Up' else 'Down') == Weekly[1, 'Direction']
[1] FALSE
res <- rep(FALSE, nrow(Weekly))
for (i in 1:nrow(Weekly)) {
lrf <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i,], family = binomial)
prob.7d <- predict(lrf, Weekly[i,], type = 'response')
res[i] <- (if (prob.7d > 0.5) 'Up' else 'Down') == Weekly[i, 'Direction']
}
mean(res)
[1] 0.5500459
So the error rate is \(1 - 0.55 = 0.45\).
set.seed(1)
x <- rnorm(100)
y <- x - 2 * x^2 + rnorm(100)
Here there are 100 observations and one predictor, so \(n=100, p=2\). The equation form: \[ y = -2 x^2 + x + \epsilon \] ## 8b
plot(x, y)
set.seed(1)
inp <- data.frame(x, y)
fit <- glm(y ~ x)
cv.glm(inp, fit)$delta
[1] 7.288162 7.284744
fit <- glm(y ~ x + I(x^2))
cv.glm(inp, fit)$delta
[1] 0.9374236 0.9371789
fit <- glm(y ~ poly(x, 3))
cv.glm(inp, fit)$delta
[1] 0.9566218 0.9562538
fit <- glm(y ~ poly(x, 4))
cv.glm(inp, fit)$delta
[1] 0.9539049 0.9534453
Note that the default value of parameter K of function cv.glm() is n. So if the K is not specified (like in above codes), cv.glm() gives LOOCV result. See ?cv.glm for details.
set.seed(5)
inp <- data.frame(x, y)
fit <- glm(y ~ x)
cv.glm(inp, fit)$delta
fit <- glm(y ~ poly(x, 2))
cv.glm(inp, fit)$delta
fit <- glm(y ~ poly(x, 3))
cv.glm(inp, fit)$delta
fit <- glm(y ~ poly(x, 4))
cv.glm(inp, fit)$delta
The results are the same with (c). Because the LOOCV is stable compared with validation-set approach.
y ~ poly(x, 2) had the smallest LOOCV error, because it the same with the true equation \(y = -2x^2 + x + \epsilon\).
summary(fit)
Call:
glm(formula = y ~ poly(x, 4))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0550 -0.6212 -0.1567 0.5952 2.2267
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.55002 0.09591 -16.162 < 2e-16 ***
poly(x, 4)1 6.18883 0.95905 6.453 4.59e-09 ***
poly(x, 4)2 -23.94830 0.95905 -24.971 < 2e-16 ***
poly(x, 4)3 0.26411 0.95905 0.275 0.784
poly(x, 4)4 1.25710 0.95905 1.311 0.193
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.9197797)
Null deviance: 700.852 on 99 degrees of freedom
Residual deviance: 87.379 on 95 degrees of freedom
AIC: 282.3
Number of Fisher Scoring iterations: 2
p-values shows linear and quadratic terms are statistical significant.
From (c) we know that there is a significant error decrease from linear to quadratic (from 7.29 to 0.94). While the decrease almost disappeared from quadratic to cubic (0.94 to 0.96). This suggests the linear and quadratic terms are statistically significant, which agrees with the summary of model glm(y ~ poly(x, 4)) above.
library(MASS)
mu.bar <- mean(Boston$medv)
So \(\hat\mu = 22.533\).
Based on Standard error, the standard error of the sample mean is calculated with: \[ \sqrt{\frac{\sum_{i=1}^n (x_i - \hat\mu)^2}{n(n-1)}} \] So we have:
n <- nrow(Boston)
sqrt(sum((Boston$medv - mu.bar)^2) / (n*(n-1)))
[1] 0.4088611
Or use function sd, which gives the same result:
sd(Boston$medv) / sqrt(nrow(Boston))
[1] 0.4088611
boot.fn <- function(data, index) mean(data[index])
boot(Boston$medv, boot.fn, R = 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Boston$medv, statistic = boot.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 22.53281 0.008616008 0.4085605
The se by boot is almost the same with that in (b) (0.4089 vs 0.4086).
Note the second parameter of function boot() must have 2 parameters. The first is the data (the first parameter of boot()), the second is the indices. See ?boot for details.
mu.bar - 2 * 0.4086; mu.bar + 2 * 0.4086
[1] 21.71561
[1] 23.35001
t.test(Boston$medv)
One Sample t-test
data: Boston$medv
t = 55.111, df = 505, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
21.72953 23.33608
sample estimates:
mean of x
22.53281
They are almost the same.
median(Boston$medv)
[1] 21.2
med.se <- function(data, index) median(data[index])
boot(Boston$medv, med.se, R = 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Boston$medv, statistic = med.se, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 21.2 -0.026 0.3785926
quantile(Boston$medv, 0.1)
10%
12.75
q0.1 <- function(data, index) quantile(data[index], c(0.1))
boot(Boston$medv, q0.1, 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Boston$medv, statistic = q0.1, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 12.75 0.01775 0.5001926
The standard error of \(\hat\mu_{0.1}\) is 0.506.