Resampling methods involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. They may allow us to obtain information that would not be available from fitting the model only once using the original training sample.
Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive. In these exercises, we investigate two of the most commonly used resampling methods, cross-validation and the bootstrap.
We now review k-fold cross-validation. (a) Explain how k-fold cross-validation is implemented.
k-fold cross-validation involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds.
–Page 181, An Introduction to Statistical Learning, 2013.
(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:
The validation set approach?
The validation set approach is conceptually simple and easily implemented as you are simply partitioning the existing training data into two sets. However, there are two drawbacks: (1.) the estimate of the test error rate can be highly variable depending on which observations are included in the training and validation sets. (2.) the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
LOOCV?
LOOCV is a special case of k-fold cross-validation with k = n. Thus, LOOCV is the most computationally intense method since the model must be fit n times. Also, LOOCV has higher variance, but lower bias, than k-fold CV.
In Chapter 4, we used logistic regression to predict the probability of default
using income
and balance
on the Default
data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
library(ISLR)
summary(Default)
default student balance income
No :9667 No :7056 Min. : 0.0 Min. : 772
Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
Median : 823.6 Median :34553
Mean : 835.4 Mean :33517
3rd Qu.:1166.3 3rd Qu.:43808
Max. :2654.3 Max. :73554
attach(Default)
set.seed(1)
(a) Fit a logistic regression model that uses income
and balance
to predict default
.
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
1. Split the sample set into a training set and a validation set.
2. Fit a multiple logistic regression model using only the training observations. 3. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default
category if the posterior probability is greater than 0.5.
4. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
FiveB = function() {
# i.
train = sample(dim(Default)[1], dim(Default)[1]/2)
# ii.
glm.fit = glm(default ~ income + balance, data = Default, family = binomial,
subset = train)
# iii.
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit, Default[-train, ], type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
# iv.
return(mean(glm.pred != Default[-train, ]$default))
}
FiveB()
[1] 0.0254
(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.
FiveB()
[1] 0.0274
FiveB()
[1] 0.0244
FiveB()
[1] 0.0244
The error rate seems to hover between 0.024 and 0.026.
(d) Now consider a logistic regression model that predicts the probability of default
using income
, balance
, and a dummy variable for student
. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student
leads to a reduction in the test error rate.
train = sample(dim(Default)[1], dim(Default)[1]/2)
glm.fit = glm(default ~ income + balance + student, data = Default, family = binomial,
subset = train)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit, Default[-train, ], type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != Default[-train, ]$default)
[1] 0.0262
The higher estimated test error rate including the student dummy variable indicates that adding the student dummy variable does not appear to lead to a reduction in the test error rate.
We continue to consider the use of a logistic regression model to predict the probability of default
using income
and balance
on the Default
data set. In particular, we will now compute estimates for the standard errors of the income
and balance
logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm()
function. Do not forget to set a random seed before beginning your analysis.
(a) Using the summary()
and glm()
functions, determine the estimated standard errors for the coefficients associated with income
and balance
in a multiple logistic regression model that uses both predictors.
set.seed(1)
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)
Call:
glm(formula = default ~ income + balance, family = binomial,
data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4725 -0.1444 -0.0574 -0.0211 3.7245
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1579.0 on 9997 degrees of freedom
AIC: 1585
Number of Fisher Scoring iterations: 8
(b) Write a function, boot.fn()
, that takes as input the Default
data set as well as an index of the observations, and that outputs the coefficient estimates for income
and balance
in the multiple logistic regression model.
boot.fn = function(data, index){
return(coef(glm(default ~ income + balance, data = data, family = binomial, subset = index)))
}
(c) Use the boot()
function together with your boot.fn()
function to estimate the standard errors of the logistic regression coefficients for income
and balance
.
library(boot)
boot(Default, boot.fn, 50)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Default, statistic = boot.fn, R = 50)
Bootstrap Statistics :
original bias std. error
t1* -1.154047e+01 -5.661486e-02 4.847786e-01
t2* 2.080898e-05 -7.436578e-08 4.456965e-06
t3* 5.647103e-03 1.854126e-05 2.639029e-04
(d) Comment on the estimated standard errors obtained using the glm()
function and using your bootstrap function.
The estimates from the two separate estimation methods are similar all the way to the second and third significant digits.
In Sections 5.3.2 and 5.3.3, we saw that the cv.glm()
function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm()
and predict.glm()
functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly
data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).
detach(Default)
summary(Weekly)
Year Lag1
Min. :1990 Min. :-18.1950
1st Qu.:1995 1st Qu.: -1.1540
Median :2000 Median : 0.2410
Mean :2000 Mean : 0.1506
3rd Qu.:2005 3rd Qu.: 1.4050
Max. :2010 Max. : 12.0260
Lag2 Lag3
Min. :-18.1950 Min. :-18.1950
1st Qu.: -1.1540 1st Qu.: -1.1580
Median : 0.2410 Median : 0.2410
Mean : 0.1511 Mean : 0.1472
3rd Qu.: 1.4090 3rd Qu.: 1.4090
Max. : 12.0260 Max. : 12.0260
Lag4 Lag5
Min. :-18.1950 Min. :-18.1950
1st Qu.: -1.1580 1st Qu.: -1.1660
Median : 0.2380 Median : 0.2340
Mean : 0.1458 Mean : 0.1399
3rd Qu.: 1.4090 3rd Qu.: 1.4050
Max. : 12.0260 Max. : 12.0260
Volume Today
Min. :0.08747 Min. :-18.1950
1st Qu.:0.33202 1st Qu.: -1.1540
Median :1.00268 Median : 0.2410
Mean :1.57462 Mean : 0.1499
3rd Qu.:2.05373 3rd Qu.: 1.4050
Max. :9.32821 Max. : 12.0260
Direction
Down:484
Up :605
set.seed(1)
attach(Weekly)
(a) Fit a logistic regression model that predicts Direction using Lag1
and Lag2
.
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.623 -1.261 1.001 1.083 1.506
Coefficients:
Estimate Std. Error z value
(Intercept) 0.22122 0.06147 3.599
Lag1 -0.03872 0.02622 -1.477
Lag2 0.06025 0.02655 2.270
Pr(>|z|)
(Intercept) 0.000319 ***
Lag1 0.139672
Lag2 0.023232 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1488.2 on 1086 degrees of freedom
AIC: 1494.2
Number of Fisher Scoring iterations: 4
(b) Fit a logistic regression model that predicts Direction using Lag1
and Lag2
using all but the first observation.
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = binomial)
summary(glm.fit)
(c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(Direction="Up"|Lag1
, Lag2
) > 0.5. Was this observation correctly classified?
predict.glm(glm.fit, Weekly[1, ], type = "response") > 0.5
1
TRUE
Weekly$Direction[1]
[1] Down
Levels: Down Up
We predict the Direction
to be UP
; the true Direction
was DOWN
.
(d) Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performs each of the following steps:
1. Fit a logistic regression model using all but the ith observation to predict Direction
using Lag1
and Lag2
.
2. Compute the posterior probability of the market moving up for the ith observation.
3. Use the posterior probability for the ith observation in order to predict whether or not the market moves up.
4. Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.
count = rep(0, dim(Weekly)[1])
for (i in 1:(dim(Weekly)[1])) {
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
is_up = predict.glm(glm.fit, Weekly[i, ], type = "response") > 0.5
is_true_up = Weekly[i, ]$Direction == "Up"
if (is_up != is_true_up)
count[i] = 1
}
sum(count)
[1] 490
490 errors.
(e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.
mean(count)
[1] 0.4499541
LOOCV estimates a test error rate of 45%.