We now review k-fold cross-validation.
Explain how k-fold cross-validation is implemented.
For k-fold cross-validation, we randomly divide the data set into multiple, equal sized, groups (K different groups/folds). The next step is to remove the first part (this is our validation set) and fit the model on the remaining K-1 part. Then, we see how good the predictions are on the left-out part (calculate the MSE on the first part). This process will be repeated K different times and take out a different part each time as validation set. Finally, we average the K different MSE’s to get an estimated validation error rate for new observations.
What are the advantages and disadvantages of k-fold crossvalidation relative to:
From a conceptual viewpoint, the validation set approach is simple and is easy to implement. However, the validation MSE can vary highly. Further, in the validation approach, only a subset of the observations are used to fit the model (the training data), and statistical methods tend to perform worse when trained on fewer observations.
Leave-One-Out Cross Validation (LOOCV) has less bias because we fit the statistical learning method multiple times using our training data (n-1). This means that almost all of our data is used. Besides that, LOOCV also produces a less variable MSE. HOWEVER, LOOCV is computationally intensive because we fit each model as many times as we have observations (n). The more n, the more times we need to fit our model!
In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis
set.seed(1)
lm.fit <- glm(default ~ income + balance, family=binomial, data = Default)
summary(lm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
i. Split the sample set into a training set and a validation set.
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(lm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)
## [1] 0.0254
Solution: The test error rate is 2.54% with the validation set approach.
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(lm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5420 -0.1329 -0.0512 -0.0176 3.7909
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.193e+01 6.379e-01 -18.709 < 2e-16 ***
## income 1.939e-05 6.953e-06 2.789 0.00528 **
## balance 5.918e-03 3.355e-04 17.641 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1490.52 on 4999 degrees of freedom
## Residual deviance: 774.41 on 4997 degrees of freedom
## AIC: 780.41
##
## Number of Fisher Scoring iterations: 8
prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)
## [1] 0.0274
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(lm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1634 -0.1446 -0.0553 -0.0203 3.3281
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.158e+01 6.008e-01 -19.281 < 2e-16 ***
## income 1.975e-05 6.775e-06 2.916 0.00355 **
## balance 5.723e-03 3.180e-04 17.996 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1543.58 on 4999 degrees of freedom
## Residual deviance: 816.44 on 4997 degrees of freedom
## AIC: 822.44
##
## Number of Fisher Scoring iterations: 8
prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)
## [1] 0.0244
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(lm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4027 -0.1517 -0.0624 -0.0233 3.6833
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.112e+01 5.816e-01 -19.120 <2e-16 ***
## income 1.638e-05 6.755e-06 2.425 0.0153 *
## balance 5.489e-03 3.067e-04 17.897 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.85 on 4999 degrees of freedom
## Residual deviance: 831.51 on 4997 degrees of freedom
## AIC: 837.51
##
## Number of Fisher Scoring iterations: 8
prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)
## [1] 0.0244
Solution: We get thee different estimates of the test error rate, 0.0274, 0.0244, 0.027. This tells us that, depending on the observations included in the training set and validation set, the test error rate can vary.
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
summary(lm.fit)
##
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3850 -0.1413 -0.0568 -0.0212 3.7409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.042e+01 6.921e-01 -15.058 <2e-16 ***
## income -5.472e-06 1.205e-05 -0.454 0.6498
## balance 5.638e-03 3.276e-04 17.212 <2e-16 ***
## studentYes -8.286e-01 3.468e-01 -2.389 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1409.45 on 4999 degrees of freedom
## Residual deviance: 767.12 on 4996 degrees of freedom
## AIC: 775.12
##
## Number of Fisher Scoring iterations: 8
prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)
## [1] 0.0278
Solution: Including the “student” variable as a dummy variable led to a test error rate of 0.0264.This result is not different from our previous tests. Therefore, it seems that adding the variable does not lead to a reduction in the validation set estimate of the test error rate.
6. We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
set.seed(1)
attach(Default)
lm.fit<- glm(default ~ income + balance, data = Default, family = "binomial")
summary(lm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Solution: The standard errors are: Intercept; 0.4348, income; 0.000004985, and balance; 0.0002274.
boot.fn <- function(data, index) {
fit1 <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
return (coef(fit1))
}
library(boot)
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
Solution: The bootstrap standard errors for the coefficients are: t1; 0.4344722, t2; 0.000004866284, and t34; 0.0002298949.
The estimated standard error obtained using the glm() function and using the bootstrap function are almost identical.
We will now consider the Boston housing data set, from the ISLR2 library.
attach(Boston)
pop.mean <- mean(medv)
pop.mean
## [1] 22.53281
Solution: The estimated population mean for the “medv” variable is 22.53.
pop.se <- sd(medv) / sqrt(dim(Boston)[1])
pop.se
## [1] 0.4088611
Solution: The estimate of the standard error of ˆµ is 0.409.
set.seed(1)
boot.fn <- function(data, index) {
mu <- mean(data[index])
return (mu)
}
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
Solution: The estimated standard error of ˆµ using the bootstrap is 0.411. This result is very close to our result in (b), which was 0.409.
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The 95% confidence interval for the One sample t-test is 21.73 - 23.34. The mean is 22.53.
mu.conf.int <- c(22.53 - 2 * 0.4119, 22.53 + 2 * 0.4119)
mu.conf.int
## [1] 21.7062 23.3538
The 95% confidence interval for the mean of “medv” is 21.71 - 23.35. This result is a slightly larger, but very close range to our t-test result of 21.73 - 23.34.
medv.median <- median(medv)
medv.median
## [1] 21.2
Solution: The median for medv is 21.2.
boot.fn <- function(data, index) {
mu <- median(data[index])
return (mu)
}
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0386 0.3770241
Solution: The estimated standard error of ˆµmed using the boostrap function is 21.2, which is identical to our solution in (e).
medv.percent <- quantile(medv, c(0.1))
medv.percent
## 10%
## 12.75
Solution: The estimate for the tenth percentile of medv in Boston census tracts is 12.75.
boot.fn <- function(data, index) {
mu <- quantile(data[index], c(0.1))
return (mu)
}
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0186 0.4925766
Solution: The estimate for the tenth percentile of medv in Boston census tracts using the bootstrap function is 12.75, which is identical to our solution in (g).