We now review k-fold cross-validation.
Explain how k-fold cross-validation is implemented. It involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set.
What are the advantages and disadvantages of k-fold crossvalidation relative to:
The validation set approach? The advantage of the k-fold cv to the validation set approach is that the validation set approach often overestimates the test error rate. The disadvantages of the k-fold cv to the validation set approach is that validation set approach is much simpler.
LOOCV? The advantages of k-fold CV to the LOOCV is that it reults in less computational problems. Also, often gives more accurate estimates of the test error rate rather than LOOCV. The disadvantage of the k-fold to the LOOCV is that the k-fold is more random.
In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
library(ISLR2)
data("Default")
logreg= glm(default~income+balance, data=Default, family="binomial")
summary(logreg)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(1)
train= sample(10000, 5000)
logreg2= glm(default~income+balance, data= Default, subset=train, family="binomial")
summary(logreg2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
prob <- predict(logreg2, Default[-train, ], type = "response")
pred <- rep("No", length(prob))
pred[prob > 0.5] <- "Yes"
mean(pred != Default[-train, ]$default)
## [1] 0.0254
2.54% of the observations are missclassified from the validation set.
train=sample(10000, 7500)
prob <- predict(logreg2, Default[-train, ], type = "response")
pred <- rep("No", length(prob))
pred[prob > 0.5] <- "Yes"
mean(pred != Default[-train, ]$default)
## [1] 0.0256
train=sample(10000, 7500)
prob <- predict(logreg2, Default[-train, ], type = "response")
pred <- rep("No", length(prob))
pred[prob > 0.5] <- "Yes"
mean(pred != Default[-train, ]$default)
## [1] 0.0252
train=sample(10000, 7500)
prob <- predict(logreg2, Default[-train, ], type = "response")
pred <- rep("No", length(prob))
pred[prob > 0.5] <- "Yes"
mean(pred != Default[-train, ]$default)
## [1] 0.0288
Their is some variation between the 3 different splits.
train=sample(10000, 7500)
logreg3= glm(default~income+balance+student, data= Default, subset=train, family="binomial")
prob <- predict(logreg2, Default[-train, ], type = "response")
pred <- rep("No", length(prob))
pred[prob > 0.5] <- "Yes"
mean(pred != Default[-train, ]$default)
## [1] 0.0256
Adding the dummy variable did not reduce the test error rate.
We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
logreg3= glm(default~income+balance, data=Default, family="binomial")
summary(logreg3)$coefficients[ ,2]
## (Intercept) income balance
## 4.347564e-01 4.985167e-06 2.273731e-04
boot.fn <- function(data, index=1:nrow(data)) {
logreg4 <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
return (coef(logreg4))
}
library(boot)
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.956706e-02 4.358002e-01
## t2* 2.080898e-05 1.667950e-07 4.855307e-06
## t3* 5.647103e-03 1.860487e-05 2.306029e-04
The bootstrap estimated standard errors are 4.809310^(-6) and 2.328810^(-4).
We will now consider the Boston housing data set, from the ISLR2 library.
mean(Boston$medv)
## [1] 22.53281
sd(Boston$medv)/sqrt(length(Boston$medv))
## [1] 0.4088611
boot.fn=function(data, index){
mean(data[index])
}
b=boot(Boston$medv, boot.fn, 1000)
b
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.01691976 0.4185628
The standard errors are similar off by .002.
round(c(mean(Boston$medv) - 2*sd(b$t), mean(Boston$medv) + 2*sd(b$t)), 4)
## [1] 21.6957 23.3699
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
They are similar off by .01.
median(Boston$medv)
## [1] 21.2
boot.fn=function(data, index){
median(data[index])
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0256 0.3687351
The standard error is .38828355.
quantile(Boston$medv, .1)
## 10%
## 12.75
boot.fn=function(data, index){
quantile(data[index], .1)
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0378 0.4903944
The standard error is slightly high at .5031813.