We now review k-fold cross-validation.
K-fold cross-validation starts by randomly dividing the observations in the data set into k number of groups. These groups are the validation set and the rest is the training set. We get the test error by then the k and the MSE estimates.
Some disadvantages of k-fold cross validation relative to the validation approach is that the test error can be variable depending on the observations in the train and validation sets. Another is that only a subset or part of the data is used to fit the model, which could result in an overestimation of the test data.
An advantages of LOOVC is that is yields less bias. Unlike the different MSE when running the validation set approch due to the random splitting process. A disadvantage of LOOVC is that it is extremely time consuming.
In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
income
and balance
to predict default.
library(ISLR)
attach(Default)
set.seed(1)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
train_glm <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(train_glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
default
category if the posterior probability is greater than 0.5.probs <- predict(train_glm, newdata = Default[-train, ], type = "response")
pred_glm <- rep("No", length(probs))
pred_glm[probs > 0.5] <- "Yes"
mean(pred_glm != Default[-train, ]$default)
## [1] 0.0254
train2 <- sample(dim(Default)[1], dim(Default)[1] / 2)
train_glm2 <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train2)
probs2 <- predict(train_glm2, newdata = Default[-train2, ], type = "response")
pred_glm2 <- rep("No", length(probs2))
pred_glm2[probs2 > 0.5] <- "Yes"
mean(pred_glm2 != Default[-train2, ]$default)
## [1] 0.0274
train3 <- sample(dim(Default)[1], dim(Default)[1] / 2)
train_glm3 <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train3)
probs3 <- predict(train_glm3, newdata = Default[-train2, ], type = "response")
pred_glm3 <- rep("No", length(probs3))
pred_glm3[probs3 > 0.5] <- "Yes"
mean(pred_glm3 != Default[-train3, ]$default)
## [1] 0.0438
train4 <- sample(dim(Default)[1], dim(Default)[1] / 2)
train_glm4 <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train4)
probs4 <- predict(train_glm4, newdata = Default[-train4, ], type = "response")
pred_glm4 <- rep("No", length(probs4))
pred_glm4[probs4 > 0.5] <- "Yes"
mean(pred_glm4 != Default[-train4, ]$default)
## [1] 0.0244
By looking at these results, we can see that the error is different every time since the sample is random.
income
, balance
, and a dummy variable for student.
Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student
leads to a reduction in the test error rate.train_partd <- sample(dim(Default)[1], dim(Default)[1] / 2)
glm_wdummy_model <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train_partd)
probs_wdummy <- predict(glm_wdummy_model, newdata = Default[-train_partd, ], type = "response")
pred_glm_wdummy <- rep("No", length(probs_wdummy))
pred_glm_wdummy[probs_wdummy > 0.5] <- "Yes"
mean(pred_glm_wdummy != Default[-train_partd, ]$default)
## [1] 0.0278
By adding a dummy variable for student
it did not seem to significantly reduce the MSE or reduce it at all.
We continue to consider the use of a logistic regression model to predict the probability of default using income
and balance
on the Default data set. In particular, we will now compute estimates for the standard errors of the income
and balance
logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
income
and balance
in a multiple logistic regression model that uses both predictors.set.seed(100)
q6_glm <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(q6_glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The standard errors of the coefficients intercept are 4.348e-01, 4.985e-06, and 2.274e-04.
boot_fn <- function(data, index) {
boot_model <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
return (coef(boot_model))
}
income
and balance
.library(boot)
boot(Default, boot_fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -2.927307e-02 4.446538e-01
## t2* 2.080898e-05 -3.481513e-08 4.916633e-06
## t3* 5.647103e-03 1.665079e-05 2.318937e-04
Based on this model, the standard errors of the coefficients of balance
and income
are 4.446538e-01, 4.916633e-06, and 2.318937e-04.
It looks like the standard errors of model models are extremely similar.
We will now consider the Boston housing data set, from the MASS library.
library(MASS)
attach(Boston)
mu_hat <- mean(medv)
mu_hat
## [1] 22.53281
se_mu_hat <- sd(medv) / sqrt(dim(Boston)[1])
se_mu_hat
## [1] 0.4088611
set.seed(100)
boot_fn_q9 <- function(data, index) {
mu <- mean(data[index])
return (mu)
}
boot(medv, boot_fn_q9, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot_fn_q9, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.004865415 0.4192063
Comparing these results to part b, they are very similar: 0.4088611 and 0.4192063.
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
con_mu_hat <- c(22.53 - 2 * 0.4119, 22.53 + 2 * 0.4119)
con_mu_hat
## [1] 21.7062 23.3538
The bootstrapping function is similar to the t test function.
med_mu_hat <- median(medv)
med_mu_hat
## [1] 21.2
boot_func <- function(data, index) {
mu <- median(data[index])
return (mu)
}
boot(medv, boot_func, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot_func, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0207 0.3866536
The estimated value is 21.2, the same value found in part e. The standard error is also pretty small of 0.3866536.
hat10_percent <- quantile(medv, c(0.1))
hat10_percent
## 10%
## 12.75
boot_func2 <- function(data, index) {
mu <- quantile(data[index], c(0.1))
return (mu)
}
boot(medv, boot_func2, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot_func2, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.01775 0.5108458
When looking at the 10th percentile, we see a value of 12.75, the same value of part g. We also get a standard error of 0.5108458.