Load libraries
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
library(boot)
## Warning: package 'boot' was built under R version 4.0.4
library(MASS)
We now review k-fold cross-validation. (a) Explain how k-fold cross-validation is implemented.
The data is first divided into k parts. The first part is treated as a validation set, and the model is fit on k-1 parts This is repeated k times, using different parts as the validation set each time. The Mean Squared Error is calculated for each cycle which is then averaged to find the k-fold cross-validation estimate
The validation set approach? Advantages: low variability for k-fold cross validation compared to validation set approach Disadvantage: k-fold cross validation takes more computational power and expense compared to validation set approach
LOOCV? Advantages: k-fold cross validation has less variance compared to LOOCV k-fold cross validation less computationally expensive compared to LOOCV Disadvantages: k-fold cross validation can have higher bias than LOOCV k-fold cross validation has more bias compared to LOOCV
In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis. (a) Fit a logistic regression model that uses income and balance to predict default.
data("Default")
set.seed(1)
glm.fit <- glm(default ~ income + balance, data = Default, family = "binomial")
set.seed(1)
train2 <- sample(1:nrow(Default), 0.5*nrow(Default))
glm.fit2 <- glm(default ~ income + balance, data = Default, family = binomial, subset = train2)
glm.fit2.prob <- predict(glm.fit2, Default[-train2,], type = "response")
pred_class2 = ifelse(glm.fit2.prob > .5, "Yes", "No")
mean(pred_class2 != Default[-train2, ]$default)
## [1] 0.0254
set.seed(1)
#70-30 split
train3 <- sample(1:nrow(Default), 0.7*nrow(Default))
glm.fit3 <- glm(default ~ income + balance, data = Default, family = binomial, subset = train3)
glm.fit3.probs <- predict(glm.fit3, Default[-train3, ], type = "response")
pred_class3 <- ifelse(glm.fit3.probs > 0.5, "Yes", "No")
mean(pred_class3 != Default[-train3, ]$default)
## [1] 0.02666667
set.seed(1)
#80-20 split
train4 <- sample(1:nrow(Default), 0.8*nrow(Default))
glm.fit4 <- glm(default ~ income + balance, data = Default, family = binomial, subset = train4)
glm.fit4.probs <- predict(glm.fit4, Default[-train4, ], type = "response")
pred_class4 <- ifelse(glm.fit4.probs > 0.5, "Yes", "No")
mean(pred_class4 != Default[-train4, ]$default)
## [1] 0.026
set.seed(1)
#90-10 split
train5 <- sample(1:nrow(Default), 0.9*nrow(Default))
glm.fit5 <- glm(default ~ income + balance, data = Default, family = binomial, subset = train5)
glm.fit5.probs <- predict(glm.fit5, Default[-train5, ], type = "response")
pred_class5 <- ifelse(glm.fit5.probs > 0.5, "Yes", "No")
mean(pred_class5 != Default[-train5, ]$default)
## [1] 0.029
The three sets, with a 70-30 split, a 80-20 split, and a 90-10 split all had different test error rates of 0.02666667, 0.026, and 0.029 respectively. They are not significantly different and have a small range.
train6 <- sample(nrow(Default), 0.5*nrow(Default))
glm.fit6 <- glm(default ~ income + balance + student, data = Default, family = binomial, subset = train6)
summary(glm.fit6)
##
## Call:
## glm(formula = default ~ income + balance + student, family = binomial,
## data = Default, subset = train6)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1148 -0.1404 -0.0558 -0.0214 3.6412
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.036e+01 6.830e-01 -15.165 < 2e-16 ***
## income -7.013e-06 1.160e-05 -0.605 0.54534
## balance 5.657e-03 3.259e-04 17.358 < 2e-16 ***
## studentYes -9.754e-01 3.366e-01 -2.897 0.00376 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1429.88 on 4999 degrees of freedom
## Residual deviance: 777.79 on 4996 degrees of freedom
## AIC: 785.79
##
## Number of Fisher Scoring iterations: 8
glm.fit6.probs <- predict(glm.fit6, Default[-train6, ], type = "response")
pred_class6 <- ifelse(glm.fit6.probs > 0.5, "Yes", "No")
mean(pred_class6 != Default[-train6, ]$default)
## [1] 0.026
The test error rate for both models, with or without student, is 2.6% Adding student does not help or worsen the test error rate
We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
set.seed(1)
glm.fit7 <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit7)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The standard error for income is 4.985e-06 and 2.274e-04 is balance.
boot.fn <- function(data, index){
return(coef(glm(default ~ income + balance, data=data, family="binomial", subset=index)))
}
boot(Default, boot.fn, 100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 8.556378e-03 4.122015e-01
## t2* 2.080898e-05 -3.993598e-07 4.186088e-06
## t3* 5.647103e-03 -4.116657e-06 2.226242e-04
Using glm function, the standard error for income is 4.985e-06 and 2.274e-04 for balance. Using bootstrap, the coefficient for income is 4.186088e-06 and balance is 2.226242e-04 These are quite similar to each other
data("Boston")
set.seed(1)
mu <- mean(Boston$medv)
mu
## [1] 22.53281
stdev.mu <- sd(Boston$medv)/sqrt(length(Boston$medv))
stdev.mu
## [1] 0.4088611
set.seed(1)
boot.mu = function(data, index) return(mean(data[index]))
boot(Boston$medv, boot.mu, 500)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.mu, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.004967589 0.4265264
The standard error in part b is 0.4088611 The standard error using bootstrap is 0.4265264 The errors are similar
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
ci = c(mu - 2 * stdev.mu, mu + 2 * stdev.mu)
ci
## [1] 21.71508 23.35053
The confidence interval using t-test is 21.72953, 23.33608 The confidence interval based on bootstrap is 21.71508, 23.35053 The confidence intervals are extremely similar
med.mu <- median(Boston$medv)
med.mu
## [1] 21.2
the median is 21.2
set.seed(1)
boot.fn <- function(data, index) {
return(median(data[index]))
}
boot(Boston$medv, boot.fn, 100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.029 0.3461316
the standard error, 0.3461, for the median, 21.2, is small.
muhat.1 <- quantile(Boston$medv, c(0.1))
muhat.1
## 10%
## 12.75
10th percentile is 12.75
set.seed(1)
boot.fn2 <- function(data, index) return(quantile(data[index], c(0.1)))
boot(Boston$medv, boot.fn2, 100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn2, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.008 0.5370477
the standard error estimate is about 0.5370 using bootstrap method for the tenth percentile of medv in Boston compared to the other estimates, it is higher