Load Libraries and attach datasets
library(ISLR2)
library(tidyverse)
library(boot)
attach(Default)
attach(Boston)
K-fold cross validation (CV) is implemented much the same way Leave-one-out CV (LOOCV) is implemented, except rather than only leaving one observation out at a time, a chunk (called a fold) is left out. To create the chunks, the data set is broke down into equal chunks, most often either 5 or 10 chunks. Each iteration, one chunk is left out and used to validate the model that was trained on the remaining chunks. After the first iteration is complete, the chunk that was previously left out is included and the next chunk is left out. This allows for the model to be trained and validated on “different” data subsets as many times as chunks are created.
K-Fold CV has less bias because it uses multiple smaller validation sets, which allow for larger training sets. There is less variability and ultimately every available observation in the data set is used to better train the model. The biggest disadvantage is that CV (either K-fold CV or LOOCV) uses more computational power than simple validation. In addition, it is harder to understand/explain the process and results.
K-fold CV has less variance than LOOCV and since it is run on chunks of observations rather that each individual observation, it requires less iterations and therefore less computation power. The downside is that K-fold CV can have more bias than LOOCV. However, using a K size of 5 or 10 provides for a good balance of bias-variance trade off.
default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.set.seed(22)
income and balance to predict default.fit_glm <- glm(default ~ income + balance, data = Default, family = 'binomial')
summary(fit_glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The validation set error rate is 2.6% for this split (75/25).
train_num <- sample(nrow(Default),(nrow(Default)*.75))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])
fit_glm <- glm(default ~ income + balance, family = 'binomial', data = train)
summary(fit_glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4592 -0.1484 -0.0585 -0.0217 3.7113
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.140e+01 4.921e-01 -23.167 <2e-16 ***
## income 1.846e-05 5.657e-06 3.264 0.0011 **
## balance 5.624e-03 2.584e-04 21.761 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2245.8 on 7499 degrees of freedom
## Residual deviance: 1208.7 on 7497 degrees of freedom
## AIC: 1214.7
##
## Number of Fisher Scoring iterations: 8
default category if the posterior probability is greater than 0.5.prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##
## pred_glm No Yes
## No 2414 54
## Yes 11 21
1 - mean(pred_glm==validate$default)
## [1] 0.026
The results of the three iterations each return a slightly different error rate for the validation sets.
Split 1 used a 50/50 (train/validate) split and returned an error rate of 2.96%
Split 2 used a 65/35 (train/validate) split and returned an error rate of 2.65%
Split 3 used a 90/10 (train/validate) split and returned an error rate of 3.7%
Interestingly, the third iteration used the largest training set, but performed the poorest in this scenario. This is likely due to the small subset of validation data skewing the results. I different selection of validation data would more than likely result in a different error rate. It is therefore very likely that using a different seed would result in slightly different error rates as well.
train_num <- sample(nrow(Default),(nrow(Default)*.50))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])
fit_glm <- glm(default ~ income + balance, family = 'binomial', data = train)
prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##
## pred_glm No Yes
## No 4803 120
## Yes 28 49
1 - mean(pred_glm==validate$default)
## [1] 0.0296
train_num <- sample(nrow(Default),(nrow(Default)*.65))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])
fit_glm <- glm(default ~ income + balance, family = 'binomial', data = train)
prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##
## pred_glm No Yes
## No 3369 82
## Yes 11 38
1 - mean(pred_glm==validate$default)
## [1] 0.02657143
train_num <- sample(nrow(Default),(nrow(Default)*.9))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])
fit_glm <- glm(default ~ income + balance, family = 'binomial', data = train)
prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##
## pred_glm No Yes
## No 953 33
## Yes 4 10
1 - mean(pred_glm==validate$default)
## [1] 0.037
default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.I created stu_dum, a dummy variable for student where 1 = Yes and 0 = No. I then recreated the 72/25 split and found that there was not a significant difference in the accuracy of the model. It definitely did not lead to a reduction, as the previous 75/25 split had an error rate of 2.6% and the model with the dummy variable (below) had an error rate of 2.8%.
Default$stu_dum <- ifelse(Default$student == 'No', 0, 1)
train_num <- sample(nrow(Default),(nrow(Default)*.75))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])
fit_glm <- glm(default ~ income + balance + stu_dum, family = 'binomial', data = train)
prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##
## pred_glm No Yes
## No 2410 54
## Yes 11 25
1 - mean(pred_glm==validate$default)
## [1] 0.026
default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.set.seed(22)
summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.The standard error for the coefficients associated with income and balance are \(4.985^{-6}\) and \(2.27^{-4}\), respectively.
summary(glm(default ~ income + balance, family = 'binomial', data = Default))
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.boot.fn=function(data,index){
return(coef(glm(default ~ income + balance, family = 'binomial', data = data, subset = index)))
}
boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.Using the boot() function, The standard error for the coefficients associated with income and balance are \(4.9902^{-6}\) and \(2.2689{-4}\), respectively.
boot(Default,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.552739e-02 4.326477e-01
## t2* 2.080898e-05 1.662128e-07 4.990171e-06
## t3* 5.647103e-03 1.479837e-05 2.268672e-04
glm() function and using your bootstrap function.The estimated standard error for both were relatively similar, with the bootstrap standard error being slightly higher for income and slightly lower for balance; however, the difference are so small that it is more than likely negligible in this situation.
Boston housing data set, from the ISLR2 library.medv. Call this estimate \(\hat{\mu}\).mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281
SEM <- sd(Boston$medv) / sqrt(nrow(Boston))
SEM
## [1] 0.4088611
The standard error was slightly higher when I ran the bootstrap 1,000 times, but when I ran it 10,000 times it was slightly lower.
set.seed(22)
boot.fn <- function(data, index) {
return(mu_hat <- mean(data[index]))
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.01964348 0.4257536
boot(Boston$medv, boot.fn, 10000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 10000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.00476083 0.4067331
The results from using the t.test() function and the bootstrap are very similar (identical down to the tenths place).
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
set.seed(22)
boot_SEM <- boot(Boston$medv, boot.fn, 10000)
CI_mu_hat <- c(mu_hat - (2*.4067331), mu_hat + (2*.4067331))
CI_mu_hat
## [1] 21.71934 23.34627
medv in the population.mu_hat_medv <- median(Boston$medv)
mu_hat_medv
## [1] 21.2
The median was the same, 21.2; however, the standard error was smaller which would result in a more narrow confidence interval overall.
set.seed(22)
boot.fn <- function(data, index) {
return(M_hat <- median(data[index]))
}
boot(Boston$medv, boot.fn, 10000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 10000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.01389 0.3814231
medv in Boston census tracts. Call this quantity \(\hat{\mu}_{0.1}\). (You can use the quantile() function.)Using the quantile function, the estimate for \(\hat{\mu}_{0.1}\) of medv in Boston’s census tracts is 12.75
mu_hat_0.1 <- quantile(Boston$medv, probs = .1)
mu_hat_0.1
## 10%
## 12.75
The standard error is relatively small overall at .5 and the estimated \(\hat{\mu}_{0.1}\) is the same, 12.75.
set.seed(22)
boot.fn <- function(data, index) {
return(mu_hat_0.1 <- quantile(data[index], 0.1))
}
boot(Boston$medv, boot.fn, 10000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 10000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 -0.005305 0.509029