In k-fold cross-validation, the data is randomly split into k parts. The model is trained on k − 1 parts and tested on the remaining part. This process repeats k times, and the errors are averaged.
Compared to the validation set approach, k-fold reduces variance since it uses more data for training each time. Compared to LOOCV, k-fold is less computationally expensive and has lower variance, but LOOCV may produce less biased estimates.
glm.fit <- glm(default ~ income + balance, data=Default, family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Answer: The logistic regression shows balance
is strongly associated with default, while income is
not.
train <- sample(1:nrow(Default), nrow(Default)/2)
glm.fit.train <- glm(default ~ income + balance, data=Default, subset=train, family=binomial)
glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
error1 <- mean(glm.pred != Default[-train,]$default)
error1
## [1] 0.0254
Answer: Validation error is about 2.54% to 3% depending on the split.
error_rates <- rep(NA,3)
for (i in 1:3) {
set.seed(i)
train <- sample(1:nrow(Default), nrow(Default)/2)
glm.fit.train <- glm(default ~ income + balance, data=Default, subset=train, family=binomial)
glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
error_rates[i] <- mean(glm.pred != Default[-train,]$default)
}
error_rates
## [1] 0.0254 0.0238 0.0264
Answer: The error rates across the three splits are slightly different but stable. They came out to be 2.54%, 2.38%, and 2.64%
error_rates_student <- rep(NA,3)
for (i in 1:3) {
set.seed(i)
train <- sample(1:nrow(Default), nrow(Default)/2)
glm.fit.train <- glm(default ~ income + balance + student, data=Default, subset=train, family=binomial)
glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
error_rates_student[i] <- mean(glm.pred != Default[-train,]$default)
}
error_rates_student
## [1] 0.0260 0.0246 0.0272
Answer: Including the student variable does not
significantly reduce the test error. They came out to 2.60%, 2.46%, and
2.72%
summary(glm.fit)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
Answer: The standard errors from glm() are
approximately 4.98e-06 for income and 2.27e-04 for
balance.
boot.fn <- function(data, index){
fit <- glm(default ~ income + balance, data=data, family=binomial, subset=index)
return(coef(fit))
}
boot(Default, boot.fn, R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -4.138024e-02 4.381705e-01
## t2* 2.080898e-05 2.071209e-07 4.860557e-06
## t3* 5.647103e-03 1.975295e-05 2.310331e-04
Answer: Bootstrap estimates are very close to
glm() estimates, confirming their reliability.
Answer: The bootstrap estimates are very close to the
glm() estimates, confirming their reliability. Both methods
yield similar results for the standard errors of income and
balance, suggesting that the model is stable.
mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281
Answer: The mean of medv is approximately
22.53.
se_mu <- sd(Boston$medv) / sqrt(nrow(Boston))
se_mu
## [1] 0.4088611
Answer: Standard error is about 0.41.
boot.mean.fn <- function(data, index){
return(mean(data[index]))
}
boot.out <- boot(Boston$medv, boot.mean.fn, R=1000)
boot.out
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.mean.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.02053913 0.405457
Answer: Bootstrap confirms similar SE to formula.
boot_se <- sd(boot.out$t)
CI_boot <- c(mu_hat - 2 * boot_se, mu_hat + 2 * boot_se)
CI_boot
## [1] 21.72189 23.34372
t.test(Boston$medv)$conf.int
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95
Answer: Bootstrap CI and t-test CI are similar.
mu_med <- median(Boston$medv)
mu_med
## [1] 21.2
Answer: Median medv is around 21.2.
boot.median.fn <- function(data, index){
return(median(data[index]))
}
boot.out.median <- boot(Boston$medv, boot.median.fn, R=1000)
boot.out.median
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.median.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.02225 0.3786632
Answer: Bootstrap SE for the median is around 0.38.
mu_0.1 <- quantile(Boston$medv, 0.1)
mu_0.1
## 10%
## 12.75
Answer: The 10th percentile is around 12.75.
boot.quantile.fn <- function(data, index){
return(quantile(data[index], 0.1))
}
boot.out.quantile <- boot(Boston$medv, boot.quantile.fn, R=1000)
boot.out.quantile
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.quantile.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0012 0.5098416
Answer: Bootstrap SE for the 10th percentile is around 0.5.