In k-fold cross-validation, the data is randomly split into k parts. The model is trained on k − 1 parts and tested on the remaining part. This process repeats k times, and the errors are averaged.
Compared to the validation set approach, k-fold reduces variance since it uses more data for training each time. Compared to LOOCV, k-fold is less computationally expensive and has lower variance, but LOOCV may produce less biased estimates.
glm.fit <- glm(default ~ income + balance, data=Default, family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Answer: The logistic regression shows balance
is strongly associated with default, while income
is
not.
train <- sample(1:nrow(Default), nrow(Default)/2)
glm.fit.train <- glm(default ~ income + balance, data=Default, subset=train, family=binomial)
glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
error1 <- mean(glm.pred != Default[-train,]$default)
error1
## [1] 0.0254
Answer: Validation error is about 2.54% to 3% depending on the split.
error_rates <- rep(NA,3)
for (i in 1:3) {
set.seed(i)
train <- sample(1:nrow(Default), nrow(Default)/2)
glm.fit.train <- glm(default ~ income + balance, data=Default, subset=train, family=binomial)
glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
error_rates[i] <- mean(glm.pred != Default[-train,]$default)
}
error_rates
## [1] 0.0254 0.0238 0.0264
Answer: The error rates across the three splits are slightly different but stable. They came out to be 2.54%, 2.38%, and 2.64%
error_rates_student <- rep(NA,3)
for (i in 1:3) {
set.seed(i)
train <- sample(1:nrow(Default), nrow(Default)/2)
glm.fit.train <- glm(default ~ income + balance + student, data=Default, subset=train, family=binomial)
glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
error_rates_student[i] <- mean(glm.pred != Default[-train,]$default)
}
error_rates_student
## [1] 0.0260 0.0246 0.0272
Answer: Including the student
variable does not
significantly reduce the test error. They came out to 2.60%, 2.46%, and
2.72%
summary(glm.fit)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
Answer: The standard errors from glm()
are
approximately 4.98e-06 for income
and 2.27e-04 for
balance
.
boot.fn <- function(data, index){
fit <- glm(default ~ income + balance, data=data, family=binomial, subset=index)
return(coef(fit))
}
boot(Default, boot.fn, R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -4.138024e-02 4.381705e-01
## t2* 2.080898e-05 2.071209e-07 4.860557e-06
## t3* 5.647103e-03 1.975295e-05 2.310331e-04
Answer: Bootstrap estimates are very close to
glm()
estimates, confirming their reliability.
Answer: The bootstrap estimates are very close to the
glm()
estimates, confirming their reliability. Both methods
yield similar results for the standard errors of income
and
balance
, suggesting that the model is stable.
mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281
Answer: The mean of medv
is approximately
22.53.
se_mu <- sd(Boston$medv) / sqrt(nrow(Boston))
se_mu
## [1] 0.4088611
Answer: Standard error is about 0.41.
boot.mean.fn <- function(data, index){
return(mean(data[index]))
}
boot.out <- boot(Boston$medv, boot.mean.fn, R=1000)
boot.out
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.mean.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.02053913 0.405457
Answer: Bootstrap confirms similar SE to formula.
boot_se <- sd(boot.out$t)
CI_boot <- c(mu_hat - 2 * boot_se, mu_hat + 2 * boot_se)
CI_boot
## [1] 21.72189 23.34372
t.test(Boston$medv)$conf.int
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95
Answer: Bootstrap CI and t-test CI are similar.
mu_med <- median(Boston$medv)
mu_med
## [1] 21.2
Answer: Median medv
is around 21.2.
boot.median.fn <- function(data, index){
return(median(data[index]))
}
boot.out.median <- boot(Boston$medv, boot.median.fn, R=1000)
boot.out.median
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.median.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.02225 0.3786632
Answer: Bootstrap SE for the median is around 0.38.
mu_0.1 <- quantile(Boston$medv, 0.1)
mu_0.1
## 10%
## 12.75
Answer: The 10th percentile is around 12.75.
boot.quantile.fn <- function(data, index){
return(quantile(data[index], 0.1))
}
boot.out.quantile <- boot(Boston$medv, boot.quantile.fn, R=1000)
boot.out.quantile
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.quantile.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0012 0.5098416
Answer: Bootstrap SE for the 10th percentile is around 0.5.