library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: ggplot2
## Loading required package: lattice
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
attach(Default)
attach(Boston)
K-fold cross-validation splits your data into K=. The model trains on K-1 parts and tests on the remaining part. This process repeats K times, using a different part for testing each time. After all runs, you average the results to get the final score. This helps check how well your model works on new data.
i. The validation set approach?
K-fold cross-validation is better than the validation set approach because it uses all the data for training and testing, making results more reliable. It provides a more balanced evaluation since every part of the data gets tested. It also helps when you don’t have much data because each piece is used multiple times. But the disadvantage of this is that K-fold takes longer because it trains the model multiple times instead of just once. If your dataset is huge, this extra time can be a problem.
ii. LOOCV?
K-fold cross-validation is faster and more practical than LOOCV because it trains the model fewer times, but LOOCV is more reliable since it tests on every individual data point. However, LOOCV has higher variance because each training set is nearly identical, while K-fold provides a more stable performance estimate by averaging over multiple test sets.
set.seed(135)
default.glm <- glm(default ~ income + balance, data = Default, family = binomial)
summary(default.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
default_train_index <- createDataPartition(Default$default, p = 0.7, list = FALSE)
default_train <- Default[default_train_index, ]
default_test <- Default[-default_train_index, ]
default2_glm <- glm(default ~ income + balance, data = default_train , family = binomial)
summary(default2_glm)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = default_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.156e+01 5.132e-01 -22.522 < 2e-16 ***
## income 2.986e-05 5.893e-06 5.068 4.02e-07 ***
## balance 5.471e-03 2.624e-04 20.849 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1141.4 on 6998 degrees of freedom
## AIC: 1147.4
##
## Number of Fisher Scoring iterations: 8
pred_probs <- predict(default2_glm, newdata = default_test, type = 'response')
predictions <- ifelse(pred_probs > 0.5, 'Yes', 'No')
mean(predictions != default_test$default)
## [1] 0.027009
The validation set error is 0.027009. This means that 2.7% of the observations in the validation set were misclassified using a set seed of 135
set.seed(120)
default_train_index2 <- createDataPartition(Default$default, p = 0.7, list = FALSE)
default_train2 <- Default[default_train_index2, ]
default_test2 <- Default[-default_train_index2, ]
default3_glm <- glm(default ~ income + balance, data = default_train2, family = binomial)
pred_probs2 <- predict(default3_glm, newdata = default_test2, type = 'response')
predictions2 <- ifelse(pred_probs2 > 0.5, 'Yes', 'No')
mean(predictions2 != default_test2$default)
## [1] 0.02434145
The validation set error is 0.02434145. This means that 2.43% of the observations in the validation set were misclassified using a set seed of 120.
set.seed(92)
default_train_index3 <- createDataPartition(Default$default, p = 0.7, list = FALSE)
default_train3 <- Default[default_train_index3, ]
default_test3 <- Default[-default_train_index3, ]
default4_glm <- glm(default ~ income + balance, data = default_train3, family = binomial)
pred_probs3 <- predict(default4_glm, newdata = default_test3, type = 'response')
predictions3 <- ifelse(pred_probs3 > 0.5, 'Yes', 'No')
mean(predictions3 != default_test3$default)
## [1] 0.02567523
The validation set error is 0.02567523. This means that 2.56% of the observations in the validation set were misclassified using a set seed of 92.
set.seed(68)
default_train_index3 <- createDataPartition(Default$default, p = 0.7, list = FALSE)
default_train4 <- Default[default_train_index3, ]
default_test4 <- Default[-default_train_index3, ]
default5_glm <- glm(default ~ income + balance, data = default_train4, family = binomial)
pred_probs4 <- predict(default5_glm, newdata = default_test4, type = 'response')
predictions4 <- ifelse(pred_probs4 > 0.5, 'Yes', 'No')
mean(predictions4 != default_test4$default)
## [1] 0.02367456
The validation set error is 0.02367456. This means that 2.36% of the observations in the validation set were misclassified using a set seed of 68.
After running the process three times with different data splits and
seeds. For seed 120, the validation error was
0.0243, for seed 92, it was
0.0257, and for seed 68, it was
0.0237. The results show small differences in the
errors, which is normal because of the different splits, but overall,
the model performed similarly each time.
set.seed(28)
default_trainD_index <- createDataPartition(Default$default, p = 0.7, list = FALSE)
default_trainD <- Default[default_trainD_index, ]
default_testD <- Default[-default_trainD_index, ]
student_default_glm <- glm(default ~ income + balance + student, data = default_trainD, family = binomial)
summary(student_default_glm)
##
## Call:
## glm(formula = default ~ income + balance + student, family = binomial,
## data = default_trainD)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.139e+01 6.123e-01 -18.608 <2e-16 ***
## income 2.935e-06 1.037e-05 0.283 0.7773
## balance 6.034e-03 2.919e-04 20.668 <2e-16 ***
## studentYes -4.855e-01 2.933e-01 -1.655 0.0978 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1043.3 on 6997 degrees of freedom
## AIC: 1051.3
##
## Number of Fisher Scoring iterations: 8
pred_probsD <- predict(student_default_glm, newdata = default_testD, type = 'response')
predictionsD <- ifelse(pred_probsD > 0.5, 'Yes', 'No')
mean(predictionsD != default_testD$default)
## [1] 0.03001
The validation set error is 0.03001. This means that 2.56% of the observations in the validation set were misclassified using a set seed of 28.
After running the model, the test error was 0.03001, meaning 3% of observations were misclassified. Including the student variable didn’t significantly reduce the test error, suggesting it has a minimal impact on predicting default.
set.seed(69)
glm_default <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm_default)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn <- function(data, index)
coef(glm(default ~ income + balance, data = Default, subset = index, family = binomial))
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -4.813602e-02 4.310357e-01
## t2* 2.080898e-05 2.361554e-07 4.998989e-06
## t3* 5.647103e-03 2.426458e-05 2.256645e-04
The estimated standard errors from the glm() function and the bootstrap method are very similar, which suggests that the logistic regression model’s standard error estimates are reliable.
For income, the glm() function gave a standard error of 4.985e-06, while the bootstrap estimate was 4.999e-06, showing only a small difference.
For balance, the glm() function estimated 2.274e-04, while the bootstrap method gave 2.257e-04, again a very small difference.
These results indicate that the standard errors from glm() are stable, and the bootstrap approach confirms their accuracy with only minor variation.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
µ_mean <- mean(Boston$medv)
print(µ_mean)
## [1] 22.53281
std_error_µ <- sd(Boston$medv) / sqrt(length(Boston$medv))
print(std_error_µ)
## [1] 0.4088611
set.seed(145)
boot.fn2 <- function(data, index) {
return(mean(data[index]))
}
boot_results <- boot(Boston$medv, boot.fn2, R = 1000)
boot_results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn2, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.002390711 0.4126301
boot_se <- sd(boot_results$t)
print(boot_se)
## [1] 0.4126301
The bootstrap standard error 0.4126 is very close to (b) 0.4088, showing both methods give similar results.
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
ci_boot <- c(µ_mean - 2 * boot_se, µ_mean + 2 * boot_se)
print(ci_boot)
## [1] 21.70755 23.35807
t.test : 21.72953 23.33608 boot : 21.70755 23.35807
The bootstrap and t-test gave similar confidence intervals.
µmed_median <- median(Boston$medv)
print(µmed_median)
## [1] 21.2
boot.fn3 <- function(data, index) {
return(median(data[index]))
}
set.seed(145)
boot_results_median <- boot(data = Boston$medv, statistic = boot.fn3, R = 1000)
print(boot_results_median)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn3, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0144 0.3865062
Our standard error for µmed is 0.3865062
µ0.1 <- quantile(Boston$medv, 0.1)
µ0.1
## 10%
## 12.75
boot.fn4 <- function(data, index) {
return(quantile(data[index], 0.1))
}
set.seed(145)
boot_results_0.1 <- boot(data = Boston$medv, statistic = boot.fn4, R = 1000)
print(boot_results_0.1)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn4, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.00405 0.5086648
Our standard error for µ0.1 is 0.5086648