In this approach, the dataset is randomly divided into k groups, or folds, each of similar size. One fold is designated as the validation set, while the remaining k - 1 folds are used for training the model. The validation set is then evaluated to compute the mean squared error (MSE1). This process is repeated k times, with each fold taking turns as the validation set. Consequently, k estimates of the test error are generated. Finally, the k-fold cross-validation estimate is obtained by averaging these computed values.
The validation set approach offers simplicity and straightforward implementation, making it accessible for many applications. However, it comes with drawbacks. The validation mean squared error (MSE) can exhibit significant variability, potentially affecting the reliability of model evaluation. Additionally, this method utilizes only a portion of the available observations for model training, which may limit its effectiveness, particularly with smaller datasets.
LOOCV offers the advantage of lower bias compared to the validation approach. Unlike the validation method, which produces varying MSE values upon repeated application due to the random splitting process, LOOCV consistently yields the same results when performed multiple times, as each iteration involves splitting based on one observation at a time. However, a notable disadvantage of LOOCV is its high computational intensity, as it requires fitting the model repeatedly for each individual observation, which can be resource-intensive, particularly with large datasets.
library(ISLR)
data("Default")
logit_model <- glm(default ~ income + balance, data = Default, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train <- sample(dim(Default)[1], dim(Default)[1]*0.50)
test <- Default[-train, ]
logit_model2 <- glm(default ~ income+ balance, data = Default, family = 'binomial', subset = train)
summary(logit_model2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.147e+01 6.147e-01 -18.653 < 2e-16 ***
## income 2.229e-05 7.085e-06 3.146 0.00166 **
## balance 5.594e-03 3.194e-04 17.512 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1477.13 on 4999 degrees of freedom
## Residual deviance: 808.97 on 4997 degrees of freedom
## AIC: 814.97
##
## Number of Fisher Scoring iterations: 8
log_prob = predict(logit_model2, test, type = "response")
log_pred = rep("No", dim(Default)[1]*0.50)
log_pred[log_prob > 0.5] = "Yes"
table(log_pred, test$default)
##
## log_pred No Yes
## No 4814 112
## Yes 22 52
mean(log_pred !=test$default)
## [1] 0.0268
train <- sample(dim(Default)[1], dim(Default)[1]*0.50)
test <- Default[-train, ]
logit_model2 <- glm(default ~ income+ balance, data = Default, family = 'binomial', subset = train)
log_prob = predict(logit_model2, test, type = "response")
log_pred = rep("No", dim(Default)[1]*0.50)
log_pred[log_prob > 0.5] = "Yes"
table(log_pred, test$default)
##
## log_pred No Yes
## No 4816 116
## Yes 15 53
mean(log_pred !=test$default)
## [1] 0.0262
train <- sample(dim(Default)[1], dim(Default)[1]*0.50)
test <- Default[-train, ]
logit_model2 <- glm(default ~ income+ balance, data = Default, family = 'binomial', subset = train)
log_prob = predict(logit_model2, test, type = "response")
log_pred = rep("No", dim(Default)[1]*0.50)
log_pred[log_prob > 0.5] = "Yes"
table(log_pred, test$default)
##
## log_pred No Yes
## No 4813 123
## Yes 12 52
mean(log_pred !=test$default)
## [1] 0.027
train <- sample(dim(Default)[1], dim(Default)[1]*0.50)
test <- Default[-train, ]
logit_model2 <- glm(default ~ income+ balance, data = Default, family = 'binomial', subset = train)
log_prob = predict(logit_model2, test, type = "response")
log_pred = rep("No", dim(Default)[1]*0.50)
log_pred[log_prob > 0.5] = "Yes"
table(log_pred, test$default)
##
## log_pred No Yes
## No 4807 117
## Yes 27 49
mean(log_pred !=test$default)
## [1] 0.0288
It looks like the validation set error that is calculated is different every single time.
train <- sample(dim(Default)[1], dim(Default)[1]*0.50)
test <- Default[-train, ]
logit_model3 <- glm(default ~ income+ balance+student, data = Default, family = 'binomial', subset = train)
log_prob = predict(logit_model2, test, type = "response")
log_pred = rep("No", dim(Default)[1]*0.50)
log_pred[log_prob > 0.5] = "Yes"
table(log_pred, test$default)
##
## log_pred No Yes
## No 4802 112
## Yes 19 67
mean(log_pred !=test$default)
## [1] 0.0262
It looks like adding the dummy variable of student to the logistic model reduces the test error rate very slighlty to 0.0252.
logit_model4 <- glm(default ~ income + balance, data = Default, family = binomial)
summary(logit_model4)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The estimated standard error for Income is 4.985e-06 , adn the estimated standard error for Balance is 2.274e-04.
boot.fn = function(data, index) return(coef(glm(default ~ balance + income, data = data, family = binomial, subset = index)))
library(boot)
boot(Default, boot.fn, 100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.411727e-02 4.589800e-01
## t2* 5.647103e-03 1.203954e-05 2.459425e-04
## t3* 2.080898e-05 7.179969e-07 4.670885e-06
The estimated standard errors for the coefficients for income and balance are 2.294044e-04 and 3.875525e-06.
library(MASS)
data("Boston")
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
attach(Boston)
mean_medv = mean(medv)
mean_medv
## [1] 22.53281
st_mean = sd(medv)/sqrt(length(medv))
st_mean
## [1] 0.4088611
boot.fn2 = function(data, index) return(mean(data[index]))
boot2 = boot(medv, boot.fn2, 100)
boot2
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn2, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.0261581 0.3808429
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
CI_bos = c(22.53 - 2 * 0.4174872, 22.53 + 2 * 0.4174872)
CI_bos
## [1] 21.69503 23.36497
median_medv = median(medv)
median_medv
## [1] 21.2
boot.fn3 = function(data, index) return(median(data[index]))
boot3 = boot(medv, boot.fn3, 1000)
boot3
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn3, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0082 0.3744569
ten_medv = quantile(medv, c(0.1))
ten_medv
## 10%
## 12.75
boot.fn4 = function(data, index) return(quantile(data[index], c(0.1)))
boot4 = boot(medv, boot.fn4, 1000)
boot4
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn4, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0045 0.4976871