K fold cross-validation is implemented by creating groups of observations into folds. These group sizes are specified by the k variable. out of these k folds, one is specified as the holdout group while the rest (k-1) are used to fit the model. Then the last set is used to test the model that was formed from the other folds. This process is repeated until each fold is used as the holdout. The test error is then calculated using the average MSE of the k test models.
The validation set approach has the advantage of being simpler, a clean 70-30 or 80-20 train-test split is much faster to compute and with large enough datasets is an effective division of the observations. The disadvantages of this approach come from its simplicity. Only one split means that smaller datasets can be represented unevenly and models typically perform worse when only trained on one set of variables. The K-folds method allows for multiple trains across varried data.
Since this model involves using as many folds as there are observations, it can put a large strain on computational power as compared to k-folds. While k-folds may only need to run 10 sets, LOOCV must run many times that to account for every observation. Another issue of LOOCV is that training the data on every single point except one can create many models with very high correlation. Since these models MSE is averaged, this correlation leads to high variance overall.
data("Default")
glm.default.fit = glm(default ~ income + balance, Default, family = "binomial")
summary(glm.default.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(12345)
sample = sample.split(Default$default, SplitRatio = .5)
train = subset(Default, sample == TRUE)
test = subset(Default, sample == FALSE)
glm.default.fitb = glm(default ~ income + balance, train, family = "binomial")
summary(glm.default.fitb)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9768 -0.1488 -0.0590 -0.0223 3.7002
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.144e+01 6.082e-01 -18.815 < 2e-16 ***
## income 2.276e-05 6.875e-06 3.311 0.00093 ***
## balance 5.548e-03 3.187e-04 17.406 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1456.95 on 4999 degrees of freedom
## Residual deviance: 810.91 on 4997 degrees of freedom
## AIC: 816.91
##
## Number of Fisher Scoring iterations: 8
glm.probs = predict(glm.default.fitb, newdata = test, type = "response")
glm.preds = rep("No", length(glm.probs))
glm.preds[glm.probs > 0.5] = "Yes"
table(glm.preds, test$default)
##
## glm.preds No Yes
## No 4812 108
## Yes 21 59
mean(glm.preds == test$default) #accuracy
## [1] 0.9742
mean(glm.preds != test$default) #Error/misclassification
## [1] 0.0258
only 2.8% of the observations were misclassified by the glm model.
set.seed(54321)
sample2 = sample.split(Default$default, SplitRatio = .5)
train2 = subset(Default, sample2 == TRUE)
test2 = subset(Default, sample2 == FALSE)
glm.default.fit2 = glm(default ~ income + balance, train2, family = "binomial")
glm.probs2 = predict(glm.default.fit2, newdata = test2, type = "response")
glm.preds2 = rep("No", length(glm.probs2))
glm.preds2[glm.probs2 > 0.5] = "Yes"
table(glm.preds2, test2$default)
##
## glm.preds2 No Yes
## No 4811 109
## Yes 22 58
1 - mean(glm.preds2 == test2$default)
## [1] 0.0262
set.seed(415661)
sample3 = sample.split(Default$default, SplitRatio = .5)
train3 = subset(Default, sample3 == TRUE)
test3 = subset(Default, sample3 == FALSE)
glm.default.fit3 = glm(default ~ income + balance, train3, family = "binomial")
glm.probs3 = predict(glm.default.fit3, newdata = test3, type = "response")
glm.preds3 = rep("No", length(glm.probs3))
glm.preds3[glm.probs3 > 0.5] = "Yes"
table(glm.preds3, test3$default)
##
## glm.preds3 No Yes
## No 4809 112
## Yes 24 55
1 - mean(glm.preds3 == test3$default)
## [1] 0.0272
set.seed(75982)
sample4 = sample.split(Default$default, SplitRatio = .5)
train4 = subset(Default, sample4 == TRUE)
test4 = subset(Default, sample4 == FALSE)
glm.default.fit4 = glm(default ~ income + balance, train4, family = "binomial")
glm.probs4 = predict(glm.default.fit4, newdata = test4, type = "response")
glm.preds4 = rep("No", length(glm.probs4))
glm.preds4[glm.probs4 > 0.5] = "Yes"
table(glm.preds4, test4$default)
##
## glm.preds4 No Yes
## No 4820 113
## Yes 13 54
1 - mean(glm.preds4 == test4$default)
## [1] 0.0252
Between the three models, there is a small variance in the error rate recorded as well as the classifications made by the models.
set.seed(45127385)
sample = sample.split(Default$default, SplitRatio = .5)
train = subset(Default, sample == TRUE)
test = subset(Default, sample == FALSE)
glm.default.fit = glm(default ~ income + balance + student, train, family = "binomial")
glm.probs = predict(glm.default.fit, newdata = test, type = "response")
glm.preds = rep("No", length(glm.probs))
glm.preds[glm.probs > 0.5] = "Yes"
table(glm.preds, test$default)
##
## glm.preds No Yes
## No 4820 115
## Yes 13 52
1 - mean(glm.preds == test$default)
## [1] 0.0256
the addition of the student variable doesn’t appear to have any negative or positive effect on the model’s ability to predict accurately or testing error rate.
set.seed(123)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The standard errors for the intercept is 0.4347564, while the coefficients for income and balance’s standard errors are 4.985^-6 and 2.274^-4.
boot.fn = function(data, index) {
return(coef(glm(default ~ income + balance, data = data, family = "binomial", subset = index)))
}
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -2.754771e-02 4.204817e-01
## t2* 2.080898e-05 1.582518e-07 4.729534e-06
## t3* 5.647103e-03 1.296980e-05 2.217214e-04
The standard errors from bootstrapping 1000 iterations are as follows: the intercept is 0.4204817, while the coefficients for income and balance’s standard errors are 4.729534^-6 and 2.217214^-4.
The estimated standard errors from both methods are fairly close in this situation. It might be said that in this case using a standard glm is simpler as opposed to the processing of 1000 iterations for bootstrapping in terms of computational power.
data("Boston")
mu.hat = mean(Boston$medv)
mu.hat
## [1] 22.53281
se.hat <- sd(Boston$medv) / sqrt(dim(Boston)[1])
se.hat
## [1] 0.4088611
The standard error of the mean helps us understand how varied our sample mean would be if we used a new sample from the population. The Standard error in this case is quite low.
boot.fn = function(data, index) {
return(mean(data[index]))
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.02171561 0.4163195
The bootstrapped standard error is .4331926, which is very close to the calculated error of .4088611.
confidence.int = c(22.53 - 2 * 0.4331926, 22.53 + 2 * 0.4331926)
confidence.int
## [1] 21.66361 23.39639
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
Because the bootstrap standard error was a bit larger, the confidence interval for it was also a bit larger. That being said, both intervals are about the same size with only a tenth difference between the upper and lower bounds.
med.hat <- median(Boston$medv)
med.hat
## [1] 21.2
boot.fn = function(data, index) {
return(median(data[index]))
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.01105 0.3887242
Our median value is exactly the same at 21.2. The standard error is 0.3841776 which is quite low compared to our overall median.
mu01.hat <- quantile(Boston$medv, c(0.1))
mu01.hat
## 10%
## 12.75
boot.fn <- function(data, index) {
return(quantile(data[index], c(0.1)))
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 -0.0169 0.5202692
Looking at the 10th percentile of houses, we can see that the value from (g) and (h) are the same (12.75) but the standard error for the bootstrap was .4879856. This is a small value compared to the original percentile.