KFCV is similar to LOOCV, except instead of having the data run through each row of data to compare it to the rest of the data, we instead choose smaller chunks (K=5 or 10). When we split the data in this manner, each set consists of a random set of training/testing data of approximately equal size (either 1/5 or 1/10 depending on K chosen), and then run through a linear model. Each iteration of this, the MSE is found and reported to compare and take an average.
The KFCV has less variability in test error estimates compared to the validation set approach. This is due to the KFCV utilizing all of its data, and the validation set forcing to leave test data completely separate from the training data.
KFCV is less intensive in regards to computations, so it is able to run much faster. This is due to KFCV using less “chunks” of computations (n = 5 or 10) compared to LOOCV having n = rows. Also, by using KFCV you are able to see multiple outputs and its average. That being said, LOOCV has less bias and higher variance than KFCV when k < n.
default using income and
balance on the Default data set. We will now
estimate the test error of this logistic regression model using the
validation set approach. Do not forget to set a random seed before
beginning your analysis.library(ISLR)
attach(Default)
income
and balance to predict default.glm.fit <- glm(default~income+balance, data=Default, family="binomial")
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(1)
train <- sample(nrow(Default), nrow(Default)/2)
str(train)
## int [1:5000] 1017 8004 4775 9725 8462 4050 8789 1301 8522 1799 ...
glm.fit2 <- glm(default~income+balance, data=Default, family="binomial", subset = train)
summary(glm.fit2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
default for that individual, and classifying the individual
to the default category if the posterior probability is greater than
0.5.glm.probs = predict(glm.fit2, newdata = Default[-train, ], type="response")
glm.pred=rep("No", length(train))
glm.pred[glm.probs>0.5] = "Yes"
table(glm.pred)
## glm.pred
## No Yes
## 4932 68
mean(glm.pred != Default[-train, ]$default)
## [1] 0.0254
set.seed(2)
train <- sample(nrow(Default), nrow(Default)/2)
str(train)
## int [1:5000] 4806 8465 5469 3276 3453 690 6787 4914 9015 5695 ...
glm.fit2 <- glm(default~income+balance, data=Default, family="binomial", subset = train)
summary(glm.fit2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3702 -0.1628 -0.0673 -0.0259 3.6470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.090e+01 5.749e-01 -18.955 <2e-16 ***
## income 1.622e-05 6.891e-06 2.354 0.0186 *
## balance 5.365e-03 3.049e-04 17.598 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1483.83 on 4999 degrees of freedom
## Residual deviance: 854.49 on 4997 degrees of freedom
## AIC: 860.49
##
## Number of Fisher Scoring iterations: 8
glm.probs = predict(glm.fit2, newdata = Default[-train, ], type="response")
glm.pred=rep("No", length(train))
glm.pred[glm.probs>0.5] = "Yes"
table(glm.pred)
## glm.pred
## No Yes
## 4920 80
mean(glm.pred != Default[-train, ]$default)
## [1] 0.0238
set.seed(3)
train <- sample(nrow(Default), nrow(Default)/2)
str(train)
## int [1:5000] 3770 8844 5095 6692 6842 8168 2923 5087 8584 788 ...
glm.fit2 <- glm(default~income+balance, data=Default, family="binomial", subset = train)
summary(glm.fit2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5804 -0.1390 -0.0524 -0.0180 3.7789
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.204e+01 6.326e-01 -19.031 < 2e-16 ***
## income 2.462e-05 6.941e-06 3.547 0.00039 ***
## balance 5.894e-03 3.287e-04 17.929 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1536.99 on 4999 degrees of freedom
## Residual deviance: 793.34 on 4997 degrees of freedom
## AIC: 799.34
##
## Number of Fisher Scoring iterations: 8
glm.probs = predict(glm.fit2, newdata = Default[-train, ], type="response")
glm.pred=rep("No", length(train))
glm.pred[glm.probs>0.5] = "Yes"
table(glm.pred)
## glm.pred
## No Yes
## 4931 69
mean(glm.pred != Default[-train, ]$default)
## [1] 0.0264
set.seed(4)
train <- sample(nrow(Default), nrow(Default)/2)
str(train)
## int [1:5000] 5624 587 2867 1795 4167 684 4794 307 2360 6558 ...
glm.fit2 <- glm(default~income+balance, data=Default, family="binomial", subset = train)
summary(glm.fit2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9240 -0.1468 -0.0558 -0.0191 3.5650
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.135e+01 6.096e-01 -18.613 <2e-16 ***
## income 9.504e-06 7.000e-06 1.358 0.175
## balance 5.772e-03 3.281e-04 17.591 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1477.13 on 4999 degrees of freedom
## Residual deviance: 792.65 on 4997 degrees of freedom
## AIC: 798.65
##
## Number of Fisher Scoring iterations: 8
glm.probs = predict(glm.fit2, newdata = Default[-train, ], type="response")
glm.pred=rep("No", length(train))
glm.pred[glm.probs>0.5] = "Yes"
table(glm.pred)
## glm.pred
## No Yes
## 4922 78
mean(glm.pred != Default[-train, ]$default)
## [1] 0.0256
Of the four splits that we have run through the model, the results
were: 0.0254, 0.0238, 0.0264, and
0.0256. While these numbers may not vary much, there is
still some variance occurring due to each split consisting of different
train/test data.
default using income,
balance, and a dummy variable for student.
Estimate the test error for this model using the validation set
approach. Comment on whether or not including a dummy variable for
student leads to a reduction in the test error rate.set.seed(1)
train <- sample(nrow(Default), nrow(Default)/2)
str(train)
## int [1:5000] 1017 8004 4775 9725 8462 4050 8789 1301 8522 1799 ...
glm.fit3 <- glm(default~income+balance+student, data=Default, family="binomial", subset = train)
summary(glm.fit3)
##
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5823 -0.1419 -0.0554 -0.0210 3.3961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.134e+01 6.937e-01 -16.346 <2e-16 ***
## income 1.686e-05 1.122e-05 1.502 0.1331
## balance 5.767e-03 3.213e-04 17.947 <2e-16 ***
## studentYes -5.992e-01 3.324e-01 -1.803 0.0715 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.77 on 4999 degrees of freedom
## Residual deviance: 800.07 on 4996 degrees of freedom
## AIC: 808.07
##
## Number of Fisher Scoring iterations: 8
glm.probs3 = predict(glm.fit3, newdata = Default[-train, ], type="response")
glm.pred3=rep("No", length(train))
glm.pred3[glm.probs3>0.5] = "Yes"
table(glm.pred3)
## glm.pred3
## No Yes
## 4937 63
mean(glm.pred3 != Default[-train, ]$default)
## [1] 0.026
There does not appear to be any difference (better or worse) to add
student into the model. The test error rate remains roughly
the same as previous versions of the model.
default using
income and balance on the Default
data set. In particular, we will now compute estimates for the standard
errors of the income and balance logistic
regression co- efficients in two different ways: (1) using the
bootstrap, and (2) using the standard formula for computing the standard
errors in the glm() function. Do not forget to set a random
seed before beginning your analysislibrary(boot)
summary() and glm()
functions, determine the estimated standard errors for the coefficients
associated with income and balance in a
multiple logistic regression model that uses both predictors.set.seed(1)
train <- sample(nrow(Default), nrow(Default)/2)
str(train)
## int [1:5000] 1017 8004 4775 9725 8462 4050 8789 1301 8522 1799 ...
glm.fit.Q6 <- glm(default~income+balance, data=Default, family="binomial")
summary(glm.fit.Q6)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
boot.fn(), that takes as input
the Default data set as well as an index of the
observations, and that outputs the coefficient estimates for
income and balance in the multiple logistic
regression model.boot.fn <- function(input, index){
return(coef(glm(default~income+balance, data=input, family="binomial", subset = index)))
}
boot() function together with your
boot.fn() function to estimate the standard errors of the
logistic regression coefficients for income and
balance.boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.912114e-02 4.347403e-01
## t2* 2.080898e-05 1.585717e-07 4.858722e-06
## t3* 5.647103e-03 1.856917e-05 2.300758e-04
detach(Default)
glm() function and using your bootstrap function.The Standard Errors for income and balance
from glm() are 4.98e-6 and 2.27e-4, and the bootstrap
functions shows them to be 4.94e-6, and 2.27e-4. These results are very
similar which shows that the functions themselves are similar as
well
Boston housing data set, from
the ISLR2 library.library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
attach(Boston)
set.seed(1)
medv. Call this estimate \(\hat{\mu}\) (mu.hat).mu.hat <- mean(medv)
mu.hat
## [1] 22.53281
se <- sd(medv)/sqrt(length(medv))
se
## [1] 0.4088611
boot.fn.se <- function(data, index){
mu <- mean(data[index])
return(mu)
}
se.boot <- boot(medv, boot.fn.se, 1000)
se.boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn.se, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
The difference in standard errors from question 9b
(0.4088), and 9c (0.4002) are minimal - they
both calculated standard error of \(\hat{\mu}\) similarly.
medv. Compare it to the
results obtained using t.test(Boston$medv). Hint You can
approximate a 95% confidence interval using the formula [\(\hat{\mu}\) − 2SE(\(\hat{\mu}\)), \(\hat{\mu}\) + 2SE(\(\hat{\mu}\))].medv.CI.high <- mu.hat-2*se
medv.CI.low <- mu.hat+2*se
medv.CI.high
## [1] 21.71508
medv.CI.low
## [1] 23.35053
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The 5% and 95% confidence intervals are very similar between the
t.test() and the calculated boundaries.
medv in the population.mu.med <- median(medv)
mu.med
## [1] 21.2
boot.fn.med <- function(data, index){
mu.med1 <- median(data[index])
return(mu.med1)
}
mu.med.boot <- boot(medv, boot.fn.med, 1000)
mu.med.boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn.med, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0386 0.3770241
Both the bootstrap and the calculated median of medv
result in a median of 21.2 (which is great because it means
it works!). The standard error from the bootstrap is shown to be
0.379, which is small by comparison.
medv in Boston census tracts. Call this
quantity $$0.1. (You can use the quantile() function.)mu.tenth <- quantile(medv, probs = 0.1)
mu.tenth
## 10%
## 12.75
boot.fn.tenth <- function(data, index){
mu.tenth1 <- quantile(data[index], probs=0.1)
return(mu.tenth1)
}
mu.tenth.boot <- boot(medv, boot.fn.tenth, 1000)
mu.tenth.boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn.tenth, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0186 0.4925766
Both g and h have a value of 12.75, and the bootstrap
revealed a standard error of 0.5114 - this is still pretty
small by comparison to the value
detach(Boston)