K-fold cross-validation works by splitting a sample of \(n\) observations into \(k\) groups of non-overlapping groups of size \(n/k\). These groups are the validation/test sets, with the training set of size \(n-n/k\) being the remainder of each group. Finally, the MSEtest error is calculated by averaging the \(k\) \(MSE\) estimates.
The validation set approach faces issues of much higher variability when compared to the k-fold method as the model is only being fit on a subset of the data, leading to a higher test error. However, the validation set approach is that it not computationally expensive as k-fold, so with a large enough data set the problems of the subset not being representative diminish.
Leave one out cross validation suffers from two significant short comings when compare to k-fold cross validation. First, this method is more computationally expensive as it fits the model once per observation, or \(n\) times, rather than \(k\) times. The other issue is that since the models are fit on such a large proportion of the data, variance can be rather high, unlike k-fold that is fit ok \(k\) samples of size \(n-n/k\). A benefit, inversely, is that the model has very little bias in the estimates, whereas k-fold will have more fluctuations in parameter estimates. A good balance between variance and bias for k-fold is \(5 \le k \le 10\).
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.3
defaultDf<-Default
logit.fit <- glm(default ~ income + balance,family = 'binomial',data=defaultDf)
summary(logit.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = defaultDf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(1)
train=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain<-glm(default ~ income + balance, family = 'binomial', data = defaultDf, subset = train)
summary(logit.fitTrain)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = defaultDf, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3583 -0.1268 -0.0475 -0.0165 3.8116
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.208e+01 6.658e-01 -18.148 <2e-16 ***
## income 1.858e-05 7.573e-06 2.454 0.0141 *
## balance 6.053e-03 3.467e-04 17.457 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1457.0 on 4999 degrees of freedom
## Residual deviance: 734.4 on 4997 degrees of freedom
## AIC: 740.4
##
## Number of Fisher Scoring iterations: 8
predLogit <- predict(logit.fitTrain,defaultDf, type = "response",subset = !train)
preds.logit <- rep('No', length(predLogit))
preds.logit[predLogit > 0.5] <- 'Yes'
mean(preds.logit != defaultDf[-train, ]$default)
## [1] 0.0484
The model fit on the train sample has a test error rate of \(4.84\%\). ### (c)
set.seed(2)
train2=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain2<-glm(default ~ income + balance, family = 'binomial', data = defaultDf, subset = train2)
summary(logit.fitTrain2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = defaultDf, subset = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2043 -0.1385 -0.0552 -0.0203 3.7058
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.184e+01 6.403e-01 -18.492 < 2e-16 ***
## income 2.717e-05 7.183e-06 3.783 0.000155 ***
## balance 5.703e-03 3.266e-04 17.460 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1443.44 on 4999 degrees of freedom
## Residual deviance: 776.64 on 4997 degrees of freedom
## AIC: 782.64
##
## Number of Fisher Scoring iterations: 8
predLogit2 <- predict(logit.fitTrain2,defaultDf, type = "response",subset = !train2)
preds.logit2 <- rep('No', length(predLogit2))
preds.logit2[predLogit2 > 0.5] <- 'Yes'
mean(preds.logit2 != defaultDf[-train2, ]$default)
## [1] 0.0475
set.seed(3)
train3=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain3<-glm(default ~ income + balance, family = 'binomial', data = defaultDf, subset = train3)
summary(logit.fitTrain3)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = defaultDf, subset = train3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1014 -0.1433 -0.0569 -0.0206 3.7241
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.160e+01 6.055e-01 -19.162 < 2e-16 ***
## income 2.254e-05 6.972e-06 3.233 0.00123 **
## balance 5.660e-03 3.131e-04 18.079 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1530.39 on 4999 degrees of freedom
## Residual deviance: 812.77 on 4997 degrees of freedom
## AIC: 818.77
##
## Number of Fisher Scoring iterations: 8
predLogit3 <- predict(logit.fitTrain3,defaultDf, type = "response",subset = !train3)
preds.logit3 <- rep('No', length(predLogit3))
preds.logit3[predLogit3 > 0.5] <- 'Yes'
mean(preds.logit3 != defaultDf[-train3, ]$default)
## [1] 0.0453
set.seed(4)
train4=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain4<-glm(default ~ income + balance, family = 'binomial', data = defaultDf, subset = train4)
summary(logit.fitTrain4)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = defaultDf, subset = train4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4799 -0.1411 -0.0559 -0.0208 3.7223
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.156e+01 6.146e-01 -18.803 < 2e-16 ***
## income 2.004e-05 6.997e-06 2.864 0.00418 **
## balance 5.678e-03 3.244e-04 17.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1450.21 on 4999 degrees of freedom
## Residual deviance: 780.49 on 4997 degrees of freedom
## AIC: 786.49
##
## Number of Fisher Scoring iterations: 8
predLogit4 <- predict(logit.fitTrain4,defaultDf, type = "response",subset = !train4)
preds.logit4 <- rep('No', length(predLogit4))
preds.logit4[predLogit4 > 0.5] <- 'Yes'
mean(preds.logit4 != defaultDf[-train4, ]$default)
## [1] 0.0472
Splitting the data into training and validation sets three more times yields similar results as the first split, with the new splits having a test error rate of \(4.75\%\), \(4.53\%\), and \(4.72\%\) respectively.
set.seed(5)
train5=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain5<-glm(default ~ income + balance + student, family = 'binomial', data = defaultDf, subset = train5)
summary(logit.fitTrain5)
##
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial",
## data = defaultDf, subset = train5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4304 -0.1472 -0.0598 -0.0220 3.7046
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.033e+01 6.789e-01 -15.221 < 2e-16 ***
## income -4.264e-06 1.176e-05 -0.363 0.71684
## balance 5.624e-03 3.186e-04 17.652 < 2e-16 ***
## studentYes -9.811e-01 3.356e-01 -2.923 0.00346 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1470.4 on 4999 degrees of freedom
## Residual deviance: 805.1 on 4996 degrees of freedom
## AIC: 813.1
##
## Number of Fisher Scoring iterations: 8
predLogit5 <- predict(logit.fitTrain5,defaultDf, type = "response",subset = !train5)
preds.logit5 <- rep('No', length(predLogit5))
preds.logit5[predLogit5 > 0.5] <- 'Yes'
mean(preds.logit5 != defaultDf[-train5, ]$default)
## [1] 0.0454
Adding the \(student\) variable does not seem to have a significant effect on the test error rate as it is \(4.54\%\), which is similar to the other models.
logitFull.fit <- glm(default ~ income + balance,family = 'binomial',data=defaultDf)
summary(logitFull.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = defaultDf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The standard error for \(income\) and \(balance\) is \(4.985 \times 10^{-6}\) and \(2.274 \times 10^{-4}\) respectively.
library(boot)
boot.fn=function(data,index)
return(coef(glm(default ~ income + balance,data=data,subset=index,family = 'binomial')))
set.seed(1)
boot(defaultDf,boot.fn,100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = defaultDf, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 9.699111e-02 4.101121e-01
## t2* 2.080898e-05 6.715005e-08 4.127740e-06
## t3* 5.647103e-03 -5.733883e-05 2.105660e-04
The bootstrap method is an accurate estimate of the model coefficient standard errors with the bootstrapped SE for \(income\) being \(4.127 \times 10^{-6}\) vs. the actual model’s SE of \(4.985 \times 10^{-6}\) and the bootstrapped SE for \(balance\) being \(2.106 \times 10^{-4}\) being close to the actual \(2.274 \times 10^{-4}\).
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.3
bostonDf <- Boston
mu_hat<-mean(bostonDf$medv)
mu_hat
## [1] 22.53281
\(\hat{\mu}=22.53281\)
se_hat<-sd(bostonDf$medv)/sqrt(nrow(bostonDf))
se_hat
## [1] 0.4088611
The SE of \(\hat{\mu}\) is \(0.4088611\).
set.seed(1)
boot.mu <- function(data, index) {
mu <- mean(data[index])
return (mu)
}
boot(bostonDf$medv, boot.mu, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = bostonDf$medv, statistic = boot.mu, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.008517589 0.4119374
Using the bootstrapping method, the SE for \(\hat{\mu}\) is \(0.4119374\) which is quite close to the actual \(0.4088611\).
t.test(bostonDf$medv)
##
## One Sample t-test
##
## data: bostonDf$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
mu_hat.CI <- c(22.53281 - 2 * 0.4119374, 22.53281 + 2 * 0.4119374)
mu_hat.CI
## [1] 21.70894 23.35668
The CI created via bootstrapping is very close to the one from the t-test, the bootstrap method giving bounds of \((21.70894, 23.35668)\) vs the t-test \((21.72953,23.33608)\).
med_hat <- median(bostonDf$medv)
med_hat
## [1] 21.2
\(\hat{\mu}_{med} = 21.2\)
set.seed(1)
boot.mu_med <- function(data, index) {
mu_med <- median(data[index])
return (mu_med)
}
boot(bostonDf$medv, boot.mu_med, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = bostonDf$medv, statistic = boot.mu_med, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.01615 0.3801002
The median derived from the bootsrapping method yields the same median of \(21.2\) with a standard error of \(0.3801002\).
med_hat_10p <- quantile(bostonDf$medv, c(0.1))
med_hat_10p
## 10%
## 12.75
The 10th percentil of the \(medv\) variable is \(12.75\).
set.seed(1)
boot.mu_10p <- function(data, index) {
mu_10p <- quantile(data[index], c(0.1))
return (mu_10p)
}
boot(bostonDf$medv, boot.mu_10p, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = bostonDf$medv, statistic = boot.mu_10p, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.01005 0.505056
Similar to the median, the 10th percentile calculated with bootstrapping is the same as the sample data with a value of \(12.75\), and the standard error is \(0.505056\).