Review k-fold cross-validation.
Explain how k-fold cross-validation is implemented.
First, the data set is split into a number of groups (k) with the same number of observations. One group will be the validation set while the model is fit using the remaining observations (k-1).This is repeated k times - each group becomes the validation set once. Then, the mean square errors of each run through the data (each turn with a new validation set) is averaged to calculate the estimated test error rate.
What are the advantages and disadvantages of k-fold cross-validation relative to (a) the validation set approach, and (b) LOOCV?
Test Error of Logistic Regression from Chapter 4 (Using Validation Set Approach)
Logistic Regression Model using income and balance to predict default.
attach(Default)
set.seed(8)
logr.default.1 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(logr.default.1)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Use Validation Set Approach to estimate test error rate:
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.0272
Interpretation
The test error rate is 2.72%.
Repeat process from Part B 3 times. Use different splits of the observations into training and validation set.
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.0258
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.0288
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.0268
Interpretation
The validation set approach is highly variable because every validation set we make randomizes the number of observations in the training and validation sets. This makes each of the previous 4 rounds of calculations are slightly different: * 2.72% * 2.58% * 2.88% * 2.68%
Logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error. Does the dummy variable lead to a lower error rate?
Default$student01 <- ifelse(Default$student == 'Yes', 1, 0)
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance + student01, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.027
Interpretation
The test error for the new model is 2.7%. This means that including the new variable for student (student01) does not reduce the test error for the model.
Compute estimates for the standard errors of the income and balance coefficients in two different ways (Bootstrap and GLM).
Use summary() and glm() to estimate standard error.
library(boot)
set.seed(12)
default.fit.bs <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(default.fit.bs)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Interpretation
Estimated Standard Errors: * income: 4.985e-06 * balance: 2.274e-04
Bootstrap function (boot.fn()).
set.seed(12)
boot.fn <- function(data, index) {
default.fit.bs <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
return(coef(default.fit.bs))}
boot() and boot.fn() to estimate standard errors.
boot(data = Default, statistic = boot.fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -8.387469e-03 4.484914e-01
## t2* 2.080898e-05 -1.550111e-07 5.010512e-06
## t3* 5.647103e-03 6.338097e-06 2.313711e-04
Interpretation
Estimated Standard Errors: * income: 5.010512e-06 * balance: 2.313711e-04
Comment on results.
The bootstrap standard errors are slightly less than those calculated using glm().
detach(Default)
Boston Data Set
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
attach(Boston)
Estimate the population mean of medv.
mu.hat <- mean(medv)
mu.hat
## [1] 22.53281
Interpretation
Estimated mean for medv is 22.53.
Standard error of mu.hat.
mu.hat.stde <- sd(medv)/sqrt(length(medv))
mu.hat.stde
## [1] 0.4088611
Interpretation
Estimated standard error for mu.hat is 0.41. This is relatively small. Especially compared to the actual value of the mean.
Estimate standard error with bootstrap. Compare to results from Part B.
boot.mu.fn <- function(data, index){
mu <- mean(data[index])
return(mu)
}
boot(medv, boot.mu.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.mu.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.005427866 0.3967152
Interpretation
The bootstrap standard error is 0.3967, which is slightly less than in Part B.
95% confidence interval for the mean of medv. Compare to t.test(Boston$medv).
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
#Confidence Interval
ci.mu.hat <- c(22.53281 - 2 * 0.3967, 22.53281 + 2 * 0.3967)
ci.mu.hat
## [1] 21.73941 23.32621
Interpretation
The confidence intervals calculated from the t.test and the bootstrap are very similar.
Estimate median of mu.hat.
mu.hat.med <- median(medv)
mu.hat.med
## [1] 21.2
Estimate standard error of mu hat median using bootstrap.
boot.med.fn <- function(data, index) {
med.mu <- median(data[index])
return(med.mu)
}
boot(medv, boot.med.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.med.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.02815 0.3773028
Interpretation
The median estimates are the same. Bootstrap calculated standard error to be 0.377.
Estimate tenth percentile of medv and call is mu.01 using quantile() function.
mu.01 <- quantile(medv, c(0.1))
mu.01
## 10%
## 12.75
Interpretation
12.75 is the tenth percentile of medv.
Use bootstrap to estimate standard error of mu.01 and comment on findings.
boot.01.fn <- function(data, index){
mu01 <- quantile(data[index], c(0.1))
return(mu01)
}
boot(medv, boot.01.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.01.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0326 0.4849664
Interpretation
The bootstrap percentile estimate is also 12.75 and the standard error estimate is 0.485.