Problem 3

Review k-fold cross-validation.

Part A

Explain how k-fold cross-validation is implemented.

First, the data set is split into a number of groups (k) with the same number of observations. One group will be the validation set while the model is fit using the remaining observations (k-1).This is repeated k times - each group becomes the validation set once. Then, the mean square errors of each run through the data (each turn with a new validation set) is averaged to calculate the estimated test error rate.

Part B

What are the advantages and disadvantages of k-fold cross-validation relative to (a) the validation set approach, and (b) LOOCV?

  • Validation Set Approach:This approach is simple and easy to implement. However, it tends to overstate the test error rate and is highly dependent on the observations that end up in the training and validation sets (meaning more variation).
  • LOOCV is similar to k-fold, but k is equal to the number of observations (rather than the number of groups). It’s more computationally intensive and complex than standard k-fold since the training process is repeated for every observation. This amount of training causes long run times (or potentially not being able to run the data because of a lack of computational power). However, LOOCV is less biased than k-fold. But it also has a larger variance.

Problem 5

Test Error of Logistic Regression from Chapter 4 (Using Validation Set Approach)

Part A

Logistic Regression Model using income and balance to predict default.

attach(Default)
set.seed(8)
logr.default.1 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(logr.default.1)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Part B

Use Validation Set Approach to estimate test error rate:

  • Split sample into training and validation sets.
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
  • Fit multiple logistic regression model using only training data.
lor.def.train <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train.default)
  • Obtain a prediction of default status for each individual in the validation set.
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
  • Compute Validation set error.
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.0272

Interpretation

The test error rate is 2.72%.

Part C

Repeat process from Part B 3 times. Use different splits of the observations into training and validation set.

  • 2nd Time
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.0258
  • Test error rate: 2.58%
  • 3rd Time
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.0288
  • Test error rate: 2.88%
  • 4th Time
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.0268
  • Test error rate: 2.68%

Interpretation

The validation set approach is highly variable because every validation set we make randomizes the number of observations in the training and validation sets. This makes each of the previous 4 rounds of calculations are slightly different: * 2.72% * 2.58% * 2.88% * 2.68%

Part D

Logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error. Does the dummy variable lead to a lower error rate?

Default$student01 <- ifelse(Default$student == 'Yes', 1, 0)
train.default <- sample(dim(Default)[1], dim(Default)[1]/2)
lor.def.train <- glm(default ~ income + balance + student01, data = Default, family = "binomial", subset = train.default)
default.prob <- predict(lor.def.train, Default[-train.default,], type = "response")
def.fault.pred <- rep("No", length(default.prob))
def.fault.pred[default.prob > 0.5] = "Yes"
mean(def.fault.pred != Default[-train.default, ]$default)
## [1] 0.027

Interpretation

The test error for the new model is 2.7%. This means that including the new variable for student (student01) does not reduce the test error for the model.

Problem 6

Compute estimates for the standard errors of the income and balance coefficients in two different ways (Bootstrap and GLM).

Part A

Use summary() and glm() to estimate standard error.

library(boot)
set.seed(12)
default.fit.bs <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(default.fit.bs)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Interpretation

Estimated Standard Errors: * income: 4.985e-06 * balance: 2.274e-04

Part B

Bootstrap function (boot.fn()).

set.seed(12)
boot.fn <- function(data, index) {
  default.fit.bs <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
  return(coef(default.fit.bs))}

Part C

boot() and boot.fn() to estimate standard errors.

boot(data = Default, statistic = boot.fn, R = 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -8.387469e-03 4.484914e-01
## t2*  2.080898e-05 -1.550111e-07 5.010512e-06
## t3*  5.647103e-03  6.338097e-06 2.313711e-04

Interpretation

Estimated Standard Errors: * income: 5.010512e-06 * balance: 2.313711e-04

Part D

Comment on results.

The bootstrap standard errors are slightly less than those calculated using glm().

detach(Default)

Problem 9

Boston Data Set

library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
attach(Boston)

Part A

Estimate the population mean of medv.

mu.hat <- mean(medv)
mu.hat
## [1] 22.53281

Interpretation

Estimated mean for medv is 22.53.

Part B

Standard error of mu.hat.

mu.hat.stde <- sd(medv)/sqrt(length(medv))
mu.hat.stde
## [1] 0.4088611

Interpretation

Estimated standard error for mu.hat is 0.41. This is relatively small. Especially compared to the actual value of the mean.

Part C

Estimate standard error with bootstrap. Compare to results from Part B.

boot.mu.fn <- function(data,  index){
  mu <- mean(data[index])
  return(mu)
}
boot(medv, boot.mu.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.mu.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.005427866   0.3967152

Interpretation

The bootstrap standard error is 0.3967, which is slightly less than in Part B.

Part D

95% confidence interval for the mean of medv. Compare to t.test(Boston$medv).

t.test(medv)
## 
##  One Sample t-test
## 
## data:  medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281
#Confidence Interval
ci.mu.hat <- c(22.53281 - 2 * 0.3967, 22.53281 + 2 * 0.3967)
ci.mu.hat
## [1] 21.73941 23.32621

Interpretation

The confidence intervals calculated from the t.test and the bootstrap are very similar.

Part E

Estimate median of mu.hat.

mu.hat.med <- median(medv)
mu.hat.med
## [1] 21.2

Part F

Estimate standard error of mu hat median using bootstrap.

boot.med.fn <- function(data, index) {
  med.mu <- median(data[index])
  return(med.mu)
}
boot(medv, boot.med.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.med.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     21.2 -0.02815   0.3773028

Interpretation

The median estimates are the same. Bootstrap calculated standard error to be 0.377.

Part G

Estimate tenth percentile of medv and call is mu.01 using quantile() function.

mu.01 <- quantile(medv, c(0.1))
mu.01
##   10% 
## 12.75

Interpretation

12.75 is the tenth percentile of medv.

Part H

Use bootstrap to estimate standard error of mu.01 and comment on findings.

boot.01.fn <- function(data, index){
  mu01 <- quantile(data[index], c(0.1))
  return(mu01)
}
boot(medv, boot.01.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.01.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0326   0.4849664

Interpretation

The bootstrap percentile estimate is also 12.75 and the standard error estimate is 0.485.