Assignment #4

Problem #3

We now review k-fold cross-validation.

Explain how k-fold cross-validation is implemented.

For k-fold cross-validation, we randomly divide the data set into multiple, equal sized, groups (K different groups/folds). The next step is to remove the first part (this is our validation set) and fit the model on the remaining K-1 part. Then, we see how good the predictions are on the left-out part (calculate the MSE on the first part). This process will be repeated K different times and take out a different part each time as validation set. Finally, we average the K different MSE’s to get an estimated validation error rate for new observations.
What are the advantages and disadvantages of k-fold crossvalidation relative to:
- The validation set approach?
From a conceptual viewpoint, the validation set approach is simple and is easy to implement. However, the validation MSE can vary highly. Further, in the validation approach, only a subset of the observations are used to fit the model (the training data), and statistical methods tend to perform worse when trained on fewer observations.
- LOOCV?
Leave-One-Out Cross Validation (LOOCV) has less bias because we fit the statistical learning method multiple times using our training data (n-1). This means that almost all of our data is used. Besides that, LOOCV also produces a less variable MSE. HOWEVER, LOOCV is computationally intensive because we fit each model as many times as we have observations (n). The more n, the more times we need to fit our model!

Problem #5

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis

Fit a logistic regression model that uses income and balance to predict default.

set.seed(1)
lm.fit <- glm(default ~ income + balance, family=binomial, data = Default)
summary(lm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set and a validation set.

train <- sample(dim(Default)[1], dim(Default)[1] / 2)

Fit a multiple logistic regression model using only the training observations.

lm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(lm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5830  -0.1428  -0.0573  -0.0213   3.3395  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.194e+01  6.178e-01 -19.333  < 2e-16 ***
## income       3.262e-05  7.024e-06   4.644 3.41e-06 ***
## balance      5.689e-03  3.158e-04  18.014  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.8  on 4999  degrees of freedom
## Residual deviance:  803.3  on 4997  degrees of freedom
## AIC: 809.3
## 
## Number of Fisher Scoring iterations: 8

Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

mean(pred.lm1 != Default[-train, ]$default)

## [1] 0.0254

Solution: The test error rate is 2.54% with the validation set approach.

Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(lm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5420  -0.1329  -0.0512  -0.0176   3.7909  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.193e+01  6.379e-01 -18.709  < 2e-16 ***
## income       1.939e-05  6.953e-06   2.789  0.00528 ** 
## balance      5.918e-03  3.355e-04  17.641  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1490.52  on 4999  degrees of freedom
## Residual deviance:  774.41  on 4997  degrees of freedom
## AIC: 780.41
## 
## Number of Fisher Scoring iterations: 8

prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)

## [1] 0.0274

train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(lm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1634  -0.1446  -0.0553  -0.0203   3.3281  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.158e+01  6.008e-01 -19.281  < 2e-16 ***
## income       1.975e-05  6.775e-06   2.916  0.00355 ** 
## balance      5.723e-03  3.180e-04  17.996  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1543.58  on 4999  degrees of freedom
## Residual deviance:  816.44  on 4997  degrees of freedom
## AIC: 822.44
## 
## Number of Fisher Scoring iterations: 8

prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)

## [1] 0.0244

train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(lm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4027  -0.1517  -0.0624  -0.0233   3.6833  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.112e+01  5.816e-01 -19.120   <2e-16 ***
## income       1.638e-05  6.755e-06   2.425   0.0153 *  
## balance      5.489e-03  3.067e-04  17.897   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.85  on 4999  degrees of freedom
## Residual deviance:  831.51  on 4997  degrees of freedom
## AIC: 837.51
## 
## Number of Fisher Scoring iterations: 8

prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)

## [1] 0.0244

Solution: We get thee different estimates of the test error rate, 0.0274, 0.0244, 0.027. This tells us that, depending on the observations included in the training set and validation set, the test error rate can vary.

Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

train <- sample(dim(Default)[1], dim(Default)[1] / 2)
lm.fit <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
summary(lm.fit)

## 
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial", 
##     data = Default, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3850  -0.1413  -0.0568  -0.0212   3.7409  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.042e+01  6.921e-01 -15.058   <2e-16 ***
## income      -5.472e-06  1.205e-05  -0.454   0.6498    
## balance      5.638e-03  3.276e-04  17.212   <2e-16 ***
## studentYes  -8.286e-01  3.468e-01  -2.389   0.0169 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1409.45  on 4999  degrees of freedom
## Residual deviance:  767.12  on 4996  degrees of freedom
## AIC: 775.12
## 
## Number of Fisher Scoring iterations: 8

prob1 <- predict(lm.fit, newdata = Default[-train, ], type = "response")
pred.lm1 <- rep("No", length(prob1))
pred.lm1[prob1 > 0.5] <- "Yes"
mean(pred.lm1 != Default[-train, ]$default)

## [1] 0.0278

Solution: Including the “student” variable as a dummy variable led to a test error rate of 0.0264.This result is not different from our previous tests. Therefore, it seems that adding the variable does not lead to a reduction in the validation set estimate of the test error rate.

Problem #6

6. We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(1)
attach(Default)
lm.fit<- glm(default ~ income + balance, data = Default, family = "binomial")
summary(lm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Solution: The standard errors are: Intercept; 0.4348, income; 0.000004985, and balance; 0.0002274.

Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn <- function(data, index) {
    fit1 <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
    return (coef(fit1))
}

Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

library(boot)
boot(Default, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

Solution: The bootstrap standard errors for the coefficients are: t1; 0.4344722, t2; 0.000004866284, and t34; 0.0002298949.

Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

The estimated standard error obtained using the glm() function and using the bootstrap function are almost identical.

Problem #9

We will now consider the Boston housing data set, from the ISLR2 library.

Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆµ.

attach(Boston)
pop.mean <- mean(medv)
pop.mean

## [1] 22.53281

Solution: The estimated population mean for the “medv” variable is 22.53.

Provide an estimate of the standard error of ˆµ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

pop.se <- sd(medv) / sqrt(dim(Boston)[1])
pop.se

## [1] 0.4088611

Solution: The estimate of the standard error of ˆµ is 0.409.

Now estimate the standard error of ˆµ using the bootstrap. How does this compare to your answer from (b)?

set.seed(1)
boot.fn <- function(data, index) {
    mu <- mean(data[index])
    return (mu)
}
boot(medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

Solution: The estimated standard error of ˆµ using the bootstrap is 0.411. This result is very close to our result in (b), which was 0.409.

Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].

t.test(medv)

## 
##  One Sample t-test
## 
## data:  medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The 95% confidence interval for the One sample t-test is 21.73 - 23.34. The mean is 22.53.

mu.conf.int <- c(22.53 - 2 * 0.4119, 22.53 + 2 * 0.4119)
mu.conf.int

## [1] 21.7062 23.3538

The 95% confidence interval for the mean of “medv” is 21.71 - 23.35. This result is a slightly larger, but very close range to our t-test result of 21.73 - 23.34.

Based on this data set, provide an estimate, ˆµmed, for the median value of medv in the population.

medv.median <- median(medv)
medv.median

## [1] 21.2

Solution: The median for medv is 21.2.

We now would like to estimate the standard error of ˆµmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

boot.fn <- function(data, index) {
    mu <- median(data[index])
    return (mu)
}
boot(medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0386   0.3770241

Solution: The estimated standard error of ˆµmed using the boostrap function is 21.2, which is identical to our solution in (e).

Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity ˆµ0.1. (You can use the quantile() function.)

medv.percent <- quantile(medv, c(0.1))
medv.percent

##   10% 
## 12.75

Solution: The estimate for the tenth percentile of medv in Boston census tracts is 12.75.

Use the bootstrap to estimate the standard error of ˆµ0.1. Comment on your findings.

boot.fn <- function(data, index) {
    mu <- quantile(data[index], c(0.1))
    return (mu)
}
boot(medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0186   0.4925766

Solution: The estimate for the tenth percentile of medv in Boston census tracts using the bootstrap function is 12.75, which is identical to our solution in (g).

Assignment #4

Lisa Smith

2022-10-09

Problem #3

Problem #5

Problem #6

Problem #9