Load Libraries and attach datasets

library(ISLR2)
library(tidyverse)
library(boot)
attach(Default)
attach(Boston)





Problem 3.

We now review k-fold cross-validation.


(a) Explain how k-fold cross-validation is implemented.

K-fold cross validation (CV) is implemented much the same way Leave-one-out CV (LOOCV) is implemented, except rather than only leaving one observation out at a time, a chunk (called a fold) is left out. To create the chunks, the data set is broke down into equal chunks, most often either 5 or 10 chunks. Each iteration, one chunk is left out and used to validate the model that was trained on the remaining chunks. After the first iteration is complete, the chunk that was previously left out is included and the next chunk is left out. This allows for the model to be trained and validated on “different” data subsets as many times as chunks are created.


(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:

i. The validation set approach?

K-Fold CV has less bias because it uses multiple smaller validation sets, which allow for larger training sets. There is less variability and ultimately every available observation in the data set is used to better train the model. The biggest disadvantage is that CV (either K-fold CV or LOOCV) uses more computational power than simple validation. In addition, it is harder to understand/explain the process and results.


ii. LOOCV?

K-fold CV has less variance than LOOCV and since it is run on chunks of observations rather that each individual observation, it requires less iterations and therefore less computation power. The downside is that K-fold CV can have more bias than LOOCV. However, using a K size of 5 or 10 provides for a good balance of bias-variance trade off.






Problem 5.

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.


set.seed(22)


(a) Fit a logistic regression model that uses income and balance to predict default.


fit_glm <- glm(default ~ income + balance, data = Default, family = 'binomial')
summary(fit_glm)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8


(b) Using the validation set approach, estimate the test error of this model.

The validation set error rate is 2.6% for this split (75/25).


i. Split the sample set into a training set and a validation set.


train_num <- sample(nrow(Default),(nrow(Default)*.75))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])


ii. Fit a multiple logistic regression model using only the training observations.


fit_glm <- glm(default ~ income + balance, family = 'binomial', data = train)
summary(fit_glm)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4592  -0.1484  -0.0585  -0.0217   3.7113  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.140e+01  4.921e-01 -23.167   <2e-16 ***
## income       1.846e-05  5.657e-06   3.264   0.0011 ** 
## balance      5.624e-03  2.584e-04  21.761   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2245.8  on 7499  degrees of freedom
## Residual deviance: 1208.7  on 7497  degrees of freedom
## AIC: 1214.7
## 
## Number of Fisher Scoring iterations: 8


iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.


prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##         
## pred_glm   No  Yes
##      No  2414   54
##      Yes   11   21


iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.


1 - mean(pred_glm==validate$default)
## [1] 0.026


(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

The results of the three iterations each return a slightly different error rate for the validation sets.

Split 1 used a 50/50 (train/validate) split and returned an error rate of 2.96%
Split 2 used a 65/35 (train/validate) split and returned an error rate of 2.65%
Split 3 used a 90/10 (train/validate) split and returned an error rate of 3.7%

Interestingly, the third iteration used the largest training set, but performed the poorest in this scenario. This is likely due to the small subset of validation data skewing the results. I different selection of validation data would more than likely result in a different error rate. It is therefore very likely that using a different seed would result in slightly different error rates as well.


Split 1

train_num <- sample(nrow(Default),(nrow(Default)*.50))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])

fit_glm <- glm(default ~ income + balance, family = 'binomial', data = train)

prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##         
## pred_glm   No  Yes
##      No  4803  120
##      Yes   28   49
1 - mean(pred_glm==validate$default)
## [1] 0.0296


Split 2

train_num <- sample(nrow(Default),(nrow(Default)*.65))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])

fit_glm <- glm(default ~ income + balance, family = 'binomial', data = train)

prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##         
## pred_glm   No  Yes
##      No  3369   82
##      Yes   11   38
1 - mean(pred_glm==validate$default)
## [1] 0.02657143


Split 3

train_num <- sample(nrow(Default),(nrow(Default)*.9))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])

fit_glm <- glm(default ~ income + balance, family = 'binomial', data = train)

prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##         
## pred_glm  No Yes
##      No  953  33
##      Yes   4  10
1 - mean(pred_glm==validate$default)
## [1] 0.037


(d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

I created stu_dum, a dummy variable for student where 1 = Yes and 0 = No. I then recreated the 72/25 split and found that there was not a significant difference in the accuracy of the model. It definitely did not lead to a reduction, as the previous 75/25 split had an error rate of 2.6% and the model with the dummy variable (below) had an error rate of 2.8%.


Default$stu_dum <- ifelse(Default$student == 'No', 0, 1)

train_num <- sample(nrow(Default),(nrow(Default)*.75))
train <- subset(Default[train_num,])
validate <- subset(Default[-train_num,])

fit_glm <- glm(default ~ income + balance + stu_dum, family = 'binomial', data = train)

prob_glm <- predict(fit_glm, validate, type = 'response')
pred_glm <- rep('No', length(prob_glm))
pred_glm[prob_glm >.5] <- 'Yes'
table(pred_glm, validate$default)
##         
## pred_glm   No  Yes
##      No  2410   54
##      Yes   11   25
1 - mean(pred_glm==validate$default)
## [1] 0.026







Problem 6.

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.


set.seed(22)


(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

The standard error for the coefficients associated with income and balance are \(4.985^{-6}\) and \(2.27^{-4}\), respectively.


summary(glm(default ~ income + balance, family = 'binomial', data = Default))
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8


(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.


boot.fn=function(data,index){
    return(coef(glm(default ~ income + balance, family = 'binomial', data = data, subset = index)))
}


(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

Using the boot() function, The standard error for the coefficients associated with income and balance are \(4.9902^{-6}\) and \(2.2689{-4}\), respectively.

boot(Default,boot.fn,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.552739e-02 4.326477e-01
## t2*  2.080898e-05  1.662128e-07 4.990171e-06
## t3*  5.647103e-03  1.479837e-05 2.268672e-04


(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

The estimated standard error for both were relatively similar, with the bootstrap standard error being slightly higher for income and slightly lower for balance; however, the difference are so small that it is more than likely negligible in this situation.







Problem 9.

We will now consider the Boston housing data set, from the ISLR2 library.


(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate \(\hat{\mu}\).


mu_hat <- mean(Boston$medv)

mu_hat
## [1] 22.53281


(b) Provide an estimate of the standard error of \(\hat{\mu}\). Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.


SEM <- sd(Boston$medv) / sqrt(nrow(Boston))

SEM
## [1] 0.4088611


(c) Now estimate the standard error of \(\hat{\mu}\) using the bootstrap. How does this compare to your answer from (b)?

The standard error was slightly higher when I ran the bootstrap 1,000 times, but when I ran it 10,000 times it was slightly lower.


set.seed(22)

boot.fn <- function(data, index) {
  return(mu_hat <- mean(data[index]))
}

boot(Boston$medv, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 -0.01964348   0.4257536
boot(Boston$medv, boot.fn, 10000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 -0.00476083   0.4067331


(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(medv). Hint: You can approximate a 95% confidence interval using the formula [\(\hat{\mu}\) − 2SE(\(\hat{\mu}\)), \(\hat{\mu}\) + 2SE(\(\hat{\mu}\))].

The results from using the t.test() function and the bootstrap are very similar (identical down to the tenths place).


t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281
set.seed(22)
boot_SEM <- boot(Boston$medv, boot.fn, 10000)

CI_mu_hat <- c(mu_hat - (2*.4067331), mu_hat + (2*.4067331))

CI_mu_hat
## [1] 21.71934 23.34627

(e) Based on this data set, provide an estimate, \(\hat{\mu}_{med}\), for the median value of medv in the population.


mu_hat_medv <- median(Boston$medv)

mu_hat_medv
## [1] 21.2


(f) We now would like to estimate the standard error of \(\hat{\mu}_{med}\). Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

The median was the same, 21.2; however, the standard error was smaller which would result in a more narrow confidence interval overall.


set.seed(22)
boot.fn <- function(data, index) {
    return(M_hat <- median(data[index]))
}

boot(Boston$medv, boot.fn, 10000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     21.2 -0.01389   0.3814231


(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity \(\hat{\mu}_{0.1}\). (You can use the quantile() function.)

Using the quantile function, the estimate for \(\hat{\mu}_{0.1}\) of medv in Boston’s census tracts is 12.75


mu_hat_0.1 <- quantile(Boston$medv, probs = .1)

mu_hat_0.1
##   10% 
## 12.75


(h) Use the bootstrap to estimate the standard error of \(\hat{\mu}_{0.1}\). Comment on your findings.

The standard error is relatively small overall at .5 and the estimated \(\hat{\mu}_{0.1}\) is the same, 12.75.


set.seed(22)
boot.fn <- function(data, index) {
    return(mu_hat_0.1 <- quantile(data[index], 0.1))
}

boot(Boston$medv, boot.fn, 10000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*    12.75 -0.005305    0.509029