3. We now review k-fold cross-validation.

  1. Explain how k-fold cross-validation is implemented.

1. Divide the dataset into k roughly equal-sized folds at random. 2. Fit the model using the other k-1 folds as a training set and evaluate the mode in the validation set, or i-th fold for each fold i 3. Calculate the mean of the k test errors to estimate the overall test error rate

  1. What are the advantages and disadvantages of k-fold cross validation relative to:
  1. The validation set approach?

Advantages: More stable test error estimate since it uses all data for training and validation ; Less variance in results since its rooted in “Actuals” and not best assumptions.

Disadvantages: Its requires much more computation

  1. LOOCV?

Advantages: Faster approach when dealing with larger datasets; Lower variance in test error estimates

Disadvantages: May be less biased and carry a higher the variance, which could lead to false positives or false negatives and mislead decision making.

5. In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
data("Default")
set.seed(1)
  1. Fit a logistic regression model that uses income and balance to predict default.
default_lrm.fit<-glm(default~income+balance,data=Default,family=binomial)
summary(default_lrm.fit)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8
  1. Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
  1. Split the sample set into a training set and a validation set.
default.train<-sample(1:nrow(Default),nrow(Default)/2)
default.training<-Default[default.train,]
default.valid<-Default[-default.train,]
  1. Fit a multiple logistic regression model using only the training observations.
lrmfit.train<- glm(default~income+balance,data=default.training,family=binomial)
  1. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
default.probs<-predict(lrmfit.train,newdata=default.valid,type="response")
default.pred<-ifelse(default.probs>0.5,"Yes","No")
  1. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
default.actual<-default.valid$default
mean(default.pred != default.actual)
## [1] 0.0254
  1. Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set.
errorcount<-numeric(3)
seed<-c(2,3,4)
for (i in 1:3) {set.seed(seed[i])
repeat.train<-sample(1:nrow(Default),nrow(Default)/2)
repeat.fit<-glm(default~income+balance,data=Default[repeat.train,], family=binomial)
repeat.prob<-predict(repeat.fit,newdata=Default[-repeat.train,],type="response")
repeat.pred<-ifelse(repeat.prob>0.5,"Yes","No")
errorcount[i]<-mean(repeat.pred!=Default[-repeat.train,]$default)}
errorcount
## [1] 0.0238 0.0264 0.0256
mean(errorcount)
## [1] 0.02526667

Comment on the results obtained.

Average validation set errors were 2.52%, and initial model validation error was 2.54%, indicating the validation set approach is stable. It may be worth noting the slight variability of the repeated test set (2.38%, 2.64% and 2.56%) could indicate some sensitity due to how the data is split.

  1. Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.
set.seed(1)
dummy.train<-sample(1:nrow(Default),nrow(Default)/2)
dummy.fit<-glm(default~income+balance+student,data=Default[dummy.train,],family=binomial)
dummy.prob<-predict(dummy.fit,newdata=Default[-dummy.train,], type="response")
dummy.pred<-ifelse(dummy.prob>0.5,"Yes","No")
mean(dummy.pred!=Default[-dummy.train,]$default)
## [1] 0.026

Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

No, adding the Student variable does not lead to a reduction in test error rate. It remains just about the same, with a slight variation upward, thus worsening the test error rate by 0.06%

6. We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

set.seed(1)
  1. Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.
default6.fit<-glm(default~income+balance,data=Default,family=binomial)
summary(default6.fit)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8
  1. Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.
default.boot<-function(data,index){
  coef(glm(default~income+balance,data=data[index,],family=binomial))}
  1. Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.
library(boot)
set.seed(1)
bootstrap<-boot(data=Default,statistic=default.boot,R=1000)
bootstrap
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = default.boot, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04
  1. Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

The estimated standard errors between the glm() and boostrap function appear to be reasonable, as the variance are extremely close. The logistical regression model is stable and model assumptions in glm() are reasonable.

9. We will now consider the Boston housing data set, from the ISLR2 library.

data("Boston")
  1. Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.
mu_hat<-mean(Boston$medv)
mu_hat
## [1] 22.53281
  1. Provide an estimate of the standard error of ˆμ.
n<-nrow(Boston)
stddev<-sd(Boston$medv)
mu_hat.se<-stddev/sqrt(n)
mu_hat.se
## [1] 0.4088611

Interpret this result.

Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

Provides the level of certainty or expected variance from the sample mean to be $409 between samples

  1. Now estimate the standard error of ˆμ using the bootstrap.
mu_hat.boot<-function(data,index){
  mean(data[index])}
set.seed(1)
mu_hat.bootmean<-boot(Boston$medv,statistic=mu_hat.boot,R=1000)
mu_hat.bootmean
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = mu_hat.boot, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

How does this compare to your answer from (b)?

Using the boostrap to estimate the standard error of ˆμ gives a value of $411, very close to the value at the theoretical formula ($409). This minimal variation confirms the reliability of the theoretical standard error for this dataset.

  1. Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).
mu_hat.bootse<-sd(mu_hat.bootmean$t)
boot_ci<-c(mu_hat-2*mu_hat.bootse,mu_hat+2*mu_hat.bootse)
boot_ci
## [1] 21.71148 23.35413
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Compare it to the results obtained using t.test(Boston$medv).

Hint: You can approximate a 95 % confidence interval using the formula [ˆμ − 2SE(ˆμ), ˆμ + 2SE(ˆμ)].

The t.test() returned a 95% confidence interval of (21.73,23.34) for the population mean of medv. This is a statistically similar to the bootstrap confidence interval of (21.71,23.25), thus suggeting the normality assumption of the t-test is reasonable and both methods provide consistent estimates.

  1. Based on this data set, provide an estimate, ˆμmed, for the median value of medv in the population.
mu_hat.med<-median(Boston$medv)
mu_hat.med
## [1] 21.2
  1. We now would like to estimate the standard error of ˆμmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap.
mu_hat.bootmed<-function(data,index){
  median(data[index])}
set.seed(1)
mu_hat.medboot<-boot(Boston$medv,statistic=mu_hat.bootmed,R=1000)
mu_hat.medboot
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = mu_hat.bootmed, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075

Comment on your findings.

The sample median of medv is calculated as $21,200 with the boostrap standard error of ~$378 between samples. The small bootstrap bias (~$23) validating the sample median to be close to the center of the bootstrap distribution. Overall, the bootstrap appears to provide a reliable quantifying variability.

  1. Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity ˆμ0.1. (You can use the quantile() function.)
mu_hat0.1<-quantile(Boston$medv,0.1)
mu_hat0.1
##   10% 
## 12.75
  1. Use the bootstrap to estimate the standard error of ˆμ0.1.
mu_hat0.1.fnboot<-function(data,index){
  quantile(data[index],0.1)}
set.seed(1)
mu_hat0.1.boot<-boot(Boston$medv,statistic=mu_hat0.1.fnboot,R=1000)
mu_hat0.1.boot
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = mu_hat0.1.fnboot, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0339   0.4767526

Comment on your findings.

The 10th percentile of medv is estimated to be 12,750 with a bootstrap estimated standard error of $477. This provides a moderate level of uncertainty in the estimate. However, the small bias ($34) does sugges the sample estimate is a good representation of the population distribution.