Chapter 05 (page 219): 3, 5, 6, 9

libraries:

library(ISLR)
library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
library(boot)

Chapter 05 Problem 3: We now review k-fold cross-validation.

(a): Explain how k-fold cross-validation is implemented.

First, you divide the dataset into k different equal parts (these are the folds!), then remove the first fold, and fit the model on the remaining k-1 folds to see how good the predictions are on left out fold. Repeat k different times taking out a different part each time. After running all the iterations, average k different MSE’s to get an estimated validation error rate.

(b): What are the advantages and disadvantages of k-fold cross validation relative to:

i. The validation set approach?

Advantages of k-fold: uses all data instead of wasting some by splitting. More reliable when estimating model performance due to bias from splitting.

Disadvantages of k-fold: Requires more computation training K times instead of once.

ii. LOOCV?

Advantages of k-fold: Less computation when dataset is large. Lower variance when estimating model performance.

Disadvantages of k-fold: More bias depending on the K.

Chapter 05 Problem 5: In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

Setting seed:

set.seed(1)
head(Default)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

(a): Fit a logistic regression model that uses income and balance to predict default.

lrm1 <- glm(default ~ income + balance, data = Default, family = binomial)
summary(lrm1)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b): Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

  1. Split the sample set into a training set and a validation set
sample <- sample(nrow(Default), nrow(Default)/2)

train <- Default[sample,]
val <- Default[-sample,]
  1. Fit a multiple logistic regression model using only the training observations.
lrm1_train <- glm(default ~ income + balance, data = train, family = binomial)
summary(lrm1_train)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5830  -0.1428  -0.0573  -0.0213   3.3395  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.194e+01  6.178e-01 -19.333  < 2e-16 ***
## income       3.262e-05  7.024e-06   4.644 3.41e-06 ***
## balance      5.689e-03  3.158e-04  18.014  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.8  on 4999  degrees of freedom
## Residual deviance:  803.3  on 4997  degrees of freedom
## AIC: 809.3
## 
## Number of Fisher Scoring iterations: 8
  1. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
probs <- predict(lrm1_train, newdata = val, type="response")

pred <- ifelse(probs > 0.5, "Yes", "No")

pred<- factor(pred, levels = c("No", "Yes"))

head(pred)
##  1  2  3  5  8  9 
## No No No No No No 
## Levels: No Yes
  1. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
val_error <- mean(pred != val$default)
val_error
## [1] 0.0254

(c): Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

70% training - 30% validation split

set.seed(1)
sample <- sample(nrow(Default), nrow(Default)*0.7)

train <- Default[sample,]
val <- Default[-sample,]
lrm1_train2 <- glm(default ~ income + balance, data = train, family = binomial)
summary(lrm1_train2)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4481  -0.1402  -0.0561  -0.0211   3.3484  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.167e+01  5.214e-01 -22.379  < 2e-16 ***
## income       2.560e-05  6.012e-06   4.258 2.06e-05 ***
## balance      5.574e-03  2.678e-04  20.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2030.3  on 6999  degrees of freedom
## Residual deviance: 1079.6  on 6997  degrees of freedom
## AIC: 1085.6
## 
## Number of Fisher Scoring iterations: 8
probs <- predict(lrm1_train2, newdata = val, type="response")

pred <- ifelse(probs > 0.5, "Yes", "No")

pred<- factor(pred, levels = c("No", "Yes"))

head(pred)
##  1 10 11 12 13 17 
## No No No No No No 
## Levels: No Yes
val_error <- mean(pred != val$default)
val_error
## [1] 0.02666667

80% training and 20% validation split

set.seed(1)
sample <- sample(nrow(Default), nrow(Default)*0.8)

train <- Default[sample,]
val <- Default[-sample,]
lrm1_train3 <- glm(default ~ income + balance, data = train, family = binomial)
summary(lrm1_train3)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4758  -0.1413  -0.0563  -0.0210   3.4620  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.168e+01  4.893e-01 -23.879  < 2e-16 ***
## income       2.547e-05  5.631e-06   4.523  6.1e-06 ***
## balance      5.613e-03  2.531e-04  22.176  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2313.6  on 7999  degrees of freedom
## Residual deviance: 1239.2  on 7997  degrees of freedom
## AIC: 1245.2
## 
## Number of Fisher Scoring iterations: 8
probs <- predict(lrm1_train3, newdata = val, type="response")

pred <- ifelse(probs > 0.5, "Yes", "No")

pred<- factor(pred, levels = c("No", "Yes"))

head(pred)
##  1 11 12 13 17 20 
## No No No No No No 
## Levels: No Yes
val_error <- mean(pred != val$default)
val_error
## [1] 0.026

60% training and 40% validation split

set.seed(1)
sample <- sample(nrow(Default), nrow(Default)*0.6)

train <- Default[sample,]
val <- Default[-sample,]
lrm1_train4 <- glm(default ~ income + balance, data = train, family = binomial)
summary(lrm1_train4)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4930  -0.1401  -0.0559  -0.0208   3.3468  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.175e+01  5.607e-01  -20.95  < 2e-16 ***
## income       2.614e-05  6.439e-06    4.06  4.9e-05 ***
## balance      5.643e-03  2.882e-04   19.58  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1793.96  on 5999  degrees of freedom
## Residual deviance:  937.47  on 5997  degrees of freedom
## AIC: 943.47
## 
## Number of Fisher Scoring iterations: 8
probs <- predict(lrm1_train4, newdata = val, type="response")

pred <- ifelse(probs > 0.5, "Yes", "No")

pred<- factor(pred, levels = c("No", "Yes"))

head(pred)
##  1 10 11 12 13 17 
## No No No No No No 
## Levels: No Yes
val_error <- mean(pred != val$default)
val_error
## [1] 0.025

Answer: Based on the relatively small change between validation error rates for the different splits, we can suggest that this model is near it’s optimal performance. The best split seems to be 60% training 40% validation.

(d): Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

Fitting the model with student added:

lrm2 <- glm(default ~ income + balance + student, data = train, family = binomial)

Estimating the test error:

probs2 <- predict(lrm2, newdata = val, type = "response")

pred2 <- ifelse(probs2>0.5, "Yes", "No")

val_error2 <- mean(pred2!=val$default)
val_error2
## [1] 0.026

Answer: By adding the student dummy variable, the test error actually went up very slightly. The model without the student variable had a test error of 0.254 at the 50%-50% split. The test error with the student variable ended up being 0.026.

Chapter 05 Problem 6: We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

set.seed(1)

(a): Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

log <- glm(default ~ income + balance, data = Default, family = binomial)
summary(log)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b): Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn = function(data,index){
  re_data <- data[index,]
  model<- glm(default ~ income + balance, data = re_data, family = binomial)
  return(coef(model)[c("income","balance")])}

(c): Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

bootstrap <- boot(data = Default, statistic = boot.fn, R = 1000)
print(bootstrap)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04
boot_se <- bootstrap$t0
boot_se
##       income      balance 
## 2.080898e-05 5.647103e-03

(d): Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

log_se <- summary(log)$coefficients[,"Std. Error"]
log_se
##  (Intercept)       income      balance 
## 4.347564e-01 4.985167e-06 2.273731e-04

Answer: The bootstrap standard errors for income and balance are much larger than the glm() standard errors. This may be becauses the bootstrap is showing reduced variance than the glm()’s assumptions. The glm() function may be underestimating the true variability of the relationship between income and balance leading to overfitting.

Chapter 05 Problem 9: We will now consider the Boston housing data set, from the ISLR2 library.

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7
set.seed(1)

(a): Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆ µ.

mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281

(b): Provide an estimate of the standard error of ˆ µ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

sample_sd <- sd(Boston$medv)
num_obs <- length(Boston$medv)
SE_mu_hat <- sample_sd/sqrt(num_obs)
SE_mu_hat
## [1] 0.4088611

This means that sample mean is a fairly good estimate of the true population mean of medv.If we were to take a lot of samples of he populations and calculate the mean, the sd of these means would be .4088611.

(c): Now estimate the standard error of ˆ µ using the bootstrap. How does this compare to your answer from (b)?

boot.fn2 <- function(data, index){
  mu_hat <- mean(data[index])
  return(mu_hat)
}
boot(Boston$medv, boot.fn2,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn2, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

The bootstrap estimated std. error of ˆ µ is 0.4106622 which is very similar to the estimate found in (b). This means that sample mean is a fairly good estimate of the true population mean of medv, however, slightly less as good than (b). If we were to take a lot of samples of he populations and calculate the mean, the sd of these means would be .4106622.

(d): Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95% confidence interval using the formula [ˆ µ− 2SE(ˆ µ), ˆ µ + 2SE(ˆ µ)].

CI_of_mu_hat <- c(22.5328-2 *0.4107,22.5328+2*0.4107)
CI_of_mu_hat
## [1] 21.7114 23.3542
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The bootstrap interval is very similar to the t.test interval.

(e): Based on this data set, provide an estimate, ˆ µmed, for the median value of medv in the population.

hat_mu_med <- median(Boston$medv)
hat_mu_med
## [1] 21.2

(f): We now would like to estimate the standard error of ˆ µmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

boot.fn.median <- function(data, index){
  return(median(data[index]))
}
boot_med<- boot(data = Boston$medv, statistic = boot.fn.median, R = 1000)
boot_med
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.median, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0386   0.3770241

The standard error of the median is 0.3770241 suggesting if we repeatedly sampled from the population and calculated the median each time, the std. error of the medians would be 0.3770241. This is slightly lower than the mean suggesting the median is a more stable estimate for this dataset potentially due to outliers.

(g): Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity ˆ µ0.1. (You can use the quantile() function.)

quantile(Boston$medv, 0.1)
##   10% 
## 12.75

(h): Use the bootstrap to estimate the standard error of ˆ µ0.1. Comment on your findings.

boot.fn3 <- function(data, index) {
  return(quantile(data[index],0.1))
}
bootstrap_quant <- boot(data = Boston$medv, statistic = boot.fn3, R = 1000)
bootstrap_quant
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn3, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0186   0.4925766

The bootstrap estimate of the standard error is 0.5059657 suggesting that if we repeatedly sampled from the population and calculated the 10th percentile each time, the std. error of these would be approximately 0.5059657.

This means that the median is the most stable estimate compared to the mean and percentiles.