Homework 4

We now review k-fold cross-validation.

Explain how k-fold cross-validation is implemented.

ANSWER 3A: This approach involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error,MSE1,MSE2, . . . ,MSEk.

What are the advantages and disadvantages of k-fold crossvalidation relative to:

The validation set approach?
LOOCV?

ANSWER 3B:

Advantages/Disadvantages of k-fold relative to the validation set approach:

Performing 10-fold CV requires fitting the learning procedure only ten times, which may be much more feasible.
The validation set approach can lead to overestimates of the test error rate, since in this approach the training set used to fit the statistical learning method contains only half the observations of the entire data set. Using this logic, it is not hard to see that LOOCV will give approximately unbiased estimates of the test error, since each training set contains n − 1 observations, which is almost as many as the number of observations in the full data set.
Performing k-fold CV for, say, k = 5 or k = 10 will lead to an intermediate level of bias, since each training set contains (k − 1)n/k observations—fewer than in the LOOCV approach, but substantially more than in the validation set approach. Therefore, from the perspective of bias reduction, it is clear that LOOCV is to be preferred to k-fold CV.

Advantages/Disadvantages of k-fold relative to LOOCV:

The most obvious advantage is computational. LOOCV requires fitting the statistical learning method n times. This has the potential to be computationally expensive (except for linear models fit by least squares, in which case formula (5.2) can be used).
Some statistical learning methods have computationally intensive fitting procedures, and so performing LOOCV may pose computational
problems, especially if n is extremely large.
k-fold CV gives more accurate estimates of the test error rate than does LOOCV
LOOCV has higher variance than does k-fold CV with k < n.

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

Fit a logistic regression model that uses income and balance to predict default.

ANSWER 5A:

require(ISLR)

## Loading required package: ISLR

log_res <- glm(formula = default ~ income + balance, data = Default, family ="binomial")
summary(log_res)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

Split the sample set into a training set and a validation set.

ANSWER 5B-i:

I have split 50/50 for train/test as shown:

require(ISLR)
require(caret)

## Loading required package: caret

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(123, sample.kind = "Rounding")
index <- createDataPartition(y = Default$default, p = 0.5, list = F)

train <- Default[index, ]
test <- Default[-index, ]
nrow(train) / nrow(Default)

## [1] 0.5001

Fit a multiple logistic regression model using only the training observations.

ANSWER 5B-ii:

require(ISLR)
require(caret)

log_res2 <- glm(default ~ income + balance, data = train, family = "binomial")
summary(log_res2)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1929  -0.1502  -0.0606  -0.0230   3.2824  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.121e+01  5.923e-01 -18.927  < 2e-16 ***
## income       1.923e-05  6.989e-06   2.752  0.00593 ** 
## balance      5.507e-03  3.132e-04  17.582  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1463.76  on 5000  degrees of freedom
## Residual deviance:  814.48  on 4998  degrees of freedom
## AIC: 820.48
## 
## Number of Fisher Scoring iterations: 8

Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

ANSWER 5B-iii:

The following shows the first 15 predictions stored in test_preds:

require(ISLR)
require(caret)

test_preds <- factor(ifelse(predict(log_res2, newdata = test, type = "response") > 0.5, "Yes", "No"))
test_preds[1:15]

##  2  3  4  7 12 13 14 15 17 18 19 20 21 25 30 
## No No No No No No No No No No No No No No No 
## Levels: No Yes

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

ANSWER 5B-iv:

The validation error = 0.02501 as shown below:

require(ISLR)
require(caret)

round(mean(test$default != test_preds), 5)

## [1] 0.02501

Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

ANSWER 5C:

As shown below, I used a loop to perform the additional train/test splits, fit the logistic regression model on train and store the test error in vactor log_df_err: The average value for test error is 0.02611

require(ISLR)
require(caret)

log_df_err <- c()
log_df_err[1] <- mean(test$default != test_preds)

set.seed(42, sample.kind = "Rounding")

for (i in 2:4) {
  index <- createDataPartition(y = Default$default, p = 0.5, list = F)
  train <- Default[index, ]
  test <- Default[-index, ]
  
  log_res <- glm(default ~ income + balance, data = train, family = "binomial")
  test_preds <- factor(ifelse(predict(log_res, newdata = test, type = "response") > 0.5, "Yes", "No"))
  
  log_df_err[i] <- mean(test$default != test_preds)
}

round(log_df_err, 5)

## [1] 0.02501 0.02581 0.02741 0.02621

mean(round(log_df_err, 5)) #average value for test error

## [1] 0.02611

Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

ANSWER 5D:

As shown below we can see that the studentYes variable is not significant which has an impact on the income variable (since the income variable is not significant anymore). The test error is 0.02681

require(ISLR)
require(caret)

set.seed(43, sample.kind = "Rounding")

index <- createDataPartition(y = Default$default, p = 0.5, list = F)
train <- Default[index, ]
test <- Default[-index, ]

log_res5d <- glm(default ~ ., data = train, family = "binomial")
summary(log_res5d)

## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2165  -0.1350  -0.0509  -0.0183   3.2414  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.166e+01  7.470e-01 -15.603   <2e-16 ***
## studentYes  -1.845e-01  3.459e-01  -0.533    0.594    
## balance      5.858e-03  3.371e-04  17.379   <2e-16 ***
## income       1.329e-05  1.207e-05   1.100    0.271    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1463.76  on 5000  degrees of freedom
## Residual deviance:  762.82  on 4997  degrees of freedom
## AIC: 770.82
## 
## Number of Fisher Scoring iterations: 8

test_preds <- factor(ifelse(predict(log_res5d, newdata = test, type = "response") > 0.5, "Yes", "No"))

round(mean(test$default != test_preds), 5)

## [1] 0.02681

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

ANSWER 6A:

The estimated standard errors for the coefficients associated with income(4.985167e-06) and balance(2.273731e-04) are shown:

require(ISLR)
require(caret)

log_res6 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(log_res6)$coefficients[, 2]

##  (Intercept)       income      balance 
## 4.347564e-01 4.985167e-06 2.273731e-04

Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

ANSWER 6B:

The coefficient estimates for income(2.080898e-05) and balance(5.647103e-03) in the multiple logistic regression model are shown:

require(ISLR)
require(caret)

boot.fn <- function(data, index = 1:nrow(data)) {coef(glm(default ~ income + balance, data = data, subset = index, family = "binomial"))[-1]}
boot.fn(Default)

##       income      balance 
## 2.080898e-05 5.647103e-03

Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

ANSWER 6C:

The SE(income) and SE(balance) coefficients are shown below:

require(tidyr)

## Loading required package: tidyr

require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#require(ISLR)
#require(caret)
require(boot)

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

require(ggplot2)

set.seed(101, sample.kind = "Rounding")

boot_res <- boot(data = Default, statistic = boot.fn, R = 1000)

boot_res

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 9.797648e-08 4.710812e-06
## t2* 5.647103e-03 1.345215e-05 2.293757e-04

as.data.frame(boot_res$t) %>% rename(income = V1, balance = V2) %>% gather(key = "variable", value = "estimate") %>%
  ggplot(aes(x = estimate, fill = factor(variable))) + 
  geom_histogram(bins = 20, color="blue",fill= "orange") + 
  facet_wrap(~ variable, scales = "free_x") + 
  labs(title = "1000 Bootstrap Parameter Estimates - 'balance' & 'income'", 
       subtitle = paste0("SE(income) = ", formatC(sd(boot_res$t[ ,1]), format = "e", digits = 6), 
                          " ,SE(balance) = ", formatC(sd(boot_res$t[ ,2]), format = "e", digits = 6)),
                         
       x = "Parameter Estimate", 
       y = "Count") + 
  theme(legend.position = "none")

Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

ANSWER 6D:

The boot() function determines the bootstrap estimate for standard errors and we can use the standard error equation on pg.197
There is not a big difference on results for standard error estimates for income and balance when using the boot() or glm() methods as shown. Based on these results we can assume that the logistic standard errors are satisfied.

require(tidyr)
require(dplyr)
#require(ISLR)
#require(caret)
require(boot)
require(ggplot2)

#boot():
sapply(data.frame(income = boot_res$t[ ,1], balance = boot_res$t[ ,2]), sd)

##       income      balance 
## 4.710812e-06 2.293757e-04

#glm():
summary(log_res6)$coefficients[2:3, 2]

##       income      balance 
## 4.985167e-06 2.273731e-04

We will now consider the Boston housing data set, from the MASS library.

Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.

ANSWER 9A:

The populating mean of medv is estimated by taking the sample mean ˆμ(22.53281) as shown:

require(MASS)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

mean(Boston$medv)

## [1] 22.53281

Provide an estimate of the standard error of ˆμ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

ANSWER 9B:

SE(ˆμ) = 0.4088611, This indicates that there is about 41% variation for the standard error estimate across multiple samples of a population.

require(MASS)
sd(Boston$medv) / sqrt(length(Boston$medv))

## [1] 0.4088611

Now estimate the standard error of ˆμ using the bootstrap. How does this compare to your answer from (b)?

ANSWER 9C:

The estimate of the standard error of ˆμ using the bootstrap is shown below(0.4081538). When comparing to 9b SE results, we find that both standard errors are very similar, about 0.41

require(MASS)
require(boot)

boot.fn <- function(vector, index) {mean(vector[index])}
set.seed(66, sample.kind = "Rounding")
boot_res <- boot(data = Boston$medv, statistic = boot.fn, R = 1000)
boot_res

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1* 22.53281 0.0116587   0.4081538

Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95% confidence interval using the formula [ˆμ − 2SE(ˆμ), ˆμ + 2SE(ˆμ)].

ANSWER 9D:

The confidence interval estimates have the same results for bootstrap and t.test as shown below: 95 percent confidence interval: 21.72953 23.33608

require(MASS)
require(boot)

boot_resSE <- sd(boot_res$t)
round(c(mean(Boston$medv) - 2*boot_resSE, mean(Boston$medv) + 2*boot_resSE), 4)

## [1] 21.7165 23.3491

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Based on this data set, provide an estimate, ˆμmed, for the median value of medv in the population.

ANSWER 9E:

The population median of medv estimate is shown:

require(MASS)
median(Boston$medv)

## [1] 21.2

We now would like to estimate the standard error of ˆμmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

ANSWER 9F:

The standard error estimate of the median using bootstrap(0.3700318) is shown below:

require(MASS)
require(boot)
boot.fn <- function(vector, index) {median(vector[index])}
set.seed(77, sample.kind = "Rounding")
(boot_res <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2  0.0094   0.3700318

Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity ˆμ0.1. (You can use the quantile() function.)

ANSWER 9G:

The estimate for the tenth percentile of medv in Boston suburbs is (12.75) shown:

require(MASS)
require(ISLR)
quantile(Boston$medv, 0.1)

##   10% 
## 12.75

Use the bootstrap to estimate the standard error of ˆμ0.1. Comment on your findings.

ANSWER 9H:

The standard error estimate using bootstrap is a bit larger(0.488873) but not by much as shown:

require(MASS)
require(ISLR)
require(boot)
boot.fn <- function(vector, index) {quantile(vector[index], 0.1)}
set.seed(77, sample.kind = "Rounding")
boot_res <- boot(data = Boston$medv, statistic = boot.fn, R = 1000)
boot_res

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 0.02085    0.488873

Homework 4

Jose Fernandez

7/4/2021