Resampling methods involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. They may allow us to obtain information that would not be available from fitting the model only once using the original training sample.

Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive. In these exercises, we investigate two of the most commonly used resampling methods, cross-validation and the bootstrap.

Question 3.

We now review k-fold cross-validation. (a) Explain how k-fold cross-validation is implemented.

k-fold cross-validation involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds.
–Page 181, An Introduction to Statistical Learning, 2013.

(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:

  • The validation set approach?
    The validation set approach is conceptually simple and easily implemented as you are simply partitioning the existing training data into two sets. However, there are two drawbacks: (1.) the estimate of the test error rate can be highly variable depending on which observations are included in the training and validation sets. (2.) the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

  • LOOCV?
    LOOCV is a special case of k-fold cross-validation with k = n. Thus, LOOCV is the most computationally intense method since the model must be fit n times. Also, LOOCV has higher variance, but lower bias, than k-fold CV.

Question 5.

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

library(ISLR)
summary(Default)
 default    student       balance           income     
 No :9667   No :7056   Min.   :   0.0   Min.   :  772  
 Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
                       Median : 823.6   Median :34553  
                       Mean   : 835.4   Mean   :33517  
                       3rd Qu.:1166.3   3rd Qu.:43808  
                       Max.   :2654.3   Max.   :73554  
attach(Default)
set.seed(1)

(a) Fit a logistic regression model that uses income and balance to predict default.

glm.fit = glm(default ~ income + balance, data = Default, family = binomial)

(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
1. Split the sample set into a training set and a validation set.
2. Fit a multiple logistic regression model using only the training observations. 3. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
4. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

FiveB = function() {
    # i.
    train = sample(dim(Default)[1], dim(Default)[1]/2)
    # ii.
    glm.fit = glm(default ~ income + balance, data = Default, family = binomial, 
        subset = train)
    # iii.
    glm.pred = rep("No", dim(Default)[1]/2)
    glm.probs = predict(glm.fit, Default[-train, ], type = "response")
    glm.pred[glm.probs > 0.5] = "Yes"
    # iv.
    return(mean(glm.pred != Default[-train, ]$default))
}
FiveB()
[1] 0.0254

(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

FiveB()
[1] 0.0274
FiveB()
[1] 0.0244
FiveB()
[1] 0.0244

The error rate seems to hover between 0.024 and 0.026.

(d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

train = sample(dim(Default)[1], dim(Default)[1]/2)
glm.fit = glm(default ~ income + balance + student, data = Default, family = binomial, 
    subset = train)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit, Default[-train, ], type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != Default[-train, ]$default)
[1] 0.0262

The higher estimated test error rate including the student dummy variable indicates that adding the student dummy variable does not appear to lead to a reduction in the test error rate.

Question 6.

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(1)
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)

Call:
glm(formula = default ~ income + balance, family = binomial, 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4725  -0.1444  -0.0574  -0.0211   3.7245  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1579.0  on 9997  degrees of freedom
AIC: 1585

Number of Fisher Scoring iterations: 8

(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn = function(data, index){
  return(coef(glm(default ~ income + balance, data = data, family = binomial, subset = index)))
}

(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

library(boot)
boot(Default, boot.fn, 50)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Default, statistic = boot.fn, R = 50)


Bootstrap Statistics :
         original        bias     std. error
t1* -1.154047e+01 -5.661486e-02 4.847786e-01
t2*  2.080898e-05 -7.436578e-08 4.456965e-06
t3*  5.647103e-03  1.854126e-05 2.639029e-04

(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

The estimates from the two separate estimation methods are similar all the way to the second and third significant digits.

Question 7.

In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).

detach(Default)
summary(Weekly)
      Year           Lag1         
 Min.   :1990   Min.   :-18.1950  
 1st Qu.:1995   1st Qu.: -1.1540  
 Median :2000   Median :  0.2410  
 Mean   :2000   Mean   :  0.1506  
 3rd Qu.:2005   3rd Qu.:  1.4050  
 Max.   :2010   Max.   : 12.0260  
      Lag2               Lag3         
 Min.   :-18.1950   Min.   :-18.1950  
 1st Qu.: -1.1540   1st Qu.: -1.1580  
 Median :  0.2410   Median :  0.2410  
 Mean   :  0.1511   Mean   :  0.1472  
 3rd Qu.:  1.4090   3rd Qu.:  1.4090  
 Max.   : 12.0260   Max.   : 12.0260  
      Lag4               Lag5         
 Min.   :-18.1950   Min.   :-18.1950  
 1st Qu.: -1.1580   1st Qu.: -1.1660  
 Median :  0.2380   Median :  0.2340  
 Mean   :  0.1458   Mean   :  0.1399  
 3rd Qu.:  1.4090   3rd Qu.:  1.4050  
 Max.   : 12.0260   Max.   : 12.0260  
     Volume            Today         
 Min.   :0.08747   Min.   :-18.1950  
 1st Qu.:0.33202   1st Qu.: -1.1540  
 Median :1.00268   Median :  0.2410  
 Mean   :1.57462   Mean   :  0.1499  
 3rd Qu.:2.05373   3rd Qu.:  1.4050  
 Max.   :9.32821   Max.   : 12.0260  
 Direction 
 Down:484  
 Up  :605  
           
           
           
           
set.seed(1)
attach(Weekly)

(a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2.

glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(glm.fit)

Call:
glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.623  -1.261   1.001   1.083   1.506  

Coefficients:
            Estimate Std. Error z value
(Intercept)  0.22122    0.06147   3.599
Lag1        -0.03872    0.02622  -1.477
Lag2         0.06025    0.02655   2.270
            Pr(>|z|)    
(Intercept) 0.000319 ***
Lag1        0.139672    
Lag2        0.023232 *  
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
  0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1488.2  on 1086  degrees of freedom
AIC: 1494.2

Number of Fisher Scoring iterations: 4

(b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.

glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = binomial)
summary(glm.fit)

(c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(Direction="Up"|Lag1, Lag2) > 0.5. Was this observation correctly classified?

predict.glm(glm.fit, Weekly[1, ], type = "response") > 0.5
   1 
TRUE 
Weekly$Direction[1]
[1] Down
Levels: Down Up

We predict the Direction to be UP; the true Direction was DOWN.

(d) Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performs each of the following steps:

1. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.
2. Compute the posterior probability of the market moving up for the ith observation.
3. Use the posterior probability for the ith observation in order to predict whether or not the market moves up.
4. Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

count = rep(0, dim(Weekly)[1])
for (i in 1:(dim(Weekly)[1])) {
    glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
    is_up = predict.glm(glm.fit, Weekly[i, ], type = "response") > 0.5
    is_true_up = Weekly[i, ]$Direction == "Up"
    if (is_up != is_true_up) 
        count[i] = 1
}
sum(count)
[1] 490

490 errors.

(e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.

mean(count)
[1] 0.4499541

LOOCV estimates a test error rate of 45%.

---
title: "Ch. 5 Resampling Methods"
output: 
  html_notebook:
    toc: true
    toc_float: true
---
Resampling methods involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. They may allow us to obtain information that would not be available from fitting the model only once using the original training sample.

Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive. In these exercises, we investigate two of the most commonly used resampling methods, cross-validation and the bootstrap.

### Question 3.    
**We now review *k*-fold cross-validation.**
**(a) Explain how *k*-fold cross-validation is implemented.**

> *k*-fold cross-validation involves randomly *k*-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining *k* − 1 folds.  
--Page 181, [An Introduction to Statistical](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf) Learning, 2013.

**(b) What are the advantages and disadvantages of *k*-fold crossvalidation relative to:**

 - **The validation set approach?**  
The validation set approach is conceptually simple and easily implemented as you are simply partitioning the existing training data into two sets. However, there are two drawbacks: (1.) the estimate of the test error rate can be highly variable depending on which observations are included in the training and validation sets. (2.) the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

 - **LOOCV?**  
LOOCV is a special case of *k*-fold cross-validation with *k* = *n*. Thus, LOOCV is the most computationally intense method since the model must be fit n times. Also, LOOCV has higher variance, but lower bias, than *k*-fold CV.

### Question 5.  
**In Chapter 4, we used logistic regression to predict the probability of `default` using `income` and `balance` on the `Default` data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.**

```{r}
library(ISLR)
summary(Default)
attach(Default)
set.seed(1)
```

**(a) Fit a logistic regression model that uses `income` and `balance` to predict `default`.**
```{r}
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
```

**(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:**  
**1.** Split the sample set into a training set and a validation set.  
**2.** Fit a multiple logistic regression model using only the training observations. 
**3.** Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the `default` category if the posterior probability is greater than 0.5.   
**4.** Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
 
```{r}
FiveB = function() {
  # i.
  train = sample(dim(Default)[1], dim(Default)[1]/2)
  # ii.
  glm.fit = glm(default ~ income + balance, data = Default, family = binomial, subset = train)
  # iii.
  glm.pred = rep("No", dim(Default)[1]/2)
  glm.probs = predict(glm.fit, Default[-train, ], type = "response")
  glm.pred[glm.probs > 0.5] = "Yes"
  # iv.
  return(mean(glm.pred != Default[-train,]$default))
}
FiveB()
```

**(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.**

```{r}
FiveB()
FiveB()
FiveB()
```
The error rate seems to hover between 0.024 and 0.026.

**(d) Now consider a logistic regression model that predicts the probability of `default` using `income`, `balance`, and a dummy variable for `student`. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for `student` leads to a reduction in the test error rate.**

```{r}
train = sample(dim(Default)[1], dim(Default)[1]/2)
glm.fit = glm(default ~ income + balance + student, data = Default, family = binomial, 
    subset = train)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit, Default[-train, ], type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != Default[-train, ]$default)
```
The higher estimated test error rate including the student dummy variable indicates that adding the student dummy variable does not appear to lead to a reduction in the test error rate.

### Question 6.  
**We continue to consider the use of a logistic regression model to predict the probability of `default` using `income` and `balance` on the `Default` data set. In particular, we will now compute estimates for the standard errors of the `income` and `balance` logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the `glm()` function. Do not forget to set a random seed before beginning your analysis.**

**(a) Using the `summary()` and `glm()` functions, determine the estimated standard errors for the coefficients associated with `income` and `balance` in a multiple logistic regression model that uses both predictors.**

```{r}
set.seed(1)
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)
```

**(b) Write a function, `boot.fn()`, that takes as input the `Default` data set as well as an index of the observations, and that outputs the coefficient estimates for `income` and `balance` in the multiple logistic regression model.**

```{r}
boot.fn = function(data, index){
  return(coef(glm(default ~ income + balance, data = data, family = binomial, subset = index)))
}
```

**(c) Use the `boot()` function together with your `boot.fn()` function to estimate the standard errors of the logistic regression coefficients for `income` and `balance`.**

```{r}
library(boot)
boot(Default, boot.fn, 50)
```

**(d) Comment on the estimated standard errors obtained using the `glm()` function and using your bootstrap function.**

The estimates from the two separate estimation methods are similar all the way to the second and third significant digits.

### Question 7.  

**In Sections 5.3.2 and 5.3.3, we saw that the `cv.glm()` function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the `glm()` and `predict.glm()` functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the `Weekly` data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).**

```{r}
detach(Default)
summary(Weekly)
set.seed(1)
attach(Weekly)
```

**(a) Fit a logistic regression model that predicts Direction using `Lag1` and `Lag2`.**

```{r}
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(glm.fit)
```

**(b) Fit a logistic regression model that predicts Direction using `Lag1` and `Lag2` using all but the first observation.**

```{r}
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = binomial)
summary(glm.fit)
```

**(c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if *P*(`Direction="Up"|Lag1`, `Lag2`) > 0.5. Was this observation correctly classified?**

```{r}
predict.glm(glm.fit, Weekly[1, ], type = "response") > 0.5

Weekly$Direction[1]
```
We predict the `Direction` to be `UP`; the true `Direction` was `DOWN`.

__(d) Write a for loop from *i* = 1 to *i* = *n*, where *n* is the number of observations in the data set, that performs each of the following steps:__

__1.__ Fit a logistic regression model using all but the ith observation to predict `Direction` using `Lag1` and `Lag2`.  
__2.__ Compute the posterior probability of the market moving up for the *i*th observation.   
__3.__ Use the posterior probability for the *i*th observation in order to predict whether or not the market moves up.  
__4.__ Determine whether or not an error was made in predicting the direction for the *i*th observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.
 
```{r}
count = rep(0, dim(Weekly)[1])
for (i in 1:(dim(Weekly)[1])) {
    glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
    is_up = predict.glm(glm.fit, Weekly[i, ], type = "response") > 0.5
    is_true_up = Weekly[i, ]$Direction == "Up"
    if (is_up != is_true_up) 
        count[i] = 1
}
sum(count)
```
490 errors.

**(e) Take the average of the _n_ numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.**

```{r}
mean(count)
```
LOOCV estimates a test error rate of 45%.