In this assignment, I will be using Tidymodels instead of base R.

1 Exercise 1 (10 points)

Review of k-fold cross-validation.

  1. Explain how k-fold cross-validation is implemented.

This method involves randomly splitting the set of observations into k groups, or folds, of roughly similar size using a k-fold CV. The first fold serves as a validation set, and the method is fitted to the remaining \(k\)-1 folds. On the observations in the held-out fold, the mean squared error, \(MSE_1\), is calculated. This technique is done k times, with each validation set consisting of a different set of observations. This approach yields \(k\) test error estimates, \(MSE_1\), \(MSE_2\),…, \(MSE_k\). By averaging these values, the \(k\)-fold CV estimate is created.

  1. What are the advantages and disadvantages of k-fold cross-validation relative to
    • The validation set approach
    • LOOCV

1.1 k-Fold CV vs Validation set

1.1.1 Advantages

  • k-fold CV has much lower variability than the validation set approach, especially in small datasets.
  • All the observations can be used to both training and testing model performance.
  • When compared to a model fit on the entire data set, the validation set approach can overestimate the test error because most models improve with more data, and a major fraction is ignored entirely from training.

1.1.2 Disadvantages

  • When compared to k-fold CV, the validation set is much easier to carry out.
  • The validation set approach saves time because a model is only trained and evaluated once, meaning that k-fold CV can be far more time consuming for large data and for large values of k.

1.2 k-Fold CV vs Leave-One-Out Cross-Validation

1.2.1 Advantages

  • Compared to LOOCV, k-Fold CV requires less computational power when running data set. For example, when k = 10, n = 1000, k-Fold CV will just fit 10 models, but LOOCV will fit 1000 times (LOOCV is computationally intensive because we need a lot of computational power even for modest data sets).
  • Due to the bias-variance trade-off, there is some evidence that k-fold CV can provide a more accurate estimate of the test error rate than LOOCV (LOOCV has lower bias but higher variance).

1.2.2 Disadvantages

  • Repeating LOOCV will always produce the same results: there is no randomness in the training/validation set splits. That is, LOOCV produces a less variable MSE.
  • The estimation of LOOCV has low bias, which is close to the true value.

2 Exercise 2 (10 points)

Denote whether the following statements are true or false. Explain your reasoning.

  1. When \(k = n\) the cross-validation estimator is approximately unbiased for the true prediction error.
  1. When \(k = n\) the cross-validation estimator will always have a low variance.

When \(k = n\),the cross-validation estimator is almost unbiased for the expected prediction error. However, it can have high variance because the N “training sets” are all so similar. The computational load is also sizable, requiring N applications of the learning method.

  1. Statistical transformations on the predictors, such as scaling and centering, must be done inside each fold.

All the column-wise data transformation can be carried out inside each fold. However, data transformation within a cross-validation loop can significantly slow down the entire process. The better method would be to perform as much data pre-processing as possible prior to cross-validation.

3 Exercise 3 (20 points)

This exercise should be answered using the Weekly data set, which is part of the LSLR package. If you don’t have it installed already you can install it with

# install.packages("ISLR")

To load the data set run the following code

library(ISLR)
data("Weekly")
# see the variables' name
names(Weekly)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
  1. Create a test and training set using initial_split(). Split proportion is up to you. Remember to set a seed!
# set a seed
set.seed(1)
# the proportion we use 80 % for training set, and 20 % for testing set
weekly_split <- initial_split(Weekly, prop = 0.8, strata = Direction)
weekly_split
## <Analysis/Assess/Total>
## <873/216/1089>
# training and testing sets
weekly_train <- training(weekly_split)
weekly_test <- testing(weekly_split)
  1. Create a logistic regression specification using logistic_reg(). Set the engine to glm.
# set a logistic regression specification
lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")
lr_spec 
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
  1. Create a 5-fold cross-validation object using the training data, and fit the resampled folds with fit_resamples() and Direction as the response and the five lag variables plus Volume as predictors. Remember to set a seed before creating k-fold object.
# create a workflow object
lr_wf <- workflow() %>%
  add_model(lr_spec) %>%
  add_formula(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume)

weekly_cv5 <- vfold_cv(data = weekly_train, v = 5, strata = Direction)

# fit the model within each of the folds
# 
fit_weekly_cv5 <- fit_resamples(lr_wf, resamples = weekly_cv5,
                                control = control_resamples(save_pred = TRUE))
fit_weekly_cv5$.metrics
## [[1]]
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.571 Preprocessor1_Model1
## 2 roc_auc  binary         0.614 Preprocessor1_Model1
## 
## [[2]]
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.583 Preprocessor1_Model1
## 2 roc_auc  binary         0.540 Preprocessor1_Model1
## 
## [[3]]
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.554 Preprocessor1_Model1
## 2 roc_auc  binary         0.559 Preprocessor1_Model1
## 
## [[4]]
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.529 Preprocessor1_Model1
## 2 roc_auc  binary         0.477 Preprocessor1_Model1
## 
## [[5]]
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.569 Preprocessor1_Model1
## 2 roc_auc  binary         0.491 Preprocessor1_Model1
  1. Collect the performance metrics using collect_metrics().Interpret.
fit_weekly_cv5 %>%
  collect_metrics(summarize = FALSE) %>%
  filter(.metric == "accuracy" )
## # A tibble: 5 x 5
##   id    .metric  .estimator .estimate .config             
##   <chr> <chr>    <chr>          <dbl> <chr>               
## 1 Fold1 accuracy binary         0.571 Preprocessor1_Model1
## 2 Fold2 accuracy binary         0.583 Preprocessor1_Model1
## 3 Fold3 accuracy binary         0.554 Preprocessor1_Model1
## 4 Fold4 accuracy binary         0.529 Preprocessor1_Model1
## 5 Fold5 accuracy binary         0.569 Preprocessor1_Model1
  1. Fit the model on the whole training data set. Calculate the accuracy on the test set. How does this result compare to results in d. Interpret.

On testing data, the predicted accuracy is 54%, which is close to the results of 5 training folders.

final_fit_weekly_cv5 <- fit(lr_wf, data = weekly_train)

augment(final_fit_weekly_cv5, new_data = weekly_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.542

4 Exercise 4 (15 points)

We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of \(n\) observations.

Bootstrapping is a statistical process for generating multiple simulated samples from a single dataset. For a variety of sample statistics, this procedure allows we to calculate standard errors, generate confidence intervals, and do hypothesis testing.

  1. What is the probability that the first bootstrap observation is not the \(j\)th observation from the original sample? Justify your answer.

There are n observations can be extracted in data set. Because any of the n observations has an equal chance of being chosen, the probability that the \(j\)th observation is chosen as the first bootstrap observation is \(\frac{1}{n}\). As a result, the probability that the first bootstrap observation is not the \(j\)th observation from the original sample is \(1 - \frac{1}{n}\).

  1. What is the probability that the second bootstrap observation is not the \(j\)th observation from the original sample? Justify your answer.

Because bootstrap resamples the rows with replacement. As part a mentioned, each observation has an equal chance of being chosen so the probability will still be \(1 - \frac{1}{n}\).

  1. Argue that the probability that the \(j\)th observation is not in the bootstrap sample is \((1-1/n)^n\).

The selection probabilities are independent because we use sampling with replacement to generate the bootstrap sample. In this case, \(j\)th observation is not the first bootstrap observation, \(j\)th observation is not the second bootstrap observation…….\(j\)th observation is not the \(n\) bootstrap observation. Based on the previous formula, we can calculate the probability is (\(1 - \frac{1}{n}\)) * (\(1 - \frac{1}{n}\))…….etc.. With \(n\)th observation, the probability will be \((1-1/n)^n\).

  1. When \(n = 5\) what is the probability that the \(j\)th observation is in the bootstrap sample?

The probability that the \(j\)th observation is not in the bootstrap sample is \((1-1/n)^n\) so we can derive the probability that the \(j\)th observation in the bootstrap sample is:

Ans: 0.67232 \[ 1-(1 - \frac{1}{n})^n \]

# crate the probability function
probBootstrap <- function(n){
 ans <- 1-(1-1/n)^n 
 return(ans)
}
probBootstrap(5)
## [1] 0.67232
  1. When \(n = 100\) what is the probability that the \(j\)th observation is in the bootstrap sample?

Ans: 0.6339677

probBootstrap(100)
## [1] 0.6339677
  1. When \(n = 10,000\) what is the probability that the \(j\)th observation is in the bootstrap sample?

Ans: 0.632139

probBootstrap(10000)
## [1] 0.632139
  1. Create a plot that displays, for each integer value of \(n\) from 1 to 100,000 the probability that the \(j\)th observation is in the bootstrap sample. Comment on what you observe.

On average, the probability of j = 100, 10000 is close to 0.632 from the previous questions. For each bootstrap sample, approximately 0.632 of the elements will be used (some more than once), that’s what the 0.632 Bootstrap Method is. The red horizontal line represents 0.632 probability, which approximately corresponds to the probability of \(j\)th observation.

tibble(x = 1:10000, y = probBootstrap(x)) %>%
ggplot(aes(x = x, y = y)) +
  geom_point() +
  labs(x = "jth Observation", y = "Probability of j", 
  title = " Probability of j Observation in Bootstrap Sample") +
  geom_hline(aes(yintercept = 1 - 1/exp(1), col = "0.632")) +
  theme_bw() 

  1. We will now investigate numerically that a bootstrap sample of size \(n = 100\) contains the \(j\)th observation. Here \(j = 4\). We repeatedly create bootstrap samples, and each time we record whether or not the fourth observation is contained in the bootstrap sample.
set.seed(69) # set a seed here
store <- integer(10000)
for (i in seq_along(store)) {
  store[i] <- sum(sample(seq_len(100), replace = TRUE) == 4) > 0
}
mean(store)
## [1] 0.6337

Comment on the results obtained.

The simulation indicates that this appears to be correct, with 63.2 percent containing 4th observation, on average. Even with large datasets, there is always approximate a 63 % chance that any particular thing will be in the bootstrap sample.

5 Exercise 5 (10 points)

Suppose that we use some statistical learning method to make a prediction for the response \(Y\) for a particular value of the predictor \(X\).

  1. Carefully describe how we might estimate the standard deviation of our prediction.

Bootstrap can be used to quantify the variability of most techniques. Even unsupervised techniques.

  1. Is this procedure depends on what statistical learning method we are using?

This is a supervised learning. Regression and classification are two types of supervised machine learning techniques, while clustering and association are two types of unsupervised learning. The supervised learning indicates that some data has already been tagged with the correct answer, as demonstrated by the previous questions and examples.

6 Exercise 6 (15 points)

This exercise should be answered using the Default data set, which is part of the LSLR package. If you don’t have it installed already you can install it with

#install.packages("ISLR")

To load the data set run the following code

library(ISLR)
data("Default")
head(Default)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559
  1. Use the parsnip package to fit a logistic regression on the default data set. default is the response and income and balance are the predictors. Then use summary() on the fitted model to determine the estimated standard errors for the coefficients associated with income and balance. Comment on the estimated standard errors.

In the real data, we can see the p-value of intercept, \(b_1\), and \(b_2\) are totally in the significant level. The smaller the standard error, the more precise the estimate. The standard error indicates how close any income and balance of that population’s mean is likely to be to the true population mean. The estimated standard error of income is 0.000004985, and the estimated standard error of balance is 0.4348.

lr_spec %>%
  fit(default ~ income + balance, data = Default) -> lr_fit
lr_fit
## parsnip model object
## 
## Fit time:  43ms 
## 
## Call:  stats::glm(formula = default ~ income + balance, family = stats::binomial, 
##     data = data)
## 
## Coefficients:
## (Intercept)       income      balance  
##  -1.154e+01    2.081e-05    5.647e-03  
## 
## Degrees of Freedom: 9999 Total (i.e. Null);  9997 Residual
## Null Deviance:       2921 
## Residual Deviance: 1579  AIC: 1585
lr_fit %>%
  pluck("fit") -> lr_fit_fit
summary(lr_fit_fit)
## 
## Call:
## stats::glm(formula = default ~ income + balance, family = stats::binomial, 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8
  1. Use the bootstraps() function from the rsample package to generate 25 bootstraps of Default.
default_boots <- bootstraps(Default, strata = default, time = 25) 
default_boots 
## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits               id         
##    <list>               <chr>      
##  1 <split [10000/3688]> Bootstrap01
##  2 <split [10000/3640]> Bootstrap02
##  3 <split [10000/3658]> Bootstrap03
##  4 <split [10000/3705]> Bootstrap04
##  5 <split [10000/3708]> Bootstrap05
##  6 <split [10000/3739]> Bootstrap06
##  7 <split [10000/3704]> Bootstrap07
##  8 <split [10000/3684]> Bootstrap08
##  9 <split [10000/3678]> Bootstrap09
## 10 <split [10000/3648]> Bootstrap10
## # … with 15 more rows
  1. Run the following code. Change boots to the name of the bootstrapping object created in the previous question. This will take a minute or two to run. Comment on
# This function takes a `bootstrapped` split object, and fits a logistic model
fit_lr_on_bootstrap <- function(split) {
    logistic_reg() %>%
    set_engine("glm") %>%
    set_mode("classification") %>%
    fit(default ~ income + balance, analysis(split))
}
# This code uses `map()` to run the model inside each split. Then it used
# `tidy()` to extract the model estimates the parameter
boot_models <-
  default_boots %>% # changed name
  mutate(model = map(splits, fit_lr_on_bootstrap),
         coef_info = map(model, tidy))
# This code extract the estimates for each model that was fit
boot_coefs <-
  boot_models %>% 
  unnest(coef_info)
# This code calculates the standard deviation of the estimate
sd_estimates <- boot_coefs %>%
  group_by(term) %>%
  summarise(std.error_est = sd(estimate))
sd_estimates
## # A tibble: 3 x 2
##   term        std.error_est
##   <chr>               <dbl>
## 1 (Intercept)    0.479     
## 2 balance        0.000243  
## 3 income         0.00000467
  1. Comment on the estimated standard errors obtained using the summary() function on the first model and the estimated standard errors you found using the above code.

In the Bootstrap method, we take random samples from the population, and we would expect \(\hat{balance}\) to differ from balance by approximately 0.00017, and except \(\hat{income}\) to differ from income by approximately 0.0000005, on average. Look at the diff column, which indicates the different values between first model and the estimated standard errors that we bootstrapped split object. The small different values means the Bootstrap resampling method is close to the ground truth.

summary(lr_fit_fit)$coefficients %>%
  as_tibble() %>%
  rename(std.error_real = "Std. Error") %>%
  select(std.error_real) -> tmp
bind_cols(sd_estimates, tmp) %>%
  mutate(diff = abs(std.error_est - std.error_real))
## # A tibble: 3 x 4
##   term        std.error_est std.error_real     diff
##   <chr>               <dbl>          <dbl>    <dbl>
## 1 (Intercept)    0.479          0.435      0.0440  
## 2 balance        0.000243       0.00000499 0.000238
## 3 income         0.00000467     0.000227   0.000223

7 References