Resampling Methods

Resampling is the process of repeatedly drawing samples from a training set, fitting a model on each sample, and then examining the extent to which these fitted models differ.

Cross Validation

Cross validation is an effective way in which we can estimate the test error rate using the training data. This is an important part of statistical learning, given that there is often not a test set available. What happens in cross validation is

A part of the training data is held out from the dataset
The model is trained/fit on the data remaining in the training set
The model is applied to the held out observations
The test error rate is estimated on this held out subset of observations
Above steps are repeated (if needed)

For different cross validation methods, the above steps can be done in different ways, but mostly pertaining to how to select which part of the data to hold out from the training set.

Validation Set Approach

The dataset is randomly divided into two parts (usually in half), a training set and validation set. The model is then fit on the training set and used to predict the validation set. This method is computationally inexpensive however it tends to be highly variable, and overestimate the test error rate. This is because it is sensitive to which variables are in which set, and since it splits the data in half, the model is fit on a lot fewer observations which will lead to the statistical learning method performing worse.

Leave-One-Out Cross-Validation (LOOCV)

This method also splits the dataset into two parts, however instead of the two subsets being similar in size, this method only uses a single observation in the validation set and keeps the rest of the data in the training set. The learning method is then fit on the \(n-1\) remaining observations and attempts to predict the observation which is left out, and calculates the MSE on this single observation. This process is repeated \(n\) times, and each observation takes a turn being the one which is “left out” from the training set. The resulting cross-validation estimate is \[\mathrm{CV}_{(n)}=\frac{1}{n} \sum_{i=1}^{n} \mathrm{MSE}_{i}\]

This method tends to not overestimate the test error as much as the validation set approach, and also has less randomness across each time the method is applied because the training set will essentially be the same everytime except for one observation. However, this method is computationally expensive as it requires the model to be fit \(n\) times.

\(k\)-Fold Cross-Validation

This is a more general form of the Validation Set and LOOCV approaches where the dataset is randomly divided into \(k\) groups (folds) of roughly equal size. The first fold is considered to be the validation set, and then the model is fit on the remaining \(k-1\) folds. The trained model is then applied to the fold which was left out, and then the test MSE is computed. This process is repeated \(k\) times, which each fold being treated as the validation set at some point. The cross validation estimate is \[\mathrm{CV}_{(k)}=\frac{1}{k} \sum_{i=1}^{k} \mathrm{MSE}_{i}\]

This is considered a more general method because the Validation Set approach is just the case where \(k=2\) and LOOCV is the case where \(k=n\). The most common levels of \(k\) are \(k=5\) and \(k=10\). This method has significant computational advantages over LOOCV as the model only needs to be fit \(k\) tiems as opposed to \(n\). This method has higher variability than LOOCV, but significantly lower variability than the Validation Set approach.

All of these mentioned approaches for cross-validation can be applied to any statistical learning method, which makes them highly versatile and useful.

Bootstrap

Bootstrapping allows us to obtain distinct data sets by repeatedly sampling observations from the original dataset, rather than repeatedly obtaining independent data sets from the population. The sampling in bootstrapping is performed with replacement, hence the same observation can occur more than once in the bootstrap dataset. The standard errors of bootstrap estimates can be computed using the formula \[\operatorname{SE}_{B}(\hat{\alpha})=\sqrt{\frac{1}{B-1} \sum_{r=1}^{B}\left(\hat{\alpha}^{* r}-\frac{1}{B} \sum_{r^{\prime}=1}^{B} \hat{\alpha}^{* r^{\prime}}\right)^{2}}\]

where B is the number of times the procedure is repeated. Overall, the bootstrap approach can be used to estimate the variability associated with \(\alpha\), which is the bootstrap parameter estimates.