Resampling

Jake

06/07/2021

  • The goal of resampling is to repeatedly draw samples form a training set in order to refit and test a model multiple times on a single training set.
  • We can estimate the test error rate through resampling methods or mathematical adjustments to the training error rate that accounts for overfitting.

Validation Set Approach

  • A simple approach to estimate the test error is to randomly divide the training set into two parts
    • A training set and a validation/test set
Validation Set

Validation Set

  • Fit the model on the training set, use the fitted model to predict observations for the validation set and compare these predictions to the actual results.
    • This provides an estimate of the test error rate
  • Pros:
    • Very simple and computationally quick way to estimate the test error
  • Cons
    • High variance between different validation splits
    • Less data to train the model can result in an overestimate of the bias in the model (and therefore the test error rate)
  • We can implement this onto R
    • sample() creates a vector of randomly chosen datapoints
## [1] 21.58206

Cross Validation Methods

  • Cross validation methods split the data into many smaller parts and the models are trained and tested on each set of folds.

Leave-One-Out

  • Leave one out cross validation (LOOCV) involves splitting the set of observations into two parts, similar to the validation set approach. However instead of creating two subsets of similar size, a single observation \((x_1,y_1)\) is used for the validation set and the remaining observations \(\{(x_2,y_2),...,(x_n,y_n)\}\) make up the training set.
    • Note that due to this there is no randomness in LOOCV results as there is no randomness in the resampling process
    • Also note that this is a special case of k-fold CV where k=n
LOOCV Set

LOOCV Set

  • The statistical model is fit on the n-1 training sets and produces an approximately unbiased estimate for the test error as the training set is very closely related to the whole set.
    • This however leads to higher variance as the training sets have substaintial overlap resulting in correlated error estimates.
  • The general formula is as follows, and can be extremly computational expensive for most methods:

\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}MSE_i\]

  • However, with least squares linear or polynomial regression there is a computational shortcut that makes the cost of LOOCV the same as a single model fit:

\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}\left(\frac{y_i-\hat{y}_i}{1-h_{ii}}\right)^2\]

  • Pros
    • Low bias as almost all of the training observations are used to train the models
    • Always have the same results as there is no randomness in the resampling method
    • Can be extremely computationally efficient for the right models
  • Cons
    • If not used for the special case models it is an extremely computationally expensive solution
    • All datasets are very similar so output is highly correlated with eachother
  • This can be implemented into R
    • cv.glm() is part of the boot library
    • cv.err$delta gievs the LOOCV error as well as a bias-adjusted version
## [1] 24.23151 24.23114

k-Fold

  • K-Fold cross validation involves randomly dividing the observations into k groups/folds of approximately equal size.
    • In practice K-fold CV is very popular and often uses K=5 or 10
    • As this split is random there is variance between different computations of k-fold cross validation.
    • Each training set contains \(\frac{(k-1)n}{k}\) observations
K fold Set

K fold Set

  • The formula for the CV estimate is:

\[ CV_{(k)}=\frac{1}{k}\sum^k_{i=1}MSE_i \]

  • Pros
    • Computationally cheap
    • Less bias than the simple validation approach but more than LOOCV
    • Can give more accurate estimates of test error than LOOCV due to Bias-Variance tradeoff
      • LOOCV has higher variance than k-fold due to high positively correlated outputs
      • Bias/Variance trade off associated with the choice of k in k-fold cross validation
  • Cons
    • Some variability in CV estimates, but much lower than validation set
  • In R we can use the same function as for LOOCV
    • K parameter for the amount of folds
##  [1] 24.11219 19.17035 19.25040 19.37225 19.23547 19.30146 19.00148 19.06866
##  [9] 18.75777 19.42713

CV on Classification Problems

  • Very similar to the regression CV except rather than using MSE to quantify test error, we instead use the number of misclassified observations
  • LOOCV

\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}\mathbb{I}(y_i\neq\hat{y}_i) \]

  • K-Fold

\[ CV_{(k)}=\frac{1}{k}\sum^n_{i=1}\frac{\mathbb{I}(y_i\neq\hat{y}_i)}{n_k} \]

Bootstrap

  • Bootstrapping is a useful tool to quantify the uncertainty associated with a given estimator or statistical learning method
    • This method can be easily applied to a wide range of statistical learning methods and is not constrained by assumptions that normal formulas for the same quantities may
  • For real data we cannot generate new samples from the original population, but bootstrap emulates that process so we can estimate variability, MSE etc without needing new samples
    • This is done by sampling from the original data set with replacement
Bootstrapped Datasets

Bootstrapped Datasets

  • An example for calculating the average using \(B\) bootstrap datasets
    • Similar thing can be done for variance

\[ \hat{\alpha}^*=\frac{1}{B}\sum^B_{i=1}\alpha^{*i}\]

  • To implement this in R we need to create a function to utilise boot() on
    • boot() is part of the boot library
    • Setting replace = T in the sample() function results in sampling with replacement
## (Intercept)  horsepower 
##   40.298300   -0.159045
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.9358610  0.0159811655 0.839614690
## t2* -0.1578447 -0.0001729074 0.007285645