The goal of resampling is to repeatedly draw samples form a training set in order to refit and test a model multiple times on a single training set.
We can estimate the test error rate through resampling methods or mathematical adjustments to the training error rate that accounts for overfitting.

Validation Set Approach

A simple approach to estimate the test error is to randomly divide the training set into two parts
- A training set and a validation/test set

Validation Set

Fit the model on the training set, use the fitted model to predict observations for the validation set and compare these predictions to the actual results.
- This provides an estimate of the test error rate
Pros:
- Very simple and computationally quick way to estimate the test error
Cons
- High variance between different validation splits
- Less data to train the model can result in an overestimate of the bias in the model (and therefore the test error rate)
We can implement this onto R
- sample() creates a vector of randomly chosen datapoints

# Choose 196 random observations from 392
train = sample(392,196)
# We can use subset = option to subset the data
lm.fit = lm(mpg ~ horsepower, data = Auto, subset = train)
# Can compute test MSE
mean((Auto$mpg-predict(lm.fit,Auto))[-train]^2)

## [1] 21.58206

Cross Validation Methods

Cross validation methods split the data into many smaller parts and the models are trained and tested on each set of folds.

Leave-One-Out

Leave one out cross validation (LOOCV) involves splitting the set of observations into two parts, similar to the validation set approach. However instead of creating two subsets of similar size, a single observation $(x_1,y_1)$ is used for the validation set and the remaining observations $\{(x_2,y_2),...,(x_n,y_n)\}$ make up the training set.
- Note that due to this there is no randomness in LOOCV results as there is no randomness in the resampling process
- Also note that this is a special case of k-fold CV where k=n

LOOCV Set

The statistical model is fit on the n-1 training sets and produces an approximately unbiased estimate for the test error as the training set is very closely related to the whole set.
- This however leads to higher variance as the training sets have substaintial overlap resulting in correlated error estimates.
The general formula is as follows, and can be extremly computational expensive for most methods:

\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}MSE_i\]

However, with least squares linear or polynomial regression there is a computational shortcut that makes the cost of LOOCV the same as a single model fit:

\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}\left(\frac{y_i-\hat{y}_i}{1-h_{ii}}\right)^2\]

Pros
- Low bias as almost all of the training observations are used to train the models
- Always have the same results as there is no randomness in the resampling method
- Can be extremely computationally efficient for the right models
Cons
- If not used for the special case models it is an extremely computationally expensive solution
- All datasets are very similar so output is highly correlated with eachother
This can be implemented into R
- cv.glm() is part of the boot library
- cv.err$delta gievs the LOOCV error as well as a bias-adjusted version

library(boot)
glm.fit = glm(mpg~horsepower, data = Auto)
cv.err = cv.glm(Auto,glm.fit)
cv.err$delta

## [1] 24.23151 24.23114

k-Fold

K-Fold cross validation involves randomly dividing the observations into k groups/folds of approximately equal size.
- In practice K-fold CV is very popular and often uses K=5 or 10
- As this split is random there is variance between different computations of k-fold cross validation.
- Each training set contains $\frac{(k-1)n}{k}$ observations

K fold Set

The formula for the CV estimate is:

\[ CV_{(k)}=\frac{1}{k}\sum^k_{i=1}MSE_i \]

Pros
- Computationally cheap
- Less bias than the simple validation approach but more than LOOCV
- Can give more accurate estimates of test error than LOOCV due to Bias-Variance tradeoff
  - LOOCV has higher variance than k-fold due to high positively correlated outputs
  - Bias/Variance trade off associated with the choice of k in k-fold cross validation
Cons
- Some variability in CV estimates, but much lower than validation set
In R we can use the same function as for LOOCV
- K parameter for the amount of folds

cv.error.10 = rep(0,10)
for (i in 1:10){
  glm.fit = glm(mpg~poly(horsepower,i),data = Auto)
  cv.error.10[i]=cv.glm(Auto, glm.fit,K=10)$delta[1]
}
cv.error.10

##  [1] 24.11219 19.17035 19.25040 19.37225 19.23547 19.30146 19.00148 19.06866
##  [9] 18.75777 19.42713

CV on Classification Problems

Very similar to the regression CV except rather than using MSE to quantify test error, we instead use the number of misclassified observations
LOOCV

\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}\mathbb{I}(y_i\neq\hat{y}_i) \]

K-Fold

\[ CV_{(k)}=\frac{1}{k}\sum^n_{i=1}\frac{\mathbb{I}(y_i\neq\hat{y}_i)}{n_k} \]

Bootstrap

Bootstrapping is a useful tool to quantify the uncertainty associated with a given estimator or statistical learning method
- This method can be easily applied to a wide range of statistical learning methods and is not constrained by assumptions that normal formulas for the same quantities may
For real data we cannot generate new samples from the original population, but bootstrap emulates that process so we can estimate variability, MSE etc without needing new samples
- This is done by sampling from the original data set with replacement

Bootstrapped Datasets

An example for calculating the average using $B$ bootstrap datasets
- Similar thing can be done for variance

\[ \hat{\alpha}^*=\frac{1}{B}\sum^B_{i=1}\alpha^{*i}\]

To implement this in R we need to create a function to utilise boot() on
- boot() is part of the boot library
- Setting replace = T in the sample() function results in sampling with replacement

boot.fn = function(data,index){
  return(coef(lm(mpg~horsepower, data=data, subset = index)))
}
# For a  single sample
boot.fn(Auto, sample(392,392, replace = T))

## (Intercept)  horsepower 
##   40.298300   -0.159045

# The boot function does this R times
boot(Auto, boot.fn, R=1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.9358610  0.0159811655 0.839614690
## t2* -0.1578447 -0.0001729074 0.007285645

Resampling

Jake

06/07/2021