- The goal of resampling is to repeatedly draw samples form a training set in order to refit and test a model multiple times on a single training set.
- We can estimate the test error rate through resampling methods or mathematical adjustments to the training error rate that accounts for overfitting.
Validation Set Approach
- A simple approach to estimate the test error is to randomly divide the training set into two parts
- A training set and a validation/test set
Validation Set
- Fit the model on the training set, use the fitted model to predict observations for the validation set and compare these predictions to the actual results.
- This provides an estimate of the test error rate
- Pros:
- Very simple and computationally quick way to estimate the test error
- Cons
- High variance between different validation splits
- Less data to train the model can result in an overestimate of the bias in the model (and therefore the test error rate)
- We can implement this onto R
- sample() creates a vector of randomly chosen datapoints
# Choose 196 random observations from 392
train = sample(392,196)
# We can use subset = option to subset the data
lm.fit = lm(mpg ~ horsepower, data = Auto, subset = train)
# Can compute test MSE
mean((Auto$mpg-predict(lm.fit,Auto))[-train]^2)## [1] 21.58206
Cross Validation Methods
- Cross validation methods split the data into many smaller parts and the models are trained and tested on each set of folds.
Leave-One-Out
- Leave one out cross validation (LOOCV) involves splitting the set of observations into two parts, similar to the validation set approach. However instead of creating two subsets of similar size, a single observation \((x_1,y_1)\) is used for the validation set and the remaining observations \(\{(x_2,y_2),...,(x_n,y_n)\}\) make up the training set.
- Note that due to this there is no randomness in LOOCV results as there is no randomness in the resampling process
- Also note that this is a special case of k-fold CV where k=n
LOOCV Set
- The statistical model is fit on the n-1 training sets and produces an approximately unbiased estimate for the test error as the training set is very closely related to the whole set.
- This however leads to higher variance as the training sets have substaintial overlap resulting in correlated error estimates.
- The general formula is as follows, and can be extremly computational expensive for most methods:
\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}MSE_i\]
- However, with least squares linear or polynomial regression there is a computational shortcut that makes the cost of LOOCV the same as a single model fit:
\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}\left(\frac{y_i-\hat{y}_i}{1-h_{ii}}\right)^2\]
- Pros
- Low bias as almost all of the training observations are used to train the models
- Always have the same results as there is no randomness in the resampling method
- Can be extremely computationally efficient for the right models
- Cons
- If not used for the special case models it is an extremely computationally expensive solution
- All datasets are very similar so output is highly correlated with eachother
- This can be implemented into R
- cv.glm() is part of the boot library
- cv.err$delta gievs the LOOCV error as well as a bias-adjusted version
## [1] 24.23151 24.23114
k-Fold
- K-Fold cross validation involves randomly dividing the observations into k groups/folds of approximately equal size.
- In practice K-fold CV is very popular and often uses K=5 or 10
- As this split is random there is variance between different computations of k-fold cross validation.
- Each training set contains \(\frac{(k-1)n}{k}\) observations
K fold Set
- The formula for the CV estimate is:
\[ CV_{(k)}=\frac{1}{k}\sum^k_{i=1}MSE_i \]
- Pros
- Computationally cheap
- Less bias than the simple validation approach but more than LOOCV
- Can give more accurate estimates of test error than LOOCV due to Bias-Variance tradeoff
- LOOCV has higher variance than k-fold due to high positively correlated outputs
- Bias/Variance trade off associated with the choice of k in k-fold cross validation
- Cons
- Some variability in CV estimates, but much lower than validation set
- In R we can use the same function as for LOOCV
- K parameter for the amount of folds
cv.error.10 = rep(0,10)
for (i in 1:10){
glm.fit = glm(mpg~poly(horsepower,i),data = Auto)
cv.error.10[i]=cv.glm(Auto, glm.fit,K=10)$delta[1]
}
cv.error.10## [1] 24.11219 19.17035 19.25040 19.37225 19.23547 19.30146 19.00148 19.06866
## [9] 18.75777 19.42713
CV on Classification Problems
- Very similar to the regression CV except rather than using MSE to quantify test error, we instead use the number of misclassified observations
- LOOCV
\[ CV_{(n)}=\frac{1}{n}\sum^n_{i=1}\mathbb{I}(y_i\neq\hat{y}_i) \]
- K-Fold
\[ CV_{(k)}=\frac{1}{k}\sum^n_{i=1}\frac{\mathbb{I}(y_i\neq\hat{y}_i)}{n_k} \]
Bootstrap
- Bootstrapping is a useful tool to quantify the uncertainty associated with a given estimator or statistical learning method
- This method can be easily applied to a wide range of statistical learning methods and is not constrained by assumptions that normal formulas for the same quantities may
- For real data we cannot generate new samples from the original population, but bootstrap emulates that process so we can estimate variability, MSE etc without needing new samples
- This is done by sampling from the original data set with replacement
Bootstrapped Datasets
- An example for calculating the average using \(B\) bootstrap datasets
- Similar thing can be done for variance
\[ \hat{\alpha}^*=\frac{1}{B}\sum^B_{i=1}\alpha^{*i}\]
- To implement this in R we need to create a function to utilise boot() on
- boot() is part of the boot library
- Setting replace = T in the sample() function results in sampling with replacement
boot.fn = function(data,index){
return(coef(lm(mpg~horsepower, data=data, subset = index)))
}
# For a single sample
boot.fn(Auto, sample(392,392, replace = T))## (Intercept) horsepower
## 40.298300 -0.159045
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 39.9358610 0.0159811655 0.839614690
## t2* -0.1578447 -0.0001729074 0.007285645