Estimating Validation Set Error

As I described in one of my articles http://rpubs.com/niroshaR/378607, if we have large enough sample size, we randomly split the data set into three sets; training set, validation set, and test set and we will use the validation set error to evaluate our model. The model that gives the lowest validation set error would be selected as the final model. Selecting one training set, validation set, and test set might occur to obtain bias results. Using resampling methods (cross validation and bootstrapping) to draw samples overcome this issue. There are three ways to estimate validation set error;

Validation set approach
Leave-One-Out cross validation
K-fold cross validation

1. Validation set approach

Randomly split n observations into two approximately equal sets
Train on one set (training set) and evaluate perfomance on the other set(test set (if you split data set into two sets) or validation set)
compute total validation set error (simply MSE for a quantitative outcome) \[ e_i =\sum_{i=1}^n(y_i-x_i^T \hat\beta_{(i)}) \]

where \(\hat\beta_{(i)}\) is the estimated coefficient value for \(i^{th}\) observation.

Repeat step \(1-3\) many times
Select the model that has the lowest total validation set error.

The validation set approach provides a high variance depending on which observations selected for the validation set during each split. Hence, this approach often over estimate depending on the variability of the data set.

2. Leave-one-out cross-validation

Leave-one-out cross validation is very similar to validation set approach except this method considers \(n\) different data sets, taking 1 observation for validation set and \(n-1\) observations for training set. Then fit the model using \(n-1\) observations and evaluate the model using the left out one observation.

Compute the total validation set error \[ CV_{(n)} =\frac{1}{n}\sum_{i=1}^n(y_i-x_i^T \hat\beta_{(i)}) \]

Select the model (out of \(n\) models) that has the lowest validation set error.

3. K-fold cross-validation

Randomly split observations into \(K\) folds.
Fit the model for each fold using the observations not in that fold
We will have \(k\) validation set errors(\(e_k\)) for each fold.
This process gives \(k\) number of mean squared errors (MSE), Compute the final CV error by averaging over \(e_k, k=1,...,K\)

\[ CV_{(k)} =\frac{1}{k}\sum_{i=1}^k\text{MSE}_i \]

K-fold cross-validation provides better results than Leave-one-out cross-validation.

Ex: 5-fold cross-validation

Most of the times, the choice of \(k\) values is between 5 and 10, but there is no formal rule for selecting \(k\). It depends on size of the data set as well. For small data sets, \(k\) -fold cross validation gives larger variance than other methods, but this is negligible with large data sets. The larger \(k\) is computationally expensive, as well as the small \(k\) values has high bias, but efficient in computing. Hence, considering a middle value for \(k\) would be appropriate for computational efficiency.

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.

By: Nirosha Rathnayake, Ph.D. Biostatistics Student, UNMC, Omaha, NE