As I described in one of my articles http://rpubs.com/niroshaR/378607, if we have large enough sample size, we randomly split the data set into three sets; training set, validation set, and test set and we will use the validation set error to evaluate our model. The model that gives the lowest validation set error would be selected as the final model. Selecting one training set, validation set, and test set might occur to obtain bias results. Using resampling methods (cross validation and bootstrapping) to draw samples overcome this issue. There are three ways to estimate validation set error;
Validation set approach
Leave-One-Out cross validation
K-fold cross validation
1. Validation set approach
Train on one set (training set) and evaluate perfomance on the other set(test set (if you split data set into two sets) or validation set)
compute total validation set error (simply MSE for a quantitative outcome) \[ e_i =\sum_{i=1}^n(y_i-x_i^T \hat\beta_{(i)}) \]
where \(\hat\beta_{(i)}\) is the estimated coefficient value for \(i^{th}\) observation.
The validation set approach provides a high variance depending on which observations selected for the validation set during each split. Hence, this approach often over estimate depending on the variability of the data set.
2. Leave-one-out cross-validation
Leave-one-out cross validation is very similar to validation set approach except this method considers \(n\) different data sets, taking 1 observation for validation set and \(n-1\) observations for training set. Then fit the model using \(n-1\) observations and evaluate the model using the left out one observation.
Compute the total validation set error \[ CV_{(n)} =\frac{1}{n}\sum_{i=1}^n(y_i-x_i^T \hat\beta_{(i)}) \]
Select the model (out of \(n\) models) that has the lowest validation set error.
3. K-fold cross-validation
\[ CV_{(k)} =\frac{1}{k}\sum_{i=1}^k\text{MSE}_i \]
Ex: 5-fold cross-validation
Most of the times, the choice of \(k\) values is between 5 and 10, but there is no formal rule for selecting \(k\). It depends on size of the data set as well. For small data sets, \(k\) -fold cross validation gives larger variance than other methods, but this is negligible with large data sets. The larger \(k\) is computationally expensive, as well as the small \(k\) values has high bias, but efficient in computing. Hence, considering a middle value for \(k\) would be appropriate for computational efficiency.
References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.
By: Nirosha Rathnayake, Ph.D. Biostatistics Student, UNMC, Omaha, NE