1 Load the data and packages

# Load data
data(iris)

# Take a look at data 
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Load packages
require(caret)
require(dplyr)
require(tidyverse)

2 Motivation for cross-validation

3 Model performance metrics

To determine whether the designed model is performing well, we need to use the observations that are not being used during the training of the model. Therefore the test set will serve as the unseen data , then the values of the dependent variables (y) are predicted and model accuracy will be evaluated based on the difference between actual values and predicted values of the dependent variable.

4 Four commonly used cross-validation methods

Each methods below will be conducted in four steps:
* Data splitting: split the data set into different subsets.
* Training: build the model on the training data set.
* Testing: apply the resultant model to the unseen data (testing data set) to predict the outcome of new observations.
* Evaluating: calculate prediction error using the model performance metrics.

4.1 Validation set approach

  • In this approach, the available data is divided into two subsets: a training set and a validation set. The training set is used to train the model, and the validation set is used to evaluate its performance. It is the most basic and straightforward method for cross-validation. Predictions done by this method could be largely affected by the subset of observations used in testing set. If the test set is not representative of the entire data, this method may lead to overfitting. Below is a figure illustration of validation set approach.

A figure for validation set approach

### Data splitting

# set seed to generate a reproducible random sample
set.seed(123)

# create training and testing data set using index, training data contains 80% of the data set
# 'list = FALSE' allows us to create a matrix data structure with the indices of the observations in the subsets along the rows.
train.index.vsa <- createDataPartition(iris$Species, p= 0.8, list = FALSE)
train.vsa <- iris[train.index.vsa,]
test.vsa <- iris[-train.index.vsa,]

# see how the the subsets are randomized
role = rep('train',nrow(iris))
role[-train.index.vsa] = 'test'
ggplot(data = cbind(iris,role)) + geom_point(aes(x = Sepal.Length,
                                                 y = Petal.Width,
                                                 color = role))

### Training
model.vsa <- lm(Petal.Width ~., data = train.vsa)


### Testing
predictions.vsa <- model.vsa %>% predict(test.vsa)


### Evaluating
data.frame(RMSE = RMSE(predictions.vsa, test.vsa$Petal.Width),
           R2 = R2(predictions.vsa, test.vsa$Petal.Width),
           MAE = MAE(predictions.vsa, test.vsa$Petal.Width))
##        RMSE        R2      MAE
## 1 0.1675093 0.9497864 0.128837

4.2 Leave-one-out cross-validation: LOOCV

  • The way LOOCV works is a little bit different from other approaches. It start with taking one observation as the validation set. Then we train the model on the rest (n-1) observations, calculate the prediction error, and repeat these steps for all data points. The overall prediction error is calculated by taking the average of all these test errors. Below is a graph illustration of LOOCV approach.

    A figure for LOOCV

### Data splitting: leave one out
train.loocv <- trainControl(method = "LOOCV")

### Training
model.loocv <- train(Petal.Width ~.,
                     data = iris,
                     method = "lm",
                     trControl = train.loocv)

### Present results
print(model.loocv)
## Linear Regression 
## 
## 150 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 149, 149, 149, 149, 149, 149, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1705606  0.9496003  0.1268164
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

4.3 K-fold cross-validation

  • In KFCV, the dataset is divided into k subsets, or “folds”, of approximately equal size. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold used exactly once as the test set. The performance of the model is evaluated for each fold, and the overall performance is then calculated as the average of the performance over all k folds. Below is a graph illustration of K-fold approach.

  • One important consideration in KFCV is the choice of k. A larger value of k reduces the variance of the estimated performance, but increases the computational cost of the method. A smaller value of k reduces the computational cost, but may result in a higher variance in the estimated performance. The commonly used values for k are 5 and 10, but the optimal value of k depends on the specific problem and dataset.

    A figure for K-fold CV

### Data splitting

# set seed to generate a reproducible random sample
set.seed(123)
# the number of K is set to be 5
train.kfold <- trainControl(method = "cv", number = 5)

### Training
model.kfold <- train(Petal.Width ~.,
                     data = iris,
                     method = "lm",
                     trControl = train.kfold)

### Present results
print(model.kfold)
## Linear Regression 
## 
## 150 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 122, 120, 118, 121, 119 
## Resampling results:
## 
##   RMSE       Rsquared   MAE    
##   0.1704321  0.9514251  0.12891
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

4.4 Repeated K-fold cross-validation

  • This method essentially repeat the K-fold CV for a certain number of times.
### Data splitting

# set seed to generate a reproducible random sample
set.seed(123)
# the number of K is set to be 5
train.rkfold <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

### Training
model.rkfold <- train(Petal.Width ~.,
                     data = iris,
                     method = "lm",
                     trControl = train.rkfold)

### Present results
print(model.rkfold)
## Linear Regression 
## 
## 150 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 122, 120, 118, 121, 119, 119, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   0.168445  0.9525634  0.1266377
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

5 Cross-Validation and panel data set

A figure for nested CV with sliding window

A figure for nested CV with expanding window

6 Reference