1 Load the data and packages

\(\color{blue}{iris}\) data is a built-in data set in R that contains measurements for 50 flowers in 3 different species and 4 different attributes.
\(\color{red}{caret}\) package is short for Classification And REgression Training. This is a useful tool for data splitting, pre-processing, feature selection and model tuning. In this simple example I will use this package to illustrate cross-validation methods.
\(\color{red}{dplyr}\) package is a commonly used tool for data manipulation.
\(\color{red}{tidyverse}\) package is for data manipulation and visualization.

# Load data
data(iris)

# Take a look at data 
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Load packages
require(caret)
require(dplyr)
require(tidyverse)

2 Motivation for cross-validation

Cross-validation is a set of methods used for measuring how “good” a model is in making predictions. In particular, it tells the users the performance of a given predictive model on the unseen data set. It can also be regarded as a resampling method since by applying cross-validation, the model will be fitted multiple times using different subset of the data in order to evaluate the performance of the model.
Some may agree that cross-validation is also a method to detect the overfitting issue. Since it can approximate of how well the model will perform on the new data, we may conclude that the model is overfitting if the model does much better on the training set than on the test set. For example, if on the training set there is a R-squared of 0.95 but the crossvalidated R-squared is only 0.5, it would be a red flag for overfitting because a large part of the model performance may not come from true relationship.

3 Model performance metrics

To determine whether the designed model is performing well, we need to use the observations that are not being used during the training of the model. Therefore the test set will serve as the unseen data , then the values of the dependent variables (y) are predicted and model accuracy will be evaluated based on the difference between actual values and predicted values of the dependent variable.

R-squared:
\(\Large R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\) where \(SS_{res}\) represents the sum of squared residuals (or errors), and \(SS_{tot}\) represents the total sum of squares. R-squared is a statistical measure that represents the proportion of the variance in the dependent variable (y) that is explained by the independent variable(s) (x) in a regression model. R-squared ranges from 0 to 1, where a value of 1 indicates that all of the variance in the dependent variable is explained by the independent variable(s). Higher R-squared usually indicates a better fit.
Rooted Mean Squared Error (RMSE):
\(\Large RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\) where n is the number of data points, \(y_i\) is the i-th true (or observed) value, \(\hat{y}i\) is the i-th predicted value. The RMSE is a measure of the difference between the predicted and actual values of a regression model. It represents the square root of the average of the squared differences between the predicted and actual values. The lower the RMSE, the better the fit of the regression model.
Mean Absolute Error (MAE):
\(\Large MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\) where n is the number of data points, \(y_i\) is the i-th true (or observed) value, \(\hat{y}i\) is the i-th predicted value. MAE represents the average of the absolute differences between the predicted and actual values. It is less sensitive to outliers compared to RMSE. The lower the MAE, the better the fit of the regression model.

4 Four commonly used cross-validation methods

Each methods below will be conducted in four steps:
* Data splitting: split the data set into different subsets.
* Training: build the model on the training data set.
* Testing: apply the resultant model to the unseen data (testing data set) to predict the outcome of new observations.
* Evaluating: calculate prediction error using the model performance metrics.

4.1 Validation set approach

In this approach, the available data is divided into two subsets: a training set and a validation set. The training set is used to train the model, and the validation set is used to evaluate its performance. It is the most basic and straightforward method for cross-validation. Predictions done by this method could be largely affected by the subset of observations used in testing set. If the test set is not representative of the entire data, this method may lead to overfitting. Below is a figure illustration of validation set approach.

A figure for validation set approach

### Data splitting

# set seed to generate a reproducible random sample
set.seed(123)

# create training and testing data set using index, training data contains 80% of the data set
# 'list = FALSE' allows us to create a matrix data structure with the indices of the observations in the subsets along the rows.
train.index.vsa <- createDataPartition(iris$Species, p= 0.8, list = FALSE)
train.vsa <- iris[train.index.vsa,]
test.vsa <- iris[-train.index.vsa,]

# see how the the subsets are randomized
role = rep('train',nrow(iris))
role[-train.index.vsa] = 'test'
ggplot(data = cbind(iris,role)) + geom_point(aes(x = Sepal.Length,
                                                 y = Petal.Width,
                                                 color = role))

### Training
model.vsa <- lm(Petal.Width ~., data = train.vsa)


### Testing
predictions.vsa <- model.vsa %>% predict(test.vsa)


### Evaluating
data.frame(RMSE = RMSE(predictions.vsa, test.vsa$Petal.Width),
           R2 = R2(predictions.vsa, test.vsa$Petal.Width),
           MAE = MAE(predictions.vsa, test.vsa$Petal.Width))

##        RMSE        R2      MAE
## 1 0.1675093 0.9497864 0.128837

4.2 Leave-one-out cross-validation: LOOCV

The way LOOCV works is a little bit different from other approaches. It start with taking one observation as the validation set. Then we train the model on the rest (n-1) observations, calculate the prediction error, and repeat these steps for all data points. The overall prediction error is calculated by taking the average of all these test errors. Below is a graph illustration of LOOCV approach.

A figure for LOOCV

### Data splitting: leave one out
train.loocv <- trainControl(method = "LOOCV")

### Training
model.loocv <- train(Petal.Width ~.,
                     data = iris,
                     method = "lm",
                     trControl = train.loocv)

### Present results
print(model.loocv)

## Linear Regression 
## 
## 150 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 149, 149, 149, 149, 149, 149, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1705606  0.9496003  0.1268164
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

4.3 K-fold cross-validation

In KFCV, the dataset is divided into k subsets, or “folds”, of approximately equal size. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold used exactly once as the test set. The performance of the model is evaluated for each fold, and the overall performance is then calculated as the average of the performance over all k folds. Below is a graph illustration of K-fold approach.
One important consideration in KFCV is the choice of k. A larger value of k reduces the variance of the estimated performance, but increases the computational cost of the method. A smaller value of k reduces the computational cost, but may result in a higher variance in the estimated performance. The commonly used values for k are 5 and 10, but the optimal value of k depends on the specific problem and dataset.

A figure for K-fold CV

### Data splitting

# set seed to generate a reproducible random sample
set.seed(123)
# the number of K is set to be 5
train.kfold <- trainControl(method = "cv", number = 5)

### Training
model.kfold <- train(Petal.Width ~.,
                     data = iris,
                     method = "lm",
                     trControl = train.kfold)

### Present results
print(model.kfold)

## Linear Regression 
## 
## 150 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 122, 120, 118, 121, 119 
## Resampling results:
## 
##   RMSE       Rsquared   MAE    
##   0.1704321  0.9514251  0.12891
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

4.4 Repeated K-fold cross-validation

This method essentially repeat the K-fold CV for a certain number of times.

### Data splitting

# set seed to generate a reproducible random sample
set.seed(123)
# the number of K is set to be 5
train.rkfold <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

### Training
model.rkfold <- train(Petal.Width ~.,
                     data = iris,
                     method = "lm",
                     trControl = train.rkfold)

### Present results
print(model.rkfold)

## Linear Regression 
## 
## 150 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 122, 120, 118, 121, 119, 119, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   0.168445  0.9525634  0.1266377
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

5 Cross-Validation and panel data set

Things will be different if the times related characteristics are introduced into the data. As sequential data will be subject to autocorrelation issues, the common assumption of i.i.d. observations won’t hold over time. However, the four cross-validation methods mentioned above will need to reply on having i.i.d. observations and will ignore the sequential nature of time, mixing up the past, present and future. Therefore, we will need to use other methods to keep the sequential nature and evaluate the model performance in a robust way.
Nested cross-validation involves a nested loop of cross-validation, where an inner loop is used for parameter tuning, and an outer loop is used for performance evaluation. In the inner loop, k-fold cross-validation is applied to the training data, and different values of parameters are tried to see which one produces the best performance. The best parameters are then selected based on their performance on the validation set. In the outer loop, a separate k-fold cross-validation is performed to evaluate the performance of the model with the selected parameters on the test data. This outer loop ensures that the performance of the model is evaluated in a robust way, as it is tested on multiple folds of the data, and is not biased by the parameters selected in the inner loop.
The following figures present two ways of doing nested cross-validation on a panel data set. Both of the ways handle the time related characteristics. But it is worth mentioning that nested cross-validation can be computationally expensive as it involves training and evaluating multiple models.

A figure for nested CV with sliding window

A figure for nested CV with expanding window

Simple examples of cross-validation

Muxi Cheng

2023-02-15