# Load data
data(iris)
# Take a look at data
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Load packages
require(caret)
require(dplyr)
require(tidyverse)
Cross-validation is a set of methods used for measuring how
“good” a model is in making predictions. In particular, it tells the
users the performance of a given predictive model on the unseen data
set. It can also be regarded as a resampling method since by
applying cross-validation, the model will be fitted multiple times using
different subset of the data in order to evaluate the performance of the
model.
Some may agree that cross-validation is also a method to detect the overfitting issue. Since it can approximate of how well the model will perform on the new data, we may conclude that the model is overfitting if the model does much better on the training set than on the test set. For example, if on the training set there is a R-squared of 0.95 but the crossvalidated R-squared is only 0.5, it would be a red flag for overfitting because a large part of the model performance may not come from true relationship.
To determine whether the designed model is performing well, we need
to use the observations that are not being used during the training of
the model. Therefore the test set will serve as the unseen data , then
the values of the dependent variables (y) are predicted and model
accuracy will be evaluated based on the difference between actual values
and predicted values of the dependent variable.
Each methods below will be conducted in four steps:
* Data splitting: split the data set into different
subsets.
* Training: build the model on the training data
set.
* Testing: apply the resultant model to the unseen data
(testing data set) to predict the outcome of new observations.
* Evaluating: calculate prediction error using the
model performance metrics.
A figure for validation set approach
### Data splitting
# set seed to generate a reproducible random sample
set.seed(123)
# create training and testing data set using index, training data contains 80% of the data set
# 'list = FALSE' allows us to create a matrix data structure with the indices of the observations in the subsets along the rows.
train.index.vsa <- createDataPartition(iris$Species, p= 0.8, list = FALSE)
train.vsa <- iris[train.index.vsa,]
test.vsa <- iris[-train.index.vsa,]
# see how the the subsets are randomized
role = rep('train',nrow(iris))
role[-train.index.vsa] = 'test'
ggplot(data = cbind(iris,role)) + geom_point(aes(x = Sepal.Length,
y = Petal.Width,
color = role))
### Training
model.vsa <- lm(Petal.Width ~., data = train.vsa)
### Testing
predictions.vsa <- model.vsa %>% predict(test.vsa)
### Evaluating
data.frame(RMSE = RMSE(predictions.vsa, test.vsa$Petal.Width),
R2 = R2(predictions.vsa, test.vsa$Petal.Width),
MAE = MAE(predictions.vsa, test.vsa$Petal.Width))
## RMSE R2 MAE
## 1 0.1675093 0.9497864 0.128837
The way LOOCV works is a little bit different from other
approaches. It start with taking one observation as the validation set.
Then we train the model on the rest (n-1) observations, calculate the
prediction error, and repeat these steps for all data points. The
overall prediction error is calculated by taking the average of all
these test errors. Below is a graph illustration of LOOCV
approach.
A figure for LOOCV
### Data splitting: leave one out
train.loocv <- trainControl(method = "LOOCV")
### Training
model.loocv <- train(Petal.Width ~.,
data = iris,
method = "lm",
trControl = train.loocv)
### Present results
print(model.loocv)
## Linear Regression
##
## 150 samples
## 4 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 149, 149, 149, 149, 149, 149, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1705606 0.9496003 0.1268164
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
In KFCV, the dataset is divided into k subsets, or “folds”, of
approximately equal size. The model is trained on k-1 folds and tested
on the remaining fold. This process is repeated k times, with each fold
used exactly once as the test set. The performance of the model is
evaluated for each fold, and the overall performance is then calculated
as the average of the performance over all k folds. Below is a graph
illustration of K-fold approach.
One important consideration in KFCV is the choice of k. A larger
value of k reduces the variance of the estimated performance, but
increases the computational cost of the method. A smaller value of k
reduces the computational cost, but may result in a higher variance in
the estimated performance. The commonly used values for k are 5 and 10,
but the optimal value of k depends on the specific problem and
dataset.
A figure for K-fold CV
### Data splitting
# set seed to generate a reproducible random sample
set.seed(123)
# the number of K is set to be 5
train.kfold <- trainControl(method = "cv", number = 5)
### Training
model.kfold <- train(Petal.Width ~.,
data = iris,
method = "lm",
trControl = train.kfold)
### Present results
print(model.kfold)
## Linear Regression
##
## 150 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 122, 120, 118, 121, 119
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1704321 0.9514251 0.12891
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
### Data splitting
# set seed to generate a reproducible random sample
set.seed(123)
# the number of K is set to be 5
train.rkfold <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
### Training
model.rkfold <- train(Petal.Width ~.,
data = iris,
method = "lm",
trControl = train.rkfold)
### Present results
print(model.rkfold)
## Linear Regression
##
## 150 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 122, 120, 118, 121, 119, 119, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.168445 0.9525634 0.1266377
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
A figure for nested CV with sliding window
A figure for nested CV with expanding window