Cross Validation

Many statistical learning approaches do not have a closed form solution in understanding the true errors of the prediction. However, resampling methods such as the cross-validation and the bootstrap have allowed complex and computational intensive models to have an approximate error estimate. In this section, we will discuss model assessment using the different cross validation (CV) techniques in statistical learning.

1. The Validation Set Approach

Suppose that we would like to estimate the test error associated with fitting a particular statistical learning method on a set of observations. The validation set approach is a techniques that divides the available set of observation into two parts, a training set and a validation / test set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the test set. The following R codes illustrate how to perform validation set approach in R.

# VALIDATION SET APPROACH
# 70% Train and 30% Test

setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Advertising <- read.csv(".\\Advertising.csv")

names(Advertising)
## [1] "X"         "TV"        "Radio"     "Newspaper" "Sales"
# Randomly select 70% of the data as Training
set.seed(123)
train.index <- sample(c(1:200), 140, replace=FALSE) 

train <- Advertising[train.index,]
test <- Advertising[-train.index,]

# Generate out-of-sample errors
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
predicted <- predict(lm.fit, newdata=test)
sq.er <- (predicted - test$Sales)^2

# Test MSE
MSE <- mean(sq.er)
MSE
## [1] 2.321357

2. Leave-One-Out Cross-Validation

Leave-one-out cross-validation (LOOCV) splits the data into two parts. However, instead of creating two subsets of comparable size, a single observation \((x_i, y_i)\) is used for the validation, and the remaining observations make up the training set. The learning method is fit on the \(n-1\) observations, and a prediction \(\hat y\) is made for the excluded (left-out) observation. This iterative process is shown in the following R codes.

# LEAVE-ONE-OUT CROSS VALIDATION

sq.er <- vector()

for(i in 1:dim(Advertising)[1]){
  test <- Advertising[i,]
  train <- Advertising[-i,]
  
  lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
  predicted <- predict(lm.fit, newdata=test)
  
  sq.er.i <- (predicted - test$Sales)^2
  sq.er <- c(sq.er, sq.er.i)
}

MSE <- mean(sq.er)
MSE
## [1] 2.9469

3. k-Fold Cross-Validation

An alternative to LOOCV is k-fold CV. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. For \(k\) iterations, one fold is used as the validation set and the learning method is fit on the remaining \(k-1\) folds. The final \(MSE\) is the average of all the \(MSEs\) for each fold.

# 10-FOLD CROSS-VALIDATION

k <- 10
MSE.k <- vector()

# Note: Usually the assignment of the folds are random
# Note: For illustration let's use the original order of the data as folds
lower <- c(1, 21, 41, 61, 81, 101, 121, 141, 161, 181)
upper <- c(20, 40, 60, 80, 100, 120, 140, 160, 180, 200)

set.seed(178)
index <- sample(c(1:200), 200, replace=FALSE)
dat <- Advertising[index,]

for(i in 1:k) {
  fold <- c(lower[i]: upper[i])
  train <- dat[-fold,]
  test <- dat[fold,]
  
  lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
  predicted <- predict(lm.fit, newdata=test)
  
  sq.er.fold <- (predicted - test$Sales)^2
  MSE.fold <- mean(sq.er.fold)
  MSE.k <- c(MSE.k, MSE.fold)
}

MSE.k
##  [1] 1.413442 2.612986 2.734047 1.677236 2.406883 2.949237 3.906837 3.538674
##  [9] 6.808757 1.319590
mean(MSE.k)
## [1] 2.936769

Validation Set in Classification Models

The three cross-validation techniques above work well with quantitative outcome variables. However, in the case of categorical outcome variables, the LOOCV and k-fold CV may not be applicable especially when the data set is not very large. In applying the validation set approach one should consider that the relative proportion of the data with 1’s and 0’s should be preserved in both the train and test set, i.e., instead of SRSWOR, stratified random sampling with proportional allocation should be implemented.

The following R codes illustrate how to perform cross validation in a classification problem.

# Preserve the proportion of 1s and 0s for train and test
library(caret)
## Warning: package 'ggplot2' was built under R version 4.3.1
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.1
data(Default)

set.seed(123)
train.index <- createDataPartition(Default$default, p=0.7, list=FALSE)

train <- Default[train.index,]
test <- Default[-train.index,]
  

table(train$default)
## 
##   No  Yes 
## 6767  234
table(test$default)
## 
##   No  Yes 
## 2900   99
234/(234+6767)
## [1] 0.0334238
99/(99+2900)
## [1] 0.033011
# Logistic Regression 
logit.fit <- glm(default ~ income + balance + student, 
                 data = train, family = binomial)

# Out-of-sample Classification Performance
prob.predict <- predict(logit.fit, newdata=test, type="response")
predicted <- ifelse(prob.predict > 0.5, "Yes", "No")
actual <- test$default

table(predicted, actual)
##          actual
## predicted   No  Yes
##       No  2884   63
##       Yes   16   36