Cross Validation

Many statistical learning approaches do not have a closed form solution in understanding the true errors of the prediction. However, resampling methods such as the cross-validation and the bootstrap have allowed complex and computationall intensive models to have an approximate error estimate. In this section, we will discuss model assessment using the different cross validation (CV) techniques in statistical learning.

1. The Validation Set Approach

Suppose that we would like to estimate the test error associated with fitting a particular statistical learning method on a set of observations. The validation set approach is a techniques that divides the available set of observation into two parts, a training set and a validation / test set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the test set. The following R codes illustrate how to perform validation set approach in R.

library(ISLR2)
library(MASS)

setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Advertising <- read.csv(".\\Advertising.csv")
# VALIDATION SET APPROACH
# 70% Train and 30% Test

set.seed(175)
train.index <- sample(c(1:200), 140, replace=FALSE)

train <- Advertising[train.index,]
test <- Advertising[-train.index,]

lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
lm.pred <- predict(lm.fit, newdata=test)

# Calculate Test MSE
SE.test <- (lm.pred - test$Sales)^2
MSE.test <- mean(SE.test)
MSE.test
## [1] 2.88188

2. Leave-One-Out Cross-Validation

Leave-one-out cross-validation (LOOCV) splits the data into two parts. However, instead of creating two subsets of comparable size, a single observation \((x_i, y_i)\) is used for the validation, and the remaining observations make up the training set. The learning method is fit on the \(n-1\) observations, and a prediction \(\hat y\) is made for the excluded (left-out) observation. This iterative process is shown in the following R codes.

# LEAVE-ONE-OUT CROSS VALIDATION

SE <- vector()

for(i in 1:200) {
  train <- Advertising[-i,]
  test <- Advertising[i,]
  
  lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
  lm.pred <- predict(lm.fit, newdata=test)
  
  SE.i <- (lm.pred - test$Sales)^2
  SE <- rbind(SE, SE.i)
}

MSE.LOOCV <- mean(SE)
MSE.LOOCV
## [1] 2.9469

3. k-Fold Cross-Validation

An alternative to LOOCV is k-fold CV. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. For \(k\) iterations, one fold is used as the validation set and the learning method is fit on the remaining \(k-1\) folds. The final \(MSE\) is the average of all the \(MSEs\) for each fold.

# 10-FOLD CROSS-VALIDATION

k <- 10
MSE.k <- vector()

# Note: Usually the assignment of the folds are random
# Note: For illustration let's use the original order of the data as folds
lower <- c(1, 21, 41, 61, 81, 101, 121, 141, 161, 181)
upper <- c(20, 40, 60, 80, 100, 120, 140, 160, 180, 200)

for(i in 1:k) {
  fold <- c(lower[i]: upper[i])
  train <- Advertising[-fold,]
  test <- Advertising[fold,]
  
  lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
  lm.pred <- predict(lm.fit, newdata=test)
  
  SE.fold <- (lm.pred - test$Sales)^2
  MSE.fold <- mean(SE.fold)
  MSE.k <- rbind(MSE.k, MSE.fold)
}

MSE.kfold <- mean(MSE.k)
MSE.kfold
## [1] 3.059968

Validation Set in Classification Models

The three cross-validation techniques above work well with quantitative outcome variables. However, in the case of categorical outcome variables, the LOOCV and k-fold CV may not be applicable especially when the data set is not very large. In applying the validation set approach one should consider that the relative proportion of the data with 1’s and 0’s should be preserved in both the train and test set, i.e., instead of SRSWOR, stratified random sampling with proportional allocation should be implemented.

The following R codes illustrate how to perform cross validation in a classification problem.

library(caret)
## Warning: package 'caret' was built under R version 4.2.2
# Use the Stock Market Data
names(Smarket)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
head(Smarket)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up
table(Smarket$Direction)
## 
## Down   Up 
##  602  648
dim(Smarket)
## [1] 1250    9
# Convert "Up" to 1 and "Down" to 0
Smarket$Direct <- ifelse(Smarket$Direction == "Up", 1, 0)

# Create Partitions with Proportional Allocation
# 70% Train and 30% Test
set.seed(123)
train.index <- createDataPartition(Smarket$Direct, p=0.7, list=FALSE) 

train <- Smarket[train.index,]
test <- Smarket[-train.index,]

# Fit Logistic Reg Model
logi.fit <- glm(Direct ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
                data=train, family = binomial)

# Out-of-sample Correct Classification
logi.pred <- predict(logi.fit, newdata=test, type="response")
logi.class <- ifelse(logi.pred > 0.5, "Up", "Down")

table(logi.class, test$Direction)
##           
## logi.class Down  Up
##       Down   40  38
##       Up    142 155
mean(logi.class == test$Direction)
## [1] 0.52
# In-sample correct classification
logi.pred.train <- predict(logi.fit, type="response")
logi.class.train <- ifelse(logi.pred.train > 0.5, "Up", "Down")

table(logi.class.train, train$Direction)
##                 
## logi.class.train Down  Up
##             Down   86  80
##             Up    334 375
mean(logi.class.train == train$Direction)
## [1] 0.5268571