Many statistical learning approaches do not have a closed form solution in understanding the true errors of the prediction. However, resampling methods such as the cross-validation and the bootstrap have allowed complex and computationall intensive models to have an approximate error estimate. In this section, we will discuss model assessment using the different cross validation (CV) techniques in statistical learning.
Suppose that we would like to estimate the test error associated with
fitting a particular statistical learning method on a set of
observations. The validation set approach is a techniques that
divides the available set of observation into two parts, a
training set and a validation / test set. The model is
fit on the training set, and the fitted model is used to predict the
responses for the observations in the test set. The following R codes
illustrate how to perform validation set approach in R.
library(ISLR2)
library(MASS)
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Advertising <- read.csv(".\\Advertising.csv")
# VALIDATION SET APPROACH
# 70% Train and 30% Test
set.seed(175)
train.index <- sample(c(1:200), 140, replace=FALSE)
train <- Advertising[train.index,]
test <- Advertising[-train.index,]
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
lm.pred <- predict(lm.fit, newdata=test)
# Calculate Test MSE
SE.test <- (lm.pred - test$Sales)^2
MSE.test <- mean(SE.test)
MSE.test
## [1] 2.88188
Leave-one-out cross-validation (LOOCV) splits the data into
two parts. However, instead of creating two subsets of comparable size,
a single observation \((x_i, y_i)\) is
used for the validation, and the remaining observations make up the
training set. The learning method is fit on the \(n-1\) observations, and a prediction \(\hat y\) is made for the excluded
(left-out) observation. This iterative process is shown in the following
R codes.
# LEAVE-ONE-OUT CROSS VALIDATION
SE <- vector()
for(i in 1:200) {
train <- Advertising[-i,]
test <- Advertising[i,]
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
lm.pred <- predict(lm.fit, newdata=test)
SE.i <- (lm.pred - test$Sales)^2
SE <- rbind(SE, SE.i)
}
MSE.LOOCV <- mean(SE)
MSE.LOOCV
## [1] 2.9469
An alternative to LOOCV is k-fold CV. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. For \(k\) iterations, one fold is used as the validation set and the learning method is fit on the remaining \(k-1\) folds. The final \(MSE\) is the average of all the \(MSEs\) for each fold.
# 10-FOLD CROSS-VALIDATION
k <- 10
MSE.k <- vector()
# Note: Usually the assignment of the folds are random
# Note: For illustration let's use the original order of the data as folds
lower <- c(1, 21, 41, 61, 81, 101, 121, 141, 161, 181)
upper <- c(20, 40, 60, 80, 100, 120, 140, 160, 180, 200)
for(i in 1:k) {
fold <- c(lower[i]: upper[i])
train <- Advertising[-fold,]
test <- Advertising[fold,]
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
lm.pred <- predict(lm.fit, newdata=test)
SE.fold <- (lm.pred - test$Sales)^2
MSE.fold <- mean(SE.fold)
MSE.k <- rbind(MSE.k, MSE.fold)
}
MSE.kfold <- mean(MSE.k)
MSE.kfold
## [1] 3.059968
The three cross-validation techniques above work well with quantitative outcome variables. However, in the case of categorical outcome variables, the LOOCV and k-fold CV may not be applicable especially when the data set is not very large. In applying the validation set approach one should consider that the relative proportion of the data with 1’s and 0’s should be preserved in both the train and test set, i.e., instead of SRSWOR, stratified random sampling with proportional allocation should be implemented.
The following R codes illustrate how to perform cross
validation in a classification problem.
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
# Use the Stock Market Data
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
head(Smarket)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
## 2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
## 3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
## 4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
## 5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
## 6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
table(Smarket$Direction)
##
## Down Up
## 602 648
dim(Smarket)
## [1] 1250 9
# Convert "Up" to 1 and "Down" to 0
Smarket$Direct <- ifelse(Smarket$Direction == "Up", 1, 0)
# Create Partitions with Proportional Allocation
# 70% Train and 30% Test
set.seed(123)
train.index <- createDataPartition(Smarket$Direct, p=0.7, list=FALSE)
train <- Smarket[train.index,]
test <- Smarket[-train.index,]
# Fit Logistic Reg Model
logi.fit <- glm(Direct ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data=train, family = binomial)
# Out-of-sample Correct Classification
logi.pred <- predict(logi.fit, newdata=test, type="response")
logi.class <- ifelse(logi.pred > 0.5, "Up", "Down")
table(logi.class, test$Direction)
##
## logi.class Down Up
## Down 40 38
## Up 142 155
mean(logi.class == test$Direction)
## [1] 0.52
# In-sample correct classification
logi.pred.train <- predict(logi.fit, type="response")
logi.class.train <- ifelse(logi.pred.train > 0.5, "Up", "Down")
table(logi.class.train, train$Direction)
##
## logi.class.train Down Up
## Down 86 80
## Up 334 375
mean(logi.class.train == train$Direction)
## [1] 0.5268571