Many statistical learning approaches do not have a closed form solution in understanding the true errors of the prediction. However, resampling methods such as the cross-validation and the bootstrap have allowed complex and computational intensive models to have an approximate error estimate. In this section, we will discuss model assessment using the different cross validation (CV) techniques in statistical learning.
Suppose that we would like to estimate the test error associated with
fitting a particular statistical learning method on a set of
observations. The validation set approach is a techniques that
divides the available set of observation into two parts, a
training set and a validation / test set. The model is
fit on the training set, and the fitted model is used to predict the
responses for the observations in the test set. The following R codes
illustrate how to perform validation set approach in R.
# VALIDATION SET APPROACH
# 70% Train and 30% Test
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Advertising <- read.csv(".\\Advertising.csv")
names(Advertising)
## [1] "X" "TV" "Radio" "Newspaper" "Sales"
# Randomly select 70% of the data as Training
set.seed(123)
train.index <- sample(c(1:200), 140, replace=FALSE)
train <- Advertising[train.index,]
test <- Advertising[-train.index,]
# Generate out-of-sample errors
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
predicted <- predict(lm.fit, newdata=test)
sq.er <- (predicted - test$Sales)^2
# Test MSE
MSE <- mean(sq.er)
MSE
## [1] 2.321357
Leave-one-out cross-validation (LOOCV) splits the data into
two parts. However, instead of creating two subsets of comparable size,
a single observation \((x_i, y_i)\) is
used for the validation, and the remaining observations make up the
training set. The learning method is fit on the \(n-1\) observations, and a prediction \(\hat y\) is made for the excluded
(left-out) observation. This iterative process is shown in the following
R codes.
# LEAVE-ONE-OUT CROSS VALIDATION
sq.er <- vector()
for(i in 1:dim(Advertising)[1]){
test <- Advertising[i,]
train <- Advertising[-i,]
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
predicted <- predict(lm.fit, newdata=test)
sq.er.i <- (predicted - test$Sales)^2
sq.er <- c(sq.er, sq.er.i)
}
MSE <- mean(sq.er)
MSE
## [1] 2.9469
An alternative to LOOCV is k-fold CV. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. For \(k\) iterations, one fold is used as the validation set and the learning method is fit on the remaining \(k-1\) folds. The final \(MSE\) is the average of all the \(MSEs\) for each fold.
# 10-FOLD CROSS-VALIDATION
k <- 10
MSE.k <- vector()
# Note: Usually the assignment of the folds are random
# Note: For illustration let's use the original order of the data as folds
lower <- c(1, 21, 41, 61, 81, 101, 121, 141, 161, 181)
upper <- c(20, 40, 60, 80, 100, 120, 140, 160, 180, 200)
set.seed(178)
index <- sample(c(1:200), 200, replace=FALSE)
dat <- Advertising[index,]
for(i in 1:k) {
fold <- c(lower[i]: upper[i])
train <- dat[-fold,]
test <- dat[fold,]
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=train)
predicted <- predict(lm.fit, newdata=test)
sq.er.fold <- (predicted - test$Sales)^2
MSE.fold <- mean(sq.er.fold)
MSE.k <- c(MSE.k, MSE.fold)
}
MSE.k
## [1] 1.413442 2.612986 2.734047 1.677236 2.406883 2.949237 3.906837 3.538674
## [9] 6.808757 1.319590
mean(MSE.k)
## [1] 2.936769
The three cross-validation techniques above work well with quantitative outcome variables. However, in the case of categorical outcome variables, the LOOCV and k-fold CV may not be applicable especially when the data set is not very large. In applying the validation set approach one should consider that the relative proportion of the data with 1’s and 0’s should be preserved in both the train and test set, i.e., instead of SRSWOR, stratified random sampling with proportional allocation should be implemented.
The following R codes illustrate how to perform cross
validation in a classification problem.
# Preserve the proportion of 1s and 0s for train and test
library(caret)
## Warning: package 'ggplot2' was built under R version 4.3.1
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.1
data(Default)
set.seed(123)
train.index <- createDataPartition(Default$default, p=0.7, list=FALSE)
train <- Default[train.index,]
test <- Default[-train.index,]
table(train$default)
##
## No Yes
## 6767 234
table(test$default)
##
## No Yes
## 2900 99
234/(234+6767)
## [1] 0.0334238
99/(99+2900)
## [1] 0.033011
# Logistic Regression
logit.fit <- glm(default ~ income + balance + student,
data = train, family = binomial)
# Out-of-sample Classification Performance
prob.predict <- predict(logit.fit, newdata=test, type="response")
predicted <- ifelse(prob.predict > 0.5, "Yes", "No")
actual <- test$default
table(predicted, actual)
## actual
## predicted No Yes
## No 2884 63
## Yes 16 36