The primary purpose of predictive modeling is to predict the future with low error. A predictive model overfits the historical data if the model predicts the past well and the future poorly. The thresholds for “low error”, “well”, and “poorly” depend on the business problem that the model seeks to solve. Regardless of the thresholds, overfitting is a common problem in predictive modeling. The post explains how to measure overfitting.
Overfitting is the difference in prediction performance between the historical data and future data. Unfortunately, modelers cannot travel into the future to collect data. An approximation of future data is to hold-out a sample of the historical data. The hold-out data set, also called a test data set, is not used in the model development or training process.
Let’s create a historical data set with 2,000 observations and the following variables:
Latent Risk: the dependent variable (what we’re trying to predict). The variable is continuous from \(-\inf\) to \(+\inf\). In our data set, the Latent Variable is a function of only 2 explanatory variables and an irreducible error that is randomly drawn from the logistic distribution: \[LatentRisk = 10 + 50(LoanAge) - 2(LoanAge^2) - 10(FICO) + \epsilon\]
Loan Age: explains latent risk
FICO Score: explains latent risk
100 noise variables: these variables have no relationship with the dependent variable. They are part of the data set to demonstrate what happens when the modeler includes them in the model (i.e., leads to overfitting).
set.seed(1985)
epsilon <- rlogis(n = 2000, location = 0, scale = 85) ## Irreducible Error
LoanAge <- rpois(n = 2000, lambda = 10) ## Loan Age
FICO <- rpois(n = 2000, lambda = 500) ## FICO score
LatentRisk <- 10 + 50*LoanAge - 2*LoanAge^2 - 10*FICO + epsilon ## Latent Risk
Now let’s create 100 noise variables and add them to the final historical data set.
## create (2000 * 100) = 200,000 random numbers
randomNumbers <- rnorm(n = 200000, mean = 100, sd = 500)
## store the randomNumbers into a matrix with 2000 rows and 100 columns
noiseVars <- matrix(data=randomNumbers, nrow=2000, ncol=100)
## convert noiseVars into a data frame
noiseVars.df <- as.data.frame(noiseVars)
## create final historical data set
importantVars.df <- data.frame(LatentRisk=LatentRisk, LoanAge=LoanAge, FICO=FICO)
historical.df <- cbind(importantVars.df, noiseVars.df)
Let’s pretend that historical.df was handed to the modeler, and that the modeler was told (from an authoritative source) that the data is clean and applicable to the business problem. Given this pretend scenario, the modeler’s first job is to split the data between training and test.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1983)
train <- createDataPartition(historical.df$LatentRisk, p=0.5, list = FALSE)
trainingData <- historical.df[train,]
testData <- historical.df[-train,]
dim(trainingData)
## [1] 1000 103
dim(testData)
## [1] 1000 103
As mentioned earlier, overfitting is the difference in prediction performance between the historical data and future data. In practice, the historical data is split between training and test, with the assumption that the test data set behaves like “future data”. In practical terms,
Overfitting is the difference in prediction performance between the training data and the test data.
\[Overfitting = Performance_{Training} - Performance_{Test}\]
Equivalently,
Overfitting is the difference in prediction error between the training data and the test data.
\[Overfitting = Error_{Test} - Error_{Training}\]
In regression problems, a popular performance metric is \(R^2\). A popular error metric is root mean squared error, \(RMSE\). The subsequent examples will focus on \(R^2\), which is calculated using the following formula:
\[R^2 = corr(predicted, actual)^2\]
With these definitions out of the way, let’s find a model where \(Overfitting \approx 0\). Since historical.df was simulated with the following equation, we know that the true model contains only 2 explanatory variables.
\[LatentRisk = 10 + 50(LoanAge) - 2(LoanAge^2) - 10(FICO) + \epsilon\]
Let’s fit a linear model using only the training data.
perfect.mod <- lm(LatentRisk ~ LoanAge + I(LoanAge^2) + FICO, data=trainingData)
summary(perfect.mod)
##
## Call:
## lm(formula = LatentRisk ~ LoanAge + I(LoanAge^2) + FICO, data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -771.02 -91.72 0.13 101.87 521.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.6502 120.9526 -0.047 0.963
## LoanAge 64.2273 7.6453 8.401 < 2e-16 ***
## I(LoanAge^2) -2.5678 0.3605 -7.123 2.02e-12 ***
## FICO -10.1391 0.2245 -45.158 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 158.6 on 996 degrees of freedom
## Multiple R-squared: 0.6865, Adjusted R-squared: 0.6856
## F-statistic: 727 on 3 and 996 DF, p-value: < 2.2e-16
perfect.mod.R2 <- cor(perfect.mod$fitted.values, trainingData$LatentRisk)^2
perfect.mod.R2
## [1] 0.6865036
The training \(R^2\) looks amazing. Can we trust it? Let’s calculate the \(R^2\) for the test data set.
perfect.mod.test.pred <- predict(perfect.mod, newdata=testData)
perfect.mod.test.R2 <- cor(perfect.mod.test.pred, testData$LatentRisk)^2
perfect.mod.test.R2
## [1] 0.6492291
The prediction performance in the test data set is high, but not as high as the prediction performance in the training data set. Now we can calculate the severity of overfitting.
Overfitting.perfect.mod <- perfect.mod.R2 - perfect.mod.test.R2
Overfitting.perfect.mod
## [1] 0.03727448
Unfortunately, the true model is almost never known. As a result, modelers often use algorithms to search for patterns. When used blindly, these algorithms can lead to severe overfitting. One such algorithm is stepwise regression.
Stepwise regression uses p-values to determine whether a variable should be added or removed from a regression model. An explanatory variable with a low p-value should be added to the model, while an explanatory variable with a high p-value should be removed from the model.
Let’s use stepwise regression on the entire training data set. There are 102 explanatory variables in the training set. Stepwise regression will choose the “best variables” using a search algorithm based on p-values.
library(MASS)
all.vars.mod <- lm(LatentRisk ~ ., data=trainingData)
stepwise.mod <- stepAIC(all.vars.mod, trace=FALSE, k=4)
summary(stepwise.mod)
##
## Call:
## lm(formula = LatentRisk ~ LoanAge + FICO + V3 + V76 + V88, data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -740.68 -93.55 -5.96 108.88 553.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 293.66143 115.10172 2.551 0.0109 *
## LoanAge 10.87000 1.58375 6.863 1.18e-11 ***
## FICO -10.23410 0.22876 -44.738 < 2e-16 ***
## V3 -0.02352 0.01039 -2.264 0.0238 *
## V76 0.02098 0.01017 2.063 0.0394 *
## V88 -0.02196 0.01008 -2.179 0.0296 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 161.6 on 994 degrees of freedom
## Multiple R-squared: 0.675, Adjusted R-squared: 0.6734
## F-statistic: 413 on 5 and 994 DF, p-value: < 2.2e-16
All the variables in the stepwise model are significant at the 5% level. Unfortunately, stepwise regression chose two random number variables to include in the model.
Let’s calculate overfitting.
stepwise.R2 <- cor(stepwise.mod$fitted.values, trainingData$LatentRisk)^2
stepwise.test.pred <- predict(stepwise.mod, newdata=testData)
stepwise.R2.test <- cor(stepwise.test.pred, testData$LatentRisk)^2
stepwise.R2
## [1] 0.6750367
stepwise.R2.test
## [1] 0.6208776
Overfitting.stepwise <- stepwise.R2 - stepwise.R2.test
Overfitting.stepwise
## [1] 0.05415905
Compared to the perfect model, the stepwise model suffers from overfitting by a factor of 1.4529794.
Even if the true model were known to the modeler, overfitting may still occur (i.e., model performs better in the training data than in the test data). In a future post, we will discuss sources of overfitting.
In reality, the true model is almost never known. Algorithms that search for patterns may increase overfitting. In a future post, we will discuss algorithms that account for overfitting.