The primary purpose of predictive modeling is to predict the future with low error. A predictive model overfits the historical data if the model predicts the past well and the future poorly. The thresholds for “low error”, “well”, and “poorly” depend on the business problem that the model seeks to solve. Regardless of the thresholds, overfitting is a common problem in predictive modeling. The post explains how to measure overfitting.

Summary

Overfitting is the difference in prediction performance between the historical data and future data. Unfortunately, modelers cannot travel into the future to collect data. An approximation of future data is to hold-out a sample of the historical data. The hold-out data set, also called a test data set, is not used in the model development or training process.

Simulate a Historical Data Set

Let’s create a historical data set with 2,000 observations and the following variables:

  1. Latent Risk: the dependent variable (what we’re trying to predict). The variable is continuous from \(-\inf\) to \(+\inf\). In our data set, the Latent Variable is a function of only 2 explanatory variables and an irreducible error that is randomly drawn from the logistic distribution: \[LatentRisk = 10 + 50(LoanAge) - 2(LoanAge^2) - 10(FICO) + \epsilon\]

  2. Loan Age: explains latent risk

  3. FICO Score: explains latent risk

  4. 100 noise variables: these variables have no relationship with the dependent variable. They are part of the data set to demonstrate what happens when the modeler includes them in the model (i.e., leads to overfitting).

set.seed(1985)

epsilon <- rlogis(n = 2000, location = 0, scale = 85)  ## Irreducible Error

LoanAge <- rpois(n = 2000, lambda = 10)   ## Loan Age

FICO <- rpois(n = 2000, lambda = 500)     ## FICO score

LatentRisk <- 10 + 50*LoanAge - 2*LoanAge^2 - 10*FICO + epsilon ## Latent Risk

Now let’s create 100 noise variables and add them to the final historical data set.

## create (2000 * 100) = 200,000 random numbers

randomNumbers <- rnorm(n = 200000, mean = 100, sd = 500)

## store the randomNumbers into a matrix with 2000 rows and 100 columns

noiseVars <- matrix(data=randomNumbers, nrow=2000, ncol=100)

## convert noiseVars into a data frame

noiseVars.df <- as.data.frame(noiseVars)

## create final historical data set

importantVars.df <- data.frame(LatentRisk=LatentRisk, LoanAge=LoanAge, FICO=FICO)

historical.df <- cbind(importantVars.df, noiseVars.df)

Split Data Set into Training and Test Data Sets

Let’s pretend that historical.df was handed to the modeler, and that the modeler was told (from an authoritative source) that the data is clean and applicable to the business problem. Given this pretend scenario, the modeler’s first job is to split the data between training and test.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1983)

train <- createDataPartition(historical.df$LatentRisk, p=0.5, list = FALSE)

trainingData <- historical.df[train,]

testData <- historical.df[-train,]

dim(trainingData)
## [1] 1000  103
dim(testData)
## [1] 1000  103

Formulas for Overfitting

As mentioned earlier, overfitting is the difference in prediction performance between the historical data and future data. In practice, the historical data is split between training and test, with the assumption that the test data set behaves like “future data”. In practical terms,

Overfitting is the difference in prediction performance between the training data and the test data.

\[Overfitting = Performance_{Training} - Performance_{Test}\]

Equivalently,

Overfitting is the difference in prediction error between the training data and the test data.

\[Overfitting = Error_{Test} - Error_{Training}\]

Formula for Prediction Performance

In regression problems, a popular performance metric is \(R^2\). A popular error metric is root mean squared error, \(RMSE\). The subsequent examples will focus on \(R^2\), which is calculated using the following formula:

\[R^2 = corr(predicted, actual)^2\]

If the true model is known, then overfitting should be close to 0

With these definitions out of the way, let’s find a model where \(Overfitting \approx 0\). Since historical.df was simulated with the following equation, we know that the true model contains only 2 explanatory variables.

\[LatentRisk = 10 + 50(LoanAge) - 2(LoanAge^2) - 10(FICO) + \epsilon\]

Let’s fit a linear model using only the training data.

perfect.mod <- lm(LatentRisk ~ LoanAge + I(LoanAge^2) + FICO, data=trainingData)

summary(perfect.mod)
## 
## Call:
## lm(formula = LatentRisk ~ LoanAge + I(LoanAge^2) + FICO, data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -771.02  -91.72    0.13  101.87  521.95 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.6502   120.9526  -0.047    0.963    
## LoanAge       64.2273     7.6453   8.401  < 2e-16 ***
## I(LoanAge^2)  -2.5678     0.3605  -7.123 2.02e-12 ***
## FICO         -10.1391     0.2245 -45.158  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 158.6 on 996 degrees of freedom
## Multiple R-squared:  0.6865, Adjusted R-squared:  0.6856 
## F-statistic:   727 on 3 and 996 DF,  p-value: < 2.2e-16
perfect.mod.R2 <- cor(perfect.mod$fitted.values, trainingData$LatentRisk)^2

perfect.mod.R2
## [1] 0.6865036

The training \(R^2\) looks amazing. Can we trust it? Let’s calculate the \(R^2\) for the test data set.

perfect.mod.test.pred <- predict(perfect.mod, newdata=testData)

perfect.mod.test.R2 <- cor(perfect.mod.test.pred, testData$LatentRisk)^2

perfect.mod.test.R2
## [1] 0.6492291

The prediction performance in the test data set is high, but not as high as the prediction performance in the training data set. Now we can calculate the severity of overfitting.

Overfitting.perfect.mod <- perfect.mod.R2 - perfect.mod.test.R2

Overfitting.perfect.mod
## [1] 0.03727448

In reality, the true model is very, very rarely known

Unfortunately, the true model is almost never known. As a result, modelers often use algorithms to search for patterns. When used blindly, these algorithms can lead to severe overfitting. One such algorithm is stepwise regression.

Stepwise regression uses p-values to determine whether a variable should be added or removed from a regression model. An explanatory variable with a low p-value should be added to the model, while an explanatory variable with a high p-value should be removed from the model.

Let’s use stepwise regression on the entire training data set. There are 102 explanatory variables in the training set. Stepwise regression will choose the “best variables” using a search algorithm based on p-values.

library(MASS)

all.vars.mod <- lm(LatentRisk ~ ., data=trainingData)

stepwise.mod <- stepAIC(all.vars.mod, trace=FALSE, k=4)

summary(stepwise.mod)
## 
## Call:
## lm(formula = LatentRisk ~ LoanAge + FICO + V3 + V76 + V88, data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -740.68  -93.55   -5.96  108.88  553.04 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 293.66143  115.10172   2.551   0.0109 *  
## LoanAge      10.87000    1.58375   6.863 1.18e-11 ***
## FICO        -10.23410    0.22876 -44.738  < 2e-16 ***
## V3           -0.02352    0.01039  -2.264   0.0238 *  
## V76           0.02098    0.01017   2.063   0.0394 *  
## V88          -0.02196    0.01008  -2.179   0.0296 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 161.6 on 994 degrees of freedom
## Multiple R-squared:  0.675,  Adjusted R-squared:  0.6734 
## F-statistic:   413 on 5 and 994 DF,  p-value: < 2.2e-16

All the variables in the stepwise model are significant at the 5% level. Unfortunately, stepwise regression chose two random number variables to include in the model.

Let’s calculate overfitting.

stepwise.R2 <- cor(stepwise.mod$fitted.values, trainingData$LatentRisk)^2

stepwise.test.pred <- predict(stepwise.mod, newdata=testData)

stepwise.R2.test <- cor(stepwise.test.pred, testData$LatentRisk)^2

stepwise.R2
## [1] 0.6750367
stepwise.R2.test
## [1] 0.6208776
Overfitting.stepwise <- stepwise.R2 - stepwise.R2.test

Overfitting.stepwise
## [1] 0.05415905

Compared to the perfect model, the stepwise model suffers from overfitting by a factor of 1.4529794.

Conclusion

  1. Even if the true model were known to the modeler, overfitting may still occur (i.e., model performs better in the training data than in the test data). In a future post, we will discuss sources of overfitting.

  2. In reality, the true model is almost never known. Algorithms that search for patterns may increase overfitting. In a future post, we will discuss algorithms that account for overfitting.