Training & Testing Sets
When originally taught, most students perform their regression analysis on the totality of their data. If one is solely interested in past behavior, this is proper. However, if one is interested in predicting future behavior, it is better practice to reserve some of the data you have for testing purposes. This reflects the “unknown” future data which is the true intent of the model.
Experiment
We will create synthetic data representing both past and future observations. A quarter of the data will be the “true” future data. This will never be seen during the modeling process. The remaining 75% reflects the “observed” data—the full complement of what is available to the data scientist.
Two models will be built using this data. One using all of it and the second using A 75/25 split into a training and testing set. The selected models from each group will then be compared against the “future” data.
Raw Data
We will use a simple linear model as our example: \[ y_i = 3.2 + 1.6x_{1i} + 3.4x_{2i} - 0.8x_{3i} + \epsilon_i \]
We will call the model using glm with a Gaussian link.
set.seed(245)
x1 <- runif(400, 0, 100)
x2 <- rnorm(400, 0, 10)
x3 <- rexp(400, 1/5)
y <- 3.2 + 1.6 * x1 + 3.4 * x2 - 0.8 * x3 + rnorm(400, 0, 10)
futIDX <- sample(400, 100)
x1f <- x1[futIDX]
x1p <- x1[-futIDX]
x2f <- x2[futIDX]
x2p <- x2[-futIDX]
x3f <- x3[futIDX]
x3p <- x3[-futIDX]
yf <- y[futIDX]
yp <- y[-futIDX]
testIDX <- sample(300, 300 * 0.25)
x1trn <- x1p[-testIDX]
x1tst <- x1p[testIDX]
x2trn <- x2p[-testIDX]
x2tst <- x2p[testIDX]
x3trn <- x3p[-testIDX]
x3tst <- x3p[testIDX]
ytrn <- yp[-testIDX]
ytst <- yp[testIDX]Linear Regression Model
This is almost unfair with regards to the training/testing split. Normally, the validation set is used to fine tune the hyperparameters. In simple linear regression there are none. If the second model returns better results, it is very strong evidence that there is value in simply fitting to less data to prevent overfitting!
Training
trc <- trainControl(method = 'cv', number = 10)
m1 <- train(data.frame(x1 = x1p, x2 = x2p, x3 = x3p), yp, method = 'glm',
trControl = trc, family = gaussian)
m2 <- train(data.frame(x1 = x1trn, x2 = x2trn, x3 = x3trn), ytrn, method = 'glm',
trControl = trc, family = gaussian)
summary(m1)##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -32.062 -7.117 -0.262 7.537 26.648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.15370 1.33089 3.872 0.000133 ***
## x1 1.59122 0.02007 79.288 < 2e-16 ***
## x2 3.41773 0.05915 57.781 < 2e-16 ***
## x3 -0.93259 0.13253 -7.037 1.37e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 106.271)
##
## Null deviance: 1089307 on 299 degrees of freedom
## Residual deviance: 31456 on 296 degrees of freedom
## AIC: 2257.1
##
## Number of Fisher Scoring iterations: 2
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -30.8402 -7.3048 -0.1341 7.4669 26.4330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.16134 1.55799 3.313 0.00108 **
## x1 1.59303 0.02327 68.466 < 2e-16 ***
## x2 3.38712 0.07074 47.882 < 2e-16 ***
## x3 -0.95062 0.14965 -6.352 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 105.9028)
##
## Null deviance: 765467 on 224 degrees of freedom
## Residual deviance: 23405 on 221 degrees of freedom
## AIC: 1693.6
##
## Number of Fisher Scoring iterations: 2
Both models recover the parameters decently, although the intercept is somewhat off. The last is almost always due to process risk, the natural vagaries of stichastic simulation.
Testing
m1p <- predict(m1, data.frame(x1 = x1f, x2 = x2f, x3 = x3f))
m2p <- predict(m2, data.frame(x1 = x1f, x2 = x2f, x3 = x3f))
defaultSummary(data.frame(pred = m1p, obs = yf))## RMSE Rsquared MAE
## 8.7674098 0.9785777 7.5048848
## RMSE Rsquared MAE
## 8.7576504 0.9786794 7.4934065
The second model, fit on less data, has better accuracy on the “future” data. This can only be due to overfitting. In a future blog post, we may investigate results when there are hyperparameters which can be tuned.