The document demonstrates time series cross-validation using the caret package. The methodology is consistent with Rob Hyndman’s recommendation for how to do time series cross-validation. Here’s a graphical summary of the method that will be applied to the economics data set from ggplot2.
Time series cross-validation answers two important questions:
These are the exact questions that K-fold cross-validation answers for cross-sectional data sets. Unlike K-fold cross-validation, the hold-out data sets (n = 12) are adjacent observations. Hence, before doing time series cross-validation, we need to make sure that the data set is sorted appropriately by time (i.e., oldest observation to newest observation).
library(tidyverse)
library(lubridate)
library(caret)
library(doParallel)
library(lattice)
library(pdp)
economics <- ggplot2::economics
knitr::kable(head(economics))
| date | pce | pop | psavert | uempmed | unemploy |
|---|---|---|---|---|---|
| 1967-07-01 | 506.7 | 198712 | 12.6 | 4.5 | 2944 |
| 1967-08-01 | 509.8 | 198911 | 12.6 | 4.7 | 2945 |
| 1967-09-01 | 515.6 | 199113 | 11.9 | 4.6 | 2958 |
| 1967-10-01 | 512.2 | 199311 | 12.9 | 4.9 | 3143 |
| 1967-11-01 | 517.4 | 199498 | 12.8 | 4.7 | 3066 |
| 1967-12-01 | 525.1 | 199657 | 11.8 | 4.8 | 3018 |
knitr::kable(tail(economics))
| date | pce | pop | psavert | uempmed | unemploy |
|---|---|---|---|---|---|
| 2014-11-01 | 12051.4 | 319564.2 | 7.3 | 13.0 | 9090 |
| 2014-12-01 | 12062.0 | 319746.2 | 7.6 | 12.9 | 8717 |
| 2015-01-01 | 12046.0 | 319928.6 | 7.7 | 13.2 | 8903 |
| 2015-02-01 | 12082.4 | 320074.5 | 7.9 | 12.9 | 8610 |
| 2015-03-01 | 12158.3 | 320230.8 | 7.4 | 12.0 | 8504 |
| 2015-04-01 | 12193.8 | 320402.3 | 7.6 | 11.5 | 8526 |
Here are the variable names from the ?economics documentation.
| Variable Name | Description |
|---|---|
| date | Month of data collection |
| psavert | personal savings rate |
| pce | personal consumption expenditures, in billions of dollars |
| unemploy | number of unemployed in thousands |
| uempmed | median duration of unemployment, in week |
| pop | total population, in thousands |
Suppose we want to forecast the number of unemployed (unemploy) for the next 12-months. As with most time series models, we assume that the independent variables are all exogenous.
The caret package makes time series cross validation very easy. First, we are going to calculate a categorical variable to store the month of the year (because there may be seasonality). Second, in order to make the research reproducible, we are going to set up seed values.
economics$month <- as.factor(month(economics$date))
#### creating sampling seeds ####
set.seed(123)
seeds <- vector(mode = "list", length = 528)
for(i in 1:527) seeds[[i]] <- sample.int(1000, 5)
## For the last model:
seeds[[528]] <- sample.int(1000, 1)
#########
Third, we set up the computer for parallel processing.
registerDoParallel(cores=3)
Then, we tell caret that we want to measure model performance and select tuning parameters based on time series cross-validation.
myTimeControl <- trainControl(method = "timeslice",
initialWindow = 36,
horizon = 12,
fixedWindow = FALSE,
allowParallel = TRUE,
seeds = seeds)
Then, we try 9 different models. Some models require tuning parameters, so we tell caret to try 5 different combinations of tuning parameters for each model that requires tuning parameters. For linear regression models, there are no tuning parameters.
tuneLength.num <- 5
glmnet.mod <- train(unemploy ~ . - date,
data = economics,
method = "glmnet",
family = "gaussian",
trControl = myTimeControl,
tuneLength=tuneLength.num,
metric='RMSE')
pois.mod <- train(unemploy ~ . - date,
data = economics,
method = "glmnet",
family = "poisson",
trControl = myTimeControl,
tuneLength=tuneLength.num,
metric='RMSE')
lm.mod <- train(unemploy ~ . - date,
data = economics,
method = "lm",
trControl = myTimeControl,
tuneLength=tuneLength.num,
metric='RMSE')
earth.mod <- train(unemploy ~ . - date,
data = economics,
method = "earth",
trControl = myTimeControl,
tuneLength=tuneLength.num,
metric='RMSE')
gam.mod <- train(unemploy ~ . - date,
data = economics,
method = "gam",
trControl = myTimeControl,
tuneLength=tuneLength.num,
metric='RMSE')
rpart.mod <- train(unemploy ~ . - date,
data = economics,
method = "rpart",
trControl = myTimeControl,
tuneLength=tuneLength.num,
metric='RMSE')
party.mod <- train(unemploy ~ . - date,
data = economics,
method = "ctree",
trControl = myTimeControl,
tuneLength=tuneLength.num,
metric='RMSE')
trainX <- economics[,c("pce","pop","psavert","uempmed","month")]
trainY <- economics[,"unemploy",drop=TRUE]
rf.mod <- train(x=trainX,
y=trainY,
method = "rf",
trControl = myTimeControl,
tuneLength=tuneLength.num,
metric='RMSE')
## note: only 4 unique complexity parameters in default grid. Truncating the grid to 4 .
gbm.mod <- train(unemploy ~ . - date,
data = economics,
method = "gbm",
trControl = myTimeControl,
tuneLength=tuneLength.num,
verbose=FALSE,
metric='RMSE')
Finally, we select the best model based on cross-validation \(RMSE\). For time series problems, you should avoid using \(R^2\) as a selection criteria due to the spurious regression problem: regressing two time series that have strong deterministic or stochastic trends lead to severely inflated \(R^2\) values.
resamps <- resamples(list(glmnet = glmnet.mod,
glmnet.pois = pois.mod,
lm = lm.mod,
earth=earth.mod,
gbm=gbm.mod,
gam=gam.mod,
rf=rf.mod,
rpart=rpart.mod,
party=party.mod))
ss <- summary(resamps)
knitr::kable(ss[[3]]$RMSE)
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s | |
|---|---|---|---|---|---|---|---|
| glmnet | 179.53451 | 451.9324 | 707.3154 | 925.1820 | 1245.841 | 3197.874 | 0 |
| glmnet.pois | 163.91939 | 426.4876 | 693.8729 | 959.0219 | 1237.839 | 3752.513 | 0 |
| lm | 177.66650 | 540.4961 | 960.0611 | 1099.7038 | 1596.653 | 3482.444 | 0 |
| earth | 97.77150 | 396.1588 | 689.6046 | 882.1214 | 1119.180 | 3862.246 | 0 |
| gbm | 128.78262 | 337.2110 | 524.9515 | 835.5737 | 1106.732 | 4546.295 | 7 |
| gam | 117.26407 | 459.9492 | 716.8455 | 895.4578 | 1174.755 | 7463.320 | 12 |
| rf | 67.00126 | 339.5902 | 555.4110 | 803.3343 | 1027.352 | 4658.564 | 0 |
| rpart | 104.96674 | 566.3252 | 958.1386 | 1188.4609 | 1458.437 | 5859.784 | 0 |
| party | 106.16768 | 435.3236 | 798.1830 | 993.5814 | 1229.761 | 5243.687 | 0 |
trellis.par.set(caretTheme())
dotplot(resamps, metric = "RMSE")
The random forest model is the winner. However, the sensitivity plots are a little difficult to explain.
lapply(c("pce","pop","psavert","uempmed","month"), function(x){
partial(rf.mod, pred.var=x, plot=TRUE, train=trainX)
}
)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]