Summary

The document demonstrates time series cross-validation using the caret package. The methodology is consistent with Rob Hyndman’s recommendation for how to do time series cross-validation. Here’s a graphical summary of the method that will be applied to the economics data set from ggplot2.

Time series cross-validation answers two important questions:

  1. If we have 9 different models, which one is the best?
  2. Some of the 9 models may require tuning during model training, which tuning parameter values should we choose?

These are the exact questions that K-fold cross-validation answers for cross-sectional data sets. Unlike K-fold cross-validation, the hold-out data sets (n = 12) are adjacent observations. Hence, before doing time series cross-validation, we need to make sure that the data set is sorted appropriately by time (i.e., oldest observation to newest observation).

library(tidyverse)
library(lubridate)
library(caret)
library(doParallel)
library(lattice)
library(pdp)

economics <- ggplot2::economics

knitr::kable(head(economics))
date pce pop psavert uempmed unemploy
1967-07-01 506.7 198712 12.6 4.5 2944
1967-08-01 509.8 198911 12.6 4.7 2945
1967-09-01 515.6 199113 11.9 4.6 2958
1967-10-01 512.2 199311 12.9 4.9 3143
1967-11-01 517.4 199498 12.8 4.7 3066
1967-12-01 525.1 199657 11.8 4.8 3018
knitr::kable(tail(economics))
date pce pop psavert uempmed unemploy
2014-11-01 12051.4 319564.2 7.3 13.0 9090
2014-12-01 12062.0 319746.2 7.6 12.9 8717
2015-01-01 12046.0 319928.6 7.7 13.2 8903
2015-02-01 12082.4 320074.5 7.9 12.9 8610
2015-03-01 12158.3 320230.8 7.4 12.0 8504
2015-04-01 12193.8 320402.3 7.6 11.5 8526

Here are the variable names from the ?economics documentation.

Variable Name Description
date Month of data collection
psavert personal savings rate
pce personal consumption expenditures, in billions of dollars
unemploy number of unemployed in thousands
uempmed median duration of unemployment, in week
pop total population, in thousands

Suppose we want to forecast the number of unemployed (unemploy) for the next 12-months. As with most time series models, we assume that the independent variables are all exogenous.

Caret

The caret package makes time series cross validation very easy. First, we are going to calculate a categorical variable to store the month of the year (because there may be seasonality). Second, in order to make the research reproducible, we are going to set up seed values.

economics$month <- as.factor(month(economics$date))

#### creating sampling seeds ####
set.seed(123)
seeds <- vector(mode = "list", length = 528)
for(i in 1:527) seeds[[i]] <- sample.int(1000, 5)

## For the last model:
seeds[[528]] <- sample.int(1000, 1)

#########

Third, we set up the computer for parallel processing.

registerDoParallel(cores=3)

Then, we tell caret that we want to measure model performance and select tuning parameters based on time series cross-validation.

myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 36,
                              horizon = 12,
                              fixedWindow = FALSE,
                              allowParallel = TRUE,
                              seeds = seeds)

Then, we try 9 different models. Some models require tuning parameters, so we tell caret to try 5 different combinations of tuning parameters for each model that requires tuning parameters. For linear regression models, there are no tuning parameters.

Regularized Linear Regression

tuneLength.num <- 5

glmnet.mod <- train(unemploy ~ . - date,
                    data = economics,
                    method = "glmnet",
                    family = "gaussian",
                    trControl = myTimeControl,
                    tuneLength=tuneLength.num,
                    metric='RMSE')

Regularized Poisson Regression

pois.mod <- train(unemploy ~ . - date,
                    data = economics,
                    method = "glmnet",
                    family = "poisson",
                    trControl = myTimeControl,
                    tuneLength=tuneLength.num,
                    metric='RMSE')

Linear Regression

lm.mod <- train(unemploy ~ . - date,
                  data = economics,
                  method = "lm",
                  trControl = myTimeControl,
                  tuneLength=tuneLength.num,
                  metric='RMSE')

Multivariate Adaptive Regression Splines (Gaussian)

earth.mod <- train(unemploy ~ . - date,
                data = economics,
                method = "earth",
                trControl = myTimeControl,
                tuneLength=tuneLength.num,
                metric='RMSE')

Generalized Additive Model

gam.mod <- train(unemploy ~ . - date,
                   data = economics,
                   method = "gam",
                   trControl = myTimeControl,
                   tuneLength=tuneLength.num,
                   metric='RMSE')

Regression Tree

rpart.mod <- train(unemploy ~ . - date,
                 data = economics,
                 method = "rpart",
                 trControl = myTimeControl,
                 tuneLength=tuneLength.num,
                 metric='RMSE')

C-Tree

party.mod <- train(unemploy ~ . - date,
                   data = economics,
                   method = "ctree",
                   trControl = myTimeControl,
                   tuneLength=tuneLength.num,
                   metric='RMSE')

Random Forest

trainX <- economics[,c("pce","pop","psavert","uempmed","month")]
trainY <- economics[,"unemploy",drop=TRUE]
  
rf.mod <- train(x=trainX,
                y=trainY,
                method = "rf",
                trControl = myTimeControl,
                tuneLength=tuneLength.num,
                metric='RMSE')
## note: only 4 unique complexity parameters in default grid. Truncating the grid to 4 .

Gradient Boosted Trees

gbm.mod <- train(unemploy ~ . - date,
                 data = economics,
                 method = "gbm",
                 trControl = myTimeControl,
                 tuneLength=tuneLength.num,
                 verbose=FALSE,
                 metric='RMSE')

Model Selection

Finally, we select the best model based on cross-validation \(RMSE\). For time series problems, you should avoid using \(R^2\) as a selection criteria due to the spurious regression problem: regressing two time series that have strong deterministic or stochastic trends lead to severely inflated \(R^2\) values.

resamps <- resamples(list(glmnet = glmnet.mod,
                          glmnet.pois = pois.mod,
                          lm = lm.mod,
                          earth=earth.mod,
                          gbm=gbm.mod,
                          gam=gam.mod,
                          rf=rf.mod,
                          rpart=rpart.mod,
                          party=party.mod))
ss <- summary(resamps)

knitr::kable(ss[[3]]$RMSE)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
glmnet 179.53451 451.9324 707.3154 925.1820 1245.841 3197.874 0
glmnet.pois 163.91939 426.4876 693.8729 959.0219 1237.839 3752.513 0
lm 177.66650 540.4961 960.0611 1099.7038 1596.653 3482.444 0
earth 97.77150 396.1588 689.6046 882.1214 1119.180 3862.246 0
gbm 128.78262 337.2110 524.9515 835.5737 1106.732 4546.295 7
gam 117.26407 459.9492 716.8455 895.4578 1174.755 7463.320 12
rf 67.00126 339.5902 555.4110 803.3343 1027.352 4658.564 0
rpart 104.96674 566.3252 958.1386 1188.4609 1458.437 5859.784 0
party 106.16768 435.3236 798.1830 993.5814 1229.761 5243.687 0
trellis.par.set(caretTheme())
dotplot(resamps, metric = "RMSE")

Final Model

The random forest model is the winner. However, the sensitivity plots are a little difficult to explain.

lapply(c("pce","pop","psavert","uempmed","month"), function(x){
    partial(rf.mod, pred.var=x, plot=TRUE, train=trainX)
  }
)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]