Summary

The document demonstrates time series cross-validation using the caret package. The methodology is consistent with Rob Hyndman’s recommendation for how to do time series cross-validation. Here’s a graphical summary of the method that will be applied to the economics data set from ggplot2.

Time series cross-validation answers two important questions:

If we have 9 different models, which one is the best?
Some of the 9 models may require tuning during model training, which tuning parameter values should we choose?

These are the exact questions that K-fold cross-validation answers for cross-sectional data sets. Unlike K-fold cross-validation, the hold-out data sets (n = 12) are adjacent observations. Hence, before doing time series cross-validation, we need to make sure that the data set is sorted appropriately by time (i.e., oldest observation to newest observation).

library(tidyverse)
library(lubridate)
library(caret)
library(doParallel)
library(lattice)
library(pdp)

economics <- ggplot2::economics

knitr::kable(head(economics))

date	pce	pop	psavert	uempmed	unemploy
1967-07-01	506.7	198712	12.6	4.5	2944
1967-08-01	509.8	198911	12.6	4.7	2945
1967-09-01	515.6	199113	11.9	4.6	2958
1967-10-01	512.2	199311	12.9	4.9	3143
1967-11-01	517.4	199498	12.8	4.7	3066
1967-12-01	525.1	199657	11.8	4.8	3018

knitr::kable(tail(economics))

date	pce	pop	psavert	uempmed	unemploy
2014-11-01	12051.4	319564.2	7.3	13.0	9090
2014-12-01	12062.0	319746.2	7.6	12.9	8717
2015-01-01	12046.0	319928.6	7.7	13.2	8903
2015-02-01	12082.4	320074.5	7.9	12.9	8610
2015-03-01	12158.3	320230.8	7.4	12.0	8504
2015-04-01	12193.8	320402.3	7.6	11.5	8526

Here are the variable names from the ?economics documentation.

Variable Name	Description
date	Month of data collection
psavert	personal savings rate
pce	personal consumption expenditures, in billions of dollars
unemploy	number of unemployed in thousands
uempmed	median duration of unemployment, in week
pop	total population, in thousands

Suppose we want to forecast the number of unemployed (unemploy) for the next 12-months. As with most time series models, we assume that the independent variables are all exogenous.

Caret

The caret package makes time series cross validation very easy. First, we are going to calculate a categorical variable to store the month of the year (because there may be seasonality). Second, in order to make the research reproducible, we are going to set up seed values.

economics$month <- as.factor(month(economics$date))

#### creating sampling seeds ####
set.seed(123)
seeds <- vector(mode = "list", length = 528)
for(i in 1:527) seeds[[i]] <- sample.int(1000, 5)

## For the last model:
seeds[[528]] <- sample.int(1000, 1)

#########

Third, we set up the computer for parallel processing.

registerDoParallel(cores=3)

Then, we tell caret that we want to measure model performance and select tuning parameters based on time series cross-validation.

myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 36,
                              horizon = 12,
                              fixedWindow = FALSE,
                              allowParallel = TRUE,
                              seeds = seeds)

Then, we try 9 different models. Some models require tuning parameters, so we tell caret to try 5 different combinations of tuning parameters for each model that requires tuning parameters. For linear regression models, there are no tuning parameters.

Regularized Linear Regression

tuneLength.num <- 5

glmnet.mod <- train(unemploy ~ . - date,
                    data = economics,
                    method = "glmnet",
                    family = "gaussian",
                    trControl = myTimeControl,
                    tuneLength=tuneLength.num,
                    metric='RMSE')

Regularized Poisson Regression

pois.mod <- train(unemploy ~ . - date,
                    data = economics,
                    method = "glmnet",
                    family = "poisson",
                    trControl = myTimeControl,
                    tuneLength=tuneLength.num,
                    metric='RMSE')

Linear Regression

lm.mod <- train(unemploy ~ . - date,
                  data = economics,
                  method = "lm",
                  trControl = myTimeControl,
                  tuneLength=tuneLength.num,
                  metric='RMSE')

Multivariate Adaptive Regression Splines (Gaussian)

earth.mod <- train(unemploy ~ . - date,
                data = economics,
                method = "earth",
                trControl = myTimeControl,
                tuneLength=tuneLength.num,
                metric='RMSE')

Generalized Additive Model

gam.mod <- train(unemploy ~ . - date,
                   data = economics,
                   method = "gam",
                   trControl = myTimeControl,
                   tuneLength=tuneLength.num,
                   metric='RMSE')

Regression Tree

rpart.mod <- train(unemploy ~ . - date,
                 data = economics,
                 method = "rpart",
                 trControl = myTimeControl,
                 tuneLength=tuneLength.num,
                 metric='RMSE')

C-Tree

party.mod <- train(unemploy ~ . - date,
                   data = economics,
                   method = "ctree",
                   trControl = myTimeControl,
                   tuneLength=tuneLength.num,
                   metric='RMSE')

Random Forest

trainX <- economics[,c("pce","pop","psavert","uempmed","month")]
trainY <- economics[,"unemploy",drop=TRUE]
  
rf.mod <- train(x=trainX,
                y=trainY,
                method = "rf",
                trControl = myTimeControl,
                tuneLength=tuneLength.num,
                metric='RMSE')

## note: only 4 unique complexity parameters in default grid. Truncating the grid to 4 .

Gradient Boosted Trees

gbm.mod <- train(unemploy ~ . - date,
                 data = economics,
                 method = "gbm",
                 trControl = myTimeControl,
                 tuneLength=tuneLength.num,
                 verbose=FALSE,
                 metric='RMSE')

Model Selection

Finally, we select the best model based on cross-validation \(RMSE\). For time series problems, you should avoid using \(R^2\) as a selection criteria due to the spurious regression problem: regressing two time series that have strong deterministic or stochastic trends lead to severely inflated \(R^2\) values.

resamps <- resamples(list(glmnet = glmnet.mod,
                          glmnet.pois = pois.mod,
                          lm = lm.mod,
                          earth=earth.mod,
                          gbm=gbm.mod,
                          gam=gam.mod,
                          rf=rf.mod,
                          rpart=rpart.mod,
                          party=party.mod))

ss <- summary(resamps)

knitr::kable(ss[[3]]$RMSE)

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
glmnet	179.53451	451.9324	707.3154	925.1820	1245.841	3197.874	0
glmnet.pois	163.91939	426.4876	693.8729	959.0219	1237.839	3752.513	0
lm	177.66650	540.4961	960.0611	1099.7038	1596.653	3482.444	0
earth	97.77150	396.1588	689.6046	882.1214	1119.180	3862.246	0
gbm	128.78262	337.2110	524.9515	835.5737	1106.732	4546.295	7
gam	117.26407	459.9492	716.8455	895.4578	1174.755	7463.320	12
rf	67.00126	339.5902	555.4110	803.3343	1027.352	4658.564	0
rpart	104.96674	566.3252	958.1386	1188.4609	1458.437	5859.784	0
party	106.16768	435.3236	798.1830	993.5814	1229.761	5243.687	0

trellis.par.set(caretTheme())
dotplot(resamps, metric = "RMSE")

Final Model

The random forest model is the winner. However, the sensitivity plots are a little difficult to explain.

lapply(c("pce","pop","psavert","uempmed","month"), function(x){
    partial(rf.mod, pred.var=x, plot=TRUE, train=trainX)
  }
)

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

Time Series Cross Validation

William Chiu

11/5/2021