The document demonstrates time series cross-validation using the caret package. The methodology is generally consistent with Rob Hyndman’s recommendation for how to do time series cross-validation.

Here’s a graphical summary of the method that will be applied to the economics data set from ggplot2.

timeseries_cv

Time series cross-validation answers two important questions:

  1. If we have 10 different models, which one is the best?

  2. Some of the 10 models may require tuning during model training, which tuning parameter values should we choose?

These are the exact questions that K-fold cross-validation answers for cross-sectional data sets. Unlike K-fold cross-validation, the hold-out data sets (n = 12) are adjacent observations. Hence, before doing time series cross-validation, we need to make sure that the data set is sorted appropriately by time (i.e., oldest observation to newest observation).

economics <- ggplot2::economics

knitr::kable(head(economics))
date pce pop psavert uempmed unemploy
1967-06-30 507.8 198712 9.8 4.5 2944
1967-07-31 510.9 198911 9.8 4.7 2945
1967-08-31 516.7 199113 9.0 4.6 2958
1967-09-30 513.3 199311 9.8 4.9 3143
1967-10-31 518.5 199498 9.7 4.7 3066
1967-11-30 526.2 199657 9.4 4.8 3018
knitr::kable(tail(economics))
date pce pop psavert uempmed unemploy
473 2006-10-31 9410.8 300836 -0.9 8.2 6826
474 2006-11-30 9478.5 301070 -1.1 7.3 6849
475 2006-12-31 9540.3 301296 -0.9 8.1 7017
476 2007-01-31 9610.6 301481 -1.0 8.1 6865
477 2007-02-28 9653.0 301684 -0.7 8.5 6724
478 2007-03-31 9705.0 301913 -1.3 8.7 6801

Here are the variable names from the ?economics documentation.

Variable Name Description
date Month of data collection
psavert personal savings rate
pce personal consumption expenditures, in billions of dollars
unemploy number of unemployed in thousands
uempmed median duration of unemployment, in week
pop total population, in thousands

Suppose we want to forecast the number of unemployed (unemploy) for each month-end between now and next year. In other words, we want to build a model that can give us accurate forecasts for the next 12-months. As with most time series models, we assume that the independent variables are all exogenous.

The caret package makes time series cross-validation very easy. First, we are going to calculate a categorical variable to store the month of the year (because there may be seasonality). Second, in order to make the research reproducible, we are going to set up seed values.

library(lubridate)

economics$month <- as.factor(month(economics$date))

#### creating sampling seeds ####
set.seed(123)
seeds <- vector(mode = "list", length = 432)
for(i in 1:431) seeds[[i]] <- sample.int(1000, 5)

## For the last model:
seeds[[432]] <- sample.int(1000, 1)

#########

Third, we set up the computer for parallel processing.

library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoParallel(cores=2)

Then, we tell caret that we want to measure model performance and select tuning parameters based on time series cross-validation.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     economics
myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 36,
                              horizon = 12,
                              fixedWindow = FALSE,
                              allowParallel = TRUE,
                              seeds = seeds)

Then, we try 10 different models. Some models require tuning parameters, so we tell caret to try 5 different combinations of tuning parameters for each model that requires tuning parameters. For linear regression models, there are no tuning parameters.

tuneLength.num <- 5

glmnet.mod <- train(unemploy ~ . - date,
                    data = economics,
                    method = "glmnet",
                    family = "gaussian",
                    trControl = myTimeControl,
                    tuneLength=tuneLength.num)
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 2.0-2
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
pois.mod <- train(unemploy ~ . - date,
                    data = economics,
                    method = "glmnet",
                    family = "poisson",
                    trControl = myTimeControl,
                  tuneLength=tuneLength.num)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
lm.mod <- train(unemploy ~ . - date,
                  data = economics,
                  method = "lm",
                  trControl = myTimeControl,
                tuneLength=tuneLength.num)

earth.mod <- train(unemploy ~ . - date,
                data = economics,
                method = "earth",
                trControl = myTimeControl,
                tuneLength=tuneLength.num)
## Loading required package: earth
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
earth.pois.mod <- train(unemploy ~ . - date,
                        data = economics,
                        method = "earth",
                        glm=list(family=poisson),
                        trControl = myTimeControl,
                        tuneLength=tuneLength.num)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
gam.mod <- train(unemploy ~ . - date,
                   data = economics,
                   method = "gam",
                   trControl = myTimeControl,
                   tuneLength=tuneLength.num)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-6. For overview type 'help("mgcv-package")'.
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
rpart.mod <- train(unemploy ~ . - date,
                 data = economics,
                 method = "rpart",
                 trControl = myTimeControl,
                 tuneLength=tuneLength.num)
## Loading required package: rpart
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
party.mod <- train(unemploy ~ . - date,
                   data = economics,
                   method = "ctree",
                   trControl = myTimeControl,
                   tuneLength=tuneLength.num)
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
rf.mod <- train(unemploy ~ . - date,
                data = economics,
                method = "rf",
                trControl = myTimeControl,
                tuneLength=tuneLength.num)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
gbm.mod <- train(unemploy ~ . - date,
                 data = economics,
                 method = "gbm",
                 distribution="poisson",
                 trControl = myTimeControl,
                 tuneLength=tuneLength.num,
                 verbose=FALSE)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:modeltools':
## 
##     empty
## 
## The following object is masked from 'package:lubridate':
## 
##     here
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

Finally, we select the best model based on cross-validation \(R^2\).

### model performance ###

resamps <- resamples(list(glmnet = glmnet.mod,
                          glmnet.pois = pois.mod,
                          lm = lm.mod,
                          earth=earth.mod,
                          earth.pois=earth.pois.mod,
                          gbm=gbm.mod,
                          gam=gam.mod,
                          rf=rf.mod,
                          rpart=rpart.mod,
                          party=party.mod))
resamps
## 
## Call:
## resamples.default(x = list(glmnet = glmnet.mod, glmnet.pois = pois.mod,
##  = gbm.mod, gam = gam.mod, rf = rf.mod, rpart = rpart.mod, party
##  = party.mod))
## 
## Models: glmnet, glmnet.pois, lm, earth, earth.pois, gbm, gam, rf, rpart, party 
## Number of resamples: 431 
## Performance metrics: RMSE, Rsquared 
## Time estimates for: everything, final model fit
ss <- summary(resamps)

knitr::kable(ss[[3]]$Rsquared)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
glmnet 0.0000000 0.07989 0.1935 0.2835 0.4517 0.9520 0
glmnet.pois 0.0000009 0.05208 0.1748 0.2777 0.4479 0.9595 0
lm 0.0000193 0.08146 0.2168 0.3064 0.5026 0.9511 0
earth 0.0000007 0.14110 0.3416 0.4096 0.6830 0.9890 0
earth.pois 0.0006496 0.13110 0.3901 0.4054 0.6272 0.9921 276
gbm 0.0000016 0.05422 0.1818 0.2556 0.3944 0.8898 7
gam 0.0000001 0.08569 0.3467 0.3878 0.6938 0.9571 12
rf 0.0000349 0.03126 0.1362 0.2116 0.3274 0.9332 0
rpart 0.0000023 0.02449 0.1142 0.1923 0.2842 0.8225 288
party 0.0000020 0.03997 0.1427 0.2215 0.3597 0.8473 115
library(lattice)

trellis.par.set(caretTheme())
dotplot(resamps, metric = "Rsquared")

The earth model is the winner. The sensitivity plots below demonstrate that the model is intuitive:

  1. Higher population leads to higher number of unemployed
  2. Lower consumption leads to higher number of unemployed
  3. Higher unemployment duration leads to higher number of unemployed
library(earth)
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
plotmo(earth.mod$finalModel)
##  grid:    pce      pop psavert uempmed month2 month3 month4 month5 month6
##       3082.45 242515.5     7.6     6.9      0      0      0      0      0
##  month7 month8 month9 month10 month11 month12
##       0      0      0       0       0       0