The document demonstrates time series cross-validation using the `caret`

package. The methodology is generally consistent with Rob Hyndman’s recommendation for how to do time series cross-validation.

Here’s a graphical summary of the method that will be applied to the `economics`

data set from `ggplot2`

.

Time series cross-validation answers two important questions:

If we have 10 different models, which one is the best?

Some of the 10 models may require tuning during model training, which tuning parameter values should we choose?

These are the exact questions that K-fold cross-validation answers for cross-sectional data sets. Unlike K-fold cross-validation, the hold-out data sets (n = 12) are **adjacent** observations. Hence, before doing time series cross-validation, we need to make sure that the data set is sorted appropriately by time (i.e., oldest observation to newest observation).

```
economics <- ggplot2::economics
knitr::kable(head(economics))
```

date | pce | pop | psavert | uempmed | unemploy |
---|---|---|---|---|---|

1967-06-30 | 507.8 | 198712 | 9.8 | 4.5 | 2944 |

1967-07-31 | 510.9 | 198911 | 9.8 | 4.7 | 2945 |

1967-08-31 | 516.7 | 199113 | 9.0 | 4.6 | 2958 |

1967-09-30 | 513.3 | 199311 | 9.8 | 4.9 | 3143 |

1967-10-31 | 518.5 | 199498 | 9.7 | 4.7 | 3066 |

1967-11-30 | 526.2 | 199657 | 9.4 | 4.8 | 3018 |

`knitr::kable(tail(economics))`

date | pce | pop | psavert | uempmed | unemploy | |
---|---|---|---|---|---|---|

473 | 2006-10-31 | 9410.8 | 300836 | -0.9 | 8.2 | 6826 |

474 | 2006-11-30 | 9478.5 | 301070 | -1.1 | 7.3 | 6849 |

475 | 2006-12-31 | 9540.3 | 301296 | -0.9 | 8.1 | 7017 |

476 | 2007-01-31 | 9610.6 | 301481 | -1.0 | 8.1 | 6865 |

477 | 2007-02-28 | 9653.0 | 301684 | -0.7 | 8.5 | 6724 |

478 | 2007-03-31 | 9705.0 | 301913 | -1.3 | 8.7 | 6801 |

Here are the variable names from the `?economics`

documentation.

Variable Name | Description |
---|---|

date | Month of data collection |

psavert | personal savings rate |

pce | personal consumption expenditures, in billions of dollars |

unemploy | number of unemployed in thousands |

uempmed | median duration of unemployment, in week |

pop | total population, in thousands |

Suppose we want to forecast the number of unemployed (`unemploy`

) for each month-end between now and next year. In other words, we want to build a model that can give us accurate forecasts for the next 12-months. As with most time series models, we assume that the independent variables are all exogenous.

The `caret`

package makes time series cross-validation very easy. First, we are going to calculate a categorical variable to store the month of the year (because there may be seasonality). Second, in order to make the research reproducible, we are going to set up seed values.

```
library(lubridate)
economics$month <- as.factor(month(economics$date))
#### creating sampling seeds ####
set.seed(123)
seeds <- vector(mode = "list", length = 432)
for(i in 1:431) seeds[[i]] <- sample.int(1000, 5)
## For the last model:
seeds[[432]] <- sample.int(1000, 1)
#########
```

Third, we set up the computer for parallel processing.

`library(doParallel)`

```
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
```

`registerDoParallel(cores=2)`

Then, we tell `caret`

that we want to measure model performance and select tuning parameters based on time series cross-validation.

`library(caret)`

```
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
##
## The following object is masked _by_ '.GlobalEnv':
##
## economics
```

```
myTimeControl <- trainControl(method = "timeslice",
initialWindow = 36,
horizon = 12,
fixedWindow = FALSE,
allowParallel = TRUE,
seeds = seeds)
```

Then, we try 10 different models. Some models require tuning parameters, so we tell `caret`

to try 5 different combinations of tuning parameters for each model that requires tuning parameters. For linear regression models, there are no tuning parameters.

```
tuneLength.num <- 5
glmnet.mod <- train(unemploy ~ . - date,
data = economics,
method = "glmnet",
family = "gaussian",
trControl = myTimeControl,
tuneLength=tuneLength.num)
```

```
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 2.0-2
```

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

```
pois.mod <- train(unemploy ~ . - date,
data = economics,
method = "glmnet",
family = "poisson",
trControl = myTimeControl,
tuneLength=tuneLength.num)
```

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

```
lm.mod <- train(unemploy ~ . - date,
data = economics,
method = "lm",
trControl = myTimeControl,
tuneLength=tuneLength.num)
earth.mod <- train(unemploy ~ . - date,
data = economics,
method = "earth",
trControl = myTimeControl,
tuneLength=tuneLength.num)
```

```
## Loading required package: earth
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
```

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

```
earth.pois.mod <- train(unemploy ~ . - date,
data = economics,
method = "earth",
glm=list(family=poisson),
trControl = myTimeControl,
tuneLength=tuneLength.num)
```

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

```
gam.mod <- train(unemploy ~ . - date,
data = economics,
method = "gam",
trControl = myTimeControl,
tuneLength=tuneLength.num)
```

```
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-6. For overview type 'help("mgcv-package")'.
```

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

```
rpart.mod <- train(unemploy ~ . - date,
data = economics,
method = "rpart",
trControl = myTimeControl,
tuneLength=tuneLength.num)
```

`## Loading required package: rpart`

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

```
party.mod <- train(unemploy ~ . - date,
data = economics,
method = "ctree",
trControl = myTimeControl,
tuneLength=tuneLength.num)
```

```
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
```

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

```
rf.mod <- train(unemploy ~ . - date,
data = economics,
method = "rf",
trControl = myTimeControl,
tuneLength=tuneLength.num)
```

```
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
```

```
gbm.mod <- train(unemploy ~ . - date,
data = economics,
method = "gbm",
distribution="poisson",
trControl = myTimeControl,
tuneLength=tuneLength.num,
verbose=FALSE)
```

```
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:modeltools':
##
## empty
##
## The following object is masked from 'package:lubridate':
##
## here
```

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

Finally, we select the best model based on cross-validation \(R^2\).

```
### model performance ###
resamps <- resamples(list(glmnet = glmnet.mod,
glmnet.pois = pois.mod,
lm = lm.mod,
earth=earth.mod,
earth.pois=earth.pois.mod,
gbm=gbm.mod,
gam=gam.mod,
rf=rf.mod,
rpart=rpart.mod,
party=party.mod))
resamps
```

```
##
## Call:
## resamples.default(x = list(glmnet = glmnet.mod, glmnet.pois = pois.mod,
## = gbm.mod, gam = gam.mod, rf = rf.mod, rpart = rpart.mod, party
## = party.mod))
##
## Models: glmnet, glmnet.pois, lm, earth, earth.pois, gbm, gam, rf, rpart, party
## Number of resamples: 431
## Performance metrics: RMSE, Rsquared
## Time estimates for: everything, final model fit
```

```
ss <- summary(resamps)
knitr::kable(ss[[3]]$Rsquared)
```

Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s | |
---|---|---|---|---|---|---|---|

glmnet | 0.0000000 | 0.07989 | 0.1935 | 0.2835 | 0.4517 | 0.9520 | 0 |

glmnet.pois | 0.0000009 | 0.05208 | 0.1748 | 0.2777 | 0.4479 | 0.9595 | 0 |

lm | 0.0000193 | 0.08146 | 0.2168 | 0.3064 | 0.5026 | 0.9511 | 0 |

earth | 0.0000007 | 0.14110 | 0.3416 | 0.4096 | 0.6830 | 0.9890 | 0 |

earth.pois | 0.0006496 | 0.13110 | 0.3901 | 0.4054 | 0.6272 | 0.9921 | 276 |

gbm | 0.0000016 | 0.05422 | 0.1818 | 0.2556 | 0.3944 | 0.8898 | 7 |

gam | 0.0000001 | 0.08569 | 0.3467 | 0.3878 | 0.6938 | 0.9571 | 12 |

rf | 0.0000349 | 0.03126 | 0.1362 | 0.2116 | 0.3274 | 0.9332 | 0 |

rpart | 0.0000023 | 0.02449 | 0.1142 | 0.1923 | 0.2842 | 0.8225 | 288 |

party | 0.0000020 | 0.03997 | 0.1427 | 0.2215 | 0.3597 | 0.8473 | 115 |

```
library(lattice)
trellis.par.set(caretTheme())
dotplot(resamps, metric = "Rsquared")
```

The earth model is the winner. The sensitivity plots below demonstrate that the model is intuitive:

- Higher population leads to higher number of unemployed
- Lower consumption leads to higher number of unemployed
- Higher unemployment duration leads to higher number of unemployed

`library(earth)`

```
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
```

`plotmo(earth.mod$finalModel)`

```
## grid: pce pop psavert uempmed month2 month3 month4 month5 month6
## 3082.45 242515.5 7.6 6.9 0 0 0 0 0
## month7 month8 month9 month10 month11 month12
## 0 0 0 0 0 0
```