• 1 Introduction
    • 1.1 Getting the data
    • 1.2 Initial plot
    • 1.3 Making it interactive
    • 1.4 Preparing for modelling
      • 1.4.1 Splitting the data
    • 1.5 Let’s model
      • 1.5.1 Autoarima
      • 1.5.2 Exponential smoothing (ets)
      • 1.5.3 Prophet
      • 1.5.4 Random forest
      • 1.5.5 Prophet-boost
      • 1.5.6 MARS
    • 1.6 Model evaluation
    • 1.7 Applying models to test data
    • 1.8 Forecast ahead
    • References

1 Introduction

This document walks through time series modelling and forecasting with the modeltime R package which gives access to a wide range of time series models with easy tooling to fit, evaluate, forecast and visualisations.

To illustrate its use we will use data from the UK Covid19 Dashboard for daily positivity rates in PCR tests.

This blog is based on https://www.business-science.io/code-tools/2020/06/29/introducing-modeltime.html.

1.1 Getting the data

The first step is to get the data. This can easily obtained from the dashboard API as a csv file.

df <- read_csv("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=uniqueCasePositivityBySpecimenDateRollingSum&format=csv", show_col_types = F)

df1 <- read_csv("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=uniquePeopleTestedBySpecimenDateRollingSum&format=csv", show_col_types = F)

head(df)
## # A tibble: 6 × 5
##   areaCode  areaName areaType date       uniqueCasePositivityBySpecimenDateRoll…
##   <chr>     <chr>    <chr>    <date>                                       <dbl>
## 1 E92000001 England  nation   2022-01-11                                    26  
## 2 E92000001 England  nation   2022-01-10                                    27.9
## 3 E92000001 England  nation   2022-01-09                                    28.8
## 4 E92000001 England  nation   2022-01-08                                    29.7
## 5 E92000001 England  nation   2022-01-07                                    30  
## 6 E92000001 England  nation   2022-01-06                                    31

1.2 Initial plot

We can use the plot_time_series function to chart the data but first we’ll reduce the dataset to date and value fields.

library(cowplot)

df <- df %>%
  select(date, uniqueCasePositivityBySpecimenDateRollingSum) %>%
  set_names(c("date", "value"))

df1 <- df1 %>%
  select(date, uniquePeopleTestedBySpecimenDateRollingSum) %>%
  set_names(c("date", "value"))

a <- df %>%
  plot_time_series(date, value, .interactive = FALSE) +
  ggtitle("Test positivity time series")

b <- df1 %>%
  plot_time_series(date, value, .interactive = FALSE) +
  ggtitle("People tested")

cowplot::plot_grid(a, b)

1.3 Making it interactive

Setting .interactive=TRUE converts the plot to a plotly chart which adds interactivity.

df %>%
  plot_time_series(date, value, .interactive = TRUE, .smooth_span = .3, .plotly_slider = TRUE)  ## adds a date slider to zoom in or out of the data
Apr 2020Jul 2020Oct 2020Jan 2021Apr 2021Jul 2021Oct 2021Jan 2022010203040
Time Series Plot

1.4 Preparing for modelling

1.4.1 Splitting the data

We’ll split the data in to training and test datasets using the time_series_split function.

splits <- df %>%
  time_series_split(assess = "3 months", cumulative = TRUE) ## using last 3 months as test data, and all previous data as training data

splits
## <Analysis/Assess/Total>
## <614/89/703>

Lets visualise:

splits %>%
  tk_time_series_cv_plan() %>%    ## converts splits to data frame
  plot_time_series_cv_plan(date, value, .interactive = FALSE) ## shows test and train data

1.5 Let’s model

Its now time to fit some models. For this exercise we’ll fit six models and compare their performance:

  • Autoarima - a conventional time series model (autoregressive moving average)

  • exponential smoothing - widely used to fit time series

  • Prophet - a modern and powerful time series algorithm developed by Facebook Taylor and Letham (2017)

  • Random forest - a widely used machine learning model

  • Prophet + xgboost - a state of the art algorithm combining prophet and xgboost - a very effective boosted tree model. Jain and Prasad (2020)

  • Spline model - Multivariate adaptive regression spline (MARS) - from the earth package.

1.5.1 Autoarima

We are using a model specification to set up the model - this form is used in all models. There are 3 elements:

  • Specifying the model ‘type’ - in this case arima_reg

  • Setting model engine to undertake the analysis - auto_arima

  • Supplying a formula value ~ date which means value as a function of date - using the training data from the splits data frame.

model_fit_arima <- arima_reg() %>%
  set_engine("auto_arima") %>%
  fit(value ~ date, training(splits))

Note it identifies that the data is a daily dataset. The output of this step is:

model_fit_arima
## parsnip model object
## 
## Series: outcome 
## ARIMA(3,1,2)(0,0,1)[7] 
## 
## Coefficients:
##           ar1     ar2     ar3     ma1     ma2    sma1
##       -0.4559  0.3453  0.8995  1.3602  0.9084  0.0578
## s.e.   0.0337  0.0458  0.0489  0.0409  0.0723  0.0429
## 
## sigma^2 estimated as 0.02734:  log likelihood=235.18
## AIC=-456.37   AICc=-456.18   BIC=-425.44

1.5.2 Exponential smoothing (ets)

model_fit_ets <- exp_smoothing() %>%
    set_engine(engine = "ets") %>%
    fit(value ~ date, data = training(splits))

model_fit_ets
## parsnip model object
## 
## ETS(A,Ad,N) 
## 
## Call:
##  forecast::ets(y = outcome, model = model_ets, damped = damping_ets,  
## 
##  Call:
##      alpha = alpha, beta = beta, gamma = gamma) 
## 
##   Smoothing parameters:
##     alpha = 0.9831 
##     beta  = 0.9831 
##     phi   = 0.9395 
## 
##   Initial states:
##     l = 0.9122 
##     b = 0.1923 
## 
##   sigma:  0.1676
## 
##      AIC     AICc      BIC 
## 1755.081 1755.219 1781.601

1.5.3 Prophet

We can use similar code for prophet:

model_fit_prophet <- prophet_reg(seasonality_yearly = TRUE, seasonality_weekly = TRUE) %>%
  set_engine("prophet") %>%
  fit(value ~ date, training(splits))
model_fit_prophet
## parsnip model object
## 
## PROPHET Model
## - growth: 'linear'
## - n.changepoints: 25
## - changepoint.range: 0.8
## - yearly.seasonality: 'TRUE'
## - weekly.seasonality: 'TRUE'
## - daily.seasonality: 'auto'
## - seasonality.mode: 'additive'
## - changepoint.prior.scale: 0.05
## - seasonality.prior.scale: 10
## - holidays.prior.scale: 10
## - logistic_cap: NULL
## - logistic_floor: NULL
## - extra_regressors: 0

1.5.3.1 Feature engineering

At this point we can also generate additional data features using functions from tidymodels

In this case:

  • step_timeseries_signature indexes the data by date and extract features for machine learning.

  • step_rm removes some columns

  • step_fourier applies a Fourier transform to the date

  • step_dummy converts all categorical variables to a series of columns with 1 where the feature is present and 0 where it isn’t (e.g. day of the week). This is format that many machine learning algorithms need for categorical variables.

The specification merely contains the details of what should be done with the data; the prep and juice steps actually apply it

recipe_spec <- recipe(value ~ date, training(splits)) %>%
  step_timeseries_signature(date) %>%
  step_rm(contains("am.pm"), contains("hour"), contains("minute"),
          contains("second"), contains("xts")) %>%
  step_fourier(date, period = 365, K = 5) %>%
  step_dummy(all_nominal())

recipe_spec %>% prep() %>% juice()
## # A tibble: 614 × 47
##    date       value date_index.num date_year date_year.iso date_half
##    <date>     <dbl>          <dbl>     <int>         <int>     <int>
##  1 2020-02-08   1       1581120000      2020          2020         1
##  2 2020-02-09   0.9     1581206400      2020          2020         1
##  3 2020-02-10   0.6     1581292800      2020          2020         1
##  4 2020-02-11   0.6     1581379200      2020          2020         1
##  5 2020-02-12   0.4     1581465600      2020          2020         1
##  6 2020-02-13   0.4     1581552000      2020          2020         1
##  7 2020-02-14   0.3     1581638400      2020          2020         1
##  8 2020-02-15   0.2     1581724800      2020          2020         1
##  9 2020-02-16   0.2     1581811200      2020          2020         1
## 10 2020-02-17   0.2     1581897600      2020          2020         1
## # … with 604 more rows, and 41 more variables: date_quarter <int>,
## #   date_month <int>, date_day <int>, date_wday <int>, date_mday <int>,
## #   date_qday <int>, date_yday <int>, date_mweek <int>, date_week <int>,
## #   date_week.iso <int>, date_week2 <int>, date_week3 <int>, date_week4 <int>,
## #   date_mday7 <int>, date_sin365_K1 <dbl>, date_cos365_K1 <dbl>,
## #   date_sin365_K2 <dbl>, date_cos365_K2 <dbl>, date_sin365_K3 <dbl>,
## #   date_cos365_K3 <dbl>, date_sin365_K4 <dbl>, date_cos365_K4 <dbl>, …

The features created include:

  • date.index.num - seconds since 1/1/1970

  • date_half - six month split

  • date_mday - day in the month

We can now use the enhanced dataset for the other algorithms:

1.5.4 Random forest

model_spec_rf <- rand_forest(trees = 500, min_n = 50) %>%
  set_engine("randomForest")

workflow_fit_rf <- workflow() %>%
  add_model(model_spec_rf) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits))

workflow_fit_rf
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
## • step_rm()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, nodesize = min_rows(~50,      x)) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 15
## 
##           Mean of squared residuals: 6.089694
##                     % Var explained: 89.58

1.5.5 Prophet-boost

model_spec_prophet_boost <- prophet_boost(seasonality_yearly = TRUE) %>%
  set_engine("prophet_xgboost") 

workflow_fit_prophet_boost <- workflow() %>%
  add_model(model_spec_prophet_boost) %>%
  add_recipe(recipe_spec) %>%
  fit(training(splits))

workflow_fit_prophet_boost 
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: prophet_boost()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## PROPHET w/ XGBoost Errors
## ---
## Model 1: PROPHET
##  - growth: 'linear'
##  - n.changepoints: 25
##  - changepoint.range: 0.8
##  - yearly.seasonality: 'TRUE'
##  - weekly.seasonality: 'auto'
##  - daily.seasonality: 'auto'
##  - seasonality.mode: 'additive'
##  - changepoint.prior.scale: 0.05
##  - seasonality.prior.scale: 10
##  - holidays.prior.scale: 10
##  - logistic_cap: NULL
##  - logistic_floor: NULL
## 
## ---
## Model 2: XGBoost Errors
## 
## xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
##     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
##     subsample = 1, objective = "reg:squarederror"), data = x$data, 
##     nrounds = 15, watchlist = x$watchlist, verbose = 0, nthread = 1)

1.5.6 MARS

model_spec_mars <- mars(mode = "regression") %>%
    set_engine("earth") 

workflow_fit_mars <- workflow() %>%
  add_model(model_spec_mars) %>%
  add_recipe(recipe_spec) %>%
  fit(training(splits))

workflow_fit_mars
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Selected 22 of 29 terms, and 9 of 46 predictors
## Termination condition: RSq changed by less than 0.001 at 29 terms
## Importance: date, date_index.num, date_sin365_K4, date_yday, ...
## Number of terms at each degree of interaction: 1 21 (additive model)
## GCV 0.4625768    RSS 245.6335    GRSq 0.9921103    RSq 0.9931544

1.6 Model evaluation

Now lets pool results to make comparison between the different models. The modeltable function provides a facility for this.

model_table <- modeltime_table(
  model_fit_arima, 
  model_fit_ets, 
  model_fit_prophet,
  workflow_fit_rf,
  workflow_fit_prophet_boost, 
  workflow_fit_mars
) 

model_table
## # Modeltime Table
## # A tibble: 6 × 3
##   .model_id .model     .model_desc              
##       <int> <list>     <chr>                    
## 1         1 <fit[+]>   ARIMA(3,1,2)(0,0,1)[7]   
## 2         2 <fit[+]>   ETS(A,AD,N)              
## 3         3 <fit[+]>   PROPHET                  
## 4         4 <workflow> RANDOMFOREST             
## 5         5 <workflow> PROPHET W/ XGBOOST ERRORS
## 6         6 <workflow> EARTH

1.7 Applying models to test data

We can now apply our models to the test data. In modeltime this is called calibration.

calibration_table <- model_table %>%
  modeltime_calibrate(testing(splits))

We can pass the output of the previous step to model_time_forecast to model the “held-out” portion based on our training data.

calibration_table %>%
  modeltime_forecast(actual_data = df) %>%
  plot_modeltime_forecast(.interactive = TRUE)
Jul 2020Jan 2021Jul 2021Jan 2022−100−50050100150200250
LegendACTUAL1_ARIMA(3,1,2)(0,0,1)[7]2_ETS(A,AD,N)3_PROPHET4_RANDOMFOREST5_PROPHET W/ XGBOOST ERRORS6_EARTHForecast Plot

Neither ets or arima predict the increase (due to omicron) in cases. Both prophet models seem to capture the rising trend but not the pattern. The random forest model seems to mirror the pattern quite well, as does the mars model. We can quantify the accuracy of our models using modeltime_accuracy:

calibration_table %>%
  modeltime_accuracy() %>%
  table_modeltime_accuracy(.interactive = FALSE)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ARIMA(3,1,2)(0,0,1)[7] Test 5.23 27.73 13.90 32.00 8.22 0.18
2 ETS(A,AD,N) Test 5.22 27.74 13.88 31.96 8.20 0.19
3 PROPHET Test 9.34 88.76 24.84 57.36 9.97 0.87
4 RANDOMFOREST Test 5.07 25.93 13.49 32.80 7.90 0.87
5 PROPHET W/ XGBOOST ERRORS Test 5.60 54.34 14.90 39.71 6.28 0.89
6 EARTH Test 44.26 282.09 117.66 92.29 58.26 0.75

The r-squared values for arima and ets are very low but pretty good for the other models. EARTH has the lowest overall root mean squared error (rmse) of 2.5 - this means on average the model predicts the daily positivity rate +- 2.5%.

1.8 Forecast ahead

The final step is to take our best models and look ahead to see what they predict. Lets forecast the next 6 months.

calibration_table %>%
  # Remove ARIMA model with low accuracy
  filter(!.model_id %in% c(1:2)) %>%
  
  # Refit and Forecast Forward
  modeltime_refit(df) %>%
  modeltime_forecast(h = "6 months", actual_data = df) %>%
  plot_modeltime_forecast(.interactive = TRUE, .plotly_slider = TRUE)
Jul 2020Jan 2021Jul 2021Jan 2022Jul 2022−50050100150
LegendACTUAL3_PROPHET4_RANDOMFOREST5_PROPHET W/ XGBOOST ERRORS6_EARTHForecast Plot

What does this show?

References

Jain, Garima, and Rajeev Ranjan Prasad. 2020. “Machine Learning, Prophet and XGBoost Algorithm: Analysis of Traffic Forecasting in Telecom Networks with Time Series Data.” 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), June. https://doi.org/10.1109/icrito48877.2020.9197864.
Taylor, Sean J., and Benjamin Letham. 2017. “Forecasting at Scale.” https://doi.org/10.7287/peerj.preprints.3190v2.