modeltime
This document walks through time series modelling and forecasting with the modeltime
R package which gives access to a wide range of time series models with easy tooling to fit, evaluate, forecast and visualisations.
To illustrate its use we will use data from the UK Covid19 Dashboard for daily positivity rates in PCR tests.
This blog is based on https://www.business-science.io/code-tools/2020/06/29/introducing-modeltime.html.
The first step is to get the data. This can easily obtained from the dashboard API as a csv file.
<- read_csv("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=uniqueCasePositivityBySpecimenDateRollingSum&format=csv", show_col_types = F)
df
<- read_csv("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=uniquePeopleTestedBySpecimenDateRollingSum&format=csv", show_col_types = F)
df1
head(df)
## # A tibble: 6 × 5
## areaCode areaName areaType date uniqueCasePositivityBySpecimenDateRoll…
## <chr> <chr> <chr> <date> <dbl>
## 1 E92000001 England nation 2022-01-11 26
## 2 E92000001 England nation 2022-01-10 27.9
## 3 E92000001 England nation 2022-01-09 28.8
## 4 E92000001 England nation 2022-01-08 29.7
## 5 E92000001 England nation 2022-01-07 30
## 6 E92000001 England nation 2022-01-06 31
We can use the plot_time_series
function to chart the data but first we’ll reduce the dataset to date and value fields.
library(cowplot)
<- df %>%
df select(date, uniqueCasePositivityBySpecimenDateRollingSum) %>%
set_names(c("date", "value"))
<- df1 %>%
df1 select(date, uniquePeopleTestedBySpecimenDateRollingSum) %>%
set_names(c("date", "value"))
<- df %>%
a plot_time_series(date, value, .interactive = FALSE) +
ggtitle("Test positivity time series")
<- df1 %>%
b plot_time_series(date, value, .interactive = FALSE) +
ggtitle("People tested")
::plot_grid(a, b) cowplot
Setting .interactive=TRUE
converts the plot to a plotly
chart which adds interactivity.
%>%
df plot_time_series(date, value, .interactive = TRUE, .smooth_span = .3, .plotly_slider = TRUE) ## adds a date slider to zoom in or out of the data
We’ll split the data in to training and test datasets using the time_series_split
function.
<- df %>%
splits time_series_split(assess = "3 months", cumulative = TRUE) ## using last 3 months as test data, and all previous data as training data
splits
## <Analysis/Assess/Total>
## <614/89/703>
Lets visualise:
%>%
splits tk_time_series_cv_plan() %>% ## converts splits to data frame
plot_time_series_cv_plan(date, value, .interactive = FALSE) ## shows test and train data
Its now time to fit some models. For this exercise we’ll fit six models and compare their performance:
Autoarima - a conventional time series model (autoregressive moving average)
exponential smoothing - widely used to fit time series
Prophet - a modern and powerful time series algorithm developed by Facebook Taylor and Letham (2017)
Random forest - a widely used machine learning model
Prophet + xgboost - a state of the art algorithm combining prophet
and xgboost
- a very effective boosted tree model. Jain and Prasad (2020)
Spline model - Multivariate adaptive regression spline (MARS) - from the earth
package.
We are using a model specification to set up the model - this form is used in all models. There are 3 elements:
Specifying the model ‘type’ - in this case arima_reg
Setting model engine to undertake the analysis - auto_arima
Supplying a formula value ~ date
which means value as a function of date - using the training data from the splits
data frame.
<- arima_reg() %>%
model_fit_arima set_engine("auto_arima") %>%
fit(value ~ date, training(splits))
Note it identifies that the data is a daily dataset. The output of this step is:
model_fit_arima
## parsnip model object
##
## Series: outcome
## ARIMA(3,1,2)(0,0,1)[7]
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sma1
## -0.4559 0.3453 0.8995 1.3602 0.9084 0.0578
## s.e. 0.0337 0.0458 0.0489 0.0409 0.0723 0.0429
##
## sigma^2 estimated as 0.02734: log likelihood=235.18
## AIC=-456.37 AICc=-456.18 BIC=-425.44
<- exp_smoothing() %>%
model_fit_ets set_engine(engine = "ets") %>%
fit(value ~ date, data = training(splits))
model_fit_ets
## parsnip model object
##
## ETS(A,Ad,N)
##
## Call:
## forecast::ets(y = outcome, model = model_ets, damped = damping_ets,
##
## Call:
## alpha = alpha, beta = beta, gamma = gamma)
##
## Smoothing parameters:
## alpha = 0.9831
## beta = 0.9831
## phi = 0.9395
##
## Initial states:
## l = 0.9122
## b = 0.1923
##
## sigma: 0.1676
##
## AIC AICc BIC
## 1755.081 1755.219 1781.601
We can use similar code for prophet:
<- prophet_reg(seasonality_yearly = TRUE, seasonality_weekly = TRUE) %>%
model_fit_prophet set_engine("prophet") %>%
fit(value ~ date, training(splits))
model_fit_prophet
## parsnip model object
##
## PROPHET Model
## - growth: 'linear'
## - n.changepoints: 25
## - changepoint.range: 0.8
## - yearly.seasonality: 'TRUE'
## - weekly.seasonality: 'TRUE'
## - daily.seasonality: 'auto'
## - seasonality.mode: 'additive'
## - changepoint.prior.scale: 0.05
## - seasonality.prior.scale: 10
## - holidays.prior.scale: 10
## - logistic_cap: NULL
## - logistic_floor: NULL
## - extra_regressors: 0
At this point we can also generate additional data features using functions from tidymodels
In this case:
step_timeseries_signature
indexes the data by date and extract features for machine learning.
step_rm
removes some columns
step_fourier
applies a Fourier transform to the date
step_dummy
converts all categorical variables to a series of columns with 1 where the feature is present and 0 where it isn’t (e.g. day of the week). This is format that many machine learning algorithms need for categorical variables.
The specification merely contains the details of what should be done with the data; the prep
and juice
steps actually apply it
<- recipe(value ~ date, training(splits)) %>%
recipe_spec step_timeseries_signature(date) %>%
step_rm(contains("am.pm"), contains("hour"), contains("minute"),
contains("second"), contains("xts")) %>%
step_fourier(date, period = 365, K = 5) %>%
step_dummy(all_nominal())
%>% prep() %>% juice() recipe_spec
## # A tibble: 614 × 47
## date value date_index.num date_year date_year.iso date_half
## <date> <dbl> <dbl> <int> <int> <int>
## 1 2020-02-08 1 1581120000 2020 2020 1
## 2 2020-02-09 0.9 1581206400 2020 2020 1
## 3 2020-02-10 0.6 1581292800 2020 2020 1
## 4 2020-02-11 0.6 1581379200 2020 2020 1
## 5 2020-02-12 0.4 1581465600 2020 2020 1
## 6 2020-02-13 0.4 1581552000 2020 2020 1
## 7 2020-02-14 0.3 1581638400 2020 2020 1
## 8 2020-02-15 0.2 1581724800 2020 2020 1
## 9 2020-02-16 0.2 1581811200 2020 2020 1
## 10 2020-02-17 0.2 1581897600 2020 2020 1
## # … with 604 more rows, and 41 more variables: date_quarter <int>,
## # date_month <int>, date_day <int>, date_wday <int>, date_mday <int>,
## # date_qday <int>, date_yday <int>, date_mweek <int>, date_week <int>,
## # date_week.iso <int>, date_week2 <int>, date_week3 <int>, date_week4 <int>,
## # date_mday7 <int>, date_sin365_K1 <dbl>, date_cos365_K1 <dbl>,
## # date_sin365_K2 <dbl>, date_cos365_K2 <dbl>, date_sin365_K3 <dbl>,
## # date_cos365_K3 <dbl>, date_sin365_K4 <dbl>, date_cos365_K4 <dbl>, …
The features created include:
date.index.num
- seconds since 1/1/1970
date_half
- six month split
date_mday
- day in the month
We can now use the enhanced dataset for the other algorithms:
<- rand_forest(trees = 500, min_n = 50) %>%
model_spec_rf set_engine("randomForest")
<- workflow() %>%
workflow_fit_rf add_model(model_spec_rf) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits))
workflow_fit_rf
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
## • step_rm()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, nodesize = min_rows(~50, x))
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 6.089694
## % Var explained: 89.58
<- prophet_boost(seasonality_yearly = TRUE) %>%
model_spec_prophet_boost set_engine("prophet_xgboost")
<- workflow() %>%
workflow_fit_prophet_boost add_model(model_spec_prophet_boost) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
workflow_fit_prophet_boost
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: prophet_boost()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## PROPHET w/ XGBoost Errors
## ---
## Model 1: PROPHET
## - growth: 'linear'
## - n.changepoints: 25
## - changepoint.range: 0.8
## - yearly.seasonality: 'TRUE'
## - weekly.seasonality: 'auto'
## - daily.seasonality: 'auto'
## - seasonality.mode: 'additive'
## - changepoint.prior.scale: 0.05
## - seasonality.prior.scale: 10
## - holidays.prior.scale: 10
## - logistic_cap: NULL
## - logistic_floor: NULL
##
## ---
## Model 2: XGBoost Errors
##
## xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
## colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
## subsample = 1, objective = "reg:squarederror"), data = x$data,
## nrounds = 15, watchlist = x$watchlist, verbose = 0, nthread = 1)
<- mars(mode = "regression") %>%
model_spec_mars set_engine("earth")
<- workflow() %>%
workflow_fit_mars add_model(model_spec_mars) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
workflow_fit_mars
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Selected 22 of 29 terms, and 9 of 46 predictors
## Termination condition: RSq changed by less than 0.001 at 29 terms
## Importance: date, date_index.num, date_sin365_K4, date_yday, ...
## Number of terms at each degree of interaction: 1 21 (additive model)
## GCV 0.4625768 RSS 245.6335 GRSq 0.9921103 RSq 0.9931544
Now lets pool results to make comparison between the different models. The modeltable
function provides a facility for this.
<- modeltime_table(
model_table
model_fit_arima,
model_fit_ets,
model_fit_prophet,
workflow_fit_rf,
workflow_fit_prophet_boost,
workflow_fit_mars
)
model_table
## # Modeltime Table
## # A tibble: 6 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> ARIMA(3,1,2)(0,0,1)[7]
## 2 2 <fit[+]> ETS(A,AD,N)
## 3 3 <fit[+]> PROPHET
## 4 4 <workflow> RANDOMFOREST
## 5 5 <workflow> PROPHET W/ XGBOOST ERRORS
## 6 6 <workflow> EARTH
We can now apply our models to the test data. In modeltime
this is called calibration.
<- model_table %>%
calibration_table modeltime_calibrate(testing(splits))
We can pass the output of the previous step to model_time_forecast
to model the “held-out” portion based on our training data.
%>%
calibration_table modeltime_forecast(actual_data = df) %>%
plot_modeltime_forecast(.interactive = TRUE)
Neither ets or arima predict the increase (due to omicron) in cases. Both prophet models seem to capture the rising trend but not the pattern. The random forest model seems to mirror the pattern quite well, as does the mars model. We can quantify the accuracy of our models using modeltime_accuracy
:
%>%
calibration_table modeltime_accuracy() %>%
table_modeltime_accuracy(.interactive = FALSE)
Accuracy Table | ||||||||
---|---|---|---|---|---|---|---|---|
.model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
1 | ARIMA(3,1,2)(0,0,1)[7] | Test | 5.23 | 27.73 | 13.90 | 32.00 | 8.22 | 0.18 |
2 | ETS(A,AD,N) | Test | 5.22 | 27.74 | 13.88 | 31.96 | 8.20 | 0.19 |
3 | PROPHET | Test | 9.34 | 88.76 | 24.84 | 57.36 | 9.97 | 0.87 |
4 | RANDOMFOREST | Test | 5.07 | 25.93 | 13.49 | 32.80 | 7.90 | 0.87 |
5 | PROPHET W/ XGBOOST ERRORS | Test | 5.60 | 54.34 | 14.90 | 39.71 | 6.28 | 0.89 |
6 | EARTH | Test | 44.26 | 282.09 | 117.66 | 92.29 | 58.26 | 0.75 |
The r-squared values for arima and ets are very low but pretty good for the other models. EARTH has the lowest overall root mean squared error (rmse) of 2.5 - this means on average the model predicts the daily positivity rate +- 2.5%.
The final step is to take our best models and look ahead to see what they predict. Lets forecast the next 6 months.
%>%
calibration_table # Remove ARIMA model with low accuracy
filter(!.model_id %in% c(1:2)) %>%
# Refit and Forecast Forward
modeltime_refit(df) %>%
modeltime_forecast(h = "6 months", actual_data = df) %>%
plot_modeltime_forecast(.interactive = TRUE, .plotly_slider = TRUE)
What does this show?