modeltimeThis document walks through time series modelling and forecasting with the modeltime R package which gives access to a wide range of time series models with easy tooling to fit, evaluate, forecast and visualisations.
To illustrate its use we will use data from the UK Covid19 Dashboard for daily positivity rates in PCR tests.
This blog is based on https://www.business-science.io/code-tools/2020/06/29/introducing-modeltime.html.
The first step is to get the data. This can easily obtained from the dashboard API as a csv file.
df <- read_csv("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=uniqueCasePositivityBySpecimenDateRollingSum&format=csv", show_col_types = F)
df1 <- read_csv("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=uniquePeopleTestedBySpecimenDateRollingSum&format=csv", show_col_types = F)
head(df)## # A tibble: 6 × 5
## areaCode areaName areaType date uniqueCasePositivityBySpecimenDateRoll…
## <chr> <chr> <chr> <date> <dbl>
## 1 E92000001 England nation 2022-01-11 26
## 2 E92000001 England nation 2022-01-10 27.9
## 3 E92000001 England nation 2022-01-09 28.8
## 4 E92000001 England nation 2022-01-08 29.7
## 5 E92000001 England nation 2022-01-07 30
## 6 E92000001 England nation 2022-01-06 31
We can use the plot_time_series function to chart the data but first we’ll reduce the dataset to date and value fields.
library(cowplot)
df <- df %>%
select(date, uniqueCasePositivityBySpecimenDateRollingSum) %>%
set_names(c("date", "value"))
df1 <- df1 %>%
select(date, uniquePeopleTestedBySpecimenDateRollingSum) %>%
set_names(c("date", "value"))
a <- df %>%
plot_time_series(date, value, .interactive = FALSE) +
ggtitle("Test positivity time series")
b <- df1 %>%
plot_time_series(date, value, .interactive = FALSE) +
ggtitle("People tested")
cowplot::plot_grid(a, b)Setting .interactive=TRUE converts the plot to a plotly chart which adds interactivity.
df %>%
plot_time_series(date, value, .interactive = TRUE, .smooth_span = .3, .plotly_slider = TRUE) ## adds a date slider to zoom in or out of the dataWe’ll split the data in to training and test datasets using the time_series_split function.
splits <- df %>%
time_series_split(assess = "3 months", cumulative = TRUE) ## using last 3 months as test data, and all previous data as training data
splits## <Analysis/Assess/Total>
## <614/89/703>
Lets visualise:
splits %>%
tk_time_series_cv_plan() %>% ## converts splits to data frame
plot_time_series_cv_plan(date, value, .interactive = FALSE) ## shows test and train dataIts now time to fit some models. For this exercise we’ll fit six models and compare their performance:
Autoarima - a conventional time series model (autoregressive moving average)
exponential smoothing - widely used to fit time series
Prophet - a modern and powerful time series algorithm developed by Facebook Taylor and Letham (2017)
Random forest - a widely used machine learning model
Prophet + xgboost - a state of the art algorithm combining prophet and xgboost - a very effective boosted tree model. Jain and Prasad (2020)
Spline model - Multivariate adaptive regression spline (MARS) - from the earth package.
We are using a model specification to set up the model - this form is used in all models. There are 3 elements:
Specifying the model ‘type’ - in this case arima_reg
Setting model engine to undertake the analysis - auto_arima
Supplying a formula value ~ date which means value as a function of date - using the training data from the splits data frame.
model_fit_arima <- arima_reg() %>%
set_engine("auto_arima") %>%
fit(value ~ date, training(splits))Note it identifies that the data is a daily dataset. The output of this step is:
model_fit_arima## parsnip model object
##
## Series: outcome
## ARIMA(3,1,2)(0,0,1)[7]
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sma1
## -0.4559 0.3453 0.8995 1.3602 0.9084 0.0578
## s.e. 0.0337 0.0458 0.0489 0.0409 0.0723 0.0429
##
## sigma^2 estimated as 0.02734: log likelihood=235.18
## AIC=-456.37 AICc=-456.18 BIC=-425.44
model_fit_ets <- exp_smoothing() %>%
set_engine(engine = "ets") %>%
fit(value ~ date, data = training(splits))
model_fit_ets## parsnip model object
##
## ETS(A,Ad,N)
##
## Call:
## forecast::ets(y = outcome, model = model_ets, damped = damping_ets,
##
## Call:
## alpha = alpha, beta = beta, gamma = gamma)
##
## Smoothing parameters:
## alpha = 0.9831
## beta = 0.9831
## phi = 0.9395
##
## Initial states:
## l = 0.9122
## b = 0.1923
##
## sigma: 0.1676
##
## AIC AICc BIC
## 1755.081 1755.219 1781.601
We can use similar code for prophet:
model_fit_prophet <- prophet_reg(seasonality_yearly = TRUE, seasonality_weekly = TRUE) %>%
set_engine("prophet") %>%
fit(value ~ date, training(splits))model_fit_prophet## parsnip model object
##
## PROPHET Model
## - growth: 'linear'
## - n.changepoints: 25
## - changepoint.range: 0.8
## - yearly.seasonality: 'TRUE'
## - weekly.seasonality: 'TRUE'
## - daily.seasonality: 'auto'
## - seasonality.mode: 'additive'
## - changepoint.prior.scale: 0.05
## - seasonality.prior.scale: 10
## - holidays.prior.scale: 10
## - logistic_cap: NULL
## - logistic_floor: NULL
## - extra_regressors: 0
At this point we can also generate additional data features using functions from tidymodels
In this case:
step_timeseries_signature indexes the data by date and extract features for machine learning.
step_rm removes some columns
step_fourier applies a Fourier transform to the date
step_dummy converts all categorical variables to a series of columns with 1 where the feature is present and 0 where it isn’t (e.g. day of the week). This is format that many machine learning algorithms need for categorical variables.
The specification merely contains the details of what should be done with the data; the prep and juice steps actually apply it
recipe_spec <- recipe(value ~ date, training(splits)) %>%
step_timeseries_signature(date) %>%
step_rm(contains("am.pm"), contains("hour"), contains("minute"),
contains("second"), contains("xts")) %>%
step_fourier(date, period = 365, K = 5) %>%
step_dummy(all_nominal())
recipe_spec %>% prep() %>% juice()## # A tibble: 614 × 47
## date value date_index.num date_year date_year.iso date_half
## <date> <dbl> <dbl> <int> <int> <int>
## 1 2020-02-08 1 1581120000 2020 2020 1
## 2 2020-02-09 0.9 1581206400 2020 2020 1
## 3 2020-02-10 0.6 1581292800 2020 2020 1
## 4 2020-02-11 0.6 1581379200 2020 2020 1
## 5 2020-02-12 0.4 1581465600 2020 2020 1
## 6 2020-02-13 0.4 1581552000 2020 2020 1
## 7 2020-02-14 0.3 1581638400 2020 2020 1
## 8 2020-02-15 0.2 1581724800 2020 2020 1
## 9 2020-02-16 0.2 1581811200 2020 2020 1
## 10 2020-02-17 0.2 1581897600 2020 2020 1
## # … with 604 more rows, and 41 more variables: date_quarter <int>,
## # date_month <int>, date_day <int>, date_wday <int>, date_mday <int>,
## # date_qday <int>, date_yday <int>, date_mweek <int>, date_week <int>,
## # date_week.iso <int>, date_week2 <int>, date_week3 <int>, date_week4 <int>,
## # date_mday7 <int>, date_sin365_K1 <dbl>, date_cos365_K1 <dbl>,
## # date_sin365_K2 <dbl>, date_cos365_K2 <dbl>, date_sin365_K3 <dbl>,
## # date_cos365_K3 <dbl>, date_sin365_K4 <dbl>, date_cos365_K4 <dbl>, …
The features created include:
date.index.num - seconds since 1/1/1970
date_half - six month split
date_mday - day in the month
We can now use the enhanced dataset for the other algorithms:
model_spec_rf <- rand_forest(trees = 500, min_n = 50) %>%
set_engine("randomForest")
workflow_fit_rf <- workflow() %>%
add_model(model_spec_rf) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits))
workflow_fit_rf## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
## • step_rm()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, nodesize = min_rows(~50, x))
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 6.089694
## % Var explained: 89.58
model_spec_prophet_boost <- prophet_boost(seasonality_yearly = TRUE) %>%
set_engine("prophet_xgboost")
workflow_fit_prophet_boost <- workflow() %>%
add_model(model_spec_prophet_boost) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
workflow_fit_prophet_boost ## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: prophet_boost()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## PROPHET w/ XGBoost Errors
## ---
## Model 1: PROPHET
## - growth: 'linear'
## - n.changepoints: 25
## - changepoint.range: 0.8
## - yearly.seasonality: 'TRUE'
## - weekly.seasonality: 'auto'
## - daily.seasonality: 'auto'
## - seasonality.mode: 'additive'
## - changepoint.prior.scale: 0.05
## - seasonality.prior.scale: 10
## - holidays.prior.scale: 10
## - logistic_cap: NULL
## - logistic_floor: NULL
##
## ---
## Model 2: XGBoost Errors
##
## xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
## colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
## subsample = 1, objective = "reg:squarederror"), data = x$data,
## nrounds = 15, watchlist = x$watchlist, verbose = 0, nthread = 1)
model_spec_mars <- mars(mode = "regression") %>%
set_engine("earth")
workflow_fit_mars <- workflow() %>%
add_model(model_spec_mars) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
workflow_fit_mars## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mars()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_timeseries_signature()
## • step_rm()
## • step_fourier()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Selected 22 of 29 terms, and 9 of 46 predictors
## Termination condition: RSq changed by less than 0.001 at 29 terms
## Importance: date, date_index.num, date_sin365_K4, date_yday, ...
## Number of terms at each degree of interaction: 1 21 (additive model)
## GCV 0.4625768 RSS 245.6335 GRSq 0.9921103 RSq 0.9931544
Now lets pool results to make comparison between the different models. The modeltable function provides a facility for this.
model_table <- modeltime_table(
model_fit_arima,
model_fit_ets,
model_fit_prophet,
workflow_fit_rf,
workflow_fit_prophet_boost,
workflow_fit_mars
)
model_table## # Modeltime Table
## # A tibble: 6 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> ARIMA(3,1,2)(0,0,1)[7]
## 2 2 <fit[+]> ETS(A,AD,N)
## 3 3 <fit[+]> PROPHET
## 4 4 <workflow> RANDOMFOREST
## 5 5 <workflow> PROPHET W/ XGBOOST ERRORS
## 6 6 <workflow> EARTH
We can now apply our models to the test data. In modeltime this is called calibration.
calibration_table <- model_table %>%
modeltime_calibrate(testing(splits))We can pass the output of the previous step to model_time_forecast to model the “held-out” portion based on our training data.
calibration_table %>%
modeltime_forecast(actual_data = df) %>%
plot_modeltime_forecast(.interactive = TRUE)Neither ets or arima predict the increase (due to omicron) in cases. Both prophet models seem to capture the rising trend but not the pattern. The random forest model seems to mirror the pattern quite well, as does the mars model. We can quantify the accuracy of our models using modeltime_accuracy:
calibration_table %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(.interactive = FALSE)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ARIMA(3,1,2)(0,0,1)[7] | Test | 5.23 | 27.73 | 13.90 | 32.00 | 8.22 | 0.18 |
| 2 | ETS(A,AD,N) | Test | 5.22 | 27.74 | 13.88 | 31.96 | 8.20 | 0.19 |
| 3 | PROPHET | Test | 9.34 | 88.76 | 24.84 | 57.36 | 9.97 | 0.87 |
| 4 | RANDOMFOREST | Test | 5.07 | 25.93 | 13.49 | 32.80 | 7.90 | 0.87 |
| 5 | PROPHET W/ XGBOOST ERRORS | Test | 5.60 | 54.34 | 14.90 | 39.71 | 6.28 | 0.89 |
| 6 | EARTH | Test | 44.26 | 282.09 | 117.66 | 92.29 | 58.26 | 0.75 |
The r-squared values for arima and ets are very low but pretty good for the other models. EARTH has the lowest overall root mean squared error (rmse) of 2.5 - this means on average the model predicts the daily positivity rate +- 2.5%.
The final step is to take our best models and look ahead to see what they predict. Lets forecast the next 6 months.
calibration_table %>%
# Remove ARIMA model with low accuracy
filter(!.model_id %in% c(1:2)) %>%
# Refit and Forecast Forward
modeltime_refit(df) %>%
modeltime_forecast(h = "6 months", actual_data = df) %>%
plot_modeltime_forecast(.interactive = TRUE, .plotly_slider = TRUE)What does this show?