by Martynas Rakickis
I decided to look at the time series forecasting using autoregressive integrated moving average (ARIMA) and Exponential smoothing state space (ETS) methods, implemented in fable and fabletools packages and I am using tsibble to handle the time series data itself. This work references work done by Rob J Hyndman and George Athanasopoulos in “Forecasting: Principles and Practice”
The package list that I will be using is here:
library(tsibble)
library(fable)
library(fabletools)
library(ggplot2)
library(dplyr)
library(feasts)
The data I am using is gathered from Eurostat: https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=avia_paoc&lang=en and it contains number of passengers flown in air flights by their reporting country.
I will start by reading in the CSV file, converting data frame into tibble, changing time field to be assumed as a monthly date, converting tibble to a tsibble (time series tibble) and removing non existing values.
readr::read_csv("datasets/Europe_Flights.csv") %>%mutate(Time = yearmonth(Time)) %>% as_tsibble(index = Time , key = Country) %>% filter(!is.na(Passengers_on_board)) -> Flights_ts
##
## -- Column specification --------------------------------------------------------
## cols(
## Time = col_date(format = ""),
## Country = col_character(),
## Passengers_on_board = col_double()
## )
print(head(Flights_ts))
## # A tsibble: 6 x 3 [1M]
## # Key: Country [1]
## Time Country Passengers_on_board
## <mth> <chr> <dbl>
## 1 2005 Jan Austria 1317150
## 2 2005 Feb Austria 1287914
## 3 2005 Mar Austria 1538968
## 4 2005 Apr Austria 1512452
## 5 2005 May Austria 1756279
## 6 2005 Jun Austria 1880917
Firstly, lets explore all the time series that we have in the dataset
Flights_ts %>% fabletools::autoplot(Passengers_on_board) +
facet_wrap(~Country, scales = "free_y", ncol = 3) +
theme(legend.position = "none") +
scale_x_yearmonth(name = "Time (Monthly)") +
ggtitle(label = "Air Passengers Carried" )
Some interesting trends emerge - all time series exhibit seasonality, some of the time series have a reduced number of observations (Croatia), and some (Austria) seem to have missing values. The missing value counts and their whereabouts in a tsibble can be found with
tsibble::count_gaps(Flights_ts)
count_gaps() will find missing values in between already existing time series observations.
Let’s take a look at the time series plots for Belgium’s flight data, as I will be examining it in more detail:
Flights_ts %>% filter(Country == "Belgium") -> Belgium_ts
Belgium_ts %>% autoplot(Passengers_on_board) +
scale_y_continuous(breaks = seq.int(from = 1e+06, to = 3.5e+06, by = .5e+06), labels = paste0( (seq.int(from = 1e+06, to = 3.5e+06, by = .5e+06)) / 1e+06 , "M"), name = "Air Passengers") +
ggtitle(label = "Air Passengers Carried in Belgium" ) +
scale_x_yearmonth(name = "Time (Monthly)")
And let’s take a look a the box plot + dot plot combination of each month:
ggplot(Belgium_ts,aes( x = lubridate::month(Time), y = Passengers_on_board, group = lubridate::month(Time))) + geom_boxplot() +
geom_dotplot(binaxis='y', stackdir='center', dotsize = 3, fill="blue", method = "histodot", binwidth = .1e+05) +
scale_y_continuous(breaks = seq.int(from = 1e+06, to = 3.5e+06, by = .5e+06), labels = paste0( (seq.int(from = 1e+06, to = 3.5e+06, by = .5e+06)) / 1e+06 , "M"), name = "Air Passengers") + ggtitle(label = "Air Passengers Carried in Belgium" ) + scale_x_yearmonth(breaks = 1:12, labels = 1:12, name = "Month")
So the Belgium_ts time series follows a yearly periodicity, with maximum at July and low points during January - February.
Let’s try fitting three models - seasonal ARIMA, seasonal ETS and seasonal naive model with drift. The last one can be thought as a benchmark model to see if my produced models have adequate performance.
First of, I will create training set, by excluding last 12 months, which we will use to fit the model and compare the forecast against the test dataset.
Belgium_ts %>% slice(1:(n()-12)) -> Belgium_train_ts
print(tail(Belgium_train_ts))
## # A tsibble: 6 x 3 [1M]
## # Key: Country [1]
## Time Country Passengers_on_board
## <mth> <chr> <dbl>
## 1 2018 Jan Belgium 2194692
## 2 2018 Feb Belgium 2200309
## 3 2018 Mar Belgium 2598359
## 4 2018 Apr Belgium 3065140
## 5 2018 May Belgium 3096099
## 6 2018 Jun Belgium 3115544
For the ETS model, let’s look at the time series decomposition plots to gather details for the time series.
Belgium_ts %>% model(feasts::STL(Passengers_on_board ~ season(window = Inf))) %>% components() %>% autoplot()
ETS models consist of three main components - error, trend and seasonality. From the decomposition plot it looks that we have positive additive trend (linear increase), seasonality fluctuates with increasing amplitude, error or the remainder part has changing variance as well. From these findings we could use ETS(M,A,M) model. Lets check. which model will be fitted based on AICc criterion and the default settings:
model(Belgium_train_ts, ETS_auto = ETS(Passengers_on_board, ic = "aicc", opt_crit = "lik")) %>% report()
## Series: Passengers_on_board
## Model: ETS(M,Ad,M)
## Smoothing parameters:
## alpha = 0.4836018
## beta = 0.001453945
## gamma = 0.0001078952
## phi = 0.9788698
##
## Initial states:
## l b s1 s2 s3 s4 s5 s6
## 1450875 11770.68 0.7975377 0.8620338 1.055935 1.159935 1.239187 1.283274
## s7 s8 s9 s10 s11 s12
## 1.103711 1.097651 1.017486 0.8907403 0.7524164 0.7400935
##
## sigma^2: 0.0018
##
## AIC AICc BIC
## 4528.594 4533.378 4584.171
It seems I was not far off with my model assessment, during the fitting process only a dampening parameter was added to the trend component, compared to my estimation.
Ok, now let’s fit an ARIMA model with default settings:
model(Belgium_train_ts, ARIMA_auto = ARIMA((Passengers_on_board) , ic = "aicc", stepwise = F)) %>% report()
## Series: Passengers_on_board
## Model: ARIMA(1,0,0)(2,1,1)[12] w/ drift
## Transformation: (Passengers_on_board)
##
## Coefficients:
## ar1 sar1 sar2 sma1 constant
## 0.6280 -0.2271 -0.2734 -0.5206 57344.30
## s.e. 0.0651 0.1506 0.1179 0.1477 4285.96
##
## sigma^2 estimated as 9.48e+09: log likelihood=-1938.61
## AIC=3889.22 AICc=3889.81 BIC=3907.29
Here I put stepwise = F, for the model fitting algorithm to search through a larger list of potential models for the best fit.
And finally, let’s fit a seasonal naive model with drift (to account for the positive trend).
model(Belgium_train_ts, snaive = SNAIVE(Passengers_on_board ~ drift())) %>% report()
## Series: Passengers_on_board
## Model: SNAIVE w/ drift
##
## Drift: 107665.3667 (se: 12829.7566)
## sigma^2: 24690398272.0593
Let’s put all these models together, so that comparing them would be easier
model(Belgium_train_ts
,ETS_auto = ETS(Passengers_on_board, ic = "aicc", opt_crit = "lik")
,ARIMA_auto = ARIMA((Passengers_on_board) , ic = "aicc", stepwise = F)
,snaive = SNAIVE(Passengers_on_board ~ drift())) -> Belgium_mable
print(Belgium_mable)
## # A mable: 1 x 4
## # Key: Country [1]
## Country ETS_auto ARIMA_auto snaive
## <chr> <model> <model> <model>
## 1 Belgium <ETS(M,Ad,M)> <ARIMA(1,0,0)(2,1,1)[12] w/ drift> <SNAIVE w/ drift>
Producing forecasts, based on the models in the previous step gives the following result and lets take a look at the forecasts
forecast(Belgium_mable, h = 12) -> Belgium_fable
autoplot(Belgium_fable, colour ="brown") +
autolayer(slice(Belgium_ts, (n()-96):n()), colour = "black") +
facet_wrap(~.model, ncol = 1, nrow = 3) +
scale_y_continuous(name = "Air Passengers")
## Plot variable not specified, automatically selected `.vars = Passengers_on_board`
snaive() returns forecasts and prediction intervals from an ARIMA(0,0,0)(0,1,0)m model where m is the seasonal period The performance of these models is following:
accuracy(Belgium_fable, Belgium_ts)
Usually in time series forecasting absolute measures are used to grade the model performance with Root Mean Square Error (RMSE) being one of the more commonly ones used. So from the produced models, Arima and SNAIVE have produced better forecasts, when considering RMSE, while ETS lags behind with its performance.
When working with non stationary time series, it sometimes helps to transform the data to simplify the model and produce better forecasts. One such family of transformations is Box-Cox transformation, which is described by Rob J Hyndman and George Athanasopoulos here To use Box-Cox transformation, we will need to get λ parameter value for our time series. A good value of λ is one which makes the size of the seasonal variation about the same across the whole series, as that makes the forecasting model simpler
Belgium_ts %>% features(features = guerrero) %>% pull(lambda_guerrero) -> Belgium_lambda
## Feature variable not specified, automatically selected `.var = Passengers_on_board`
print(Belgium_lambda)
## [1] 0.112981
So we have λ value and we can use time series transformation in conjunction with model fitting. Below are the new models from the transformed time series:
Belgium_train_ts %>%
model(ETS_bc = ETS(box_cox(Passengers_on_board, Belgium_lambda) , ic = "aicc", opt_crit = "lik"),
ARIMA_bc = ARIMA(box_cox(Passengers_on_board, Belgium_lambda) , ic = "aicc", stepwise = F),
snaive_bc = SNAIVE(box_cox(Passengers_on_board, Belgium_lambda) ~ drift())
) -> Belgium_bc_mable
print(Belgium_bc_mable)
## # A mable: 1 x 4
## # Key: Country [1]
## Country ETS_bc ARIMA_bc snaive_bc
## <chr> <model> <model> <model>
## 1 Belgium <ETS(M,A,A)> <ARIMA(3,0,0)(2,1,0)[12] w/ drift> <SNAIVE w/ drift>
select(Belgium_bc_mable, Country, ETS_bc) %>% report()
## Series: Passengers_on_board
## Model: ETS(M,A,A)
## Transformation: box_cox(Passengers_on_board, Belgium_lambda)
## Smoothing parameters:
## alpha = 0.5228754
## beta = 0.0001012281
## gamma = 0.0004214208
##
## Initial states:
## l b s1 s2 s3 s4 s5 s6
## 35.08869 0.0241381 -1.01236 -0.6807899 0.4058893 0.8640478 1.169602 1.351197
## s7 s8 s9 s10 s11 s12
## 0.5703719 0.5481228 0.1145506 -0.5430074 -1.340906 -1.446718
##
## sigma^2: 0
##
## AIC AICc BIC
## 362.4822 366.7322 414.9713
select(Belgium_bc_mable, Country, ARIMA_bc) %>% report()
## Series: Passengers_on_board
## Model: ARIMA(3,0,0)(2,1,0)[12] w/ drift
## Transformation: box_cox(Passengers_on_board, Belgium_lambda)
##
## Coefficients:
## ar1 ar2 ar3 sar1 sar2 constant
## 0.5357 -0.0587 0.1960 -0.6414 -0.4256 0.1765
## s.e. 0.0806 0.0923 0.0821 0.0755 0.0794 0.0207
##
## sigma^2 estimated as 0.06147: log likelihood=-4.44
## AIC=22.89 AICc=23.68 BIC=43.96
select(Belgium_bc_mable, Country, snaive_bc) %>% report()
## Series: Passengers_on_board
## Model: SNAIVE w/ drift
## Transformation: box_cox(Passengers_on_board, Belgium_lambda)
##
## Drift: 0.2673 (se: 0.0314)
## sigma^2: 0.1476
Comparing the models produced after Box-Cox transformations with their original counteparts, we can see following differences:
1. ETS model lost its dampened trend parameter
2. ARIMA model changed from ARIMA(0,0,5)(0,1,1)[12] w/ drift to ARIMA(3,0,2)(0,1,1)[12] w/o drift
3. Seasonal naive model remained mostly the same
Now lets see their performance
forecast(Belgium_bc_mable, h = 12) -> Belgium_bc_fable
accuracy(rbind(Belgium_bc_fable, Belgium_fable), Belgium_ts) %>% arrange(RMSE)
## Warning: `rbind.fbl_ts()` was deprecated in fabletools 0.2.0.
## Please use `bind_rows()` instead.
And lets see the forecast for the best performing model based on RMSE
Belgium_bc_fable %>% filter(.model == "snaive_bc") %>% autoplot(color = "red") + autolayer(filter_index(Belgium_ts, "2015" ~. ), color = "blue") +
scale_x_yearmonth(name = "Time (Monthly)") +
ggtitle(label = "Air Passengers Carried - Seasonal Naive forecast" ) +
scale_y_continuous(breaks = seq.int(from = 1e+06, to = 3.5e+06, by = .5e+06), labels = paste0( (seq.int(from = 1e+06, to = 3.5e+06, by = .5e+06)) / 1e+06 , "M"), name = "Air Passengers")
## Plot variable not specified, automatically selected `.vars = Passengers_on_board`
It seems the seasonal naive model still the better performer, with ARIMA trailing behind a little bit.
Residuals in time series forecasting are the leftovers, which the model cannot explain after the model has been fitted to a time series. As per Hyndman and Athanasopoulos The residuals are uncorrelated. If there are correlations between residuals, then there is information left in the residuals which should be included in the forecast model. 2) The residuals have zero mean. If the residuals have a mean other than zero, then the forecasts are biased.
Having that in mind, let’s take a look of two better performing models - SNAIVE_bc, ARIMA_bc and examine their residuals:
Belgium_bc_mable %>% select(Country, snaive_bc) %>%
gg_tsresiduals() + ggtitle(label = "SNAIVE")
Belgium_bc_mable %>% select(Country, ARIMA_bc) %>%
gg_tsresiduals() + ggtitle(label = "ARIMA")
From the plots it seems both models more or less satisfy the second property (zero mean), but when looking at the residual correlation, SNAIVE model clearly shows remaining correlation in the residuals, while ARIMA model residuals satisfy the first property (residuals are uncorrelated). Generally, one would want to produce a model that satisfies both properties, though in practice sometimes it is not possible to achieve both.
For reproducibility, I include my sessionInfo() snapshot
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] feasts_0.1.7 dplyr_1.0.4 ggplot2_3.3.3 fable_0.3.0
## [5] fabletools_0.3.0 tsibble_1.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.21 bslib_0.2.4
## [4] urca_1.3-0 purrr_0.3.4 lattice_0.20-41
## [7] colorspace_2.0-0 vctrs_0.3.6 generics_0.1.0
## [10] htmltools_0.5.1.1 yaml_2.2.1 utf8_1.1.4
## [13] rlang_0.4.10 jquerylib_0.1.3 pillar_1.5.0
## [16] glue_1.4.2 withr_2.4.1 DBI_1.1.1
## [19] distributional_0.2.2 lifecycle_1.0.0 stringr_1.4.0
## [22] munsell_0.5.0 anytime_0.3.9 gtable_0.3.0
## [25] progressr_0.7.0 evaluate_0.14 labeling_0.4.2
## [28] knitr_1.31 fansi_0.4.2 highr_0.8
## [31] Rcpp_1.0.6 readr_1.4.0 scales_1.1.1
## [34] jsonlite_1.7.2 farver_2.0.3 hms_1.0.0
## [37] digest_0.6.27 stringi_1.5.3 numDeriv_2016.8-1.1
## [40] grid_4.0.3 cli_2.3.0 tools_4.0.3
## [43] magrittr_2.0.1 sass_0.3.1 tibble_3.0.6
## [46] crayon_1.4.1 tidyr_1.1.2 pkgconfig_2.0.3
## [49] ellipsis_0.3.1 lubridate_1.7.9.2 assertthat_0.2.1
## [52] rmarkdown_2.7 rstudioapi_0.13 R6_2.5.0
## [55] nlme_3.1-152 compiler_4.0.3