2022-09-19

A tidy forecasting workflow

The process of producing forecasts can be split up into a few fundamental steps.

  1. Preparing data
  2. Data visualisation
  3. Specifying a model
  4. Model estimation
  5. Accuracy & performance evaluation
  6. Producing forecasts

A tidy forecasting workflow

Example

To illustrate the process, we will fit linear trend models to national GDP data stored in global_economy.

1-Data Preparation(tidy)

  • The first step in forecasting is to prepare data in the correct format.

  • It envolves loading in data, identifying missing values, filtering the time series, and other pre-processing tasks.

  • The functionality provided by tsibble and other packages in the tidyverse simplifies this step.

  • Each data set is unique, so it is an essential step to understanding the data’s features and should always be done before models are estimated.

1-Data Preparation(tidy)

  • We create GDP per capita to be able to compare GDP across countries
gdppc <- global_economy %>%
  mutate(GDP_per_capita = GDP/Population) %>%
  select(Year, Country, GDP, Population, GDP_per_capita)
gdppc
## # A tsibble: 15,150 x 5 [1Y]
## # Key:       Country [263]
##     Year Country             GDP Population GDP_per_capita
##    <dbl> <fct>             <dbl>      <dbl>          <dbl>
##  1  1960 Afghanistan  537777811.    8996351           59.8
##  2  1961 Afghanistan  548888896.    9166764           59.9
##  3  1962 Afghanistan  546666678.    9345868           58.5
##  4  1963 Afghanistan  751111191.    9533954           78.8
##  5  1964 Afghanistan  800000044.    9731361           82.2
##  6  1965 Afghanistan 1006666638.    9938414          101. 
##  7  1966 Afghanistan 1399999967.   10152331          138. 
##  8  1967 Afghanistan 1673333418.   10372630          161. 
##  9  1968 Afghanistan 1373333367.   10604346          130. 
## 10  1969 Afghanistan 1408888922.   10854428          130. 
## # … with 15,140 more rows

2-Data Visualisation

  • Visualisation is an essential step in understanding the data (identify common patterns, and subsequently specify an appropriate model.)
gdppc %>%
  filter(Country=="Brazil") %>%
  autoplot(GDP_per_capita) +
    labs(title = "GDP per capita for Brazil", y = "$US")

Define a model (specify)

  • There are many different time series models that can be used for forecasting.

  • Specifying an appropriate model for the data is essential for producing appropriate forecasts.

  • We can specify different models using the function model in the package fable.

  • This function use a formula (y ~ x) interface. The response variable(s) are specified on the left of the formula, and the structure of the model is written on the right.

Define a model (specify)

gdppc %>%
  filter(Country== "Brazil") %>%
      model(TSLM(GDP_per_capita ~ trend()))
## # A mable: 1 x 2
## # Key:     Country [1]
##   Country `TSLM(GDP_per_capita ~ trend())`
##   <fct>                            <model>
## 1 Brazil                            <TSLM>
  • In this case, we are using the TSLM model (time series linear model)
  • The response variable is GDP_per_capita and it is being modelled using trend()
  • The special functions used to define the model’s structure vary between models (as each model can support different structures).

Train the Model (estimate)

  • Once an appropriate model is specified, the next step is to estimate (train) the model given our data.

  • One or more model specifications can be estimated using the model() function.

estim<-gdppc %>%
      model(trend_model = TSLM(GDP_per_capita ~ trend()))
estim
## # A mable: 263 x 2
## # Key:     Country [263]
##    Country             trend_model
##    <fct>                   <model>
##  1 Afghanistan              <TSLM>
##  2 Albania                  <TSLM>
##  3 Algeria                  <TSLM>
##  4 American Samoa           <TSLM>
##  5 Andorra                  <TSLM>
##  6 Angola                   <TSLM>
##  7 Antigua and Barbuda      <TSLM>
##  8 Arab World               <TSLM>
##  9 Argentina                <TSLM>
## 10 Armenia                  <TSLM>
## # … with 253 more rows

Train the Model (estimate)

  • Again, we fit a linear trend model to the GDP per capita data for each combination of key variables (countries) in the tsibble.

  • In this example, it will fit a model to each of the 263 countries in the dataset. The resulting object is a model table or a “mable”.

  • The trend_model column contains information about the fitted model for each country (each cell corresponds to a fitted model)

estim
## # A mable: 263 x 2
## # Key:     Country [263]
##    Country             trend_model
##    <fct>                   <model>
##  1 Afghanistan              <TSLM>
##  2 Albania                  <TSLM>
##  3 Algeria                  <TSLM>
##  4 American Samoa           <TSLM>
##  5 Andorra                  <TSLM>
##  6 Angola                   <TSLM>
##  7 Antigua and Barbuda      <TSLM>
##  8 Arab World               <TSLM>
##  9 Argentina                <TSLM>
## 10 Armenia                  <TSLM>
## # … with 253 more rows

Check model performance (evaluate)

  • Once a model has been fitted, it is important to check how well it has performed on the data.

  • There are several diagnostic tools available to check model behavior, and also accuracy measures that allow one model to be compared against another.

  • It is an essential step to make sure your model account for the behavior of the data !!!

Produce forecasts (forecast)

  • With an appropriate model specified, estimated and checked, it is time to produce the forecasts using the function forecast().

  • We have to specify the number of future observations to forecast. For example, forecast the next 10 observations can be generated using h = 10.

  • We can also use natural language; e.g., h = "2 years" can be used to predict two years into the future

fore<-estim %>% forecast(h = 3)
fore
## # A fable: 789 x 5 [1Y]
## # Key:     Country, .model [263]
##    Country        .model       Year   GDP_per_capita  .mean
##    <fct>          <chr>       <dbl>           <dist>  <dbl>
##  1 Afghanistan    trend_model  2018     N(526, 9653)   526.
##  2 Afghanistan    trend_model  2019     N(534, 9689)   534.
##  3 Afghanistan    trend_model  2020     N(542, 9727)   542.
##  4 Albania        trend_model  2018  N(4716, 476419)  4716.
##  5 Albania        trend_model  2019  N(4867, 481086)  4867.
##  6 Albania        trend_model  2020  N(5018, 486012)  5018.
##  7 Algeria        trend_model  2018  N(4410, 643094)  4410.
##  8 Algeria        trend_model  2019  N(4489, 645311)  4489.
##  9 Algeria        trend_model  2020  N(4568, 647602)  4568.
## 10 American Samoa trend_model  2018 N(12491, 652926) 12491.
## # … with 779 more rows

Produce forecasts (forecast)

  • It is generate a forecast table, or fable.

  • Each row corresponds to one forecast period for each country.

  • The GDP_per_capita column contains the forecast distribution, while the .mean column contains the point forecast. The point forecast is the mean (or average) of the forecast distribution.

fore
## # A fable: 789 x 5 [1Y]
## # Key:     Country, .model [263]
##    Country        .model       Year   GDP_per_capita  .mean
##    <fct>          <chr>       <dbl>           <dist>  <dbl>
##  1 Afghanistan    trend_model  2018     N(526, 9653)   526.
##  2 Afghanistan    trend_model  2019     N(534, 9689)   534.
##  3 Afghanistan    trend_model  2020     N(542, 9727)   542.
##  4 Albania        trend_model  2018  N(4716, 476419)  4716.
##  5 Albania        trend_model  2019  N(4867, 481086)  4867.
##  6 Albania        trend_model  2020  N(5018, 486012)  5018.
##  7 Algeria        trend_model  2018  N(4410, 643094)  4410.
##  8 Algeria        trend_model  2019  N(4489, 645311)  4489.
##  9 Algeria        trend_model  2020  N(4568, 647602)  4568.
## 10 American Samoa trend_model  2018 N(12491, 652926) 12491.
## # … with 779 more rows

Produce forecasts (forecast)

  • The forecasts can be plotted along with the historical data using autoplot()
estim %>% forecast(h = 3) %>% filter(Country == "Brazil") %>%
  autoplot(gdppc) +
  labs(y = "$US", title = "GDP per capita for Brazil")

Forecast Methods

  • We will see four simple forecasting methods as benchmarks. To illustrate them, we will use quarterly Australian clay brick production between 1970 and 2004.
bricks <- aus_production %>%
  filter_index("1970 Q1" ~ "2004 Q4") %>%
  select(Bricks)
  • The filter_index() function is a convenient shorthand for extracting a section of a time series.

MEAN(y): Average method

  • Forecast of all future values is equal to mean of historical data \(\{y_1,\dots,y_T\}\).
  • Forecasts: \(\hat{y}_{T+h|T} = \bar{y} = (y_1+\dots+y_T)/T\)

MEAN(y): Average method

bricks %>% 
  model(MEAN(Bricks))%>%
    forecast(h = 20) %>%
      autoplot(bricks, level = NULL)

# Level= Null remove the Predict Interval

NAIVE(y): Naïve method

  • Forecasts equal to last observed value.

  • Forecasts: \(\hat{y}_{T+h|T} =y_T\).

  • Consequence of efficient market hypothesis.

  • Naïve forecast is optimal when data follow a random walk (see Section 9.1), these are also called random walk forecasts and the RW() function can be used instead of NAIVE.

NAIVE(y): Naïve method

bricks %>%
  model(NAIVE(Bricks)) %>%
  forecast(h = "5 years") %>%
  autoplot(bricks, level = NULL)

SNAIVE(y~lag(m)): Seasonal naïve method

  • A similar method is useful for highly seasonal data.
  • We set each forecast to be equal to the last observed value from the same season (e.g., the same month of the previous year).
  • Forecasts: \(\hat{y}_{T+h|T} =y_{T+h-m(k+1)}\), where \(m=\) seasonal period and \(k\) is the integer part of \((h-1)/m\). (i.e., the number of complete years in the forecast period prior to time \(T+h\))
  • For example, with monthly data, the forecast for all future February values is equal to the last observed February value, etc

SNAIVE(y~lag(m)): Seasonal naïve method

Seasonal naïve method

bricks %>%
  model(SNAIVE(Bricks ~ lag("year"))) %>%
  forecast(h = "5 years") %>%
  autoplot(bricks, level = NULL) +
  labs(title = "Clay brick production in Australia")

Seasonal naïve method

bricks %>%
  model(SNAIVE(Bricks ~ lag(" 5 year"))) %>%
  forecast(h = "5 years") %>%
  autoplot(bricks, level = NULL) +
  labs(title = "Clay brick production in Australia")

  • The lag() function allows me to decide from where the forecast should be based on. For some time series there is more than one seasonal period, and then the required lag must be specified.

RW(y ~ drift()): Drift method

  • A variation on the naïve method is to allow the forecasts to increase or decrease over tim.

  • The amount of change over time (called the drift) is set to be the average change seen in the historical data.

  • Forecasts equal to last value plus average change.

\(\hat{y}_{T+h|T} = y_{T} + \frac{h}{T-1}\sum_{t=2}^T (y_t-y_{t-1}) = y_T + \frac{h}{T-1}(y_T -y_1)\)

  • Equivalent to extrapolating a line drawn between first and last observations.

Drift method

bricks %>%
  model(RW(Bricks ~ drift())) %>%
  forecast(h = "5 years") %>%
  autoplot(bricks, level = NULL) +
  geom_line(data = slice(bricks, range(cumsum(!is.na(Bricks)))),
            aes(y=Bricks), linetype = "dashed", colour = "#0072B2") +
  labs(title = "Clay brick production in Australia")

Drift method

All 4 models

brick_fit <- aus_production %>%
  filter_index("1970 Q1" ~ "2004 Q4") %>%
  model(
    Seasonal_naive = SNAIVE(Bricks),
    Naive = NAIVE(Bricks),
    Drift = RW(Bricks ~ drift()),
    Mean = MEAN(Bricks)
  )
## # A mable: 1 x 4
##   Seasonal_naive   Naive         Drift    Mean
##          <model> <model>       <model> <model>
## 1       <SNAIVE> <NAIVE> <RW w/ drift>  <MEAN>

Producing forecast

brick_fc <- brick_fit %>%
  forecast(h = "5 years")
## # A fable: 80 x 4 [1Q]
## # Key:     .model [4]
##   .model         Quarter       Bricks .mean
##   <chr>            <qtr>       <dist> <dbl>
## 1 Seasonal_naive 2005 Q1 N(409, 3026)   409
## 2 Seasonal_naive 2005 Q2 N(423, 3026)   423
## 3 Seasonal_naive 2005 Q3 N(428, 3026)   428
## 4 Seasonal_naive 2005 Q4 N(397, 3026)   397
## # … with 76 more rows

Visualising forecasts

brick_fc %>%
  autoplot(aus_production, level = NULL) +
  labs(title = "Clay brick production in Australia",
       y = "Millions of bricks") +
  guides(colour = guide_legend(title = "Forecast"))

Apple closing stock price

# Extract training data
apple_stock <- gafa_stock %>%
  filter(Symbol == "AAPL") %>%
  mutate(trading_day = row_number()) %>%
  update_tsibble(index=trading_day, regular=TRUE)
  • Because stock prices are not observed every day, we first set up a new time index based on the trading days rather than calendar days.

Apple closing stock price

# Specify, estimate and forecast
apple_stock %>%
  model(
    Mean = MEAN(Close),
    Naive = NAIVE(Close),
    Drift = RW(Close ~ drift())
  ) %>%
  forecast(h=42) %>%
  autoplot(apple_stock, level = NULL) +
  labs(title = "Apple closing stock price", y="$US") +
  guides(colour=guide_legend(title="Forecast"))

Fitted values

  • Each observation in a time series can be forecast using all previous observations.

  • We call these fitted values (\(\hat{y}_{t|t-1}\)) – forecast of \(y_t\) based on observations \(y_1,\dots,y_{t-1}\).

  • Sometimes drop the subscript: \(\hat{y}_t \equiv \hat{y}_{t|t-1}\).

  • Fitted values are often not true forecasts because any parameters involved in the forecasting method are estimated using all available observations in the time series.

  • For example, if we use the mean method, the fitted values are given by

\[ \hat{y}_t = \hat{c}\]

  • where \(\hat{c}\) is the average computed over all available observations, including those at times after \(t\).

  • When the estimate involves observations after time \(t\), the fitted values are not true forecasts.

  • The naïve or seasonal naïve forecasts do not involve any parameters, and so fitted values are true forecasts in such cases.

Residuals

  • “Residuals” in a time series model are what is left over after fitting a model.

  • The residuals are equal to the difference between the observations and the corresponding fitted values: \(e_t = y_t-\hat{y}_{t|t-1}\).

  • A good forecasting method will yield innovation residuals with the following properties:

  1. The innovation residuals are uncorrelated. If there are correlations between innovation residuals, then there is information left in the residuals which should be used in computing forecasts.

  2. The innovation residuals have zero mean. If they have a mean other than zero, then the forecasts are biased.

Residuals

  • Models that do not satisfy these properties can be improved.

  • However, that does not mean that forecasting methods that satisfy these properties cannot be improved.

  • Checking these properties is important in order to see whether a method is using all of the available information, but it is not a good way to select a forecasting method.

  • Adjusting for bias is easy: if the residuals have mean \(m\), then simply subtract \(m\) from all forecasts and the bias problem is solved.

  • Fixing the correlation problem is harder (Ch 10).

Residuals

  • The residuals to also have the following two properties (but not necessary):
  1. The innovation residuals have constant variance. This is known as homoscedasticity
  2. The innovation residuals are normally distributed.
  • These two properties make the calculation of prediction intervals easier.
  • Sometimes there is usually little that you can do to ensure that your innovation residuals have constant variance and a normal distribution (alternative approach to obtaining prediction intervals is necessary).

Residuals

  • It is important to state the distinction between residuals and innovation residuals.

  • If a transformation has been used in the model, then it is often useful to look at residuals on the transformed scale (innovation residuals)

  • For example, suppose we modeled the logarithms of the data, \(w_t=log(y_t)\), the innovation residuals would be \(w_t-\hat{w_t}\), whereas the residuals are given by \(y_t-\hat{y_t}\).

  • The fitted value and residuals can be checked via augment()

Example

  • Apple Stock price (Close)
apple_stock %>% autoplot(Close)

Example

fit <- apple_stock %>% model(NAIVE(Close))
augment(fit)
## # A tsibble: 1,258 x 7 [1]
## # Key:       Symbol, .model [1]
##    Symbol .model       trading_day Close .fitted .resid .innov
##    <chr>  <chr>              <int> <dbl>   <dbl>  <dbl>  <dbl>
##  1 AAPL   NAIVE(Close)           1  79.0    NA   NA     NA    
##  2 AAPL   NAIVE(Close)           2  77.3    79.0 -1.74  -1.74 
##  3 AAPL   NAIVE(Close)           3  77.7    77.3  0.421  0.421
##  4 AAPL   NAIVE(Close)           4  77.1    77.7 -0.556 -0.556
##  5 AAPL   NAIVE(Close)           5  77.6    77.1  0.489  0.489
##  6 AAPL   NAIVE(Close)           6  76.6    77.6 -0.991 -0.991
##  7 AAPL   NAIVE(Close)           7  76.1    76.6 -0.511 -0.511
##  8 AAPL   NAIVE(Close)           8  76.5    76.1  0.399  0.399
##  9 AAPL   NAIVE(Close)           9  78.1    76.5  1.52   1.52 
## 10 AAPL   NAIVE(Close)          10  79.6    78.1  1.57   1.57 
## # … with 1,248 more rows

Example

  • We estimate a naive forecast \(\hat{y}_{t|t-1} = y_{t-1}\)

  • \(e_t = y_t - \hat{y}_{t|t-1} = y_t-y_{t-1}\)

  • fitted contains the fitted values;

  • resid contains the residuals;

  • innov contains the “innovation residuals” which, in this case, are identical to the regular residuals.

  • Why is there NAs on the first row of the fit table?

Example

augment(fit) %>%
  ggplot(aes(x = trading_day)) +
  geom_line(aes(y = Close, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted"))

  • Both curves are overlapping each other, right? Let’s take a closer look

Example

augment(fit) %>%
  filter(trading_day > 1100) %>%
  ggplot(aes(x = trading_day)) +
  geom_line(aes(y = Close, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted"))

Example - Residual

augment(fit) %>%
  autoplot(.resid) +
  labs(y = "$US",
       title = "Residuals from naïve method")

Example

augment(fit) %>%
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 150) +
  labs(title = "Histogram of residuals")

Example

augment(fit) %>%
  ACF(.resid) %>%
  autoplot() + labs(title = "ACF of residuals")

Example

  • The naïve method produces forecasts might account for all available information.

  • The mean of the residuals is close to zero (histogram) but may not be normal — the right tail seems a little too long, even when we ignore the outlier.

  • There are some significant correlation in the residual series (ACF)

  • The variation of the residuals does not stay much the same across the historical data (end of the sample)

  • This forecast method might not be good, and the prediction intervals that are computed assuming a normal distribution may be inaccurate.

Example

  • gg_tsresiduals() function
gg_tsresiduals(fit)

Facebook Example

fb_stock <- gafa_stock %>%
  filter(Symbol == "FB") %>%
  mutate(trading_day = row_number()) %>%
  update_tsibble(index=trading_day, regular=TRUE)

fit2 <- fb_stock %>% model(NAIVE(Close))
augment(fit2)
## # A tsibble: 1,258 x 7 [1]
## # Key:       Symbol, .model [1]
##    Symbol .model       trading_day Close .fitted .resid .innov
##    <chr>  <chr>              <int> <dbl>   <dbl>  <dbl>  <dbl>
##  1 FB     NAIVE(Close)           1  54.7    NA   NA     NA    
##  2 FB     NAIVE(Close)           2  54.6    54.7 -0.150 -0.150
##  3 FB     NAIVE(Close)           3  57.2    54.6  2.64   2.64 
##  4 FB     NAIVE(Close)           4  57.9    57.2  0.720  0.720
##  5 FB     NAIVE(Close)           5  58.2    57.9  0.310  0.310
##  6 FB     NAIVE(Close)           6  57.2    58.2 -1.01  -1.01 
##  7 FB     NAIVE(Close)           7  57.9    57.2  0.720  0.720
##  8 FB     NAIVE(Close)           8  55.9    57.9 -2.03  -2.03 
##  9 FB     NAIVE(Close)           9  57.7    55.9  1.83   1.83 
## 10 FB     NAIVE(Close)          10  57.6    57.7 -0.140 -0.140
## # … with 1,248 more rows

Facebook Example

gg_tsresiduals(fit2)

ACF of residuals

  • We assume that the residuals are white noise (uncorrelated, mean zero, constant variance). If they aren’t, then there is information left in the residuals that should be used in computing forecasts.

  • So a standard residual diagnostic is to check the ACF of the residuals of a forecasting method.

  • We expect these to look like white noise.

  • But we can do a more formal test for autocorrelation:

\(r_k =\) autocorrelation of residual at lag \(k\)

Consider a whole set of \(r_{k}\) values, and develop a test to see whether the set is significantly different from a zero set (instead of doing it individually) - portmanteau test

Portmanteau tests

Box-Pierce test

\[Q = T \sum_{k=1}^\ell r_k^2\] where \(\ell\) is max lag being considered and \(T\) is number of observations.

  • If each \(r_k\) close to zero, \(Q\) will be small.
  • If some \(r_k\) values large (positive or negative), \(Q\) will be large.

Portmanteau tests

Ljung-Box test - preferred method

\[Q^* = T(T+2) \sum_{k=1}^\ell (T-k)^{-1}r_k^2\] where \(\ell\) is max lag being considered and \(T\) is number of observations.

  • My preferences: \(\ell=10\) for non-seasonal data, \(h=2m\) for seasonal data (where \(m\) is seasonal period).

  • However, the test is not good when \(\ell\) is large, so if these values are larger than \(T/5\), then use \(\ell = T/5\).

  • The null hypothesis for both test is the residuals are not distinguishable from a white noise series (not significant results).

Portmanteau tests

  • If data are WN, \(Q^*\) has \(\chi^2\) distribution with \((\ell - K)\) degrees of freedom where \(K=\) no. parameters in model.
  • When applied to raw data, set \(K=0\).
  • lag \(= \ell\), dof \(= K\)
#Apple Stocks 
augment(fit) %>%
  features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 × 4
##   Symbol .model       lb_stat lb_pvalue
##   <chr>  <chr>          <dbl>     <dbl>
## 1 AAPL   NAIVE(Close)    25.8   0.00399
  • The results are significant (i.e., the p-values less than 5%). Thus, we can conclude that the residuals are distinguishable from a white noise series.

Portmanteau tests

#Facebook Stocks
augment(fit2) %>%
  features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 × 4
##   Symbol .model       lb_stat lb_pvalue
##   <chr>  <chr>          <dbl>     <dbl>
## 1 FB     NAIVE(Close)    12.1     0.276
  • The results are not significant (i.e., the p-values are large). Thus, we can conclude that the residuals are not distinguishable from a white noise series.

Drift Model

fit_drift <- apple_stock %>% model(Drift = RW(Close ~ drift()))
augment(fit_drift)
## # A tsibble: 1,258 x 7 [1]
## # Key:       Symbol, .model [1]
##    Symbol .model trading_day Close .fitted .resid .innov
##    <chr>  <chr>        <int> <dbl>   <dbl>  <dbl>  <dbl>
##  1 AAPL   Drift            1  79.0    NA   NA     NA    
##  2 AAPL   Drift            2  77.3    79.1 -1.80  -1.80 
##  3 AAPL   Drift            3  77.7    77.3  0.359  0.359
##  4 AAPL   Drift            4  77.1    77.8 -0.618 -0.618
##  5 AAPL   Drift            5  77.6    77.2  0.426  0.426
##  6 AAPL   Drift            6  76.6    77.7 -1.05  -1.05 
##  7 AAPL   Drift            7  76.1    76.7 -0.574 -0.574
##  8 AAPL   Drift            8  76.5    76.2  0.336  0.336
##  9 AAPL   Drift            9  78.1    76.6  1.46   1.46 
## 10 AAPL   Drift           10  79.6    78.1  1.50   1.50 
## # … with 1,248 more rows

Drift Model

gg_tsresiduals(fit_drift)

Drift Model

augment(fit_drift) %>%
  features(.resid, ljung_box, lag=10, dof=1)
## # A tibble: 1 × 4
##   Symbol .model lb_stat lb_pvalue
##   <chr>  <chr>    <dbl>     <dbl>
## 1 AAPL   Drift     25.8   0.00219

Distributional forecasts and prediction intervals

  • We express the uncertainty in our forecasts using a probability distribution
  • A forecast \(\hat{y}_{T+h|T}\) is (usually) the mean of the conditional distribution \(y_{T+h} \mid y_1, \dots, y_{T}\).
  • Most time series models produce normally distributed forecasts.
  • The forecast distribution describes the probability of observing any future value.

Forecast distributions

Assuming residuals are normal, uncorrelated, sd = \(\hat\sigma\):

Mean: \(y_{T+h|T} \sim N(\bar{y}, (1 + 1/T)\hat{\sigma}^2)\)

Naïve: \(y_{T+h|T} \sim N(y_T, h\hat{\sigma}^2)\)

Seasonal naïve: & \(y_{T+h|T} \sim N(y_{T+h-m(k+1)}, (k+1)\hat{\sigma}^2)\)

Drift: \(y_{T+h|T} \sim N(y_T + \frac{h}{T-1}(y_T - y_1),h\frac{T+h}{T}\hat{\sigma}^2)\)

where \(k\) is the integer part of \((h-1)/m\).

Note that when \(h=1\) and \(T\) is large, these all give the same approximate forecast variance: \(\hat{\sigma}^2\).

Prediction intervals

  • A prediction interval gives a region within which we expect \(y_{T+h}\) to lie with a specified probability.
  • Assuming forecast errors are normally distributed, then a 95% PI is

\[\hat{y}_{T+h|T} \pm 1.96 \hat\sigma_h\]

where \(\hat\sigma_h\) is the st dev of the \(h\)-step distribution.

  • When \(h=1\), \(\hat\sigma_h\) can be estimated from the residuals.

Prediction intervals

brick_fit <- aus_production %>%
  filter_index("1970 Q1" ~ "2004 Q4") %>%
  model(Naive = NAIVE(Bricks))

brick_fc <- brick_fit %>%
  forecast(h = 1)

brick_fc %>% hilo(level = 95)
## # A tsibble: 1 x 5 [1Q]
## # Key:       .model [1]
##   .model Quarter       Bricks .mean        `95%`
##   <chr>    <qtr>       <dist> <dbl>       <hilo>
## 1 Naive  2005 Q1 N(397, 1960)   397 [310, 484]95

One-step prediction intervals

When forecasting one step ahead, the standard deviation of the forecast distribution can be estimated using the standard deviation of the residuals given by:

\[ \hat{\sigma} = \sqrt{\frac{1}{T-K-M}\sum_{t=1}^{T}e_t^2}\] - K is the number of parameters estimated in the forecasting method,

  • M is the number of missing values in the residuals. (For example, M = 1 for a naive forecast, because we can’t forecast the first observation.)

  • When T is large enough, T and K does not matter*

h-step forecast standard deviation

  • For the four benchmark methods, it is possible to mathematically derive the forecast standard deviation under the assumption of uncorrelated residuals.

  • If the \(\hat{\sigma}\) is the residual standard deviation (h=1), then \(\hat{\sigma_h}\) is:

  • m seasonal period
  • k is the integer part of \(\frac{h-1}{m}\)

h-step forecast standard deviation

brick_fc <- brick_fit %>%
  forecast(h = 1)

brick_fc %>% hilo(level = 95)
## # A tsibble: 1 x 5 [1Q]
## # Key:       .model [1]
##   .model Quarter       Bricks .mean        `95%`
##   <chr>    <qtr>       <dist> <dbl>       <hilo>
## 1 Naive  2005 Q1 N(397, 1960)   397 [310, 484]95
  • \(\hat{\sigma} = \sqrt{1960}\)

  • So \(\hat{\sigma_2} = \sqrt{1960} \times \sqrt{1} = 44.27\)

  • Interval \(397 \pm 1.96\times44.27 = 397 \pm 86.7 = [310-484]\)

2-step forecast standard deviation

brick_fc <- brick_fit %>%
  forecast(h = 2)

brick_fc %>% hilo(level = 95)
## # A tsibble: 2 x 5 [1Q]
## # Key:       .model [1]
##   .model Quarter       Bricks .mean        `95%`
##   <chr>    <qtr>       <dist> <dbl>       <hilo>
## 1 Naive  2005 Q1 N(397, 1960)   397 [310, 484]95
## 2 Naive  2005 Q2 N(397, 3920)   397 [274, 520]95
  • Interval \(397 \pm 1.96\times\sqrt3920 = 397 \pm 122 = [274-519]\)

Prediction intervals

bricks<-aus_production %>%
  filter_index("1970 Q1" ~ "2004 Q4")

  bricks%>%
  model(Naive = NAIVE(Bricks))%>%
  forecast(h = 30) %>%
  autoplot(bricks)

Prediction intervals from bootstrapped residuals

  • When a normal distribution for the residuals is an unreasonable assumption, one alternative is to use bootstrapping (residuals must be uncorrelated with constant variance).

  • The goal of bootstrap is to create set of futures values (scenarios), and then compute a statistic - In our case, we are interested in the predict intervals.

  1. Estimate a model (drift, naive, etc).
  2. calculate the residuals - \(e_t = y_t-\hat{y_t}\)
  • To create the future value (scenario) for \(y\), we have to forecast it \((y_{T+k})\)

  • The future value based on the one-step forecast:

\[y_{T+1} = \hat{y}_{T+1|T} + e_{T+1}\]

  • \(\hat{y}_{T+1|T}\) is the one-step forecast (from a model) and \(e_{T+1}\) is unknown because \(y_{T+1}\) does not exist yet.

Prediction intervals from bootstrapped residuals

  • Assuming future errors will be similar to past errors, we can replace \(e_{T+1}\) by sampling (with replacement) from the collection of errors (residuals).

  • Adding the new simulated observation to our data set, we can repeat the process to obtain:

\[y_{T+2} = \hat{y}_{T+2|T+1} + e_{T+2}\]

  • where \(e_{T+2}\) is another draw from the collection of residuals.

  • Continuing in this way, we can simulate an entire set of future values for our time series.

Bootstrapped residuals

  • We can use the function generate.
fit<-bricks%>%
  model(NAIVE(Bricks))
sim<-fit%>%generate(h=30,times=5, bootstrap = TRUE)
sim
## # A tsibble: 150 x 4 [1Q]
## # Key:       .model, .rep [5]
##    .model        Quarter .rep   .sim
##    <chr>           <qtr> <chr> <dbl>
##  1 NAIVE(Bricks) 2005 Q1 1      445.
##  2 NAIVE(Bricks) 2005 Q2 1      427.
##  3 NAIVE(Bricks) 2005 Q3 1      381.
##  4 NAIVE(Bricks) 2005 Q4 1      399.
##  5 NAIVE(Bricks) 2006 Q1 1      373.
##  6 NAIVE(Bricks) 2006 Q2 1      394.
##  7 NAIVE(Bricks) 2006 Q3 1      379.
##  8 NAIVE(Bricks) 2006 Q4 1      346.
##  9 NAIVE(Bricks) 2007 Q1 1      304.
## 10 NAIVE(Bricks) 2007 Q2 1      274.
## # … with 140 more rows
#We generated five possible sample paths (scenarios) for the next 30 trading days (forecast).

Bootstrapped residuals

bricks %>%
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Bricks)) +
  geom_line(aes(y = .sim, colour = as.factor(.rep)),
    data = sim) +
  guides(colour = "none")

Bootstrapped residuals

  • Then we can compute prediction intervals by calculating percentiles of the future sample paths for each forecast horizon.
bk <- fit %>% forecast(h = 30, bootstrap = TRUE, times=5000)
bk
## # A fable: 30 x 4 [1Q]
## # Key:     .model [1]
##    .model        Quarter       Bricks .mean
##    <chr>           <qtr>       <dist> <dbl>
##  1 NAIVE(Bricks) 2005 Q1 sample[5000]  398.
##  2 NAIVE(Bricks) 2005 Q2 sample[5000]  398.
##  3 NAIVE(Bricks) 2005 Q3 sample[5000]  397.
##  4 NAIVE(Bricks) 2005 Q4 sample[5000]  397.
##  5 NAIVE(Bricks) 2006 Q1 sample[5000]  397.
##  6 NAIVE(Bricks) 2006 Q2 sample[5000]  397.
##  7 NAIVE(Bricks) 2006 Q3 sample[5000]  398.
##  8 NAIVE(Bricks) 2006 Q4 sample[5000]  397.
##  9 NAIVE(Bricks) 2007 Q1 sample[5000]  397.
## 10 NAIVE(Bricks) 2007 Q2 sample[5000]  396.
## # … with 20 more rows

Bootstrapped residuals

  • Notice that the forecast distribution is now represented as a simulation with 5000 sample paths.

  • There is no normality assumption, so the prediction intervals are not symmetric.

  • The .mean column is the mean of the bootstrap samples, so it may be slightly different from the results obtained without a bootstrap.

autoplot(bk, bricks)

Predict Intervals

  • Point forecasts often useless without a measure of uncertainty (such as prediction intervals).
  • Prediction intervals require a stochastic model (with random errors, etc) - Bootstrap.
  • For most models, prediction intervals get wider as the forecast horizon increases.
  • Use level argument to control coverage.
  • Check residual assumptions before believing them.
  • Usually too narrow due to unaccounted uncertainty.

Forecasting with transformations

  • Forecasting from a model with transformations:
  1. Produce forecasts of the transformed data.
  2. Reverse the transformation (or back-transform) to obtain forecasts on the original scale.
eggs <- prices %>%
  filter(!is.na(eggs)) %>% select(eggs)
eggs %>% autoplot() +
  labs(title="Annual egg prices",
       y="$US (adjusted for inflation)")

Modelling with transformations

  • We observe some drift, and we could smooth the data by log it
fit <- eggs %>%
  model(RW(log(eggs) ~ drift()))
fit
## # A mable: 1 x 1
##   `RW(log(eggs) ~ drift())`
##                     <model>
## 1             <RW w/ drift>

Forecasting with transformations

fc <- fit %>%
  forecast(h = 50)
fc
## # A fable: 50 x 4 [1Y]
## # Key:     .model [1]
##    .model                   year             eggs .mean
##    <chr>                   <dbl>           <dist> <dbl>
##  1 RW(log(eggs) ~ drift())  1994 t(N(4.1, 0.018))  61.8
##  2 RW(log(eggs) ~ drift())  1995 t(N(4.1, 0.036))  61.4
##  3 RW(log(eggs) ~ drift())  1996 t(N(4.1, 0.054))  61.0
##  4 RW(log(eggs) ~ drift())  1997 t(N(4.1, 0.073))  60.5
##  5 RW(log(eggs) ~ drift())  1998 t(N(4.1, 0.093))  60.1
##  6 RW(log(eggs) ~ drift())  1999    t(N(4, 0.11))  59.7
##  7 RW(log(eggs) ~ drift())  2000    t(N(4, 0.13))  59.3
##  8 RW(log(eggs) ~ drift())  2001    t(N(4, 0.15))  58.9
##  9 RW(log(eggs) ~ drift())  2002    t(N(4, 0.17))  58.6
## 10 RW(log(eggs) ~ drift())  2003    t(N(4, 0.19))  58.2
## # … with 40 more rows

Forecasting with transformations

fc %>% autoplot(eggs) +
  labs(title="Annual egg prices",
       y="US$ (adjusted for inflation)")

Bias adjustment

  • Back-transformed point forecasts are medians.

  • Back-transformed PI have the correct coverage.

  • For many purposes, this is acceptable, although the mean is usually preferable ( medians do not add up, whereas means do).

  • we can adjust the forecast back to the mean (bias adjusted)

Bias adjustment

fc %>%
  autoplot(eggs, level = c(80, 90), point_forecast = lst(mean, median)) +
  labs(title="Annual egg prices",
       y="US$ (adjusted for inflation)")

Forecasting and decomposition

\[y_t = \hat{S}_t + \hat{A}_t\]

  • \(\hat{A}_t\) is seasonally adjusted component \((\hat{A}_t =\hat{T}_t + \hat{R}_t)\)

  • \(\hat{S}_t\) is seasonal component.

  • Forecast both term separately.

  • The seasonal component is (usually) unchanging (or changing extremely slowly) - forecast by simply taking the last year of the estimated component (a seasonal naïve method).

  • Forecast \(\hat{A}_t\) using non-seasonal time series method.

  • Combine forecasts of \(\hat{S}_t\) and \(\hat{A}_t\) to get forecasts of original data.

US Retail Employment

us_retail_employment <- us_employment %>%
  filter(year(Month) >= 1990, Title == "Retail Trade") %>%
  select(-Series_ID)

autoplot(us_retail_employment)

US Retail Employment

dcmp <- us_retail_employment %>%
  model(STL(Employed)) %>%
  components() %>% select(-.model)
dcmp
## # A tsibble: 357 x 6 [1M]
##       Month Employed  trend season_year remainder season_adjust
##       <mth>    <dbl>  <dbl>       <dbl>     <dbl>         <dbl>
##  1 1990 Jan   13256. 13288.      -33.0      0.836        13289.
##  2 1990 Feb   12966. 13269.     -258.     -44.6          13224.
##  3 1990 Mar   12938. 13250.     -290.     -22.1          13228.
##  4 1990 Apr   13012. 13231.     -220.       1.05         13232.
##  5 1990 May   13108. 13211.     -114.      11.3          13223.
##  6 1990 Jun   13183. 13192.      -24.3     15.5          13207.
##  7 1990 Jul   13170. 13172.      -23.2     21.6          13193.
##  8 1990 Aug   13160. 13151.       -9.52    17.8          13169.
##  9 1990 Sep   13113. 13131.      -39.5     22.0          13153.
## 10 1990 Oct   13185. 13110.       61.6     13.2          13124.
## # … with 347 more rows

US Retail Employment

  • For the seasonally adjusted component, any non-seasonal forecasting method may be used. We will use the naive method in this example.
dcmp %>%
  model(NAIVE(season_adjust)) %>%
  forecast() %>%
  autoplot(dcmp) +
  labs(title = "Naive forecasts of seasonally adjusted data")

US Retail Employment

  • The previous graphs shows naïve forecasts of the seasonally adjusted US retail employment data.

  • These are then “reseasonalised” by adding in the seasonal naïve forecasts of the seasonal component.

  • We can use the function decomposition_model() instead - compute forecasts via any additive decomposition, using model functions to forecast each of the decomposition’s components.

  • Seasonal components will be forecast automatically using SNAIVE() (standard)

  • The function will also reseasonalising for you.

  • The variances from both the seasonally adjusted and seasonal forecasts are combined (predict intervals)

US Retail Employment

us_retail_employment %>%
  model(stlf = decomposition_model(
    STL(Employed ~ trend(window = 7), robust = TRUE),
    NAIVE(season_adjust)
  )) %>%
  forecast() %>%
  autoplot(us_retail_employment)

Evaluating forecast accuracy

  • It is important to evaluate forecast accuracy using genuine forecasts.

  • The accuracy of forecasts comes from a model performs on new data that were not used when fitting the model.

  • A model which fits the training data well will not necessarily forecast well.

  • A perfect fit can always be obtained by using a model with enough parameters.

  • Over-fitting a model to data is just as bad as failing to identify a systematic pattern in the data.

Evaluating forecast accuracy

  • So we divide the data into two sub-sample - training and test.

  • The test set must not be used for any aspect of model development or calculation of forecasts.

  • Forecast accuracy is based only on the test set.

  • It should be at least as large as the maximum forecast horizon required

Forecast errors

Forecast “error”: the difference between an observed value and its forecast.

\[ e_{T+h} = y_{T+h} - \hat{y}_{T+h|T}, \]

  • training data {\(y_{1},..., y_T\)}; test data {\(y_{T+1}, y_{T+2}, ...\)}
  • Residuals - Comes from training data, while forecast errors comes from test data
  • Unlike residuals, forecast errors on the test set involve multi-step forecasts.
  • These are true forecast errors as the test data is not used in computing \(\hat{y}_{T+h|T}\).

Measures of forecast accuracy

train <- aus_production %>%
  filter(between(year(Quarter), 1992, 2007))

beer <- aus_production %>%
  filter(year(Quarter) >= 1992)

beer_fc_plot <- train %>%
  model(
    Mean = MEAN(Beer),
    Naive = NAIVE(Beer),
    Seasonal_naive = SNAIVE(Beer),
    Drift = RW(Beer ~ drift())
  ) %>% forecast(h=10) %>% autoplot(beer, level = NULL) +
    labs(title = "Forecasts for quarterly beer production",
         y = "Megalitres") +
    guides(colour=guide_legend(title="Forecast"))

Measures of forecast accuracy

beer_fc_plot

Measures of forecast accuracy

  • What is the most accurate model? How can we tell?

  • For forecast (forecast errors as well) on the same scale data, we look at:

Mean Absolute error: \(MAE = mean(|e_t|)\)

Root mean squared error: \(RMSE = \sqrt{mean({e_t}^2)}\)

  • MAE is easily interpretable, but it less popular
  • RMSE is more sensitive to outliers (penalise large errors)

Measures of forecast accuracy

  • For forecast on different scale data (unit-free), we look at:

Mean absolute percentage error: \[MAPE= 100\times mean(|e_{T+h}|/ |y_{T+h}|)\]

  • The MAPE could be undefined if \(y=0\), or extreme values if \(y\) close to zero.

Measures of forecast accuracy

Mean Absolute Scaled Error \[ MASE = \text{mean}(|e_{T+h}|/Q) \] where \(Q\) is a stable measure of the scale of the time series \(\{y_t\}\).

For non-seasonal time series, \[ Q = (T-1)^{-1}\sum_{t=2}^T |y_t-y_{t-1}| \]

works well. Then MASE is equivalent to MAE relative to a naïve method.

Measures of forecast accuracy

Mean Absolute Scaled Error

\[ MASE = mean(|e_{T+h}|/Q) \]

where \(Q\) is a stable measure of the scale of the time series \(\{y_t\}\).

For seasonal time series, \[ Q = (T-m)^{-1}\sum_{t=m+1}^T |y_t-y_{t-m}| \]

Then MASE is equivalent to MAE relative to a seasonal naïve method.

If it is less than one - Better forecast than the average one-step naïve forecast. If it is more than one - Worse than the average one-step naïve forecast

Measures of forecast accuracy

Measures of forecast accuracy

recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)

train <- recent_production %>%
  filter(year(Quarter) <= 2007)

beer_fit <- train %>%
  model(
    Mean = MEAN(Beer),
    Naive = NAIVE(Beer),
    Seasonal_naive = SNAIVE(Beer),
    Drift = RW(Beer ~ drift())
  )
beer_fc <- beer_fit %>%
  forecast(h = 10)

Measures of forecast accuracy

accuracy(beer_fit)
#provides the accuracy based on the residual (training set)
## # A tibble: 4 × 6
##   .model         .type     RMSE   MAE  MAPE  MASE
##   <chr>          <chr>    <dbl> <dbl> <dbl> <dbl>
## 1 Drift          Training  65.3  54.8 12.2   3.83
## 2 Mean           Training  43.6  35.2  7.89  2.46
## 3 Naive          Training  65.3  54.7 12.2   3.83
## 4 Seasonal_naive Training  16.8  14.3  3.31  1
accuracy(beer_fc, recent_production)
#provides the accuracy of the prediction error, based on the test set.
## # A tibble: 4 × 6
##   .model         .type  RMSE   MAE  MAPE  MASE
##   <chr>          <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Drift          Test   64.9  58.9 14.6  4.12 
## 2 Mean           Test   38.4  34.8  8.28 2.44 
## 3 Naive          Test   62.7  57.4 14.2  4.01 
## 4 Seasonal_naive Test   14.3  13.4  3.17 0.937

Evaluating distributional forecast accuracy

  • We also can evaluate the distributional forecast accuracy
  • We can evaluate a given quantile forecast - Quantile scores
  • Evaluate a given predict interval - Winkler Score
  • Evaluate the whole forecast distribution - Continuous Ranked Probability Score or CRPS

Time series cross-validation

  • A more sophisticated version of training/test sets is time series cross-validation.

  • There are a series of test sets, each consisting of a single observation.

  • The corresponding training set consists only of all observations that occurred prior to the observation that forms the test set.

Time series cross-validation

Traditional evaluation

Time series cross-validation

Time series cross-validation

  • Forecast accuracy averaged over test sets.

  • Also known as “evaluation on a rolling forecasting origin”

  • We can also produce cross-validation for multi-step forecast (\(h=2,3,4,...\))

Time series cross-validation

Traditional evaluation

Time series cross-validation

Time series cross-validation

Traditional evaluation

Time series cross-validation

Time series cross-validation

Traditional evaluation

Time series cross-validation

Time series cross-validation

  • Let’s create the training sets using the function stretch_tsibble.

  • Stretch with a minimum length of 3, growing by 1 each step.

fb_stretch <- fb_stock %>%
  stretch_tsibble(.init = 3, .step = 1) 
## # A tsibble: 790,650 x 4 [1]
## # Key:       .id [1,255]
##   Date       Close trading_day   .id
##   <date>     <dbl>       <int> <int>
## 1 2014-01-02  54.7           1     1
## 2 2014-01-03  54.6           2     1
## 3 2014-01-06  57.2           3     1
## 4 2014-01-02  54.7           1     2
## 5 2014-01-03  54.6           2     2
## 6 2014-01-06  57.2           3     2
## 7 2014-01-07  57.9           4     2
## # … with 790,643 more rows

Time series cross-validation

Estimate RW w/ drift models for each window.

fit_cv <- fb_stretch %>%
  model(RW(Close ~ drift()))
## # A mable: 1,255 x 3
## # Key:     .id, Symbol [1,255]
##     .id Symbol `RW(Close ~ drift())`
##   <int> <chr>                <model>
## 1     1 FB             <RW w/ drift>
## 2     2 FB             <RW w/ drift>
## 3     3 FB             <RW w/ drift>
## 4     4 FB             <RW w/ drift>
## # … with 1,251 more rows

Time series cross-validation

Produce one step ahead forecasts from all models.

fc_cv <- fit_cv %>%
  forecast(h=1)
## # A fable: 1,255 x 5 [1]
## # Key:     .id, Symbol [1,255]
##     .id Symbol trading_day      Close .mean
##   <int> <chr>        <dbl>     <dist> <dbl>
## 1     1 FB               4 N(58, 3.9)  58.4
## 2     2 FB               5   N(59, 2)  59.0
## 3     3 FB               6 N(59, 1.5)  59.1
## 4     4 FB               7 N(58, 1.8)  57.7
## # … with 1,251 more rows

Time series cross-validation

# Cross-validated
fc_cv %>% accuracy(fb_stock)
# Training set
fb_stock %>% model(RW(Close ~ drift())) %>% accuracy()
RMSE MAE MAPE
Cross-validation 2.418 1.469 1.266
Training 2.414 1.465 1.261
  • As expected, the accuracy measures from the residuals are smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

  • A good way to choose the best forecasting model is to find the model with the smallest RMSE computed using time series cross-validation.