library(fpp3)
https://robjhyndman.com/seminars/uwa2017/
https://robjhyndman.com/seminars/isi2019workshop/
https://github.com/robjhyndman/ETC3550Slides/tree/fable
https://github.com/rstudio-conf-2020/time-series-forecasting/blob/master/materials/labs.R
we can think of a time series as comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series).
The purpose of these adjustments and transformations is to simplify the patterns in the historical data by removing known sources of variation, or by making the pattern more consistent across the whole data set. Simpler patterns are usually easier to model and lead to more accurate forecasts.
Some of the variation seen in seasonal data may be due to simple calendar effects. In such cases, it is usually much easier to remove the variation before doing any further analysis.
Any data that are affected by population changes can be adjusted to give per-capita data.
global_economy %>%
filter(Country == "Australia") %>%
autoplot(GDP / Population) + labs(y = "GDP per capita ($US)")
Data which are affected by the value of money are best adjusted before modelling.
To make these adjustments, a price index is used. If \(z_t\) denotes the price index and \(y_t\) denotes the original house price in year \(t\), then \(x_t = y_t / z_t ∗ z2000\) gives the adjusted house price at year 2000 dollar values. Price indexes are often constructed by government agencies. For consumer goods, a common price index is the Consumer Price Index (or CPI).
print_retail <- aus_retail %>%
filter(Industry == "Newspaper and book retailing") %>%
group_by(Industry) %>%
index_by(Year = year(Month)) %>%
summarise(Turnover = sum(Turnover))
aus_economy <- global_economy %>%
filter(Code == "AUS")
print_retail %>%
left_join(aus_economy, by = "Year") %>%
mutate(Adjusted_turnover = Turnover / CPI * 100) %>%
pivot_longer(c(Turnover, Adjusted_turnover), values_to = "Turnover") %>%
ggplot(aes(x = Year, y = Turnover)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y") +
labs(y = "$A", title = "Turnover for the Australian print media industry")
By adjusting for inflation using the CPI, we can see that Australia’s newspaper and book retailing industry has been in decline much longer than the original data suggests. The adjusted turnover is in 2010 Australian dollars, as CPI is 100 in 2010 in this data set.
If the data shows variation that increases or decreases with the level of the series, then a transformation can be useful.
A logarithmic transformation is often useful. If we denote the original observations as \(y_1, …, y_T\) and the transformed observations as \(w_1, …, w_T\), then \(w_t = log(y_t)\).
Logarithms are useful because they are interpretable: changes in a log value are relative (or percentage) changes on the original scale. So if log base 10 is used, then an increase of 1 on the log scale corresponds to a multiplication of 10 on the original scale. If any value of the original series is zero or negative, then logarithms are not possible.
Square roots and cube roots can be used. These are called power transformations because they can be written in the form \(w_t = y^p_t\).
A useful family of transformations, that includes both logarithms and power transformations, is the family of Box-Cox transformations:
\[\begin{align*} w_t = \begin{cases} \log(y_t) & \text{if $\lambda=0$}; \\ \text{sign}(y_t)(|y_t|^\lambda-1)/\lambda & \text{otherwise}. \end{cases} \end{align*}\]
Allows for negative values of \(y_t\) provided \(λ > 0\).
The logarithm in a Box-Cox transformation is always a natural logarithm (i.e., to base \(e\)). So if \(λ = 0\), natural logarithms are used, but if \(λ ≠ 0\) a power transformation is used.
If \(λ = 1\) , then \(w_t = y_t − 1\), so the transformed data is shifted downwards but there is no change in the shape of the time series. For all other values of \(λ\), the time series will change shape.
A good value of $λ is one which makes the size of the seasonal variation about the same across the whole series, as that makes the forecasting model simpler.
The guerrero feature can be used to choose a value of lambda for you.
lambda <- aus_production %>%
features(Gas, features = guerrero) %>%
pull(lambda_guerrero)
lambda
[1] 0.1204864
aus_production %>% autoplot(box_cox(Gas, lambda))
Additive decomposition: magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series.
\(y_{t} = S_{t} + T_{t} + R_t,\)
Multiplicative decomposition: variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series (common with economic time series).
\(y_{t} = S_{t} \times T_{t} \times R_t,\)
When a log transformation has been used:
\(y_{t} = S_{t} \times T_{t} \times R_t\) is equivalent to \(\log y_{t} = \log S_{t} + \log T_{t} + \log R_t\)
The data shows the total monthly number of persons in thousands employed in the retail sector across the US since 1990.
us_retail_employment <- us_employment %>%
filter(year(Month) >= 1990, Title == "Retail Trade") %>%
select(-Series_ID)
autoplot(us_retail_employment, Employed) +
labs(y = "Persons (thousands)", title = "Total employment in US retail")
STL decomposition method:
dcmp <- us_retail_employment %>%
model(STL(Employed))
head(components(dcmp))
The output above shows the components of an STL decomposition. The original data is shown (as Employed), followed by the estimated components. This output forms a “dable” or decomposition table. The header to the table shows that the Employed series has been decomposed additively.
The trend column (containing the trend-cycle \(T_t\)) follows the overall movement of the series, ignoring any seasonality and random fluctuations
autoplot(us_retail_employment, Employed, color = "gray") +
autolayer(components(dcmp), trend, color = "red") +
labs(y = "Persons (thousands)", title = "Total employment in US retail")
components(dcmp) %>% autoplot()
The three components are shown separately in the bottom three panels of Figure 3.4. These components can be added together to reconstruct the data shown in the top panel. Notice that the seasonal component changes over time, so that any two consecutive years have similar patterns, but years far apart may have different seasonal patterns. The remainder component shown in the bottom panel is what is left over when the seasonal and trend-cycle components have been subtracted from the data.
The large grey bar in the bottom panel shows that the variation in the remainder component is smallest compared to the variation in the data, which has a bar about one quarter the size.
If the seasonal component is removed from the original data, the resulting values are the “seasonally adjusted” data. For an additive decomposition, the seasonally adjusted data are given by \(y_t - S_t\), and for multiplicative data, the seasonally adjusted values are obtained using \(y_t/S_t\).
autoplot(us_retail_employment, Employed, color = "gray") +
autolayer(components(dcmp), season_adjust, color = "blue") +
labs(y = "Persons (thousands)", title = "Total employment in US retail")
Seasonally adjusted series contain the remainder component as well as the trend-cycle. Therefore, they are not “smooth,” and “downturns” or “upturns” can be misleading. If the purpose is to look for turning points in a series, and interpret any changes in direction, then it is better to use the trend-cycle component rather than the seasonally adjusted data.
The first step in a classical decomposition is to use a moving average method to estimate the trend-cycle.
A moving average of order \(m\) can be written as:
\[\begin{equation} \hat{T}_{t} = \frac{1}{m} \sum_{j=-k}^k y_{t+j}, \tag{3.1} \end{equation}\]
\(m=2k+1\)
the estimate of the trend-cycle at time \(t\) is obtained by averaging values of the time series within \(k\) periods of t.
We call this an \(m-MA\), meaning a moving average of order \(m\). This is easily computed using slide_dbl() from the slider package which applies a function to “sliding” time windows. In this case, we use the mean() function with a window of size 5.
global_economy %>%
filter(Country == "Australia") %>%
autoplot(Exports) +
labs(y = "% of GDP", title = "Total Australian exports")
aus_exports <- global_economy %>%
filter(Country == "Australia") %>%
mutate(
`5-MA` = slider::slide_dbl(Exports, mean, .before = 2, .after = 2, .complete = TRUE)
)
autoplot(aus_exports, Exports) +
autolayer(aus_exports, `5-MA`, color = "red") +
labs(y = "Exports (% of GDP)", title = "Total Australian exports") +
guides(colour = guide_legend(title = "series"))
The order of the moving average determines the smoothness of the trend-cycle estimate. Simple moving averages such as these are usually of an odd order (e.g., 3, 5, 7, etc.).
### Moving averages of moving averages
It is possible to apply a moving average to a moving average. One reason for doing this is to make an even-order moving average symmetric.
beer <- aus_production %>%
filter(year(Quarter) >= 1992) %>%
select(Quarter, Beer)
beer_ma <- beer %>%
mutate(
`4-MA` = slider::slide_dbl(Beer, mean, .before = 1, .after = 2, .complete = TRUE),
`2x4-MA` = slider::slide_dbl(`4-MA`, mean, .before = 1, .after = 0, .complete = TRUE)
)
head(beer_ma)
tail(beer_ma)
When a 2-MA follows a moving average of an even order (such as 4), it is called a “centred moving average of order 4.” This is because the results are now symmetric.
The most common use of centred moving averages is for estimating the trend-cycle from seasonal data.
In general, a \(2 \times m\)-MA is equivalent to a weighted moving average of order \(m + 1\) where all observations take the weight \(1/m\), except for the first and last terms which take weights \(1/(2m)\). if the seasonal period is even and of order \(m\) , we use a \(2 × m\)-MA to estimate the trend-cycle. If the seasonal period is odd and of order \(m\), we use a \(m\)-MA to estimate the trend-cycle.
us_retail_employment_ma <- us_retail_employment %>%
mutate(
`12-MA` = slider::slide_dbl(Employed, mean, .before = 5, .after = 6, .complete = TRUE),
`2x12-MA` = slider::slide_dbl(`12-MA`, mean, .before = 1, .after = 0, .complete = TRUE)
)
autoplot(us_retail_employment_ma, Employed, color = "gray") +
autolayer(us_retail_employment_ma, vars(`2x12-MA`), color = "red") +
labs(y = "Persons (thousands)",
title = "Total employment in US retail")
There are two forms of classical decomposition: an additive decomposition and a multiplicative decomposition. We assume that the seasonal component is constant from year to year.
Step 1: If \(m\) is an even number, compute the trend-cycle component \(\hat{T}_t\) using a \(2 × m\)-MA. If $m is an odd number, compute the trend-cycle component \(\hat{T}_t\) using an \(m\)-MA.
Step 2: Calculate the detrended series: \(y_t - \hat{T}_t\).
Step 3: To estimate the seasonal component for each season, simply average the detrended values for that season.
Step 4: The remainder component is calculated by subtracting the estimated seasonal and trend-cycle components: \(\hat{R}_t = y_t - \hat{T}_t - \hat{S}_t\)
us_retail_employment %>%
model(classical_decomposition(Employed, type = "additive")) %>%
components() %>%
autoplot() +
labs(title = "Classical additive decomposition of total US retail employment")
A classical multiplicative decomposition is similar, except that the subtractions are replaced by divisions.
This method is based on classical decomposition, but includes many extra steps and features in order to overcome the drawbacks of classical decomposition: - trend-cycle estimates are available for all observations including the end points - the seasonal component is allowed to vary slowly over time - has some sophisticated methods for handling trading day variation, holiday effects and the effects of known predictors - tends to be highly robust to outliers and level shifts in the time series
x11_dcmp <- us_retail_employment %>%
model(x11 = feasts:::X11(Employed, type = "additive")) %>%
components()
autoplot(x11_dcmp) +
labs(title = "Additive X11 decomposition of US retail employment in the US")
The X11 trend-cycle has captured the sudden fall in the data due to the 2007-2008 global financial crisis better than the classical and STL decomposition. The unusual observation in 1996 is now more clearly seen in the remainder component.
library(ggplot2)
x11_dcmp %>%
ggplot(aes(x = Month)) +
geom_line(aes(y = Employed, colour = "Data")) +
geom_line(aes(y = season_adjust, colour = "Seasonally Adjusted")) +
geom_line(aes(y = trend, colour = "Trend")) +
labs(y = "Persons (thousands)",
title = "Total employment in US retail") +
scale_colour_manual(
values = c("gray", "blue", "red"),
breaks = c("Data", "Seasonally Adjusted", "Trend"))
It can be useful to use seasonal plots and seasonal sub-series plots of the seasonal component. These help us to visualise the variation in the seasonal component over time.
x11_dcmp %>%
gg_subseries(seasonal)
“SEATS” stands for “Seasonal Extraction in ARIMA Time Series”. The procedure works only with quarterly and monthly data.
seats_dcmp <- us_retail_employment %>%
model(seats = feasts:::SEATS(Employed)) %>%
components()
autoplot(seats_dcmp) +
labs(title = "SEATS decomposition of total US retail employment")
STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess,” while Loess is a method for estimating nonlinear relationships.
STL has several advantages over the classical, SEATS and X11 decomposition methods:
STL has some disadvantages. In particular, it does not handle trading day or calendar variation automatically, and it only provides facilities for additive decompositions.
It is possible to obtain a multiplicative decomposition by first taking logs of the data, then back-transforming the components. Decompositions between additive and multiplicative can be obtained using a Box-Cox transformation of the data with \(0<λ<1\). A value of \(λ=0\) corresponds to the multiplicative decomposition while \(λ=1\) is equivalent to an additive decomposition.
us_retail_employment %>%
model(STL(Employed ~ trend(window = 7) + season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
The two main parameters to be chosen when using STL are the trend-cycle window trend(window = ?) and the seasonal window season(window = ?). These control how rapidly the trend-cycle and seasonal components can change. Smaller values allow for more rapid changes. Both trend and seasonal windows should be odd numbers; trend window is the number of consecutive observations to be used when estimating the trend-cycle; season window is the number of consecutive years to be used in estimating each value in the seasonal component. Setting the seasonal window to be infinite is equivalent to forcing the seasonal component to be periodic season(window=‘periodic’)(i.e., identical across years).
By default, the STL() function provides a convenient automated STL decomposition using a seasonal window of season(window=13), and the trend window chosen automatically from the seasonal period. The default setting for monthly data is trend(window=21). This usually gives a good balance between overfitting the seasonality and allowing it to slowly change over time.
The feasts package includes functions for computing FEatures And Statistics from Time Series (hence the name).
library(feasts)
The features can be computed using the features() function.
tourism %>%
features(Trips, list(mean = mean)) %>%
arrange(mean)
Using list function to give a name to the column:
A common short summary of a data set is to compute five summary statistics: the minimum, first quartile, median, third quartile and maximum.
tourism %>% features(Trips, quantile)
All the autocorrelations of a series can be considered features of that series.
The feat_acf() function computes a selection of the autocorrelations:
tourism %>% features(Trips, feat_acf)
##STL Features
strength of trend:
\[\begin{align*} F_T = \max\left(0, 1 - \frac{\text{Var}(R_t)}{\text{Var}(T_t+R_t)}\right). \end{align*}\]
strength of seasonality:
\[\begin{align*} F_S = \max\left(0, 1 - \frac{\text{Var}(R_t)}{\text{Var}(S_{t}+R_t)}\right). \end{align*}\]
close to 0, no seasonality. Close to 1, strong seasonality.
These measures can be useful, for example, when you have a large collection of time series, and you need to find the series with the most trend or the most seasonality.
Other useful features based on STL include the timing of peaks and troughs — which month or quarter contains the largest seasonal component and which contains the smallest seasonal component.
tourism %>%
features(Trips, feat_stl)
We can then use these features in plots to identify what type of series are heavily trended and what are most seasonal.
tourism %>%
features(Trips, feat_stl) %>%
ggplot(aes(x = trend_strength, y = seasonal_strength_year, col = Purpose)) +
geom_point() +
facet_wrap(vars(State))
Clearly, holiday series are most seasonal which is unsurprising. The strongest trends tend to be in Western Australia and Victoria.
tourism %>%
features(Trips, feat_stl) %>%
filter(seasonal_strength_year == max(seasonal_strength_year)) %>%
left_join(tourism, by = c("State", "Region", "Purpose")) %>%
ggplot(aes(x = Quarter, y = Trips)) +
geom_line() +
facet_grid(vars(State, Region, Purpose))
The feat_stl() function returns several more features:
All of the features included in the feasts package can be computed in one line like this.
tourism_features <- tourism %>%
features(Trips, feature_set(pkgs = "feasts"))
tourism_features
This gives 48 features for every combination of the three key variables (Region, State and Purpose). We can treat this tibble like any data set and analyse it to find interesting observations or groups of observations.
We show all features that involve seasonality, along with the Purpose variable.
library(glue)
tourism_features %>%
select_at(vars(contains("season"), Purpose)) %>%
mutate(
seasonal_peak_year = seasonal_peak_year +
4*(seasonal_peak_year==0),
seasonal_trough_year = seasonal_trough_year +
4*(seasonal_trough_year==0),
seasonal_peak_year = glue("Q{seasonal_peak_year}"),
seasonal_trough_year = glue("Q{seasonal_trough_year}"),
) %>%
GGally::ggpairs(mapping = aes(colour = Purpose))
Here, the Purpose variable is mapped to colour. There is a lot of information in this figure, and we will highlight just a few things we can learn.
A useful way to handle many more variables is to use a dimension reduction technique such as principal components. This gives linear combinations of variables that explain the most variation in the original data. We can compute the principal components of the tourism features as follows.
library(broom)
pcs <- tourism_features %>%
select(-State, -Region, -Purpose) %>%
prcomp(scale = TRUE) %>%
augment(tourism_features)
pcs %>%
ggplot(aes(x = .fittedPC1, y = .fittedPC2, col = Purpose)) +
geom_point() +
theme(aspect.ratio = 1)
The first step in forecasting is to prepare data in the correct format. This process may involve loading in data, identifying missing values, filtering the time series, and other pre-processing tasks.
We will model GDP per capita over time; so first, we must compute the relevant variable.
gdppc <- global_economy %>%
mutate(GDP_per_capita = GDP / Population)
Looking at your data allows you to identify common patterns, and subsequently specify an appropriate model.
gdppc %>%
filter(Country == "Sweden") %>%
autoplot(GDP_per_capita) +
labs(y = "$US", title = "GDP per capita for Sweden")
Specifying an appropriate model for the data is essential for producing appropriate forecasts. Models in fable are specified using model functions, which each use a formula (y ~ x) interface. The response variable(s) are specified on the left of the formula, and the structure of the model is written on the right.
TSLM(GDP_per_capita ~ trend())
<TSLM model definition>
The model function is TSLM() (time series linear model), the response variable is GDP_per_capita and it is being modelled using trend() (a “special” function specifying a linear trend when it is used within TSLM()).
we next train the model on some data. One or more model specifications can be estimated using the model() function.
fit <- gdppc %>%
model(trend_model = TSLM(GDP_per_capita ~ trend()))
7 errors (1 unique) encountered for trend_model
[7] 0 (non-NA) cases
This fits a linear trend model to the GDP per capita data for each combination of key variables in the tsibble. The resulting object is a model table or a “mable”.
fit
Once a model has been fitted, it is important to check how well it has performed on the data. There are several diagnostic tools available to check model behaviour, and also accuracy measures that allow one model to be compared against another.
With an appropriate model specified, estimated and checked, it is time to produce the forecasts using forecast()
fit %>% forecast(h = "3 years")
This is a forecasting table, or “fable”. Each row corresponds to one forecast period for each country. The GDP_per_capita column contains the forecasting distribution, while the .mean column contains the point forecast. The point forecast is the mean (or average) of the forecasting distribution.
fit %>%
forecast(h = "3 years") %>%
filter(Country == "Sweden") %>%
autoplot(gdppc) +
labs(y = "$US", title = "GDP per capita for Sweden")
Quarterly Australian clay brick production between 1970 and 2004
bricks <- aus_production %>% filter_index("1970 Q1" ~ "2004 Q4")
the forecasts of all future values are equal to the average (or “mean”) of the historical data.
\[\begin{align*} \hat{y}_{T+h|T} = \bar{y} = (y_{1}+\dots+y_{T})/T. \end{align*}\]
bricks %>%
model(MEAN(Bricks)) %>%
forecast(h = "3 years") %>%
autoplot(bricks, level = NULL)
For naïve forecasts, we simply set all forecasts to be the value of the last observation.
\[\begin{align*} \hat{y}_{T+h|T} = y_{T} \end{align*}\]
bricks %>%
model(NAIVE(Bricks)) %>%
forecast(h = "3 years") %>%
autoplot(bricks, level = NULL)
A similar method is useful for highly seasonal data. In this case, we set each forecast to be equal to the last observed value from the same season of the year (e.g., the same month of the previous year). Formally, the forecast for time \(T + h\) is written as
\[\begin{align*} \hat{y}_{T+h|T} = y_{T+h-m(k+1)}, \end{align*}\]
where \(m =\) the seasonal period, and \(k\) is the integer part of \((h−1)/m\)
bricks %>%
model(SNAIVE(Bricks ~ lag("year"))) %>%
forecast(h = "5 years") %>%
autoplot(bricks, level = NULL)
A variation on the naïve method is to allow the forecasts to increase or decrease over time, where the amount of change over time (called the drift) is set to be the average change seen in the historical data.
\[\begin{align*} \hat{y}_{T+h|T} = y_{T} + \frac{h}{T-1}\sum_{t=2}^T (y_{t}-y_{t-1}) = y_{T} + h \left( \frac{y_{T} -y_{1}}{T-1}\right). \end{align*}\]
bricks %>%
model(RW(Bricks ~ drift())) %>%
forecast(h = "3 years") %>%
autoplot(bricks, level = NULL)
# Set training data from 1992 to 2006
train <- aus_production %>% filter_index("1992 Q1" ~ "2006 Q4")
# Fit the models
beer_fit <- train %>%
model(
Mean = MEAN(Beer),
`Naïve` = NAIVE(Beer),
`Seasonal naïve` = SNAIVE(Beer)
)
# Generate forecasts for 14 quarters
beer_fc <- beer_fit %>% forecast(h = 14)
# Plot forecasts against actual values
beer_fc %>%
autoplot(train, level = NULL) +
autolayer(filter_index(aus_production, "2007 Q1" ~ .), color = "black") +
labs(y = "Megalitres", title = "Forecasts for quarterly beer production") +
guides(colour = guide_legend(title = "Forecast"))
Plot variable not specified, automatically selected `.vars = Beer`
In this case, only the seasonal naïve forecasts are close to the observed values from 2007 onwards.
# Re-index based on trading days
google_stock <- gafa_stock %>%
filter(Symbol == "GOOG", year(Date) >= 2015) %>%
mutate(day = row_number()) %>%
update_tsibble(index = day, regular = TRUE)
# Filter the year of interest
google_2015 <- google_stock %>% filter(year(Date) == 2015)
# Fit the models
google_fit <- google_2015 %>%
model(
Mean = MEAN(Close),
`Naïve` = NAIVE(Close),
Drift = NAIVE(Close ~ drift())
)
# Produce forecasts for the trading days in January 2016
google_jan_2016 <- google_stock %>%
filter(yearmonth(Date) == yearmonth("2016 Jan"))
google_fc <- google_fit %>% forecast(new_data = google_jan_2016)
# Plot the forecasts
google_fc %>%
autoplot(google_2015, level = NULL) +
autolayer(google_jan_2016, Close, color = "black") +
labs(x = "Day", y = "Closing Price (US$)",
title = "Google stock prices (Jan 2015 - Jan 2016)") +
guides(colour = guide_legend(title = "Forecast"))
Each observation in a time series can be forecast using all previous observations. We call these fitted values and they are denoted by \(\hat{y}_{t|t-1}\). Fitted values almost always involve one-step forecasts.
The “residuals” in a time series model are what is left over after fitting a model.
\(e_{t} = y_{t}-\hat{y}_{t}.\)
If a transformation has been used in the model, then it is often useful to look at residuals on the transformed scale. We call these “innovation residuals”.
The fitted values and residuals from a model can be obtained using the augment() function
augment(beer_fit)
There are three new columns added to the original data:
Residuals are useful in checking whether a model has adequately captured the information in the data. For this purpose, we use innovation residuals.
A good forecasting method will yield innovation residuals with the following properties:
Checking these properties is important in order to see whether a method is using all of the available information, but it is not a good way to select a forecasting method.
In addition to these essential properties, it is useful (but not necessary) for the residuals to also have the following two properties.
Using the naïve method
aug <- google_2015 %>%
model(NAIVE(Close)) %>%
augment()
autoplot(aug, .innov) +
labs(x = "Day", y = "Residual", title = "Residuals from naïve method")
aug %>%
ggplot(aes(x = .innov)) +
geom_histogram() +
labs(title = "Histogram of residuals")
aug %>%
ACF(.innov) %>%
autoplot() +
labs(title = "ACF of residuals")
These graphs show that the naïve method produces forecasts that appear to account for all available information. The mean of the residuals is close to zero and there is no significant correlation in the residuals series. The time plot of the residuals shows that the variation of the residuals stays much the same across the historical data, apart from the one outlier, and therefore the residual variance can be treated as constant. The histogram suggests that the residuals may not be normal — the right tail seems a little too long, even when we ignore the outlier. Consequently, forecasts from this method will probably be quite good, but prediction intervals that are computed assuming a normal distribution may be inaccurate.
google_2015 %>%
model(NAIVE(Close)) %>%
gg_tsresiduals()
In addition to looking at the ACF plot, we can also do a more formal test for autocorrelation by considering a whole set of \(r_k\) values as a group, rather than treating each one separately.
A test for a group of autocorrelations is called a portmanteau test.
Box-Pierce test
\[\begin{align*} Q = T \sum_{k=1}^\ell r_k^2, \end{align*}\]
If each \(r_k\) is close to zero, then \(Q\) will be small. If some \(r_k\) values are large (positive or negative), then Q will be large.
Use: - \(\ell=10\) for non-seasonal data - \(\ell=2m\) for seasonal data - if \(\ell\) > \(T/5\) then \(\ell=T/5\)
Ljung-Box test
\[\begin{align*} Q^* = T(T+2) \sum_{k=1}^\ell (T-k)^{-1}r_k^2. \end{align*}\]
Again, large values of \(Q^∗\) suggest that the autocorrelations do not come from a white noise series.
If the autocorrelations did come from a white noise series, then both \(Q\) and \(Q^∗\) would have a \(χ^2\) distribution with \((ℓ − K)\) degrees of freedom, where \(K\) is the number of parameters in the model. If they are calculated from raw data (rather than the residuals from a model), then set \(K = 0\).
# lag = ℓ, dof = K
aug %>% features(.innov, box_pierce, lag = 10, dof = 0)
aug %>% features(.innov, ljung_box, lag = 10, dof = 0)
For both \(Q\) and \(Q^∗\), the results are not significant (i.e., the \(p\) -values are relatively large). Thus, we can conclude that the residuals are not distinguishable from a white noise series.
An alternative simple approach that may be appropriate for forecasting the Google daily closing stock price is the drift method. The tidy() function shows the one estimated parameter, the drift coefficient
fit <- google_2015 %>% model(RW(Close ~ drift()))
tidy(fit)
#K = 1
augment(fit) %>% features(.innov, ljung_box, lag = 10, dof = 1)
The residuals from the drift method are indistinguishable from a white noise series.
https://otexts.com/fpp3/prediction-intervals.html
The value of prediction intervals is that they express the uncertainty in the forecasts.
Prediction intervals can easily be computed for you when using the fable package. For example, here is the output when using the naïve method for the Google stock price.
google_2015 %>%
model(NAIVE(Close)) %>%
forecast(h = 10) %>%
hilo()
The hilo() function converts the forecast distributions into intervals. By default, 80% and 95% prediction intervals are returned, although other options are possible via the level argument.
google_2015 %>%
model(NAIVE(Close)) %>%
forecast(h = 10) %>%
autoplot(google_2015)
When forecasting from a model with transformations, we first produce forecasts of the transformed data. Then, we need to reverse the transformation (or back-transform) to obtain forecasts on the original scale. The reverse Box-Cox transformation is given by
\[\begin{equation} \tag{5.2} y_{t} = \begin{cases} \exp(w_{t}) & \text{if $\lambda=0$};\\ \text{sign}(\lambda w_t+1)|\lambda w_t+1|^{1/\lambda} & \text{otherwise}. \end{cases} \end{equation}\]
The fable package will automatically back-transform the forecasts whenever a transformation has been used in the model definition.
If a transformation has been used, then the prediction interval is first computed on the transformed scale, then the end points are back-transformed to give a prediction interval on the original scale.
One issue with using mathematical transformations such as Box-Cox transformations is that the back-transformed point forecast will not be the mean of the forecast distribution. In fact, it will usually be the median of the forecast distribution (assuming that the distribution on the transformed space is symmetric). For many purposes, this is acceptable, although the mean is usually preferable.
For a Box-Cox transformation, the back-transformed mean is given (approximately) by
\[\begin{equation} \tag{5.3} \hat{y}_{T+h|T} = \begin{cases} \exp(\hat{w}_{T+h|T})\left[1 + \frac{\sigma_h^2}{2}\right] & \text{if $\lambda=0$;}\\ (\lambda \hat{w}_{T+h|T}+1)^{1/\lambda}\left[1 + \frac{\sigma_h^2(1-\lambda)}{2(\lambda \hat{w}_{T+h|T}+1)^{2}}\right] & \text{otherwise;} \end{cases} \end{equation}\]
The difference between the simple back-transformed forecast and the mean is called the bias.
load(file = "datasets/prices.rda")
prices %>%
filter(!is.na(eggs)) %>%
model(RW(log(eggs) ~ drift())) %>%
forecast(h = 50) %>%
autoplot(prices %>% filter(!is.na(eggs)),
level = 80, point_forecast = lst(mean, median)
)
Ignoring unknown aesthetics: linetype
prices %>%
filter(!is.na(eggs)) %>%
model(RW(log(eggs) ~ drift())) %>%
forecast(h = 50) %>%
autoplot(prices %>% filter(!is.na(eggs)),
level = 80, point_forecast = lst(mean, median)
)
Ignoring unknown aesthetics: linetype
Bias adjusted forecast means are automatically computed in the fable package.. The forecast median (the point forecast prior to bias adjustment) can be obtained using the median() function on the distribution column.
\(y_t = \hat{S}_t + \hat{A}_t\)
\(\hat{A}_t = \hat{T}_t+\hat{R}_{t}\)
\(y_t = \hat{S}_t\hat{A}_t\)
\(\hat{A}_t = \hat{T}_t\hat{R}_{t}\)
To forecast a decomposed time series, we forecast the seasonal component, \(\hat{S}_t\), and the seasonally adjusted component \(\hat{A}_t\), separately.
A seasonal naïve method is used for the seasonal component. (seasonal component is unchanging, or changing extremely slowly).
To forecast the seasonally adjusted component, any non-seasonal forecasting method may be used.
#Employment in the US retail sector
us_retail_employment <- us_employment %>%
filter(year(Month) >= 1990, Title == "Retail Trade")
dcmp <- us_retail_employment %>%
model(STL(Employed ~ trend(window = 7), robust = TRUE)) %>%
components() %>%
select(-.model)
dcmp %>%
model(NAIVE(season_adjust)) %>%
forecast() %>%
autoplot(dcmp) +
labs(y = "New orders index",
title = "Naive forecasts of seasonally adjusted data")
naïve forecasts of the seasonally adjusted US retail employment data. These are then “reseasonalised” by adding in the seasonal naïve forecasts of the seasonal component.
This is made easy with the decomposition_model() model function, which allows you to compute forecasts via any additive decomposition, using other model functions to forecast each of the decomposition’s components. The function will also do the reseasonalising for you.
fit_dcmp <- us_retail_employment %>%
model(stlf = decomposition_model(
STL(Employed ~ trend(window = 7), robust = TRUE),
NAIVE(season_adjust)
))
fit_dcmp %>%
forecast() %>%
autoplot(us_retail_employment)
The upper and lower limits of the prediction intervals on the seasonally adjusted data are “reseasonalised” by adding in the forecasts of the seasonal component.
The ACF of the residuals shown in Figure 5.14, display significant autocorrelations. These are due to the naïve method not capturing the changing trend in the seasonally adjusted series.
fit_dcmp %>% gg_tsresiduals()
It is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy.
The size of the test set is typically about 20% of the total sample:
The filter() function is useful when extracting a portion of a time serie:
aus_production %>% filter(year(Quarter) >= 1995)
Another useful function is slice():
#last 20 observations - 5 years
aus_production %>%
slice(n()-19:0)
# first year of data from each time series
aus_retail %>%
group_by(State, Industry) %>%
slice(1:12)
A forecast “error” is the difference between an observed value and its forecast. It means the unpredictable part of an observation.
\(e_{T+h} = y_{T+h} - \hat{y}_{T+h|T},\)
The forecast errors are on the same scale as the data. Accuracy measures that are based only on \(e_t\) are therefore scale-dependent and cannot be used to make comparisons between series that involve different units.
The two most commonly used scale-dependent measures are based on the absolute errors or squared errors:
\[\begin{align*} \text{Mean absolute error: MAE} & = \text{mean}(|e_{t}|),\\ \text{Root mean squared error: RMSE} & = \sqrt{\text{mean}(e_{t}^2)}. \end{align*}\]
A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean. Consequently, the RMSE is also widely used, despite being more difficult to interpret.
The percentage error is given by \(p_{t} = 100 e_{t}/y_{t}\). Percentage errors have the advantage of being unit-free, and so are frequently used to compare forecast performances between data sets.
\[\begin{align*} \text{Mean absolute percentage error: MAPE} = \text{mean}(|p_{t}|). \end{align*}\]
Measures based on percentage errors have the disadvantage of being infinite or undefined if \(y_t=0\) for any \(t\) in the period of interest, and having extreme values if any \(y_t\) is close to zero.
“symmetric” MAPE (sMAPE) (not recommended):
\[\begin{align*} \text{sMAPE} = \text{mean}\left(200|y_{t} - \hat{y}_{t}|/(y_{t}+\hat{y}_{t})\right). \end{align*}\]
### Scaled errors
They proposed scaling the errors based on the training MAE from a simple forecast method.
non-seasonal time series, a useful way to define a scaled error uses naïve forecasts:
\[\begin{align*} q_{j} = \frac{\displaystyle e_{j}} {\displaystyle\frac{1}{T-1}\sum_{t=2}^T |y_{t}-y_{t-1}|}. \end{align*}\]
Is independent of the scale of the data. A scaled error is less than one if it arises from a better forecast than the average one-step naïve forecast computed on the training data. Conversely, it is greater than one if the forecast is worse than the average one-step naïve forecast computed on the training data.
seasonal time series, a scaled error can be defined using seasonal naïve forecasts:
\[\begin{align*} q_{j} = \frac{\displaystyle e_{j}} {\displaystyle\frac{1}{T-m}\sum_{t=m+1}^T |y_{t}-y_{t-m}|}. \end{align*}\]
The mean absolute scaled error is simply:
\[\begin{align*} \text{MASE} = \text{mean}(|q_{j}|). \end{align*}\]
root mean squared scaled error is given by:
\[\begin{align*} \text{RMSSE} = \sqrt{\text{mean}(q_{j}^2)}, \end{align*}\]
recent_production <- aus_production %>% filter(year(Quarter) >= 1992)
beer_train <- recent_production %>% filter(year(Quarter) <= 2007)
beer_fit <- beer_train %>%
model(
Mean = MEAN(Beer),
`Naïve` = NAIVE(Beer),
`Seasonal naïve` = SNAIVE(Beer),
Drift = RW(Beer ~ drift())
)
beer_fc <- beer_fit %>%
forecast(h = 10)
beer_fc %>%
autoplot(
aus_production %>% filter(year(Quarter) >= 1992),
level = NULL
) +
labs(y = "Megalitres", title = "Forecasts for quarterly beer production") +
guides(colour = guide_legend(title = "Forecast"))
We compute the forecast accuracy measures for this period.
accuracy(beer_fc, recent_production)
google_fit <- google_2015 %>%
model(
Mean = MEAN(Close),
`Naïve` = NAIVE(Close),
Drift = RW(Close ~ drift())
)
google_fc <- google_fit %>%
forecast(google_jan_2016)
google_fc %>%
autoplot(bind_rows(google_2015, google_jan_2016), level = NULL) +
labs(x = "Day", y = "Closing Price (US$)",
title = "Google stock prices (Jan 2015 - Jan 2016)") +
guides(colour = guide_legend(title = "Forecast"))
accuracy(google_fc, google_stock)
google_fc %>%
filter(.model == "Naïve") %>%
autoplot(bind_rows(google_2015, google_jan_2016), level = 80)
The lower limit of this prediction interval gives the 10th percentile (or 0.1 quantile) of the forecast distribution, so we would expect the actual value to lie below the lower limit about 10% of the time, and to lie above the lower limit about 90% of the time. When we compare the actual value to this percentile, we need to allow for the fact that it is more likely to be above than below.
Quantile Score is:
\[\begin{align*} Q_{p,t} = \begin{cases} 2(1 - p) \big(f_{p,t} - y_{t}\big), & \text{if $y_{t} < f_{p,t}$}\\ 2p \big(y_{t} - f_{p,t}\big), & \text{if $y_{t} \ge f_{p,t}$} \end{cases} \end{align*}\]
A low value of \(Q_{p,t}\) indicates a better estimate of the quantile.
This is easily computed using accuracy with the quantile_score function:
google_fc %>%
filter(.model == "Naïve", Date == "2016-01-04") %>%
accuracy(google_stock, list(qs = quantile_score), probs = 0.10)
It is often of interest to evaluate a prediction interval, rather than a few quantile
For observations that fall within the interval, the Winkler score is simply the length of the interval. So low scores are associated with narrow intervals. However, if the observation falls outside the interval, the penalty applies, with the penalty proportional to how far the observation is outside the interval.
This is easily computed using accuracy with the winkler_score function:
google_fc %>%
filter(.model == "Naïve", Date == "2016-01-04") %>%
accuracy(google_stock, list(winkler = winkler_score), level = 80)
#> # A tibble: 1 x 4
#> .model Symbol .type winkler
#> <chr> <chr> <chr> <dbl>
#> 1 Naïve GOOG Test 55.7
Often we are interested in the whole forecasting distribution, rather than particular quantiles or prediction intervals. In that case, we can average the quantile scores over all values of \(p\) to obtain the Continuous Ranked Probability Score or CRPS.
A CRPS value is a little like a weighted absolute error computed from the entire forecast distribution, where the weighting takes account of the probabilities.
google_fc %>%
accuracy(google_stock, list(crps = CRPS))
Here, the Naïve method is giving better distributional forecasts than the Drift or Mean methods.
As with point forecasts, it is useful to be able to compare the distributional forecast accuracy of several methods across series on different scales.
For example, if we use the Naïve method as a benchmark, and also compute forecasts using the Drift method, we can compute the CRPS skill score of the Drift method relative to the Naïve method as
\[\begin{align*} \frac{\text{CRPS}_{N} - \text{CRPS}_{D}}{\text{CRPS}_{N}} \end{align*}\]
This gives the proportion that the Drift method improves over the Naïve method based on CRPS. It is easy to compute using the accuracy() function.
google_fc %>%
accuracy(google_stock, list(skill = skill_score(CRPS)))
#> # A tibble: 3 x 4
#> .model Symbol .type skill
#> <chr> <chr> <chr> <dbl>
#> 1 Drift GOOG Test -0.266
#> 2 Mean GOOG Test -1.90
#> 3 Naïve GOOG Test 0
Of course, the skill score for Naïve is 0 because it can’t improve on itself. The other two methods have larger CRPS values than Naïve, so the skills scores are negative; the Drift method is 26.6% worse than the Naïve method.
When the data are seasonal, skill_score() will use a Seasonal Naïve benchmark method rather than the Naïve benchmark. This will work even when the benchmark forecast is not included in the fable object as skill_score() computes the benchmark forecasts that are needed. It is important that the data provided to accuracy() include only the training and test data to ensure the same training data are used for the benchmark forecasts.
https://otexts.com/fpp3/tscv.html
A more sophisticated version of training/test sets is time series cross-validation.
# Time series cross-validation accuracy
google_2015_tr <- google_2015 %>%
stretch_tsibble(.init = 3, .step = 1)
google_2015_tr
The .id column provides a new key indicating the various training sets.
# TSCV accuracy
google_2015_tr %>%
model(RW(Close ~ drift())) %>%
forecast(h = 1) %>%
accuracy(google_2015)
The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
1 observation is missing at 253
# Training set accuracy
google_2015 %>%
model(RW(Close ~ drift())) %>%
accuracy()
As expected, the accuracy measures from the residuals are smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.
A good way to choose the best forecasting model is to find the model with the smallest RMSE computed using time series cross-validation.
The code below evaluates the forecasting performance of 1- to 8-step-ahead drift forecasts. The plot shows that the forecast error increases as the forecast horizon increases, as we would expect.
google_2015_tr <- google_2015 %>%
stretch_tsibble(.init = 3, .step = 1)
fc <- google_2015_tr %>%
model(RW(Close ~ drift())) %>%
forecast(h = 8) %>%
group_by(.id) %>%
mutate(h = row_number()) %>%
ungroup()
fc %>%
accuracy(google_2015, by = c("h", ".model")) %>%
ggplot(aes(x = h, y = RMSE)) +
geom_point()
The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
8 observations are missing between 253 and 260