Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:
Australian Population (global_economy)
Bricks (aus_production)
NSW Lambs (aus_livestock)
Household wealth (hh_budget)
Australian takeaway food turnover (aus_retail).
global_economy %>%
filter(Country == "Australia") %>%
autoplot(Population)
Data has trend and no seasonality. Random walk with drift model is appropriate.
global_economy %>%
filter(Country == "Australia") %>%
model(RW(Population ~ drift())) %>%
forecast(h = "10 years") %>%
autoplot(global_economy)
aus_production %>%
filter(!is.na(Bricks)) %>%
autoplot(Bricks) +
labs(title = "Clay brick production")
This data appears to have more seasonality than trend, so of the models available, seasonal naive is most appropriate.
aus_production %>%
filter(!is.na(Bricks)) %>%
model(SNAIVE(Bricks)) %>%
forecast(h = "5 years") %>%
autoplot(aus_production)
## Warning: Removed 20 row(s) containing missing values (geom_path).
nsw_lambs <- aus_livestock %>%
filter(State == "New South Wales", Animal == "Lambs")
nsw_lambs %>%
autoplot(Count)
This data appears to have more seasonality than trend, so of the models available, seasonal naive is most appropriate.
nsw_lambs %>%
model(SNAIVE(Count)) %>%
forecast(h = "5 years") %>%
autoplot(nsw_lambs)
hh_budget %>%
autoplot(Wealth)
Annual data with trend upwards, so we can use a random walk with drift.
hh_budget %>%
model(RW(Wealth ~ drift())) %>%
forecast(h = "5 years") %>%
autoplot(hh_budget)
takeaway <- aus_retail %>%
filter(Industry == "Takeaway food services") %>%
summarise(Turnover = sum(Turnover))
takeaway %>% autoplot(Turnover)
This data has strong seasonality and strong trend, so we will use a seasonal naive model with drift.
takeaway %>%
model(SNAIVE(Turnover ~ drift())) %>%
forecast(h = "5 years") %>%
autoplot(takeaway)
This is actually not one of the four benchmark methods discussed in the book, but is sometimes a useful benchmark when there is strong seasonality and strong trend.
The corresponding equation is \[ \hat{y}_{T+h|T} = y_{T+h-m(k+1)} + \frac{h}{T-m}\sum_{t=m+1}^T(y_t - y_{t-m}), \] where \(m=12\) and \(k\) is the integer part of \((h-1)/m\) (i.e., the number of complete years in the forecast period prior to time \(T+h\)).
Use the Facebook stock price (data set gafa_stock) to do
the following:
fb_stock <- gafa_stock %>%
filter(Symbol == "FB")
fb_stock %>%
autoplot(Close)
An upward trend is evident until mid-2018, after which the closing stock
price drops.
The data must be made regular before it can be modelled. We will use trading days as our regular index.
fb_stock <- fb_stock %>%
mutate(trading_day = row_number()) %>%
update_tsibble(index = trading_day, regular = TRUE)
Time to model a random walk with drift.
fb_stock %>%
model(RW(Close ~ drift())) %>%
forecast(h = 100) %>%
autoplot(fb_stock)
Prove drift methods are extrapolations from the first and last observation. First, we will demonstrate it graphically.
fb_stock %>%
model(RW(Close ~ drift())) %>%
forecast(h = 100) %>%
autoplot(fb_stock) +
geom_line(
aes(y = Close),
linetype = "dashed", colour = "blue",
data = fb_stock %>% filter(trading_day %in% range(trading_day))
)
To prove it algebraically, note that \[\begin{align*} \hat{y}_{T+h|T} = y_T + h\left(\frac{y_T-y_1}{T-1}\right) \end{align*}\]
which is a straight line with slope \((y_T-y_1)/(T-1)\) that goes through the point \((T,y_T)\).
Therefore, it must also go through the point \((1,c)\) where \[ (y_T-c)/(T-1) = (y_T - y_1) / (T-1), \] so \(c=y_1\).
Use other appropriate benchmark methods. The most appropriate benchmark method is the naive model. The mean forecast is terrible for this type of data, and the data is non-seasonal.
fb_stock %>%
model(NAIVE(Close)) %>%
forecast(h = 100) %>%
autoplot(fb_stock)
The naive method also appears to be appropriate.
Produce forecasts for the 7 Victorian series in
aus_livestock using SNAIVE(). Plot the
resulting forecasts including the historical data. Is this a reasonable
benchmark for these series?
aus_livestock %>%
filter(State == "Victoria") %>%
model(SNAIVE(Count)) %>%
forecast(h = "5 years") %>%
autoplot(aus_livestock)
Most point forecasts look reasonable from the seasonal naive method.
Some series are more seasonal than others, and for the series with very weak seasonality it may be better to consider using a naive or drift method.
The prediction intervals in some cases go below zero, so perhaps a log transformation would have been better for these series.
Are the following statements true or false? Explain your answer.
False.
Although many good forecasting methods produce normally distributed residuals this is not required to produce good forecasts.
Other forecasting methods may use other distributions, it is just less common as they can be more difficult to work with.
False.
It is possible to produce a model with small residuals by making a highly complicated (overfitted) model that fits the data extremely well.
This highly complicated model will often perform very poorly when forecasting new data.
False.
There is no single best measure of accuracy - often you would want to see a collection of accuracy measures as they can reveal different things about your residuals.
MAPE in particular has some substantial disadvantages - extreme values can result when \(y_t\) is close to zero, and it assumes that the unit being measured has a meaningful zero.
False.
There are many reasons why a model may not forecast well, and making the model more complicated can make the forecasts worse.
The model specified should capture structures that are evident in the data. Although adding terms that are unrelated to the structures found in the data will improve the model’s residuals, the forecasting performance of the model will not improve.
Adding missing features relevant to the data (such as including a seasonal pattern that exists in the data) should improve forecast performance.
False.
There are many measures of forecast accuracy, and the appropriate model is the one which is best suited to the forecasting task. For instance, you may be interested in choosing a model which forecasts well for predictions exactly one year ahead. In this case, using cross-validated accuracy could be a more useful approach to evaluating accuracy.
Select one of the time series as follows (but choose your own seed value):
set.seed(12345678)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
set.seed(12345678)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`, 1))
myseries_train <- myseries %>%
filter(year(Month) < 2011)
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
#or
autoplot(myseries, Turnover) +
geom_line(data = myseries_train, aes(x =Month, y= Turnover), colour = "red")
The plot (above) indicates that the training data has been extracted correctly.
SNAIVE()
applied to your training data (myseries_train).fit <- myseries_train %>%
model(SNAIVE(Turnover))
dt_test<-myseries%>%
filter(year(Month) >= 2011)
# or
dt_test2<-anti_join(myseries, myseries_train)
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
fc <- fit %>%
forecast(dt_test)
fc %>% autoplot(myseries)
bind_rows(
accuracy(fit),
accuracy(fc, myseries)
) %>%
select(-State, -Industry, -.model)
## # A tibble: 2 × 9
## .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Training 0.439 1.21 0.915 5.23 12.4 1 1 0.768
## 2 Test 0.836 1.55 1.24 5.94 9.06 1.36 1.28 0.601
The accuracy on the training data is substantially better than the out-of-sample forecast accuracy. This is common, and especially evident in this example as the model has failed to capture the trend in the data. This can be seen in the mean error, which is above zero as the model predictions do not account for the upward trend.
myseries_accuracy <- function(data, last_training_year) {
myseries_train <- data %>%
filter(year(Month) <= last_training_year)
fit <- myseries_train %>%
model(SNAIVE(Turnover))
fc <- fit %>%
forecast(new_data = anti_join(myseries, myseries_train))
bind_rows(
accuracy(fit),
accuracy(fc, myseries)
) %>%
mutate(last_training_year = last_training_year) %>%
select(last_training_year, .type, ME:ACF1)
}
as.list(2011:2017) %>%
purrr::map_dfr(myseries_accuracy, data = myseries) %>%
ggplot(aes(x = last_training_year, y = RMSE, group = .type)) +
geom_line(aes(col = .type))
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
The accuracy on the training data is almost unchanged when the size of the training set is increased. However, the accuracy on the test data decreases as we are averaging RMSE over the forecast horizon, and with less training data the forecasts horizons can be longer.
Consider the number of pigs slaughtered in New South Wales (data set
aus_livestock).
nsw_pigs <- aus_livestock %>%
filter(State == "New South Wales", Animal == "Pigs")
nsw_pigs %>%
autoplot(Count)
Data generally follows a downward trend, however there are some periods where the amount of pigs slaughtered changes rapidly.
nsw_pigs %>% gg_season(Count, labels = "right")
nsw_pigs %>% gg_subseries(Count)
Some seasonality is apparent, with notable increases in December and decreases during January, February and April.
nsw_pigs_train <- nsw_pigs %>% slice(1:486)
fit <- nsw_pigs_train %>%
model(
mean = MEAN(Count),
naive = NAIVE(Count),
snaive = SNAIVE(Count),
drift = RW(Count ~ drift())
)
fit %>%
forecast(h = "6 years") %>%
accuracy(nsw_pigs)
## # A tibble: 4 × 12
## .model Animal State .type ME RMSE MAE MPE MAPE MASE RMSSE
## <chr> <fct> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 drift Pigs New South … Test -4685. 8091. 6967. -7.36 10.1 0.657 0.557
## 2 mean Pigs New South … Test -39360. 39894. 39360. -55.9 55.9 3.71 2.75
## 3 naive Pigs New South … Test -6138. 8941. 7840. -9.39 11.4 0.740 0.615
## 4 snaive Pigs New South … Test -5838. 10111. 8174. -8.81 11.9 0.771 0.696
## # … with 1 more variable: ACF1 <dbl>
The drift method performed best for all measures of accuracy (although it had a larger first order auto-correlation)
fit %>%
select(drift) %>%
gg_tsresiduals()
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
The residuals do not appear to be white noise as the ACF plot contains many significant lags. It is also clear that the seasonal component is not captured by the drift method, as there exists a strong positive auto-correlation at lag 12 (1 year). The histogram appears to have a slightly long left tail.
We will use the bricks data from aus_production
(Australian quarterly clay brick production 1956–2005) for this
exercise.
tidy_bricks <- aus_production %>%
filter(!is.na(Bricks))
tidy_bricks %>%
model(STL(Bricks)) %>%
components() %>%
autoplot()
Data is multiplicative, and so a transformation should be used.
dcmp <- tidy_bricks %>%
model(STL(log(Bricks))) %>%
components()
dcmp %>%
autoplot()
Seasonality varies slightly.
dcmp <- tidy_bricks %>%
model(stl = STL(log(Bricks) ~ season(window = "periodic"))) %>%
components()
dcmp %>% autoplot()
The seasonality looks fairly stable, so I’ve used a periodic season (window). The decomposition still performs well when the seasonal component is fixed. The remainder term does not appear to contain a substantial amount of seasonality.
dcmp %>%
as_tsibble() %>%
autoplot(season_adjust)
fit <- dcmp %>%
select(-.model) %>%
model(naive = NAIVE(season_adjust)) %>%
forecast(h = "5 years")
dcmp %>%
as_tsibble() %>%
autoplot(season_adjust) + autolayer(fit)
decomposition_model() to reseasonalise the
results, giving forecasts for the original data.fit <- tidy_bricks %>%
model(stl_mdl = decomposition_model(STL(log(Bricks)), NAIVE(season_adjust)))
fit %>%
forecast(h = "5 years") %>%
autoplot(tidy_bricks)
decomposition_model() with
those from SNAIVE(), using a test set comprising the last 2
years of data. Which is better?tidy_bricks_train <- tidy_bricks %>%
slice(1:(n() - 8))
fit <- tidy_bricks_train %>%
model(
stl_mdl = decomposition_model(STL(log(Bricks)), NAIVE(season_adjust)),
snaive = SNAIVE(Bricks)
)
fc <- fit %>%
forecast(h = "2 years")
fc %>%
autoplot(tidy_bricks, level = NULL)
The decomposition forecasts appear to more closely follow the actual future data.
fc %>%
accuracy(tidy_bricks)
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 snaive Test 2.75 20 18.2 0.395 4.52 0.504 0.407 -0.0503
## 2 stl_mdl Test 0.368 18.1 15.1 -0.0679 3.76 0.418 0.368 0.115
The STL decomposition forecasts are more accurate than the seasonal naive forecasts across all accuracy measures.
tourism contains quarterly visitor nights (in thousands)
from 1998 to 2017 for 76 regions of Australia.
filter() and aggregate total overnight trips (sum over
Purpose) using summarise(). Call this new
dataset gc_tourism.gc_tourism <- tourism %>%
filter(Region == "Gold Coast") %>%
summarise(Trips = sum(Trips))
gc_tourism
## # A tsibble: 80 x 2 [1Q]
## Quarter Trips
## <qtr> <dbl>
## 1 1998 Q1 827.
## 2 1998 Q2 681.
## 3 1998 Q3 839.
## 4 1998 Q4 820.
## 5 1999 Q1 987.
## 6 1999 Q2 751.
## 7 1999 Q3 822.
## 8 1999 Q4 914.
## 9 2000 Q1 871.
## 10 2000 Q2 780.
## # … with 70 more rows
slice() or filter(), create three
training sets for this data excluding the last 1, 2 and 3 years. For
example,
gc_train_1 <- gc_tourism %>% slice(1:(n()-4)).gc_train_1 <- gc_tourism %>% slice(1:(n() - 4))
gc_train_2 <- gc_tourism %>% slice(1:(n() - 8))
gc_train_3 <- gc_tourism %>% slice(1:(n() - 12))
SNAIVE()) method. Call these
gc_fc_1, gc_fc_2 and gc_fc_3,
respectively.gc_fc <- bind_cols(
gc_train_1 %>% model(gc_fc_1 = SNAIVE(Trips)),
gc_train_2 %>% model(gc_fc_2 = SNAIVE(Trips)),
gc_train_3 %>% model(gc_fc_3 = SNAIVE(Trips))
) %>% forecast(h = "1 year")
gc_fc %>% autoplot(gc_tourism)
accuracy() to compare the test set forecast
accuracy using MAPE. Comment on these.gc_fc %>% accuracy(gc_tourism)
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 gc_fc_1 Test 75.1 167. 154. 6.36 15.1 2.66 2.36 -0.410
## 2 gc_fc_2 Test 12.0 43.1 39.5 1.14 4.32 0.670 0.599 -0.792
## 3 gc_fc_3 Test 35.8 91.4 83.9 3.56 9.07 1.46 1.30 0.239
The second set of forecasts are most accurate (as can be seen in the previous plot).
gafa_stock select apple stock close pricings
and plot itapple_stock <- gafa_stock %>%
filter(Symbol == "AAPL") %>%
mutate(trading_day = row_number()) %>%
update_tsibble(index = trading_day, regular = TRUE)
apple_stock %>%
autoplot(Close)
apple_stretch <- apple_stock %>%
stretch_tsibble(.init = 10, .step = 1) %>%
filter(.id != max(.id))
apple_stretch
## # A tsibble: 790,608 x 10 [1]
## # Key: .id, Symbol [1,248]
## Symbol Date Open High Low Close Adj_Close Volume trading_day .id
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 AAPL 2014-01-02 79.4 79.6 78.9 79.0 67.0 5.87e7 1 1
## 2 AAPL 2014-01-03 79.0 79.1 77.2 77.3 65.5 9.81e7 2 1
## 3 AAPL 2014-01-06 76.8 78.1 76.2 77.7 65.9 1.03e8 3 1
## 4 AAPL 2014-01-07 77.8 78.0 76.8 77.1 65.4 7.93e7 4 1
## 5 AAPL 2014-01-08 77.0 77.9 77.0 77.6 65.8 6.46e7 5 1
## 6 AAPL 2014-01-09 78.1 78.1 76.5 76.6 65.0 6.98e7 6 1
## 7 AAPL 2014-01-10 77.1 77.3 75.9 76.1 64.5 7.62e7 7 1
## 8 AAPL 2014-01-13 75.7 77.5 75.7 76.5 64.9 9.46e7 8 1
## 9 AAPL 2014-01-14 76.9 78.1 76.8 78.1 66.1 8.31e7 9 1
## 10 AAPL 2014-01-15 79.1 80.0 78.8 79.6 67.5 9.79e7 10 1
## # … with 790,598 more rows
fit_cv <- apple_stretch %>%
model(RW(Close ~ drift()))
fc_cv <- fit_cv %>%
forecast(h = 1)
fc_cv
## # A fable: 1,248 x 6 [1]
## # Key: .id, Symbol, .model [1,248]
## .id Symbol .model trading_day Close .mean
## <int> <chr> <chr> <dbl> <dist> <dbl>
## 1 1 AAPL RW(Close ~ drift()) 11 N(80, 1.2) 79.7
## 2 2 AAPL RW(Close ~ drift()) 12 N(79, 1.1) 79.2
## 3 3 AAPL RW(Close ~ drift()) 13 N(77, 1.4) 77.1
## 4 4 AAPL RW(Close ~ drift()) 14 N(78, 1.4) 78.4
## 5 5 AAPL RW(Close ~ drift()) 15 N(79, 1.3) 78.8
## 6 6 AAPL RW(Close ~ drift()) 16 N(79, 1.2) 79.5
## 7 7 AAPL RW(Close ~ drift()) 17 N(78, 1.3) 77.9
## 8 8 AAPL RW(Close ~ drift()) 18 N(79, 1.2) 78.6
## 9 9 AAPL RW(Close ~ drift()) 19 N(72, 3.5) 72.0
## 10 10 AAPL RW(Close ~ drift()) 20 N(71, 3.3) 71.1
## # … with 1,238 more rows
fit_cv2<-fit_cv <- apple_stretch %>%
model(MEAN(Close))
fc_cv2 <- fit_cv2 %>%
forecast(h = 1)
fc_cv2
## # A fable: 1,248 x 6 [1]
## # Key: .id, Symbol, .model [1,248]
## .id Symbol .model trading_day Close .mean
## <int> <chr> <chr> <dbl> <dist> <dbl>
## 1 1 AAPL MEAN(Close) 11 N(78, 1.3) 77.6
## 2 2 AAPL MEAN(Close) 12 N(78, 1.4) 77.7
## 3 3 AAPL MEAN(Close) 13 N(78, 1.3) 77.7
## 4 4 AAPL MEAN(Close) 14 N(78, 1.2) 77.7
## 5 5 AAPL MEAN(Close) 15 N(78, 1.2) 77.8
## 6 6 AAPL MEAN(Close) 16 N(78, 1.3) 77.9
## 7 7 AAPL MEAN(Close) 17 N(78, 1.2) 77.9
## 8 8 AAPL MEAN(Close) 18 N(78, 1.2) 78.0
## 9 9 AAPL MEAN(Close) 19 N(78, 3) 77.7
## 10 10 AAPL MEAN(Close) 20 N(77, 4.9) 77.3
## # … with 1,238 more rows
fc_cv %>% accuracy(apple_stock)
## # A tibble: 1 × 11
## .model Symbol .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RW(Close ~ … AAPL Test -0.0158 2.10 1.41 -0.0139 1.06 1.00 1.00 0.0330
fc_cv2 %>% accuracy(apple_stock)
## # A tibble: 1 × 11
## .model Symbol .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MEAN(Close) AAPL Test 26.9 37.6 28.5 16.9 18.6 20.2 17.9 0.996
Random walk with drift
prices plot the wheat (do not
forget to remove NA)wheat <- prices %>%
filter(!is.na(wheat)) %>%
select(wheat)
wheat %>%
autoplot(wheat) +
labs(
title = "Annual wheat prices",
y = "$US (adjusted for inflation)")
fit <- wheat %>%
model(rwdrift = RW(wheat ~ drift()))
fit
## # A mable: 1 x 1
## rwdrift
## <model>
## 1 <RW w/ drift>
fc <- fit %>%
forecast(h = 50)
fc
## # A fable: 50 x 4 [1Y]
## # Key: .model [1]
## .model year wheat .mean
## <chr> <dbl> <dist> <dbl>
## 1 rwdrift 1997 N(124, 4389) 124.
## 2 rwdrift 1998 N(120, 8823) 120.
## 3 rwdrift 1999 N(116, 13302) 116.
## 4 rwdrift 2000 N(112, 17826) 112.
## 5 rwdrift 2001 N(108, 22394) 108.
## 6 rwdrift 2002 N(104, 27007) 104.
## 7 rwdrift 2003 N(100, 31665) 99.9
## 8 rwdrift 2004 N(96, 36368) 95.9
## 9 rwdrift 2005 N(92, 41116) 91.9
## 10 rwdrift 2006 N(88, 45908) 87.8
## # … with 40 more rows
fc %>% autoplot(wheat) +
labs(
title = "Annual wheat prices",
y = "US$ (adjusted for inflation)"
)
fcb<-fit%>%generate(h=30, times=100, bootstrap = TRUE)
fcb
## # A tsibble: 3,000 x 4 [1Y]
## # Key: .model, .rep [100]
## .model year .rep .sim
## <chr> <dbl> <chr> <dbl>
## 1 rwdrift 1997 1 139.
## 2 rwdrift 1998 1 157.
## 3 rwdrift 1999 1 -22.0
## 4 rwdrift 2000 1 -130.
## 5 rwdrift 2001 1 -255.
## 6 rwdrift 2002 1 -264.
## 7 rwdrift 2003 1 -382.
## 8 rwdrift 2004 1 -427.
## 9 rwdrift 2005 1 -489.
## 10 rwdrift 2006 1 -533.
## # … with 2,990 more rows
wheat %>%
ggplot(aes(x = year)) +
geom_line(aes(y = wheat)) +
geom_line(aes(y = .sim, colour = as.factor(.rep)),
data = fcb) +
guides(colour = "none")
When residuals are not normally distributed. However, this technique only assumes that the residuals are uncorrelated with constant variance.