624 Homework 3

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.4.2

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## Warning: package 'tsibble' was built under R version 4.4.2

## Warning: package 'tsibbledata' was built under R version 4.4.2

## Warning: package 'feasts' was built under R version 4.4.2

## Warning: package 'fabletools' was built under R version 4.4.2

## Warning: package 'fable' was built under R version 4.4.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ readr   2.1.5
## ✔ purrr   1.0.2     ✔ stringr 1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction

In this homework assignment I will be submitting exercises 5.1, 5.2, 5.3, 5.4 and 5.7 from the Hyndman online Forecasting book.

Question 5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy) Bricks (aus_production) NSW Lambs (aus_livestock) Household wealth (hh_budget). Australian takeaway food turnover (aus_retail).

Australian Population (global_economy)

aus_population <- global_economy %>%
  filter(Code == "AUS") %>%
  select(Population)

aus_population %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Population`

Because of the strong positive trend in the population of our data I will use the RW(y ~ drift()) forecasting method.

aus_population %>%
  model(Drift = RW(Population ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(aus_population)

Bricks (aus_production)

bricks <- aus_production %>%
  select(Bricks) %>%
  na.omit()

bricks %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Bricks`

bricks %>%
  gg_season()

## Plot variable not specified, automatically selected `y = Bricks`

Due to strong quarterly seasonality I will forecast the bricks production with SNAIVE(y) with a one year lag (accounting for the seasonal variance of the same quarter one year ago).

bricks %>%
  model(SNAIVE(Bricks ~ lag("1 year"))) %>%
  forecast(h = 10) %>%
  autoplot(bricks, level = 95)

## NSW Lambs (aus_livestock)

lambs <- aus_livestock %>%
  filter(Animal == "Lambs",
         State == "New South Wales") %>%
  select(Count)

lambs %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Count`

lambs %>%
  gg_season()

## Plot variable not specified, automatically selected `y = Count`

With no apparent seasonal pattern I will opt to use the NAIVE() model to forecast.

lambs %>%
  model(NAIVE(Count)) %>%
  forecast(h = 36) %>%
  autoplot(lambs)

Household wealth (hh_budget)

hhwealth <- hh_budget %>%
  select(Wealth)

hhwealth %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Wealth`

For the household wealth data I will forecast with the RW(y ~ drift()) drift method, there is a positive trend here.

hhwealth %>%
  model(Drift = RW(Wealth ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(hhwealth)

Australian takeaway food turnover (aus_retail)

aus_takeaway <- aus_retail %>%
  filter(Industry == 'Cafes, restaurants and takeaway food services') %>%
  group_by(Industry) %>%
  summarise(Turnover = sum(Turnover))

aus_takeaway %>%
  autoplot(Turnover)

Although we have seasonality in the data above I will use the drift method RW(y ~ drift()), as the trend here seems to have a bigger impact on the turnover.

aus_takeaway %>%
  model(Drift = RW(Turnover ~ drift())) %>%
  forecast(h = 30) %>%
  autoplot(aus_takeaway)

Question 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.
Produce forecasts using the drift method and plot them.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

a. Produce a time plot of the series.

facebook <- gafa_stock %>%
  filter(Symbol == 'FB') %>%
  select(Close)

facebook %>% autoplot()

## Plot variable not specified, automatically selected `.vars = Close`

b. Produce forecasts using the drift method and plot them.

# Creating new time index to account for trading days rather than calendar days
facebook <- facebook %>%
  mutate(trading_day = row_number()) %>%
  update_tsibble(index = trading_day, regular = TRUE)

facebook %>%
  model(Drift = RW(Close ~ drift())) %>%
  forecast(h = 100) %>%
  autoplot(facebook)

c. Show that the forecasts are identical to extending the line drawn between the first and last observations.

facebook %>%
  model(Drift = RW(Close ~ drift())) %>%
  forecast(h = 100) %>%
  autoplot(facebook) +
  geom_abline(intercept = facebook[[1,1]], slope = (tail(facebook$Close, n=1) - facebook[[1,1]]) / length(facebook$Close))

d. Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

facebook %>%
  model( Mean = MEAN(Close),
         Naive = NAIVE(Close),
         Drift = RW(Close ~ drift())) %>%
  forecast(h = 100) %>%
  autoplot(facebook)

Determining the best model fit for financial stock data is tricky as there are fundamental considerations that need to taken into account. The RW drift model assumes that the stock price will continue to increase due to the historical increases and that may not be the case if the fundamental outlook for the company changes. I would say that if consensus stock price estimates are significantly higher than current valuations the drift model might be appropriate. Without fundamental data available The best model is probably the naive model where the next day’s price is most likely the most recent day’s price.

Question 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look at some forecasts
fit |> forecast() |> autoplot(recent_production)

What do you conclude?

The innovation residuals seem uncorrelated, have constant variance, and the histogram of residuals creates a nearly normal plot. The acf plot seems that it could have some seasonality in it, however I cannot determine by the eye if there is a pattern. I would conclude that the residual variance looks like white noise. A better model or modifications such as accounting for possible outlier events could produce a more accurate forecast.

Question 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports

# Extract data of interest
aus_economy <- global_economy |>
    filter(Country == "Australia") 

# Define and estimate a model
fit <- aus_economy  |> model(NAIVE(Exports))

# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

# Look at some forecasts
fit %>% forecast() %>% autoplot(aus_economy)

I chose to use the naive model as the data is yearly and there cannot be seasonality. The innovation residuals show no correlation, relatively constant variance, and a normal distribution for the histogram of residuals. In the acf plot we can see that there is greater absolute values in more recent lags - indicating that variance might be increasing, but other than this concern the naive model seems to be a good fit.

Bricks

fit <- aus_production %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks ~ lag("year")))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

fit %>% forecast() %>% autoplot(aus_production)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

The acf plot above stands out to me immediately signaling that although a seasonal naive model was used there is still seasonality in the model. The innovation residuals also shows an increase in variance after about 1975, and a left sided tail in the residual histogram. This model is not a good fit for forecasting as not all the seasonality is accounted for. Running the seasonal naive model for data post 1980 might yield a better fit, as we can see that this is when the variance in bricks production increases.

Question 5.7

For your retail time series (from Exercise 7 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(1234)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals.

fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

Do the residuals appear to be uncorrelated and normally distributed?

Apparent in the acf plot there is correlation, and the residuals are nearly normal, with a right sided tail ended around 10 where the left tail ends around -5.

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tasmania Cafes, … SNAIV… Trai…  1.33  2.90  2.22  6.31  10.7     1     1 0.800

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Tasm… Cafes, … Test   7.12  9.13  7.58  13.2  14.4  3.42  3.15 0.863

All metrics are worse on the test set when compared to the training set. When looking at the plot of data in hindsight, the trend of the data has a much bigger impact on the turnover amount and thus a naive method would not be appropriate. The naive method assumes that the most likely next observation is the most recent observation - and so an increase in turnover could not be considered in the model.

How sensitive are the accuracy measures to the amount of training data used?

The accuracy measures are highly sensitive to the amount of training data used. More or less data would change the nature of the model’s forecast by changing the seasonality, trend, variance, residuals, etc., captured. The model’s accuracy measures would then be changed due to the changed forecast.