Data 624 Homework 3

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.2
## ✔ lubridate   1.9.4     ✔ fable       0.5.0
## ✔ ggplot2     4.0.0

## Warning: package 'fable' was built under R version 4.5.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

Introduction

In this report, I will be answering questions 1, 2, 3, 4, and 7 from the 5.11 section from the Hyndman online Forecasting book

Question 1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each cas-e:

Australian Population (global_economy)
Bricks (aus_production)
NSW Lambs (aus_livestock)
Household wealth (hh_budget).
Australian takeaway food turnover (aus_retail).

aus_pop <- global_economy |>
  filter(Country == "Australia") |>
  select(Year, Population)

aus_pop |>
  model(RWdrift = RW(Population ~ drift())) |>
  forecast(h = 10) |>
  autoplot(aus_pop)

bricks_clean <- aus_production |>
  select(Quarter, Bricks) |>
  filter(!is.na(Bricks))

fit_bricks <- bricks_clean |>
  model(SNaive = SNAIVE(Bricks))

fc_bricks <- fit_bricks |>
  forecast(h = "2 years")

autoplot(fc_bricks, bricks_clean) +
  labs(title = "Bricks — Seasonal Naive Forecast (2 years)",
       x = "Quarter", y = "Bricks")

nsw_lambs <- aus_livestock |>
  filter(State == "New South Wales", Animal == "Lambs") |>
  select(Month, Count)

nsw_lambs |>
  model(SNaive = SNAIVE(Count)) |>
  forecast(h = "2 years") |>
  autoplot(nsw_lambs)

hh_budget |>
  model(RWdrift = RW(Wealth ~ drift())) |>
  forecast(h = 10) |>
  autoplot(hh_budget)

takeaway_aus <- aus_retail |>
  filter(Industry == "Cafes, restaurants and takeaway food services") |>
  index_by(Month) |>                         
  summarise(Turnover = sum(Turnover, na.rm = TRUE))

takeaway_aus |>
  model(SNaive = SNAIVE(Turnover)) |>
  forecast(h = "2 years")

## # A fable: 24 x 4 [1M]
## # Key:     .model [1]
##    .model    Month
##    <chr>     <mth>
##  1 SNaive 2019 Jan
##  2 SNaive 2019 Feb
##  3 SNaive 2019 Mar
##  4 SNaive 2019 Apr
##  5 SNaive 2019 May
##  6 SNaive 2019 Jun
##  7 SNaive 2019 Jul
##  8 SNaive 2019 Aug
##  9 SNaive 2019 Sep
## 10 SNaive 2019 Oct
## # ℹ 14 more rows
## # ℹ 2 more variables: Turnover <dist>, .mean <dbl>

  autoplot(takeaway_aus, Turnover) +
  labs(title = "Australian Takeaway Food Turnover — Seasonal Naive Forecast")

In this case I aggregated all the states (summing all states’ monthly turnover) to be able to plot entire Australia’s turnover series.

Australian population and household wealth exhibit clear long-term trends with no seasonality, so I used random walk models with drift. Bricks production, NSW lamb slaughter, and takeaway food turnover display strong seasonal patterns, making seasonal naïve forecasts more appropriate. These models provide simple baseline forecasts that preserve the dominant features of each series.

Question 2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

facebook <- gafa_stock |>
  filter(Symbol == "FB")

autoplot(facebook, Close) +
  labs(title = "Facebook Stock Daily Closing Price",
       x = "Date", y = "Closing Price [USD]")

The Facebook stock price shows an overall upward trend with substantial day-to-day volatility and no clear seasonality, which is typical for financial time series.

Produce forecasts using the drift method and plot them.

# facebook trading-day data
fb <- gafa_stock |>
  filter(Symbol == "FB") |>
  select(Date, Close) |>
  arrange(Date) |>
  as_tibble()

# daily calendar and join prices onto it
fb_daily_tbl <- tibble(Date = seq(min(fb$Date), max(fb$Date), by = "day")) |>
  left_join(fb, by = "Date") |>
  tidyr::fill(Close, .direction = "down")   

# regular tsibble
fb_daily <- fb_daily_tbl |>
  as_tsibble(index = Date) |>
  tsibble::update_tsibble(regular = TRUE)

fc_drift <- fb_daily |>
  model(RWdrift = RW(Close ~ drift())) |>
  forecast(h = 30)

autoplot(fc_drift, fb_daily) +
  labs(title = "Facebook Closing Price — Drift Forecast (30 days, daily regularized)",
       x = "Date", y = "Close (USD)")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

# extract means of drift forecast
fc_tbl <- fc_drift |>
  as_tibble() |>
  select(Date, drift_mean = .mean)

y1 <- fb_daily |> as_tibble() |> slice(1) |> pull(Close)
yT <- fb_daily |> as_tibble() |> slice(n()) |> pull(Close)
n  <- nrow(fb_daily)
h  <- nrow(fc_tbl)

slope <- (yT - y1) / (n - 1)  

line_df <- bind_rows(
  fb_daily |> as_tibble() |> select(Date),
  fc_tbl |> select(Date)
) |>
  mutate(t = row_number() - 1,
         line_value = y1 + slope * t)

# plot of actual, drift forecast mean, and dashed first–last extension
ggplot() +
  geom_line(data = fb_daily |> as_tibble(), aes(Date, Close)) +
  geom_line(data = fc_tbl, aes(Date, drift_mean)) +
  geom_line(data = line_df, aes(Date, line_value),
            linetype = "dashed", linewidth = 1.1) +
  labs(title = "Drift forecast equals extension of the first–last line",
       x = "Date", y = "Close (USD)")

The drift forecast mean lies directly on top of the dashed extension, showing that the drift method simply extends the line between the first and last observations.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

fit_bench <- fb_daily |>
  model(
    Naive   = NAIVE(Close),
    Drift   = RW(Close ~ drift()),
    SNaive  = SNAIVE(Close)   
  )

fc_bench <- fit_bench |>
  forecast(h = 30)

autoplot(fc_bench, fb_daily) +
  labs(title = "Facebook Close — Benchmark Forecasts (30 days)",
       x = "Date", y = "Close (USD)")

I compared several benchmark methods and the seasonal naïve method is not appropriate because stock prices don’t have stable seasonal patterns. The drift method extrapolates a linear trend based on the first and last observations, as discussed above, which can be sensitive to the particular sample period. The naïve method is more appropriate for stock prices because it assumes tomorrow’s price is best predicted by today’s price, consistent with the common random-walk view of financial markets. Because of this, I would choose the naïve method as the best benchmark for this series.

Question 3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)

# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))

# Look at the residuals
fit |> gg_tsresiduals()

## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_rug()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

What do you conclude?

The seasonal naïve model does a pretty decent job picking up the strong quarterly pattern in beer production, which we can see from the repeating maxs and mins of the graph. However, when looking at the residual diagnostics, the errors don’t really behave like white noise. The ACF plot shows a large spike around lag 4, which suggests there is still some seasonal structure left in the data (not ideal). The residual time plot also shows periods where errors stay mostly positive or mostly negative, rather than bouncing randomly around zero. So, while SNAIVE works as a simple baseline and captures the main seasonality, the remaining patterns in the residuals suggest that a more advanced model would probably give better forecasts.

Question 4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

aus_exports <- global_economy |>
  filter(Country == "Australia") |>
  select(Year, Exports)

autoplot(aus_exports, Exports)

This is an annual series where each data point corresponds to one year, so it doesn’t make sense to use SNAIVE(). I will be using NAIVE().

# fit the model
fit_exports <- aus_exports |> model(NAIVE(Exports))

# residual diagnostics
fit_exports |> gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_rug()`).

# forecast
fit_exports |> forecast() |> autoplot(aus_exports)

The forecast is simply a continuation of the last observed value. The residual diagnostics suggest that while the model is simple, there is still be some structure left in the data, particularly if exports exhibit longer-term movements. Overall, NAIVE provides a reasonable baseline for this series.

bricks <- aus_production |>
  filter(!is.na(Bricks))

bricks |> autoplot(Bricks) +
  labs(title = "Australian Clay Brick Production", y = "Millions of bricks")

# fit the model
fit_bricks <- bricks |> model(SNAIVE(Bricks))

# residual diagnostic
fit_bricks |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_rug()`).

# forecast
fit_bricks |> forecast() |> autoplot(bricks)

I used the seasonal naïve model since this is a quarterly series with an obvious season pattern. It successfully captures the quarterly seasonal pattern, the residual diagnostics show that the errors are not white noise. The ACF plot displays several significant spikes, indicating there is still autocorrelation. The residual time plot also shows periods of large shocks and clustering, which suggests that while SNAIVE works, it does not fully explain the underlying behavior and dynamics of the bricks series. A more complex model would probably provide a better fit.

Question 7

For your retail time series (from Exercise 7 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(1234)

myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

The red portion (train data) confirms that the training data includes all observations before 2011. The split looks correct.

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals.

fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_rug()`).

Do the residuals appear to be uncorrelated and normally distributed? Although the residuals are centered around zero, they clearly do not behave like white noise. The residual time plot shows long runs of positive and negative values, indicating dependence over time. This is confirmed by the ACF plot, which shows strong autocorrelation at many lags. The histogram is roughly bell-shaped, but it is not perfectly normal. So, the seasonal naïve model does not adequately capture the structure in the data, and a more complex model would likely produce better forecasts.

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

f. Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tasmania Cafes, … SNAIV… Trai…  1.33  2.90  2.22  6.31  10.7     1     1 0.800

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Tasm… Cafes, … Test   7.12  9.13  7.58  13.2  14.4  3.42  3.15 0.863

The seasonal naïve model shows smaller errors on the training data (ME ≈ 1.33), but the performance gets worse on the test data (ME ≈ 7.12 and RMSE ≈ 9.13), indicating that the model does not generalize well to unseen observations. This is consistent with the residual diagnostics, which show strong autocorrelation across many lags and visible patterns over time, rather than random scatter. While the histogram of residuals is roughly bell-shaped (maybe indicating normal dist), the presence of serial dependence suggests the model is underfitting the data. Overall, although SNAIVE captures the seasonal pattern, it does not do a good job accounting for remaining structure in the series.