Introduction

The following RMD contains answers to exercises in chapter 5 of the Forecasting: Principles and Practice textbook for CUNY SPS DATA 624 Spring 2025 https://otexts.com/fpp3/toolbox-exercises.html. This chapter focuses on different tools of a forecaster. The 5.11 Exercises answered here include 5.1, 5.2, 5.3, 5.4 and 5.7.

Load Libraries

library(fpp3)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(tidyr)

5.1

Prompt: Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)
Bricks (aus_production) NSW Lambs (aus_livestock)
Household wealth (hh_budget)
Australian takeaway food turnover (aus_retail)

Australian Population (global_economy)

# Select the data
aus_pop <- global_economy |>
  filter(Country == "Australia") |>
  select(Year, Population) |>
  as_tsibble(index = Year)

# Plot the data
autoplot(aus_pop, Population) +
  labs(title = "Australian Population Data", y = "Population")

NAIVE(y): Not best. In the above graph, the trend is upward, so the Naive(y) would not be effective, as that model forecasts equal to last observed value.

SNAIVE(y): Not best. The data here is yearly so there are no seasons to forecast on.

RW(y ~ drift()): Best! This model forecasts equal to last value plus average change in the data over time, which accommodates for the upward trend the data shows.

# Fit the model with Drift
aus_pop_fit <- aus_pop |>
  model(
    Drift = RW(Population ~ drift())
  )

# Generate forecasts for 5 years
aus_pop_fc <- aus_pop_fit |> forecast(h = "5 years")

autoplot(aus_pop, Population) +
  autolayer(aus_pop_fc) +
  labs(
    title = "Drift Forecasted Australian Population Data", 
    y = "Population",
    x = "Year"
  )

Bricks (aus_production)

# Select the data
bricks_ts <- aus_production |>
  select(Quarter, Bricks) |> 
  filter(!is.na(Bricks)) |>
  as_tsibble(index = Quarter)


# Plot the data
autoplot(bricks_ts, Bricks) +
  labs(title = "Bricks Data", y = "Bricks")

NAIVE(y): Not best. In the above graph, there is seasonality. This means a linear forecast would not be most effective.

SNAIVE(y): Best! The data here has significant quarterly seasonality, so this is the best simple method to use

RW(y ~ drift()): Not best. This model is not effective for the same reasons that the NAIVE(y) is ineffective - linear forecasting is not what this data needs.

# Fit the model with SNAIVE
bricks_fit <- bricks_ts |>
  model(
    Seasonal_naive = SNAIVE(Bricks ~ lag("year"))
  )

# Generate forecasts for 5 years
bricks_fc <- bricks_fit |> forecast(h = "5 years")

# Plot bricks forecast
autoplot(bricks_ts, Bricks) +
  autolayer(bricks_fc, level = NULL) +
  labs(
    title = "Seasonal Naive Forecasted Bricks Data", 
    y = "Bricks",
    x = "Year and Quarter"
  )

NSW Lambs (aus_livestock)

# Select the data
lamb_ts <- aus_livestock |>
  filter(Animal == "Lambs") |>
  filter(Month >= as.Date("2010-01-01")) |>
  filter(Count > 0) |> # I am assuming the 0 rows are times when the lambs were not observed, rather than 0 were observed
  select(Month, Count) |> 
  as_tsibble(index = Month)

# Filtering out State is killing me
lamb_ct <- lamb_ts |>
  select(Month, Count, -State)

# Select a smaller cut of the data
lamb_ts_18 <- lamb_ts |>
  filter(Month >= as.Date("2018-01-01"))

# Plot around the dumb State that won't filter out
ggplot(lamb_ts, aes(x = Month, y = Count)) +
  geom_line() + 
  labs(
    title = "Lambs Data 2010-2018", 
    x = "Month",
    y = "Count"
  )

# Second plot for the smaller set of data
ggplot(lamb_ts_18, aes(x = Month, y = Count)) +
  geom_line() + 
  labs(
    title = "Lambs Data in 2018", 
    x = "Month",
    y = "Count"
  )

NAIVE(y): Not best. In the above graph, there is seasonality. This means a linear forecast would not be most effective.

SNAIVE(y): Best! Looking at the Lambs Data in 2018 graph, we can see the data has seasonality within each year, so this is the best simple method to use. The seasonality seems to happen month-over-month.

RW(y ~ drift()): Not best. This model is not effective for the same reasons that the NAIVE(y) is ineffective - linear forecasting is not what this data needs.

# Fit the model with SNAIVE
lamb_fit <- lamb_ts |>
  model(
    Seasonal_naive = SNAIVE(Count ~ lag("12 months"))
  )

# Generate forecasts for 5 years
lamb_fc <- lamb_fit |> forecast(h = "3 years")

# Plot the data
ggplot() +
  # original lamb_ts data
  geom_line(data = lamb_ts, aes(x = Month, y = Count)) +
  # forecasted lamb_fc data
  geom_line(data = lamb_fc, aes(x = Month, y = .mean), color = "blue") + 
  labs(
    title = "Seasonal Naive Forecasted Lambs Data", 
    x = "Month", 
    y = "Count"
  ) +
  theme_minimal()

Household wealth (hh_budget)

# Select the data
hh_wealth <- hh_budget |>
  select(Year, Wealth) |>
  as_tsibble(index = Year)

# Plot the data
autoplot(hh_wealth, Wealth) +
  labs(
    title = "Countries' Wealth Data", 
    y = "Wealth (% net disposable income",
    x = "Year"
    )

NAIVE(y): Not best. In the above graph, there are upward trends for all countries, so the Naive(y) would not be super effective, as that model forecasts equal to last observed value. Now, there could definitely be an argument made that this model is the best, as the countries seem to experience cycles, so assuming the data will increase in the future may not be most accurate.

SNAIVE(y): Not best. The data here is yearly so there are no seasons to forecast on.

RW(y ~ drift()): Best! This model forecasts equal to last value plus average change in the data over time, which accommodates for the upward trend each country shows. It is not perfect, as it looks to overshoot the expected increase in Wealth over time.

# Fit the model with Drift
hh_wealth_fit <- hh_wealth |>
  model(
    Drift = RW(Wealth ~ drift())
  )

# Generate forecasts for 10 years
hh_wealth_fc <- hh_wealth_fit |> forecast(h = "10 years")

autoplot(hh_wealth, Wealth) +
  autolayer(hh_wealth_fc, level = NULL) +
  labs(
    title = "Drift Forecasted Wealth Data for Countries", 
    y = "Wealth",
    x = "Year"
  )

Australian takeaway food turnover (aus_retail)

# Select the data
aus_tt <- aus_retail |>
  filter(Industry == "Takeaway food services") |>
  select(Month, Turnover) |>
  index_by(Month) |>
  summarise(Average_Turnover = mean(Turnover, na.rm = TRUE)) |>
  as_tsibble(index = Month)

# Plot the data
autoplot(aus_tt, Average_Turnover) +
  labs(
    title = "Average Turnover for Takeaway Food", 
    y = "Wealth (% net disposable income",
    x = "Year"
    )

NAIVE(y): Not best. In the above graph, there are upward trends and seasonality, so a horizontal line will not be most effective.

SNAIVE(y): Good. The data here is monthly and has fluctuations. This makes SNAIVE(y) a good choice. There is also an upward trend, which will not be accounted for here. I will graph both Seasonal Naive and Drift.

RW(y ~ drift()): Good. This model forecasts equal to last value plus average change in the data over time, which accommodates for the upward trend. This will take the trend into account, but leave out seasonality. I will graph both Seasonal Naive and Drift.

# Fit the model with Drift
aus_tt_fit <- aus_tt |>
  model(
    Drift = RW(Average_Turnover ~ drift())
  )

# Generate forecasts for 5 years
aus_tt_fc <- aus_tt_fit |> forecast(h = "5 years")

autoplot(aus_tt, Average_Turnover) +
  autolayer(aus_tt_fc, level = NULL) +
  labs(
    title = "Drift Forecasted Turnover for Takeaway", 
    y = "Average Turnover",
    x = "Month"
  )

# Fit the model with SNAIVE
aus_tt_fit_sn <- aus_tt |>
  model(
    Seasonal_naive = SNAIVE(Average_Turnover ~ lag("12 months"))
  )

# Generate forecasts for 5 years
aus_tt_fc_sn <- aus_tt_fit_sn |> forecast(h = "5 years")

# Plot bricks forecast
autoplot(aus_tt, Average_Turnover) +
  autolayer(aus_tt_fc_sn, level = NULL) +
  labs(
    title = "Seasonal Naive Forecasted Turnover for Takeaway", 
    y = "Average Turnover",
    x = "Month and Year"
  )

After visualizing both models, the Seasonal Naive model does a much more accurate job of forecasting.

5.2

Prompt: Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.
Produce forecasts using the drift method and plot them.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Produce a time plot of the series.

# Filter and tidy all relevant parts of the data
fb_stock <- gafa_stock |>
  filter(Symbol == "FB") |>
  select(Date, Open, High, Low, Adj_Close) |>
  pivot_longer(cols = c(Open, High, Low, Adj_Close),
               names_to = "Timing",
               values_to = "Price")

# Plot the data
autoplot(fb_stock) +
  labs(
    title = "Facebook Stock", 
    y = "Stock Price",
    x = "Year"
  )

The data for different timings is similar. Going forward, we will just take a look at the Adj_Close stat.

Produce forecasts using the drift method and plot them.

# Filter relevant parts of the data and make the Date column is in Date format
fb_close <- gafa_stock |>
  filter(Symbol == "FB") |>
  select(Date, Adj_Close) |>
  mutate(Date = ymd(Date)) |> 
  as_tsibble(index = Date)

# Fill missing dates
fb_close <- fb_close |>
  fill_gaps()

# Fit the model with Drift
fb_close_fit <- fb_close |>
  model(
    Drift = RW(Adj_Close ~ drift())
  )

# Generate forecasts for 3 years
fb_close_fc <- fb_close_fit |> forecast(h = "3 years")

autoplot(fb_close, Adj_Close) +
  autolayer(fb_close_fc, level = NULL) +
  labs(
    title = "Drift Forecasted Facebook Closing Stock Prices", 
    y = "Closing Stock Prices",
    x = "Year"
  )

The data in the facebook stock prices is irregularly separated. I first tried to take the average adjusted close value within a month and drop the date value, so the data would be regularly separated monthly. That was a challenge, so I pivoted to using fill_gaps() to make the day-to-day values more regular.

Show that the forecasts are identical to extending the line drawn between the first and last observations.

# Get the first and last valid observations
first_obs <- fb_close[1, ]
last_obs <- fb_close[nrow(fb_close), ]

# Calculate the slope (m) and intercept (b) for the line y = mx + b
slope <- (last_obs$Adj_Close - first_obs$Adj_Close) / as.numeric(last_obs$Date - first_obs$Date)
intercept <- first_obs$Adj_Close

# Create a data frame for the linear line
line_data <- data.frame(
  Date = seq(from = first_obs$Date, to = last_obs$Date, by = "days"),
  Adj_Close = intercept + slope * as.numeric(seq(from = first_obs$Date, to = last_obs$Date, by = "days") - as.numeric(first_obs$Date))
)

# Plot the data with the linear line
ggplot() +
  geom_line(data = fb_close, aes(x = Date, y = Adj_Close)) + 
  geom_line(data = line_data, aes(x = Date, y = Adj_Close), color = "red", linetype = "dashed") +  
  autolayer(fb_close_fc, level = NULL) +
  labs(
    title = "Linear Forecast of Facebook Closing Stock Prices",
    y = "Closing Stock Price",
    x = "Date"
  ) +
  theme_minimal()

The linear line extends from the first observation to the last observation. For some reason, the visualization appears to have the forecasted line as a higher slope. This may be due to the filled in dates.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Seasonal Naive - 12 months

# Fit the model with SNAIVE
fb_close_fit_sn <- fb_close |>
  model(
    Seasonal_naive = SNAIVE(Adj_Close ~ lag("12 months"))
  )

# Generate forecasts for 3 years
fb_close_fc_sn <- fb_close_fit_sn |> forecast(h = "3 years")

# Plot bricks forecast
autoplot(fb_close, Adj_Close) +
  autolayer(fb_close_fc_sn, level = NULL) +
  labs(
    title = "Seasonal Naive Forecast of Facebook Closing Stock Prices",
    y = "Closing Stock Price",
    x = "Date"
  )

The above graph takes the yearly seasonality seriously. This future forecast does not look like it matches the trend of the historical data closely.

Seasonal Naive - 1 month

# Fit the model with SNAIVE
fb_close_fit_sn <- fb_close |>
  model(
    Seasonal_naive = SNAIVE(Adj_Close ~ lag("1 months"))
  )

# Generate forecasts for 3 years
fb_close_fc_sn <- fb_close_fit_sn |> forecast(h = "3 years")

# Plot bricks forecast
autoplot(fb_close, Adj_Close) +
  autolayer(fb_close_fc_sn, level = NULL) +
  labs(
    title = "Seasonal Naive Forecast of Facebook Closing Stock Prices",
    y = "Closing Stock Price",
    x = "Date"
  )

The above graph takes the seasonality of the past month seriously. This future forecast does not look like it matches the trend of the historical data closely.

Mean

# Fit the model with SNAIVE
fb_close_fit_mn <- fb_close |>
  model(
    Mean = MEAN(Adj_Close),
  )

# Generate forecasts for 3 years
fb_close_fc_mn <- fb_close_fit_mn |> forecast(h = "3 years")

# Plot bricks forecast
autoplot(fb_close, Adj_Close) +
  autolayer(fb_close_fc_mn, level = NULL) +
  labs(
    title = "Mean Forecast of Facebook Closing Stock Prices",
    y = "Closing Stock Price",
    x = "Date"
  )

The textbook mentions that the mean is a good estimate for stock prices. In this case, it is possible for the stock to trend up or down in the future, so to minimize incorrect guessing, the most updated price is the best guess at future prices.

5.3

Prompt: Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

What do you conclude?

Answer: It looks like the residuals have some skewness. The mean of the Innovation Residuals graph does not look like it is zero, indicated skewness. The lag plot shows the lag of 4 extends outside of the bounds majorly. The histogram should follow a normal distribution if the residuals were random, which it does not.

The above all leads to the conclusion that the residuals are not simply white noise, the Seasonal Naive model hasn’t fully captured the data’s patterns, and the residuals still contain useful information.

5.4

Prompt: Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports from global_economy

# Extract data of interest
aus_exp <- global_economy |>
  filter(Country == "Australia") |>
  select(Year, Exports) |> 
  drop_na() |>
  as_tsibble(index = Year)

# Define and estimate a model
aus_exp_fit <- aus_exp |> model(NAIVE(Exports))

# Look at the residuals
aus_exp_fit |> gg_tsresiduals()

# Generate forecast
aus_exp_fc <- aus_exp_fit |> forecast(h = "5 years")

# Plot naive forecast
autoplot(aus_exp, Exports) + 
  autolayer(aus_exp_fc, series = "Forecast", level = NULL) +
  labs(title = "Naive Forecast for Australian Exports", y = "Exports", x = "Year")

Bricks from aus_production

# Select the data
bricks_ts <- aus_production |>
  select(Quarter, Bricks) |> 
  filter(!is.na(Bricks)) |>
  as_tsibble(index = Quarter)

# Fit the model with SNAIVE
bricks_fit <- bricks_ts |>
  model(
    Seasonal_naive = SNAIVE(Bricks ~ lag("year"))
  )

# Generate forecasts for 5 years
bricks_fc <- bricks_fit |> forecast(h = "5 years")

# Plot bricks forecast
autoplot(bricks_ts, Bricks) +
  autolayer(bricks_fc, level = NULL) +
  labs(
    title = "Seasonal Naive Forecasted Bricks Data", 
    y = "Bricks",
    x = "Year and Quarter"
  )

5.7

Prompt: For your retail time series (from Exercise 7 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(64)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

head(myseries)

## # A tsibble: 6 x 5 [1M]
## # Key:       State, Industry [1]
##   State    Industry                         `Series ID`    Month Turnover
##   <chr>    <chr>                            <chr>          <mth>    <dbl>
## 1 Victoria Other specialised food retailing A3349799R   1982 Apr     34.9
## 2 Victoria Other specialised food retailing A3349799R   1982 May     34.6
## 3 Victoria Other specialised food retailing A3349799R   1982 Jun     34.6
## 4 Victoria Other specialised food retailing A3349799R   1982 Jul     35.2
## 5 Victoria Other specialised food retailing A3349799R   1982 Aug     33.8
## 6 Victoria Other specialised food retailing A3349799R   1982 Sep     35.4

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals.

fit |> gg_tsresiduals()

Do the residuals appear to be uncorrelated and normally distributed?

Answer: The residuals look to be normally distributed with right skewness. The lag plot includes the first 8 values extend outside the boundary. This means the residuals are correlated.

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))
fc |> autoplot(myseries)

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Victoria Other s… SNAIV… Trai…  4.84  11.6  7.85  5.02  7.76     1     1 0.751

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Vict… Other s… Test   12.4  29.0  24.8  5.67  13.6  3.16  2.50 0.795

How sensitive are the accuracy measures to the amount of training data used?

Answer: The accuracy measures change depending on the amount of training data used. Too little data can lead to high variance, while too much data may introduce bias if old trends are no longer relevant. Metrics like RMSE and MAE generally improve with more data, but only if the added data reflects current patterns. Residual autocorrelation (ACF1) should decrease with more training data, but if it remains high, the model may be missing key structures. More training data isn’t always better, finding the right balance is key to improving forecast accuracy.

The high ACF1 value of 0.795 suggests strong correlation in residuals, indicating the model may not fully capture trends. This is important because a high ACF1 means the residuals are not behaving like white noise, implying that the model is missing important patterns.

Homework03

Kaylie Evans

2025-02-23