Part I (70 points). End-to-End Forecasting Analysis for Quarterly Beer Sales

R code to load ausbeer (fpp3)

  Consider the quarterly Australian beer production series ausbeer constructed from aus production
  in the fpp3 package:
# Load packages and data
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.3
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tibble      3.3.0     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.2
## ✔ lubridate   1.9.4     ✔ fable       0.4.1
## ✔ ggplot2     4.0.0
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tsibble' was built under R version 4.4.3
## Warning: package 'tsibbledata' was built under R version 4.4.3
## Warning: package 'feasts' was built under R version 4.4.3
## Warning: package 'fabletools' was built under R version 4.4.3
## Warning: package 'fable' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
library(tsibble)
library(tsibbledata)

ausbeer <- aus_production %>%
select(Quarter, Beer) %>%
filter(!is.na(Beer))


print(ausbeer)
## # A tsibble: 218 x 2 [1Q]
##    Quarter  Beer
##      <qtr> <dbl>
##  1 1956 Q1   284
##  2 1956 Q2   213
##  3 1956 Q3   227
##  4 1956 Q4   308
##  5 1957 Q1   262
##  6 1957 Q2   228
##  7 1957 Q3   236
##  8 1957 Q4   320
##  9 1958 Q1   272
## 10 1958 Q2   233
## # ℹ 208 more rows
ausbeer %>%
  autoplot(Beer) +
  labs(
    title = "Quarterly Australian Beer Production",
    x = "Quarter",
    y = "Megalitres"
  ) +
  theme_minimal()

(a) Preliminary exploration. Produce a time series plot of yt over the available sample.Describe qualitatively:

Trend
Beer production showed a long-run upward trend during the earlier decades up to the late 1970s/early 1980s. In more recent years the trend appears to flatten or decline slightly. 

Seasonality 
A distinct seasonal pattern can be observed each year. Specifically, the highest output is in Q4, while Q2 typically sees the lowest production levels. 
The series has very strong quarterly seasonality, and the size of the seasonal swings stays about the same over time, so the seasonality is additive, not multiplicative.

Irregularities 
There appears to be no major outliers. 
Variance appears to be slightly increasing but stable overall. 
The only evidence of a structural break appears around the 1980's

Based on the plot, reasonable model classes are: ETS/ ARIMA/SARIMA

ETS: There is a pattern (initially increasing, followed by a decrease). There is also strong, consistent additive seasonality with a stable intensity. Choosing ETS(A,A,A) or ETS(A,N,A) would be appropriate they can effectively capture variations in levels and additive seasonal trends.

ARIMA/SARIMA:The data shows that it is non-stationary, which means we need to difference it. There is a clear seasonal pattern that repeats every quarter, so we should use a SARIMA model with a seasonal period of 4. The variance remains stable, so we don’t need to change it. SARIMA models can effectively capture both the trend and seasonal patterns shown in the plot.

Dynamic Regression: Using dynamic regression makes sense if we have relevant data, like economic indicators or prices. However, since the plot does not show any clear variables, the best starting points are ETS or SARIMA.

(b) Autocorrelation and unit root.

  1. Compute and plot the ACF (and PACF, if helpful) of the original series (or a simple transformation such as log yt, if you choose to transform). Interpret the behavior at small lags and at the seasonal lags (multiples of 4). What do these patterns suggest about trend and quarterly seasonality?
library(fpp3)

ausbeer <- aus_production %>%
  select(Quarter, Beer) %>%
  filter(!is.na(Beer))

ausbeer %>% ACF(Beer) %>% autoplot() +
  labs(title = "ACF of Quarterly Beer Production")

ausbeer %>% PACF(Beer) %>% autoplot() +
  labs(title = "PACF of Quarterly Beer Production")

At short non-seasonal lags (1, 2, 3…), there is a slow decline that suggests the series is nonstationary and probably has a long-term trend.

At the seasonal lags (4, 8, 12, …), the ACF shows large  spikes. These seasonal spikes confirm strong quarterly seasonality.The PACF commonly shows significant peaks at lag 1 and also at lag 4, suggesting a mix of short-term and seasonal dependencies.
  1. Perform a unit root test (e.g., ADF) on the series you plan to model (original or transformed). Clearly state H0 and H1, report the test statistic and p-value, and conclude whether non-seasonal differencing appears necessary
# KPSS unit root test
ausbeer %>%
  features(Beer, unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      2.20        0.01
# How many non-seasonal differences are suggested?
ausbeer %>%
  features(Beer, unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

H0: The series is level-stationary (no unit root). H1: The series is nonstationary (has a unit root / trend).

The KPSS test, which assesses level stationarity, produced a test statistic of 2.20 and a p-value of 0.01. We reject the null hypothesis of stationarity. This suggests that the series is nonstationary, indicating that non-seasonal differencing may be required.

(c) Training and test periods (h = 8). You will use the last h = 8 quarters as a holdout test set.

  1. Explicitly define the training window and the test window using calendar dates (e.g., Training: . . . ; Test: . . . ).
train<- ausbeer %>% filter(Quarter <= yearquarter("2008 Q2"))
test     <- ausbeer %>% filter(Quarter >  yearquarter("2008 Q2"))

train
## # A tsibble: 210 x 2 [1Q]
##    Quarter  Beer
##      <qtr> <dbl>
##  1 1956 Q1   284
##  2 1956 Q2   213
##  3 1956 Q3   227
##  4 1956 Q4   308
##  5 1957 Q1   262
##  6 1957 Q2   228
##  7 1957 Q3   236
##  8 1957 Q4   320
##  9 1958 Q1   272
## 10 1958 Q2   233
## # ℹ 200 more rows
test 
## # A tsibble: 8 x 2 [1Q]
##   Quarter  Beer
##     <qtr> <dbl>
## 1 2008 Q3   410
## 2 2008 Q4   488
## 3 2009 Q1   415
## 4 2009 Q2   398
## 5 2009 Q3   419
## 6 2009 Q4   488
## 7 2010 Q1   414
## 8 2010 Q2   374
Training window:
1956 Q1 to 2008 Q2
2008 Q3 to 2010 Q2 
  1. Briefly justify why a fixed test window of h = 8 is appropriate here. What kind of forecasting task (e.g., medium-term planning vs. very short-term) does this horizon correspond to in an applied context?
The fixed test set provides a two-year forecast, which covers two seasonal cycles and is useful for medium-term planning, like production and capacity decisions, rather than short-term forecasts.

(d) Competing forecasting models (4 methods). Using the training data only, fit at least four distinct forecasting models, including:

• an ETS model with a seasonal component (e.g., ETS(A,A,A) or ETS(A,A,M)), • a non-seasonal ARIMA model (if appropriate after differencing), • a seasonal ARIMA (SARIMA) model with quarterly seasonality (proper trend and seasonal terms), • one additional model of your choice (e.g., dynamic regression with trend and Fourier terms plus ARIMA errors, or a seasonal naıve benchmark).

For each model, clearly write down its structure (e.g., ETS(A,A,A), ARIMA(p, d, q),SARIMA(p, d, q) × (P, D, Q)4, or regression + ARIMA errors) and briefly justify your specification choices using the preliminary analyses from parts (a)–(b).

ETS model with a seasonal component (A,A,A) ETS (A,A,A)

The time plot shows a clear trend and consistent seasonal patterns each quarter, with the seasonal changes being roughly the same over time.

fit_ets <- train %>%
  model(ETS = ETS(Beer ~ error("A") + trend("A") + season("A")))
fit_ets
## # A mable: 1 x 1
##            ETS
##        <model>
## 1 <ETS(A,A,A)>

Non-seasonal ARIMA (1,1,1)

Unit root tests and a slowly decaying ACF suggest non-stationarity, requiring d = 1. The non-seasonal ARIMA model excludes seasonal terms for later comparison with a fully seasonal model.

fit_arima_ns <- train %>%
model(ARIMA_NS = ARIMA(Beer ~ pdq(1,1,1) + PDQ(0,0,0))) 
fit_arima_ns
## # A mable: 1 x 1
##                  ARIMA_NS
##                   <model>
## 1 <ARIMA(1,1,1) w/ drift>

Seasonal ARIMA (SARIMA) SARIMA(p, 1, q) × (P, 1, Q)\₄

Strong quarterly seasonality; ACF seasonal spikes; need seasonal & regular differencing.

fit_sarima <- train %>%
  model(SARIMA = ARIMA(Beer ~ pdq(0,1,1) + PDQ(0,1,1)))
fit_sarima
## # A mable: 1 x 1
##                     SARIMA
##                    <model>
## 1 <ARIMA(0,1,1)(0,1,1)[4]>

Additional model: Seasonal Naïve Benchmark

Strong benchmark for quarterly seasonal data this in return captures seasonality.

fit_snaive <- train %>%
  model(SNaive = SNAIVE(Beer ~ lag("year")))
fit_snaive
## # A mable: 1 x 1
##     SNaive
##    <model>
## 1 <SNAIVE>

(e) Model comparison via information criteria (training). For each of your four models, compute AIC, AICc, and BIC using the training data.

  1. Tabulate the three criteria for all models.
library(fpp3)

ausbeer <- aus_production %>%
  select(Quarter, Beer) %>%
  filter(!is.na(Beer))

train <- ausbeer %>%
  filter(Quarter <= yearquarter("2008 Q2"))

# All models on training data

fits <- train %>%
  model(
    ETS      = ETS(Beer ~ error("A") + trend("A") + season("A")),
    ARIMA_NS = ARIMA(Beer ~ pdq(d = 1) + PDQ(0,0,0)),        
    # non-seasonal ARIMA
    SARIMA   = ARIMA(Beer ~ pdq(0,1,1) + PDQ(0,1,1)),         
    # SARIMA with quarterly seasonality
  )


info_tbl <- glance(fits) %>%
  select(.model, AIC, AICc, BIC)

info_tbl
## # A tibble: 3 × 4
##   .model     AIC  AICc   BIC
##   <chr>    <dbl> <dbl> <dbl>
## 1 ETS      2304. 2305. 2335.
## 2 ARIMA_NS 1824. 1824. 1844.
## 3 SARIMA   1735. 1735. 1745.
  1. Identify which model each criterion prefers and comment on any differences in ranking. Explain in a sentence or two how AIC, AICc, and BIC differ in how strongly they penalize model complexity.
AIC uses a light penalty, so it is more likely to choose slightly more complex models if they fit better. 
AICc is a version of AIC that adjusts for small sample sizes and penalizes extra parameters a bit more. 
BIC has the strictest penalty, which grows with the logarithm of the sample size, making it favor simpler models more than AIC or AICc.
  1. Select a preferred model based on these criteria (and any additional considerations you judge important) and clearly state your choice.
The SARIMA model is the best option for forecasting quarterly Australian beer production,it has the lowest AIC, AICc, and BIC values. This model also captures both the trend and the strong quarterly seasonality. Other models, like ETS(A,A,A) and standard ARIMA, did not fit as well. Therefore, we select the SARIMA model for our forecasts.

(f) Residual diagnostics for the selected model. Focusing on your final chosen model:

  1. Plot the residuals over time and the residual ACF. Comment on any remaining autocorrelation, non-constant variance, or obvious model misspecification.

    The residuals hover around zero without any significant long-term trends, indicating that the SARIMA model has successfully addressed both trend and seasonality. The variance remains fairly consistent throughout the sample, with no evident signs of varying dispersion. While there are a couple of notable spikes in the residuals during the mid-1970s and mid-1980s, these are not recurrent and do not suggest any fundamental problems with the model.

sarima_fit <- train %>%
  model(
    SARIMA = ARIMA(Beer ~ pdq(0,1,1) + PDQ(0,1,1))
  )

# residuals
sarima_resid <- residuals(sarima_fit)

# Plot residuals 
sarima_resid %>%
  autoplot(.resid) +
  labs(
    title = "Residuals over time: SARIMA(0,1,1) × (0,1,1)[4]",
    x = "Quarter",
    y = "Residual"
  )

sarima_resid %>%
  ACF(.resid) %>%
  autoplot() +
  labs(
    title = "Residual ACF: SARIMA(0,1,1) × (0,1,1)[4]",
    x = "Lag",
    y = "ACF"
  )

  1. Conduct a Ljung–Box test on the residuals. State H0 and H1, report the p-value, and interpret the result in terms of whether the residuals are consistent with white noise.

If the residual Ljung–Box test rejects the null hypothesis of white noise for your preferred model, go back to part (d), adjust your model choices, and repeat the comparison until you obtain a model whose residuals are consistent with white noise.

H0: Residuals are white noise (no remaining autocorrelation up to the tested lag).
H1: Residuals are not white noise (there is remaining autocorrelation).

We fail to reject the null hypothesis that the residuals are white noise. This indicates that there is no strong evidence of remaining     autocorrelation, and the model’s residuals are consistent with white noise.
lb_test <- sarima_resid %>%
  features(.resid, ljung_box, lag = 8, dof = 2)  

lb_test
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 SARIMA    20.8   0.00203

(g) Forecast accuracy on the test set. Using only the models estimated on the training data, generate h = 8-step-ahead forecasts for the test period.

ausbeer <- aus_production %>%
  select(Quarter, Beer) %>%
  filter(!is.na(Beer))

train <- ausbeer %>%
  filter(Quarter <= yearquarter("2008 Q2"))

test <- ausbeer %>%
  filter(Quarter > yearquarter("2008 Q2"))

# Fit all models 
fits <- train %>%
  model(
    ETS      = ETS(Beer ~ error("A") + trend("A") + season("A")),
    ARIMA_NS = ARIMA(Beer ~ pdq(d = 1) + PDQ(0,0,0)),         # non-seasonal ARIMA
    SARIMA   = ARIMA(Beer ~ pdq(0,1,1) + PDQ(0,1,1)),         # SARIMA(0,1,1)x(0,1,1)[4]
    SNaive   = SNAIVE(Beer ~ lag("year"))                     # seasonal naive
  )

# 8-step-ahead forecasts
h <- 8
fc <- fits %>%
  forecast(h = h)

# Accuracy on the test set
acc <- fc %>%
  accuracy(test)

acc
## # A tibble: 4 × 10
##   .model   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_NS Test   9.32  15.9  13.8  2.01  3.16   NaN   NaN 0.132
## 2 ETS      Test   7.89  13.2  11.6  1.79  2.73   NaN   NaN 0.202
## 3 SARIMA   Test   7.57  12.9  11.5  1.70  2.70   NaN   NaN 0.198
## 4 SNaive   Test   6.5   14.6  13.2  1.39  3.12   NaN   NaN 0.237
  1. For each model, compute at least two error metrics on the test set, such as RMSE and MAE: Summarize your results in a small table.

    In all cases the SARIMA model has the lowest RMSE and MAE and therefore the best forecast accuracy for all four models. ARIMA and ETS models perform worse.

  2. Compare the models’ test-set performance and identify which one performs best outof-sample. Does this agree with the information-criterion ranking from part (e)? Discuss briefly.

The out-of-sample rankings agrees with the findings from part (e),  SARIMA demonstrated the lowest values for AIC, AICc, and BIC. Both the in-sample fit, as indicated by the information criteria, and the out-of-sample forecasting accuracy suggest that SARIMA is the most favorable model.

(h) Forecast plot with 80% prediction intervals. For your final chosen model:

  1. Produce a time plot showing: • the training data, • the test data (actuals), • the h = 8 forecasts, • 80% forecast intervals. Make sure the training and forecast periods are clearly labeled.
library(fpp3)

# SARIMA model on the training data
sarima_fit <- train %>%
  model(
    SARIMA = ARIMA(Beer ~ pdq(0,1,1) + PDQ(0,1,1))
  )

# 8-step-ahead forecasts with 80% intervals
h <- 8
sarima_fc <- sarima_fit %>%
  forecast(h = h, level = 80)

# Plot
ausbeer %>%
  filter(Quarter >= yearquarter("2000 Q1")) %>%  # zoom in to recent years
  autoplot(Beer) +
  autolayer(
    sarima_fc,
    Beer,
    level = 80
  ) +
  labs(
    title = "SARIMA(0,1,1) × (0,1,1)[4] forecasts with 80% intervals",
    x = "Quarter",
    y = "Beer production (megalitres)"
  ) +
  guides(colour = guide_legend(title = "Series")) +
  annotate(
    "text",
    x = yearquarter("2007 Q1"),
    y = max(ausbeer$Beer, na.rm = TRUE),
    label = "Training period",
    hjust = 0,
    size = 3
  ) +
  annotate(
    "text",
    x = yearquarter("2009 Q1"),
    y = max(ausbeer$Beer, na.rm = TRUE),
    label = "Test (actuals) & 8-step forecasts",
    hjust = 0,
    size = 3
  )

  1. Comment on whether the 80% intervals appear empirically reasonable: do most of the test observations fall within them, and is the interval width plausible given the series’ variability and seasonal behavior?
The 80% prediction intervals generated by the SARIMA model seem empirically sound. During the 8-quarter testing period, the majority of actual observations fall within the 80% bands, with at most one point being near or slightly beyond the interval. This aligns with the expected nominal 80% coverage over just eight forecasts. Additionally, the width of the intervals appears reasonable.

(i) Interpretation and practical implications. Suppose this series represents quarterly beer sales for a producer making production and inventory decisions.

  1. Interpret the main features of your selected model (trend, seasonality, and any autoregressive/moving-average components) in substantive, non-technical terms.

    Trend- Sales increase gradually over time, indicating stable market conditions without drastic changes. Seasonality- There’s a consistent pattern in sales that repeats each year, with certain quarters like holidays—seeing higher demand, while others have lower sales. Short-term fluctuations- Unexpected spikes or drops in demand can occur, impacting the following quarter and the same time next year. However, these effects are temporary and eventually return to the usual pattern.

  2. Discuss at least two practical implications of your forecasts: for example, how the forecasted seasonal peaks and troughs might inform production scheduling, inventory planning, or capacity decisions.

Production scheduling around seasonal peaks and troughs:
Because the data reveal a clear and predictable seasonal pattern, the producer can plan ahead by increasing production and building inventory before high‑demand quarters. Conversely, they can scale back production ahead of forecasted low‑demand quarters, which helps avoid excess stock, storage costs, and capital being unnecessarily tied up in inventory.

Capacity and Scheduling : 
Management can make temporary capacity adjustments, like hiring seasonal staff or securing extra storage, based on the model’s prediction of recurring high-demand quarters and short-lived extreme shocks, rather than making permanent expansions.
  1. Briefly note one limitation of your analysis (e.g., structural breaks, omitted covariates, model uncertainty) and how you might address it in a more advanced or real-world setting.
In practice, we would expand the analysis to include relevant factors like prices, promotions, and economic indicators, and check for any significant changes. This could involve using regression with ARIMA errors or other methods for analyzing multiple time series to make the results more reliable and easier to understand. One major limitation is that the model only looks at one variable and assumes past seasonal patterns and trends will remain unchanged. 

1. Complex time series behavior.

Describe two concrete ways in which real-world time series can be more complicated than the assumptions of a simple linear trend + seasonality model (e.g., structural breaks,changing seasonality, nonlinear effects, intermittent demand). For each one, explain how it can lead to systematically incorrect forecasts if it is ignored.

Structual Breaks: Real-world data can experience sudden outlier events, like natural disasters or pandemics, leading to temporary spikes or drops that a simple linear model cannot effectively capture. These events may cause permanent "structural breaks," altering the underlying trends or seasonal patterns in ways that require adjustment to the model.

Changing or Evolving Seasonality: Seasonal trends can change over time, like how holiday sales might start earlier due to online shopping or become less clear as shopping habits evolve. If these changes are overlooked, predictions can be off, leading to repeated mistakes in estimating when sales will peak or drop, especially if the timing shifts by even a month.

2. Unexpected events and model limitations.

Give two examples of “unexpected events” (shocks) that standard statistical or machine learning forecasting models are unlikely to predict. For each example, explain one practical strategy you would use to manage this uncertainty (e.g., scenario analysis, judgmental overrides, stress testing, or adding external indicators).

Sudden regulatory or policy change - We experienced this in real time when tariffs were raised on imported goods. This kind of unexpected policy shock cannot be predicted by standard statistical or machine learning forecasting models, which rely on historical patterns. To manage this uncertainty, I would use scenario analysis combined with judgmental overrides: construct alternative demand scenarios (e.g., “no tariff change”, “moderate tariff increase”, “large tariff increase”), estimate the likely impact on sales volumes and costs under each scenario using expert judgment and any available external data, and then manually adjust the model’s baseline forecasts accordingly. 

Major macroeconomic or geopolitical shock - An unexpected event could be a sudden economic or political crisis, like a financial crash, a war, or a pandemic that quickly changes how people behave and affects supply. Regular forecasting methods based on historical data are usually not good at predicting when or how big such shocks will be. I would implement stress testing and scenario analysis, creating  scenarios (like severe recession or quick recovery) influenced by external indicators such as unemployment. I would adjust the baseline forecasts for each scenario to prepare for a range of possible outcomes in inventory, capacity, and financial planning.

3. Interpreting incorrect forecasts.

Your manager claims: “The model is bad because last quarter’s forecast was off by 20%.” As the forecasting expert, how would you respond? Explain how you would:

  1. evaluate whether this error is actually unusually large (what diagnostics, error metrics, or comparisons you would use), and

  2. decide whether to change the model, the data pipeline, or stakeholder expectations.

    A single 20% miss doesn’t automatically mean the model is bad. First, we should check if this size of error is unusual compared with past errors, and whether there were any data issues or special events that quarter to cause impact.

    If it’s a one‑off shock and everything else looks normal, we keep the model and just remind people that forecasts are uncertain. If we keep seeing big errors in the same direction over several periods, then it’s a sign the model may really be wrong, and that’s when we should think about changing it.

4. Communicating with non-experts.

Non-technical stakeholders often interpret a forecast as a single “exact” number. Describe two common misunderstandings non-experts may have about forecasts (for example, ignoring prediction intervals or confusing model uncertainty with data quality). For each misunderstanding, propose one way to communicate or visualize the forecasts (e.g., intervals, fan charts, scenario bands, verbal framing) that would reduce the risk of misinterpretation

  Many people assume a forecast is a precise number rather than an informed estimate, so they may not realize that some variation around that number is normal. Displaying prediction intervals or fan charts helps communicate that forecasts always include a range of possible outcomes.

Another common misunderstanding is that a wide range indicates poor data or a faulty model. In reality, wide intervals often reflect highly variable or unpredictable real-world conditions. Presenting forecasts as clear scenarios help reinforce that uncertainty is a natural part of forecasting and does not imply analytical errors.