ADEC7406 Module 8 Discussion

Author

Fabian Yang

Part I

Preparation

library(fpp3)
library(tsibble)
library(fredr)

# Load FRED ALTSALES data
altsales <- fredr::fredr(
  series_id = "ALTSALES",
  observation_start = as.Date("2010-01-01"),
  observation_end   = as.Date("2020-12-01")
) %>%
  transmute(
    Month = yearmonth(date),
    ALTSALES = value
  ) %>%
  as_tsibble(index = Month)

ACF and PACF

altsales %>%
  gg_tsdisplay(ALTSALES, plot_type = "partial")

The ACF of the original ALTSALES series decays very slowly and remains significantly positive across many lags, which suggests that the series is non-stationary. The PACF shows a very large spike at lag 1, while most later lags are relatively small. This pattern is consistent with a trending or highly persistent series rather than a stationary process. Therefore, first differencing is needed.

First Differencing

altsales_diff <- altsales %>%
  mutate(ALTSALES_diff = difference(ALTSALES))

altsales_diff %>%
  filter(!is.na(ALTSALES_diff)) %>%
  gg_tsdisplay(ALTSALES_diff, plot_type = "partial")

The first-differenced ALTSALES series appears much more stable than the original series, fluctuating around zero without a clear trend. In addition, the ACF and PACF of the differenced series no longer show the slow decay observed in the original data. Most autocorrelations are small, with only a few significant spikes at the early lags. This suggests that one difference is sufficient to achieve approximate stationarity, so I set d = 1 and do not difference the series again.

Manual ARIMA Fit

fit_manual <- altsales %>%
  model(manual = ARIMA(ALTSALES ~ pdq(1,1,1)))

report(fit_manual)
Series: ALTSALES 
Model: ARIMA(1,1,1)(0,0,1)[12] 

Coefficients:
          ar1     ma1    sma1
      -0.6828  0.8890  0.3652
s.e.   0.1062  0.0628  0.1317

sigma^2 estimated as 0.543:  log likelihood=-145.31
AIC=298.61   AICc=298.93   BIC=310.11
fit_auto <- altsales %>%
  model(auto = ARIMA(ALTSALES))

report(fit_auto)
Series: ALTSALES 
Model: ARIMA(2,1,2)(0,0,1)[12] 

Coefficients:
         ar1     ar2      ma1      ma2    sma1
      0.0761  0.3881  -0.0140  -0.7712  0.4175
s.e.  0.1320  0.1289   0.0852   0.0818  0.1226

sigma^2 estimated as 0.5091:  log likelihood=-140.45
AIC=292.91   AICc=293.59   BIC=310.16

Although I specified pdq(1,1,1) for the manual model, the reported output became ARIMA(1,1,1)(0,0,1)[12]. This means that seasonal terms were still selected by the algorithm.

The manually selected model and the automatically selected model are broadly consistent in their main conclusions. Both suggest that the ALTSALES series requires first differencing, so d = 1 is appropriate, which matches the evidence from the ACF and PACF of the original and differenced series. My manual choice, ARIMA(1,1,1), was a simpler specification based on the general short-run dependence visible after first differencing. By contrast, the automatic procedure selected ARIMA(2,1,2), indicating that the data may contain somewhat richer short-run dynamics than my initial visual inspection suggested. When I estimated the models in fable, both were also augmented with a seasonal MA component, which implies that the series still contains a modest seasonal pattern at the monthly frequency. Overall, the two approaches do not contradict each other: the manual model captures the same broad structure of the data, while the automatic model refines that structure by selecting a slightly more flexible specification with better information-criterion values.

Forecasting

altsales_all <- fredr::fredr(
  series_id = "ALTSALES",
  observation_start = as.Date("2010-01-01"),
  observation_end   = as.Date("2022-01-01")
) %>%
  transmute(
    Month = yearmonth(date),
    ALTSALES = value
  ) %>%
  as_tsibble(index = Month)

train <- altsales_all %>%
  filter(Month <= yearmonth("2020 Dec"))

actual <- altsales_all %>%
  filter(Month >= yearmonth("2021 Jan"),
         Month <= yearmonth("2022 Jan"))

fit_auto <- train %>%
  model(auto = ARIMA(ALTSALES))

fc_auto <- fit_auto %>%
  forecast(h = "13 months")

fc_plot <- fc_auto %>%
  hilo(level = 95) %>%
  unpack_hilo(`95%`)

ggplot() +
  geom_ribbon(
    data = fc_plot,
    aes(x = Month, ymin = `95%_lower`, ymax = `95%_upper`,
        fill = "95% prediction interval"),
    alpha = 0.25
  ) +
  geom_line(
    data = train,
    aes(x = Month, y = ALTSALES,
        color = "Historical values",
        linetype = "Historical values"),
    linewidth = 0.8
  ) +
  geom_line(
    data = actual,
    aes(x = Month, y = ALTSALES,
        color = "Actual values",
        linetype = "Actual values"),
    linewidth = 0.8
  ) +
  geom_line(
    data = fc_plot,
    aes(x = Month, y = .mean,
        color = "Forecast values",
        linetype = "Forecast values"),
    linewidth = 0.8
  ) +
  labs(
    title = "ALTSALES Forecast",
    subtitle = "Historical values, actual values, forecasts, and 95% prediction intervals",
    x = "Month",
    y = "ALTSALES",
    color = NULL,
    linetype = NULL,
    fill = NULL
  ) +
  scale_color_manual(
    values = c(
      "Historical values" = "black",
      "Actual values" = "black",
      "Forecast values" = "blue"
    )
  ) +
  scale_linetype_manual(
    values = c(
      "Historical values" = "solid",
      "Actual values" = "dashed",
      "Forecast values" = "solid"
    )
  ) +
  scale_fill_manual(
    values = c("95% prediction interval" = "skyblue")
  ) +
  theme_grey() +
  theme(
    legend.position = "bottom"
  ) +
  guides(
    color = guide_legend(nrow = 1),
    linetype = guide_legend(nrow = 1),
    fill = guide_legend(nrow = 1)
  )

Part II

For this part, I decided to use HOUSTNSA (new privately-owned housing units started: total units) data from FRED. This time series should show strong seasonality, as it is not seasonally adjusted.

houstnsa <- fredr::fredr(
  series_id = "HOUSTNSA",
  observation_start = as.Date("2010-01-01"),
  observation_end   = as.Date("2020-12-01")
) %>%
  transmute(
    Month = yearmonth(date),
    HOUSTNSA = value
  ) %>%
  as_tsibble(index = Month)

fit_sarima <- houstnsa %>%
  model(sarima = ARIMA(HOUSTNSA))

report(fit_sarima)
Series: HOUSTNSA 
Model: ARIMA(1,0,0)(2,1,0)[12] w/ drift 

Coefficients:
         ar1     sar1     sar2  constant
      0.4828  -0.5394  -0.3010    6.4323
s.e.  0.0855   0.1083   0.0999    0.8367

sigma^2 estimated as 77.37:  log likelihood=-431.56
AIC=873.12   AICc=873.65   BIC=887.06

The fitted model is ARIMA(1,0,0)(2,1,0)[12] with drift. This means that the series includes one non-seasonal autoregressive term, two seasonal autoregressive terms, one seasonal difference, and a drift term. The fact that d = 0 but D = 1 suggests that the main non-stationarity in the series comes from seasonality rather than from an ordinary trend. In other words, the yearly seasonal pattern must be removed, but an additional regular difference is not necessary.

The coefficient ar1 = 0.4828 indicates moderate positive short-run persistence, meaning that movements from one month tend to carry over somewhat into the next month. The seasonal coefficients sar1 = -0.5394 and sar2 = -0.3010 show that the series also depends on seasonal dynamics from one and two years earlier, with both effects being negative. This suggests some seasonal correction or reversal after seasonal differencing. The drift term, 6.4323, indicates that even after accounting for seasonal structure, the series still has a positive average underlying change over time. Since the coefficients are large relative to their standard errors, they appear to be meaningfully different from zero. Overall, the model suggests that housing starts are driven by both short-run monthly persistence and strong annual seasonal structure.

Part III

White Noise (WN)

This process (the equation) is defined by: \[\begin{equation} y_t = \varepsilon_t, \quad \varepsilon_t \sim WN(0, \sigma^2). \end{equation}\]

Here, \(y_t\) is the observed time series at time \(t\), and \(\varepsilon_t\) is a white noise error term with mean zero and constant variance \(\sigma^2\). This means that each observation \(y_t\) is simply a random shock drawn from a normal distribution with no dependence on past values. The process is stationary, as the mean and variance do not change over time, and there is no autocorrelation between observations.

Random Walk (RW)

This process is defined by \[\begin{equation} y_t = y_{t-1} + \varepsilon_t. \end{equation}\]

Here, \(y_t\) is the observed time series at time \(t\), and \(\varepsilon_t\) is a white noise error term. The random walk process implies that the current value of the series is equal to the previous value plus a random shock. This process is non-stationary because its mean and variance change over time, and it exhibits strong autocorrelation, as each observation depends on all past observations through the cumulative sum of shocks.

Random Walk with Drift

This process is defined by \[\begin{equation} y_t = \delta + y_{t-1} + \varepsilon_t. \end{equation}\]

Here, \(y_t\) is the observed time series at time \(t\), \(\delta\) is a constant drift term, and \(\varepsilon_t\) is a white noise error term. The random walk with drift process implies that the series has a deterministic trend (drift) in addition to the random walk behavior. Like the random walk, this process is non-stationary due to the presence of the drift and the cumulative effect of shocks, which causes the mean and variance to change over time.

Derivation

We define the first difference of a series: \[\begin{equation} \Delta y_t = y_t - y_{t-1}. \end{equation}\] Then, we substitute the RW equation into the differencing formula: \[\begin{align} \Delta y_t &= y_t - y_{t-1} \\ &= (y_{t-1} + \varepsilon_t)- y_{t-1} \\ &= \varepsilon_t. \end{align}\]

Therefore, after differencing, the RW process becomes just the shock term itself. And because \(\varepsilon_t\) is white noise, we have \(\Delta y_t \sim WN(0, \sigma^2)\). This shows that the first difference of a random walk process is a white noise process.

And, for RW w/ drift, the same logic gives us: \[\begin{equation} \Delta y_t = \delta + \varepsilon_t. \end{equation}\]

So differencing removes the accumulated level component, but the constant drift remains.

Plot

set.seed(123)

n <- 200
eps <- rnorm(n, mean = 0, sd = 1)
drift <- 0.3

sim_data <- tibble(
  time = 1:n,
  `White Noise` = eps,
  `Random Walk` = cumsum(eps),
  `Random Walk with Drift` = cumsum(drift + eps)
) %>%
  pivot_longer(-time, names_to = "series", values_to = "value") %>%
  as_tsibble(key = series, index = time)

sim_data %>%
  autoplot(value) +
  facet_grid(series ~ ., scales = "free_y") +
  labs(
    title = "Simulated Time Series Processes",
    x = "Time",
    y = "Value"
  ) + 
  theme(legend.position = "bottom")