Discussion 14

Author

Aryamani Boruah

Published

May 10, 2026

Part I. Understanding the Four Models and How They Connect

When I first looked at these four models together, my instinct was to treat them as four separate things to memorize. But the more I worked through the equations, the more I realized they are actually a sequence — each one fixing a limitation of the one before it. I will try to write them out that way, because that framing made them click for me.

ARIMA(p, d, q)

ARIMA is where I think most of us started in this course. The basic idea is that a variable $y_t$ can be explained by its own past values (the AR part), its own past errors (the MA part), and differencing to handle a trend that makes the series non-stationary (the I part).

In backshift operator notation, the full model is:

\[\phi_p(B)(1 - B)^d \, y_t = c + \theta_q(B) \, \varepsilon_t\]

where $(1 - B)^d$ applies $d$ rounds of differencing to remove the trend, and:

\[\phi_p(B) = 1 - \phi_1 B - \phi_2 B^2 - \cdots - \phi_p B^p\] \[\theta_q(B) = 1 + \theta_1 B + \theta_2 B^2 + \cdots + \theta_q B^q\]

Writing it out without the operator notation (after differencing, let $\tilde{y}_t = \Delta^d y_t$):

\[\tilde{y}_t = c + \phi_1 \tilde{y}_{t-1} + \cdots + \phi_p \tilde{y}_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q}\]

The key assumption that I want to flag here — because it becomes the whole motivation for ARCH later — is that $\varepsilon_t \overset{iid}{\sim} (0, \sigma^2)$. The error variance $\sigma^2$ is treated as fixed and constant over time. This is fine for a lot of series, but for financial returns or anything with “calm and stormy” periods, it is a really strong assumption that tends not to hold.

ADL(p, s) — Autoregressive Distributed Lag

The ADL model felt like a natural extension to me once I thought of it as: what if ARIMA had a covariate? Instead of only using past values of $y_t$ itself, we also bring in current and lagged values of some external variable $x_t$.

\[y_t = \alpha_0 + \sum_{i=1}^{p} \phi_i \, y_{t-i} + \sum_{j=0}^{s} \beta_j \, x_{t-j} + \varepsilon_t\]

The $\beta_j$ coefficients are the distributed lag part — they capture how the effect of $x$ on $y$ plays out over $s$ periods rather than all at once. A practical implication is the long-run multiplier, which gives the total cumulative effect of a permanent unit increase in $x$:

\[\lambda = \frac{\displaystyle\sum_{j=0}^{s} \beta_j}{1 - \displaystyle\sum_{i=1}^{p} \phi_i}\]

One thing that helped me understand ADL is that if you set all $\beta_j = 0$, you just get an AR(p) model back. ADL does not change the structure of the error term at all — it still assumes $\varepsilon_t \overset{iid}{\sim} (0, \sigma^2)$, so the same constant-variance limitation from ARIMA applies here.

ARCH(q) — Autoregressive Conditional Heteroskedasticity

This is where I had to slow down and re-read a few times. The ARCH model does not replace the mean equation — it adds a second equation specifically for the variance. What Engle (1982) noticed was that even if residuals look uncorrelated on average, their squared values often are correlated. Big errors tend to follow big errors. That is volatility clustering, and ARCH is designed to model it.

The model has two parts:

Mean equation:

\[y_t = \mu + \varepsilon_t, \qquad \varepsilon_t = \sigma_t \, z_t, \qquad z_t \overset{iid}{\sim} N(0,1)\]

Variance equation:

\[\sigma_t^2 = \omega + \sum_{i=1}^{q} \alpha_i \, \varepsilon_{t-i}^2\]

So the conditional variance at time $t$ depends on how large the squared errors were over the past $q$ periods. The bigger last period’s shock $\varepsilon_{t-1}^2$, the higher the forecasted variance today. That is the mechanism capturing clustering.

The constraints needed for this to make sense are $\omega > 0$, $\alpha_i \geq 0$, and $\sum \alpha_i < 1$ for covariance stationarity. When those hold, the unconditional variance is:

\[\sigma^2 = \frac{\omega}{1 - \displaystyle\sum_{i=1}^{q} \alpha_i}\]

One limitation I noticed: if volatility stays elevated for a long time, you need a large $q$ to capture that with ARCH alone, which means a lot of parameters. That is exactly what GARCH fixes.

GARCH(p, q) — Generalized ARCH

GARCH, from Bollerslev (1986), adds lagged conditional variances $\sigma_{t-j}^2$ to the variance equation. Intuitively, if yesterday’s variance was high, today’s is probably also high — we do not need to rely only on how big the shocks were.

Mean equation (same structure as ARCH):

\[y_t = \mu + \varepsilon_t, \qquad \varepsilon_t = \sigma_t \, z_t, \qquad z_t \overset{iid}{\sim} N(0,1)\]

Variance equation:

\[\sigma_t^2 = \omega + \underbrace{\sum_{i=1}^{q} \alpha_i \, \varepsilon_{t-i}^2}_{\text{ARCH terms}} + \underbrace{\sum_{j=1}^{p} \beta_j \, \sigma_{t-j}^2}_{\text{GARCH terms}}\]

The constraints are $\omega > 0$, $\alpha_i \geq 0$, $\beta_j \geq 0$, and for stationarity:

\[\sum_{i=1}^{q} \alpha_i + \sum_{j=1}^{p} \beta_j < 1\]

The quantity on the left side is what I will call persistence $\pi$. When $\pi$ is close to 1, a shock to variance decays slowly — the model is saying volatility is sticky. When $\pi$ is well below 1, the series mean-reverts quickly to its long-run variance:

\[\bar{\sigma}^2 = \frac{\omega}{1 - \pi} = \frac{\omega}{1 - \displaystyle\sum_{i=1}^{q} \alpha_i - \displaystyle\sum_{j=1}^{p} \beta_j}\]

What made GARCH click for me is that ARCH(q) is literally a special case of GARCH(p,q) with $p = 0$. Adding the $\beta$ terms is like giving the model a memory for its own past uncertainty, not just past shocks. In practice GARCH(1,1) does most of the work that ARCH(20 or 30) would need to do.

How the Four Models Connect

ARIMA to ADL: ARIMA only looks inward — it uses $y$’s own history. ADL opens a window to the outside by adding $x_t$ and its lags. The error structure is unchanged in both.

ARIMA/ADL to ARCH: Both ARIMA and ADL treat variance as constant. ARCH says: what if the variance itself follows an autoregressive process? It takes the residuals from a mean model (which could be an ARIMA or ADL model) and regresses their squared values on past squared values to model time-varying uncertainty.

ARCH to GARCH: ARCH with lagged squared errors only is often not parsimonious enough. GARCH adds lagged conditional variances, turning what would require ARCH(q) with a very large $q$ into a two-parameter (for GARCH(1,1)) or small-parameter model. ARCH(q) is a restricted version of GARCH(p,q) where all $\beta_j = 0$.

	Time-varying variance	Exogenous regressors	Lagged variance terms
ARIMA	No	No	No
ADL	No	Yes	No
ARCH	Yes	No	No
GARCH	Yes	Optional	Yes

Part II. Fitting an AR(1)-GARCH(2,1) on S&P 500 Returns

For Part II I decided to use S&P 500 daily returns because the financial time series context is where GARCH is most commonly applied and it gives a concrete sense of what the coefficients mean in a real setting. I chose a GARCH(2,1) model — meaning two lagged variance terms ($\beta_1$, $\beta_2$) and one lagged squared shock term ($\alpha_1$) — mostly to work through a model where both $p$ and $q$ are greater than 0 and to practice reading a non-trivial parameter table.

Setup

library(quantmod)
library(rugarch)
library(ggplot2)
library(xts)
library(knitr)
library(kableExtra)

set.seed(42)

Getting the Data

getSymbols("^GSPC",
           from        = "2015-01-01",
           to          = "2024-12-31",
           auto.assign = TRUE)

[1] "GSPC"

prices  <- Ad(GSPC)
returns <- na.omit(diff(log(prices))) * 100

returns_df <- data.frame(
  date = index(returns),
  r    = as.numeric(returns)
)

cat("Number of observations:", nrow(returns_df), "\n")

Number of observations: 2514

cat("Date range:", format(min(returns_df$date)),
    "to", format(max(returns_df$date)), "\n")

Date range: 2015-01-05 to 2024-12-30

cat("Mean return:", round(mean(returns_df$r), 4), "%\n")

Mean return: 0.0419 %

cat("Std deviation:", round(sd(returns_df$r), 4), "%\n")

Std deviation: 1.1272 %

I compute log returns as $r_t = 100 \times \ln(P_t / P_{t-1})$ and multiply by 100 so everything is in percentage points. This keeps the coefficient values in a readable range.

Time Series Plot

ggplot(returns_df, aes(x = date, y = r)) +
  geom_line(linewidth = 0.35, color = "#1D9E75") +
  geom_hline(yintercept = 0, linetype = "dashed",
             color = "gray50", linewidth = 0.4) +
  labs(
    title    = "S&P 500 daily log returns, 2015 to 2024",
    subtitle = "The March 2020 COVID shock is the most visible volatility cluster",
    x        = NULL,
    y        = "Log return (%)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(size = 13, face = "plain"),
    plot.subtitle = element_text(size = 11, color = "gray40")
  )

Even before fitting any model, you can see from this plot why constant variance is not going to work here. The March 2020 window has swings of 8 to 10 percent in a single day, while most of 2017 barely moves a percent. That visual pattern — quiet then stormy then quiet again — is exactly what GARCH is built for.

Checking for ARCH Effects

Before committing to GARCH, I wanted to verify that there actually are ARCH effects in the data rather than just assuming it. A simple way to see this is by comparing the ACF of the raw returns to the ACF of the squared returns.

par(mfrow = c(1, 2))

acf(returns_df$r,
    lag.max = 30,
    main    = "ACF of returns",
    col     = "#1D9E75")

acf(returns_df$r^2,
    lag.max = 30,
    main    = "ACF of squared returns",
    col     = "#D85A30")

par(mfrow = c(1, 1))

The ACF of raw returns is basically flat — there is little serial correlation in the returns themselves. But the ACF of squared returns is strongly and persistently autocorrelated across many lags. That tells me the size of returns is predictable even if the direction is not, which is the defining feature of ARCH effects. GARCH is the right tool here.

Writing Out the Equations Before Estimating

I am fitting an AR(1)-GARCH(2,1). The AR(1) in the mean equation captures any small serial correlation in the returns, and the GARCH(2,1) in the variance equation models the clustering.

Mean equation:

\[r_t = \mu + \phi_1 \, r_{t-1} + \varepsilon_t\]

\[\varepsilon_t = \sigma_t \, z_t, \qquad z_t \overset{iid}{\sim} N(0,1)\]

Variance equation:

\[\sigma_t^2 = \omega + \alpha_1 \, \varepsilon_{t-1}^2 + \beta_1 \, \sigma_{t-1}^2 + \beta_2 \, \sigma_{t-2}^2\]

The $\alpha_1$ term captures how much a large shock yesterday amplifies today’s variance — this is the news impact. The $\beta_1$ and $\beta_2$ terms capture how much yesterday’s and the day before’s variance level carries forward — this is the persistence. The intercept $\omega$ anchors the whole thing to a long-run baseline.

A Note on Software vs Textbook Notation

This tripped me up when I first wrote the ugarchspec call. In the textbook, GARCH($p$, $q$) means $p$ lagged variance terms (the $\beta$’s) and $q$ lagged squared shock terms (the $\alpha$’s). But in rugarch, garchOrder = c(first, second) where the first number is the $\alpha$ order and the second is the $\beta$ order — the opposite of how I read the textbook notation.

So to fit GARCH(2,1) — meaning 2 $\beta$ terms and 1 $\alpha$ term — I need to write garchOrder = c(1, 2):

What I want (textbook)	What I write in rugarch
GARCH($p=2$, $q=1$): 2 beta lags, 1 alpha lag	`garchOrder = c(1, 2)`

Model Estimation

spec <- ugarchspec(
  variance.model = list(
    model      = "sGARCH",
    garchOrder = c(1, 2)     # alpha order = 1 | beta order = 2
  ),
  mean.model = list(
    armaOrder    = c(1, 0),  # AR(1) mean
    include.mean = TRUE
  ),
  distribution.model = "norm"
)

fit <- ugarchfit(spec = spec, data = returns)
print(fit)


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : sGARCH(1,2)
Mean Model  : ARFIMA(1,0,0)
Distribution    : norm 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.080858    0.013935  5.80239 0.000000
ar1    -0.051820    0.022350 -2.31853 0.020421
omega   0.041902    0.007698  5.44337 0.000000
alpha1  0.197684    0.031514  6.27281 0.000000
beta1   0.679682    0.190897  3.56046 0.000370
beta2   0.090073    0.164452  0.54771 0.583888

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.080858    0.013644  5.92638 0.000000
ar1    -0.051820    0.020428 -2.53672 0.011190
omega   0.041902    0.013018  3.21893 0.001287
alpha1  0.197684    0.055154  3.58419 0.000338
beta1   0.679682    0.299744  2.26754 0.023357
beta2   0.090073    0.250842  0.35908 0.719535

LogLikelihood : -3227.834 

Information Criteria
------------------------------------
                   
Akaike       2.5727
Bayes        2.5866
Shibata      2.5726
Hannan-Quinn 2.5777

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                      1.620  0.2031
Lag[2*(p+q)+(p+q)-1][2]     1.772  0.3018
Lag[4*(p+q)+(p+q)-1][5]     2.797  0.4840
d.o.f=1
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                         statistic p-value
Lag[1]                     0.02559  0.8729
Lag[2*(p+q)+(p+q)-1][8]    1.84594  0.8819
Lag[4*(p+q)+(p+q)-1][14]   4.26510  0.8538
d.o.f=3

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[4]     1.474 0.500 2.000  0.2246
ARCH Lag[6]     2.141 1.461 1.711  0.4604
ARCH Lag[8]     3.395 2.368 1.583  0.4723

Nyblom stability test
------------------------------------
Joint Statistic:  1.9137
Individual Statistics:             
mu     0.1357
ar1    0.3242
omega  0.3876
alpha1 0.2941
beta1  0.5899
beta2  0.6109

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.49 1.68 2.12
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value      prob sig
Sign Bias           3.2420 0.0012025 ***
Negative Sign Bias  0.2261 0.8211555    
Positive Sign Bias  0.3546 0.7229510    
Joint Effect       19.6619 0.0001994 ***


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     126.5    6.562e-18
2    30     141.4    9.878e-17
3    40     155.6    7.484e-16
4    50     158.6    1.728e-13


Elapsed time : 0.1656311

Extracting the Coefficients

cv    <- coef(fit)
mu_h  <- round(cv["mu"],     6)
phi1  <- round(cv["ar1"],    6)
omega <- round(cv["omega"],  6)
a1    <- round(cv["alpha1"], 6)
b1    <- round(cv["beta1"],  6)
b2    <- round(cv["beta2"],  6)

pers      <- round(a1 + b1 + b2, 6)
lr_var    <- round(omega / (1 - pers), 6)
lr_sd     <- round(sqrt(lr_var), 4)
half_life <- round(log(0.5) / log(pers), 1)

param_table <- data.frame(
  Symbol        = c("$\\mu$", "$\\phi_1$", "$\\omega$",
                    "$\\alpha_1$", "$\\beta_1$", "$\\beta_2$"),
  rugarch.label = c("mu", "ar1", "omega",
                    "alpha1", "beta1", "beta2"),
  Estimate      = c(mu_h, phi1, omega, a1, b1, b2)
)

kable(param_table, escape = FALSE,
      caption = "Estimated AR(1)-GARCH(2,1) coefficients") |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Estimated AR(1)-GARCH(2,1) coefficients
	Symbol	rugarch.label	Estimate
mu	$\mu$	mu	0.080858
ar1	$\phi_1$	ar1	-0.051820
omega	$\omega$	omega	0.041902
alpha1	$\alpha_1$	alpha1	0.197684
beta1	$\beta_1$	beta1	0.679682
beta2	$\beta_2$	beta2	0.090073

Updated Equations with Estimated Coefficients

With the numbers in hand, I can now substitute back into the equations I wrote before fitting. This is where the reconciliation between software output and textbook notation matters — the label alpha1 in the rugarch output corresponds to $\alpha_1$ in my variance equation, beta1 to $\beta_1$, and beta2 to $\beta_2$.

Updated mean equation:

cat(sprintf(
  "$$\\hat{r}_t = %.6f + %.6f \\, r_{t-1} + \\varepsilon_t$$",
  mu_h, phi1
))

\[\hat{r}_t = 0.080858 + -0.051820 \, r_{t-1} + \varepsilon_t\]

Updated variance equation:

cat(sprintf(
  "$$\\hat{\\sigma}_t^2 = %.6f \n  + %.6f \\, \\varepsilon_{t-1}^2 \n  + %.6f \\, \\sigma_{t-1}^2 \n  + %.6f \\, \\sigma_{t-2}^2$$",
  omega, a1, b1, b2
))

\[\hat{\sigma}_t^2 = 0.041902 + 0.197684 \, \varepsilon_{t-1}^2 + 0.679682 \, \sigma_{t-1}^2 + 0.090073 \, \sigma_{t-2}^2\]

Conditional Volatility Plot

cond_vol <- as.numeric(sigma(fit))
vol_df   <- data.frame(date = index(returns), vol = cond_vol)

ggplot(vol_df, aes(x = date, y = vol)) +
  geom_line(linewidth = 0.4, color = "#D85A30") +
  labs(
    title    = "Estimated conditional volatility — AR(1)-GARCH(2,1)",
    subtitle = "S&P 500 daily log returns, 2015 to 2024",
    x        = NULL,
    y        = expression(hat(sigma)[t] ~ "(% per day)")
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(size = 13, face = "plain"),
    plot.subtitle = element_text(size = 11, color = "gray40")
  )

Seeing this plot was actually the most satisfying part of the assignment for me. The model is capturing exactly what I saw in the raw returns plot — the COVID spike in 2020, the elevated regime in late 2022, and the relatively calm stretches in between. The model is not just fitting the data; it is telling a story about when uncertainty was high and when it was not.

Interpreting the Coefficients

Volatility clustering

Looking at the conditional volatility plot, the clustering is unmistakable. The March 2020 spike is by far the most prominent, but there are also sustained elevated periods in late 2018 and through 2022. The reason GARCH captures this is that once $\sigma_t^2$ rises, it feeds back into $\sigma_{t+1}^2$ through the $\beta$ terms, keeping variance elevated until the shocks subside.

Persistence

cat(sprintf(
  "$$\\pi = \\alpha_1 + \\beta_1 + \\beta_2 = %.6f + %.6f + %.6f = %.6f$$",
  a1, b1, b2, pers
))

\[\pi = \alpha_1 + \beta_1 + \beta_2 = 0.197684 + 0.679682 + 0.090073 = 0.967439\]

This is the persistence of the variance process. A value close to 1 means shocks to variance take a very long time to decay — once volatility rises, the model predicts it will stay elevated for many periods. For S&P 500 returns this is a well-known empirical finding: equity volatility is highly persistent, which is why GARCH(1,1) or GARCH(2,1) has been the workhorse model in finance for decades.

Long-run variance and volatility

cat(sprintf(
  "$$\\bar{\\sigma}^2 = \\frac{\\omega}{1 - \\pi} = \\frac{%.6f}{1 - %.6f} = %.6f$$",
  omega, pers, lr_var
))

\[\bar{\sigma}^2 = \frac{\omega}{1 - \pi} = \frac{0.041902}{1 - 0.967439} = 1.286877\]

cat(sprintf(
  "$$\\bar{\\sigma} = \\sqrt{%.6f} \\approx %.4f\\%%\\text{ per day}$$",
  lr_var, lr_sd
))

\[\bar{\sigma} = \sqrt{1.286877} \approx 1.1344\%\text{ per day}\]

This is the baseline level that conditional volatility tends to return to when there are no unusual shocks. It is not the volatility on any given day — that fluctuates — but the average level the process is gravitating toward.

Mean reversion and half-life

Mean reversion in the variance context means: after a shock pushes $\sigma_t^2$ above the long-run level, how quickly does it return? One way to quantify this is the half-life of a shock — the number of trading days for half of the initial deviation to dissipate:

cat(sprintf(
  "$$\\text{Half-life} = \\frac{\\ln(0.5)}{\\ln(\\pi)} = \\frac{\\ln(0.5)}{\\ln(%.6f)} \\approx %.1f \\text{ trading days}$$",
  pers, half_life
))

\[\text{Half-life} = \frac{\ln(0.5)}{\ln(\pi)} = \frac{\ln(0.5)}{\ln(0.967439)} \approx 20.9 \text{ trading days}\]

For context, a half-life of around 20 to 30 trading days means roughly one to one-and-a-half calendar months for the worst of a shock to work its way out of the variance forecast. That matches intuition from watching markets: volatility after a major event like a Fed surprise or a geopolitical shock does not vanish overnight but does generally settle within a few weeks.

What the individual beta coefficients tell us

The split between $\beta_1$ and $\beta_2$ in a GARCH(2,1) tells you something about the shape of the decay. If $\beta_1$ dominates, most of the persistence comes from the most recent variance period. If $\beta_2$ is also meaningful, the model is saying that variance two days ago still has independent explanatory power beyond what yesterday captured. In practice for equity returns, $\beta_1$ tends to carry most of the weight, but having $\beta_2$ in the model allows for slightly more flexible decay patterns rather than the strict geometric decline implied by GARCH(1,1).

The .qmd source file and knitted HTML are published on RPubs.