Module 9 Discussion

Author

Jincheng Xie

Loading Libraries

library(fpp3)
library(fredr)
library(lmtest)
library(urca)

fredr_set_key("bdfddd4beeaf18c36abe8754e4f3929b")

Part I: ETS vs. ARIMA — Are They Equivalent?

I used the U.S. Total Retail Sales (RSXFS) dataset from FRED. It has an obvious upward trend, strong seasonality, and a massive drop during COVID, so it’s a good test case to see how ETS and ARIMA handle complex patterns differently.

Pulling and preparing the data

retail_raw <- fredr(
  series_id = "RSXFS",
  observation_start = as.Date("2005-01-01"),
  observation_end = as.Date("2023-12-01"),
  frequency = "m"
)

retail <- retail_raw |>
  mutate(Month = yearmonth(date)) |>
  as_tsibble(index = Month) |>
  select(Month, value)

autoplot(retail, value) +
  labs(
    title = "U.S. Total Retail Sales (Excluding Food Services)",
    y = "Millions of Dollars",
    x = NULL
  ) +
  theme_minimal()

Train-test split

I’ll hold out the last 24 months (2022–2023) as the test set and train on everything before that.

train <- retail |> filter(Month < yearmonth("2022-01"))
test  <- retail |> filter(Month >= yearmonth("2022-01"))

cat("Training set:", nrow(train), "observations\n")
Training set: 204 observations
cat("Test set:    ", nrow(test), "observations\n")
Test set:     24 observations

Fitting ETS and ARIMA

fit_models <- train |>
  model(
    ets   = ETS(value),
    arima = ARIMA(value)
  )

report(fit_models |> select(ets))
Series: value 
Model: ETS(M,A,N) 
  Smoothing parameters:
    alpha = 0.8243817 
    beta  = 0.0003153464 

  Initial states:
     l[0]     b[0]
 291670.6 1704.589

  sigma^2:  4e-04

     AIC     AICc      BIC 
4733.874 4734.177 4750.465 
report(fit_models |> select(arima))
Series: value 
Model: ARIMA(0,1,0)(0,0,1)[12] w/ drift 

Coefficients:
         sma1   constant
      -0.1723  1216.3615
s.e.   0.0759   480.8599

sigma^2 estimated as 67337762:  log likelihood=-2116.78
AIC=4239.56   AICc=4239.68   BIC=4249.5
glance(fit_models) |>
  select(.model, AIC, AICc, BIC)
# A tibble: 2 × 4
  .model   AIC  AICc   BIC
  <chr>  <dbl> <dbl> <dbl>
1 ets    4734. 4734. 4750.
2 arima  4240. 4240. 4250.

Residual diagnostics

fit_models |>
  select(ets) |>
  gg_tsresiduals() +
  ggtitle("ETS Residual Diagnostics")

fit_models |>
  select(arima) |>
  gg_tsresiduals() +
  ggtitle("ARIMA Residual Diagnostics")

Forecasting and accuracy comparison

fc <- fit_models |>
  forecast(h = nrow(test))

fc |>
  autoplot(retail |> filter(Month >= yearmonth("2018-01")), level = 80) +
  labs(
    title = "ETS vs ARIMA: 24-Month Forecast on Test Data",
    y = "Millions of Dollars",
    x = NULL
  ) +
  theme_minimal()

accuracy_table <- fc |>
  accuracy(retail)

accuracy_table |>
  select(.model, RMSE, MAE, MAPE, MASE)
# A tibble: 2 × 5
  .model   RMSE    MAE  MAPE  MASE
  <chr>   <dbl>  <dbl> <dbl> <dbl>
1 arima  29171. 28429.  4.86 1.42 
2 ets    15811. 14262.  2.45 0.713

Discussion — Are ETS and ARIMA equivalent?

Even though they can sometimes produce similar forecasts on simple datasets, ETS and ARIMA are not computationally equivalent, and the results above perfectly highlight why. Looking at the accuracy table, the ETS model significantly outperformed ARIMA on the test set, cutting the RMSE almost in half (15,811 vs. 29,170) and dropping the MAPE to 2.44%.

The plot shows exactly where the divergence happens. The ARIMA model essentially forecasted a flattened trend for 2022-2024, whereas ETS successfully tracked the continued upward trajectory. This comes down to their structural differences: ARIMA tries to achieve stationarity through differencing and then models the remaining autocorrelations. In doing so, it likely “differenced away” the strong post-COVID recovery trend. ETS, instead, uses a state-space framework to explicitly update the level, trend, and seasonality components at each step. For this specific retail data, ETS’s ability to smooth and adapt to that shifting trend made it much more robust than ARIMA’s linear filtering.

Part II: Dynamic Regression

Next, I tested if adding an external regressor actually helps. I pulled the University of Michigan Consumer Sentiment Index (UMCSENT) because consumer confidence should logically tie into retail spending. The goal is to see if sentiment captures any variance that the base ARIMA misses.

I’ll use the University of Michigan Consumer Sentiment Index (UMCSENT) as the exogenous regressor. The logic is that consumer confidence should be a leading or concurrent indicator of retail spending.

Getting the regressor data

sentiment_raw <- fredr(
  series_id = "UMCSENT",
  observation_start = as.Date("2005-01-01"),
  observation_end = as.Date("2023-12-01"),
  frequency = "m"
)

sentiment <- sentiment_raw |>
  mutate(Month = yearmonth(date)) |>
  as_tsibble(index = Month) |>
  select(Month, sentiment = value)

Merging and exploring

combined <- retail |>
  left_join(sentiment, by = "Month") |>
  drop_na()

combined |>
  pivot_longer(cols = c(value, sentiment), names_to = "series") |>
  ggplot(aes(x = Month, y = value)) +
  geom_line() +
  facet_wrap(~series, scales = "free_y", ncol = 1) +
  labs(title = "Retail Sales and Consumer Sentiment Over Time") +
  theme_minimal()

combined |>
  ggplot(aes(x = sentiment, y = value)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title = "Retail Sales vs. Consumer Sentiment",
    x = "Consumer Sentiment Index",
    y = "Retail Sales (Millions $)"
  ) +
  theme_minimal()

Train-test split (combined data)

train_c <- combined |> filter(Month < yearmonth("2022-01"))
test_c  <- combined |> filter(Month >= yearmonth("2022-01"))

Fitting models

I’ll fit three models on the training data: a plain ARIMA, an ETS, and a dynamic regression (ARIMA errors + consumer sentiment).

fit_all <- train_c |>
  model(
    ets          = ETS(value),
    arima        = ARIMA(value),
    dynamic_reg  = ARIMA(value ~ sentiment)
  )

report(fit_all |> select(dynamic_reg))
Series: value 
Model: LM w/ ARIMA(0,1,1)(0,0,1)[12] errors 

Coefficients:
          ma1     sma1  sentiment  intercept
      -0.1112  -0.1503   510.9204  1281.1119
s.e.   0.0765   0.0759   125.9177   420.1076

sigma^2 estimated as 62288745:  log likelihood=-2107.82
AIC=4225.63   AICc=4225.94   BIC=4242.2

Checking residuals of the dynamic regression

fit_all |>
  select(dynamic_reg) |>
  gg_tsresiduals() +
  ggtitle("Dynamic Regression Residual Diagnostics")

augment(fit_all |> select(dynamic_reg)) |>
  features(.innov, ljung_box, lag = 24)
# A tibble: 1 × 3
  .model      lb_stat lb_pvalue
  <chr>         <dbl>     <dbl>
1 dynamic_reg    24.0     0.462

A non-significant Ljung-Box test (p > 0.05) would confirm that the residuals look like white noise, meaning the model has adequately captured both the regression relationship and the autocorrelation structure.

Forecasting

For the dynamic regression forecast, we need the future values of the regressor. In a real forecasting situation we’d need to forecast sentiment separately, but since this is a backtest we can use the actual observed sentiment values from the test set.

fc_all <- fit_all |>
  forecast(new_data = test_c)

fc_all |>
  autoplot(combined |> filter(Month >= yearmonth("2018-01")), level = 80) +
  labs(
    title = "Model Comparison: 24-Month Forecast",
    y = "Millions of Dollars",
    x = NULL
  ) +
  theme_minimal()

Accuracy comparison across all models

acc <- fc_all |>
  accuracy(combined)

acc |>
  select(.model, RMSE, MAE, MAPE, MASE) |>
  arrange(RMSE)
# A tibble: 3 × 5
  .model        RMSE    MAE  MAPE  MASE
  <chr>        <dbl>  <dbl> <dbl> <dbl>
1 ets         15811. 14262.  2.45 0.713
2 arima       29171. 28429.  4.86 1.42 
3 dynamic_reg 31821. 30980.  5.30 1.55 

Discussion — Does the external regressor help?

No. Adding the consumer sentiment index actually hurt the model’s performance. As the accuracy table shows, the RMSE jumped from 29,170 (plain ARIMA) to 31,821 (dynamic regression), and the MAPE worsened to 5.30%. (ETS remains the overall winner out of all models built).

As the prompt notes, an external regressor only helps if it captures variation that the ARIMA structure misses. Here, the historical lags of retail sales clearly already contain the underlying macroeconomic signal. Because consumer sentiment just moves in tandem with the broader economy, its predictive power was redundant. Instead of adding “fresh” information, forcing this variable into the model just introduced noise and overcomplicated things, ultimately degrading the out-of-sample forecast.

Part III: Granger Causality

Concept

Granger causality is not about true cause-and-effect. It’s a statistical test for whether past values of one variable help predict future values of another, beyond what the second variable’s own past can do. The classic example: a rooster’s crow “Granger-causes” sunrise because it reliably precedes it — but nobody would argue the rooster actually makes the sun come up.

The spurious correlation problem

As the assignment points out with the Tyler Vigen site, if we just run tests on non-stationary, trending data, we’ll find “significant” but completely spurious correlations everywhere. To avoid this and test for genuine predictive relationships, I differenced both the retail sales and consumer sentiment series to make them stationary before running the Granger causality tests.

Setting up two series

I’ll test whether consumer sentiment Granger-causes retail sales. Intuitively, if people feel more optimistic, they might spend more in subsequent months.

# Work with the combined dataset from Part II
# First check stationarity with ADF tests

cat("=== ADF Test: Retail Sales (levels) ===\n")
=== ADF Test: Retail Sales (levels) ===
summary(ur.df(combined$value, type = "trend", selectlags = "AIC"))

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression trend 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)

Residuals:
   Min     1Q Median     3Q    Max 
-55830  -2352   -131   1834  59893 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) 6120.24268 4290.35708   1.427   0.1551  
z.lag.1       -0.02395    0.01636  -1.464   0.1446  
tt            41.53563   21.81505   1.904   0.0582 .
z.diff.lag    -0.05671    0.06728  -0.843   0.4001  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8155 on 222 degrees of freedom
Multiple R-squared:  0.02244,   Adjusted R-squared:  0.009227 
F-statistic: 1.698 on 3 and 222 DF,  p-value: 0.1682


Value of test-statistic is: -1.464 3.6097 2.1099 

Critical values for test statistics: 
      1pct  5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2  6.22  4.75  4.07
phi3  8.43  6.49  5.47
cat("=== ADF Test: Consumer Sentiment (levels) ===\n")
=== ADF Test: Consumer Sentiment (levels) ===
summary(ur.df(combined$sentiment, type = "trend", selectlags = "AIC"))

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression trend 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.1429  -2.1937   0.1718   2.7736   9.7746 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  4.8212648  1.9274680   2.501  0.01309 * 
z.lag.1     -0.0618856  0.0228685  -2.706  0.00734 **
tt           0.0005227  0.0044666   0.117  0.90695   
z.diff.lag   0.0470588  0.0676132   0.696  0.48716   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.38 on 222 degrees of freedom
Multiple R-squared:  0.03217,   Adjusted R-squared:  0.0191 
F-statistic:  2.46 on 3 and 222 DF,  p-value: 0.06359


Value of test-statistic is: -2.7061 2.4869 3.6647 

Critical values for test statistics: 
      1pct  5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2  6.22  4.75  4.07
phi3  8.43  6.49  5.47
combined_diff <- combined |>
  mutate(
    d_retail    = difference(value),
    d_sentiment = difference(sentiment)
  ) |>
  drop_na()

# Verify stationarity after differencing
cat("=== ADF Test: Retail Sales (first difference) ===\n")
=== ADF Test: Retail Sales (first difference) ===
summary(ur.df(combined_diff$d_retail, type = "drift", selectlags = "AIC"))

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
   Min     1Q Median     3Q    Max 
-54254  -2226   -338   1759  57643 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1662.65322  555.86705   2.991  0.00309 ** 
z.lag.1       -1.24554    0.09637 -12.925  < 2e-16 ***
z.diff.lag     0.17273    0.06612   2.612  0.00961 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8108 on 222 degrees of freedom
Multiple R-squared:  0.545, Adjusted R-squared:  0.5409 
F-statistic:   133 on 2 and 222 DF,  p-value: < 2.2e-16


Value of test-statistic is: -12.9249 83.5264 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1  6.52  4.63  3.81
cat("=== ADF Test: Consumer Sentiment (first difference) ===\n")
=== ADF Test: Consumer Sentiment (first difference) ===
summary(ur.df(combined_diff$d_sentiment, type = "drift", selectlags = "AIC"))

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.6386  -2.5201   0.3799   2.9302   9.2232 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.13373    0.28800  -0.464 0.642861    
z.lag.1     -1.22746    0.09189 -13.358  < 2e-16 ***
z.diff.lag   0.24690    0.06564   3.761 0.000216 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.315 on 222 degrees of freedom
Multiple R-squared:  0.5193,    Adjusted R-squared:  0.515 
F-statistic: 119.9 on 2 and 222 DF,  p-value: < 2.2e-16


Value of test-statistic is: -13.3577 89.2247 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1  6.52  4.63  3.81

Visual check of differenced series

combined_diff |>
  pivot_longer(
    cols = c(d_retail, d_sentiment), 
    names_to = "series", 
    values_to = "diff_value"
  ) |>
  ggplot(aes(x = Month, y = diff_value)) +
  geom_line() +
  facet_wrap(~series, scales = "free_y", ncol = 1) +
  labs(title = "Differenced Series (Stationary)") +
  theme_minimal()

Running the Granger causality tests

# Does sentiment Granger-cause retail sales?
# Testing with multiple lag orders to be thorough

cat("=== Does Consumer Sentiment Granger-cause Retail Sales? ===\n\n")
=== Does Consumer Sentiment Granger-cause Retail Sales? ===
for (k in c(3, 6, 12)) {
  cat(paste0("--- Lag order: ", k, " ---\n"))
  gt <- grangertest(d_retail ~ d_sentiment, order = k, data = combined_diff)
  print(gt)
  cat("\n")
}
--- Lag order: 3 ---
Granger causality test

Model 1: d_retail ~ Lags(d_retail, 1:3) + Lags(d_sentiment, 1:3)
Model 2: d_retail ~ Lags(d_retail, 1:3)
  Res.Df Df      F  Pr(>F)  
1    217                    
2    220 -3 3.4336 0.01784 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

--- Lag order: 6 ---
Granger causality test

Model 1: d_retail ~ Lags(d_retail, 1:6) + Lags(d_sentiment, 1:6)
Model 2: d_retail ~ Lags(d_retail, 1:6)
  Res.Df Df      F  Pr(>F)  
1    208                    
2    214 -6 2.0575 0.05962 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

--- Lag order: 12 ---
Granger causality test

Model 1: d_retail ~ Lags(d_retail, 1:12) + Lags(d_sentiment, 1:12)
Model 2: d_retail ~ Lags(d_retail, 1:12)
  Res.Df  Df      F Pr(>F)
1    190                  
2    202 -12 1.1268 0.3406
# And the reverse direction: does retail Granger-cause sentiment?
cat("=== Does Retail Sales Granger-cause Consumer Sentiment? ===\n\n")
=== Does Retail Sales Granger-cause Consumer Sentiment? ===
for (k in c(3, 6, 12)) {
  cat(paste0("--- Lag order: ", k, " ---\n"))
  gt <- grangertest(d_sentiment ~ d_retail, order = k, data = combined_diff)
  print(gt)
  cat("\n")
}
--- Lag order: 3 ---
Granger causality test

Model 1: d_sentiment ~ Lags(d_sentiment, 1:3) + Lags(d_retail, 1:3)
Model 2: d_sentiment ~ Lags(d_sentiment, 1:3)
  Res.Df Df      F Pr(>F)
1    217                 
2    220 -3 1.9444 0.1234

--- Lag order: 6 ---
Granger causality test

Model 1: d_sentiment ~ Lags(d_sentiment, 1:6) + Lags(d_retail, 1:6)
Model 2: d_sentiment ~ Lags(d_sentiment, 1:6)
  Res.Df Df      F Pr(>F)
1    208                 
2    214 -6 1.0716 0.3807

--- Lag order: 12 ---
Granger causality test

Model 1: d_sentiment ~ Lags(d_sentiment, 1:12) + Lags(d_retail, 1:12)
Model 2: d_sentiment ~ Lags(d_sentiment, 1:12)
  Res.Df  Df      F Pr(>F)
1    190                  
2    202 -12 0.8547 0.5941

Interpretation

Interestingly, the tests show no significant Granger causality in either direction at these lag lengths (all p-values > 0.05). This suggests that past changes in consumer sentiment don’t consistently help predict future changes in retail sales beyond what retail’s own history already tells us (and vice versa).

Honestly, this aligns perfectly with my dynamic regression findings in Part II—the external regressor didn’t add any predictive value there either. Even if we had found a significant result, as the spurious correlation examples illustrate, Granger causality is strictly about predictive precedence, not true causation. It wouldn’t mean changing people’s sentiment magically causes higher sales, just that one series might lead the other. But in this specific case, they don’t even reliably predict each other once the trends are removed.