library(fpp3)
library(fredr)
library(lmtest)
library(urca)
fredr_set_key("bdfddd4beeaf18c36abe8754e4f3929b")Module 9 Discussion
Loading Libraries
Part I: ETS vs. ARIMA — Are They Equivalent?
I used the U.S. Total Retail Sales (RSXFS) dataset from FRED. It has an obvious upward trend, strong seasonality, and a massive drop during COVID, so it’s a good test case to see how ETS and ARIMA handle complex patterns differently.
Pulling and preparing the data
retail_raw <- fredr(
series_id = "RSXFS",
observation_start = as.Date("2005-01-01"),
observation_end = as.Date("2023-12-01"),
frequency = "m"
)
retail <- retail_raw |>
mutate(Month = yearmonth(date)) |>
as_tsibble(index = Month) |>
select(Month, value)
autoplot(retail, value) +
labs(
title = "U.S. Total Retail Sales (Excluding Food Services)",
y = "Millions of Dollars",
x = NULL
) +
theme_minimal()Train-test split
I’ll hold out the last 24 months (2022–2023) as the test set and train on everything before that.
train <- retail |> filter(Month < yearmonth("2022-01"))
test <- retail |> filter(Month >= yearmonth("2022-01"))
cat("Training set:", nrow(train), "observations\n")Training set: 204 observations
cat("Test set: ", nrow(test), "observations\n")Test set: 24 observations
Fitting ETS and ARIMA
fit_models <- train |>
model(
ets = ETS(value),
arima = ARIMA(value)
)
report(fit_models |> select(ets))Series: value
Model: ETS(M,A,N)
Smoothing parameters:
alpha = 0.8243817
beta = 0.0003153464
Initial states:
l[0] b[0]
291670.6 1704.589
sigma^2: 4e-04
AIC AICc BIC
4733.874 4734.177 4750.465
report(fit_models |> select(arima))Series: value
Model: ARIMA(0,1,0)(0,0,1)[12] w/ drift
Coefficients:
sma1 constant
-0.1723 1216.3615
s.e. 0.0759 480.8599
sigma^2 estimated as 67337762: log likelihood=-2116.78
AIC=4239.56 AICc=4239.68 BIC=4249.5
glance(fit_models) |>
select(.model, AIC, AICc, BIC)# A tibble: 2 × 4
.model AIC AICc BIC
<chr> <dbl> <dbl> <dbl>
1 ets 4734. 4734. 4750.
2 arima 4240. 4240. 4250.
Residual diagnostics
fit_models |>
select(ets) |>
gg_tsresiduals() +
ggtitle("ETS Residual Diagnostics")fit_models |>
select(arima) |>
gg_tsresiduals() +
ggtitle("ARIMA Residual Diagnostics")Forecasting and accuracy comparison
fc <- fit_models |>
forecast(h = nrow(test))
fc |>
autoplot(retail |> filter(Month >= yearmonth("2018-01")), level = 80) +
labs(
title = "ETS vs ARIMA: 24-Month Forecast on Test Data",
y = "Millions of Dollars",
x = NULL
) +
theme_minimal()accuracy_table <- fc |>
accuracy(retail)
accuracy_table |>
select(.model, RMSE, MAE, MAPE, MASE)# A tibble: 2 × 5
.model RMSE MAE MAPE MASE
<chr> <dbl> <dbl> <dbl> <dbl>
1 arima 29171. 28429. 4.86 1.42
2 ets 15811. 14262. 2.45 0.713
Discussion — Are ETS and ARIMA equivalent?
Even though they can sometimes produce similar forecasts on simple datasets, ETS and ARIMA are not computationally equivalent, and the results above perfectly highlight why. Looking at the accuracy table, the ETS model significantly outperformed ARIMA on the test set, cutting the RMSE almost in half (15,811 vs. 29,170) and dropping the MAPE to 2.44%.
The plot shows exactly where the divergence happens. The ARIMA model essentially forecasted a flattened trend for 2022-2024, whereas ETS successfully tracked the continued upward trajectory. This comes down to their structural differences: ARIMA tries to achieve stationarity through differencing and then models the remaining autocorrelations. In doing so, it likely “differenced away” the strong post-COVID recovery trend. ETS, instead, uses a state-space framework to explicitly update the level, trend, and seasonality components at each step. For this specific retail data, ETS’s ability to smooth and adapt to that shifting trend made it much more robust than ARIMA’s linear filtering.
Part II: Dynamic Regression
Next, I tested if adding an external regressor actually helps. I pulled the University of Michigan Consumer Sentiment Index (UMCSENT) because consumer confidence should logically tie into retail spending. The goal is to see if sentiment captures any variance that the base ARIMA misses.
I’ll use the University of Michigan Consumer Sentiment Index (UMCSENT) as the exogenous regressor. The logic is that consumer confidence should be a leading or concurrent indicator of retail spending.
Getting the regressor data
sentiment_raw <- fredr(
series_id = "UMCSENT",
observation_start = as.Date("2005-01-01"),
observation_end = as.Date("2023-12-01"),
frequency = "m"
)
sentiment <- sentiment_raw |>
mutate(Month = yearmonth(date)) |>
as_tsibble(index = Month) |>
select(Month, sentiment = value)Merging and exploring
combined <- retail |>
left_join(sentiment, by = "Month") |>
drop_na()
combined |>
pivot_longer(cols = c(value, sentiment), names_to = "series") |>
ggplot(aes(x = Month, y = value)) +
geom_line() +
facet_wrap(~series, scales = "free_y", ncol = 1) +
labs(title = "Retail Sales and Consumer Sentiment Over Time") +
theme_minimal()combined |>
ggplot(aes(x = sentiment, y = value)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Retail Sales vs. Consumer Sentiment",
x = "Consumer Sentiment Index",
y = "Retail Sales (Millions $)"
) +
theme_minimal()Train-test split (combined data)
train_c <- combined |> filter(Month < yearmonth("2022-01"))
test_c <- combined |> filter(Month >= yearmonth("2022-01"))Fitting models
I’ll fit three models on the training data: a plain ARIMA, an ETS, and a dynamic regression (ARIMA errors + consumer sentiment).
fit_all <- train_c |>
model(
ets = ETS(value),
arima = ARIMA(value),
dynamic_reg = ARIMA(value ~ sentiment)
)
report(fit_all |> select(dynamic_reg))Series: value
Model: LM w/ ARIMA(0,1,1)(0,0,1)[12] errors
Coefficients:
ma1 sma1 sentiment intercept
-0.1112 -0.1503 510.9204 1281.1119
s.e. 0.0765 0.0759 125.9177 420.1076
sigma^2 estimated as 62288745: log likelihood=-2107.82
AIC=4225.63 AICc=4225.94 BIC=4242.2
Checking residuals of the dynamic regression
fit_all |>
select(dynamic_reg) |>
gg_tsresiduals() +
ggtitle("Dynamic Regression Residual Diagnostics")augment(fit_all |> select(dynamic_reg)) |>
features(.innov, ljung_box, lag = 24)# A tibble: 1 × 3
.model lb_stat lb_pvalue
<chr> <dbl> <dbl>
1 dynamic_reg 24.0 0.462
A non-significant Ljung-Box test (p > 0.05) would confirm that the residuals look like white noise, meaning the model has adequately captured both the regression relationship and the autocorrelation structure.
Forecasting
For the dynamic regression forecast, we need the future values of the regressor. In a real forecasting situation we’d need to forecast sentiment separately, but since this is a backtest we can use the actual observed sentiment values from the test set.
fc_all <- fit_all |>
forecast(new_data = test_c)
fc_all |>
autoplot(combined |> filter(Month >= yearmonth("2018-01")), level = 80) +
labs(
title = "Model Comparison: 24-Month Forecast",
y = "Millions of Dollars",
x = NULL
) +
theme_minimal()Accuracy comparison across all models
acc <- fc_all |>
accuracy(combined)
acc |>
select(.model, RMSE, MAE, MAPE, MASE) |>
arrange(RMSE)# A tibble: 3 × 5
.model RMSE MAE MAPE MASE
<chr> <dbl> <dbl> <dbl> <dbl>
1 ets 15811. 14262. 2.45 0.713
2 arima 29171. 28429. 4.86 1.42
3 dynamic_reg 31821. 30980. 5.30 1.55
Discussion — Does the external regressor help?
No. Adding the consumer sentiment index actually hurt the model’s performance. As the accuracy table shows, the RMSE jumped from 29,170 (plain ARIMA) to 31,821 (dynamic regression), and the MAPE worsened to 5.30%. (ETS remains the overall winner out of all models built).
As the prompt notes, an external regressor only helps if it captures variation that the ARIMA structure misses. Here, the historical lags of retail sales clearly already contain the underlying macroeconomic signal. Because consumer sentiment just moves in tandem with the broader economy, its predictive power was redundant. Instead of adding “fresh” information, forcing this variable into the model just introduced noise and overcomplicated things, ultimately degrading the out-of-sample forecast.
Part III: Granger Causality
Concept
Granger causality is not about true cause-and-effect. It’s a statistical test for whether past values of one variable help predict future values of another, beyond what the second variable’s own past can do. The classic example: a rooster’s crow “Granger-causes” sunrise because it reliably precedes it — but nobody would argue the rooster actually makes the sun come up.
The spurious correlation problem
As the assignment points out with the Tyler Vigen site, if we just run tests on non-stationary, trending data, we’ll find “significant” but completely spurious correlations everywhere. To avoid this and test for genuine predictive relationships, I differenced both the retail sales and consumer sentiment series to make them stationary before running the Granger causality tests.
Setting up two series
I’ll test whether consumer sentiment Granger-causes retail sales. Intuitively, if people feel more optimistic, they might spend more in subsequent months.
# Work with the combined dataset from Part II
# First check stationarity with ADF tests
cat("=== ADF Test: Retail Sales (levels) ===\n")=== ADF Test: Retail Sales (levels) ===
summary(ur.df(combined$value, type = "trend", selectlags = "AIC"))
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression trend
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-55830 -2352 -131 1834 59893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6120.24268 4290.35708 1.427 0.1551
z.lag.1 -0.02395 0.01636 -1.464 0.1446
tt 41.53563 21.81505 1.904 0.0582 .
z.diff.lag -0.05671 0.06728 -0.843 0.4001
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8155 on 222 degrees of freedom
Multiple R-squared: 0.02244, Adjusted R-squared: 0.009227
F-statistic: 1.698 on 3 and 222 DF, p-value: 0.1682
Value of test-statistic is: -1.464 3.6097 2.1099
Critical values for test statistics:
1pct 5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2 6.22 4.75 4.07
phi3 8.43 6.49 5.47
cat("=== ADF Test: Consumer Sentiment (levels) ===\n")=== ADF Test: Consumer Sentiment (levels) ===
summary(ur.df(combined$sentiment, type = "trend", selectlags = "AIC"))
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression trend
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-16.1429 -2.1937 0.1718 2.7736 9.7746
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8212648 1.9274680 2.501 0.01309 *
z.lag.1 -0.0618856 0.0228685 -2.706 0.00734 **
tt 0.0005227 0.0044666 0.117 0.90695
z.diff.lag 0.0470588 0.0676132 0.696 0.48716
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.38 on 222 degrees of freedom
Multiple R-squared: 0.03217, Adjusted R-squared: 0.0191
F-statistic: 2.46 on 3 and 222 DF, p-value: 0.06359
Value of test-statistic is: -2.7061 2.4869 3.6647
Critical values for test statistics:
1pct 5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2 6.22 4.75 4.07
phi3 8.43 6.49 5.47
combined_diff <- combined |>
mutate(
d_retail = difference(value),
d_sentiment = difference(sentiment)
) |>
drop_na()
# Verify stationarity after differencing
cat("=== ADF Test: Retail Sales (first difference) ===\n")=== ADF Test: Retail Sales (first difference) ===
summary(ur.df(combined_diff$d_retail, type = "drift", selectlags = "AIC"))
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression drift
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-54254 -2226 -338 1759 57643
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1662.65322 555.86705 2.991 0.00309 **
z.lag.1 -1.24554 0.09637 -12.925 < 2e-16 ***
z.diff.lag 0.17273 0.06612 2.612 0.00961 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8108 on 222 degrees of freedom
Multiple R-squared: 0.545, Adjusted R-squared: 0.5409
F-statistic: 133 on 2 and 222 DF, p-value: < 2.2e-16
Value of test-statistic is: -12.9249 83.5264
Critical values for test statistics:
1pct 5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1 6.52 4.63 3.81
cat("=== ADF Test: Consumer Sentiment (first difference) ===\n")=== ADF Test: Consumer Sentiment (first difference) ===
summary(ur.df(combined_diff$d_sentiment, type = "drift", selectlags = "AIC"))
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression drift
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-16.6386 -2.5201 0.3799 2.9302 9.2232
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.13373 0.28800 -0.464 0.642861
z.lag.1 -1.22746 0.09189 -13.358 < 2e-16 ***
z.diff.lag 0.24690 0.06564 3.761 0.000216 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.315 on 222 degrees of freedom
Multiple R-squared: 0.5193, Adjusted R-squared: 0.515
F-statistic: 119.9 on 2 and 222 DF, p-value: < 2.2e-16
Value of test-statistic is: -13.3577 89.2247
Critical values for test statistics:
1pct 5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1 6.52 4.63 3.81
Visual check of differenced series
combined_diff |>
pivot_longer(
cols = c(d_retail, d_sentiment),
names_to = "series",
values_to = "diff_value"
) |>
ggplot(aes(x = Month, y = diff_value)) +
geom_line() +
facet_wrap(~series, scales = "free_y", ncol = 1) +
labs(title = "Differenced Series (Stationary)") +
theme_minimal()Running the Granger causality tests
# Does sentiment Granger-cause retail sales?
# Testing with multiple lag orders to be thorough
cat("=== Does Consumer Sentiment Granger-cause Retail Sales? ===\n\n")=== Does Consumer Sentiment Granger-cause Retail Sales? ===
for (k in c(3, 6, 12)) {
cat(paste0("--- Lag order: ", k, " ---\n"))
gt <- grangertest(d_retail ~ d_sentiment, order = k, data = combined_diff)
print(gt)
cat("\n")
}--- Lag order: 3 ---
Granger causality test
Model 1: d_retail ~ Lags(d_retail, 1:3) + Lags(d_sentiment, 1:3)
Model 2: d_retail ~ Lags(d_retail, 1:3)
Res.Df Df F Pr(>F)
1 217
2 220 -3 3.4336 0.01784 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--- Lag order: 6 ---
Granger causality test
Model 1: d_retail ~ Lags(d_retail, 1:6) + Lags(d_sentiment, 1:6)
Model 2: d_retail ~ Lags(d_retail, 1:6)
Res.Df Df F Pr(>F)
1 208
2 214 -6 2.0575 0.05962 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--- Lag order: 12 ---
Granger causality test
Model 1: d_retail ~ Lags(d_retail, 1:12) + Lags(d_sentiment, 1:12)
Model 2: d_retail ~ Lags(d_retail, 1:12)
Res.Df Df F Pr(>F)
1 190
2 202 -12 1.1268 0.3406
# And the reverse direction: does retail Granger-cause sentiment?
cat("=== Does Retail Sales Granger-cause Consumer Sentiment? ===\n\n")=== Does Retail Sales Granger-cause Consumer Sentiment? ===
for (k in c(3, 6, 12)) {
cat(paste0("--- Lag order: ", k, " ---\n"))
gt <- grangertest(d_sentiment ~ d_retail, order = k, data = combined_diff)
print(gt)
cat("\n")
}--- Lag order: 3 ---
Granger causality test
Model 1: d_sentiment ~ Lags(d_sentiment, 1:3) + Lags(d_retail, 1:3)
Model 2: d_sentiment ~ Lags(d_sentiment, 1:3)
Res.Df Df F Pr(>F)
1 217
2 220 -3 1.9444 0.1234
--- Lag order: 6 ---
Granger causality test
Model 1: d_sentiment ~ Lags(d_sentiment, 1:6) + Lags(d_retail, 1:6)
Model 2: d_sentiment ~ Lags(d_sentiment, 1:6)
Res.Df Df F Pr(>F)
1 208
2 214 -6 1.0716 0.3807
--- Lag order: 12 ---
Granger causality test
Model 1: d_sentiment ~ Lags(d_sentiment, 1:12) + Lags(d_retail, 1:12)
Model 2: d_sentiment ~ Lags(d_sentiment, 1:12)
Res.Df Df F Pr(>F)
1 190
2 202 -12 0.8547 0.5941
Interpretation
Interestingly, the tests show no significant Granger causality in either direction at these lag lengths (all p-values > 0.05). This suggests that past changes in consumer sentiment don’t consistently help predict future changes in retail sales beyond what retail’s own history already tells us (and vice versa).
Honestly, this aligns perfectly with my dynamic regression findings in Part II—the external regressor didn’t add any predictive value there either. Even if we had found a significant result, as the spurious correlation examples illustrate, Granger causality is strictly about predictive precedence, not true causation. It wouldn’t mean changing people’s sentiment magically causes higher sales, just that one series might lead the other. But in this specific case, they don’t even reliably predict each other once the trends are removed.