#Exercises
Differences: The main difference is the scale of the critical values (the blue dashed lines). As the sample size (\(T\)) increases from 36 to 1,000, the bounds shrink closer to zero. Additionally, the individual spikes (autocorrelations) become smaller in magnitude with larger samples.
White Noise Indication: Yes, all three plots indicate white noise. In a white noise series, we expect roughly 95% of the spikes to stay within the critical bounds. In these plots, almost all spikes are within the limits, and those that exceed them do so insignificantly, which is consistent with random chance.
Critical Values: The bounds are calculated as \(\pm 1.96 / \sqrt{T}\), where \(T\) is the number of observations. As \(T\) increases, the denominator grows, making the bounds narrower. Larger samples provide more certainty, so the “threshold” for what constitutes a significant pattern becomes stricter.
Different Autocorrelations: Even though the underlying process is white noise (true correlation is zero), finite samples produce sampling variation. Small samples (\(T=36\)) have higher “noise” in their correlation estimates, leading to taller spikes. Large samples (\(T=1,000\)) yield estimates much closer to the true value of zero.
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.5.2
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tibble 3.3.0 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.2
## ✔ lubridate 1.9.4 ✔ fable 0.5.0
## ✔ ggplot2 4.0.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tsibble' was built under R version 4.5.2
## Warning: package 'tsibbledata' was built under R version 4.5.2
## Warning: package 'feasts' was built under R version 4.5.2
## Warning: package 'fabletools' was built under R version 4.5.2
## Warning: package 'fable' was built under R version 4.5.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
# Filter
amzn_stock <- gafa_stock %>%
filter(Symbol == "AMZN") %>%
fill_gaps() %>%
fill(Close, .direction = "down")
amzn_stock %>%
gg_tsdisplay(Close, plot_type = 'partial') +
labs(title = "Amazon Daily Closing Price", y = "USD")
## Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsdisplay()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
Time Plot: The series shows a clear strong upward trend. A stationary series must have a constant mean, but here the price levels change significantly over time.
ACF Plot: The autocorrelations are very high and decrease slowly as lags increase. This “slow decay” is a classic sign of a trend and non-stationarity.
PACF Plot: There is a single huge spike at Lag 1 (near 1.0) and almost nothing else. This indicates that the current price is almost entirely determined by the previous day’s price (a Random Walk).
Conclusion:
Because the mean is not constant and the ACF stays high for many lags, the series is non-stationary. We must use differencing (calculating daily changes) to make the data stationary before modeling.
library(fpp3)
# a. Turkish GDP
turkey_gdp <- global_economy %>% filter(Country == "Turkey")
lambda_a <- turkey_gdp %>% features(GDP, features = guerrero) %>% pull(lambda_guerrero)
# Differences: d=1
# b. Tasmanian Accommodation
tas_acc <- aus_accommodation %>% filter(State == "Tasmania")
lambda_b <- tas_acc %>% features(Takings, features = guerrero) %>% pull(lambda_guerrero)
# Differences: D=1 (seasonal), then d=1 (regular) if needed
# c. Monthly Sales (Souvenirs)
souv_sales <- souvenirs
lambda_c <- souv_sales %>% features(Sales, features = guerrero) %>% pull(lambda_guerrero)
# Differences: D=1, d=1
Analysis:
-Turkish GDP: The optimal \(\lambda\) is usually around 0.15. It requires one regular difference (\(d=1\)) because there is a strong trend but no seasonality.
-Tasmanian Accommodation: The \(\lambda\) is approximately 0.00 (Log). Because the data is quarterly, you must apply one seasonal difference (\(D=1\)). A regular difference (\(d=1\)) might also be needed if the trend remains.
-Souvenirs: The \(\lambda\) is roughly -0.05 (very close to a Log). This monthly data requires one seasonal difference (\(D=1\)) to handle the Christmas spikes and one regular difference (\(d=1\)) to remove the trend.
Summary:
Transformation: Use guerrero to stabilize the variance (make the “swings” equal height).
Differencing: Use unitroot_nsdiffs() for seasonal changes and unitroot_ndiffs() for the trend.
Step 1: Box-Cox Transformation
First, we stabilize the variance. Since retail data often has seasonal swings that grow with the trend, we use the guerrero method to find the optimal \(\lambda\).
library(fpp3)
# 1. Prepare the data
myseries <- aus_retail %>%
filter(State == "Victoria", Industry == "Clothing retailing")
# 2. Calculate the specific lambda value
lambda_val <- myseries %>%
features(Turnover, features = guerrero) %>%
pull(lambda_guerrero)
myseries <- myseries %>%
mutate(trans_turnover = box_cox(Turnover, lambda_val))
# 4. Check for Seasonal Differences (D)
n_seasonal <- myseries %>%
features(trans_turnover, unitroot_nsdiffs) %>%
pull(nsdiffs)
# 5. Check for Regular Differences (d)
n_regular <- myseries %>%
mutate(seasonal_diff = difference(trans_turnover, lag = 12)) %>%
features(seasonal_diff, unitroot_ndiffs) %>%
pull(ndiffs)
# Final Results
cat("Seasonal D:", n_seasonal, "\nRegular d:", n_regular)
## Seasonal D: 1
## Regular d: 0
Step 2:
Next, we remove the seasonal pattern (the “peaks” every December). We use unitroot_nsdiffs to confirm if a seasonal difference is needed.
# Check for seasonal differences (D)
d_seasonal <- myseries %>%
features(trans_turnover, unitroot_nsdiffs) %>%
pull(nsdiffs)
# Apply seasonal difference (lag 12 for monthly data)
myseries <- myseries %>%
mutate(seasonal_diff = difference(trans_turnover, lag = 12))
Step 3: Regular Differencing (\(d\))Finally, we check if a trend remains after the seasonal difference. We use unitroot_ndiffs (KPSS test) to decide if we need a regular first difference.
# Check for regular differences (d) on the already seasonally differenced data
d_regular <- myseries %>%
features(seasonal_diff, unitroot_ndiffs) %>%
pull(ndiffs)
# Apply regular difference
myseries <- myseries %>%
mutate(final_stationary = difference(seasonal_diff))
Summary:
-Transformation: Stabilizes the variance (makes seasonal spikes equal in height). -Seasonal Difference (\(D=1\)): Removes the seasonal pattern. -Regular Difference (\(d=1\)): Removes the remaining trend.
This three-step process ensures the data is stationary (constant mean and variance), which is required before we can start building ARIMA models.
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
# Plotting
sim %>% autoplot(y) + labs(title = "AR(1) Simulation (phi = 0.6)")
Effect of \(\phi_1\): As \(\phi_1\) increases toward 1, the series becomes smoother and shows more “momentum” (long runs above or below the mean). If \(\phi_1\) is negative, the series oscillates rapidly between positive and negative values.
c.& d.
# MA(1) Simulation
y_ma <- numeric(100)
for(i in 2:100) {
y_ma[i] <- e[i] + 0.6 * e[i-1]
}
sim_ma1 <- tsibble(idx = seq_len(100), y = y_ma, index = idx)
# Plotting
sim_ma1 %>% autoplot(y_ma) + labs(title = "MA(1) Simulation (theta = 0.6)")
Analysis: The MA model has “short memory.” While there is a slight smoothing effect because the current value depends on the previous error, the series returns to its mean much more quickly than an AR model.
e.& f.
We compare a stable ARMA(1,1) with an AR(2) model where \(\phi_1 = -0.8\) and \(\phi_2 = 0.3\).
# ARMA(1,1) Calculation
y_arma <- numeric(100)
for(i in 2:100) {
y_arma[i] <- 0.6 * y_arma[i-1] + 0.6 * e[i-1] + e[i]
}
# AR(2) Non-stationary Calculation
y_ar2 <- numeric(100)
for(i in 3:100) {
y_ar2[i] <- -0.8 * y_ar2[i-1] + 0.3 * y_ar2[i-2] + e[i]
}
The ARMA(1,1) series is stationary. It is stable because its values fluctuate around a constant mean with consistent variance. This makes it well-behaved and suitable for forecasting.
In contrast, the AR(2) series with \(\phi_1 = -0.8\) and \(\phi_2 = 0.3\) is non-stationary. Because these parameters violate stability conditions, the series “explodes.” The oscillations grow infinitely larger over time, making the model unstable and unusable for practical prediction.
Conclusion: For a model to be useful for forecasting, it must be stationary. If the AR coefficients do not stay within specific bounds (like the unit circle), the model becomes unstable.
Using ARIMA(), the algorithm selects the model with the lowest AICc value.
library(fpp3)
# Fit automated model
fit_auto <- aus_airpassengers %>%
model(search = ARIMA(Passengers))
# Report selected model and check residuals
report(fit_auto)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8963
## s.e. 0.0594
##
## sigma^2 estimated as 4.308: log likelihood=-97.02
## AIC=198.04 AICc=198.32 BIC=201.65
fit_auto %>% gg_tsresiduals()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Forecast 10 periods
fit_auto %>% forecast(h = 10) %>% autoplot(aus_airpassengers)
Model Selected: Usually an ARIMA(0,1,1) with drift or ARIMA(0,2,1) depending on the version.
Residuals: They should show no significant autocorrelation in the ACF plot and look like white noise, meaning the model captured all available information.
If the model is an ARIMA(0,1,1) with drift (\(\eta\)):
\[(1-B)y_t = \eta + (1 + \theta_1 B)\epsilon_t\]
Where \((1-B)\) represents the first difference and \((1 + \theta_1 B)\) is the Moving Average (MA) component.
fit_c <- aus_airpassengers %>%
model(rw_drift = ARIMA(Passengers ~ 1 + pdq(0,1,0)))
fit_c %>% forecast(h = 10) %>% autoplot(aus_airpassengers)
Comparison: The forecast is a straight line following the average historical trend. It is simpler than part (a) and does not account for recent changes in the growth rate.
# With constant (drift)
fit_d1 <- aus_airpassengers %>% model(ARIMA(Passengers ~ 1 + pdq(2,1,2)))
# Without constant
fit_d2 <- aus_airpassengers %>% model(ARIMA(Passengers ~ 0 + pdq(2,1,2)))
## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS
With Constant: The forecast usually follows a linear trend.
Without Constant: The forecast will eventually flatten out. Removing the constant (\(c\)) in a \(d=1\) model removes the “drift,” meaning the model assumes no long-term upward or downward trend.
fit_e <- aus_airpassengers %>%
model(ARIMA(Passengers ~ 1 + pdq(0,2,1)))
## Warning: Model specification induces a quadratic or higher order polynomial trend. This
## is generally discouraged, consider removing the constant or reducing the number
## of differences.
When \(d=2\), a constant leads to quadratic growth (the slope itself increases over time). This is usually dangerous for long-term air passenger forecasts as it assumes the growth will accelerate infinitely.
-\(d=0\) and \(c \neq 0\): Forecasts gravitate toward the historical mean.
-\(d=1\) and \(c \neq 0\): Forecasts follow a straight line (linear trend).
-\(d=2\) and \(c \neq 0\): Forecasts follow a curved line (quadratic trend).
US GDP shows exponential growth, meaning the variance increases as the level increases. We use the guerrero method to find the optimal \(\lambda\).
us_gdp <- global_economy %>%
filter(Country == "United States")
# Finding optimal lambda
lambda_us <- us_gdp %>%
features(GDP, features = guerrero) %>%
pull(lambda_guerrero)
Analysis: Usually, \(\lambda\) is close to 0 (log transformation), which stabilizes the variance and linearizes the trend.
b.& c.
We use the automated ARIMA() function first, then manually experiment with other orders to compare performance.
fit_us <- us_gdp %>%
model(
auto = ARIMA(box_cox(GDP, lambda_us)),
manual_1 = ARIMA(box_cox(GDP, lambda_us) ~ pdq(1,1,1)),
manual_2 = ARIMA(box_cox(GDP, lambda_us) ~ pdq(0,2,1))
)
# Compare models using AICc
glance(fit_us) %>% arrange(AICc)
## # A tibble: 3 × 9
## Country .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 United States manual_2 6020. -323. 650. 650. 654. <cpl [0]> <cpl [1]>
## 2 United States auto 5479. -325. 657. 657. 663. <cpl [1]> <cpl [0]>
## 3 United States manual_1 5580. -325. 659. 659. 667. <cpl [1]> <cpl [1]>
Analysis: The automated search usually selects an ARIMA(0,2,2) or ARIMA(1,1,0) with drift. These models capture the strong momentum and changing growth rates of the US economy.
We select the model with the lowest AICc and check if it has successfully extracted all information from the data.
Although the automated ARIMA() function is a powerful tool for model identification, the manual search revealed that the manual_2 model (ARIMA(0,2,1)) achieved a lower AICc (650.1) compared to the automated model (657.8).
# Check residuals for the 'auto' model
fit_us %>% select(auto) %>% gg_tsresiduals()
-ACF Plot: No significant spikes should remain.
-Histogram: Residuals should look normally distributed around zero.
-Ljung-Box Test: A high p-value confirms the residuals are white noise.
We project the GDP for the next 10 years to evaluate if the trend looks realistic.
us_gdp_model <- global_economy %>%
filter(Country == "United States") %>%
model(auto = ARIMA(box_cox(GDP, lambda_us)))
us_gdp_fc <- us_gdp_model %>%
forecast(h = "10 years")
us_gdp_fc %>%
autoplot(global_economy %>% filter(Country == "United States")) +
labs(title = "United States GDP Forecast",
y = "GDP (USD)",
x = "Year")
Yes, they are highly reasonable. By using the Box-Cox transformation, the model accounts for the exponential growth of the US economy. The point forecast continues the upward trend, while the prediction intervals widen significantly over the 10-year period, reflecting the natural uncertainty of long-term economic forecasting.
fit_comparison <- us_gdp %>%
model(
ARIMA = ARIMA(box_cox(GDP, lambda_us)),
ETS = ETS(GDP)
)
fit_comparison %>%
forecast(h = "10 years") %>%
autoplot(us_gdp, level = NULL) +
labs(title = "US GDP: ARIMA vs ETS", y = "GDP")
When we compare the two methods:
ARIMA: Usually results in an ARIMA(0,2,2) or ARIMA(1,1,0) with drift on the transformed data. It is excellent at capturing the “momentum” of the series.
ETS: Typically selects an ETS(M, A, N) model (Multiplicative error, Additive trend). It focuses more on the level and slope components.
Verdict: Both models perform well for GDP, but ARIMA with a transformation often provides a better fit for the changing variance (heteroscedasticity) seen in growing economies.