Module 8 Discussion_ARIMA

Author

You-Jin,Tsai

library(fpp3)
library(fredr)
library(tidyverse)
library(patchwork)
library(knitr)
library(kableExtra)
library(gridExtra)
library(writexl)
library(tseries)
library(forecast)   
library(quantmod)
library(feasts)
library(broom)

Sys.setlocale("LC_TIME", "en_US.UTF-8")
remove(list=ls())
fredr_set_key("11c23965cf4414274d293d0c36ec7507")

1. ARIMA Modeling for TSMC Stock

1.1 Manual Selection

Load data

# load data
getSymbols("TSM", src = "yahoo", from = "2021-01-01")

[1] "TSM"

tsm <- tibble(
  date = index(TSM),
  close = as.numeric(TSM$TSM.Close)
) %>% 
  as_tsibble(index = date)

Time series plot

It is clear from a simple eyeball test that TSMC’s stock price is not stationary, especially due to the obvious instability in its mean.

tsm_clean <- tsm %>%
  fill_gaps() %>%               
  fill(close, .direction = "down") 


autoplot(tsm_clean, close) +  
  labs(title = "TSMC closing price (Cleaned)",
       subtitle="Daily, Gaps filled with Last Observation Carried Forward",
       x = "Year", y = "Price(USD)")

Make time series stationary

I applied differencing to the TSMC stock price data. Visually, aside from a few outliers, the series generally appears to be stationary.

tsm_diff <- tsm_clean |>
  mutate(diff_close = difference(close))

autoplot(tsm_diff, diff_close) + 
  labs(title = "Change in TSMC price (First Difference)",
       x = "Year", y = "Price Change (USD)")

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

ACF & PACF plot

Since the PACF cuts off after lag 1, I will test an AR(1) model. The ACF also shows a cut-off after lag 1, suggesting an MA(1) model. Combining these with the first-order differencing, my manual model selections are ARIMA(1,1,0) and ARIMA(0,1,1)

tsm_diff %>%
  filter(!is.na(diff_close)) %>%
  gg_tsdisplay(diff_close, plot_type = 'partial') +
  labs(title = "ACF and PACF for TSMC Stock Change")

Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
ℹ Please use `ggtime::gg_tsdisplay()` instead.

1.2 Automated Selection and Comparison

Both manual models showed nearly identical performance, with AICc values far lower than the automated model. The computer’s choice of ARIMA(1,2,0) implies that it prioritized eliminating volatility spikes through a second difference ($d=2$). In my assessment, this was a case of over-differencing. Since $d=1$ already rendered the series stationary, the extra differencing step only degraded the model’s quality, as evidenced by the significant jump in AICc.

# Model Fitting
tsm_models <- tsm_clean %>%
  model(
    manual_110 = ARIMA(close ~ pdq(1, 1, 0)),
    manual_011 = ARIMA(close ~ pdq(0, 1, 1)),
    auto_model = ARIMA(close) 
  )

# Display auto model
report(tsm_models %>% select(auto_model))

Series: close 
Model: ARIMA(1,2,0) 

Coefficients:
          ar1
      -0.5250
s.e.   0.0194

sigma^2 estimated as 16.91:  log likelihood=-5433.33
AIC=10870.67   AICc=10870.68   BIC=10881.79

# Model comparison

glance(tsm_models) %>% 
  select(.model, AIC, AICc, BIC) %>%
  arrange(AICc)

1.3 Predict future value

I chose the ARIMA(0,1,1) model for my final prediction and retrained it after setting aside the final 30 days of data as a test set. Although the actual prices fall within the 95% prediction interval, the model does not provide a precise daily forecast. It successfully captures the overall volatility and boundary of the price movement, even if it cannot predict the exact ‘next day’ value with high precision.

train_data <- tsm_clean %>% slice(1:(n() - 30))
test_data  <- tsm_clean %>% slice((n() - 29):n())

# Re-fit the best model on training data
final_fit <- train_data %>%
  model(Best_Model = ARIMA(close ~ pdq(0, 1, 1)))

# Forecast for the next 30 days
final_fc <- final_fit %>% forecast(h = 30)

final_fc %>%
  autoplot(train_data %>% filter_index("2025-10" ~ .), level = 95) +
  autolayer(test_data, close, color = "red", linewidth = 1) + 
  labs(title = "TSMC Price Forecast vs Actual",
       subtitle = "Model: ARIMA(0,1,1) | Red line shows actual observed prices",
       x = "Date", y = "Price (USD)") +
  theme_minimal()

2. Seasonal ARIMA Modeling for US Total Vehicle Sales

2.1 Seasonal Data Visualization and Automated Model Fitting

Load Data

# U.S. Total Vehicle Sales (Not Seasonally Adjusted)
# Period: 2021-01-01 to 2025-12-01
TVC <- fredr(
  series_id = "TOTALNSA",
  observation_start = as.Date("2021-01-01"),
  observation_end = as.Date("2025-12-01")
) |>
  # Convert units from Thousands to Millions for better readability
  transmute(month = yearmonth(date), sales = value / 1000) |>
  as_tsibble(index = month)

Time series plot

The chart reveals that domestic auto sales follow a distinct seasonal cycle, with high points in March and December and a significant seasonal low in January.

autoplot(TVC,sales)+
  labs(title = "U.S. Total Vehicle Sales (NSA)",
       x = "Month", y = "Sales (Millions of Units)")

Fit Model

The automated selection identified an ARIMA(2,0,0)(0,1,1) model. The inclusion of the seasonal differencing term ($D=1$) confirms that the series has a clear seasonal component. This means the model accounts for annual cycles by differencing the data at a lag of 12.

fit_car <- TVC %>%
  model(auto_sarima = ARIMA(sales))

report(fit_car)

Series: sales 
Model: ARIMA(2,0,0)(0,1,1)[12] 

Coefficients:
         ar1     ar2     sma1
      0.5274  0.2592  -0.5849
s.e.  0.1381  0.1382   0.2578

sigma^2 estimated as 0.008879:  log likelihood=43.99
AIC=-79.97   AICc=-79.04   BIC=-72.49

2.2 Interpretation of SARIMA Parameters

Mathematical Equation

The $ARIMA(2,0,0)(0,1,1)_{12}$ model for car sales is expressed as:

$$(1 - 0.5274 B - 0.2592 B^2)(1 - B^{12}) y_t = (1 - 0.5849 B^{12}) \epsilon_t$$

$y_t$: Domestic car sales.
$B$: Backshift operator (e.g., $B y_t = y_{t-1}$).
$\epsilon_t$: White noise (random error).

Component Breakdown

Seasonal Differencing $(1 - B^{12})$: Since $D=1$, the model analyzes the year-over-year change ($y_t - y_{t-12}$) to remove the annual seasonal cycle.
AR(2) terms: These capture short-term momentum from the previous two months.
SMA(1) term: This adjusts the forecast based on the random “shock” from the same month last year.

Interpretation of Coefficients

ar1 (0.5274) & ar2 (0.2592): Both are positive, indicating strong sales momentum. High sales in the past two months typically lead to higher sales this month.
sma1 (-0.5849): This negative coefficient represents a seasonal correction. If sales unexpectedly spiked last year, the model “corrects” this year’s forecast downward to align with the long-term seasonal average.

3. White Noise (WN), Random Walk (RW), and RW with Drift

3.1 Mathematical Equations and Relationships

(1) White Noise (WN)

$$y_t = \epsilon_t$$

Terms: $\epsilon_t$ is a random shock.
Assumptions: $\epsilon_t \sim iid(0, \sigma^2)$, meaning it has a constant mean of zero, constant variance, and no autocorrelation.

(2) Random Walk (RW)

$$y_t = y_{t-1} + \epsilon_t$$

Relationship to white noise: If we take the first difference ($\Delta$):

$$\Delta y_t = y_t - y_{t-1} = \epsilon_t$$

Difference of random walk = White Noise.

(3) Random Walk with Drift

$$y_t = c + y_{t-1} + \epsilon_t$$

Terms: $c$ is the drift constant (representing a systematic trend).
Relationship to white noise: If we take the first difference ($\Delta$):

$$\Delta y_t = y_t - y_{t-1} = c + \epsilon_t$$

Difference of random walk with drift = White Noise + Constant.

3.2 Time series plot

set.seed(99)
n <- 200
WN <- rnorm(n)
RW <- cumsum(WN)
RW_drift <- 0.9 + cumsum(0.3 + WN) 

plot.ts(cbind(WN, RW, RW_drift), main="WN vs RW vs RW with Drift")

The simulation highlights the fundamental differences between these three processes. White Noise (WN) is pure randomness; it is stationary with no memory, meaning past shocks do not affect the future, and its forecast is always a constant mean.

In contrast, a Random Walk (RW) is the accumulation of noise, resulting in permanent memory where every shock shifts the series forever. It is non-stationary, and its forecast is “flat” from the last observed value.

Adding a constant creates a Random Walk with Drift, which introduces a systematic upward or downward trend. While it also has permanent memory, its forecast follows a clear trending path based on the drift.