Homework 6

library(fpp3)
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr
── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
✔ tibble      3.3.1     ✔ tsibble     1.1.6
✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
✔ tidyr       1.3.2     ✔ feasts      0.4.2
✔ lubridate   1.9.4     ✔ fable       0.5.0
✔ ggplot2     4.0.1     
── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date()    masks base::date()
✖ dplyr::filter()      masks stats::filter()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval()  masks lubridate::interval()
✖ dplyr::lag()         masks stats::lag()
✖ tsibble::setdiff()   masks base::setdiff()
✖ tsibble::union()     masks base::union()
library(forecast)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

1) Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

a) Explain the differences among these figures. Do they all indicate that the data are white noise?

All three of the chart are within the bounds of confidence. Therefore it’s most likely white noise

b)Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values at are at different distances from the mean because of the large amount of data. The larger the sample size the more precise estimates. The autocorrelations are different in each figure because the random number in a given plot create random white noise.

2) A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

gafa_stock |>
  filter(Symbol == "AMZN") |>
  gg_tsdisplay(Close, plot_type='partial')
Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
ℹ Please use `ggtime::gg_tsdisplay()` instead.
Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

The plot shows that all amazon closing price is non stationary. A stationary time series would have an AFC plot that quickly converge to zero. The ACF show strong correlation between each point well outside of confidence. This shows that past value have a strong influence on future values so the graph isn’t stationary

#3) For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

##3a) Turkish GDP from global_economy.

global_economy |>
  filter (Country == "Turkey") |>
  gg_tsdisplay(GDP, plot_type='partial')

turkey_gdp <- global_economy |> filter(Country == "Turkey")

lambda <- turkey_gdp |> 
  features(GDP, features = guerrero) |> 
  pull(lambda_guerrero)

turkey_gdp |> 
  gg_tsdisplay(difference(box_cox(GDP, lambda)), plot_type = 'partial')
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

First we stabilize the variance with box-cox then we remove the strong upward trend the data has. There’s no seasonality with the data so we don’t need to account for it.

##3b)Accommodation takings in the state of Tasmania from aus_accommodation.

aus_accommodation |>
  filter(State == "Tasmania") |>
  autoplot(Takings)

tas_acc <- aus_accommodation |> 
  filter(State == "Tasmania")

lambda <- tas_acc |> 
  features(Takings, features = guerrero) |> 
  pull(lambda_guerrero)

ns_diffs <- tas_acc |> 
  mutate(log_takings = box_cox(Takings, lambda)) |> 
  features(log_takings, unitroot_nsdiffs)

n_diffs <- tas_acc |> 
  mutate(log_takings = box_cox(Takings, lambda)) |> 
  mutate(d_log_takings = difference(log_takings, 4)) |>
  features(d_log_takings, unitroot_ndiffs)

tas_acc |> 
  gg_tsdisplay(difference(difference(box_cox(Takings, lambda), 4)), 
               plot_type = 'partial') +
  labs(title = "Stationary Tasmania Accommodation Takings (d=1, D=1)")
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

This is very similar to 3a but required me to find the difference twice because of the seasonality spikes.

##3c)Monthly sales from souvenirs.

souvenirs |>
  autoplot()
Plot variable not specified, automatically selected `.vars = Sales`

souvenirs |> gg_tsdisplay(difference(difference(log(Sales)), lag = 12), plot_type="partial")
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).

#5) For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries |>
  autoplot()
Plot variable not specified, automatically selected `.vars = Turnover`

lambda <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

suggested_D <- myseries |>
  mutate(log_turnover = box_cox(Turnover, lambda)) |>
  features(log_turnover, unitroot_nsdiffs)

suggested_d <- myseries |>
  mutate(log_turnover = box_cox(Turnover, lambda)) |>
  mutate(d_seasonal = difference(log_turnover, 12)) |>
  features(d_seasonal, unitroot_ndiffs)

myseries |>
  gg_tsdisplay(difference(difference(box_cox(Turnover, lambda), 12)), plot_type = 'partial')
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).

#6) Simulate and plot some data from simple ARIMA models.

##6a)Use the following R code to generate data from an AR(1) model with

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

sim |> 
  gg_tsdisplay(y, plot_type = 'partial') +
  labs(title = "Simulated AR(1) Process (phi = 0.6)")

##6b)Produce a time plot for the series. How does the plot change as you change ϕ1

sim_ar <- function(phi) {
  y <- numeric(100)
  e <- rnorm(100)
  for(i in 2:100) y[i] <- phi*y[i-1] + e[i]
  tsibble(idx = seq_len(100), y = y, index = idx) |>
    autoplot(y) + labs(title = paste0("phi = ", phi))
}

library(patchwork)
sim_ar(0.9) / sim_ar(-0.9)

The amount of time it takes for the time series to increase as the phi does. As the time series the phi decrease so does the amount of time until the time series peaks.

##6c)Write your own code to generate data from an MA(1) model with θ1 = 0.6 and σ2 = 1.

e <- rnorm(100)
y <- numeric(100)

y[1] <- e[1]

for(i in 2:100) {
  y[i] <- e[i] + 0.6 * e[i-1]
}

sim_ma <- tsibble(idx = seq_len(100), y = y, index = idx)

sim_ma
# A tsibble: 100 x 2 [1]
     idx       y
   <int>   <dbl>
 1     1 -0.770 
 2     2 -0.0853
 3     3 -0.561 
 4     4 -0.471 
 5     5 -0.584 
 6     6 -2.15  
 7     7 -0.0910
 8     8  0.281 
 9     9 -0.350 
10    10 -0.442 
# ℹ 90 more rows

6d)Produce a time plot for the series. How does the plot change as you change θ1?

sim_ma <- function(theta) {
  e <- rnorm(100)
  y <- numeric(100)
  y[1] <- e[1]
  for(i in 2:100) y[i] <- e[i] + theta*e[i-1]
  
  tsibble(idx = seq_len(100), y = y, index = idx) |>
    autoplot(y) + 
    labs(title = paste0("MA(1) with theta = ", theta), y = "y") +
    ylim(-4, 4)
}

sim_ma(0.9) / sim_ma(-0.9)

As the theta increase toward 1 the plot becomes smooth with more time in between each peak and as you decrease the theta the peaks become smaller with less time inbetween peaks.

##6e) Generate data from an ARMA(1,1) model with

y <- numeric(100)
e <- rnorm(100)

y[1] <- e[1]


for(i in 2:100) {
  y[i] <- 0.6 * y[i-1] + e[i] + 0.6 * e[i-1]
}

sim_arma <- tsibble(idx = seq_len(100), y = y, index = idx)

sim_arma
# A tsibble: 100 x 2 [1]
     idx       y
   <int>   <dbl>
 1     1  0.738 
 2     2 -0.0546
 3     3 -0.427 
 4     4 -0.254 
 5     5  0.202 
 6     6  0.185 
 7     7  0.983 
 8     8  1.22  
 9     9  0.896 
10    10  0.889 
# ℹ 90 more rows

##6f)Generate data from an AR(2) model with ϕ1=−0.8, ϕ2=0.3 and σ2=1.(Note that these parameters will give a non-stationary series.)

y <- numeric(100)
e <- rnorm(100)

y[1] <- 0
y[2] <- 0

for(i in 3:100) {
  y[i] <- -0.8 * y[i-1] + 0.3 * y[i-2] + e[i]
}

sim_ar2 <- tsibble(idx = seq_len(100), y = y, index = idx)

sim_ar2
# A tsibble: 100 x 2 [1]
     idx     y
   <int> <dbl>
 1     1  0   
 2     2  0   
 3     3 -1.62
 4     4  3.12
 5     5 -3.51
 6     6  1.72
 7     7 -2.38
 8     8  2.71
 9     9 -2.64
10    10  2.43
# ℹ 90 more rows

6g) Graph the latter two series and compare them.

p1 <- sim_arma |> autoplot(y) + labs(title = "ARMA(1,1): Stationary")
p2 <- sim_ar2 |> autoplot(y) + labs(title = "AR(2): Explosive/Non-Stationary")

# Display plots together
p1 / p2

The ARMA(1,1) is a stable system that could model real world stuff like stable economic growth and AR(2) create a feedback loop that amplifies every shock making it terrible to use in forecasting since the system can easily spiral out of control

#7)Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

##7a)Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

fit <- aus_airpassengers |>
  model(arima_model = ARIMA(Passengers))

# 2. Report the selected model
report(fit)
Series: Passengers 
Model: ARIMA(0,2,1) 

Coefficients:
          ma1
      -0.8963
s.e.   0.0594

sigma^2 estimated as 4.308:  log likelihood=-97.02
AIC=198.04   AICc=198.32   BIC=201.65
# 3. Check residuals
fit |> gg_tsresiduals()
Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
ℹ Please use `ggtime::gg_tsresiduals()` instead.

# 4. Perform Ljung-Box test
fit |> augment() |>
  features(.innov, ljung_box, lag = 10, dof = 1)
# A tibble: 1 × 3
  .model      lb_stat lb_pvalue
  <chr>         <dbl>     <dbl>
1 arima_model    6.70     0.669
# 5. Generate and plot 10-period forecast
fit |>
  forecast(h = 10) |>
  autoplot(aus_airpassengers) +
  labs(title = "10-Year Forecast for Australian Air Passengers",
       y = "Passengers (millions)")

##7b)Write the model in terms of the backshift operator.

\[(1-B)^2 y_t = (1 + \theta_1 B) \epsilon_t\]

##7c)Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

fit_comp <- aus_airpassengers |>
  model(
    auto_arima = ARIMA(Passengers),
    rw_drift = ARIMA(Passengers ~ 1 + pdq(0,1,0))
  )

fc_comp <- fit_comp |> forecast(h = 10)

fc_comp |>
  autoplot(aus_airpassengers, level = NULL) + # level = NULL hides intervals for clarity
  labs(title = "Comparison: Automated ARIMA vs. Random Walk with Drift",
       y = "Passengers (millions)") +
  guides(colour = guide_legend(title = "Model"))

##7d)Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

fit_97d <- aus_airpassengers |>
  model(
    auto_arima = ARIMA(Passengers),             # Part (a)
    rw_drift = ARIMA(Passengers ~ 1 + pdq(0,1,0)), # Part (c)
    arima212_drift = ARIMA(Passengers ~ 1 + pdq(2,1,2)), # Part (d)
    arima212_no_drift = ARIMA(Passengers ~ 0 + pdq(2,1,2)) # No constant
  )
Warning: 1 error encountered for arima212_no_drift
[1] non-stationary AR part from CSS
fc_97d <- fit_97d |> forecast(h = 10)

fc_97d |>
  autoplot(aus_airpassengers, level = NULL) +
  labs(title = "Comparison of ARIMA Models", y = "Passengers (millions)")
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_line()`).

##7e) Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

fit_97e <- aus_airpassengers |>
  model(
    no_constant = ARIMA(Passengers ~ pdq(0,2,1)),
    with_constant = ARIMA(Passengers ~ 1 + pdq(0,2,1))
  )
Warning: Model specification induces a quadratic or higher order polynomial trend.  This
is generally discouraged, consider removing the constant or reducing the number
of differences.
# Forecast for 10 years
fc_97e <- fit_97e |> forecast(h = 10)

# Plot to see the curvature
fc_97e |>
  autoplot(aus_airpassengers, level = NULL) +
  labs(title = "ARIMA(0,2,1) With vs. Without Constant",
       subtitle = "Note the quadratic (curved) acceleration when the constant is added",
       y = "Passengers (millions)")

#8)For the United States GDP series (from global_economy):

##8a) if necessary, find a suitable Box-Cox transformation for the data;

global_economy |> 
  filter(Country == "United States") |>
  autoplot()
Plot variable not specified, automatically selected `.vars = GDP`

usa <- global_economy |> 
  filter(Country == "United States")

usa |> 
  gg_tsdisplay(difference(log(GDP)), plot_type = "partial")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

##8b)Fit a suitable ARIMA model to the transformed data using ARIMA();

us_gdp <- global_economy |> 
  filter(Country == "United States")

# Fit the automated ARIMA model
fit <- us_gdp |>
  model(arima_auto = ARIMA(GDP))

# View the selected model details
report(fit)
Series: GDP 
Model: ARIMA(0,2,2) 

Coefficients:
          ma1      ma2
      -0.4206  -0.3048
s.e.   0.1197   0.1078

sigma^2 estimated as 2.615e+22:  log likelihood=-1524.08
AIC=3054.15   AICc=3054.61   BIC=3060.23

##8c)Try some other plausible models by experimenting with the orders chosen;

global_economy |>
  filter(Country == "United States") |>
  features(box_cox(GDP, lambda), unitroot_ndiffs)
# A tibble: 1 × 2
  Country       ndiffs
  <fct>          <int>
1 United States      2
global_economy |>
  filter(Country == "United States") |>
  gg_tsdisplay(difference(GDP, 1),
               plot_type='partial', lag=36) +
  labs(title = "Differenced", y="")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

fit <- global_economy |>
  filter(Country == "United States") |>
  select(GDP, Year) |>
  model(
    arima011 = ARIMA(box_cox(GDP,lambda) ~ pdq(0,1,1)),
    arima110 = ARIMA(box_cox(GDP,lambda) ~ pdq(1,1,0)),
    arima111 = ARIMA(box_cox(GDP,lambda) ~ pdq(1,1,1)),
    auto = ARIMA(box_cox(GDP,lambda), stepwise = FALSE, approx = FALSE)
  )
Warning: It looks like you're trying to fully specify your ARIMA model but have not said
if a constant should be included. You can include a constant using `ARIMA(y~1)`
to the formula or exclude it by adding `ARIMA(y~0)`.
Warning: 1 error encountered for arima111
[1] Could not find an appropriate ARIMA model.
This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
fit |> pivot_longer(everything(), names_to = "Model name",
                     values_to = "Orders")
# A mable: 4 x 2
# Key:     Model name [4]
  `Model name`         Orders
  <chr>               <model>
1 arima011     <ARIMA(0,1,1)>
2 arima110     <ARIMA(1,1,0)>
3 arima111       <NULL model>
4 auto         <ARIMA(0,2,2)>

8d) choose what you think is the best model and check the residual diagnostics;

I think the best model is ARIMA(1,1,0) with drift

fit %>%
  select(arima110) %>%
  augment() %>%
  filter(!is.na(.innov)) %>%
  gg_tsdisplay(.innov, plot_type = "hist")

8e)produce forecasts of your fitted model. Do the forecasts look reasonable?

us_gdp_fit <- global_economy |>
  filter(Country == "United States") |>
  model(best_arima = ARIMA(GDP ~ 1 + pdq(1,1,0)))
Warning: 1 error encountered for best_arima
[1] system is computationally singular: reciprocal condition number = 1.1301e-18
# Generate Forecast for 10 periods (years)
us_gdp_fc <- us_gdp_fit |>
  forecast(h = 10)

# Plot the Forecast
us_gdp_fc |>
  autoplot(global_economy |> filter(Country == "United States")) +
  labs(title = "10-Year GDP Forecast for the US",
       y = "GDP (Current USD)",
       x = "Year")
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_line()`).

Both forecast look responsible. Both see the strong linear trend and continue it forward making this a reasonable and robust forecast for this dataset.

##8f)compare the results with what you would obtain using ETS() (with no transformation).

us_gdp <- global_economy |> 
  filter(Country == "United States")

# Fit both ARIMA and ETS
fit_comparison <- us_gdp |>
  model(
    arima = ARIMA(GDP ~ 1 + pdq(1,1,0)),
    ets = ETS(GDP)
  )
Warning: 1 error encountered for arima
[1] system is computationally singular: reciprocal condition number = 1.1301e-18
# Generate 10-year forecasts
fc_comparison <- fit_comparison |> forecast(h = 10)

# Plot both forecasts together
fc_comparison |>
  autoplot(us_gdp, level = NULL) +
  labs(title = "US GDP Forecast: ARIMA vs. ETS",
       y = "GDP (Current USD)") +
  guides(colour = guide_legend(title = "Model Type"))
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_line()`).