Data 624 Homework #6

9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?

The three ACF plots look different mainly because the sample sizes are different. For the series with only 36 observations, the sample ACF values vary more and appear more irregular because the smaller sample size makes the estimated autocorrelations more sensitive to random variation. For the series with 360 observations, the ACF values are closer to zero overall, although there is still some random fluctuation. For the series with 1,000 observations, the ACF values are very close to zero at almost all lags, which is what we would expect from white noise since white noise has no true autocorrelation. Even so, all three figures are still consistent with white noise. In a white noise series, the sample ACF is not expected to be exactly zero at every lag, but instead should bounce randomly around zero. The key point is that there is no clear pattern, such as a slow decay, repeating seasonal spikes, or many large significant autocorrelations. It is also normal for a few spikes to appear slightly larger just by chance, especially when many lags are shown.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values are different in each plot because they depend on the sample size. In an ACF plot, the approximate significance bounds are given by \(\pm 1.96 / \sqrt{n}\). This means that a smaller sample size gives wider bounds, while a larger sample size gives narrower bounds. That is why the blue dashed lines are farther from zero for the series with 36 observations and closer to zero for the series with 1,000 observations. The autocorrelations are different in each figure because each one comes from a different random sample of white noise. Even though white noise has no true autocorrelation, the sample ACF values are only estimates, so they can still vary randomly from one sample to another.

9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock). along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

# keep Amazon stock data
amazon_stock <- gafa_stock %>%
  filter(Symbol == "AMZN")

# plot the daily closing prices using the actual date
amazon_stock %>%
  autoplot(Close) +
  labs(
    title = "Amazon Daily Closing Prices",
    y = "Closing Price",
    x = "Date"
  )

# create a regular observation index for the ACF and PACF plots
amazon_stock_indexed <- amazon_stock %>%
  as_tibble() %>%
  mutate(obs = row_number()) %>%
  select(obs, Close) %>%
  as_tsibble(index = obs)

# plot the ACF and PACF using the regular observation index
amazon_stock_indexed %>%
  ggtime::gg_tsdisplay(Close, plot_type = "partial") +
  labs(
    title = "ACF and PACF of Amazon Daily Closing Prices",
    y = "Closing Price",
    x = "Lag"
  )

The time plot of Amazon daily closing prices shows that the series is non stationary because the values do not stay around a constant mean. Instead, the series moves up and down over time in a wandering pattern, which is common for stock prices. This shows that the level of the series changes over time rather than staying stable.

The ACF plot also shows that the series is non stationary because the autocorrelations remain very large and positive for many lags and decrease only slowly. For a stationary series, the ACF would usually drop much faster. A slow decay in the ACF is a common sign that the series has trend like behavior or behaves like a random walk.

The PACF plot shows one very large spike at lag 1, followed by much smaller values at later lags. This suggests that nearby observations are strongly related, which is also typical of a non stationary stock price series.

Overall, these plots suggest that the Amazon closing price series is non stationary and should be differenced. Taking the first difference would remove the changing level and make the series more stable over time.

9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data a. Turkish GDP from global_economy.

# keep Turkey GDP data
turkey_gdp <- global_economy %>%
  filter(Code == "TUR")

# estimate a Box-Cox lambda
lambda_turkey <- turkey_gdp %>%
  features(GDP, guerrero) %>%
  pull(lambda_guerrero)

lambda_turkey

## [1] 0.1572187

# plot the original GDP series
turkey_gdp %>%
  autoplot(GDP) +
  labs(
    title = "Turkish GDP",
    y = "GDP",
    x = "Year"
  )

# create a Box-Cox transformed series
turkey_gdp_bc <- turkey_gdp %>%
  mutate(GDP_bc = box_cox(GDP, lambda_turkey))

# plot the transformed series after one regular difference
turkey_gdp_bc %>%
  mutate(GDP_diff = difference(GDP_bc)) %>%
  filter(!is.na(GDP_diff)) %>%
  ggtime::gg_tsdisplay(GDP_diff, plot_type = "partial") +
  labs(
    title = "Box Cox transformed and differenced Turkish GDP",
    y = "Transformed GDP",
    x = "Year"
  )

# check the suggested number of regular differences
turkey_gdp_bc %>%
  features(GDP_bc, unitroot_ndiffs)

## # A tibble: 1 × 2
##   Country ndiffs
##   <fct>    <int>
## 1 Turkey       1

For Turkish GDP, an appropriate choice is a Box Cox transformation with lambda about 0.16, which is close to a log transformation, and one regular difference. The GDP series shows strong growth over time and a changing level, so the transformation helps stabilize the variance and first differencing removes the trend. After one regular difference, the series looks much closer to stationary.

Accommodation takings in the state of Tasmania from aus_accommodation.

# keep Tasmania accommodation takings
tas_accom <- aus_accommodation %>%
  filter(State == "Tasmania")

# estimate a Box-Cox lambda
lambda_tas <- tas_accom %>%
  features(Takings, guerrero) %>%
  pull(lambda_guerrero)

lambda_tas

## [1] 0.001819643

# plot the original takings series
tas_accom %>%
  autoplot(Takings) +
  labs(
    title = "Tasmania Accommodation Takings",
    y = "Takings",
    x = "Time"
  )

# create a Box-Cox transofrmed series
tas_accom_bc <- tas_accom %>%
  mutate(Takings_bc = box_cox(Takings, lambda_tas))

# plot the series after one seasonal difference
tas_accom_bc %>%
  mutate(season_diff = difference(Takings_bc, 4)) %>%
  filter(!is.na(season_diff)) %>%
  ggtime::gg_tsdisplay(season_diff, plot_type = "partial") +
  labs(
    title = "Box Cox transformed and seasonally differenced Tasmania accommodation takings",
    y = "Transformed takings",
    x = "Time"
  )

# check the suggested number of differences
tas_accom_bc %>%
  features(Takings_bc, unitroot_nsdiffs)

## # A tibble: 1 × 2
##   State    nsdiffs
##   <chr>      <int>
## 1 Tasmania       1

tas_accom_bc %>%
  mutate(season_diff = difference(Takings_bc, 4)) %>%
  features(season_diff, unitroot_ndiffs)

## # A tibble: 1 × 2
##   State    ndiffs
##   <chr>     <int>
## 1 Tasmania      0

For Tasmania accommodation takings, an appropriate choice is a log transformation and one seasonal difference of lag 4. The estimated Box Cox lambda is very close to 0, so a log transformation is reasonable. The series shows a clear quarterly seasonal pattern and an upward movement over time. The output shows that one seasonal difference is needed, and after seasonal differencing no regular difference is needed. This means the seasonal difference is enough to make the series much closer to stationary.

Monthly sales from souvenirs.

# estimate a Box-Cox lambda
lambda_souvenirs <- souvenirs %>%
  features(Sales, guerrero) %>%
  pull(lambda_guerrero)

lambda_souvenirs

## [1] 0.002118221

# plot the original sales series
souvenirs %>%
  autoplot(Sales) +
  labs(
    title = "Monthly Souvenir Sales",
    y = "Sales",
    x = "Month"
  )

# create a Box-Cox transformed series
souvenirs_bc <- souvenirs %>%
  mutate(Sales_bc = box_cox(Sales, lambda_souvenirs))

# check the suggested number of seasonal differences
souvenirs_bc %>%
  features(Sales_bc, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

# check whether a regular difference is needed after seasonal differencing
souvenirs_bc %>%
  mutate(season_diff = difference(Sales_bc, 12)) %>%
  features(season_diff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

# plot the series after one seasonal difference
souvenirs_bc %>%
  mutate(season_diff = difference(Sales_bc, 12)) %>%
  filter(!is.na(season_diff)) %>%
  ggtime::gg_tsdisplay(season_diff, plot_type = "partial") +
  labs(
    title = "Box Cox transformed and seasonally differenced souvenir sales",
    y = "Transformed sales",
    x = "Month"
  )

For the souvenirs series, an appropriate choice is a log transformation and one seasonal difference of lag 12. The estimated Box Cox lambda is very close to 0, so a log transformation is appropriate. The series shows increasing variation, a strong upward trend, and clear yearly seasonality. The seasonal difference removes the yearly seasonal pattern, and the differencing check shows that no additional regular difference is needed. After this step, the series is much closer to stationary.

9.5

For your retail data (from exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

set.seed(0523)
my_id <- sample(unique(aus_retail$`Series ID`), 1)

myseries <- aus_retail %>%
  filter(`Series ID` == my_id)

# show the selected series
myseries %>%
  distinct(State, Industry, `Series ID`)

## # A tibble: 1 × 3
##   State    Industry                                         `Series ID`
##   <chr>    <chr>                                            <chr>      
## 1 Tasmania Hardware, building and garden supplies retailing A3349590A

# plot the original series to look for trend and seasonality
myseries %>%
  autoplot(Turnover) +
  labs(
    title = "Selected Australian Retail Series",
    y = "Turnover",
    x = "Month"
  )

# estimate a Box Cox lambda using Guerrero's method
lambda_retail <- myseries %>%
  features(Turnover, guerrero) %>%
  pull(lambda_guerrero)

lambda_retail

## [1] 0.4449179

# create the transformed series
myseries_bc <- myseries |>
  mutate(Turnover_bc = box_cox(Turnover, lambda_retail))

# Plot the transformed series
myseries_bc |>
  autoplot(Turnover_bc) +
  labs(
    title = "Box-Cox Transformed Retail Series",
    y = "Transformed Turnover",
    x = "Month"
  )

# check how many seasonal differences are needed
myseries_bc %>%
  features(Turnover_bc, unitroot_nsdiffs)

## # A tibble: 1 × 3
##   State    Industry                                         nsdiffs
##   <chr>    <chr>                                              <int>
## 1 Tasmania Hardware, building and garden supplies retailing       1

# after one seasonal difference, check how many regular differences are needed
myseries_bc %>%
  mutate(season_diff = difference(Turnover_bc, 12)) %>%
  features(season_diff, unitroot_ndiffs)

## # A tibble: 1 × 3
##   State    Industry                                         ndiffs
##   <chr>    <chr>                                             <int>
## 1 Tasmania Hardware, building and garden supplies retailing      0

# plot the transformed series after one seasonal difference
myseries_bc %>%
  mutate(season_diff = difference(Turnover_bc, 12)) %>%
  filter(!is.na(season_diff)) %>%
  ggtime::gg_tsdisplay(season_diff, plot_type = "partial") +
  labs(
    title = "Seasonally Differenced Retail Series",
    y = "Seasonally Differenced Turnover",
    x = "Month"
  )

The selected retail series is for Tasmania hardware, building and garden supplies retailing. The original plot shows a clear upward movement over time and strong monthly seasonality. The estimated Box Cox lambda is about 0.445, so a Box Cox transformation is appropriate to help stabilize the variance. After transforming the data, the seasonal differencing check suggests that one seasonal difference is needed. Then, after applying one seasonal difference of lag 12, the regular differencing check shows that no regular difference is needed. Therefore, the appropriate choice for this series is a Box Cox transformation with lambda about 0.445 and one seasonal difference of lag 12.

9.6

Simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with \(\phi_1 = 0.6\) and \(\sigma^2 = 1\). The process starts with \(y_1 = 0\). y <- numeric(100) e <- rnorm(100) for(i in 2:100) y[i] <- 0.6*y[i-1] + e[i] sim <- tsibble(idx = seq_len(100), y = y, index = idx)

set.seed(0524)

y <- numeric(100)

e <- rnorm(100)

for (i in 2:100) {
  y[i] <- 0.6 * y[i - 1] + e[i]
}

sim <- tsibble(
  idx = seq_len(100),
  y = y,
  index = idx
)

head(sim)

## # A tsibble: 6 x 2 [1]
##     idx      y
##   <int>  <dbl>
## 1     1  0    
## 2     2  0.782
## 3     3  0.169
## 4     4 -0.958
## 5     5  0.355
## 6     6 -0.524

# plot the simulated AR(1) series
sim %>%
  autoplot(y) +
  labs(
    title = "Simulated AR(1) Series",
    subtitle = expression(phi[1] == 0.6 ~ "," ~ sigma^2 == 1),
    x = "Time",
    y = "y"
  )

# plot the ACF and PACF of the simulated series
sim %>%
  ggtime::gg_tsdisplay(y, plot_type = "partial") +
  labs(
    title = "Simulated AR(1) Series with ACF and PACF",
    x = "Lag"
  )

The data were simulated from an AR(1) model with \(\phi_1 = 0.6\) and \(\sigma^2 = 1\). Since the coefficient is positive and less than 1, the series is stationary and shows short run persistence. This means nearby observations tend to be positively related. In the time plot, the values move up and down around a roughly constant mean rather than showing a long term trend. In the ACF plot, the correlations gradually decrease as the lag increases, which is a common pattern for an AR(1) process. In the PACF plot, there is one main spike at lag 1 and the remaining lags are close to zero, which is also consistent with an AR(1) model.

Produce a time plot for the series. How does the plot change as you change ϕ1?

# choose several values of phi_1 to compare
phi_values <- c(-0.8, 0.2, 0.6, 0.9)

# create an empty list to store each simulated series
sim_list <- list()

# simulate an AR(1) series for each phi_1 value
for (j in seq_along(phi_values)) {
  
  # create empty series and random errors
  y <- numeric(100)
  e <- rnorm(100)
  
  # simulate the AR(1) process
  for (i in 2:100) {
    y[i] <- phi_values[j] * y[i - 1] + e[i]
  }
  
  # store results in a tibble
  sim_list[[j]] <- tibble(
    idx = 1:100,
    y = y,
    phi = as.factor(phi_values[j])
  )
}

# combine all simulated series together
sim_all <- bind_rows(sim_list) %>%
  as_tsibble(index = idx, key = phi)

# plot the simulated series
sim_all %>%
  ggplot(aes(x = idx, y = y)) +
  geom_line() +
  facet_wrap(~phi, scales = "free_y") +
  labs(
    title = "Simulated AR(1) Series for Different Values of phi_1",
    x = "Time",
    y = "y"
  )

As the value of \(\phi_1\) changes, the behavior of the AR(1) series changes. When \(\phi_1\) is small and positive, such as 0.2, the series looks more random because the dependence between nearby observations is weak. When \(\phi_1\) becomes larger and positive, such as 0.6 or 0.9, the series becomes smoother and shows stronger persistence. High values are more likely to be followed by high values, and low values are more likely to be followed by low values. When \(\phi_1\) is close to 1, the series tends to stay in the same direction for longer periods and changes more slowly. When \(\phi_1\) is negative, such as -0.8, the series tends to alternate more from one time point to the next, which creates a zig-zag pattern. Overall, larger positive values of \(\phi_1\) create stronger persistence, while negative values create more oscillation.

Write your own code to generate data from an MA(1) model with θ1 = 0.6 and σ^2 = 1.

# generate 100 random normal errors
e <- rnorm(100)

# create an empty vector for the MA(1) series
y <- numeric(100)

# set the first value equal to the first error term
y[1] <- e[1]

# simulate the MA(1)
for (i in 2:100) {
  y[i] <- e[i] + 0.6 * e[i - 1]
}

# put the simulated data into a tsibble
sim_ma1 <- tibble(
  idx = 1:100,
  y = y
) %>%
  as_tsibble(index = idx)

head(sim_ma1)

## # A tsibble: 6 x 2 [1]
##     idx      y
##   <int>  <dbl>
## 1     1 -0.287
## 2     2 -1.76 
## 3     3  0.624
## 4     4  1.85 
## 5     5  1.24 
## 6     6  0.226

# plot the simulated MA(1) series
sim_ma1 |>
  autoplot(y) +
  labs(
    title = "Simulated MA(1) Series",
    subtitle = expression(theta[1] == 0.6 ~ "," ~ sigma^2 == 1),
    x = "Time",
    y = "y"
  )

The data were generated from an MA(1) model with \(\theta_1 = 0.6\) and \(\sigma^2 = 1\). In this model, each observation depends on the current error term and the previous error term. Since the coefficient is positive, nearby observations tend to move together over short periods. The series still looks mostly random, but it shows some short run dependence rather than behaving like pure white noise.

Produce a time plot for the series. How does the plot change as you change θ1?

# choose several values of theta_1 to compare
theta_values <- c(-0.8, 0.2, 0.6, 0.9)

# create an empty list to store each simulated series
sim_list <- list()

# simulate an MA(1) series for each theta_1 value
for (j in seq_along(theta_values)) {
  
  # create random errors
  e <- rnorm(100)
  
  # create an empty series
  y <- numeric(100)
  
  # first value uses only the current error
  y[1] <- e[1]
  
  # simulate the MA(1) process
  for (i in 2:100) {
    y[i] <- e[i] + theta_values[j] * e[i - 1]
    
  }
  
  # store results in a tibble
  sim_list[[j]] <- tibble(
    idx = 1:100,
    y = y,
    theta = as.factor(theta_values[j])
  )
}

# combine all simulated series
sim_all <- bind_rows(sim_list) %>%
  as_tsibble(index = idx, key = theta)

# plot the simulated MA(1) series
sim_all %>%
  ggplot(aes(x = idx, y = y)) +
  geom_line() +
  facet_wrap(~theta, scales = "free_y") +
  labs(
    title = "Simulated MA(1) Series for Different Values of theta_1",
    x = "Time",
    y = "y"
  )

As the value of \(\theta_1\) changes, the overall series still looks like short term random fluctuation, but the dependence between nearby observations changes. When \(\theta_1\) is close to 0, the series looks more like white noise because each value depends mostly on the current error term. As \(\theta_1\) becomes larger in the positive direction, nearby observations tend to move together more, so the series looks slightly smoother over short stretches. When \(\theta_1\) is negative, consecutive values are more likely to move in opposite directions, so the series looks more jagged or alternating. Unlike the AR(1) model, changing \(\theta_1\) in an MA(1) model mainly affects the short term pattern rather than creating long lasting persistence.

e, Generate data from an ARMA(1,1) model with \(\phi_1 = 0.6\), \(\theta_1 = 0.6\), and \(\sigma^2 = 1\).

# create 100 random normal errors
e<- rnorm(100)

# create an empty vector for the ARMA(1,1) series
y <- numeric(100)

# set the starting value
y[1] <- 0

# simulate the ARMA(1, 1) process:
for (i in 2:100) {
  y[i] <- 0.6 * y[i - 1] + e[i] + 0.6 * e[i - 1]
}

# put the simulated data into a tsibble
sim_arma11 <- tibble(
  idx = 1:100,
  y = y
) %>%
  as_tsibble(index = idx)

head(sim_arma11)

## # A tsibble: 6 x 2 [1]
##     idx        y
##   <int>    <dbl>
## 1     1  0      
## 2     2  0.00474
## 3     3 -0.678  
## 4     4 -1.88   
## 5     5 -4.23   
## 6     6 -4.35

# plot the simulated ARMA(1,1) series
sim_arma11 |>
  autoplot(y) +
  labs(
    title = "Simulated ARMA(1,1) Series",
    subtitle = expression(phi[1] == 0.6 ~ "," ~ theta[1] == 0.6 ~ "," ~ sigma^2 == 1),
    x = "Time",
    y = "y"
  )

The data were generated from an ARMA(1,1) model with \(\phi_1 = 0.6\), \(\theta_1 = 0.6\), and \(\sigma^2 = 1\). This means each observation depends on both the previous observation and the previous error term. Compared with a pure AR model or a pure MA model, the series shows a combination of persistence and short run dependence. In the time plot, the series moves around a roughly constant level, so it appears stationary.

Generate data from an AR(2) model with ϕ1 = -0.8, ϕ2 = 0.3 and σ^2 = 1.

# generate 100 random normal errors
e <- rnorm(100)

# create an empty vector for the AR(2) series
y <- numeric(100)

# set starting values
y[1] <- 0
y[2] <- 0

# simulate the AR(2) process
for (i in 3:100) {
  y[i] <- -0.8 * y[i - 1] + 0.3 * y[i - 2] + e[i]
}

# put the simulated data into a tsibble
sim_ar2 <- tibble(
  idx = 1:100,
  y = y
) %>%
  as_tsibble(index = idx)

head(sim_ar2)

## # A tsibble: 6 x 2 [1]
##     idx      y
##   <int>  <dbl>
## 1     1  0    
## 2     2  0    
## 3     3 -0.521
## 4     4  1.11 
## 5     5 -0.412
## 6     6 -0.588

# plot the simulated AR(2) series
sim_ar2 %>%
  autoplot(y) +
  labs(
    title = "Simulated AR(2) Series",
    subtitle = expression(phi[1] == -0.8 ~ "," ~ phi[2] == 0.3 ~ "," ~ sigma^2 == 1),
    x = "Time",
    y = "y"
  )

The data were generated from an AR(2) model with \(\phi_1 = -0.8\), \(\phi_2 = 0.3\), and \(\sigma^2 = 1\). The time plot shows that the series is not stationary because the oscillations become larger over time instead of staying around a constant level. This matches the note in the question that these parameter values produce a non stationary series.

Graph the latter two series and compare them.

# generate random errors for the ARMA(1,1) series
e1 <- rnorm(100)

# create empty vector for ARMA(1,1)
y1 <- numeric(100)
y1[1] <- 0

# simulate ARMA(1,1)
for (i in 2:100) {
  y1[i] <- 0.6 * y1[i - 1] + e1[i] + 0.6 * e1[i - 1]
}

# generate random errors for the AR(2) series
e2 <- rnorm(100)

# create empty vector for AR(2)
y2 <- numeric(100)
y2[1] <- 0
y2[2] <- 0

# simulate AR(2)
for (i in 3:100) {
 y2[i] <- -0.8 * y2[i - 1] + 0.3 * y2[i - 2] + e2[i] 
}

# combine both series into one tsibble
sim_compare <- bind_rows(
  tibble(idx = 1:100, y = y1, series = "ARMA(1,1)"),
  tibble(idx = 1:100, y = y2, series = "AR(2")
) %>%
  as_tsibble(index = idx, key = series)

# plot the two series
sim_compare %>%
  ggplot(aes(x = idx, y = y)) +
  geom_line() +
  facet_wrap(~series, scales = "free_y") +
  labs(
    title = "Comparison of ARMA(1,1) and AR(2) Series",
    x = "Time",
    y = "y"
  )

The ARMA(1,1) series looks more stable and fluctuates around a roughly constant mean, which is consistent with a stationary process. It shows short run dependence, but the movements remain fairly controlled over time. In contrast, the AR(2) series looks much less stable because the parameters produce a non stationary process. Its oscillations become larger over time, and the series does not settle around a constant mean. Overall, the ARMA(1,1) series appears more regular and stationary, while the AR(2) series appears more unstable.

9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970 - 2011. a. Use `ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

# plot the original series
aus_airpassengers %>%
  autoplot(Passengers) +
  labs(
    title = "Australian Air Passengers",
    y = "Passengers in millions",
    x = "Year"
  )

# fit an automatic ARIMA model
fit <- aus_airpassengers %>%
  model(ARIMA(Passengers))

report(fit)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8963
## s.e.   0.0594
## 
## sigma^2 estimated as 4.308:  log likelihood=-97.02
## AIC=198.04   AICc=198.32   BIC=201.65

# check whether residuals look like white noise
fit %>%
  ggtime::gg_tsresiduals()

# Ljung-Box test for residual autocorrelation
augment(fit) %>%
  features(.innov, ljung_box, lag = 24, dof = 0)

## # A tibble: 1 × 3
##   .model            lb_stat lb_pvalue
##   <chr>               <dbl>     <dbl>
## 1 ARIMA(Passengers)    24.0     0.463

# forecast the next 10 periods
fc <- fit %>%
  forecast(h = 10)

# plot the forecasts 
fc %>%
  autoplot(aus_airpassengers) +
  labs(
    title = "Forecasts for Australian Air Passengers",
    y = "Passengers in millions",
    x = "Year"
  )

The ARIMA() function selected ARIMA(0,2,1) for the aus_airpassengers series. The residual plots suggest that the residuals behave like white noise because they are centered around zero and do not show any clear pattern over time. The ACF of the residuals also does not show strong significant spikes. The Ljung Box test gives a p value of 0.463, which is greater than 0.05, so there is no strong evidence of autocorrelation remaining in the residuals. This suggests that the model is adequate. The forecasts for the next 10 periods continue the upward movement in the series, with prediction intervals becoming wider over time.

Write the model in terms of the backshift operator.

Let \(B\) be the backshift operator, where \(By_t = y_{t-1}\). Since the chosen model is ARIMA(0,2,1), its backshift form is

\[ (1-B)^2 y_t = (1+\theta_1 B)\varepsilon_t \]

From the output, the estimated value of \(\theta_1\) is \(-0.8963\). Plugging that in gives

\[ (1-B)^2 y_t = (1 - 0.8963B)\varepsilon_t \]

This tells us that the data are differenced twice, and the differenced series follows an MA(1) process.

Plot forecasts from an ARIMA (0,1,0) model with drift and compare these to part a.

# fit the model from part a 
fit_a <- aus_airpassengers %>%
  model(auto = ARIMA(Passengers))

# fit ARIMA(0,1,0) with drift
fit_c <- aus_airpassengers %>%
  model(rw_drift = ARIMA(Passengers ~ pdq(0,1,0) + 1))

# forecast the next 10 periods
fc_a <- fit_a %>%
  forecast(h = 10)

fc_c <- fit_c %>%
  forecast(h = 10)

# plot forecast together
bind_rows(fc_a, fc_c) %>%
  autoplot(aus_airpassengers, level = NULL) +
  labs(
    title = "Forecast Comparison: Part a and ARIMA(0,1,0) with Drift",
    y = "Passengers in millions",
    x = "Year"
  )

The ARIMA(0,1,0) model with drift is a random walk with drift. Its forecasts continue forward in a roughly straight line, with prediction intervals widening over time. Compared with the model in part a, this model is simpler because it uses only one difference and a drift term. The forecasts are smoother and more linear than the forecasts from the automatically selected ARIMA model.

Plot forecasts from an ARIMA (2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens?

# fit ARIMA(2,1,2) with a constant
fit_d_const <- aus_airpassengers %>%
  model(arima212_const = ARIMA(Passengers ~ pdq(2,1,2) + 1))

# fit ARIMA(2,1,2) without constant
fit_d_noconst <- aus_airpassengers %>%
  model(arima212_noconst = ARIMA(Passengers ~ pdq(2,1,2) + 0))

## Warning: 1 error encountered for arima212_noconst
## [1] non-stationary AR part from CSS

# forecast the next 10 periods
fc_d_const <- fit_d_const %>%
  forecast(h = 10)

fc_d_noconst <- fit_d_noconst %>%
  forecast(h = 10)

# compare part a, part c, and ARIMA(2,1,2) with constant
bind_rows(fc_a, fc_c, fc_d_const) %>%
  autoplot(aus_airpassengers, level = NULL) +
  labs(
    title = "Forecast Comparison: Parts a, c, and ARIMA(2,1,2) with Constant",
    y = "Passengers in millions",
    x = "Year"
  )

The ARIMA(2,1,2) model with a constant is more flexible than the models in parts a and c, so it may follow the past data more closely. However, because it has more parameters, it may also be less stable and may overfit the series. When I removed the constant, the ARIMA(2,1,2) model did not fit successfully and returned a non stationary AR warning. This suggests that the version without a constant is not appropriate for this series.

Plot forecasts from an ARIMA(0,2,1) model with constant. What happens?

# fit ARIMA(0,2,1) with a constant
fit_e <- aus_airpassengers %>%
  model(arima021_const = ARIMA(Passengers ~ pdq(0,2,1) + 1))

## Warning: Model specification induces a quadratic or higher order polynomial trend.  This
## is generally discouraged, consider removing the constant or reducing the number
## of differences.

# forecast the next 10 periods
fc_e <- fit_e %>%
  forecast(h = 10)

# plot forecasts
fc_e %>%
  autoplot(aus_airpassengers, level = NULL) +
  labs(
    title = "Forecasts from ARIMA(0,2,1) with a Constant",
    y = "Passengers (millions)",
    x = "Year"
  )

When a constant is included in an ARIMA(0,2,1) model, the forecasts show a strong curved pattern rather than a simple linear continuation. This happens because the model differences the series twice, so adding a constant after second differencing implies a quadratic trend in the original series. As a result, the forecasts can increase too quickly and may look unrealistic. This is why adding a constant to a model with second differencing is usually not a good idea.

9.8

For the United States GDP series (from global_economy):

# keep the United States GDP series
us_gdp <- global_economy %>%
  filter(Code == "USA")

if necessary, find a suitable Box-Cox transformation for the data;

# plot the original GDP series 
us_gdp %>%
  autoplot(GDP) +
  labs(
    title = "United States GDP",
    y = "GDP",
    x = "Year"
  )

# Use a log transformation because the variation increases as GDP gets larger
lambda <- 0

us_gdp %>%
  mutate(log_GDP = box_cox(GDP, lambda)) %>%
  autoplot(log_GDP) +
  labs(
    title = "Log Transformed United States GDP",
    y = "log(GDP)",
    x = "Year"
  )

The GDP series shows strong long run growth, and the variation gets larger as the level gets larger. Because of this, a log transformation is a reasonable choice. A log transformation is the same as a Box Cox transformation with lambda equal to 0, and it helps make the variance more stable.

fit a suitable ARIMA model to the transformed data using ARIMA();

# fit an automatic ARIMA model to the transformed series
fit_auto <- us_gdp %>%
  model(ARIMA(log(GDP)))

# show the selected model 
report(fit_auto)

## Series: GDP 
## Model: ARIMA(0,2,1) 
## Transformation: log(GDP) 
## 
## Coefficients:
##           ma1
##       -0.6353
## s.e.   0.1138
## 
## sigma^2 estimated as 0.0004278:  log likelihood=139.76
## AIC=-275.53   AICc=-275.3   BIC=-271.48

The automatic ARIMA model selected for the log transformed GDP series is ARIMA(0,2,1).

try some other plausible models by experimenting with the orders chosen;

# fit several plausible ARIMA models for comparison
fits <- us_gdp %>%
  model(
    auto = ARIMA(log(GDP)),
    arima010 = ARIMA(log(GDP) ~ pdq(0,1,0) + 1),
    arima011 = ARIMA(log(GDP) ~ pdq(0,1,1) + 1),
    arima110 = ARIMA(log(GDP) ~ pdq(1,1,0) + 1),
    arima111 = ARIMA(log(GDP) ~ pdq(1,1,1) + 1)
  )

# compare the models using information criteria
glance(fits)

## # A tibble: 5 × 9
##   Country       .model     sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <fct>         <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 United States auto     0.000428    140. -276. -275. -271. <cpl [0]> <cpl [1]>
## 2 United States arima010 0.000760    125. -246. -245. -242. <cpl [0]> <cpl [0]>
## 3 United States arima011 0.000552    134. -263. -262. -257. <cpl [0]> <cpl [1]>
## 4 United States arima110 0.000451    140. -275. -274. -268. <cpl [1]> <cpl [0]>
## 5 United States arima111 0.000414    143. -278. -278. -270. <cpl [1]> <cpl [1]>

From the models I tried, ARIMA(1,1,1) has the lowest AICc, so it looks like the best choice among these models.

choose what you think is the best model and check the residual diagnostics;

# Choose ARIMA(1,1,1) as the final model because it has the lowest AICc
best_fit <- us_gdp %>%
  model(ARIMA_log = ARIMA(log(GDP) ~ pdq(1,1,1) + 1))

# residual diagnostics
best_fit %>%
  ggtime::gg_tsresiduals()

# Ljung-Box test for remaining autocorrelation
augment(best_fit) %>%
  features(.innov, ljung_box, lag = 10, dof = 2)

## # A tibble: 1 × 4
##   Country       .model    lb_stat lb_pvalue
##   <fct>         <chr>       <dbl>     <dbl>
## 1 United States ARIMA_log    4.11     0.847

I chose ARIMA(1,1,1) as the final model because it has the lowest AICc among the models I compared. To check the model, I looked at the residual plots and the Ljung Box test. The residuals stay centered around zero, show no clear pattern over time, and the Ljung Box p value is not small, so the model appears to fit the data reasonably well.

produce forecasts of your fitted model. Do the forecasts look reasonable?

# forecast the next 10 years
fc_arima <- best_fit %>%
  forecast(h = 10)

# plot the forecasts
fc_arima %>%
  autoplot(us_gdp) +
  labs(
    title = "ARIMA Forecasts for United States GDP",
    y = "GDP",
    x = "Year"
  )

The forecasts look reasonable because they continue the long run upward movement in GDP and do not show sudden jumps. Since the model is fitted to log GDP, the forecast path looks smoother and more stable.

compare the results with what you would obtain using ETS() (with no transformation).

# fit an ETS model on the original GDP series with no transformation
fit_ets <- us_gdp %>%
  model(ETS_raw = ETS(GDP))

# forecast the next 10 years
fc_ets <- fit_ets %>%
  forecast(h = 10)

# plot both forecasts together
bind_rows(fc_arima, fc_ets) %>%
  autoplot(us_gdp, level = NULL) +
  labs(
    title = "Comparison of ARIMA and ETS Forecasts for United States GDP",
    y = "GDP",
    x = "Year"
  )

Both models give forecasts that continue the upward movement in GDP. The ARIMA model is fitted to log GDP, while the ETS model is fitted to the original GDP series. Because of that, the ARIMA forecast handles the changing variance better and looks smoother. The ETS forecast is still reasonable, but I would prefer the ARIMA model here because it fits the transformed data and matches the long run growth pattern well.

Data 624 Homework #6

Jayden Jiang

2026-03-17

9.1

9.2

9.3

9.5

9.6

9.7

9.8