Homework 6

Author

Naomi Buell

library(tidyverse)
library(fpp3)
library(scales)
library(ggplot2)
library(feasts)
library(patchwork)
library(latex2exp)

set.seed(9219)

9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?

Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.

Since all data lies between the two blue lines, meaning the autocorrelations are not significantly different from zero, these appear to be white noise.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values are at different distances from the mean due to the size of the data; the larger the data, the tighter the range of critical values.

9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

# get amazon closing prices
amazon <- gafa_stock |>
    filter(Symbol == "AMZN") |>
    select(Date, Close) |>
    # Re-index based on trading days
    mutate(day = row_number()) |>
    # Divide day by the number of trading days in a year to plot years since 2014-01-02
    mutate(years_since = (day - 1) / 252) |>
    update_tsibble(index = years_since, regular = TRUE)

# Get the minimum date in the tsibble
min_date <- min(amazon$Date)

# lot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF
amazon |>
    gg_tsdisplay(Close, plot_type = "partial") +
    labs(
        title = "Time plot, ACF, and PACF of Amazon stock price",
        y = "Closing price",
        x = paste("Years since", min_date)
    )

The time plot shows that the series is non-stationary because it has an obvious upward trend since 2014-01-02. The ACF shows a slow drop-off, which is a sign of a non-stationary series. The PACF plot shows a strong correlation with the first lag, which drops off quickly in subsequent lags. This is an indication that the data is non-stationary and should be differenced.

9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

Turkish GDP from global_economy.

# Extract data of interest
turk_gdp <- global_economy |>
    filter(Country == "Turkey") |>
    select(GDP)

# Find lambda using the Box-Cox transformation
lambda <- turk_gdp |>
    features(GDP, features = guerrero) |>
    pull(lambda_guerrero)

# Transform the data using the Box-Cox transformation
turk_gdp_transformed <- turk_gdp |>
    mutate(GDP = box_cox(GDP, lambda))

The appropriate Box-Cox transformation uses λ = 0.1572187. Next, I find the order of differencing needed, if any, with a KPSS test:

\(H_0\): Data is stationary.
\(H_A\): Data is not stationary.

turk_gdp_transformed |>
    features(GDP, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1      1.50        0.01

In this case, the p-value is shown as 0.01 (and therefore it may be smaller than that), indicating that the \(H_0\) is rejected. That is, the data are not stationary. We can difference the data, and apply the test again:

\(H_0\): Differenced data is stationary.
\(H_A\): Differenced data is not stationary.

# get optimal diffs
ndiffs <- turk_gdp_transformed |>
    features(GDP, unitroot_ndiffs) |>
    pull(ndiffs)

turk_gdp_diffed <- turk_gdp_transformed |>
    mutate(diff_gdp = difference(GDP, differences = ndiffs))

# test diffs
turk_gdp_diffed |>
    features(diff_gdp, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0889         0.1

The p-value is reported as 0.1 (and so it could be larger than that), so we fail to reject the \(H_0\) that differenced data are stationary, using an order of difference of 1. I plot the transformed and differenced data below.

# Plot the transformed data
turk_gdp_diffed |>
    autoplot(diff_gdp) +
    labs(
        title = "Transformed, differenced Turkish GDP",
        x = "Year",
        y = ""
    ) +
    theme_classic()

The data appears stationary.

Accommodation takings in the state of Tasmania from aus_accommodation.

# Extract data of interest
tasm_acc <- aus_accommodation |>
    filter(State == "Tasmania") |>
    select(Takings)

# Find lambda using the Box-Cox transformation
lambda <- tasm_acc |>
    features(Takings, features = guerrero) |>
    pull(lambda_guerrero)

# Transform the data using the Box-Cox transformation
tasm_acc_transformed <- tasm_acc |>
    mutate(Takings = box_cox(Takings, lambda))

The appropriate Box-Cox transformation uses λ = 0.0018196. Next, I perform the KPSS test where

\(H_0\): Data is stationary.
\(H_A\): Data is not stationary.

tasm_acc_transformed |>
    features(Takings, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1      1.84        0.01

Since the p-value is low (≤ 0.01), we reject the \(H_0\) that the data are stationary, and try again with differenced data:

\(H_0\): Seasonally differenced data is stationary.
\(H_A\): Seasonally differenced data is not stationary.

# get optimal diffs
nsdiffs <- tasm_acc_transformed |>
    features(Takings, unitroot_nsdiffs) |>
    pull(nsdiffs)

tasm_acc_sdiffed <- tasm_acc_transformed |>
    mutate(sdiff_take = difference(Takings, differences = 12 * nsdiffs))

# test diffs
tasm_acc_sdiffed |>
    features(sdiff_take, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0658         0.1

The p-value is reported as 0.1 (and so it could be larger than that), so we fail to reject the \(H_0\) that the seasonally differenced data are stationary, using an order of seasonal difference of 1. I plot the transformed and differenced data below.

tasm_acc_sdiffed |>
    autoplot(sdiff_take) +
    labs(
        x = "Year",
        y = "",
        title = "Transformed, seasonally differenced Tasmanian takings"
    ) +
    theme_classic()

The data appears stationary.

9.5

For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

# get random id and pull series
random_id <- sample(aus_retail$`Series ID`, 1)

myseries <- aus_retail |>
    filter(`Series ID` == random_id)

# get name for labels
industry_name <- myseries |>
    pull(Industry) |>
    unique()

turnover <- myseries |> select(Turnover)

I transform the data and check if differencing is needed by testing:

\(H_0\): Data is stationary.
\(H_A\): Data is not stationary.

# Find lambda using the Box-Cox transformation
lambda <- turnover |>
    features(Turnover, features = guerrero) |>
    pull(lambda_guerrero)

# Transform the data using the Box-Cox transformation
turnover_transformed <- turnover |>
    mutate(Turnover = box_cox(Turnover, lambda))

turnover_transformed |>
    features(Turnover, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1      7.38        0.01

Since the p-value is low (≤ 0.01), we reject the \(H_0\) that the data are stationary, and try again with differenced data:

\(H_0\): Seasonally differenced data is stationary.
\(H_A\): Seasonally differenced data is not stationary.

# Get seasonal diff
nsdiffs <- turnover_transformed |>
    features(Turnover, unitroot_nsdiffs) |>
    pull(nsdiffs)

print(paste("The optimal order of seasonal differences =", nsdiffs))

[1] "The optimal order of seasonal differences = 1"

turnover_sdiffed <- turnover_transformed |>
    mutate(sdiff_turn = difference(Turnover, 12 * nsdiffs))

# test diffs
turnover_sdiffed |>
    features(sdiff_turn, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0838         0.1

# Get diff
ndiffs <- turnover_sdiffed |>
    features(sdiff_turn, unitroot_ndiffs) |>
    pull(ndiffs)

print(paste("The optimal order of differences =", ndiffs))

[1] "The optimal order of differences = 0"

The p-value is reported as 0.1 (and so it could be larger than that), so we fail to reject the \(H_0\) that seasonally differenced data are stationary, using an order of difference of 1. We do not need to difference the data beyond this seasonal difference. I plot the transformed and differenced data below.

turnover_sdiffed |>
    autoplot(sdiff_turn) +
    labs(
        x = "",
        title = "Retail turnover",
        subtitle = industry_name,
        y = "Monthly retail turnover, differenced"
    ) +
    theme_classic() +
    scale_x_yearmonth(breaks = "2 year", labels = label_date("%Y")) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

The data appears stationary.

9.6

Simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with \(\phi_1\)=0.6 and σ^₂=1. The process starts with y₁=0.

I modify the code provided by the text for use as a function.

ar_1 <- function(phi_1) {
    # Generate AR(1) process
    y <- numeric(100)
    e <- rnorm(100)

    for (t in 2:100) {
        y[t] <- phi_1 * y[t - 1] + e[t]
    }

    # Create tsibble and plot
    sim <- tsibble(idx = seq_len(100), y = y, index = idx)

    plot <- sim |>
        autoplot(y) +
        labs(
            title = TeX(paste(
                "Time plot of AR(1) model with $\\phi_1$ = ",
                phi_1
            )),
            x = "Index",
            y = expression(y[t])
        ) +
        theme_minimal()

    return(plot)
}

Produce a time plot for the series. How does the plot change as you change \(\phi_1\)?

ar_1(.6)

We do not include c when generating the data from the AR(1) model above (c = 0). When \(\phi_1\)<0, as in the plot above, \(y_t\) tends to oscillate around the mean.

When I change \(\phi_1\)=0, \(y_t\) is equivalent to white noise:

ar_1(0)

When I change \(\phi_1\)=1, \(y_t\) is equivalent to a random walk:

ar_1(1)

Write your own code to generate data from an MA(1) model with \(\theta_1\)=0.6 and \({\sigma}^2\)=1.

ma_1 <- function(theta) {
    # Generate AR(1) process
    y <- numeric(100)
    e <- rnorm(100)

    for (t in 2:100) {
        y[t] <- e[t] + theta * e[t - 1]
    }

    # Create tsibble and plot
    sim <- tsibble(idx = seq_len(100), y = y, index = idx)

    plot <- sim |>
        autoplot(y) +
        labs(
            title = TeX(paste(
                "$\\theta_1$ = ",
                theta
            )),
            x = "Index",
            y = expression(y[t])
        ) +
        theme_minimal()

    return(plot)
}

Produce a time plot for the series. How does the plot change as you change \(\theta_1\)?

# Plot the time series
ma_1(.6)

As I increase \(\theta_1\), the plots appear to have less variability and may be slightly more centered around the mean:

plots <- list()
for (theta in c(-1, -0.6, 0, 0.6, 1)) {
    plots[[as.character(theta)]] <- ma_1(theta)
}
wrap_plots(plots) +
    plot_annotation(title = "Time plots of MA(1) with different θ1 values")

Generate data from an ARMA(1,1) model with \(\phi_1\) = 0.6, \(\theta_1\) = 0.6, and \({\sigma}^2\) = 1.

arma_1_1 <- function(phi, theta) {
    # Generate AR(1) process
    y <- numeric(100)
    e <- rnorm(100)

    for (t in 2:100) {
        y[t] <- phi * y[t - 1] + e[t] + theta * e[t - 1]
    }

    # Create tsibble and plot
    sim <- tsibble(idx = seq_len(100), y = y, index = idx)

    plot <- sim |>
        gg_tsdisplay(y, plot_type = "partial") +
        labs(
            title = "ARMA(1,1) Model",
            subtitle = TeX(paste(
                "$\\phi_1$ = ",
                phi,
                ", $\\theta_1$ = ",
                theta
            )),
            x = "Index",
            y = expression(y[t])
        )

    return(plot)
}

arma_1_1(.6, .6)

Generate data from an AR(2) model with \(\phi_1 = -0.8\), \(\phi_2 = 0.3\), and \({\sigma}^2 = 1\). (Note that these parameters will give a non-stationary series.)

ar_2 <- function(phi_1, phi_2) {
    # Generate AR(1) process
    y <- numeric(100)
    e <- rnorm(100)

    for (t in 3:100) {
        y[t] <- phi_1 * y[t - 1] + phi_2 * y[t - 2] + e[t]
    }

    # Create tsibble and plot
    sim <- tsibble(idx = seq_len(100), y = y, index = idx)

    plot <- sim |>
        gg_tsdisplay(y, plot_type = "partial") +
        labs(
            title = "AR(2) Model",
            subtitle = TeX(paste(
                "$\\phi_1$ = ",
                phi_1,
                ", $\\phi_2$ = ",
                phi_2
            )),
            x = "Index",
            y = expression(y[t])
        )

    return(plot)
}
ar_2(-0.8, 0.3)

Graph the latter two series and compare them.

Both are graphed above. The ARMA(1,1) model is relatively stationary, random and centered around 0, while the AR(2) model has an obvious pattern, oscilating around zero with an increasing magnitude with time.The ARMA(1,1) model’s autocorrelation decreases with no pattern, but the AR(2)’s oscillating autocorrelation suggests seasonality in the data generated. The PACF of the ARMA(1,1) model is positive at the first lag and truncated thereafter, while the PACF of the AR(2) shows only a negative first lag and no other values.

9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

# Plot the data
aus_airpassengers |>
    autoplot() +
    labs(
        title = "Total annual air passengers in Australia, 1970-2016",
        y = "Passengers (in millions)",
        x = "Year"
    ) +
    theme_classic()

There is no evidence of changing variance, so we will not do a Box-Cox transformation.

To address the non-stationarity, we will take a difference of the data. The differenced data are shown below.

# get optimal diffs
ndiffs <- aus_airpassengers |>
    features(Passengers, unitroot_ndiffs) |>
    pull(ndiffs)

# Plot the differenced data
aus_airpassengers |>
    gg_tsdisplay(
        difference(Passengers, differences = ndiffs),
        plot_type = "partial"
    )

By taking an order of difference of 2, these now appear to be stationary.

fit <- aus_airpassengers |>
    model(ARIMA(Passengers))
report(fit)

Series: Passengers 
Model: ARIMA(0,2,1) 

Coefficients:
          ma1
      -0.8963
s.e.   0.0594

sigma^2 estimated as 4.308:  log likelihood=-97.02
AIC=198.04   AICc=198.32   BIC=201.65

# Get ma1 coefficient (theta)
theta <- tidy(fit) |>
    pull(estimate)

The model selected is an ARIMA(0,2,1) model.

fit |>
    gg_tsresiduals()

The ACF plot of the residuals from the ARIMA(0,2,1) model shows that all autocorrelations are within the threshold limits, indicating that the residuals are behaving like white noise.

I plot the forecast for the next 10 periods below.

fit |>
    forecast(h = 10) |>
    autoplot(aus_airpassengers) +
    labs(
        title = "Ten year forecast of annual air passengers in Australia",
        subtitle = "ARIMA(0,2,1)",
        y = "Passengers (in millions)",
        x = "Year"
    ) +
    theme_classic()

Write the model in terms of the backshift operator.

\[ (1 - B)^2 y_t = (1 + -0.896B)\epsilon_t \]

Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

aus_airpassengers |>
    model(
        stepwise = ARIMA(Passengers),
        arima010 = ARIMA(Passengers ~ pdq(0, 1, 0))
    ) |>
    forecast(h = 10) |>
    autoplot(aus_airpassengers) +
    labs(
        title = "Ten year forecast of annual air passengers in Australia",
        y = "Passengers (in millions)",
        x = "Year"
    ) +
    theme_classic()

The model selected in the stepwise search process, ARIMA(0,2,1), forecasts higher annual air passengers than the ARIMA(0,1,0) model with drift.

Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

aus_airpassengers |>
    filter(Year < 2012) |> # Filtering data to prior to 2012 for numeric stability
    model(
        stepwise_a = ARIMA(Passengers),
        arima010_c = ARIMA(Passengers ~ pdq(0, 1, 0)),
        arima_212_d = ARIMA(Passengers ~ pdq(2, 1, 2)),
        arima212_no_cons_d = ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
    ) |>
    forecast(h = 10) |>
    autoplot(aus_airpassengers, alpha = .5) +
    labs(
        title = "Ten-year forecast of annual air passengers in Australia after 2012",
        y = "Passengers (in millions)",
        x = "Year"
    ) +
    theme_classic()

Warning: 1 error encountered for arima212_no_cons_d
[1] non-stationary AR part from CSS

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_line()`).

The forecasts from the ARIMA(2,1,2) model with drift are below the automatically selected ARIMA(0,2,1) and above the ARIMA(0,1,0) model forecasts. Without a constant, this model was not able to produce any forecasts. We get a warning message for a non-stationary autoregressive part.

Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

aus_airpassengers |>
    model(
        stepwise_a = ARIMA(Passengers),
        arima212_cons_e = ARIMA(Passengers ~ 1 + pdq(0, 1, 1))
    ) |>
    forecast(h = 10) |>
    autoplot(aus_airpassengers) +
    labs(
        title = "Ten-year forecast of annual air passengers in Australia",
        y = "Passengers (in millions)",
        x = "Year"
    ) +
    theme_classic()

The ARIMA(0,2,1) model with a constant generates lower forecasts than the automatically selected ARIMA(0,2,1) model.

9.8

For the United States GDP series (from global_economy):

us_gdp <- global_economy |>
    filter(Country == "United States") |>
    select(GDP)

if necessary, find a suitable Box-Cox transformation for the data;

lambda <- us_gdp |>
    features(GDP, features = guerrero) |>
    pull(lambda_guerrero)

# Transform the data using the Box-Cox transformation
us_gdp_transformed <- us_gdp |>
    mutate(GDP = box_cox(GDP, lambda))

The appropriate Box-Cox transformation uses λ = 0.2819443.

fit a suitable ARIMA model to the transformed data using ARIMA()

us_gdp_transformed |>
    autoplot() +
    theme_classic()

To address the non-stationarity, as evidenced from the trend in the graph above, we will take a difference of the data. The differenced data are shown below.

# get optimal diffs
ndiffs <- us_gdp_transformed |>
    features(GDP, unitroot_ndiffs) |>
    pull(ndiffs)

# Difference gdp data
us_gdp_differenced <- us_gdp_transformed |>
    mutate(diff_gdp = difference(GDP, differences = ndiffs))

# Plot the differenced data
us_gdp_differenced |>
    gg_tsdisplay(
        diff_gdp,
        plot_type = "partial"
    )

# Test the differenced data for stationarity
us_gdp_differenced |>
    features(diff_gdp, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1     0.208         0.1

us_gdp_differenced |> autoplot(diff_gdp) + theme_classic()

By taking an order of difference of 1, these now appear to be stationary, since the p-value is greater than 0.1, so we fail to reject the \(H_0\) that differenced data are stationary. The data now appears stationary in the graph above.

fit <- us_gdp_differenced |>
    model(ARIMA(diff_gdp))
report(fit)

Series: diff_gdp 
Model: ARIMA(1,0,0) w/ mean 

Coefficients:
         ar1  constant
      0.4586  118.1789
s.e.  0.1198    9.5047

sigma^2 estimated as 5381:  log likelihood=-325.32
AIC=656.65   AICc=657.09   BIC=662.83

The suitable model for the transformed US GDP data is ARIMA(1,0,0) w/ mean.

try some other plausible models by experimenting with the orders chosen;

Based on the patterns in the ACF and PACF of the differenced data graphed above, I try an ARIMA(p = 0, d = 0, q = 1) model since there is a significant spike at lag q in the ACF, but none beyond lag q, where q = 1. I also try another automated model selection that works harder than the default stepwise procedure to search a larger model space.

fit2 <- us_gdp_differenced |>
    model(
        stepwise = ARIMA(diff_gdp),
        search = ARIMA(diff_gdp, stepwise = FALSE),
        arima001 = ARIMA(diff_gdp ~ pdq(0, 0, 1))
    )

Both of the automated searches suggested the same model: ARIMA(1,0,0) w/ mean.

choose what you think is the best model and check the residual diagnostics;

report(fit2)

# A tibble: 3 × 8
  .model   sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
  <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
1 stepwise  5381.   -325.  657.  657.  663. <cpl [1]> <cpl [0]>
2 search    5381.   -325.  657.  657.  663. <cpl [1]> <cpl [0]>
3 arima001  5587.   -326.  659.  659.  665. <cpl [0]> <cpl [1]>

Based on all residual diagnostic measures reported above, the two automated selection processes select the best model, the ARIMA(1,0,0) w/ mean.

fit |> gg_tsresiduals()

The residuals graphed above appear to be white noise.

produce forecasts of your fitted model. Do the forecasts look reasonable?

# Fit ARIMA model (letting it handle differencing)
fit <- us_gdp |>
    model(ARIMA(box_cox(GDP, lambda))) |>
    forecast(h = 10) |>
    mutate(.mean = inv_box_cox(.mean, lambda)) |> # Invert Box-Cox transformation
    autoplot(us_gdp) +
    theme_classic() +
    labs(title = "10-year forecasts of US GDP")

Yes, they look reasonable.

compare the results with what you would obtain using ETS() (with no transformation).

us_gdp |>
    model(
        arima = ARIMA(box_cox(GDP, lambda)),
        ets = ETS(box_cox(GDP, lambda))
    ) |> # Letting ARIMA handle differencing
    forecast(h = 10) |>
    mutate(.mean = inv_box_cox(.mean, lambda)) |> # Invert Box-Cox transformation
    autoplot(us_gdp, alpha = .5) +
    theme_classic() +
    labs(title = "10-year forecasts of US GDP")

us_gdp |>
    model(
        arima = ARIMA(box_cox(GDP, lambda)),
        ets = ETS(box_cox(GDP, lambda))
    ) |>
    report()

# A tibble: 2 × 11
  .model       sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots    MSE   AMSE
  <chr>         <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>    <dbl>  <dbl>
1 arima  5479.          -325.  657.  657.  663. <cpl [1]> <cpl [0]>   NA     NA 
2 ets       0.0000350   -367.  744.  745.  754. <NULL>    <NULL>    5865. 21390.
# ℹ 1 more variable: MAE <dbl>

The ARIMA model forecasts are slightly higher than the ETS, but have much narrower CIs than the ETS. The AICc of the ARIMA suggests that it’s the better model.