DATA 624: Assignment 06

Author

Curtis Elsasser

Published

March 23, 2025

Setup

library(forecast)
library(fpp3)
library(zoo)

set.seed(314159)

Assignment 06

Do the exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman. Please submit both the Rpubs link as well as your .rmd file.

Exercise 9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Figure 9.32: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?

No, they don’t suggest the same amount of correlation as we can see by the ACF range. The ACF for the 36 random numbers shows a correlation range of ~[0.5, -0.5], while the ACF for the 360 shows less correlation with a range of ~[0.1, -0.1] and 1,000 random numbers shows even less. The ACF for 36 random numbers does not suggest white noise. Actually, it suggests fairly strong correlation, which is not expected from a white noise series. The ACF for 360 random numbers suggests weak correlation, which is more consistent with white noise. The ACF for 1,000 random numbers shows even weaker correlation, which is what we would expect from a white noise series.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values are different distances from the mean because of the way the critical threshold is typically calculated. The critical threshold is typically calculated as \(\pm 2/\sqrt{n}\), where \(n\) is the sample size. Which, for our three samples, means that:

For a sample of 36 numbers, the critical threshold is \(\pm 2/\sqrt{36} = \pm 0.3333\).
For a sample of 360 numbers, the critical threshold is \(\pm 2/\sqrt{360} = \pm 0.1054\).
For a sample of 1,000 numbers, the critical threshold is \(\pm 2/\sqrt{1000} = \pm 0.0632\).

As the sample size increases, the critical threshold approaches zero.

The reason the autocorrelations differ, I believe, is effectively the law of large numbers. As the sample size increases, the sample mean and variance converge to the population mean and variance. But what is the population when it comes to the ACF and a uniform distribution random number generator? The population is a theoretically infinite sequence of random numbers generated from a uniform distribution. So, what is the sequence of random numbers in an ACF? Let’s imagine that we eliminate all of the numbers between a given lag. For a sample of 36 numbers, a lag of 20 is a little weird. Beyond lag 18 I’m not sure what it’s looking at. But from 1 - 18 lags, the ACF has 2 numbers to work with for every lag. The chances of finding positive or negative correlation is fairly high. Let’s look at the equation:

\[ \text{ACF}(k) = \frac{\sum_{t=1}^{n-k} (x_t - \bar{x})(x_{t+k} - \bar{x})}{\sum_{t=1}^{n} (x_t - \bar{x})^2} \]

Let’s let R generate two random numbers that we shall consider to be a single lag and calculate its ACF.

r <- rnorm(2)
rm <- mean(r)
print((r[1] - rm) * (r[2] - rm) / ((r[1] - rm)^2 + (r[2] - rm)^2))

[1] -0.5

Our ACF for lag 1 is -0.5.

For a sample of 360 numbers, the ACF has 18 numbers to work with for every lag. And the odds of getting a correlated sequence of random numbers is much lower than for 36 numbers. For a sample of 1,000 numbers, the ACF has 50 numbers to work with for every lag. And the odds of getting a correlated sequence of random numbers is even lower than for 360 numbers. So, as the sample size increases, the ACF converges to zero.

Exercise 9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

Let’s have a look at our data first. We are getting warnings below that are saying, “…the provided data has an irregular interval.” Which I imagine is due to missing weekends and holidays. And the ACF plot is bonkers.

gafa_stock |>
  filter(Symbol == "AMZN") |>
  head(n = 10)

# A tsibble: 10 x 8 [!]
# Key:       Symbol [1]
   Symbol Date        Open  High   Low Close Adj_Close  Volume
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl>
 1 AMZN   2014-01-02  399.  399.  394.  398.      398. 2137800
 2 AMZN   2014-01-03  398.  403.  396.  396.      396. 2210200
 3 AMZN   2014-01-06  396.  397   388.  394.      394. 3170600
 4 AMZN   2014-01-07  395.  398.  394.  398.      398. 1916000
 5 AMZN   2014-01-08  398.  403   396.  402.      402. 2316500
 6 AMZN   2014-01-09  404.  407.  398.  401.      401. 2103000
 7 AMZN   2014-01-10  403.  404.  394.  398.      398. 2679500
 8 AMZN   2014-01-13  398.  400.  388.  391.      391. 2844900
 9 AMZN   2014-01-14  392.  399.  391.  398.      398. 2340100
10 AMZN   2014-01-15  399.  399.  393.  396.      396. 2678300

Let’s experiment. Let’s create a parallel imputed timeseries and see if we get different results.

amzn <- gafa_stock |>
  filter(Symbol == "AMZN")

# Create a complete sequence of dates from the min to the max in the data
dates <- seq(min(amzn$Date), max(amzn$Date), by = "day")

# Merge with our amzn data to fill in missing dates
amzn <- merge(amzn, data.frame(Date = dates), all = TRUE)

# And now we shall interpolate missing values using LOCF
amzn <- amzn |>
  arrange(Date) |>
  na.locf()

# Huh, that worked. But our dataset is a data.frame. Let's convert it.
amzn <- as_tsibble(amzn, index = Date)

Plot the original data

gafa_stock |>
  filter(Symbol == "AMZN") |>
  ggplot(aes(x = Date, y = Close)) +
  geom_line() +
  labs(
    title = "Daily Closing Prices for Amazon Stock", 
    x = "Date", 
    y = "Closing Price"
  )

Plot the ACF for the original data

gafa_stock |>
  filter(Symbol == "AMZN") |>
  ACF(Close) |>
  autoplot() +
  labs(
    title = "ACF for Daily Closing Prices for Amazon Stock"
  )

Warning: Provided data has an irregular interval, results should be treated
with caution. Computing ACF by observation.

That is bonkers. Either we have total correlation at every lag or something is awry. Let’s try plotting the ACF for our imputed data. Oh, wait! It’s not 1.0 across the board. It descends from 1.0 at lag 1 to ~0.85 at lag 30. To say the least, the ACF is well above the significance line at every lag. This suggests that the series is non-stationary and should be differenced.

Before we made the realization that the ACF wasn’t bonkers, we had pursued imputation to see if we got different results. If there are differences between the non-imputed ACF and the imputed ACF, it’s more minuscule than a plot reveals.

amzn |>
  ACF(Close) |>
  autoplot() +
  labs(
    title = "ACF for Daily Closing Prices for Amazon Stock (Imputed)"
  )

The PACF was our first reassurance that the ACF was not bonkers. Let’s have a look at it for the original data.

gafa_stock |>
  filter(Symbol == "AMZN") |>
  PACF(Close) |>
  autoplot() +
  labs(
    title = "PACF for Daily Closing Prices for Amazon Stock"
  )

Warning: Provided data has an irregular interval, results should be treated
with caution. Computing ACF by observation.

Ahhhh, when we remove the intervening effects of lags, the plot reveals to us lags that are significant. The PACF shows that the series is non-stationary and should be differenced.

And now the PACF with our imputed version:

amzn |>
  PACF(Close) |>
  autoplot() +
  labs(
    title = "PACF for Daily Closing Prices for Amazon Stock (Imputed)"
  )

We do get slightly different results with the imputed data. Especially at lag 2. Which does make sense in that at least 3 consecutive days (Friday, Saturday and Sunday) out of 7 days have identical closing prices.

Exercise 9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

Turkish GDP from global_economy.

data <- global_economy |>
  filter(Code == "TUR")

Let’s have a look at him.

data |>
  ggplot(aes(x = Year, y = GDP)) +
  geom_line() +
  labs(
    title = "Turkish GDP",
    x = "Year",
    y = "GDP"
  )

He’s pretty exponential. Let’s give the guerrero method a try.

lambda <- data |>
  features(GDP, guerrero)

data <- data |>
  mutate(GDP = box_cox(GDP, lambda$lambda_guerrero))

Let’s see how we did.

data |>
  ggplot(aes(x = Year, y = GDP)) +
  geom_line() +
  labs(
    title = "Turkish GDP",
    x = "Year",
    y = "GDP"
  )

Not bad. It doesn’t look like there is any rhyme or reason to his ups and downs, nonetheless there is a strong, linear trend line. Let’s first apply a first difference to see if we can make him stationary.

data <- data |>
  mutate(GDP = difference(GDP, lag = 1))

Let’s see how we did.

data |>
  filter(!is.na(GDP)) |>
  ggplot(aes(x = Year, y = GDP)) +
  geom_line() +
  labs(
    title = "Differenced Turkish GDP",
    x = "Year",
    y = "GDP"
  )

Looks like we are getting closer. Let’s check the ACF to whether it passes the white noise test.

data |>
  ACF(GDP) |>
  autoplot() +
  labs(
    title = "ACF for Turkish GDP"
  )

We are below the significance line at every lag, but just. Lag 4 is a little bothersome, but I don’t think any further differencing is going to help reduce the ACF. We will come back and experiment if we have time. Actually, I don’t think there is anything more we can do. It looks like we have a stationary series. We could try a seasonal difference, but I don’t think it makes sense. The data is already annual. According to our ACF plot there is some correlation at lag 4, but I don’t think it’s enough to justify a seasonal difference. Let’s move on. There are a lot of questions below.

Accommodation takings in the state of Tasmania from aus_accommodation.

data <- aus_accommodation |>
  filter(State == "Tasmania")

Let’s see what he looks like.

data |>
  ggplot(aes(x = Date, y = Takings)) +
  geom_line() +
  labs(
    title = "Accommodation Takings in Tasmania"
  )

The trend is slightly sinusoidal, but there is a strong linear trend line. We don’t need an ACF plot to tell us that there is a very strong annual seasonal component. So, we need to apply differencing to remove the trend and to make the seasonality’s variability and autocovariance constant.

data <- data |>
  mutate(Takings = difference(Takings, lag = 1)) |>
  filter(!is.na(Takings))

Now let’s attempt to remove the seasonality’s variability. The data is quarterly, and the seasonality look like it is yearly. So, let’s apply a seasonal difference.

data <- data |>
  mutate(Takings = difference(Takings, lag = 4)) |>
  filter(!is.na(Takings))

Let’s see how the trend line looks now.

data |>
  ggplot(aes(x = Date, y = Takings)) +
  geom_line() +
  labs(
    title = "Differenced Accommodation Takings in Tasmania"
  )

That looks pretty good. Let’s look at the ACF.

data |>
  ACF(Takings) |>
  autoplot() +
  labs(
    title = "ACF for Differenced Accommodation Takings in Tasmania"
  )

Hmmm, it’s showing some correlation at lag 2 and 3? Let’s run a KPSS test to see if we are stationary.

data |>
  features(Takings, unitroot_kpss)

# A tibble: 1 × 3
  State    kpss_stat kpss_pvalue
  <chr>        <dbl>       <dbl>
1 Tasmania    0.0467         0.1

He says we are stationary. Let’s move on.

Monthly sales from souvenirs.

Let’s see what he looks like.

souvenirs |>
  ggplot(aes(x = Month, y = Sales)) +
  geom_line() +
  labs(
    title = "Sales from Souvenirs",
    x = "Month",
    y = "Sales"
  )

This one will benefit from a Box-Cox transformation for two reasons:

It has a slight exponential trend.
It has a strong seasonal component that is growing at an exponential rate.

Let’s apply the guerrero method to find a suitable Box-Cox transformation.

lambda <- souvenirs |>
  features(Sales, guerrero)
data <- souvenirs |>
  mutate(Sales = box_cox(Sales, lambda$lambda_guerrero))

Let’s have a look.

data |>
  ggplot(aes(x = Month, y = Sales)) +
  geom_line() +
  labs(
    title = "Sales from Souvenirs (Box-Cox)",
    x = "Month",
    y = "Sales"
  )

querrero did a pretty good job, but there is still a trend line. Let’s apply a first difference to see if we can level him out.

data <- data |>
  mutate(Sales = difference(Sales, lag = 1)) |>
  filter(!is.na(Sales))

One more look (hoping):

data |>
  ggplot(aes(x = Month, y = Sales)) +
  geom_line() +
  labs(
    title = "Differenced Sales from Souvenirs",
    x = "Month",
    y = "Sales"
  )

That looks much better! Now to tackle the seasonality. Let’s apply a seasonal difference. The data is monthly and the seasonality looks like it is yearly. So, let’s apply a seasonal difference.

data <- data |>
  mutate(Sales = difference(Sales, lag = 12)) |>
  filter(!is.na(Sales))

Let’s plot an ACF and see how it looks.

data |>
  ACF(Sales) |>
  autoplot() +
  labs(
    title = "ACF for Differenced Sales from Souvenirs"
  )

Hmmm, there is a fair amount (~ -0.4) of correlation at lag 1? Once again, let’s run a KPSS test to see if we are stationary.

data |>
  features(Sales, unitroot_kpss)

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0381         0.1

And KPSS says that we are stationary. Let’s move on.

Exercise 9.5

For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

Hey, I remember this dataset! Let’s see what it looks like.

data <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))

data |>
  ggplot(aes(x = Month, y = Turnover)) +
  geom_line() +
  labs(
    title = "Retail Turnover",
    x = "Month",
    y = "Turnover"
  )

Oh, this line looks a little reptilian. It’s trend does have an upward curve, but it’s not strongly exponential. In the beginning it’s linear, then levels off, then rises more quickly, and towards the end it fires its boosters and is off. Let’s give querrero a try.

lambda <- data |>
  features(Turnover, guerrero)
print(lambda$lambda_guerrero)

[1] 0.08473322

data <- data |>
  mutate(Turnover = box_cox(Turnover, lambda$lambda_guerrero))

It also detected an exponential tendency. Let’s see how he did.

data |>
  ggplot(aes(x = Month, y = Turnover)) +
  geom_line() +
  labs(
    title = "Retail Turnover (Box-Cox)",
    x = "Month",
    y = "Turnover"
  )

Wow, querrero did a good job of straightening out the line and of making the variability constant. But we still have a linear trend line. Let’s apply a first difference to see if we can get him stationary.

data <- data |>
  mutate(Turnover = difference(Turnover, lag = 1)) |>
  filter(!is.na(Turnover))

data |>
  ggplot(aes(x = Month, y = Turnover)) +
  geom_line() +
  labs(
    title = "Differenced Retail Turnover",
    x = "Month",
    y = "Turnover"
  )

It’s certainly level. The seasonal has been somewhat obscured, but we know it’s still there. Let’s try seasonal differencing to see if we can bring it out.

data <- data |>
  mutate(Turnover = difference(Turnover, lag = 12)) |>
  filter(!is.na(Turnover))

Let’s look at it through KPSS

data |>
  features(Turnover, unitroot_kpss)

# A tibble: 1 × 4
  State           Industry       kpss_stat kpss_pvalue
  <chr>           <chr>              <dbl>       <dbl>
1 South Australia Food retailing    0.0114         0.1

We hit the upper limit of the KPSS test which means the null hypothesis of stationarity cannot be rejected. So, we are stationary.

Exercise 9.6

Simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with \(\phi_1 = 0.6\) and \(\sigma^2 = 1\). The process starts with \(y_1 = 0\).

run_sim_ar <- function(phi = 0.6, sigma_sq = 1, e = NULL) {
  y <- numeric(100)
  if(is.null(e))
    e <- rnorm(100, sd = sqrt(sigma_sq))
  for(i in 2:100)
    y[i] <- phi*y[i-1] + e[i]
  return(tsibble(idx = seq_len(100), y = y, index = idx))
}

Produce a time plot for the series. How does the plot change as you change \(\phi_1\)?

# let's use the same random error for all series so that we isolate the effect of phi
e <- rnorm(100)
df = data.frame(
  idx = run_sim_ar()$idx,
  `phi=0.0` = run_sim_ar(phi = 0.0, e = e)$y,
  `phi=0.2` = run_sim_ar(phi = 0.2, e = e)$y,
  `phi=0.4` = run_sim_ar(phi = 0.4, e = e)$y,
  `phi=0.6` = run_sim_ar(phi = 0.6, e = e)$y,
  `phi=0.8` = run_sim_ar(phi = 0.8, e = e)$y,
  `phi=0.9` = run_sim_ar(phi = 0.9, e = e)$y
) 
df |>
  pivot_longer(cols = -idx, names_to = "phi", values_to = "y") |>
  ggplot(mapping = aes(x = idx, y = y, color = phi)) +
  geom_line() +
  labs(
    title = "AR(1) Model",
    x = "Time",
    y = "Value"
  ) +
  facet_wrap(~phi, ncol = 2)

The plot becomes increasingly biased as phi approaches 1.0. We used the same error component for each plot so that we could isolate the effect of phi. As phi approaches 1.0, the series becomes more and more biased towards the previous value. The series with phi = 0.0 looks an awful lot like noise, while the series with phi = 1.0 appears to be negatively biased. Let’s look at the means.

summary(df)

      idx            phi.0.0            phi.0.2           phi.0.4       
 Min.   :  1.00   Min.   :-2.30219   Min.   :-2.6750   Min.   :-3.0555  
 1st Qu.: 25.75   1st Qu.:-0.79987   1st Qu.:-0.7021   1st Qu.:-0.7894  
 Median : 50.50   Median :-0.02475   Median :-0.1176   Median :-0.1989  
 Mean   : 50.50   Mean   :-0.08163   Mean   :-0.1009   Mean   :-0.1305  
 3rd Qu.: 75.25   3rd Qu.: 0.53431   3rd Qu.: 0.5514   3rd Qu.: 0.5666  
 Max.   :100.00   Max.   : 2.22818   Max.   : 2.3886   Max.   : 2.5302  
    phi.0.6           phi.0.8           phi.0.9       
 Min.   :-3.6765   Min.   :-4.5943   Min.   :-6.1635  
 1st Qu.:-0.8831   1st Qu.:-1.1916   1st Qu.:-1.7654  
 Median :-0.1505   Median :-0.1036   Median :-0.2752  
 Mean   :-0.1830   Mean   :-0.3228   Mean   :-0.6222  
 3rd Qu.: 0.6326   3rd Qu.: 0.7204   3rd Qu.: 0.9378  
 Max.   : 2.8152   Max.   : 3.2870   Max.   : 3.3500

And the means confirm that the series is growing disproportionately in the negative direction as phi approaches 1.0.

Write your own code to generate data from an MA(1) model with \(\theta_1 = 0.6\) and \(\sigma^2 = 1\).

run_sim_ma <- function(theta = 0.6, sigma_sq = 1, e = NULL) {
  y <- numeric(100)
  if(is.null(e))
    e <- rnorm(100, sd = sqrt(sigma_sq))
  for(i in 2:100)
    y[i] <- e[i] + theta*e[i-1]
  return(tsibble(idx = seq_len(100), y = y, index = idx))
}

Produce a time plot for the series. How does the plot change as you change \(\theta_1\)?

# let's use the same random error for all series so that we isolate the effect of phi
e <- rnorm(100)
df = data.frame(
  idx = run_sim_ma()$idx,
  `theta=0.0` = run_sim_ma(theta = 0.0, e = e)$y,
  `theta=0.2` = run_sim_ma(theta = 0.2, e = e)$y,
  `theta=0.4` = run_sim_ma(theta = 0.4, e = e)$y,
  `theta=0.6` = run_sim_ma(theta = 0.6, e = e)$y,
  `theta=0.8` = run_sim_ma(theta = 0.8, e = e)$y,
  `theta=0.9` = run_sim_ma(theta = 0.9, e = e)$y
) 
df |>
  pivot_longer(cols = -idx, names_to = "theta", values_to = "y") |>
  ggplot(mapping = aes(x = idx, y = y, color = theta)) +
  geom_line() +
  labs(
    title = "MA(1) Model",
    x = "Time",
    y = "Value"
  ) +
  facet_wrap(~theta, ncol = 2)

Again, we used a common error sequence for all of the thetas, so that we could isolate the effects of theta. I was about to say that it look like an amplifier, but that’s not entirely correct. Or maybe it is correct and just not a very faithful amplifier. Let’s try superimposing the plots and see if that helps make the distortion more clear.

df |>
  pivot_longer(cols = -idx, names_to = "theta", values_to = "y") |>
  ggplot(mapping = aes(x = idx, y = y, color = theta)) +
  geom_line() +
  labs(
    title = "MA(1) Model",
    x = "Time",
    y = "Value"
  )

Not only is it a very cool plot, but it does a good job of show the distortion. It’s most pronounced at the peaks and valleys. The higher the theta, the higher the peaks and the lower the valleys.

Generate data from an ARMA(1,1) model with \(\phi_1 = 0.6\), \(\theta_1 = 0.6\) and \(\sigma^2 = 1\).

run_sim_arma <- function(phi = 0.6, theta = 0.6, sigma_sq = 1, e = NULL) {
  y <- numeric(100)
  if(is.null(e))
    e <- rnorm(100, sd = sqrt(sigma_sq))
  for(i in 2:100)
    y[i] <- phi*y[i-1] + e[i] + theta*e[i-1]
  return(tsibble(idx = seq_len(100), y = y, index = idx))
}

This will be interesting to see. Let’s use the same trick of using the same error series for all of the plots so that we can isolate the effects of phi and theta.

e <- rnorm(100)
df = data.frame(
  idx = run_sim_arma()$idx,
  `phi=0.0, theta=0.0` = run_sim_arma(phi = 0.0, theta = 0.0, e = e)$y,
  `phi=0.2, theta=0.2` = run_sim_arma(phi = 0.2, theta = 0.2, e = e)$y,
  `phi=0.4, theta=0.4` = run_sim_arma(phi = 0.4, theta = 0.4, e = e)$y,
  `phi=0.6, theta=0.6` = run_sim_arma(phi = 0.6, theta = 0.6, e = e)$y,
  `phi=0.8, theta=0.8` = run_sim_arma(phi = 0.8, theta = 0.8, e = e)$y,
  `phi=1.0, theta=1.0` = run_sim_arma(phi = 1.0, theta = 1.0, e = e)$y
)
df |>
  pivot_longer(cols = -idx, names_to = "phi_theta", values_to = "y") |>
  ggplot(mapping = aes(x = idx, y = y, color = phi_theta)) +
  geom_line() +
  labs(
    title = "ARMA(1,1) Model",
    x = "Time",
    y = "Value"
  ) +
  facet_wrap(~phi_theta, ncol = 2)

The effects of phi are way more obvious than the effects of theta. It steals the show.

Generate data from an AR(2) model with \(\phi_1 = 0.6\), \(\phi_2 = -0.8\) and \(\sigma^2 = 1\). (Note that these parameters will give a non-stationary series.)

run_sim_ar2 <- function(phi1 = 0.6, phi2 = -0.8, sigma_sq = 1, e = NULL) {
  y <- numeric(100)
  if(is.null(e))
    e <- rnorm(100, sd = sqrt(sigma_sq))
  for(i in 3:100) {
    y[i] <- phi1*y[i-1] + phi2*y[i-2] + e[i]
  }
  return(tsibble(idx = seq_len(100), y = y, index = idx))
}

Oh, we are plotting for question g. We won’t spoil the surprise. But we will have a peek using our scheme above. We shall use the following to determine the value of phi2: \(\phi_2=\frac{\phi_1}{2}\).

# We will use the same error component as we did for the ARMA(1,1) model
# Let's set phi2 = phi1/2.
df = data.frame(
  idx = run_sim_ar2()$idx,
  `phi1=0.0, phi2=0.0` = run_sim_ar2(phi1 = 0.0, phi2 = 0.0, e = e)$y,
  `phi1=0.2, phi2=0.1` = run_sim_ar2(phi1 = 0.2, phi2 = 0.1, e = e)$y,
  `phi1=0.4, phi2=0.2` = run_sim_ar2(phi1 = 0.4, phi2 = 0.2, e = e)$y,
  `phi1=0.6, phi2=0.3` = run_sim_ar2(phi1 = 0.6, phi2 = 0.3, e = e)$y,
  `phi1=0.8, phi2=0.4` = run_sim_ar2(phi1 = 0.8, phi2 = 0.4, e = e)$y
)
df |>
  pivot_longer(cols = -idx, names_to = "phi", values_to = "y") |>
  ggplot(mapping = aes(x = idx, y = y, color = phi)) +
  geom_line() +
  labs(
    title = "AR(2) Model (phi2 = phi1/2)",
    x = "Time",
    y = "Value"
  ) +
  facet_wrap(~phi, ncol = 2, scale = "free_y")

Pretty crazy. It doesn’t take much to push the formula over the edge (phi1=0.8, phi2=0.4).

Graph the latter two series and compare them.

e <- rnorm(100)
df = data.frame(
  idx = run_sim_ar2()$idx,
  e = e,
  `AR(2)` = run_sim_ar2(phi1 = 0.6, phi2 = -0.8, e = e)$y,
  `ARMA(1,1)` = run_sim_arma(phi = 0.6, theta = 0.6, e = e)$y
)
df |>
  pivot_longer(cols = -idx, names_to = "name", values_to = "y") |>
  ggplot(mapping = aes(x = idx, y = y, color = name)) +
  geom_area(mapping = aes(fill = name), alpha = 0.2) +
  labs(
    title = "AR(2) vs ARMA(1,1)",
    x = "Time",
    y = "Value"
  )

summary(df)

      idx               e                AR.2.            ARMA.1.1.      
 Min.   :  1.00   Min.   :-2.16953   Min.   :-3.10514   Min.   :-3.3733  
 1st Qu.: 25.75   1st Qu.:-0.66048   1st Qu.:-1.10201   1st Qu.:-1.1667  
 Median : 50.50   Median :-0.10414   Median :-0.20388   Median :-0.4256  
 Mean   : 50.50   Mean   :-0.09002   Mean   :-0.06205   Mean   :-0.4002  
 3rd Qu.: 75.25   3rd Qu.: 0.37771   3rd Qu.: 1.12483   3rd Qu.: 0.4540  
 Max.   :100.00   Max.   : 1.85255   Max.   : 2.64083   Max.   : 1.8936

Once again AR is biased, but as much as it was with AR(1). Where AR(2) is pushing the peaks and valleys to the extreme, ARMA(1,1) has a smoothing effect around 0. It’s mean is closer to 0 than e even though it’s max and min are further from 0 than e.

Exercise 9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

aus_airpassengers |>
  ggplot(mapping = aes(x = Year, y = Passengers)) +
  geom_line() +
  labs(
    title = "Total Number of Passengers from Australian Air Carriers",
    x = "Year",
    y = "Passengers (millions)"
  )

Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

aus_airpassengers |>
  model(ARIMA(Passengers ~ pdq(,,) + PDQ(,,)))

# A mable: 1 x 1
  `ARIMA(Passengers ~ pdq(, , ) + PDQ(, , ))`
                                      <model>
1                              <ARIMA(0,2,1)>

The model selected was ARIMA(0,2,1). Let’s check the residuals.

aus_airpassengers |>
  model(ARIMA(Passengers ~ pdq(0,2,1) + PDQ(0,0,0))) |>
  gg_tsresiduals()

The distribution of the residuals is a left leaning approximately normal distribution. The ACF shows lags that are well within the significance lines. We shall proceed with the forecast.

fit <- aus_airpassengers |>
  model(ARIMA(Passengers ~ pdq(0,2,1) + PDQ(0,0,0)))

aus_airpassengers |>
  model(ARIMA(Passengers ~ pdq(0,2,1) + PDQ(0,0,0))) |>
  report() |>
  forecast(h = 10) |>
  autoplot(aus_airpassengers) +
  labs(
    title = "Forecast for Total Number of Passengers from Australian Air Carriers",
    x = "Year",
    y = "Passengers (millions)"
  )

Series: Passengers 
Model: ARIMA(0,2,1) 

Coefficients:
          ma1
      -0.8963
s.e.   0.0594

sigma^2 estimated as 4.308:  log likelihood=-97.02
AIC=198.04   AICc=198.32   BIC=201.65

Write the model in terms of the backshift operator.

Let’s recall. \(By_{t} = y_{t-1}\). Now let’s think. The model we found was ARIMA(0,2,1) with no seasonal component. This means we have a second order differencing and a first order moving average.

First order differencing may be written as: \[y_t^{'} = y_t - By_{t}\]

Second order differencing may be written as: \[y_t^{''} = y_t - 2 By_{t} + By_{t-2} = (1 - B)^2y_t\]

First order moving average may be written as: \[y_t^{'} = y_t + \theta_1 e_{t-1} = y_t + \theta_1 Be_{t}\]

So, we can express the second order differencing as: \[ \begin{align} \text{ARIMA}(0,2,1) &= y_t - 2 By_{t} + By_{t-2} + y_t + \theta_1 Be_{t} \\ &= 2y_t - 2 By_{t} + By_{t-2} + \theta_1 Be_{t} \end{align} \]

Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

I struggled with drift. I tried to force it in a number of ways, none of which worked. I see that drift is included with what is below. At the moment, I don’t understand why.

aus_airpassengers |>
  model(ARIMA(Passengers ~ pdq(0,1,0) + PDQ(0,0,0))) |>
  report() |>
  forecast(h = 10) |>
  autoplot(aus_airpassengers) +
  labs(
    title = "Forecast for Passengers from Australian Air Carriers (ARIMA(0,1,0) with Drift)",
    x = "Year",
    y = "Passengers (millions)"
  )

Series: Passengers 
Model: ARIMA(0,1,0) w/ drift 

Coefficients:
      constant
        1.4191
s.e.    0.3014

sigma^2 estimated as 4.271:  log likelihood=-98.16
AIC=200.31   AICc=200.59   BIC=203.97

The AIC is almost identical. The prediction’s confidence interval for ARIMA(0,2,1) is conical where the prediction’s confidence interval for ARIMA(0,1,0) is somewhat bell shaped. The ARIMA(0,2,1) model has a slightly more pronounced upward trend than the ARIMA(0,1,0) model. The mouth of the confident interval for ARIMA(0,2,1) is more narrow than ARIMA(0,1,0)’s.

Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

aus_airpassengers |>
  model(ARIMA(Passengers ~ 1 + pdq(2,1,2) + PDQ(0,0,0))) |>
  report() |>
  forecast(h = 10) |>
  autoplot(aus_airpassengers) +
  labs(
    title = "Forecast for Passengers from Australian Air Carriers (ARIMA(2,1,2) with Drift)",
    x = "Year",
    y = "Passengers (millions)"
  )

Series: Passengers 
Model: ARIMA(2,1,2) w/ drift 

Coefficients:
          ar1      ar2     ma1     ma2  constant
      -0.5518  -0.7327  0.5895  1.0000    3.2216
s.e.   0.1684   0.1224  0.0916  0.1008    0.7224

sigma^2 estimated as 4.031:  log likelihood=-96.23
AIC=204.46   AICc=206.61   BIC=215.43

The AIC score is going up. The prediction appears to be carrying some of the noise from the training data. The confidence interval is the same shape as the ARIMA(0,1,0) model, only its outline is wavy due to the noise in the forecast, which I believe is coming from the MA(2) component.

Let’s try removing the drift which should remove the constant.

aus_airpassengers |>
  model(ARIMA(Passengers ~ 0 + pdq(2,1,2) + PDQ(0,0,0)))

Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2) + PDQ(0, 0, 0))
[1] non-stationary AR part from CSS

# A mable: 1 x 1
  `ARIMA(Passengers ~ 0 + pdq(2, 1, 2) + PDQ(0, 0, 0))`
                                                <model>
1                                          <NULL model>

What happened is that I encountered an error:

Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2) + PDQ(0, 0, 0)) [1] non-stationary AR part from CSS

This means that the model is non-stationary. And judging from the message, I believe this is because the AR(2) component is finding that the component is not stationary. We’ll come back to this if we have time.

Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

aus_airpassengers |>
  model(ARIMA(Passengers ~ 1 + pdq(0,1,2) + PDQ(0,0,0))) |>
  report() |>
  forecast(h = 10) |>
  autoplot(aus_airpassengers) +
  labs(
    title = "Forecast for Passengers from Australian Air Carriers (ARIMA(0,2,1) with Constant)",
    x = "Year",
    y = "Passengers (millions)"
  )

Series: Passengers 
Model: ARIMA(0,1,2) w/ drift 

Coefficients:
          ma1     ma2  constant
      -0.0321  0.0957    1.4176
s.e.   0.1466  0.1521    0.3180

sigma^2 estimated as 4.427:  log likelihood=-97.95
AIC=203.9   AICc=204.87   BIC=211.21

Ah, the noise component goes away. Wait, I thought it was the MA(2) component that was causing the noise. But the wavy forecast seems to be coming from the AR(2) component. Perhaps, the waviness isn’t coming from noise.

Exercise 9.8

For the United States GDP series (from global_economy):

Let’s isolate the data

data <- global_economy |>
  filter(Code == "USA")

And now let’s have a look

data |>
  ggplot(aes(x = Year, y = GDP)) +
  geom_line() +
  labs(
    title = "United States GDP",
    x = "Year",
    y = "GDP"
  )

Oh my gosh, he’s a textbook example of exponential growth. Let’s apply the guerrero method to see what he thinks our log transform should be.

if necessary, find a suitable Box-Cox transformation for the data;

lambda <- data |>
  features(GDP, guerrero)
print(lambda$lambda_guerrero)

[1] 0.2819443

data <- data |>
  mutate(GDP = box_cox(GDP, lambda$lambda_guerrero))

Let’s see how we did.

data |>
  ggplot(aes(x = Year, y = GDP)) +
  geom_line() +
  labs(
    title = "United States GDP (Box-Cox)",
    x = "Year",
    y = "GDP"
  )

fit a suitable ARIMA model to the transformed data using ARIMA();

data |>
  model(ARIMA(GDP ~ pdq(,,) + PDQ(0,0,0))) |>
  report()

Series: GDP 
Model: ARIMA(1,1,0) w/ drift 

Coefficients:
         ar1  constant
      0.4586  118.1822
s.e.  0.1198    9.5047

sigma^2 estimated as 5479:  log likelihood=-325.32
AIC=656.65   AICc=657.1   BIC=662.78

try some other plausible models by experimenting with the orders chosen;

Let’s try the same model without drift

data |>
  model(ARIMA(GDP ~ 0 + pdq(1,1,0) + PDQ(0,0,0))) |>
  report()

Series: GDP 
Model: ARIMA(1,1,0) 

Coefficients:
         ar1
      0.9287
s.e.  0.0428

sigma^2 estimated as 7038:  log likelihood=-333.85
AIC=671.69   AICc=671.91   BIC=675.78

This model has a higher AIC and BIC than the ARIMA(1,1,0) with drift.

Let’s look at the residuals

data |>
  model(ARIMA(GDP ~ 0 + pdq(1,1,0) + PDQ(0,0,0))) |>
  gg_tsresiduals()

It appears to be slightly more normally distributed than ARIMA(1,1,0) with drift (see below), but the ACF shows that lag 1 is showing a fair amount of correlation. So, we shall stick with ARIMA(1,1,0) with drift for the time being.

data |>
  model(ARIMA(GDP ~ pdq(0,1,0) + PDQ(0,0,0))) |>
  report()

Series: GDP 
Model: ARIMA(0,1,0) w/ drift 

Coefficients:
      constant
      220.3489
s.e.   10.8051

sigma^2 estimated as 6774:  log likelihood=-331.77
AIC=667.53   AICc=667.76   BIC=671.62

He’s got all around higher goodness of fit scores.

data |>
  model(ARIMA(GDP ~ pdq(1,1,1) + PDQ(0,0,0))) |>
  report()

Series: GDP 
Model: ARIMA(1,1,1) w/ drift 

Coefficients:
         ar1      ma1  constant
      0.4736  -0.0190  114.8547
s.e.  0.2851   0.3286    9.3486

sigma^2 estimated as 5580:  log likelihood=-325.32
AIC=658.64   AICc=659.41   BIC=666.82

This model also has higher goodness of fit scores but oh so minimally. But the added complexity (MA) doesn’t warrant selecting ARIMA(1,1,1) over ARIMA1(1,1,0).

choose what you think is the best model and check the residual diagnostics;

We shall select the first model, ARIMA(1,1,0) as it has the best goodness of fit scores. Let’s check the residuals.

data |>
  model(ARIMA(GDP ~ pdq(1,1,0) + PDQ(0,0,0))) |>
  gg_tsresiduals()

If the residuals can be considered to be normally distributed, then they are very left biased. The ACF shows that the residuals are well within the significance lines.

produce forecasts of your fitted model. Do the forecasts look reasonable?

data |>
  model(ARIMA(GDP ~ pdq(1,1,0) + PDQ(0,0,0))) |>
  forecast(h = 10) |>
  autoplot(data) +
  labs(
    title = "Forecast for United States GDP (ARIMA(1,1,0) w/drift)",
    x = "Year",
    y = "GDP"
  )

The forecast looks very reasonable. Not only is it in line with the apparent trend, but the confidence interval is relatively narrow when compared to those we saw above.

compare the results with what you would obtain using ETS() (with no transformation).

global_economy |>
  filter(Code == "USA") |>
  model(ETS(GDP ~ error("A") + trend("M") + season("N"))) |>
  report()

Series: GDP 
Model: ETS(A,M,N) 
  Smoothing parameters:
    alpha = 0.9998995 
    beta  = 0.485905 

  Initial states:
         l[0]     b[0]
 453235324559 1.101298

  sigma^2:  3.191607e+22

     AIC     AICc      BIC 
3246.770 3247.924 3257.072

Oh my gosh, this model is off the charts. It’s AIC, AICc and BIC are all roughly 6 times higher than the ARIMA(1,1,0) model. Let’s see what the residuals look like.

global_economy |>
  filter(Code == "USA") |>
  model(ETS(GDP ~ error("A") + trend("M") + season("N"))) |>
  gg_tsresiduals()

The residuals look good up to the year 2000, at which point it looses its noise like behavior. The distribution looks more normal than ARIMA(1,1,0)’s residuals. The ACF is borderline. Lag 3 and lag 7 are both marginally exceeding the critical limit.