Question 9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?
Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

Question 9.1a Answer

In each of the plots, as the size of the white noise series increases, the bounds (dashed blue lines) decrease. The bounds themselves represent the space where 95% of the autocorrelations should lie. 95% of the spikes in the ACF plot lie between \(\pm2/\sqrt{T}\)(Section 2.9 - White noise), where \(T\) is the length of the time series. Since none of the black lines exceed the bounds, all of the graphs indicate that the repsective series they are plotting are white noise.

Question 9.2b Answer

In section 2.9 - White noise, it says the following:

“For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series.”

Which is why as the length of the time series increases, the distance between the critical values from the mean of zero decreases. Since each of the series are made up of white noise, which is random in nature, the chances of autocorrelation decrease as the series size increases.

Question 9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

Question 9.2 Answer

amzn <- gafa_stock %>%
  filter(Symbol == "AMZN") %>%
  select(Close, Date)

amzn %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Close`

Stocks are usually random walks. As we can see from the plot above, we have an upward trend up until 3/4ths of the way into 2019, where the stock begins to experience a downward trend.

amzn %>%
  gg_tsdisplay(Close, plot_type = 'partial')

## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

amzn |>
  features(Close, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      14.0        0.01

As we can see, we are dealing with non-stationary data. The ACF plot above shows us that the ACF is decreasing slowly. The very first lag in the PACF plot is much larger than the rest of the lags, which indicates that the data is indeed not stationary. I’ve also added in the KPSS test, which shows us a p-value of less than 0.01, which indicates that the data is not stationary.

Question 9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

Turkish GDP from global_economy
Accommodation takings in the state of Tasmania from aus_accommodation.
Monthly sales from souvenirs.

Question 9.3a Answer

turkish_gdp <- global_economy %>%
  filter(Country == "Turkey") %>%
  select(Year, GDP)

turkish_gdp %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

turkish_gdp %>% 
  ACF(GDP) %>%
  autoplot()

As we can see from the time series plot above, we have an increasing trend. It also looks like the data has a bit of a power/exponential shape to it. So we will be applying a Box-Cox Transformation to the dataset. We can also see from the ACF plot that we have a gradual decrease, which is characteristic of non-stationary data (section 9.1 - Stationarity and differencing).

lambda <- turkish_gdp |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)

turkish_gdp <- turkish_gdp %>%
  mutate(transformed_gdp = box_cox(GDP, lambda))

turkish_gdp |>
  autoplot(transformed_gdp) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Turkey GDP with $\\lambda$ = ",
         round(lambda,2))))

The output above still shows us an upward trend. Let’s try taking the first difference of the transformed GDP.

turkish_gdp <- turkish_gdp |>
  mutate(diff_transformed_gdp = difference(transformed_gdp))
turkish_gdp %>%
  autoplot(diff_transformed_gdp)

## Warning: Removed 1 row containing missing values (`geom_line()`).

turkish_gdp %>% 
  ACF(diff_transformed_gdp) %>%
  autoplot()

turkish_gdp %>%
  features(diff_transformed_gdp, ljung_box, lag = 10)

## # A tibble: 1 × 2
##   lb_stat lb_pvalue
##     <dbl>     <dbl>
## 1    5.84     0.829

It looks like by taking the difference and transforming, we have a white noise series as shown in the plot above. Also the p-value from the Ljung-Box Q statistic is 0.829, which suggess that the yearly transformed change in Turkey GDP is essentially a random amount which is uncorrelated with that of previous years.

Question 9.3b Answer

tasmania_takings <- aus_accommodation %>%
  filter(State == "Tasmania") %>%
  select(Date, Takings)

tasmania_takings %>%
  autoplot(Takings)

tasmania_takings %>% 
  ACF(Takings) %>%
  autoplot()

So we have some pretty obvious seasonality inherent within the data. We also have a general upward trend inherent within the data. This is to be expected since the tourism industry is seasonal in nature. We can also definitely see the seasonality in the ACF plot. Lag 4, 8, 12, 16 are showing this slowly decreasing trend. We can also see from the time series plot, that there is a subtle increase in the variance as the level increases. Therefore, a Box-Cox transformation is necessary.

lambda <- tasmania_takings |>
  features(Takings, features = guerrero) |>
  pull(lambda_guerrero)

tasmania_takings <- tasmania_takings %>%
  mutate(transformed_takings = box_cox(Takings, lambda))

tasmania_takings |>
  autoplot(transformed_takings) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Tasmania Takings with $\\lambda$ = ",
         round(lambda,4))))

The plot above shows us that the variance has been lessened after the transformation. We will now apply the appropriate differencing.

tasmania_takings |>
  features(transformed_takings, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

tasmania_takings |>
  mutate(diff_transformed_takings = difference(transformed_takings, 4)) |>
  features(diff_transformed_takings, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

unitroot_nsdiffs() returns 1, telling us that one seasonal differencing is required. Also unitroot_ndiffs() returns 0, telling us that a first difference is NOT required after doing seasonal difference.

tasmania_takings <- tasmania_takings |>
  mutate(diff_transformed_takings = difference(transformed_takings, 4))

tasmania_takings %>% 
  gg_tsdisplay(diff_transformed_takings, plot_type = 'partial')

## Warning: Removed 4 rows containing missing values (`geom_line()`).

## Warning: Removed 4 rows containing missing values (`geom_point()`).

The data seems to be stationary after performing a seasonal differencing. The ACF has a significant spikes at lag 3 and 4, but nothing after that. The last significant lag spike for the pacf is at 4, and then there are no more, so the seasonal differenced data seems to be stationary.

Question 9.3c Answer

souvenirs %>%
  gg_tsdisplay(plot_type = 'partial')

## Plot variable not specified, automatically selected `y = Sales`

We have 2 significant spikes in the ACF plot shown above at lag 1 and lag 12, indicating that a seasonal difference is probably required. Additionally, the time series plot above shows us that the variance of the series is increasing as the level increases, which we should probably rectify with a Box-Cox transformation.

lambda <- souvenirs |>
  features(Sales, features = guerrero) |>
  pull(lambda_guerrero)

souvenirs <- souvenirs %>%
  mutate(transformed_sales = box_cox(Sales, lambda))

souvenirs |>
  autoplot(transformed_sales) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Sales with $\\lambda$ = ",
         round(lambda,4))))

The resulting plot shown above shows stabilized variance. Now we can proceeded to perform differencing.

souvenirs |>
  features(transformed_sales, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

souvenirs |>
  mutate(diff_transformed_sales = difference(transformed_sales, 12)) |>
  features(diff_transformed_sales, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

Again, just like the previous part of this question, a seasonal differencing, based on the output of unitroot_nsdiffs, is required, while a first difference after a seasonal difference is not required.

souvenirs <- souvenirs |>
  mutate(diff_transformed_sales = difference(transformed_sales, 12))

souvenirs %>% 
  gg_tsdisplay(diff_transformed_sales, plot_type = 'partial')

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

The ACF and PACF plots decrease rapidly after 2 lags into the blue dashed line area, which indicates that there seems to be stationarity in the seasonally differenced and transformed Sales data from the souvenirs dataset.

Question 9.5

For your retail data (from Exercise 8 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

Question 9.5 Answer

set.seed(23987)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1)) %>%
  select(Month, Turnover)

myseries %>%
  gg_tsdisplay(plot_type = 'partial')

## Plot variable not specified, automatically selected `y = Turnover`

The variance increases with the level of the series as shown in the plot above, so a Box-Cox transformation should be performed on the data before we proceed.

lambda <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

myseries <- myseries %>%
  mutate(transformed_turnover = box_cox(Turnover, lambda))

myseries |>
  autoplot(transformed_turnover) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Turnover with $\\lambda$ = ",
         round(lambda,4))))

After Box-Cox transforming the data, we can see that the variance of the series has stabilized. Now we will proceed with selecting the appropriate seasonal differencing.

myseries |>
  features(transformed_turnover, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

myseries |>
  mutate(diff_transformed_turnover = difference(transformed_turnover, 12)) |>
  features(diff_transformed_turnover, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

The results of the test above indicate that we should perform one seasonal differencing and that’s it. The differenced and transformed series is plotted below.

myseries <- myseries |>
  mutate(diff_transformed_turnover = (difference(transformed_turnover, 12)))

myseries %>% 
  gg_tsdisplay(diff_transformed_turnover, plot_type = 'partial', lag_max = 36)

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

As we can see from the plot above, the series does show white noise behavior, indicative of stationarity. We also see in the ACF plot that the spikes decrease rapidly and by lag 7, the spikes are within the blue dotted lines, indicative of stationarity within the transformed and seasonally differenced Turnover data from the myseries dataset.

Question 9.6

Simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with \(ϕ_1=0.6\) and \(σ^2 = 1\). The process starts with \(y_1=0\).

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

Produce a time plot for the series. How does the plot change as you change \(ϕ^1\)?
Write your own code to generate data from an MA(1) model with \(θ_11=0.6\) and \(σ^2=1\).
Produce a time plot for the series. How does the plot change as you change \(θ_1\)?
Generate data from an ARMA(1,1) model with \(ϕ_1=0.6\), \(θ_1=0.6\) and \(σ^2=1\).
Generate data from an AR(2) model with \(ϕ_1=−0.8\), \(ϕ_22=0.3\) and \(σ^2=1\).(Note that these parameters will give a non-stationary series.)
Graph the latter two series and compare them.

Question 9.6b Answer

sim_df <- data.frame(idx = seq(1, 100))
e <- rnorm(100)
for(phi in seq(0, 1, by = 0.1)){
  y <- numeric(100)
  for(i in 2:100)
    y[i] <- phi*y[i-1] + e[i]
  sim_df[ , paste0("phi=", phi)] <- y
}

sim_df %>%
  pivot_longer(-idx) %>%
  ggplot(aes(x = idx, y = value)) + 
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

The plot above shows us that if we increase phi from 0 to 1 using the AR(1) model, we end up with a gradual increase in a noticeable trend for y.

Question 9.6c and d Answer

sim_df <- data.frame(idx = seq(1, 100))
e <- rnorm(100)
for(theta in seq(0, 1, by = 0.1)){
  for(i in 2:100)
    y[i] <- theta*e[i-1] + e[i]
  sim_df[ , paste0("theta=", theta)] <- y
}

sim_df %>%
  pivot_longer(-idx) %>%
  ggplot(aes(x = idx, y = value)) + 
  geom_line() +
  facet_grid(name ~ ., scales = "free_y")

The plot above shows us that as we increase \(\theta_1\), we start to see the spikes become less prevalent, and the graphs begin to look less like white noise.

Question 9.6e Answer

y <- numeric(100)
e <- rnorm(100)
theta = 0.6
phi = 0.6
for(i in 2:100)
  y[i] <- theta*e[i-1] + phi*y[i-1] + e[i]
sim_arma_11 <- tsibble(idx = seq_len(100), y = y, index = idx)

sim_arma_11 %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = y`

Question 9.6f Answer

y <- numeric(100)
e <- rnorm(100)
phi_1 = -0.8
phi_2 = 0.3
for(i in 3:100)
  y[i] <- phi_1*y[i-1] + phi_2*y[i-2] + e[i]
sim_ar_2 <- tsibble(idx = seq_len(100), y = y, index = idx)

sim_ar_2 %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = y`

Question 9.6g Answer

sim_arma_11 %>%
  gg_tsdisplay(y, plot_type='partial') +
  labs(title = "ARMA (1, 1) model")

sim_ar_2 %>%
  gg_tsdisplay(y, plot_type='partial') +
  labs(title = "AR(2) model")

There is stationarity within the plot showing the ARMA(1, 1) model as shown on the output on the top. The line plot appears to be white noise, and the acf pacf plots show a rapid decrease in the spikes. However for the AR(2) model, we have obvious seasonality and increasing variance with the level, followed by a slowly decreasing spike pattern shown in the acf plot. \(\phi_1\) is less than zero, and in section 9.3 - Autoregressive models, it states that when \(\phi_1 < 0\), \(y_t\) tends to oscillate around the mean, which is why we are seeing the pattern shown in the plot above.

Question 9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.
Write the model in terms of the backshift operator.
Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.
Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.
Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

Question 9.7a Answer

aus_airpassengers %>%
  gg_tsdisplay(plot_type = 'partial')

## Plot variable not specified, automatically selected `y = Passengers`

There is an obvious increasing trend within the data as shown in the plot above. We also have a slowly decreasing spike trend as shown in the acf plot and a giant spike in the first lag in the pacf plot, followed by much smaller spikes. All of these point out the fact that there is non-stationarity inherent in the data.

fit <- aus_airpassengers %>%
  model(ARIMA(Passengers))

report(fit)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8963
## s.e.   0.0594
## 
## sigma^2 estimated as 4.308:  log likelihood=-97.02
## AIC=198.04   AICc=198.32   BIC=201.65

fit |> forecast(h=10) |>
  autoplot(aus_airpassengers) +
  labs(y = "Air Transport Passengers (millions)", title = "Air Transport Passengers Australia
")

fit %>%
  gg_tsresiduals(lag_max = 16)

An ARIMA(0,2,1) was selected when using the built-in ARIMA selection method. As we can see in the plot above, the ACF no longer follows a slow decrease in the spikes. Also, we have a giant spike in 2010 in the residuals. This is probably as a result of the state of the world’s economy at the time, and this is probably we have a right skewed histogram of residuals.

Question 9.7b Answer

\((1-B)^2y_t = c + (1 + \theta_1B)e_t\)

Question 9.7c Answer

arima_010_fit <- aus_airpassengers %>%
  model(
    ARIMA(Passengers ~ pdq(0,1,0))
  )

report(arima_010_fit)

## Series: Passengers 
## Model: ARIMA(0,1,0) w/ drift 
## 
## Coefficients:
##       constant
##         1.4191
## s.e.    0.3014
## 
## sigma^2 estimated as 4.271:  log likelihood=-98.16
## AIC=200.31   AICc=200.59   BIC=203.97

arima_010_fit %>%
  forecast(h=10) %>%
  autoplot(aus_airpassengers)

While you can’t really tell based on the forecasts themselves whether or not this model outperforms the part a model, we can use the AIC and BIC. Since the AIC and BIC values for this model are larger than than AIC and BIC for the part a model, the autogenerated ARIMA model still is a better performing model.

Question 9.7d Answer

arima_212_fit <- aus_airpassengers %>%
  model(
    ARIMA(Passengers ~ 0 + pdq(2,1,2))
  )

## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS

report(arima_212_fit)

## Series: Passengers 
## Model: NULL model 
## NULL model

arima_212_fit %>%
  forecast(h=10) %>%
  autoplot(aus_airpassengers)

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 10 rows containing missing values (`()`).

We don’t get a forecast and the output shows us that we get a NULL model. Therefore, we have nothing to compare the other two models with.

Question 9.7e Answer

arima_021_fit <- aus_airpassengers %>%
  model(
    ARIMA(Passengers ~ pdq(0,2,1))
  )

report(arima_021_fit)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8963
## s.e.   0.0594
## 
## sigma^2 estimated as 4.308:  log likelihood=-97.02
## AIC=198.04   AICc=198.32   BIC=201.65

arima_021_fit %>%
  forecast(h=10) %>%
  autoplot(aus_airpassengers)

This model is actually the model that was reported when using the autoselected ARIMA model algorithm in part a, which is why the reporting values are all the same.

Question 9.8

For the United States GDP series (from global_economy):

if necessary, find a suitable Box-Cox transformation for the data;
fit a suitable ARIMA model to the transformed data using ARIMA();
try some other plausible models by experimenting with the orders chosen;
choose what you think is the best model and check the residual diagnostics;
produce forecasts of your fitted model. Do the forecasts look reasonable?
compare the results with what you would obtain using ETS() (with no transformation).

Question 9.8a Answer

us_gdp <- global_economy %>%
  filter(Country == "United States") %>%
  select(GDP, Year)

lambda <- us_gdp |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)

us_gdp <- us_gdp %>%
  mutate(transformed_gdp = box_cox(GDP, lambda))

us_gdp |>
  autoplot(transformed_gdp) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed GDP with $\\lambda$ = ",
         round(lambda,4))))

Question 9.8b Answer

fit <- us_gdp %>%
  model(
    arima = ARIMA(transformed_gdp)
  )
report(fit)

## Series: transformed_gdp 
## Model: ARIMA(1,1,0) w/ drift 
## 
## Coefficients:
##          ar1  constant
##       0.4586  118.1822
## s.e.  0.1198    9.5047
## 
## sigma^2 estimated as 5479:  log likelihood=-325.32
## AIC=656.65   AICc=657.1   BIC=662.78

fit %>%
  gg_tsresiduals()

Question 9.8c Answer

If we use those unit root tests detailed in section 9.1, we can determine how much first and seasonal differencing should be applied to the transformed data.

us_gdp %>%
  features(transformed_gdp, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

us_gdp %>%
  gg_tsdisplay(GDP, plot_type = 'partial')

us_gdp_alt_fit <- us_gdp %>%
  model(
    ARIMA(transformed_gdp ~ pdq(2,1,1))
  )

report(us_gdp_alt_fit)

## Series: transformed_gdp 
## Model: ARIMA(2,1,1) w/ drift 
## 
## Coefficients:
##          ar1      ar2      ma1  constant
##       1.1661  -0.2792  -0.7356   24.3346
## s.e.  0.3418   0.2108   0.3077    2.5572
## 
## sigma^2 estimated as 5647:  log likelihood=-325.14
## AIC=660.29   AICc=661.46   BIC=670.5

us_gdp_alt_fit %>%
  gg_tsresiduals()

Question 9.8d Answer

Since the AIC and BIC values for the model in part c are larger than than AIC and BIC for the part b model, the autogenerated ARIMA model still is a better performing model. The residual a plots for both of the models show that the spikes are within the blue dotted line bounds.

Question 9.8e Answer

fit %>%
  forecast(h = 12) %>%
  autoplot(us_gdp)

The forecasts shown on the graph above seem reasonable and follow the overall trend of the original data.

Question 9.8f Answer

us_gdp %>%
  model(
    ETS(GDP)
  ) %>%
  forecast(h = 12) %>%
  autoplot(us_gdp)

As we can see from the graph above, the confidence interval is much larger than the confidence intervals shown for the forecasts in part e.

DATA 624 - Homework 6

2023-03-17