624 hw6

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.4.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## 
## Attaching package: 'tsibble'

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.4.3

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──

## ✔ tibble      3.2.1     ✔ ggplot2     3.5.1
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.2
## ✔ lubridate   1.9.3     ✔ fable       0.5.0

## Warning: package 'dplyr' was built under R version 4.4.2

## Warning: package 'ggplot2' was built under R version 4.4.2

## Warning: package 'tsibbledata' was built under R version 4.4.3

## Warning: package 'feasts' was built under R version 4.4.3

## Warning: package 'fabletools' was built under R version 4.4.3

## Warning: package 'fable' was built under R version 4.4.3

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()     masks base::date()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ tsibble::intersect()  masks base::intersect()
## ✖ lubridate::interval() masks tsibble::interval()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ tsibble::setdiff()    masks base::setdiff()
## ✖ tsibble::union()      masks base::union()

9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?

The figures show different critical values and ACF distributions. They all indicate that there is white noise since the spikes are within the critical values which are shown with the blue dashed lines.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values are at different distances since the length of the time series determines how close we are to the mean. The larger the length, the closer the critical values are to the mean. The ACF values will decrease along with critical values since the size of a random data et increases and probability of correlation decreases.

9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

amazon_close <- gafa_stock |> 
  filter(Symbol == "AMZN") |>
  select(Date, Close)


amazon_close |> 
  gg_tsdisplay(Close, plot_type = 'partial')

## Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsdisplay()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

The plots show that the series is non-stationary since the data shows a trend while a stationary series would not have a trend. There is an upward trend until 2018. The acf plot suggest that the series is non-stationary because the values are large and have a very slow decline while a stationary series acf would have a more rapid decline. The pacf plot also indicates a non-stationary series since it is sinusoidal.

9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

a. Turkish GDP from global_economy.

# Filter and view data 

turkish_gdp <- global_economy |> 
  filter(Code == "TUR") |>
  select(Year, GDP)

turkish_gdp |> autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

# Box plot transformation
# Lambda
lambda_turk <- turkish_gdp |> 
  features(GDP, features = guerrero) |> 
  pull(lambda_guerrero)

# Box-plot 
turkish_gdp_bc <- turkish_gdp |>
  mutate(bc = box_cox(GDP, lambda_turk)) |>
  select(Year, bc)

print(lambda_turk)

## [1] 0.1572187

# Seasonal differencing test 
turkish_gdp_bc |>
  features(bc, unitroot_nsdiffs)

No seasonal differencing are noted so I will test of appropriate order of differencing next.

# test for order of differencing

turkish_gdp_bc |> 
  features(bc, unitroot_ndiffs)

The appropriate box-cox transformation has a lambda of 0.1572 and one order of differencing with no seasonal differencing.

# Plot transformed bc data

turkish_gdp_bc |>
  mutate(diff_bc = difference(bc)) |>
  autoplot(diff_bc)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

b. Accommodation takings in the state of Tasmania from aus_accommodation.

# Select and plot data
tasmania_takings <- aus_accommodation |>
  filter(State == "Tasmania") |>
  select(Date, Takings)

tasmania_takings |>
  autoplot(Takings)

# Find lambda
lambda_tasmania <- tasmania_takings |>
  features(Takings, features = guerrero) |> 
  pull(lambda_guerrero)

# Box-cox transformation
tasmania_takings_bc <- tasmania_takings |>
  mutate(bc = box_cox(Takings, lambda_tasmania)) |>
  select(Date, bc)

print(lambda_tasmania)

## [1] 0.001819643

# Seasonal differencing test 
tasmania_takings_bc |>
  features(bc, unitroot_nsdiffs)

Seasonal differencing is suggested

# Apply seasonal differencing and test for order of differencing 

tasmania_takings_bc |>
  mutate(seas_diff = difference(bc, 4)) |>
  features(seas_diff, unitroot_ndiffs)

The appropriate box-cox transformation is a lambda of 0.0018 and one seasonal difference. Data is plotted below to make sure it looks stationary.

tasmania_takings_bc |>
  mutate(seas_diff = difference(bc, 4)) |>
  autoplot(seas_diff)

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

c. Monthly sales from souvenirs.

# Select and plot data 
monthly_souvenirs <- souvenirs

monthly_souvenirs |>
  autoplot(Sales)

# Find lambda
lambda_souvenirs <- monthly_souvenirs |>
  features(Sales, features = guerrero) |>
  pull(lambda_guerrero)

# Box-cox transformation
monthly_souvenirs_bc <- monthly_souvenirs |>
  mutate(bc = box_cox(Sales, lambda_souvenirs)) |>
  select(Month, bc)

print(lambda_souvenirs)

## [1] 0.002118221

# Seasonal differencing test
monthly_souvenirs_bc |> 
  features(bc, unitroot_nsdiffs)

Seasonal differencing is suggested

# Apply seasonal differencing and test for order of differencing
monthly_souvenirs_bc |>
  mutate(seas_diff = difference(bc, 12)) |>
  features(seas_diff, unitroot_ndiffs)

The appropriate box cox transformation has a lambda of 0.0021 and one seasonal difference. The data is plotted below to ensure data looks stationary.

monthly_souvenirs_bc |> 
  mutate(seas_diff = difference(bc, 12)) |>
  autoplot(seas_diff)

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

9.5

For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

# code from textbook 
set.seed(12345678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

# select data and plot
myseries <- myseries |> 
  select(Month, Turnover)

myseries |>
  autoplot(Turnover)

The data does not appear to show constant variation so a boxcox transformation is appopriate.

# Find lambda
lambda_myseries <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

# Box-cox transformation
myseries_bc <- myseries |> 
  mutate(bc = box_cox(Turnover, lambda_myseries)) |>
  select(Month, bc)

# Plot
myseries_bc |>
  autoplot(bc)

The data now appears to show constant variation.

# Test for seasonal differencing 
myseries_bc |>
  features(bc, unitroot_nsdiffs)

Seasonal differencing is suggested.

# Apply seasonal differencing and test for order of differencing

myseries_bc |>
  mutate(seas_diff = difference(bc, 12)) |>
  features(seas_diff, unitroot_ndiffs)

No additional differencing suggested. The data is plotted below to make sure data looks stationary.

myseries_bc |>
  mutate(seas_diff = difference(bc, 12)) |>
  autoplot(seas_diff)

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

9.6

Simulate and plot some data from simple ARIMA models.

a. Use the following R code to generate data from an AR(1) model with ϕ1=0.6 and σ2=1. The process starts with y1=0.

# Code provided by textbook 
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

b. Produce a time plot for the series. How does the plot change as you change ϕ1?

sim |> autoplot(y)

# Time plot after changing value of phi1 
y_1 <- numeric(100)
y_2 <- numeric(100)
y_3 <- numeric(100)
e <-rnorm(100)
for(i in 2:100) {
  y_1[i] <- 0.6*y_1[i-1] + e[i]
  y_2[i] <- 0.95*y_1[i-1] + e[i]
  y_3[i] <- 0.3*y_1[i-1] + e[i]
}

sim_multi <- tsibble(idx = seq_len(100), "phi = 0.6" = y_1, "phi = 0.99" = y_2, "phi=0.3" = y_3, index = idx)

sim_long <- sim_multi |>
  pivot_longer(!idx, names_to = "phi", values_to = "y")

autoplot(sim_long, y)

The plots don’t seem to be too different from each other but the higher phi value (0.99) seems to result in more extreme peaks and lows.

c. Write your own code to generate data from an MA(1) model with θ1=0.6 and σ2=1.

y_ma <- numeric(100)
e_ma <- rnorm(100)
for(i in 2:100)
  y_ma[i] <- e_ma[i] + 0.6*e_ma[i-1]
sim_ma <- tsibble(idx = seq_len(100), y_ma = y_ma, index=idx)

d. Produce a time plot for the series. How does the plot change as you change θ1?

sim_ma |> autoplot(y_ma)

# Time plot with variety of theta_1
y_1 <- numeric(100)
y_2 <- numeric(100)
y_3 <- numeric(100)
e <- rnorm(100)
for(i in 2:100) {
  y_1[i] <- e[i] + 0.6*e[i-1]
  y_2[i] <- e[i] + 0.99*e[i-1]
  y_3[i] <- e[i] + 0.1*e[i-1]
}

sim_ma_multi <- tsibble(
  idx = seq_len(100), 'theta = 0.6' = y_1, 'theta = 0.99' = y_2, 'theta = 0.1' = y_3, index = idx)

sim_ma_long <- sim_ma_multi |>
  pivot_longer(!idx, names_to = 'theta', values_to = 'y')

autoplot(sim_ma_long, y)

The theta value closer to 1 seems to have more extreme peaks and lows in the data plot than the other values.

e. Generate data from an ARMA(1,1) model with ϕ1=0.6, θ1=0.6 and σ2=1.

y_arma <- numeric(100)
e_arma <- rnorm(100)
for(i in 2:100)
  y_arma[i] <- 0.6*y[i-1] + 0.6*e_arma[i-1] + e_arma[i]
sim_arma <- tsibble(idx = seq_len(100), y_arma =y_arma, index=idx)

f. Generate data from an AR(2) model with ϕ1=−0.8, ϕ2=0.3 and σ2=1. (Note that these parameters will give a non-stationary series.)

y_ar2 <- numeric(100)
e_ar2 <- rnorm(100)
for(i in 3:100)
  y_ar2[i] <- -0.8*y_ar2[i-1] + 0.3*y_ar2[i-2] + e_ar2[i]
sim_ar2 <- tsibble(idx = seq_len(100), y_ar2 =y_ar2, index=idx)

g. Graph the latter two series and compare them.

combined_data <- tsibble(
  idx = seq_len(100), y_ar2 = y_ar2, y_arma = y_arma, index = idx)
combined_data <- combined_data |>
  pivot_longer(!idx, names_to = 'model', values_to = 'y')

combined_data |> autoplot(y)

The AR2 model shows increasing variance over time. The ARMA1,1 model appears very stable and actually just looks like a straight line at 0. This is likely due to the extreme values of AR2. ARMA1,1 is plotted below to verify.

sim_arma |> autoplot(y_arma)

9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

a. Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

# plot data
autoplot(aus_airpassengers, Passengers)

Since there is no seasonal variation, a non-seasonal ARIMA would likely be appropriate.

passengers_arima <- aus_airpassengers |>
  model(ARIMA(Passengers))
report(passengers_arima)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8963
## s.e.   0.0594
## 
## sigma^2 estimated as 4.308:  log likelihood=-97.02
## AIC=198.04   AICc=198.32   BIC=201.65

The (0,2,1) model was selected

aus_airpassengers |>
  model(ARIMA(Passengers)) |>
  gg_tsresiduals()

## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The residuals histogram is a relatively normal curve but there is an outlier on the right side. The acf plot shows all values within the critical values. The time plot of the residuals has a few spikes but the variance is mostly constant.

passengers_arima |>
  forecast(h=10) |>
  autoplot(aus_airpassengers)

b. Write the model in terms of the backshift operator.

The model in terms of backshift operator can be written as \[ (1-B)^2 y_t = (1 - 0.8963B)\epsilon_t \]

c. Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

passengers_arima_c <- aus_airpassengers |>
  model("pass_a"= ARIMA(Passengers ~ pdq(0,2,1)),
        "pass_c" = ARIMA(Passengers ~ 1 + pdq(0,1,0)))
passengers_arima_c |> forecast(h=10) |>
  autoplot(aus_airpassengers, level = NULL)

The ARIMA 0,1,0 with drift model has less steep slope than the 0,2,1 model. The 0,1,0 model seems to be a good model when looking at the shape of the data as a whole but the 0,2,1 seems to be a better model when considering the more recent steeper slope.

d. Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

passengers_arima_d <- aus_airpassengers |>
  model("pass_a" = ARIMA(Passengers ~ pdq(0,2,1)), 
        "pass_c" = ARIMA(Passengers~ 1 + pdq(0,1,0)), 
        "pass_d" = ARIMA(Passengers ~ 1 + pdq(2,1,2))) 

passengers_arima_d |> 
  forecast(h=10) |>
  autoplot(aus_airpassengers, level=NULL)

The 2,1,2 drift model seem to perform similarly to the 0,1,0 drift model.

# Remove constant 
passengers_arima_d2 <- aus_airpassengers |>
  model('no_constant' = ARIMA(Passengers ~ 0 + pdq(2,1,2)))

## Warning: 1 error encountered for no_constant
## [1] non-stationary AR part from CSS

passengers_arima_d2 |>
  forecast(h=10) |>
  autoplot(aus_airpassengers, level= NULL)

## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_line()`).

The removal of the constant creates an error and unable to project a model.

e. Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

passengers_arima_e <- aus_airpassengers |>
  model("pass_a" = ARIMA(Passengers ~ pdq(0,2,,1)),
        "pass_e" = ARIMA(Passengers ~ 1 + pdq(0,2,1)))

## Warning: Model specification induces a quadratic or higher order polynomial trend.  This
## is generally discouraged, consider removing the constant or reducing the number
## of differences.

passengers_arima_e |>
  forecast(h=10) |>
  autoplot(aus_airpassengers, level = NULL)

This model results in a warning that a quadratic or higher order polynomial trend was induced and that removing the constant or reducing the number of differences should be considered. The graph shows a very steep projection compared to the other models.

9.8

For the United States GDP series (from global_economy):

a. if necessary, find a suitable Box-Cox transformation for the data;

# Select and plot data
us_gdp <- global_economy |>
  filter(Code == "USA") |>
  select(Year, GDP)

autoplot(us_gdp, GDP)

# Find lambda 
lambda_gdp <- us_gdp |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)
print(lambda_gdp)

## [1] 0.2819443

# box cox transformation
us_gdp_bc <- us_gdp |>
  mutate(bc = box_cox(GDP, lambda_gdp)) |>
  select(Year, bc)
us_gdp_bc |> autoplot(bc)

A box-cox transformation with a lambda of 0.2819 is appropriate for this data

b. fit a suitable ARIMA model to the transformed data using ARIMA();

The data does not appear to be seasonal so a non-appropriate ARIMA model is appropriate.

us_gdp_arima <- us_gdp_bc |>
  model(ARIMA(bc))
report(us_gdp_arima)

## Series: bc 
## Model: ARIMA(1,1,0) w/ drift 
## 
## Coefficients:
##          ar1  constant
##       0.4586  118.1822
## s.e.  0.1198    9.5047
## 
## sigma^2 estimated as 5479:  log likelihood=-325.32
## AIC=656.65   AICc=657.1   BIC=662.78

c. try some other plausible models by experimenting with the orders chosen;

I will try the Hyndman-Khandakar algorithm.

# Test order for differencing 
us_gdp_bc |> features(bc, unitroot_ndiffs)

# model
us_gdp_models <- us_gdp_bc |>
  model('ARIMA010' = ARIMA(bc ~ 1 + pdq(0,1,0)),
        'ARIMA212' = ARIMA(bc ~ 1 + pdq(2,1,2)),
        'ARIMA110' = ARIMA(bc ~ 1 + pdq(1,1,0)),
        'ARIMA011' = ARIMA(bc ~ 1 + pdq(0,1,1)))

glance(us_gdp_models)

d. choose what you think is the best model and check the residual diagnostics;

glance(us_gdp_models) |>
  arrange(AICc)

The best model has the lowest AICc which happens to be the ARIMA 1,1,0 model in this case.

us_gdp_bc |> 
  model(ARIMA(bc ~ 1 + pdq(1,1,0))) |>
  gg_tsresiduals()

The residuals show a relatively constant variance and close to normal distribution. All acf points are also within the critical value shown by the blue dashed line.

e. produce forecasts of your fitted model. Do the forecasts look reasonable?

us_gdp_bc |>
  model(ARIMA(bc ~ 1 + pdq(1,1,0))) |>
  forecast(h=10) |>
  autoplot(us_gdp_bc)

The forecast seems reasonable since it follows the overall slope of the existing data.

f. compare the results with what you would obtain using ETS() (with no transformation).

us_gdp_bc |>
  model("ARIMA" = ARIMA(bc ~ 1+ pdq(1,1,0)), 
        "ETS" = ETS(bc ~ error("A") + trend("A") + season("N"))) |>
  forecast(h=10) |> 
  autoplot(us_gdp_bc, level = NULL)

The results of the forecasts seem similar but the ARIMA model projects slightly higher values than ETS.

us_gdp_bc |>
  slice(-n()) |>
  stretch_tsibble(.init = 10) |>
  model("ARIMA" = ARIMA(bc ~ 1 + pdq(1,1,0)),
        "ETS" =ETS(bc ~ error("A") + trend("A") + season("N"))) |>
  forecast(h=1) |>
  accuracy(us_gdp_bc) |>
  select(.model, RMSE:MAPE)

The ARIMA model has lower RMSE, MAE, and MAPE but higher MPE. Since models with lower RMSE are seen as better, ARIMA is the better model in this case.