Homework 6

exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8

9.1 Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?

All three plots indicate the data are white noise because none of the spikes go over the blue dotted line. The three ACF plots all have a different threshold for white noise (critical value), as indicated by the line. The first is at about ~.3, the second is ~.1 and the third sits very close to zero.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The blue dotted line represents a 95% confidence interval (95% of the spikes should fall within if the data is white noise). The thresholds are different because they’re partially decided by the number of data points/length of the time series. So, a longer time series will have a different critical value. The plots look different because white noise does not look the same in every time series.

From section 2.9: “For white noise series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within
±1.96/√T where T is the length of the time series”

9.2 A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.5.2

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tsibble     1.1.6     ✔ feasts      0.4.2
## ✔ tsibbledata 0.4.1     ✔ fable       0.5.0

## Warning: package 'tsibble' was built under R version 4.5.2

## Warning: package 'tsibbledata' was built under R version 4.5.2

## Warning: package 'feasts' was built under R version 4.5.2

## Warning: package 'fabletools' was built under R version 4.5.2

## Warning: package 'fable' was built under R version 4.5.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

amzn <- gafa_stock |>
  filter(Symbol == "AMZN")

autoplot(amzn, Close)

#side-by-side
amzn |>
  gg_tsdisplay((Close), plot_type='partial')

## Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsdisplay()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

amzn |> PACF((Close)) |>
    autoplot() + labs(subtitle = "Amazon closing stock price")

## Warning: Provided data has an irregular interval, results should be treated
## with caution. Computing ACF by observation.

amzn |> ACF((Close)) |>
    autoplot() + labs(subtitle = "Amazon closing stock price")

## Warning: Provided data has an irregular interval, results should be treated
## with caution. Computing ACF by observation.

#differenced plot

amzn |>
  gg_tsdisplay(difference(Close), plot_type='partial')

## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

In both the original ACF and PACF plots, the data is not white noise (significant # of peaks and dips above and below the line). This means the data has not been sufficiently differenced and is not stationary.

The ACF starts very close to 1 and shows a very slow decay, which means daily values are extremely related (this is common for a stock price, which is typically a random walk). The PACF removes the influence of intermediate lags. After lag 1, several lags go outside the line, indicating that the data is not stationary.

This data is clearly not seasonal. It’s a random walk, and to difference the data and make it stationary, like the google stock example, we can use the changes in closing stock price.

9.3 For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

Turkish GDP from global_economy.

library(urca)

## Warning: package 'urca' was built under R version 4.5.2

#plot the time series
turkey <- global_economy |> filter (Country == "Turkey") |> mutate(adj_gdp = GDP/Population)

#The data is clearly not stable
autoplot(turkey, adj_gdp)

turkey |> PACF((adj_gdp)) |>
    autoplot()

#ACF shows a slow decay
turkey |> ACF((adj_gdp)) |>
    autoplot()

#The data is not seasonal because it is yearly

##Let's try differencing year on year

#this could use a box-cox of .225
turkey |> features(adj_gdp, features = guerrero)

## # A tibble: 1 × 2
##   Country lambda_guerrero
##   <fct>             <dbl>
## 1 Turkey            0.225

#this calls for one difference
turkey |> features(box_cox(adj_gdp, 0.225), unitroot_ndiffs)

## # A tibble: 1 × 2
##   Country ndiffs
##   <fct>    <int>
## 1 Turkey       1

#looks a lot more stable
turkey |> autoplot(difference(box_cox(adj_gdp, 0.225)))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

#without the box-cox
#looks a little more stationary (though there are some bigger swings at the end, variance grows over time)

autoplot(turkey, difference(adj_gdp))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

turkey |>
  gg_tsdisplay(difference(box_cox(adj_gdp, 0.225)), plot_type='partial')

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

turkey |> PACF(difference(box_cox(adj_gdp, 0.225))) |>
    autoplot()

turkey |> ACF(difference(box_cox(adj_gdp, 0.225))) |>
    autoplot()

#null hypothesis is not rejected. The data is stationary.

turkey |>
    mutate(diff_close = difference(box_cox(adj_gdp, .225))) |>
  features(diff_close, unitroot_kpss)

## # A tibble: 1 × 3
##   Country kpss_stat kpss_pvalue
##   <fct>       <dbl>       <dbl>
## 1 Turkey     0.0931         0.1

#Here we want a high p-value, which confirms that the data is white noise

turkey |>
  mutate(diff_close = difference(box_cox(adj_gdp, .225))) |>
  features(diff_close, ljung_box, lag = 10)

## # A tibble: 1 × 3
##   Country lb_stat lb_pvalue
##   <fct>     <dbl>     <dbl>
## 1 Turkey     5.88     0.825

Accommodation takings in the state of Tasmania from aus_accommodation.

tasmania <- aus_accommodation |> filter (State == "Tasmania")

#This is very seasonal (quarterly) data trending upward
autoplot(tasmania, Takings)

#let's try it with a box-cox

tasmania |> features(Takings, features = guerrero)

## # A tibble: 1 × 2
##   State    lambda_guerrero
##   <chr>              <dbl>
## 1 Tasmania         0.00182

#calls for one difference
tasmania |> features(box_cox(Takings, 0.00182), unitroot_ndiffs)

## # A tibble: 1 × 2
##   State    ndiffs
##   <chr>     <int>
## 1 Tasmania      1

#this doesn't look tremendously stationary 

tasmania |>
    autoplot(box_cox(Takings, .00182) |> difference(4))

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

#this looks a little more stable


tasmania |>
    autoplot(box_cox(Takings, .00182) |> difference(4) |> difference())

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

tasmania |>
    mutate(diff_data = box_cox(Takings, 0.00182) |> difference(4) |> difference()) |>
    features(diff_data, unitroot_kpss)

## # A tibble: 1 × 3
##   State    kpss_stat kpss_pvalue
##   <chr>        <dbl>       <dbl>
## 1 Tasmania    0.0474         0.1

#with just seasonal difference 
tasmania |>
    mutate(diff_data = box_cox(Takings, 0.00182) |> difference(4) ) |>
    features(diff_data, unitroot_kpss)

## # A tibble: 1 × 3
##   State    kpss_stat kpss_pvalue
##   <chr>        <dbl>       <dbl>
## 1 Tasmania     0.198         0.1

#this works -- null hypothesis is not rejected and data is stationary


#two differences also works - visually, I would go with 2 differences
#the plot looks much more stable
#however, unitroot_ndiffs says 1 difference is fine, as does unitroot_KPSS
tasmania |>
    mutate(diff_data = box_cox(Takings, 0.00182) |> difference(4) |> difference()) |>
    features(diff_data, unitroot_kpss)

## # A tibble: 1 × 3
##   State    kpss_stat kpss_pvalue
##   <chr>        <dbl>       <dbl>
## 1 Tasmania    0.0474         0.1

Monthly sales from souvenirs.

#since this is monthly, I'm guessing we need to take a difference of 12

#data is extremely seasonal but also trends upward and the seasonal swings get 
#bigger over time
autoplot(souvenirs)

## Plot variable not specified, automatically selected `.vars = Sales`

#calls for one difference 
souvenirs |> features(Sales, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

#Let's use the suggested lambda value
souvenirs |> features(Sales, features = guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1         0.00212

#looks a little more stationary

souvenirs |> autoplot(
  box_cox(Sales, 0.00212) |> difference(12)
)

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

#check the ACF
#2 spikes outside the lines
souvenirs |> ACF(
  box_cox(Sales, 0.00212) |> difference(12)
  ) |> autoplot()

#unitroot kpss

souvenirs |>
    mutate(diff_data = box_cox(Sales, 0.00212) |> difference(12) ) |>
    features(diff_data, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.240         0.1

#.1 means the null hypothesis is not rejected, the data are stationary

#checking without the difference, just in case
souvenirs |>
    mutate(diff_data = box_cox(Sales, 0.00212)  ) |>
    features(diff_data, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      1.79        0.01

#autocorrelation is still remaining
souvenirs |>
  mutate(diff_data = box_cox(Sales, 0.00212) |> difference(12) ) |>
  features(diff_data, ljung_box, lag = 10)

## # A tibble: 1 × 2
##   lb_stat    lb_pvalue
##     <dbl>        <dbl>
## 1    53.7 0.0000000559

#taking a second difference makes the data a lot more statoinary:

souvenirs |> autoplot(
    box_cox(Sales, 0.00212) |> difference(12) |> difference()
)

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).

#still passes the kpss test (of course), and **looks** much mor stationary
#single-difference looks closer to d) house sales in section 9.1, which is non-stationary
#the single-differenced version passes the KPSS test

souvenirs |>
    mutate(diff_data = box_cox(Sales, 0.00212) |> difference(12) |> difference()) |>
    features(diff_data, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0381         0.1

#checking the auto-arima, which also suggests a difference of 1
souvenirs |>
  model(auto_arima = ARIMA(box_cox(Sales, 0.00212))) |>
  report()

## Series: Sales 
## Model: ARIMA(0,0,2)(2,1,0)[12] w/ drift 
## Transformation: box_cox(Sales, 0.00212) 
## 
## Coefficients:
##          ma1     ma2     sar1     sar2  constant
##       0.3465  0.5815  -0.3264  -0.2845    0.4494
## s.e.  0.0922  0.1203   0.1403   0.1414    0.0434
## 
## sigma^2 estimated as 0.03168:  log likelihood=22.89
## AIC=-33.78   AICc=-32.49   BIC=-20.12

9.5 for your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

aus_retail_3 <- aus_retail |> 
  filter (Industry == "Cafes, restaurants and catering services") |> 
  summarize(Turnover = sum(Turnover))

aus_retail_3 |> autoplot(Turnover)

#tons of trend and seasonality

#how many differences do I need?
aus_retail_3 |>
  features(Turnover, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

aus_retail_3 |> features(Turnover, guerrero)

## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1           0.281

#.281

#looks fairly stationary
aus_retail_3 |>
    autoplot(box_cox(Turnover, 0.281) |> difference(12))

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

#data is stationary with a transformation and a difference
aus_retail_3 |>
    mutate(seas_diff = box_cox(Turnover, 0.281) |> difference(12) ) |>
    features(seas_diff, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0837         0.1

#ACF shows slow decay and seasonality
aus_retail_3 |> ACF(
  box_cox(Turnover, 0.281) |> difference(12)
  ) |> autoplot()

#PACF still has several spikes outside the lines
aus_retail_3 |> PACF(
  box_cox(Turnover, 0.281) |> difference(12)
  ) |> autoplot()

#unitroot_ndiffs says we are done differencing 
aus_retail_3 |>
    mutate(bc_sales = box_cox(Turnover, 0.281) |> difference(12)) |>
    features(bc_sales, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

9.6Simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with
phi 1 = .6 and
σ2=1. The process starts with y1=0.

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

Produce a time plot for the series. How does the plot change as you change fi 1?

autoplot(sim)

## Plot variable not specified, automatically selected `.vars = y`

#this should be a random walk

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 1*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi value of 1")

## Plot variable not specified, automatically selected `.vars = y`

#staionary, maaaybe white noise
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .1*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi value of .1")

## Plot variable not specified, automatically selected `.vars = y`

#>1 should be parabolic
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 1.2*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi value of 1.2")

## Plot variable not specified, automatically selected `.vars = y`

#a little closer to a random walk
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .75*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi value of .75")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .9*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi value of .9")

## Plot variable not specified, automatically selected `.vars = y`

#closer to white noise
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .25*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi value of .25")

## Plot variable not specified, automatically selected `.vars = y`

Write your own code to generate data from an MA(1) model with phi = .6 and sigma squared = 1

#using a past error
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim)

## Plot variable not specified, automatically selected `.vars = y`

d. Produce a time plot for the series. How does the plot change as you change phi?

#no longer parabolic

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 1.5*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of 1.5")

## Plot variable not specified, automatically selected `.vars = y`

#changing phi to 1 - is it still a random walk ?

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 1*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of 1")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .8*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of .8")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .75*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of .75")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .5*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of .5")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .4*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of .4")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .3*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of .3")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .2*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of .2")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .1*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of .1")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of 0")

## Plot variable not specified, automatically selected `.vars = y`

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- -.5*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim) + labs(title = "phi of -.5")

## Plot variable not specified, automatically selected `.vars = y`

It’s kind of hard to see the differences, but the series has different size swings depending on the value of phi. It stays stationary, even at values larger than 1, though values >1 tend to have bigger swings.

e. Generate data from an ARMA(1,1) model with
ϕ 1 = 0.6 ,
θ 1 = 0.6 and
σ 2 = 1 .

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + 0.6*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim)

## Plot variable not specified, automatically selected `.vars = y`

Generate data from an AR(2) model with
ϕ 1 = − 0.8 ,
ϕ 2 = 0.3 and
σ 2 = 1 . (Note that these parameters will give a non-stationary series.)

y <- numeric(100)
e <- rnorm(100)
for(i in 3:100)
  y[i] <- -0.8*y[i-1] + 0.3*y[i-2] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
autoplot(sim)

## Plot variable not specified, automatically selected `.vars = y`

Graph the latter two series and compare them.

The first graph is non-stationary, close to a random walk. The second starts off fairly stable, but quickly diverges, getting farther and farther from the x axis in both directoins.

9.7 Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

aus_fit <- aus_airpassengers |> 
  model(search = ARIMA(Passengers, stepwise=FALSE))

aus_fit

## # A mable: 1 x 1
##           search
##          <model>
## 1 <ARIMA(0,2,1)>

#0,2,1!
#This will handle the differencing

Use ARIMA() to find an appropriate ARIMA model. What model was selected.

Check that the residuals look like white noise.

aus_fit |>
  select(search) |>
  gg_tsresiduals()

## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

They look pretty close to white noise– the histogram is normal-ish, and the ACF doesn’t include any spikes outside the line.

Plot forecasts for the next 10 periods.

aus_fit |>
  forecast(h=10) |>
  filter(.model=='search') |>
  autoplot(aus_airpassengers)

Write the model in terms of the backshift operator.

for two differences:

(1-B)^2

(1-B) * (1-B)

1 - B - B + B^2

(1 - 2B + B^2)yt

for the single MA:

(1 + theta1B)et

combining them:

(1 - 2B + B^2)yt = (1 + theta1B)et

Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

#the first 1 (constant) is the drift
aus_fit_2 <- aus_airpassengers |> 
  model(arima010 = ARIMA(Passengers ~ 1 + pdq(0,1,0))
        )

aus_fit_2 |>
  forecast(h=10) |>
  filter(.model=='arima010') |>
  autoplot(aus_airpassengers)

They look similar, the forecasts for the 010 model are generally a little lower/more conservative. Both models trend upward in a fairly straight line. The drift in the 010 allows for the forecast to follow the upward trend.

Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c.

Remove the constant and see what happens.

#this one is a little wobblier, forecasting some ups and (small) downs. 
aus_fit_3 <- aus_airpassengers |> 
  model(arima212 = ARIMA(Passengers ~ 1 + pdq(2,1,2))
        )

aus_fit_3 |>
  forecast(h=10) |>
  filter(.model=='arima212') |>
  autoplot(aus_airpassengers)

#removing the constant gives me an error -- there's no appropriate ARIMA model

aus_fit_4 <- aus_airpassengers |> 
  model(arima212nd = ARIMA(Passengers ~ pdq(2,1,2))
        )

## Warning: It looks like you're trying to fully specify your ARIMA model but have not said
## if a constant should be included. You can include a constant using `ARIMA(y~1)`
## to the formula or exclude it by adding `ARIMA(y~0)`.

## Warning: 1 error encountered for arima212nd
## [1] Could not find an appropriate ARIMA model.
## This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
## For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots

Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

aus_fit_5 <- aus_airpassengers |> 
  model(arima021 = ARIMA(Passengers ~ pdq(0,2,1))
        )

aus_fit_5 |>
  forecast(h=10) |>
  filter(.model=='arima021') |>
  autoplot(aus_airpassengers)

This looks similar to the initial models, with an upward trajectory, straight-ish line. This may be because of the two differences.

9.8

For the United States GDP series (from global_economy):
1. if necessary, find a suitable Box-Cox transformation for the data;

usa <- global_economy |> 
  filter (Code == "USA") |>
  mutate(gdp_percap = GDP/Population)

usa |> autoplot(gdp_percap)

#data is clearly not stable


usa |> features(gdp_percap, features = guerrero)

## # A tibble: 1 × 2
##   Country       lambda_guerrero
##   <fct>                   <dbl>
## 1 United States           0.393

usa |>
    mutate(diff_data = box_cox(gdp_percap, 0.393) ) |>
    features(diff_data, unitroot_kpss)

## # A tibble: 1 × 3
##   Country       kpss_stat kpss_pvalue
##   <fct>             <dbl>       <dbl>
## 1 United States      1.55        0.01

#yearly data = single difference
usa |>
    mutate(diff_data = box_cox(gdp_percap, 0.393) |> difference()) |>
    autoplot(diff_data)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

#KPSS value is .1 here, so the data is stationary
usa |>
    mutate(diff_data = box_cox(gdp_percap, 0.393) |> difference()) |>
    features(diff_data, unitroot_kpss)

## # A tibble: 1 × 3
##   Country       kpss_stat kpss_pvalue
##   <fct>             <dbl>       <dbl>
## 1 United States     0.197         0.1

fit a suitable ARIMA model to the transformed data using ARIMA();

usa_fit <- usa |> 
  model(search = ARIMA(box_cox(gdp_percap, 0.393), stepwise=FALSE))

usa_fit

## # A mable: 1 x 2
## # Key:     Country [1]
##   Country                        search
##   <fct>                         <model>
## 1 United States <ARIMA(1,1,0) w/ drift>

#110 w drift


usa_fit_21 <- usa |> 
  model(arima110 = ARIMA(box_cox(gdp_percap, 0.393) ~ 1 + pdq(1,1,0))
        )
#10 year forecast
usa_fit_21 |>
  forecast(h=10) |>
  filter(.model=='arima110') |>
  autoplot(usa)

#check the residuals:
usa_fit_21 |> gg_tsresiduals()

#mostly normally distributed (a little lumpy and concentrated at the middle) 
#no lags outside the lines
#bigger residuals for economic downturns

try some other plausible models by experimenting with the orders chosen;

usa |>
  gg_tsdisplay(difference(box_cox(gdp_percap, 0.393)), plot_type = 'partial')

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

#AR model - the PACF indicates a  1,1 model (already used), ACF shows MA 11
#and 1,1,1 model to combine both. Will do drift and no drift + a 110 model without
#drift

usa_fit_2 <- usa |> 
  model(
        arima011 = ARIMA(box_cox(gdp_percap, 0.393) ~ pdq(0,1,1)),
        arima111 = ARIMA(box_cox(gdp_percap, 0.393) ~ pdq(1,1,1)),
        arima011_d = ARIMA(box_cox(gdp_percap, 0.393) ~ 1 + pdq(0,1,1)),
        arima111_d = ARIMA(box_cox(gdp_percap, 0.393) ~ 1 + pdq(1,1,1)),
        arima110_nd = ARIMA(box_cox(gdp_percap, 0.393) ~ pdq(1,1,0))
  )

#stepwise - just checking, though 1,1 is at the beginning of the grid
#so the result shouldn't be different
#This still gives 110 w drift

usa_fit_3 <- usa |> 
  model(stepwise = ARIMA(box_cox(gdp_percap, 0.393)))

usa_fit_3

## # A mable: 1 x 2
## # Key:     Country [1]
##   Country                      stepwise
##   <fct>                         <model>
## 1 United States <ARIMA(1,1,0) w/ drift>

usa_fit_2 |>
  forecast(h=10) |>
  filter(.model=='arima011') |>
  autoplot(usa) + labs(title = 'ARIMA 011')

usa_fit_2 |>
  forecast(h=10) |>
  filter(.model=='arima011_d') |>
  autoplot(usa) + labs(title = 'ARIMA 011 with drift')

usa_fit_2 |>
  forecast(h=10) |>
  filter(.model=='arima111') |>
  autoplot(usa)  + labs(title = 'ARIMA 111')

usa_fit_2 |>
  forecast(h=10) |>
  filter(.model=='arima111_d') |>
  autoplot(usa)  + labs(title = 'ARIMA 111 with drift')

usa_fit_2 |>
  forecast(h=10) |>
  filter(.model=='arima110_nd') |>
  autoplot(usa)  + labs(title = 'ARIMA 110, no drift')

choose what you think is the best model and check the residual diagnostics;

I liked the 011 model with drift because the ACF cut off pretty cleanly.

Residual diagnostics look pretty good. There are big dips for outliers in the 80s and in 2008/09. Histogram looks normal and none of the lags are outside.

usa_fit_2 |>
  select(arima011_d) |>
  gg_tsresiduals()

Overall, the drift doesn’t seem to change the plot much (maybe at all), possibly because it has such a strong upward trend.

produce forecasts of your fitted model. Do the forecasts look reasonable?

usa_fit_2 |>
  forecast(h=10) |>
  filter(.model=='arima011_d') |>
  autoplot(usa) + labs(title = 'ARIMA 011 with drift')

Plots for all models are above – all the forecasts continue the upward trajectory and look reasonable to me. The drift doesn’t change much (or anything), as noted above.

The 011 model in particular continues the upward trajectory and looks to have a narrower confidence interval than the others.

compare the results with what you would obtain using ETS() (with no transformation).

#checking the model

fit_usa_ets <- usa |>
  model(ETS(gdp_percap))
report(fit_usa_ets)

## Series: gdp_percap 
## Model: ETS(M,A,N) 
##   Smoothing parameters:
##     alpha = 0.9998989 
##     beta  = 0.5425435 
## 
##   Initial states:
##      l[0]    b[0]
##  2889.565 148.697
## 
##   sigma^2:  4e-04
## 
##      AIC     AICc      BIC 
## 924.4653 925.6192 934.7675

#MAN (no seasonality for yearly data)
#forecast

usafit_gdp <- usa |>
  model(ETS(gdp_percap ~ error("M") + trend("A") + season("N")))

#10 years

fit_usa <- usafit_gdp |>
  forecast(h = 10)

fit_usa |>
  autoplot(usa) + labs(title = "ETS M,A,N")

#residuals for this one look good too
usafit_gdp |> 
  gg_tsresiduals()

This has an upward trajectory as well, and a much wider confidence interval than ARIMA.

The estimates are actually similar, with the ARIMA 011 drift model topping out at about 80000 and the ETS MAN model hitting a similar spot in year 10. However, the confidence interval gets much wider for the ETS model. This is likely because, with exponential smoothing and an MAN model, getting farther from the last prediction means a higher multiplicative error.

It’s worth noting that AIMA 011 is equivalent to ETS ANN, which was not used here.

Homework 6

Samantha Barbaro

2026-03-15