Data 624: Homework 6

Ask

Do the exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman. Please submit both the Rpubs link as well as your .rmd file.

9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?

Among these figures the boundaries and the length of the lags decrease as the number of random numbers increase. All the data is between the blue dashed lines therefore it indicates that the data are white noise.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values are at different distances from the mean of zero because of the amount of data in each time series. The larger the series the smaller the critical value and the less correlated the values become. For the smaller set of random numbers, there is a stronger positive or negative correlation between the lag values. The autocorrelations differ in each figure because of the amount of random numbers in the given plot. It is all white noise however, because the numbers are random.

9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

gafa_stock |>
  filter(Symbol == "AMZN") |>
  gg_tsdisplay(Close, plot_type='partial') +
  labs(title = "Amazon Closing Stock Price")

## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

The plots above show that the Amazon Closing Price is non stationary because the ACF plot has lag values that are large and decreasing slowly. A stationary time series would have an ACF plot that quickly converges to zero. The time series plot shows an increasing trend, therefore it is not stationary. The PACF for a stationary time series would be 0 for all of the lags. The PACF graph in this example shows a lag value at 1.

9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

Turkish GDP from global_economy.

global_economy |>
  filter (Country == "Turkey") |>
  gg_tsdisplay(GDP, plot_type='partial')

lambda <- global_economy |>
  filter(Country == "Turkey") |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)

global_economy |>
  filter(Country == "Turkey") |>
  features(box_cox(GDP, lambda), unitroot_ndiffs)

## # A tibble: 1 × 2
##   Country ndiffs
##   <fct>    <int>
## 1 Turkey       1

The above code shows that the appropriate number of differencing to make the data stationary is one.

Accommodation takings in the state of Tasmania from aus_accommodation.

aus_accommodation |>
  filter(State == "Tasmania") |>
  autoplot(Takings)

lambda <- aus_accommodation |>
  filter(State == "Tasmania") |>
  features(Takings, features = guerrero) |>
  pull(lambda_guerrero)

aus_accommodation |>
  filter(State == "Tasmania") |>
  features(box_cox(Takings, lambda), unitroot_ndiffs)

## # A tibble: 1 × 2
##   State    ndiffs
##   <chr>     <int>
## 1 Tasmania      1

The data needs to be differenced once to become stationary.

Monthly sales from souvenirs.

souvenirs |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = Sales`

lambda <- souvenirs |>
  features(Sales, features = guerrero) |>
  pull(lambda_guerrero)

souvenirs |>
  features(box_cox(Sales, lambda), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

The amount of times this time series needs to be differenced is once.

9.5

For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

set.seed(12345678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

The series has trend and seasonality and in order to use an ARIMA model. Because of the variance in the data I will apply the BoxCox transformation.

myseries |>
  features(Turnover, unitroot_ndiffs)

## # A tibble: 1 × 3
##   State              Industry                                            ndiffs
##   <chr>              <chr>                                                <int>
## 1 Northern Territory Clothing, footwear and personal accessory retailing      1

# transformed plot
myseries %>%
  gg_tsdisplay(difference(Turnover, 12), plot_type='partial', lag = 36)

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

lambda <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

myseries |>
  features(box_cox(Turnover, lambda), unitroot_ndiffs)

## # A tibble: 1 × 3
##   State              Industry                                            ndiffs
##   <chr>              <chr>                                                <int>
## 1 Northern Territory Clothing, footwear and personal accessory retailing      1

# Box Cox 
myseries %>%
  gg_tsdisplay(difference(box_cox(Turnover,lambda), 12), plot_type='partial', lag = 36)

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

The box cox transformation shows a more stationary time series after a singe difference.

9.6

Simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with phi_1=0.6 and delta^2 = 1. The process starts with y_1 = 0.

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

sim

## # A tsibble: 100 x 2 [1]
##      idx      y
##    <int>  <dbl>
##  1     1  0    
##  2     2 -2.38 
##  3     3 -1.68 
##  4     4 -0.118
##  5     5 -0.823
##  6     6 -1.25 
##  7     7  0.234
##  8     8  1.24 
##  9     9  3.30 
## 10    10  1.45 
## # ℹ 90 more rows

Produce a time plot for the series. How does the plot change as you change phi_1?

sim |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = y`

for(i in 2:100)
  y[i] <- 0.9*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

sim |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = y`

for(i in 2:100)
  y[i] <- 0.2*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

sim |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = y`

The smaller the phi value the period of the time series (The difference between the peaks of the waves) decreases, as the phi value increases, the period also increases.

9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

fit <- aus_airpassengers |>
  filter(Year < 2012) |>
  model(ARIMA(Passengers))

report(fit)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 estimated as 4.671:  log likelihood=-87.8
## AIC=179.61   AICc=179.93   BIC=182.99

The model chosen is ARIMA(0,2,1)

fit |>
  forecast(h=10) |>
  autoplot(aus_airpassengers)

fit |>
  gg_tsresiduals()

The residuals show a normal distribution and the ACF plot shows lags that are within the blue lines.

Write the model in terms of the backshift operator

\(y_t = -0.8756e_{t-1} +e_t\)

\((1-B)^dy_t = (1- 0.8756B)e_t\)

Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

fit1 <-aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(ARIMA(Passengers ~ pdq(0,1,0)))

fit1 %>% 
  forecast(h=10) %>%
  autoplot(aus_airpassengers)

fit1 %>% 
  gg_tsresiduals() +
  labs(title = "Residuals for ARIMA(0,1,0)")

The residuals are more normally distributed and the ACF shows more positively correlated values. Part A forecasts values higher than the actual time series. The forecast in part b shows the forecast below the actual values in the time series.

Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

fit2 <-aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(ARIMA(Passengers ~ 1+  pdq(2,1,2)))

fit2 %>% 
  forecast(h=10) %>%
  autoplot(aus_airpassengers)

fit2 %>% 
  gg_tsresiduals() +
  labs(title = "Residuals for ARIMA(2,1,2)")

When the constant is removed, a warning is produced because of the non-stationary AR part of from the CSS.

# Remove the constant


fit3 <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(ARIMA(Passengers ~ 0 + pdq(2,1,2)))

## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS

report(fit3)

## Series: Passengers 
## Model: NULL model 
## NULL model

# Fix the warning
# https://stackoverflow.com/questions/7233288/non-stationary-seasonal-ar-part-from-css-error-in-r Referenced for the error - changed method to Maximum Likelihood.
#  The coefficients should be between -1 and 1 for the process to be stationary.

fit4 <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(ARIMA(Passengers ~ 0 + pdq(2,1,2), method = "ML"))
report(fit4)

## Series: Passengers 
## Model: ARIMA(2,1,2) 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2
##       1.8128  -0.8154  -1.9642  0.9999
## s.e.  0.1321   0.1320   0.1295  0.1310
## 
## sigma^2 estimated as 4.203:  log likelihood=-88.19
## AIC=186.37   AICc=188.09   BIC=194.94

The inclusion of a constant in a non-stationary ARIMA model is equivalent to inducing a polynomial trend of order \(d\) in the forecasts. If the constant is to be ommitted the forecast include a polynomial trend \(d-1\).

fit5 <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(ARIMA(Passengers ~ 0 + pdq(2,0,2)))
report(fit5)

## Series: Passengers 
## Model: ARIMA(2,0,2) 
## 
## Coefficients:
##          ar1     ar2     ma1    ma2
##       0.1590  0.8369  1.0569  0.128
## s.e.  0.2553  0.2546  0.2658  0.179
## 
## sigma^2 estimated as 6.64:  log likelihood=-100.21
## AIC=210.43   AICc=212.09   BIC=219.12

Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

fit4 <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(ARIMA(Passengers ~ 1 +  pdq(0,2,1)))

## Warning: Model specification induces a quadratic or higher order polynomial trend. 
## This is generally discouraged, consider removing the constant or reducing the number of differences.

report(fit4)

## Series: Passengers 
## Model: ARIMA(0,2,1) w/ poly 
## 
## Coefficients:
##           ma1  constant
##       -1.0000    0.0721
## s.e.   0.0709    0.0260
## 
## sigma^2 estimated as 4.086:  log likelihood=-85.74
## AIC=177.48   AICc=178.15   BIC=182.55

A warning is produced that the model specification induces a quadratic or higher order trend. It then encourages me to remove the constant or reduce the number of differences.

9.8

For the United States GDP series (from global_economy):

global_economy |> 
  filter(Country == "United States") |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

if necessary, find a suitable Box-Cox transformation for the data;

lambda <-global_economy |>
  filter(Country == "United States") |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)

fit a suitable ARIMA model to the transformed data using ARIMA();

fit <- global_economy |>
  filter(Country == "United States") |>
  model(ARIMA(box_cox(GDP, lambda))) 

report(fit)

## Series: GDP 
## Model: ARIMA(1,1,0) w/ drift 
## Transformation: box_cox(GDP, lambda) 
## 
## Coefficients:
##          ar1  constant
##       0.4586  118.1822
## s.e.  0.1198    9.5047
## 
## sigma^2 estimated as 5479:  log likelihood=-325.32
## AIC=656.65   AICc=657.1   BIC=662.78

try some other plausible models by experimenting with the orders chosen;

global_economy |>
  filter(Country == "United States") |>
  features(box_cox(GDP, lambda), unitroot_ndiffs)

## # A tibble: 1 × 2
##   Country       ndiffs
##   <fct>          <int>
## 1 United States      1

global_economy |>
  filter(Country == "United States") |>
  gg_tsdisplay(difference(GDP, 1),
               plot_type='partial', lag=36) +
  labs(title = "Differenced", y="")

## Warning: Removed 1 row containing missing values (`geom_line()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

Here we see in the pcf graph that only 1 of the lags is out of the significance boundary, so we can select q to be 1. In the ACF graph the first lag is out of the significant limit, so we can select p to be 1. Using the unitdiff function, we found that the optimal value for d is 1.

fit <- global_economy |>
  filter(Country == "United States") |>
  select(GDP, Year) |>
  model(
    arima011 = ARIMA(box_cox(GDP,lambda) ~ pdq(0,1,1)),
    arima110 = ARIMA(box_cox(GDP,lambda) ~ pdq(1,1,0)),
    arima111 = ARIMA(box_cox(GDP,lambda) ~ pdq(1,1,1)),
    auto = ARIMA(box_cox(GDP,lambda), stepwise = FALSE, approx = FALSE)
  )

fit |> pivot_longer(everything(), names_to = "Model name",
                     values_to = "Orders")

## # A mable: 4 x 2
## # Key:     Model name [4]
##   `Model name`                  Orders
##   <chr>                        <model>
## 1 arima011     <ARIMA(0,1,1) w/ drift>
## 2 arima110     <ARIMA(1,1,0) w/ drift>
## 3 arima111     <ARIMA(1,1,1) w/ drift>
## 4 auto         <ARIMA(1,1,0) w/ drift>

glance(fit) |> arrange(AICc) |> select(.model:BIC)

## # A tibble: 4 × 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 arima110  5479.   -325.  657.  657.  663.
## 2 auto      5479.   -325.  657.  657.  663.
## 3 arima011  5689.   -326.  659.  659.  665.
## 4 arima111  5580.   -325.  659.  659.  667.

choose what you think is the best model and check the residual diagnostics;

fit |> select(arima110) %>% gg_tsresiduals()

produce forecasts of your fitted model. Do the forecasts look reasonable?

US_GDP <- global_economy |>
  filter(Country == "United States") |>
  select(GDP, Year) 
fit |>
  forecast(h=10) |>
  filter(.model == 'arima110') |>
  autoplot(US_GDP) + 
  labs(title = "Forecasted United States GDP using ARIMA(1,1,0)") +
  theme(plot.title = element_text(hjust = 0.5))

compare the results with what you would obtain using ETS() (with no transformation).

US_ETS <- US_GDP |>
  model(ETS(GDP))
report(US_ETS)

## Series: GDP 
## Model: ETS(M,A,N) 
##   Smoothing parameters:
##     alpha = 0.9990876 
##     beta  = 0.5011949 
## 
##   Initial states:
##          l[0]        b[0]
##  448093333334 64917355687
## 
##   sigma^2:  7e-04
## 
##      AIC     AICc      BIC 
## 3190.787 3191.941 3201.089

The AICc for arima110 is 657.1005. The AICc for the ETS model is 3191.941, which is significantly higher.