DATA624

Exercise 9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?

The difference between this figures is that the ACF of the lag decreases as the sample of random numbers increases. With 36 numbers their might be some autocorrelation by chance but as the sample size increases the autocorrelation goes to close to zero. In all three of these graphs, the indication is that the data is white noise. If there was a trend in the data you would expect the ACF to start high and slowly decay as you increased the lag. You also don’t see a periodic spikes indicating a seasonality.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values are at different distances from the mean of zero by chance alone. The autocorrelations are different in each graph because they are all random.

Exercise 9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

The initial plot of the time series shows that there is no seasonality but there is an upward trend to the data.

amzn_df <- gafa_stock %>% filter(Symbol == 'AMZN') %>% select(Symbol, Date, Close)

amzn_df |> autoplot()

## Plot variable not specified, automatically selected `.vars = Close`

When we look at the ACF we a pronounced autocorrelation that only delays slightly as the lag increases which indicates that the data has a trend and is not stationary.

amzn_df |> ACF(Close) |>
  autoplot() + labs(subtitle = "Amazon closing stock price")

## Warning: Provided data has an irregular interval, results should be treated
## with caution. Computing ACF by observation.

“While the ACF measures the correlation between two points with a given lag, it doesn’t account for the influence of other intervening observations. The PACF remedies this by measuring the correlation between two points, controlling for the values at all shorter lags.” Article about ACF vs PACF

With the PACF we see continual spikes at one week and then one month and then five weeks. I would start with taking the difference between consecutive observations.

amzn_df |> PACF(Close) |>
  autoplot() + labs(subtitle = "Amazon closing stock price")

## Warning: Provided data has an irregular interval, results should be treated
## with caution. Computing ACF by observation.

Exercise 9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

a.Turkish GDP from global_economy.

turk_hw <- global_economy |> filter(Country == 'Turkey') |> select(Country, Year, GDP)
turk_hw |> autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

turk_trans <- as.data.frame(turk_hw)
bc_trans <- preProcess(turk_trans["GDP"], method = c("BoxCox"))
transformed <- predict(bc_trans, turk_trans["GDP"])

turk_trans$GDP_t <- transformed$GDP

turk_trans <- as_tsibble(turk_trans, index = 'Year')

turk_trans |> autoplot(GDP_t)

It looks like first order differencing would be appropriate for the Turkey dataset. The ACF and PACF plots don’t show any concerns about autocorrelation.

turk_trans |> gg_tsdisplay(difference(GDP_t) , plot_type='partial')

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

b.Accommodation takings in the state of Tasmania from aus_accommodation.

tas_hw <- aus_accommodation |> filter(State == 'Tasmania') |> select(Date, State, Takings)
tas_hw |> autoplot()

## Plot variable not specified, automatically selected `.vars = Takings`

tas_trans <- as.data.frame(tas_hw)
bc_trans <- preProcess(tas_trans["Takings"], method = c("BoxCox"))
transformed <- predict(bc_trans, tas_trans["Takings"])

tas_trans$Takings_t <- transformed$Takings

tas_trans <- as_tsibble(tas_trans, index = 'Date')

tas_trans |> autoplot(Takings_t)

The Tasmania data set may require 2nd order differencing. First we will have to take the difference for the seasonal component and then potentially a second differencing.

tas_trans |> gg_tsdisplay(difference(Takings_t) , plot_type='partial')

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

c.Monthly sales from souvenirs.

souvenirs_hw <- souvenirs
souvenirs_hw |> autoplot()

## Plot variable not specified, automatically selected `.vars = Sales`

souvenirs_trans <- as.data.frame(souvenirs_hw)
bc_trans <- preProcess(souvenirs_trans["Sales"], method = c("BoxCox"))
transformed <- predict(bc_trans, souvenirs_trans["Sales"])

souvenirs_trans$Sales_t <- transformed$Sales

souvenirs_trans <- as_tsibble(souvenirs_trans, index = 'Month')

souvenirs_trans |> autoplot(Sales_t)

The souvenirs data set will require a seasonal differencing and potential a second consecutive differencing.

souvenirs_trans |> gg_tsdisplay(difference(Sales_t) , plot_type='partial')

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Exercise 9.5

For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

set.seed(555)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries |> autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

Turnover_trans <- as.data.frame(myseries)
bc_trans <- preProcess(Turnover_trans["Turnover"], method = c("BoxCox"))
transformed <- predict(bc_trans, Turnover_trans["Turnover"])

Turnover_trans$Turnover_t <- transformed$Turnover

Turnover_trans <- as_tsibble(Turnover_trans, index = 'Month')

Turnover_trans |> autoplot(Turnover_t)

Turnover_trans |> gg_tsdisplay(difference(Turnover_t) , plot_type='partial')

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

We can see from the p value of 0.1 that after one round of differencing that that the data is stationary

Turnover_trans |>  mutate(diff_turn = difference(Turnover_t)) |>
  features(diff_turn, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0146         0.1

Exercise 9.6

Simulate and plot some data from simple ARIMA models.

a. Use the following R code to generate data from an AR(1) model with ϕ1=0.6 and σ2=1. The process starts with y1=0.

set.seed(999)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

 b. Produce a time plot for the series. How does the plot change as you change ϕ1?

sim |> autoplot()

## Plot variable not specified, automatically selected `.vars = y`

When we change the ϕ1 = 0.9 we get a positive trend to our series

set.seed(999)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.9*y[i-1] + e[i]
sim2 <- tsibble(idx = seq_len(100), y = y, index = idx)

sim2 |> autoplot()

## Plot variable not specified, automatically selected `.vars = y`

 c. Write your own code to generate data from an MA(1) model with θ1=0.6 and σ2=1

set.seed(999)
y <- numeric(100)
e <- rnorm(100, 1)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim2 <- tsibble(idx = seq_len(100), y = y, index = idx)

sim2 |> autoplot()

## Plot variable not specified, automatically selected `.vars = y`

 d. Produce a time plot for the series. How does the plot change as you change θ1?

When we change the ϕ1 = 0.9 we get an increase in the positive trend to our series

set.seed(999)
y <- numeric(100)
e <- rnorm(100, 1)
for(i in 2:100)
  y[i] <- 0.9*y[i-1] + e[i]
sim2 <- tsibble(idx = seq_len(100), y = y, index = idx)

sim2 |> autoplot()

## Plot variable not specified, automatically selected `.vars = y`

 e. Generate data from an ARMA(1,1) model with ϕ1=0.6, θ1=0.6 and σ2=1

set.seed(999)
y <- numeric(100)
e <- rnorm(100, 1)
for(i in 2:100)
  y[i] <- 0.6 * y[i-1] + e[i] + 0.6 * e[i-1]
sim3 <- tsibble(idx = seq_len(100), y = y, index = idx)

 f. Generate data from an AR(2) model with ϕ1=−0.8, ϕ2=0.3 and σ2=1 (Note that these parameters will give a non-stationary series.)

set.seed(999)
y <- numeric(100)
e <- rnorm(100, 1)
for(i in 3:100)
  y[i] <- -0.8 * y[i-1] + 0.3 * y[i-2] + e[i] 
sim4 <- tsibble(idx = seq_len(100), y = y, index = idx)

 g. Graph the latter two series and compare them.
 
 The last model (AR(2) model with ϕ1=−0.8, ϕ2=0.3 and σ2=1) gives us a model very different from the other models.  The variance increases as the model increases.

sim3 |> autoplot()

## Plot variable not specified, automatically selected `.vars = y`

sim4 |> autoplot()

## Plot variable not specified, automatically selected `.vars = y`

Exercise 9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

a. Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

airpass_hw <- aus_airpassengers |> filter(Year > 1969 & Year < 2012)
airpass_hw |> autoplot()

## Plot variable not specified, automatically selected `.vars = Passengers`

b. Write the model in terms of the backshift operator.

This backshift notation is something I’m still wrapping my head around. Maybe we could go over this in class?

c. Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

fit <- airpass_hw |> model(ARIMA(Passengers ~ pdq(0,1,0)))

report(fit)

## Series: Passengers 
## Model: ARIMA(0,1,0) w/ drift 
## 
## Coefficients:
##       constant
##         1.3669
## s.e.    0.3319
## 
## sigma^2 estimated as 4.629:  log likelihood=-89.08
## AIC=182.17   AICc=182.48   BIC=185.59

d. Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

fit <- airpass_hw |> model(ARIMA(Passengers ~ pdq(2,1,2)))

report(fit)

## Series: Passengers 
## Model: ARIMA(2,1,2) w/ drift 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  constant
##       1.4694  -0.5103  -1.5736  0.6780    0.0650
## s.e.  0.3780   0.3558   0.3081  0.2688    0.0294
## 
## sigma^2 estimated as 4.748:  log likelihood=-87.74
## AIC=187.47   AICc=189.94   BIC=197.75

We remove the constant by setting d = 2

fit <- airpass_hw |> model(ARIMA(Passengers ~ pdq(2,2,2)))

report(fit)

## Series: Passengers 
## Model: ARIMA(2,2,2) 
## 
## Coefficients:
##           ar1      ar2     ma1      ma2
##       -1.0882  -0.3507  0.0614  -0.7214
## s.e.   0.2148   0.2092  0.1915   0.1706
## 
## sigma^2 estimated as 4.672:  log likelihood=-86.49
## AIC=182.99   AICc=184.75   BIC=191.43

fit <- airpass_hw |> model(ARIMA(Passengers ~ pdq(2,1,2)))

report(fit)

## Series: Passengers 
## Model: ARIMA(2,1,2) w/ drift 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  constant
##       1.4694  -0.5103  -1.5736  0.6780    0.0650
## s.e.  0.3780   0.3558   0.3081  0.2688    0.0294
## 
## sigma^2 estimated as 4.748:  log likelihood=-87.74
## AIC=187.47   AICc=189.94   BIC=197.75

e. Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

fit <- airpass_hw |> model(ARIMA(Passengers ~ pdq(0,2,1)))

report(fit)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 estimated as 4.671:  log likelihood=-87.8
## AIC=179.61   AICc=179.93   BIC=182.99

Exercise 9.8

For the United States GDP series (from global_economy):

a. if necessary, find a suitable Box-Cox transformation for the data;

usa_df <- global_economy |> filter(Country == 'United States') |> select(Country, Year, GDP)
usa_df |> autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

usa_df_trans <- usa_df %>% select(Year, GDP) %>% as.data.frame()
bc_trans <- preProcess(usa_df_trans["GDP"], method = c("BoxCox"))
transformed <- predict(bc_trans, usa_df_trans["GDP"])

usa_df_trans$GDP_t <- transformed$GDP

usa_df_trans <- as_tsibble(usa_df_trans, index = 'Year')

usa_df_trans |> autoplot(GDP_t)

b. fit a suitable ARIMA model to the transformed data using ARIMA();

library(urca)

fit <- usa_df_trans |> model(ARIMA(GDP_t))

report(fit)

## Series: GDP_t 
## Model: ARIMA(1,1,0) w/ drift 
## 
## Coefficients:
##          ar1  constant
##       0.4879   10.3092
## s.e.  0.1170    0.8619
## 
## sigma^2 estimated as 45.19:  log likelihood=-188.6
## AIC=383.19   AICc=383.64   BIC=389.32

c. try some other plausible models by experimenting with the orders chosen;

usa_df_trans |> gg_tsdisplay(difference(GDP_t), plot_type='partial')

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

fit_2 <- usa_df_trans %>%
  model(
    "ARIMA(2,1,0)" = ARIMA(GDP_t ~ 1 + pdq(2,1,0)),
    "ARIMA(2,1,1)" = ARIMA(GDP_t ~ 1 + pdq(2,1,1)),
    "ARIMA(1,1,1)" = ARIMA(GDP_t ~ 1 + pdq(1,1,1))       
      )

d. choose what you think is the best model and check the residual diagnostics;

All of these models are surprisingly close in their AIC, the model with the best AIC was the model selected for me (1,1,0)

glance(fit_2) %>%
  arrange (AIC) %>%
  select(.model:BIC)

## # A tibble: 3 × 6
##   .model       sigma2 log_lik   AIC  AICc   BIC
##   <chr>         <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(1,1,1)   45.8   -188.  385.  386.  393.
## 2 ARIMA(2,1,0)   45.9   -189.  385.  386.  393.
## 3 ARIMA(2,1,1)   46.6   -188.  387.  388.  397.

The residuals look like white noise and histogram of the residuals approaches normal

fit |> gg_tsresiduals()

e. produce forecasts of your fitted model. Do the forecasts look reasonable?

forecast_US <- fit |> forecast(h = 10)
forecast_US |> autoplot(usa_df_trans)

f. compare the results with what you would obtain using ETS() (with no transformation).

It's hard to compare the forecasts because the ARIMA model is built on transformed data.  The forecasts look very similar but the the confidence intervals are much larger for the ETS() model with no transformation.  I don't think that it's appropriate to compare the AICs of models built on different data sets (transformed/no transformation).

fit_3 <- usa_df %>%
  model(ETS(GDP))

forecast_US <- fit_3 |> forecast(h = 10)
forecast_US |> autoplot(usa_df)

DATA624_HW6

William Aiken

10/19/2024