HW6DATA624

Author

Semyon Toybis

Assignment

We are required to complete questions 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, and 9.8 from chapter 9 of “Forecasting: Principles and Practice” Third Edition by Rob Hyndman and George Athanasopoulos.

9.1

A

The above plots are autocorrelation plots for three different time series, showing the autocorrelation coefficient at various lags. The plots suggest that the data is white noise as there are no significant autocorrelation spikes at any of the lags (demonstrated by all the bars being within the two blue lines). We should also check the mean of the residuals (a mean of zero would suggest white noise).

B

The critical values are at different distances from zero because the time series have different amounts of observations - for a 95% confidence level, we divide 1.96 by the square root of the number of observations. The more observations there are, the smaller the value of that fraction will be.

The auto-correlations are different in each figure because these are three different time series that have their own unique characteristics that can lead to different correlation levels at different lags.

9.2

Below is a plot of Amazon’s closing price and the ACF and PACF plots:

amzn_data <- gafa_stock |> filter(Symbol=='AMZN') |> mutate(day = row_number()) |>
  update_tsibble(index = day, regular = TRUE)

amzn_data |> select(Close) |>
gg_tsdisplay(plot_type = "partial")
Plot variable not specified, automatically selected `y = Close`

The plot of the closing price shows us that there is a positive trend in Amazon’s closing stock price. A presence of trend means the time series is not stationary.

The ACF plot shows us that correlation is consistently large at all lags, meaning the time series is not stationary (a stationary time series would have an ACF where the value drops to zero quickly).

9.3

A - Turkish GDP

First, I plot the time series:

turkeyGDP <- global_economy |> filter(Country=='Turkey') |> select(GDP) |> mutate(newGDP = GDP/1000000000) |> select(-GDP)


turkeyGDP |> autoplot(newGDP)

Next, I perform a Box-Cox transformation

lambdaTurkeyGDP <- turkeyGDP |>
  features(newGDP, features = guerrero) |>
  pull(lambda_guerrero)

turkeyGDP$BoxCoxGDP <- box_cox(turkeyGDP$newGDP, lambdaTurkeyGDP)

turkeyGDP |> autoplot(BoxCoxGDP)

There is still an upward trend, so I try one order of differencing

turkeyGDP <- turkeyGDP |> mutate(BoxCoxDifferenced = difference(BoxCoxGDP))

turkeyGDP |> autoplot(BoxCoxDifferenced)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

I suspect that the time series is now stationary. I perform a KPSS test:

turkeyGDP |>
  features(BoxCoxDifferenced, unitroot_kpss)
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0889         0.1

The p-value of 0.1 indicates that the data appears stationary after one order of differencing applied to the Box-Cox transformed data.

B - Tasmania accommodations takings

First, I plot the time series:

tasmaniaTakings <- aus_accommodation |> filter(State=='Tasmania') |> select(Takings)


tasmaniaTakings |> autoplot(Takings)

There is positive trend and seasonality as well as non constant variance.

Next, I perform a Box-Cox transformation

lambdaTasmaniaTakings <- tasmaniaTakings |>
  features(Takings, features = guerrero) |>
  pull(lambda_guerrero)

tasmaniaTakings$BoxCoxTakings <- box_cox(tasmaniaTakings$Takings, lambdaTasmaniaTakings)

tasmaniaTakings |> autoplot(BoxCoxTakings)

There is still an upward trend and seasonality, so I try one order of seasonal differencing:

tasmaniaTakings |> gg_tsdisplay(difference(BoxCoxTakings, 4),
                                plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

This appears stationary, so I implement the seasonal differencing and try a KPSS test

tasmaniaTakings <- tasmaniaTakings |> mutate(BoxCoxSeasonalDifferenced = difference(Takings, 4))

KPSS test:

tasmaniaTakings |>
  features(BoxCoxSeasonalDifferenced, unitroot_kpss)
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1     0.196         0.1

The p-value of 0.1 indicates that the data appears stationary after one order of seasonal differencing applied to the Box-Cox transformed data.

C - Monthly sales from souvenirs

First, I plot the time series:

souvenirs |> autoplot(Sales)

There is positive trend and seasonality as well as non constant variance.

Next, I perform a Box-Cox transformation

lambdaMonthlySales <- souvenirs |>
  features(Sales, features = guerrero) |>
  pull(lambda_guerrero)

souvenirs$BoxCoxSales <- box_cox(souvenirs$Sales, lambdaMonthlySales)

souvenirs |> autoplot(BoxCoxSales)

There is still an upward trend and seasonality, so I try one order of seasonal differencing:

souvenirs |> gg_tsdisplay(difference(BoxCoxSales, 12),
                                plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_point()`).

This appears stationary, so I implement the seasonal differencing and try a KPSS test

souvenirs <- souvenirs |> mutate(BoxCoxSeasonalDifferenced = difference(Sales, 12))

KPSS test:

souvenirs |>
  features(BoxCoxSeasonalDifferenced, unitroot_kpss)
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1     0.944        0.01

The p-value of 0.01 indicates that the data is not stationary. I try one order of differencing on the seasonally differenced data:

souvenirs |> gg_tsdisplay(difference(BoxCoxSeasonalDifferenced, 1),
                                plot_type='partial', lag=36) +
  labs(title="Difference of seasonally differenced", y="")
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).

I try implementing this transformation and performing a KPSS test.

souvenirs <- souvenirs |> mutate(DifferencedBoxCoxSeasonalDifferenced = difference(BoxCoxSeasonalDifferenced, 1))

KPSS test:

souvenirs |>
  features(DifferencedBoxCoxSeasonalDifferenced, unitroot_kpss)
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1     0.111         0.1

The p-value of 0.1 indicates that the data appears stationary after one order of differencing on the seasonally differenced, Box-Cox transformed sales data.

9.5

For this question, I will select ‘Liquor retailing’ for ‘New South Wales’

aus_liquor_NSW <- aus_retail |> filter(Industry=='Liquor retailing', State=='New South Wales') 

aus_liquor_NSW |> autoplot(Turnover)

There is positive trend, seasonality, and non-stable variance. I will first start with a Box-Cox transformation.

lambdaLiquorNSW <- aus_liquor_NSW |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

aus_liquor_NSW$BoxCoxTurnover <- box_cox(aus_liquor_NSW$Turnover, lambdaLiquorNSW)

aus_liquor_NSW |> autoplot(BoxCoxTurnover)

The data seem to have constant variance but is not yet stationary, so I now try one order of seasonal differencing.

aus_liquor_NSW |> gg_tsdisplay(difference(BoxCoxTurnover, 12),
                                plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_point()`).

The ACF plot suggests the data is not stationary. I will implement the seasonal difference and then do one order of differencing on the seasonally differenced data.

aus_liquor_NSW <- aus_liquor_NSW |> mutate(BoxCoxSeasonalDifferenced = difference(BoxCoxTurnover, 12))

aus_liquor_NSW <- aus_liquor_NSW |> mutate(DifferencedBoxCoxSeasonalDifferenced = difference(BoxCoxSeasonalDifferenced, 1))


aus_liquor_NSW |> gg_tsdisplay(DifferencedBoxCoxSeasonalDifferenced,
                                plot_type='partial', lag=36) +
  labs(title="Difference of Seasonally differenced", y="")
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).

I perform a KPSS test:

aus_liquor_NSW |>
  features(DifferencedBoxCoxSeasonalDifferenced, unitroot_kpss)
# A tibble: 1 × 4
  State           Industry         kpss_stat kpss_pvalue
  <chr>           <chr>                <dbl>       <dbl>
1 New South Wales Liquor retailing    0.0152         0.1

The p-value of 0.1 indicates that the data appears stationary after one order of differencing on the seasonally differenced, Box-Cox transformed turnover data.

9.6

A

I use the code provided to generate the data and also create two alternative versions of y that I will use later

y <- numeric(100)
y2 <- y
y3 <- y
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

B

I produce a time plot of the series:

sim |> autoplot(y) + labs(title = 'phi 0.6')

Next, I create two alternative data sets with different phi values: 0.3 and 0.9:

for(i in 2:100)
  y2[i] <- 0.3*y2[i-1] + e[i]
sim2 <- tsibble(idx = seq_len(100), y = y2, index = idx)

sim2 |> autoplot(y)  + labs(title = 'phi 0.3')

for(i in 2:100)
  y3[i] <- 0.9*y3[i-1] + e[i]
sim3 <- tsibble(idx = seq_len(100), y = y3, index = idx)

sim3 |> autoplot(y) + labs(title = 'phi 0.9')

The closer that phi is to 1, the closer the series comes to a random walk. The higher phi is, the more weight there is to more recent values. The higher phi leads to a smoother time series and the time series reaches lower depths than the models with lower phi values.

C, D

I leverage the code above to generate data from an MA(1) model with theta equal to 0.6 and sigma squared = 1:

y_ma<- numeric(100)
y2_ma <- y_ma
y3_ma <- y_ma
e_ma <- rnorm(100)
for(i in 2:100)
  y_ma[i] <- + e_ma[i] + e_ma[i-1] * 0.6
sim_ma <- tsibble(idx = seq_len(100), y = y_ma, index = idx)

sim_ma |> autoplot(y)  + labs(title = 'theta 0.6')

Next, I try creating a time series with two different theta coefficients: 0.3 and 0.9

for(i in 2:100)
  y2_ma[i] <- + e_ma[i] + e_ma[i-1] * 0.3
sim2_ma <- tsibble(idx = seq_len(100), y = y2_ma, index = idx)

sim2_ma |> autoplot(y) + labs(title = 'theta 0.3')

for(i in 2:100)
  y3_ma[i] <- + e_ma[i] + e_ma[i-1] * 0.9
sim3_ma <- tsibble(idx = seq_len(100), y = y3_ma, index = idx)

sim3_ma |> autoplot(y) + labs(title = 'theta 0.9')

The higher theta seems to have more variance than the lower vales (the plot with theta 0.9 has larger peak to trough draw-downs than the other plots).

E, F, G - ARMA(1,1), AR(2) and plot

I generate data for ARMA(1,1) model with ϕ1=0.6, θ1=0.6 and σ2=1 and an AR(2) model with ϕ1=−0.8, ϕ2=0.3 and σ2=1

z <- numeric(100)
y_arma <- z
y_ar2 <- z
e_2 <- rnorm(100)

for(i in 2:100) {
  y_arma[i] <- 0.6*z[i-1] + e_2[i] + 0.6 * e_2[i]
}

for(i in 3:100) {
    y_ar2[i] <- -0.8*z[i-1] + 0.3*z[i-2] + e_2[i]
}



sim_arma <- tsibble(idx = seq_len(100), y = y_arma, index = idx)
sim_ar2 <- tsibble(idx = seq_len(100), y = y_ar2, index = idx)

combined <- full_join(as_tibble(sim_arma), as_tibble(sim_ar2), by = 'idx', suffix = c('arma','ar2'))

combined |> pivot_longer(cols = c(yarma, yar2), names_to = 'model') |> ggplot(aes(x = idx, y = value, color = model)) + geom_line()

The ARMA (1,1) model has more variability than the AR2 model and also has one less value than the ARMA (1,1) model.

9.7 - aus_airpassengers

First I plot the time series

aus_airpassengers |> autoplot()
Plot variable not specified, automatically selected `.vars = Passengers`

A

I fit an ARIMA model and check the residuals.

ausair_fit <- aus_airpassengers |>
  model(ARIMA(Passengers))

ausair_fit |> gg_tsresiduals()

The residuals to look like white nose - there are no significant auto-correlations, the residuals are centered around zero (though there are some outliers) and there is no discernible pattern in the residuals. Below are the model details:

report(ausair_fit)
Series: Passengers 
Model: ARIMA(0,2,1) 

Coefficients:
          ma1
      -0.8963
s.e.   0.0594

sigma^2 estimated as 4.308:  log likelihood=-97.02
AIC=198.04   AICc=198.32   BIC=201.65

Next, I generate the forecast for the next ten periods:

ausair_fit |> forecast(h=10) |> autoplot(aus_airpassengers)

B

The model is an ARIMA (0,2,1) with a theta of -0.8963. The equation in terms of the backshift operator is below:

\[ (1 - B)^2 X_t = (1 - 0.8963 B) \epsilon_t \]

C, D

We are asked to plot forecasts for an ARIMA (0,1,0) with drift model and ARIMA (2,1,2) with drift model and compare these to the automatically selected model in part A.

ausair_fit_several <- aus_airpassengers |>
  model(auto = ARIMA(Passengers),
        ARIMA010 = ARIMA(Passengers ~ 1 + pdq(0,1,0)),
        ARIMA212 = ARIMA(Passengers ~ 1 + pdq(2,1,2)))

Next, I plot the forecasts for the models:

ausair_fit_several |> forecast(h=10) |> autoplot(aus_airpassengers, level = NULL)

The ARIMA(0,1,0) with drift and ARIMA(2,1,2) with drift models generate more moderate forecasts than the automatically selected ARIMA (0,2,1) with no drift model.

I also try removing the constants:

aus_airpassengers |>
  model(auto = ARIMA(Passengers),
        ARIMA010 = ARIMA(Passengers ~ pdq(0,1,0)),
        ARIMA212 = ARIMA(Passengers ~ 0 + pdq(2,1,2))) |>
  forecast(h=10) |> autoplot(aus_airpassengers, level = NULL)
Warning: 1 error encountered for ARIMA212
[1] non-stationary AR part from CSS
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_line()`).

The ARIMA (2,1,2) model fails to generate

E

We are tasked with plotting forecasts from an ARIMA (0,2,1) model with a constant. It should be noted that it is discouraged to include a constant for models with a difference greater than 1 as it leads to a quadratic or higher order trend. I also include the automatically selected ARIMA for comparison purposes.

aus_airpassengers |>
  model(auto = ARIMA(Passengers)
    ,ARIMA021Constant = ARIMA(Passengers ~ 1 + pdq(0,2,1))) |>
  forecast(h=10) |> autoplot(aus_airpassengers, level = NULL)
Warning: Model specification induces a quadratic or higher order polynomial trend. 
This is generally discouraged, consider removing the constant or reducing the number of differences.

The forecast increases at a faster rate than the automatically selected ARIMA (0,2,1).

9.8

First, I plot the series

usGDP <- global_economy |> filter(Country == 'United States') |> select(GDP) |> mutate(GDP_in_billions = GDP/ 1000000000 )

usGDP |> autoplot(GDP_in_billions)

A

We are tasked with finding a suitable Box-Cox transformation if necessary.

lambdaUSGDP <- usGDP |>
  features(GDP_in_billions, features = guerrero) |>
  pull(lambda_guerrero)

usGDP$BoxCoxGDP <- box_cox(usGDP$GDP_in_billions, lambdaUSGDP)

usGDP |> autoplot(BoxCoxGDP)

The variance seems more stable, so I will continue with the BoxCox transformed data. The lambda value is:

lambdaUSGDP
[1] 0.2819443

B, C

We are tasked with fitting an automatically selected ARIMA model as well as other plausible models. I will start by first identifying a model manually.

The time series needs to be differenced to achieve stationarity. Below is what one order of differencing looks like.

usGDP |> autoplot(difference(BoxCoxGDP))
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

I conduct a KPSS test to check for stationarity:

usGDP |> mutate(diffBoxCoxGDP =difference(BoxCoxGDP)) |>
  features(diffBoxCoxGDP, unitroot_kpss)
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1     0.208         0.1

The p-value of 0.1 indicates that the data appears stationary after one order of differencing.

Next, I plot the differenced time series and examine the ACF and PACF plots:

usGDP |> mutate(diffBoxCoxGDP =difference(BoxCoxGDP)) |>
gg_tsdisplay(difference(diffBoxCoxGDP), plot_type = "partial")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The last significant partial autocorrelation is at lag 1, which suggests we can use an AR term of 1. Because we used one order of differencing, the manually selected model would be an ARIMA(1,1,0).

Next, I fit the manual model and an automatically selected model to the data:

usGDP_fit <- usGDP |>  model(arima110 = ARIMA(BoxCoxGDP ~ 0 + pdq(1,1,0)),
    auto = ARIMA(BoxCoxGDP, stepwise = FALSE))

usGDP_fit
# A mable: 1 x 2
        arima110                    auto
         <model>                 <model>
1 <ARIMA(1,1,0)> <ARIMA(1,1,0) w/ drift>

Interestingly, the automatically selected ARIMA model has the same parameters as the manually selected model though it includes a constant. When the constant is not equal to zero and d is equal to one, the long-term forecasts will follow a straight line.

Below are the details of the ARIMA models:

usGDP_fit |> select(arima110) |> report()
Series: BoxCoxGDP 
Model: ARIMA(1,1,0) 

Coefficients:
         ar1
      0.9287
s.e.  0.0428

sigma^2 estimated as 0.05921:  log likelihood=-0.81
AIC=5.61   AICc=5.83   BIC=9.7
usGDP_fit |> select(auto) |> report()
Series: BoxCoxGDP 
Model: ARIMA(1,1,0) w/ drift 

Coefficients:
         ar1  constant
      0.4586    0.3428
s.e.  0.1198    0.0276

sigma^2 estimated as 0.0461:  log likelihood=7.72
AIC=-9.43   AICc=-8.98   BIC=-3.3

D

In the above, we see that the AICc for the automatically selected model is lower than the manual model, suggesting that is the better model.

I check the residual diagnostics of both models below:

usGDP_fit |> select(arima110) |> gg_tsresiduals()

usGDP_fit |> select(auto) |> gg_tsresiduals()

Residuals for both models seem to be white noise. The manually selected model has one significant autocorrelation spike, however this is likely okay. We can conduct a Ljung-Box test with lag = 10.

usGDP_fit |> select(arima110) |> augment() |> features(.innov, ljung_box, lag = 10)
# A tibble: 1 × 3
  .model   lb_stat lb_pvalue
  <chr>      <dbl>     <dbl>
1 arima110    10.8     0.370
usGDP_fit |> select(auto) |> augment() |> features(.innov, ljung_box, lag = 10)
# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 auto      3.81     0.955

Both have large p-values, so we can conclude that the residuals are not distinguishable from white noise.

E

The automatically fitted model has a lower AICc so I will use it for making a forecast.

usGDP_fit |> select(auto) |> forecast(h=10) |> autoplot(usGDP)

The forecasts seems reasonable based on the observed values of the time series - the forecast continues the upward trend in GDP.

F

We are tasked with comparing the ARIMA model with an ETS model on the non-transformed data. I leverage the textbook code to use time series cross-validation to compare the models

usGDP |> slice(-n()) |> stretch_tsibble(.init = 10) |>
  model(
    ETS(GDP_in_billions),
    ARIMA(GDP_in_billions)
  ) |> forecast(h=1) |>
  accuracy(usGDP) |> select(.model, RMSE:MAPE)
# A tibble: 2 × 5
  .model                  RMSE   MAE   MPE  MAPE
  <chr>                  <dbl> <dbl> <dbl> <dbl>
1 ARIMA(GDP_in_billions)  223.  137. 0.512  1.74
2 ETS(GDP_in_billions)    194.  129. 0.547  1.74

When modeling the non-transformed GDP data, the ETS model has better accuracy, indicated by the lower values of RMSE and MAPE.

Below, I generate a forecast for ten years using the ETS model:

usGDP |> model(ETS(GDP_in_billions)) |> forecast(h=10) |> autoplot(usGDP)

Below is what the forecast for the ARIMA model looks like:

usGDP |> model(ARIMA(GDP_in_billions)) |> forecast(h=10) |> autoplot(usGDP)