Data 624 Assignment 6

Objective

Assignment 6 involves answering questions 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 from the textbook Forecasting: principles and practice by Rob J Hyndman and George Athanasopoulos.

Question 9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

Explain the differences among these figures. Do they all indicate that the data are white noise?
Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

Part A

Something is immediately noticeable is that the confidence intervals vary between the three figures - which indicates a difference in sample size for each plot.

The left and middle plot don’t display any significant correlations. All the values are within the confidence intervals and don’t show any patterns. The timeseries data, therefore resembles white noise.

The plot on the right shows a spike on lag 5, 10 and 15 - indicating a pattern in the data. This displays some autocorrelation and does not resemble white noise.

Part B

Smaller sample sizes tend to have more variability in it’s values, this is indicated by wider confidence intervals. This will lead to values deviating farther from zero. Larger sample sizes have less variability and a narrower wider confidence interval. The values in the larger sample size will be be closer to zero for this reason.

Autocorrelation values can vary despite all being white noise, because of the random nature of white noise. This will still result in some random fluctuations.

Question 9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

amzn_stock <- gafa_stock %>%
  filter(Symbol == "AMZN") %>%
  fill_gaps() %>%
  as_tsibble(index = Date)

autoplot(amzn_stock, Close)

In the first trend plot, we can observe an upward - a sign of non-stationarity.

Looking at the ACF plot on the left, we see can see the autocorrelation values starts high and then goes in a gradual downward slope. A stationary series would show insignificant fluctuations unlike what we see her. We can see a consistent downward pattern in the values - which tells us the variance isn’t acting like white noise. It also tells us the data is non-stationary.

In the PACF plot we notice a huge spike at lag 1 - which suggests that there is a trend and is also a sign of non-stationarity.

If we were to use differencing, we would affect address the strong upward trend in the data and the patterns mentioned above.

amzn_stock %>%
  gg_tsdisplay(Close, plot_type = 'partial', lag = 180)

## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

Question 9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

Turkish GDP from global_economy.
Accommodation takings in the state of Tasmania from aus_accommodation.
Monthly sales from souvenirs.

Turkish GDP

In the Turkish GDP data we can see an overall trend and there is significant autocorrelations in the ACF and PACF plots.

turkish_gdp <- global_economy %>% 
  filter(Code == "TUR") 

autoplot(turkish_gdp)

## Plot variable not specified, automatically selected `.vars = GDP`

turkish_gdp %>%
   gg_tsdisplay(GDP, plot_type = 'partial', lag = 5) +
  labs(title = "Original Series", y = "")

lambda <- turkish_gdp %>%
  features(GDP, features = guerrero) %>%
  pull(lambda_guerrero)

diff_series <- turkish_gdp %>%
    mutate(GDP_diff = difference(box_cox(GDP, lambda), lag = 1))

In the transformed series we can see the removal of trend and autocorrelation values are non-significant.

diff_series %>%
  gg_tsdisplay(GDP_diff, plot_type = 'partial', lag = 5) +
  labs(title = "Original Series", y = "")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Tasmania Accomodation

We can observe both an upward trend and the presence of seasonality. There are also a lot of significant autocorrelation values.

tas_accommodation <- aus_accommodation %>%
    filter(State == "Tasmania")

autoplot(tas_accommodation, Takings)

tas_accommodation %>%
  gg_tsdisplay(Takings, plot_type = 'partial', lag = 20) +
  labs(title = "Original Series", y = "")

lambda2 <- tas_accommodation %>%
  features(Takings, features = guerrero) %>%
  pull(lambda_guerrero)

diff_series2 <- tas_accommodation %>%
    mutate(takings_diff = difference(box_cox(Takings, lambda2), lag = 1))

It seems that while the trend is removed, there are still significant autocorrelation values. Seasonality patterns are also still visible.

diff_series2 %>%
  gg_tsdisplay(takings_diff, plot_type = 'partial', lag = 20) +
  labs(title = "Original Series", y = "")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Monthly sales from souvenirs

autoplot(souvenirs)

## Plot variable not specified, automatically selected `.vars = Sales`

Growth trend that is growing and seasonality.

souvenirs %>%
  gg_tsdisplay(Sales, plot_type = 'partial', lag = 60) +
  labs(title = "Original Series", y = "")

lambda3 <- souvenirs %>%
  features(Sales, features = guerrero) %>%
  pull(lambda_guerrero)

diff_series3 <- souvenirs %>%
    mutate(sales_diff = difference(box_cox(Sales, lambda3), lag = 1))

This transformation removed the upward trend, however seasonality is still present. We also still see some significant autocorrelation values.

diff_series3 %>%
  gg_tsdisplay(sales_diff, plot_type = 'partial', lag = 20) +
  labs(title = "Original Series", y = "")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Question 9.5

For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

Response

When observing the retail data, we can observe that there is both an upward trend and seasonality. When looking at the peaks, we can see that after 2005, the peaks are getting larger. We may need to perform multiple differencing.

set.seed(12345678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

autoplot(myseries,Turnover)

myseries_five <- myseries %>%
  filter(year(Month) <= 1993)

autoplot(myseries_five,Turnover)

I performed two rounds of differencing, one to remove the trend and the second to remove the seasonal pattern. I was able to successfully remove the trend, however there are still some significant autocorrelation values such as spike 1 and 12. There may be some seasonal values still present.

# differencing to remove trend
first_diff <- myseries %>%
  mutate(turnover_diff = difference(Turnover, lag = 1))

# differencing to remove seasonality
second_diff <- first_diff %>%
  mutate(turnover_diff2 = difference(turnover_diff, lag = 12))

second_diff %>%
  gg_tsdisplay(turnover_diff2, plot_type = 'partial', lag = 60) +
  labs(title = "Original Series", y = "")

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question 9.6

Simulate and plot some data from simple ARIMA models

Use the following R code to generate data from an AR(1) model with ϕ1 = 0.6 and σ^2 = 1. The process starts with
y1 = 0.

Response

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

autoplot(sim)

## Plot variable not specified, automatically selected `.vars = y`

Produce a time plot for the series. How does the plot change as you change ϕ1?

Response

When comparing the values when the ϕ is 0.3 and 0.9, we can see the trend is more defined with 0.9. When the value is 0.3, we see the trend diminish.

simulate_ar1 <- function(phi1,n = 100) {
  y <- numeric(n)
  e <- rnorm(n)
  for(i in 2:n) {
    y[i] <- phi1 * y[i - 1] + e[i]
  }
  tsibble(idx = seq_len(n), y = y, index = idx)
}

sim1 <- simulate_ar1(phi1 = 0.3)


autoplot(sim1) + ggtitle("Time Plot for AR(1) with phi1 = 0.3")

## Plot variable not specified, automatically selected `.vars = y`

sim2 <- simulate_ar1(phi1 = 0.9)
autoplot(sim2) + ggtitle("Time Plot for AR(1) with phi1 = 0.9")

## Plot variable not specified, automatically selected `.vars = y`

Write your own code to generate data from an MA(1) model with θ1 = 0.6 and σ^2 = 1.

Response

theta1 <- 0.6
sigma2 <- 1
n <- 100  

e <- rnorm(n, mean = 0, sd = sqrt(sigma2))
y <- numeric(n)


for(i in 2:n) {
  y[i] <- e[i] + theta1 * e[i-1]
}


sim_ma1 <- tsibble(idx = seq_len(n), y = y, index = idx)

Produce a time plot for the series. How does the plot change as you change θ1?

As Theta increases that fluctuations become more pronounced.

# theta at 0.6
autoplot(sim_ma1)

## Plot variable not specified, automatically selected `.vars = y`

#theta at 0.1
theta1 <- 0.1
sigma2 <- 1
n <- 100  

e <- rnorm(n, mean = 0, sd = sqrt(sigma2))
y <- numeric(n)


for(i in 2:n) {
  y[i] <- e[i] + theta1 * e[i-1]
}


sim_ma2 <- tsibble(idx = seq_len(n), y = y, index = idx)

autoplot(sim_ma2)

## Plot variable not specified, automatically selected `.vars = y`

#theta at 0.9
theta1 <- 0.9
sigma2 <- 1
n <- 100  

e <- rnorm(n, mean = 0, sd = sqrt(sigma2))
y <- numeric(n)


for(i in 2:n) {
  y[i] <- e[i] + theta1 * e[i-1]
}


sim_ma3 <- tsibble(idx = seq_len(n), y = y, index = idx)

autoplot(sim_ma3)

## Plot variable not specified, automatically selected `.vars = y`

Generate data from an ARMA(1,1) model with ϕ1 = 0.6, θ1 = 0.6 and σ2 = 1.

Response

phi1 <- 0.6  
theta1 <- 0.6  
sigma2 <- 1  
n <- 100  


e <- rnorm(n, mean = 0, sd = sqrt(sigma2))
y <- numeric(n)

for(i in 2:n) {
  y[i] <- phi1 * y[i - 1] + e[i] + theta1 * e[i - 1]
}

sim_arma <- tsibble(idx = seq_len(n), y = y, index = idx)

Generate data from an AR(2) model with ϕ1 = −0.8, ϕ2 = 0.3, σ2 = 1

phi1 <- -0.8  
phi2 <- 0.3   
sigma2 <- 1   
n <- 100      

e <- rnorm(n, mean = 0, sd = sqrt(sigma2))
y <- numeric(n)
y[1] <- e[1]  
y[2] <- e[2]  

for(i in 3:n) {
  y[i] <- phi1 * y[i - 1] + phi2 * y[i - 2] + e[i]
}

sim_ar2 <- tsibble(idx = seq_len(n), y = y, index = idx)

Graph the latter two series and compare them

The ARMA is more stable and displays consistent fluctuations. In the AR model, we see that the values stay close to zero for half the series and then start oscillate, growing larger towards the end.

autoplot(sim_arma)

## Plot variable not specified, automatically selected `.vars = y`

autoplot(sim_ar2)

## Plot variable not specified, automatically selected `.vars = y`

Question 9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

Response

Using the ARIMA function to find the appropriate model, it determined the ARIMA(0,2,1) model to be the most appropriate.

aus <- aus_airpassengers %>%
  filter(Year <= 2011)

fit <- aus %>%
  model(arima = ARIMA(Passengers))

report(fit)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 estimated as 4.671:  log likelihood=-87.8
## AIC=179.61   AICc=179.93   BIC=182.99

Looking at the residuals, there are no significant autocorrelation values. The residuals appear close to a normal distribution however, there appears to be some patterns in the acf chart.

fit %>%
  gg_tsresiduals()

forecasts <- fit %>% forecast(h = 10)

autoplot(forecasts, aus)

Write the model in terms of the backshift operator

Response

\[ (1 - B)^2 y_t = (1 - 0.8756 B) e_t \]

Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

The ARIMA(0,2,1) model from part has a lower AIC and BIC compared the ARIMA(0,1,0)

fit_drift <- aus %>%
  model(ARIMA(Passengers ~ pdq(0,1,0)))

report(fit_drift)

## Series: Passengers 
## Model: ARIMA(0,1,0) w/ drift 
## 
## Coefficients:
##       constant
##         1.3669
## s.e.    0.3319
## 
## sigma^2 estimated as 4.629:  log likelihood=-89.08
## AIC=182.17   AICc=182.48   BIC=185.59

Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

Response

In comparison to part c, ARIMA(2,1,2) model has higher AIC (187.47) and BIC (197.75) telling us that it doesn’t fit as well as the ARIMA(0,1,0) model.

ARIMA(0,2,1) model is still the most parsimonious, as it has the lowest AIC and BIC.

fit_drift_2 <- aus %>%
  model(ARIMA(Passengers ~ pdq(2,1,2)))

report(fit_drift_2)

## Series: Passengers 
## Model: ARIMA(2,1,2) w/ drift 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  constant
##       1.4694  -0.5103  -1.5736  0.6780    0.0650
## s.e.  0.3780   0.3558   0.3081  0.2688    0.0294
## 
## sigma^2 estimated as 4.748:  log likelihood=-87.74
## AIC=187.47   AICc=189.94   BIC=197.75

forecast <- fit_drift_2 %>%
  forecast(h=10)

autoplot(forecast, aus)

Removing the constant gives me an error message.

fit_arima_no_constant <- aus %>%
  model(ARIMA(Passengers ~ 0 + pdq(2,1,2)))

## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS

report(fit_arima_no_constant)

## Series: Passengers 
## Model: NULL model 
## NULL model

Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

Response

When creating the model, we get a warning telling us to remove the constant or reduce the differences.

fit_arima_021 <- aus %>%
  model(ARIMA(Passengers ~ 1 + pdq(0,2,1)))

## Warning: Model specification induces a quadratic or higher order polynomial trend. 
## This is generally discouraged, consider removing the constant or reducing the number of differences.

report(fit_arima_021)

## Series: Passengers 
## Model: ARIMA(0,2,1) w/ poly 
## 
## Coefficients:
##           ma1  constant
##       -1.0000    0.0721
## s.e.   0.0709    0.0260
## 
## sigma^2 estimated as 4.086:  log likelihood=-85.74
## AIC=177.48   AICc=178.15   BIC=182.55

forecasts_021 <- fit_arima_021 %>%
forecast(h = 10)

autoplot(forecasts_021, aus)

Question 9.8

For the United States GDP series (from global_economy):

if necessary, find a suitable Box-Cox transformation for the data

Response

We can see that there is an upward trend. Using the box-cox transformation however, doesn’t make the data stationary. It did however, it did reduce some of the fluctuation.

gsp <- global_economy %>%
  filter(Code == "USA") %>%
  select(GDP)

autoplot(gsp)

## Plot variable not specified, automatically selected `.vars = GDP`

lambda <- gsp %>% features(GDP, features = guerrero) %>% pull(lambda_guerrero)

gsp %>% autoplot(box_cox(GDP, lambda))

fit a suitable ARIMA model to the transformed data using ARIMA()

Response

arima_bc <- gsp %>%
  model(ARIMA(box_cox(GDP, lambda)))

report(arima_bc)

## Series: GDP 
## Model: ARIMA(1,1,0) w/ drift 
## Transformation: box_cox(GDP, lambda) 
## 
## Coefficients:
##          ar1  constant
##       0.4586  118.1822
## s.e.  0.1198    9.5047
## 
## sigma^2 estimated as 5479:  log likelihood=-325.32
## AIC=656.65   AICc=657.1   BIC=662.78

forecasts <- arima_bc %>%
  forecast(h = 10) 

autoplot(forecasts, gsp)

try some other plausible models by experimenting with the orders chosen

Response

fit_models <- gsp %>%
  model(
    arima = ARIMA(GDP),
    arima112 = ARIMA(GDP ~ pdq(1,1,2)),
    arima021 = ARIMA(GDP ~ pdq(0,2,1)),
    arima012 = ARIMA(GDP ~ pdq(0,1,2)),
    arima222 = ARIMA(GDP ~ pdq(2,2,2))
  )

## Warning: It looks like you're trying to fully specify your ARIMA model but have not said if a constant should be included.
## You can include a constant using `ARIMA(y~1)` to the formula or exclude it by adding `ARIMA(y~0)`.

## Warning: 1 error encountered for arima112
## [1] Could not find an appropriate ARIMA model.
## This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
## For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots

accuracy(fit_models)

## # A tibble: 5 × 10
##   .model   .type           ME      RMSE       MAE     MPE   MAPE    MASE   RMSSE
##   <chr>    <chr>        <dbl>     <dbl>     <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
## 1 arima    Training   3.68e10   1.56e11   9.87e10   1.04    1.74   0.289   0.383
## 2 arima112 Training NaN       NaN       NaN       NaN     NaN    NaN     NaN    
## 3 arima021 Training   2.62e10   1.66e11   1.04e11   0.819   1.69   0.305   0.408
## 4 arima012 Training   1.29e11   2.19e11   1.57e11   2.41    2.78   0.462   0.538
## 5 arima222 Training   2.89e10   1.43e11   9.26e10   1.06    1.74   0.271   0.351
## # ℹ 1 more variable: ACF1 <dbl>

choose what you think is the best model and check the residual diagnostics

Response

The ARIMA(2,2,2) model has the lowest RMSE (143076555408) and performs well regarding MAPE (1.744291).

Looking at the residuals, there doesn’t appear to be any noticeable patterns. The distribution is close to normal. The residuals appear to behave like white noise.

fit_models %>%
  select(arima222) %>%
  gg_tsresiduals()

produce forecasts of your fitted model. Do the forecasts look reasonable?

Response

forecast_arima222 <- fit_models %>%
  select(arima222) %>%  
  forecast(h = 10)          

# Plot the forecast along with the historical data
autoplot(forecast_arima222, gsp)

compare the results with what you would obtain using ETS() (with no transformation).

Response

The ARIMA(2,2,2) is still the best model as it has the lowest RMSE, and MAE.

fit_models2 <- gsp %>%
  model(
    arima = ARIMA(GDP),
    arima222 = ARIMA(GDP ~ pdq(2,2,2)),
    ets = ETS(GDP)
  ) 

accuracy(fit_models2)

## # A tibble: 3 × 10
##   .model   .type             ME    RMSE     MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>    <chr>          <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima    Training     3.68e10 1.56e11 9.87e10 1.04   1.74 0.289 0.383 -2.40e-2
## 2 arima222 Training     2.89e10 1.43e11 9.26e10 1.06   1.74 0.271 0.351  4.57e-4
## 3 ets      Training     2.10e10 1.67e11 1.05e11 0.484  1.91 0.307 0.409  1.53e-1

Data 624 Assignment 6

Christian Uriostegui

2024-10-07

Objective

Question 9.1

Part A

Part B

Question 9.2

Question 9.3

Turkish GDP

Tasmania Accomodation

Monthly sales from souvenirs

Question 9.5

Response

Question 9.6

Response

Response

Response

Response

Question 9.7

Response

Response

Response

Response

Question 9.8

Response

Response

Response

Response

Response

Response