Data 624: Assignment 6

ARIMA Modeling:

Instructions Do the exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman.
Please submit both the Rpubs link as well as your .rmd file.

9.1

A. Explain the differences among these figures. Do they all indicate that the data are white noise?

As we examine the graphs, we see that the spikes in the lags diminish as n (the length of the time series) increases, which also leads to a decrease in the significance threshold. As n increases, the bounded area (represented by the blue-dotted line) decreases. All the lag spikes fall within the bounded area of ±1.96n±n1.96. This indicates that the auto-correlations are white noise.

B.Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values can be described by the formula ±1.96n±n1.96, where n is the length of the time series. As n increases, the critical values become smaller and closer to zero because the standard error of the autocorrelation estimates decreases. Therefore, a larger sample size n correlates to narrower confidence intervals around the mean of zero, leading to critical values that reflect this increased precision.

9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

gafa_stock |>
  filter(Symbol == "AMZN") |>
  gg_tsdisplay(Close, plot_type='partial') +
  labs(title = "AMZN Closing Stock Price")

## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

From the quick plot of the time series for AMZN closing prices, we can see that there’s a clear trend in the data. The plot shows a continuous upward curve, indicating the presence of a trend. For data to be considered stationary, it should look more horizontal, with no clear seasonality or trend. This signals the need for at least one order of differencing.

The ACF and PACF plots back this up. For truly stationary data, the ACF should drop to zero quickly, but here we see the ACF decreasing slowly, which confirms non-stationarity. The PACF shows a significant spike at lag 1, indicating strong autocorrelation, and then smaller significant spikes at regular intervals, possibly hinting at some seasonality.

Altogether, this shows that the data should be differenced to remove the trend and make it stationary.

9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

A. Turkish Data

global_economy |>
  filter(Country == "Turkey") |>
  gg_tsdisplay(GDP, plot_type='partial') +
  labs(title = "Non-transformed Turkish GDP")

lambda_turkey <- global_economy |>
  filter(Country == "Turkey") |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero) |>
  round(4)  


global_economy |>
  filter(Country == "Turkey") |>
  features(GDP, unitroot_ndiffs)

## # A tibble: 1 × 2
##   Country ndiffs
##   <fct>    <int>
## 1 Turkey       1

global_economy |>
  filter(Country == "Turkey") |>
  features(GDP, unitroot_nsdiffs)

## # A tibble: 1 × 2
##   Country nsdiffs
##   <fct>     <int>
## 1 Turkey        0

global_economy |>
  filter(Country == "Turkey") |>
  features(GDP, unitroot_kpss)

## # A tibble: 1 × 3
##   Country kpss_stat kpss_pvalue
##   <fct>       <dbl>       <dbl>
## 1 Turkey       1.22        0.01

# Print the lambda value
print(paste("lambda =", lambda_turkey))

## [1] "lambda = 0.1572"

turkey_transform <- global_economy |>
  filter(Country == "Turkey") |>
  mutate(GDP_transformed_diff = difference(box_cox(GDP, lambda_turkey)))  # Apply Box-Cox and differencing

# Plot the transformed and differenced data
turkey_transform |>
  gg_tsdisplay(GDP_transformed_diff) +
  labs(title = TeX(paste0("Differenced GDP with $\\lambda$ = ", lambda_turkey)))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

The Turkish data needs one order of difference and does not require a seasonal difference

B. Accommodation takings in the state of Tasmania from aus_accommodation.

## Visualize data ACF and APCF
aus_accommodation |>
  filter(State == "Tasmania") |>
  gg_tsdisplay(Takings, plot_type='partial',) +
  labs(title = "Tasmania Accomodation Takings")

# Lambda
lambda_tasmania <-aus_accommodation |>
  filter(State == "Tasmania") |>
  features(Takings, features = guerrero) |>
  pull(lambda_guerrero)|>
  round(4)
print(paste("lamda=", lambda_tasmania))

## [1] "lamda= 0.0018"

aus_accommodation |>
  filter(State == "Tasmania") |>
  features(box_cox(Takings,lambda_tasmania), unitroot_ndiffs)

## # A tibble: 1 × 2
##   State    ndiffs
##   <chr>     <int>
## 1 Tasmania      1

aus_accommodation |>
  filter(State == "Tasmania") |>
  features(box_cox(Takings,lambda_tasmania), unitroot_nsdiffs)

## # A tibble: 1 × 2
##   State    nsdiffs
##   <chr>      <int>
## 1 Tasmania       1

The Tasmania data need both a first difference and seasonal difference. I chose M=4 for this quarterly time series.

aus_accommodation %>%
  filter(State == "Tasmania") %>%
  gg_tsdisplay(difference(box_cox(Takings,lambda_tasmania))|>
                            difference(4), plot_type='partial',
               ) +
  labs(title = TeX(paste0("Differenced Tasmania Accomodation Takings with $\\lambda$ = ",
         lambda_tasmania)))

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

C. Monthly from souvenirs

# Plot Data Souvenirs monthlysales and ACF/PACF
souvenirs |>
  gg_tsdisplay(Sales, plot_type = 'partial') + 
  labs(title = "Monthly Souvenirs sales")

# Lambda
lambda_monthly_souvenirs <-souvenirs |>
 features(Sales, features = guerrero) |>
  pull(lambda_guerrero)|>
  round(4)
print(paste("LAmbda= ",lambda_monthly_souvenirs))

## [1] "LAmbda=  0.0021"

# Unit Root Test seasonality
souvenirs|>
  features(Sales, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

souvenirs|>
  features(
    Sales, unitroot_ndiffs
  )

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

souvenirs|>
  gg_tsdisplay(difference(box_cox(Sales,lambda_monthly_souvenirs))|>
                            difference(12), plot_type='partial',
               ) +
  labs(title = TeX(paste0("Differenced Monthly sales  with $\\lambda$ = ",
         lambda_monthly_souvenirs)))

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

The Souvenirs data needs a first difference as well as a seasonal difference, went with m=12 since it is monthly data.

9.5

For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

set.seed(3231)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1)) 

myseries_train <- myseries %>%
  filter(year(Month) < 2011)
autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

myseries|>
  gg_tsdisplay(Turnover, plot_type = "partial")

my_series_lamda<- myseries|>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)|>
  round(4)
print(my_series_lamda)

## [1] 0.0902

myseries|>
  features(Turnover, unitroot_ndiffs)

## # A tibble: 1 × 3
##   State    Industry       ndiffs
##   <chr>    <chr>           <int>
## 1 Tasmania Food retailing      1

myseries|>
  features(Turnover, unitroot_nsdiffs)

## # A tibble: 1 × 3
##   State    Industry       nsdiffs
##   <chr>    <chr>            <int>
## 1 Tasmania Food retailing       1

We need one difference and a seasonal difference

myseries|>
  gg_tsdisplay(difference(box_cox(Turnover, my_series_lamda))|>
                 difference(12), plot_type='partial',
               ) +
  labs(title = TeX(paste0("Differenced Monthly Turnover with $\\lambda$ = ",
         my_series_lamda)))

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

9.6

Simulate and plot some data from simple ARIMA models.

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

B. Plot the series

sim %>% autoplot(y)+
  labs(title = "AR(1) Model with Phi = 0.6", x = "Time", y = "Value")

Changing the ϕ1 to -0.6

for(i in 2:100)
  y[i] <- -1*0.6*y[i-1] + e[i]
sim2 <- tsibble(idx = seq_len(100), y = y, index = idx)
sim2 |>
  autoplot(y) +
  labs(title = "AR(1) Model with Phi = -0.6", x = "Time", y = "Value")

AR(1) model with ϕ=−0.6= -0.6ϕ=−0.6. As expected, we can see the alternating pattern in the time series due to the negative autocorrelation (ϕ=−0.6= -0.6ϕ=−0.6).

C ,D

MA model

# Simulate an MA(1) process with theta = 0.6
y <- numeric(100)  # Initialize the vector
e <- rnorm(100)  # Generate white noise

# Simulate the MA(1) process
for(i in 2:100) {
  y[i] <- e[i] + 0.6 * e[i-1]  # MA(1) model with theta = 0.6
}

# Create a tsibble
sim_4 <- tsibble(idx = seq_len(100), y = y, index = idx)

# Plot the data
MA_1<-sim_4 |>
  gg_tsdisplay(y, plot_type = "partial") +
  labs(title = "MA(1) Model with Theta = 0.6", x = "Time", y = "Value")
MA_1

The ACF plot shows a significant spike at lag 1 with values dropping close to zero at higher lags, which is typical of an MA(1) process. The PACF plot has no significant spikes beyond lag 1, further confirming the suitability of the MA(1) model for this data.

Generate data from an ARMA(1,1) model with ϕ1=0.6ϕ1=0.6, θ1=0.6θ1=0.6 and σ2=1σ2=1.

n <- 100  # Length of the time series
phi <- 0.6  # AR(1) coefficient
theta <- 0.6  # MA(1) coefficient
sigma <- 1  # Standard deviation of the white noise (sigma^2 = 1)

# Generate white noise with standard deviation sigma = 1
e <- rnorm(n, mean = 0, sd = sigma)

# Simulate an ARMA(1,1) model with AR(1) and MA(1) components
y <- numeric(n)
for(i in 2:n) {
  y[i] <- phi * y[i-1] + e[i] + theta * e[i-1]  # ARMA(1,1) process
}

# Create a tsibble for the simulated data
library(tsibble)
sim_5 <- tsibble(idx = seq_len(n), y = y, index = idx)

# Plot the simulated data
arma_1_1<-sim_5|>
  gg_tsdisplay(y, plot_type = "partial")+
  labs(title = expression(paste(
    "Simulated ARMA(1,1) Process with ", phi[1], " = 0.6, ",
    theta[1], " = 0.6, and ", sigma^2, " = 1")),
       x = "Time", y = "Value")
arma_1_1

Generate AR(2) with

0.8ϕ1=−0.8, ϕ2=0.3ϕ2=0.3 and σ2=1σ2=1

phi1= -0.8
phi2=  0.3

e <- rnorm(n, mean = 0, sd = sigma)

# Initialize a vector for the AR(2) process
y <- numeric(n)

e <- rnorm(n, mean = 0, sd = sigma)

# Initialize a vector for the AR(2) process
y <- numeric(n)

# Simulate the AR(2) process:
#y[i] depends on y[i-1], y[i-2], and white noise e[i]
for(i in 3:n) {
  y[i] <- phi1 * y[i-1] + phi2 * y[i-2] + e[i]  # AR(2) model
}
sim_Ar2 <- tsibble(idx = seq_len(n), y = y, index = idx)

sim_Ar2|>
  gg_tsdisplay(y, plot_type = "partial")

The ARMA(1,1) plot indicates stationarity with its stable, mean-reverting behavior. The rapid decline in the ACF and the sharp cut-off in the PACF after lag 1 confirm this.

In contrast, the AR(2) plot shows a non-stationary process due to its unstable, diverging pattern and persistent auto-correlations over several lags.

9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.
1. Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.
2. Write the model in terms of the backshift operator.
3. Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.
4. Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.
5. Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

fit <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(ARIMA(Passengers)) 

report(fit)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 estimated as 4.671:  log likelihood=-87.8
## AIC=179.61   AICc=179.93   BIC=182.99

fit|>
  gg_tsresiduals()

fit %>% 
  forecast(h=10) %>%
  autoplot(aus_airpassengers) +
  labs(title = "Aus Aircraft Passengers with ARIMA(0,2,1)", y = "Passengers Mil")

The residuals and ACF for the selected ARIMA(0,2,1) model indicate white noise, and the forecast aligns well with the test data, falling within the confidence intervals.

yt=−0.8756ϵt−1+ϵtyt=−0.8756ϵt−1+ϵt

(1−B)2yt=(1−0.87B)ϵt

fitdrift <-aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(ARIMA(Passengers ~ pdq(0,1,0)))
fitdrift|>
  forecast(h=10)|>
  autoplot(aus_airpassengers)+
  labs(
    title = "AUS Arircraft Passenges with Arima(0,1,0)",
    y = "Passengers Mil"
  )

fitdrift|>
  gg_tsresiduals()+
  labs(
    title = "AUS Arircraft Passenges with Arima(0,1,0)")

fit212drift<- aus_airpassengers|>
  filter(Year <2012)|>
  model(ARIMA(Passengers ~ pdq(2,1,2)))
  report(fit212drift)

## Series: Passengers 
## Model: ARIMA(2,1,2) w/ drift 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  constant
##       1.4694  -0.5103  -1.5736  0.6780    0.0650
## s.e.  0.3780   0.3558   0.3081  0.2688    0.0294
## 
## sigma^2 estimated as 4.748:  log likelihood=-87.74
## AIC=187.47   AICc=189.94   BIC=197.75

fit212drift|>
  gg_tsresiduals()+
  labs(
    title = "AUS Arircraft Passenges with Arima(2,1,2) W/ Drift",)

fit212drift|>
  forecast(h=10)|>
  autoplot(aus_airpassengers)+
  labs(
    title = "AUS Arircraft Passenges with Arima(2,1,2) W/ Drift",
    y = "Passengers Mil"
  )

fit_no_constant212<-aus_airpassengers|>
  filter(Year < 2012)|>
  model(ARIMA(Passengers ~ 0 +pdq(2,1,2)))

## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS

report(fit_no_constant212)

## Series: Passengers 
## Model: NULL model 
## NULL model

When we remove the Constant we have a null model.

fit_021C<- aus_airpassengers|>
  filter(Year<2012)|>
  model(ARIMA(Passengers ~ 1+ pdq(0,2,1)))

## Warning: Model specification induces a quadratic or higher order polynomial trend. 
## This is generally discouraged, consider removing the constant or reducing the number of differences.

report(fit_021C)

## Series: Passengers 
## Model: ARIMA(0,2,1) w/ poly 
## 
## Coefficients:
##           ma1  constant
##       -1.0000    0.0721
## s.e.   0.0709    0.0260
## 
## sigma^2 estimated as 4.086:  log likelihood=-85.74
## AIC=177.48   AICc=178.15   BIC=182.55

We receive a warning because the model uses second differencing (d=2d=2d=2), which implies that we’re modeling a trend of at least quadratic order. In this case, ARIMA(0,2,1) assumes a trend that grows at a nonlinear rate, but no explicit polynomial trend has been included in the model

fit_021C|>
 forecast(h=50) %>%
  autoplot(aus_airpassengers) +
  labs(title = "Australian Aircraft Passengers with ARIMA(0,2,1) with constant", 
       y = "Passengers Mil")

“As we can see, the inclusion of the constant is causing the forecast values to explode in a quadratic curve, which is not ideal for real-world forecasting. This kind of growth is unlikely for the number of passengers and suggests that the model may need refinement to avoid unrealistic projections.

fit_021C|>
  gg_tsresiduals()+
  labs(title = "Australian Aircraft Passengers with ARIMA(0,2,1) with constant")

9.8

For the United States GDP series (from global_economy):
1. if necessary, find a suitable Box-Cox transformation for the data;
2. fit a suitable ARIMA model to the transformed data using ARIMA();
3. try some other plausible models by experimenting with the orders chosen;
4. choose what you think is the best model and check the residual diagnostics;
5. produce forecasts of your fitted model. Do the forecasts look reasonable?
6. compare the results with what you would obtain using ETS() (with no transformation).

usa_gdp<- global_economy|>
  filter(Country == "United States")
usa_gdp|>
  autoplot()

## Plot variable not specified, automatically selected `.vars = GDP`

usa_gdp|>
  autoplot(sqrt(GDP))

lambda_usagdp <- usa_gdp |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)|>
  round(4)
print(lambda_usagdp)

## [1] 0.2819

usa_gdp|>
  mutate(transformed_gdp = box_cox( GDP,lambda_usagdp))|>
  gg_tsdisplay(transformed_gdp, plot_type = "partial")+
  labs(
    title = paste("transformed US GDP with a lamda", lambda_usagdp
  ))

fitUSA<- usa_gdp|>
  model(ARIMA(box_cox(GDP,lambda_usagdp)))
report(fitUSA)

## Series: GDP 
## Model: ARIMA(1,1,0) w/ drift 
## Transformation: box_cox(GDP, lambda_usagdp) 
## 
## Coefficients:
##          ar1  constant
##       0.4586  118.0287
## s.e.  0.1198    9.4923
## 
## sigma^2 estimated as 5465:  log likelihood=-325.25
## AIC=656.5   AICc=656.95   BIC=662.63

usa_gdp|>
  features(box_cox(GDP, lambda_usagdp), unitroot_ndiffs)

## # A tibble: 1 × 2
##   Country       ndiffs
##   <fct>          <int>
## 1 United States      1

usa_gdp|>
  features(box_cox(GDP, lambda_usagdp), unitroot_nsdiffs)

## # A tibble: 1 × 2
##   Country       nsdiffs
##   <fct>           <int>
## 1 United States       0

usa_gdp|>
    features(box_cox(GDP, lambda_usagdp), unitroot_kpss)

## # A tibble: 1 × 3
##   Country       kpss_stat kpss_pvalue
##   <fct>             <dbl>       <dbl>
## 1 United States      1.55        0.01

usa_fit <- usa_gdp |>
  filter(Year< 2005)|>
  model(
    arima_110 = ARIMA(box_cox(GDP, lambda_usagdp) ~ pdq(1,1,0)),
    arima_120 = ARIMA(box_cox(GDP, lambda_usagdp) ~ pdq(1,2,0)),
    arima_210 = ARIMA(box_cox(GDP, lambda_usagdp) ~ pdq(2,1,0)),
    arima_212 = ARIMA(box_cox(GDP, lambda_usagdp) ~1+ pdq(2,1,2)),
    arima_111 = ARIMA(box_cox(GDP, lambda_usagdp) ~ pdq(1,1,1))
  )

glance(usa_fit) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 5 × 6
##   .model    sigma2 log_lik   AIC  AICc   BIC
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 arima_120  4982.   -244.  491.  492.  495.
## 2 arima_111  4187.   -245.  497.  498.  504.
## 3 arima_110  4365.   -246.  498.  498.  503.
## 4 arima_210  4412.   -246.  499.  500.  506.
## 5 arima_212  4087.   -244.  499.  502.  510.

 usa_fit %>%  
   select(arima_212) %>%   
   gg_tsresiduals() + 
   ggtitle("Residuals for ARIMA(1,1,0)")

The histogram of residuals is centered around zero, although there seems to be some skewness on the left side. This suggests that the residuals might not be perfectly normally distributed, but the deviation is minor.

The ACF plot of the residuals shows no significant spikes beyond the confidence bounds. This suggests that the residuals are uncorrelated, which is what we expect from a good model (i.e., the residuals resemble white noise).

usa_fit %>%
  forecast(h=25) %>%
  filter(.model=='arima_212') %>%
  autoplot(usa_gdp)+
  labs(
    title = paste("GDP boxcox lamda:",lambda_usagdp," Arima model (2,1,2) \nconstant 1,\nTrained < 2005"
  ))

usa_fit %>%
  forecast(h=25) %>%
 
  autoplot(usa_gdp,level =NULL, )+
  labs(title = paste("GDP boxcox lamda:",lambda_usagdp, "\nForcasts for USA transformed and trained  < 2005 "))

fit_usaets <- usa_gdp|>
  filter(Year<2005)|>
  model(ETS(GDP))

report(fit_usaets)

## Series: GDP 
## Model: ETS(M,A,N) 
##   Smoothing parameters:
##     alpha = 0.9203841 
##     beta  = 0.5312597 
## 
##   Initial states:
##          l[0]        b[0]
##  448093333334 63642663151
## 
##   sigma^2:  8e-04
## 
##      AIC     AICc      BIC 
## 2436.235 2437.773 2445.268

fit_usaets|>
  forecast(h = 25)|>
  autoplot(usa_gdp,)+
  labs(
    title = "non transformed forcast ETSMAM\nTrained < 2005"
  )

The ETS model using (MAN) generates a much wider forecast due to the multiplicative errors. While this proves advantageous in the short run—capturing events like the 2008 recession within the confidence intervals—it leads to a much more unpredictable forecast as the horizon (H) increases

Data 624: Assignment 6

Darwhin Gomez

2024-10-21

ARIMA Modeling:

9.1

9.2

9.3

9.5

9.6

9.7

9.8