Homework 6

Do the exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman. Please submit both the Rpubs link as well as your .rmd file.

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.4.1

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.5
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.3.2
## ✔ lubridate   1.9.3     ✔ fable       0.3.4
## ✔ ggplot2     3.5.1     ✔ fabletools  0.4.2

## Warning: package 'tsibble' was built under R version 4.4.1

## Warning: package 'tsibbledata' was built under R version 4.4.1

## Warning: package 'feasts' was built under R version 4.4.1

## Warning: package 'fabletools' was built under R version 4.4.1

## Warning: package 'fable' was built under R version 4.4.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

1. Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

library(imager)

## Warning: package 'imager' was built under R version 4.4.1

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

## 
## Attaching package: 'imager'

## The following object is masked from 'package:magrittr':
## 
##     add

## The following object is masked from 'package:tidyr':
## 
##     fill

## The following object is masked from 'package:dplyr':
## 
##     where

## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum

## The following object is masked from 'package:graphics':
## 
##     frame

## The following object is masked from 'package:base':
## 
##     save.image

im = load.image("/Users/vitug/OneDrive/Desktop/CUNY Masters/DATA_624/homework6.png")
plot(im)

a. Explain the differences among these figures. Do they all indicate that the data are white noise?

The difference between the three figures are the lengths of each spike from the mean of zero, and the size of the bounded area. As the time series increases, the spikes seem to decrease and the size of the bounded area decreases. They all indicate that the data are white noise since the data lies within two blue dotted lines, which can be descrbe as the length of the time series.

b. Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The critical values are at different distance from the mean of zero because the time series are sampling more random numbers and because of that, the confidence intervals gets narrower. As the series size increases, the critical values get closer to zero. The autocorrelations are different due to the presence of white noise, which decreases the chance of autocorrelation.

2. A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

gafa_stock %>%
  filter(Symbol == "AMZN") %>%
  gg_tsdisplay(Close, plot_type='partial') +
  labs(title = "Amazon Closing Stock Price")

## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.

Based on the plots above, the Amazon Closing Price is non stationary because the ACF plot has large values. A stationary time series would have an ACF plot that quickly converges to zero. The time series plot shows an increasing trend which indicates that the series is not stationary. The PACF for a stationary time series would be 0 for all of the lags. The PACF graph in this example shows a lag value at 1

3. For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

a. Turkish GDP from global_economy.

global_economy |>
  filter (Country == "Turkey") |>
  gg_tsdisplay(GDP, plot_type='partial') +
  labs(title = "Non-transformed Turkish GDP")

# calculate lambda
lambda <- global_economy |>
  filter (Country == "Turkey") |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)
# unit root test
  global_economy |>
  filter(Country == "Turkey") |>
  features(box_cox(GDP,lambda), unitroot_ndiffs)

## # A tibble: 1 × 2
##   Country ndiffs
##   <fct>    <int>
## 1 Turkey       1

global_economy |>
  filter(Country == "Turkey") |>
  gg_tsdisplay(difference(box_cox(GDP,lambda)),plot_type = "partial") +
  labs(title = "Transformed Turkish GDP")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Based on the Graphs and the table data above, the appropriate number of differencing to obtain stationary data is one.

b. Accommodation takings in the state of Tasmania from aus_accommodation.

aus_accommodation |>
  filter(State == "Tasmania") |>
  gg_tsdisplay(Takings, plot_type='partial') +
  labs(title = "Non-transformed Tasmania Accomodation Takings")

# calculate lambda
lambda <- aus_accommodation |>
  filter(State == "Tasmania") |>
  features(Takings, features = guerrero) |>
  pull(lambda_guerrero)
# unit root test
aus_accommodation |>
  filter(State == "Tasmania") |>
  features(box_cox(Takings, lambda), unitroot_ndiffs)

## # A tibble: 1 × 2
##   State    ndiffs
##   <chr>     <int>
## 1 Tasmania      1

aus_accommodation |>
  filter(State == "Tasmania") |>
  gg_tsdisplay(difference(box_cox(Takings,lambda)),plot_type = "partial") +
  labs(title = "Transformed Tasmania Accomodation Takings")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

The data in this time series shows increasing trend, variance and seasonality which it suggests that is seasonal and non-stationary. Based on the graphs above, the data needs to be differencing to be stationary. The ACF decays drastically after the transformation, also the PACF plot is truncated after the second lag, data might need to be differenced another time in order to be better centered around zero.

c. Monthly sales from souvenirs.

souvenirs |>
  gg_tsdisplay(Sales, plot_type='partial', lag = 40) +
  labs(title = "Non-transformed Monthly Souvenir Sales")

# calculate lambda
lambda <- souvenirs |>
  features(Sales, features = guerrero) |>
  pull(lambda_guerrero)
# unit root test
souvenirs |>
  features(box_cox(Sales, lambda), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

souvenirs |>
  gg_tsdisplay(difference(box_cox(Sales,lambda)),plot_type = "partial") +
  labs(title = "Transformed Tasmania Accomodation Takings")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

The trend for this time series trnds it increases with high peaks each year there is seasonality in this series,after applying the lambda, it seems that the trend changed drastically with down peaks yearly. It appears to be some significant lags from the ACF plot before we difference it After reviewing the plots and data tables,the number of times this time series needs to be differenced is one.

5. For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

set.seed(12271979)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>%
  autoplot(Turnover) + labs(title = "Retail Data")

## applying Lambda
Saleslambda <- myseries |>
  features(Turnover, features = guerrero) |>
  pull(lambda_guerrero)

# applying BoxCox transformation
myseries %>%
  gg_tsdisplay(box_cox(Turnover,Saleslambda),plot_type = "partial")

# applying difference
myseries %>%
  gg_tsdisplay(difference(box_cox(Turnover,Saleslambda),12),plot_type = "partial",lag =36)

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

This time series has an increasing trend and a strong seasonality, by applying the transformation, the variance has stabilized. As we can see in the graphs above, after I applied the difference, we can see in the ACF graph a decrease of the spikes after the 12th log,the PACF graph there are ups and downs in the data which still is non-stationery.

6. Simulate and plot some data from simple ARIMA models.

a. Use the following R code to generate data from an AR(1) model with ϕ1=0.6 and σ2 = 1. The process starts with y1 = 0.

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim

## # A tsibble: 100 x 2 [1]
##      idx      y
##    <int>  <dbl>
##  1     1  0    
##  2     2  1.17 
##  3     3  1.69 
##  4     4 -0.113
##  5     5  0.724
##  6     6 -1.45 
##  7     7 -1.28 
##  8     8 -2.31 
##  9     9 -0.128
## 10    10 -0.327
## # ℹ 90 more rows

b. Produce a time plot for the series. How does the plot change as you change ϕ1?

for(i in 2:100)
  y[i] <- 0.9*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

sim |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = y`

for(i in 2:100)
  y[i] <- 0.2*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)

sim |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = y`

As we can see in the graphs above, when the phi value is smaller,the period of the time series decreases, on the other hand, as the phi value increases, the period tends to have the same behavior.

c. Write your own code to generate data from an MA(1) model with θ1 = 0.6 and σ2 = 1.

# create function
my_own <- function(theta,sigmasq){
  set.seed(25)
  y <- numeric(100)
  e <- rnorm(100,sigmasq)
  for (i in 2:100)
# moving average formula
  y[i] <- e[i] + theta * e[i-1]
  sim <- tsibble(idx = seq_len(100), y = y, index = idx)
}

my_own(0.6,1) %>%
  autoplot() + labs(title = "Moving Average Time Plot with Theta = 0.6")

## Plot variable not specified, automatically selected `.vars = y`

d. Produce a time plot for the series. How does the plot change as you change θ1?

my_own(0.1,1) %>%
  autoplot() + labs(title = "Moving Average with theta = 0.1")

## Plot variable not specified, automatically selected `.vars = y`

It’s really hard to find big differences between both graphs, however,it seems that lower values of theta increment more sharp peaks and declines, when theta values increases it seems that peaks are less pronounced.

e. Generate data from an ARMA(1,1) model with ϕ1 = 0.6,θ1 = 0.6 and σ2 = 1.

# arma function
arma <- function(theta,phi,sigmasq){
  set.seed(23)
  y <- numeric(100)
  e <- rnorm(100,sigmasq)
  for (i in 2:100)
# formula
  y[i] <- e[i] + theta * e[i-1] + phi * y[i-1]
  sim <- tsibble(idx = seq_len(100), y = y, index = idx)
}

f. Generate data from an AR(2) model with ϕ1 = −0.8, ϕ2 = 0.3 and σ2 = 1. (Note that these parameters will give a non-stationary series.)

arma2 <- function(theta,theta2,sigmasq){
  set.seed(10)
  y <- numeric(100)
  e <- rnorm(100,sigmasq)
  for (i in 3:100)
  ## We combine the two formula together..
  y[i] <- e[i] + theta * e[i-1] + theta2 * e[i-2]
  sim <- tsibble(idx = seq_len(100), y = y, index = idx)
}

g. Graph the latter two series and compare them

arma(0.6,0.6,1) %>%
  gg_tsdisplay(plot_type = "partial") + labs(title = "ARMA(1,1) plot")

## Plot variable not specified, automatically selected `y = y`

arma2(-0.8,0.3,1) %>%
  gg_tsdisplay(plot_type = "partial") + labs(title = "AR(2)")

## Plot variable not specified, automatically selected `y = y`

The ARMA(1,1) model seems to be stationary as it appears to be random, the ACF plot decreases around the 5 lag and the PACF graph shows a drastic decrease around the second lag and the data gets truncated afterwards. The AR(2) model is not not stationary is it oscillates about the mean and exponentially increases in variance as the index increases. The PACF plot show a negative first lag and nothing afterwards, same thing with the ACF plot with the difference that the spikes in the data increases and decreases, while the PACF keeps the spikes between the dotted lines, the ACF graph shows that the spikes goes over those lines.

7. Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.

head(aus_airpassengers)

## # A tsibble: 6 x 2 [1Y]
##    Year Passengers
##   <dbl>      <dbl>
## 1  1970       7.32
## 2  1971       7.33
## 3  1972       7.80
## 4  1973       9.38
## 5  1974      10.7 
## 6  1975      11.1

a. Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

arima_aus <- aus_airpassengers |>
  filter(Year < 2012) |>
  model(ARIMA(Passengers))
report(arima_aus)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 estimated as 4.671:  log likelihood=-87.8
## AIC=179.61   AICc=179.93   BIC=182.99

Model chosen for ARIMA is (0,2,1)

arima_aus %>% 
  gg_tsresiduals() +
  labs(title = "Residuals for ARIMA(0,2,1)")

The residuals might resemble white noise.

arima_aus %>% 
  forecast(h=10) %>%
  autoplot(aus_airpassengers) +
  labs(title = "Forecast of Australian Aircraft Passengers for next 10 periods", y = "Passengers (in millions)")

b. Write the model in terms of the backshift operator.

Yt=− 0.87εt−1 + εt

(1−B)2yt = (1−0.87B)εt

c. Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

arima_aus2 <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(
        arima010 = ARIMA(Passengers ~ pdq(0,1,0)))
arima_aus2 %>%
  forecast(h=10) %>%
  autoplot(aus_airpassengers) +
  labs(title = "Forecast with ARIMA(0,1,0)")

report(arima_aus)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 estimated as 4.671:  log likelihood=-87.8
## AIC=179.61   AICc=179.93   BIC=182.99

Part A forecast values are higher than the actual time series, while the forecast values in part b shows the forecast below the actual values in the time series.

d.Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

arima_aus3 <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(
        arima212 = ARIMA(Passengers ~ pdq(2,1,2)))
arima_aus3 %>%
  forecast(h=10) %>%
  autoplot(aus_airpassengers) +
  labs(title = "Forecast with ARIMA(2,1,2)")

report(arima_aus3)

## Series: Passengers 
## Model: ARIMA(2,1,2) w/ drift 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  constant
##       1.4694  -0.5103  -1.5736  0.6780    0.0650
## s.e.  0.3780   0.3558   0.3081  0.2688    0.0294
## 
## sigma^2 estimated as 4.748:  log likelihood=-87.74
## AIC=187.47   AICc=189.94   BIC=197.75

## remove constant by adding a 0
arima_aus4 <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(
        arima212 = ARIMA(Passengers + 0 ~ pdq(2,1,2)))
arima_aus4 %>%
  forecast(h=10) %>%
  autoplot(aus_airpassengers) +
  labs(title = "Forecast with ARIMA(2,1,2) without Constant")

report(arima_aus4)

## Series: Passengers 
## Model: ARIMA(2,1,2) w/ drift 
## Transformation: Passengers + 0 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  constant
##       1.4694  -0.5103  -1.5736  0.6780    0.0650
## s.e.  0.3780   0.3558   0.3081  0.2688    0.0294
## 
## sigma^2 estimated as 4.748:  log likelihood=-87.74
## AIC=187.47   AICc=189.94   BIC=197.75

It is hard to tell if the model without the constant has an influence in the forecast.

e. Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

## Removing constant by adding a 0
arima_aus5 <- aus_airpassengers %>%
  filter(Year < 2012) %>%
  model(
        arima021 = ARIMA(Passengers ~ pdq(0,2,1)))
arima_aus5 %>%
  forecast(h=10) %>%
  autoplot(aus_airpassengers) +
  labs(title = "Forecast with ARIMA(0,2,1) with a Constant")

report(arima_aus5)

## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8756
## s.e.   0.0722
## 
## sigma^2 estimated as 4.671:  log likelihood=-87.8
## AIC=179.61   AICc=179.93   BIC=182.99

This model has a lower AICc value compared to the rest of the models, Sigma estimated values are the same than the first model, as well as the BIC values.

8. For the United States GDP series (from global_economy):

US <- global_economy %>%
  filter(Country == 'United States')
## Plot the United States GDP
US %>%
  autoplot(GDP) + labs(title = "United States GDP")

a. if necessary, find a suitable Box-Cox transformation for the data;

# use lambda
USAlambda <- US |>
  features(GDP, features = guerrero) |>
  pull(lambda_guerrero)
US %>%
  autoplot(box_cox(GDP,USAlambda))

b. fit a suitable ARIMA model to the transformed data using ARIMA();

US %>%
  gg_tsdisplay(box_cox(GDP,USAlambda),plot_type = "partial")

Based on the plots above, we need to transform this data to stationery,we see a slight increase in trend in the plot and the ACF plot has signficant lag at every point in time.

US %>%
  gg_tsdisplay(difference(box_cox(GDP,USAlambda)),plot_type = "partial")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

#find the best model
US_fit <- US %>%
  model(
    arima011 = ARIMA(box_cox(GDP,USAlambda) ~ pdq(0,1,1)),
    arima110 = ARIMA(box_cox(GDP,USAlambda) ~ pdq(1,1,0)),
    stepwise = ARIMA(box_cox(GDP,USAlambda)),
        search = ARIMA(box_cox(GDP,USAlambda),stepwise = FALSE)
    
  )
glance(US_fit) %>%
  arrange(AICc) %>%
  select(.model:BIC)

## # A tibble: 4 × 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 arima110  5479.   -325.  657.  657.  663.
## 2 stepwise  5479.   -325.  657.  657.  663.
## 3 search    5479.   -325.  657.  657.  663.
## 4 arima011  5689.   -326.  659.  659.  665.

I would say that arima110 is the best option for this scenario since it has the best values.

c. try some other plausible models by experimenting with the orders chosen;

US_fit2 <- US %>%
  model(
    arima112 = ARIMA(box_cox(GDP,USAlambda) ~ pdq(1,1,2)),
    arima210 = ARIMA(box_cox(GDP,USAlambda) ~ pdq(2,1,0)),
    arima211 = ARIMA(box_cox(GDP,USAlambda) ~ pdq(2,1,1)),
    stepwise = ARIMA(box_cox(GDP,USAlambda)),
        search = ARIMA(box_cox(GDP,USAlambda),stepwise = FALSE)
    
  )
glance(US_fit2) %>%
  arrange(AICc) %>%
  select(.model:BIC)

## # A tibble: 5 × 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 stepwise  5479.   -325.  657.  657.  663.
## 2 search    5479.   -325.  657.  657.  663.
## 3 arima210  5580.   -325.  659.  659.  667.
## 4 arima112  5630.   -325.  660.  661.  670.
## 5 arima211  5647.   -325.  660.  661.  671.

I would say that arima2,1,1 is the best option for this scenario since it has the best values.

d. choose what you think is the best model and check the residual diagnostics;

US_fit %>%
  select(arima110) %>%
  gg_tsresiduals()

The residual data looks left skewed, diagnostics looks good, there is no significant changes in trends or spikes.

e. produce forecasts of your fitted model. Do the forecasts look reasonable?

US_fit %>%
  forecast(h = 10,level = NULL) %>%
  autoplot(US) + labs(title = "US GDP Forecast")

All forecasts models appear to be reasonable and the data appears to have an increasing trend.

f. compare the results with what you would obtain using ETS() (with no transformation).

US_fit3 <- US |>
  model(ETS(GDP),
        arima011 = ARIMA(GDP ~ pdq(0,1,1)))
glance(US_fit3) %>%
  arrange(AICc) %>%
  select(.model:BIC)

## # A tibble: 2 × 6
##   .model     sigma2 log_lik   AIC  AICc   BIC
##   <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 arima011 7.74e+22  -1583. 3170. 3170. 3174.
## 2 ETS(GDP) 6.78e- 4  -1590. 3191. 3192. 3201.

Based on the table above, the Arima model performed slightly better than the ETS method, without any transformation applied to the model