Data 624 HW6

10/22/2021

Gabe Abreu

Data 624 HW 6

Chapter 9 Exercises (9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8)

9.1

Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

  1. Explain the differences among these figures. Do they all indicate that the data are white noise?

The three plots do not show data points spike above the critical value range. The fact there the critical range boundary is not broken, it indicates that the data is white noise.

  1. Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

As the data series increases in observations, the outliers decrease. Another consequence of increasing sample size is the critical value decreasing as defined by their mathematical relationship.

White noise is defined as values that have a mean of zero, in a small sample its easy to see the individual means of the values while large sample sizes the small differences are going to look closer to zero.

9.2

A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

amz_close %>% gg_tsdisplay(Close, plot_type = 'partial')
## Warning: Provided data has an irregular interval, results should be treated with
## caution. Computing ACF by observation.

## Warning: Provided data has an irregular interval, results should be treated with
## caution. Computing ACF by observation.

The series is nonstationary due to the clear upward trend shown in the scatter plot. The ACF plot shows r1 values of 1.00 and the ACF does not drop to zero, it’s barely decreasing. The large r1 values alone signals the non-stationary nature of this data. The pacf plot shows a strong correlation at at the first lag followed by correlations that are not significant. This series would benefit from differencing in order to make it stationary.

9.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

  1. Turkish GDP from global_economy
turkey_data <- global_economy %>%
               filter(Country == "Turkey")

Let’s examine the ACF and PACF plots to see if the data is non-stationary.

turkey_data %>% gg_tsdisplay(GDP, plot_type = 'partial')

The ACF plot shows values slowly decreasing and well above the critical value range. This indicates it is not a stationary data series. The PACF shows a strong correlation at lag 1 followed by insignificant correlations meaning there is an autoregressive term in the data. Given the one lag, I would say p = 1.

BoxCox Transformation

trans_turkey <- dplyr::select(turkey_data, GDP)
trans_turkey$GDP <- box_cox(trans_turkey$GDP, BoxCox.lambda(trans_turkey$GDP))
trans_turkey %>% gg_tsdisplay(GDP, plot_type = 'partial')

The non-stationary attributes of the data series are still apparent and needs to be addressed with differencing.

Use the ndiffs function to determine the order of differencing necessary.

unitroot_ndiffs(trans_turkey$GDP)
## ndiffs 
##      1

First order differencing will most likey make the data series stationary.

ggtsdisplay(diff(trans_turkey$GDP))

The ACF and PACF plots show a stationary series.

  1. Accommodation takings in the state of Tasmania from aus_accommodation
tasmanian_takings <- aus_accommodation %>% filter(State == 'Tasmania')
tasmanian_takings%>% gg_tsdisplay(Takings, plot_type = 'partial')

The graph shows a series with slighty increasing variation as the levels increase, making it a good candidate for a BoxCox Transformation. Also the ACF plot shows vales above the critical value range in a pattern formation, showing seasonality. The PACF has values above the critical value range at lag 1 and 3.

BoxCox Transformation

trans_tas <- dplyr::select(tasmanian_takings, Takings)
trans_tas$Takings <- box_cox(trans_tas$Takings, BoxCox.lambda(trans_tas$Takings))
trans_tas %>% gg_tsdisplay(Takings, plot_type = 'partial')

The non-stationary attributes of the data series are still apparent and needs to be addressed with differencing. The increasing variation however seems to be fixed with the BoxCox transformation.

Let’s verify with the kpss function

unitroot_kpss(trans_tas$Takings)
##   kpss_stat kpss_pvalue 
##    1.837249    0.010000

Use the ndiffs function to determine the order of differencing necessary.

unitroot_ndiffs(trans_tas$Takings)
## ndiffs 
##      1

First order differencing will most likey make the data series stationary.

ggtsdisplay(diff(trans_tas$Takings))

The data does not seem stationary to me, let’s check with kpss function.

unitroot_kpss(diff(trans_tas$Takings))
##   kpss_stat kpss_pvalue 
##   0.2573405   0.1000000

Following the kpss_stat, it is much smaller than the initial stat value after the transformation and the p value is exactly 0.1. The data is now stationary.

  1. Monthly sales from souvenirs
souvenirs %>% gg_tsdisplay(Sales, plot_type = 'partial')

Interesting data series, the tail end of the series shows increasing variation with the levels and a mild upward tend. The ACF plot has two lag values that are above the critical value range as does the pacf plot.

trans_souvenirs <- souvenirs
trans_souvenirs$Sales <- BoxCox(souvenirs$Sales, BoxCox.lambda(souvenirs$Sales))
trans_souvenirs %>% gg_tsdisplay(Sales, plot_type = 'partial')

Check for Non-Stationary and Differencing

unitroot_kpss(trans_souvenirs$Sales)
##   kpss_stat kpss_pvalue 
##    1.823545    0.010000

Large stat value and the ACF plot shows that the series is still non-stationary and will need differencing.

unitroot_ndiffs(trans_souvenirs$Sales)
## ndiffs 
##      1

The ndiffs function suggest 1 order of differencing.

ggtsdisplay(diff(trans_souvenirs$Sales))

unitroot_kpss(diff(trans_souvenirs$Sales))
##   kpss_stat kpss_pvalue 
##  0.06153858  0.10000000

The kpss_stat is much smaller and within an acceptable range for stationary data.

9.5

For your retail data (from Exercise 8 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

set.seed(128)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

Autoplot the series to remind ourselves what the time series looks like.

ggtsdisplay(myseries$Turnover)

Given our familiarity with this series, the increasing variance with the levels, and the strong correlation, it is evident that the series is non-stationary and needs a BoxCox transformation.

transformed_series <- myseries
transformed_series$Turnover <- box_cox(myseries$Turnover, BoxCox.lambda(myseries$Turnover))

Let’s view the series after the BoxCox transformation

ggtsdisplay(transformed_series$Turnover)

Examine the transformed series kpss stat value and order of differencing.

unitroot_kpss(transformed_series$Turnover)
##   kpss_stat kpss_pvalue 
##    7.225846    0.010000

Very high kpss stat value, this series is definitely not stationary.

unitroot_ndiffs(transformed_series$Turnover)
## ndiffs 
##      1
ggtsdisplay(diff(transformed_series$Turnover))

unitroot_kpss(diff(transformed_series$Turnover))
##   kpss_stat kpss_pvalue 
##  0.01172902  0.10000000

After a first order differencing the series produces a kpss_stat value well within the acceptable range for a statoinary series.

9.6

Simulate and plot some data from simple ARIMA models.

  1. Use the following R code to generate data from an AR(1) model with ϕ1=0.6 and σ2=1. The process starts with y1=0
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
  1. Produce a time plot for the series. How does the plot change as you change ϕ1 ?

Recreate the code from above to accept multiple values of phi for expediency

ar_generator <- function(pie) {
    y <- ts(numeric(100))
    e <- rnorm(100)
    
    for(i in 2:100)
        y[i] <- pie*y[i-1] + e[i]

    #sim1 <- tsibble(idx = seq_len(100), y = y, index = idx)
    #return(sim1)
    return(y)
}

Phi = 0.2

autoplot(ar_generator(0.2))

Phi = 0.4

autoplot(ar_generator(0.4))

Phi = 0.6

autoplot(ar_generator(0.6))

Phi = 0.8

autoplot(ar_generator(0.8))

Phi = 1.0

autoplot(ar_generator(1.0))

As phi increases there is an increase in trend and direction. The lower phi is the more the plots look like white noise.

  1. Write your own code to generate data from an MA(1) model with θ1=0.6 and σ2=1
ma1 <- function(theta, n=100L){
        y <- numeric(100)
        e <- rnorm(100)
        
        for(i in 2:100)
            y[i] <- theta*e[i-1] + e[i]
       
        tsibble(idx = seq_len(n), y = y, index = idx)
}
  1. Produce a time plot for the series. How does the plot change as you change θ1?

theta = 0

autoplot(ma1(0))
## Plot variable not specified, automatically selected `.vars = y`

theta = 0.6

autoplot(ma1(0.6))
## Plot variable not specified, automatically selected `.vars = y`

theta = 1

autoplot(ma1(1))
## Plot variable not specified, automatically selected `.vars = y`

  1. Generate data from an ARMA(1,1) model with ϕ1=0.6, θ1=0.6 and σ2=1
a1 <- function(phi, theta, n=100){
    y <- numeric(n)
    e <- rnorm(n)
    for(i in 2:n)
        y[i] <- phi*y[i-1] + theta*e[i-1] + e[i]
    tsibble(idx = seq_len(n), y = y, index = idx)
}

arghma1 <- a1(0.6,0.6)
  1. Generate data from an AR(2) model with ϕ1= −0.8 , ϕ2=0.3 and σ2=1. (Note that these parameters will give a non-stationary series.)
 a2 <- function(phi1, phi2, n=100){
    y <- numeric(n)
    e <- rnorm(n)
    for(i in 3:n)
        y[i] <- phi1*y[i-1] + phi2*y[i-2] + e[i]
    tsibble(idx = seq_len(n), y = y, index = idx)
}

arghma2 <- a2(-0.8,0.3)
  1. Graph the latter two series and compare them.

In this plot the series tends to go back to the mean despite the large swings to either the positive or negative.

arghma1 %>% autoplot(y)

This plot shows increasing variance and amplitude as the series goes up in levels. Appears like an increasing sound wave.

arghma2 %>% autoplot(y)

9.7

Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011

  1. Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.
autoplot(aus_airpassengers)
## Plot variable not specified, automatically selected `.vars = Passengers`

fit <- aus_airpassengers %>%
        model(ARIMA(Passengers))

AIC = 198.04 AICc = 198.32

report(fit)
## Series: Passengers 
## Model: ARIMA(0,2,1) 
## 
## Coefficients:
##           ma1
##       -0.8963
## s.e.   0.0594
## 
## sigma^2 estimated as 4.308:  log likelihood=-97.02
## AIC=198.04   AICc=198.32   BIC=201.65

The ARIMA function suggested an ARIMA(0, 2, 1)

Let’s investigate the residuals.

fit %>% gg_tsresiduals()

The residuals look like white noise.

Now it’s time to perform the forecast

fit %>% forecast(h=10) %>% autoplot(aus_airpassengers)

  1. Write the model in terms of the backshift operator.

Arima(0,2,1) in terms of the backshift operator: ((1-B)^2)(1 +(theta)B)

source: https://online.stat.psu.edu/stat510/lesson/4/4.1

  1. Plot forecasts from an ARIMA(0,1,0) model with drift and compare these to part a.

Compared to part (a) the ACF lags are noticeably more postive, AIC is also slightly higher. The residuals still look appear to be white noise. The condifence interval for part (a) is slightly larger than the range for this plot.

fit2 <- Arima(aus_airpassengers$Passengers, order = c(0,1,0), include.drift = TRUE)
fit2 %>% forecast(h=10) %>% autoplot()

AIC = 200.31 AICc = 200.59

fit2 
## Series: aus_airpassengers$Passengers 
## ARIMA(0,1,0) with drift 
## 
## Coefficients:
##        drift
##       1.4191
## s.e.  0.3014
## 
## sigma^2 estimated as 4.271:  log likelihood=-98.16
## AIC=200.31   AICc=200.59   BIC=203.97
checkresiduals(fit2$residuals)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

  1. Plot forecasts from an ARIMA(2,1,2) model with drift and compare these to parts a and c. Remove the constant and see what happens.

The forecast doesn’t appear as smooth compared to the plots in parts a and c. The AIC is also higher than the previous plots.

Plot with drift

fit3 <- Arima(aus_airpassengers$Passengers, order = c(2,1,2), include.drift = TRUE) 
fit3 %>% forecast(h=10) %>% autoplot()

AIC = 204.46, AICc = 206.61

fit3 
## Series: aus_airpassengers$Passengers 
## ARIMA(2,1,2) with drift 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2   drift
##       -0.5518  -0.7327  0.5895  1.0000  1.4101
## s.e.   0.1684   0.1224  0.0916  0.1008  0.3162
## 
## sigma^2 estimated as 4.031:  log likelihood=-96.23
## AIC=204.46   AICc=206.61   BIC=215.43
checkresiduals(fit3$residuals)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

Plot with constant removed

#fit4 <- Arima(aus_airpassengers$Passengers, order = c(2,1,2), include.drift = FALSE) 

#fit4 %>% forecast(h=10) %>% autoplot()

Trying to remove the constant causes an error, meaning the constant is necessary. Without the drift the model is not stationary, resulting in the error.

  1. Plot forecasts from an ARIMA(0,2,1) model with a constant. What happens?

It’s the same plot as part (a).

fit4 <- Arima(aus_airpassengers$Passengers, order = c(0,2,1), include.constant = TRUE) 
fit4 %>% forecast(h=10) %>% autoplot()

9.8

For the United States GDP series (from global_economy):

  1. if necessary, find a suitable Box-Cox transformation for the data;
us_gdp <- global_economy %>%
filter(Code == "USA")
ggtsdisplay(us_gdp$GDP)

There is a clear trend and autocorrelation, even though I don’t think BoxCox is necessary, I will perform one anyway as the previous examples shows a transformation can help build a better model.

(I attempted this problem with values that were not transformed and the transformed model performed better than non-transformed.)
  1. fit a suitable ARIMA model to the transformed data using ARIMA();
lambda <- us_gdp %>%
            features(GDP, features = guerrero) %>%
            pull(lambda_guerrero)
fit <- us_gdp %>% model(ARIMA(box_cox(GDP, lambda)))
report(fit)
## Series: GDP 
## Model: ARIMA(1,1,0) w/ drift 
## Transformation: box_cox(GDP, lambda) 
## 
## Coefficients:
##          ar1  constant
##       0.4586  118.1822
## s.e.  0.1198    9.5047
## 
## sigma^2 estimated as 5479:  log likelihood=-325.32
## AIC=656.65   AICc=657.1   BIC=662.78
gg_tsresiduals(fit)

  1. try some other plausible models by experimenting with the orders chosen;

Based on the ndiffs and the resulting ACF plots, there needs to be a second order differencing for this series.

unitroot_ndiffs(us_gdp$GDP)
## ndiffs 
##      2
ggtsdisplay(diff(diff(us_gdp$GDP)))

us_fit <- us_gdp %>%
  model(arima020 = ARIMA(box_cox(GDP, lambda) ~ pdq(0,2,0)),
        arima021 = ARIMA(box_cox(GDP, lambda) ~ pdq(0,2,1)),
        arima121 = ARIMA(box_cox(GDP, lambda) ~ pdq(1,2,1)),
        arima122 = ARIMA(box_cox(GDP, lambda) ~ pdq(1,2,2)),
        arima123 = ARIMA(box_cox(GDP, lambda) ~ pdq(1,2,3)),
        arima221 = ARIMA(box_cox(GDP, lambda) ~ pdq(2,2,1)),
        arima222 = ARIMA(box_cox(GDP, lambda) ~ pdq(2,2,2)),
        arima321 = ARIMA(box_cox(GDP, lambda) ~ pdq(3,2,1)),
        arima322 = ARIMA(box_cox(GDP, lambda) ~ pdq(3,2,2)),
        arima323 = ARIMA(box_cox(GDP, lambda) ~ pdq(3,2,3)),
        stepwise = ARIMA(box_cox(GDP, lambda)),
        search = ARIMA(box_cox(GDP, lambda), stepwise=FALSE))
## Warning in sqrt(diag(best$var.coef)): NaNs produced

## Warning in sqrt(diag(best$var.coef)): NaNs produced

## Warning in sqrt(diag(best$var.coef)): NaNs produced
glance(us_fit) %>% arrange(AICc)
## # A tibble: 12 x 9
##    Country       .model   sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##    <fct>         <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
##  1 United States arima121  5761.   -321.  648.  649.  655. <cpl [1]> <cpl [1]>
##  2 United States arima021  6020.   -323.  650.  650.  654. <cpl [0]> <cpl [1]>
##  3 United States arima221  5834.   -321.  650.  651.  658. <cpl [2]> <cpl [1]>
##  4 United States arima122  5845.   -321.  650.  651.  658. <cpl [1]> <cpl [2]>
##  5 United States arima321  5919.   -321.  652.  653.  662. <cpl [3]> <cpl [1]>
##  6 United States arima123  5964.   -321.  652.  653.  662. <cpl [1]> <cpl [3]>
##  7 United States arima222  5971.   -321.  652.  654.  662. <cpl [2]> <cpl [2]>
##  8 United States arima323  5279.   -319.  652.  654.  666. <cpl [3]> <cpl [3]>
##  9 United States arima322  6001.   -321.  654.  655.  666. <cpl [3]> <cpl [2]>
## 10 United States stepwise  5479.   -325.  657.  657.  663. <cpl [1]> <cpl [0]>
## 11 United States search    5479.   -325.  657.  657.  663. <cpl [1]> <cpl [0]>
## 12 United States arima020  7201.   -328.  658.  658.  660. <cpl [0]> <cpl [0]>

Interestingly, the majority of my manually chosen models beat out the automated models including the best fit model (ARIMA(1,1,0) w/ drift) from part A. The computer chosen model only did 1 order of differencing while I selected to go with 2 orders of differencing which did make a difference in the AIC values.

  1. choose what you think is the best model and check the residual diagnostics;

The arima121 model produced the lower AIC values and therefore will be the selected model. Let’s evaluate the residuals.

us_fit %>% 
   dplyr::select(arima121) %>%
    gg_tsresiduals()

The residuals look like white noise on my chosen model arima121.

augment(us_fit) %>%
  filter(.model=='arima121') %>%
  features(.innov, ljung_box, lag = 10, dof = 3)
## # A tibble: 1 x 4
##   Country       .model   lb_stat lb_pvalue
##   <fct>         <chr>      <dbl>     <dbl>
## 1 United States arima121    4.36     0.738

The large p-value confirms that the residuals are indeed white noise.

  1. produce forecasts of your fitted model. Do the forecasts look reasonable?
us_fit %>%
  forecast(h=5) %>%
  filter(.model=='arima121') %>%
  autoplot(us_gdp)

The forecast certainly looks reasonable, there is a strong upward trend that is predicted to continue.

  1. compare the results with what you would obtain using ETS() (with no transformation).
us_gdp %>%
    model(ETS(GDP)) %>%
    forecast(h = 5) %>%
    autoplot(us_gdp)

The forecast produced by the ETS model is very broad compared the Arima model, the confidence interval is over a larger range. Arima gives a more confined prediction. A tigther range is useful as a data scientist/analyst can judge whether their model is wrong quicker over a model that has broad estimates.