Data 624 Assignment 6: ARIMA

The following figure shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

a. Explain the differences among these figures. Do they all indicate that the data are white noise?

These pictures show the correlation between different lags of the series (shown on the x-axis).
The y-axis (the correlation) has the same scale for each plot, but the x axis shows an increasing number of lags as the series gets longer.
If the data are white noise (random) then we expect the correlations to be below the blue line, which indicates a significant lag.
For all the plots, the correlations of the lags shown are all below the significance level so they are all indicitive of white noise.

For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/T−√, where T is the length of the time series. That is why, as T gets larger, the range between the dashed lines around the mean of zero in the diagrams is getting narrower. The diagrams do have some spikes touching or going slightly beyond the 95% interval border lines and, counted together, none of them make up more than 5% of T values. Therefore all 3 series can be regarded as white noise.

In other words, if the vast majority of the spikes are within the blue dashed line you likely have white noise - This is the case with all three plots.

b. Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

The reason why the critical values are at different distances from the mean of zero is because there is a random autocorrelation with some positive and negative values around the zero line.

Given that the 3 series are composed of randomly chosen numbers, we expect the values and subsequently the autocorrelation values (in magnitude and direction) to also be random. Therefore, we would not expect all three graphs to be identical.

A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

autoplot(ibmclose) + theme_fivethirtyeight() + 
  labs(title = "IBM Closing Prices", subtitle = 'Daily Data')

ggtsdisplay(ibmclose)

ACF plot shows that the autocorrelation values are bigger than critical values (blue line) and decreases slowly over time. Also, r1 in the PACF is large and positive at about the value 1, indicating a lag value can be utilized to forecast the series. This also means that the series is non-stationary. Generally, the PACF plot shows that there is a strong correlation between IBM stock data and their 1 lagged values.

To achieve stationary data, IBM stock data would need to be differenced. Differencing wil stabilize the mean of a time series by removing changes in the level of a time series. Therefore it will eliminate or reduce trend and seasonality, thus making the series more stationary.

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

usnetelec

The plot shows a series that increases postively in a linear fashion. The lamda of .51 indicates a square-root transform could be utilized, but it did not appear to change the plot very much. The series displays no seasonality, therefore first differencing would be appropriate.

autoplot(var) + theme_fivethirtyeight() + labs(title = varStr, subtitle = '1949-2003')

(lambda <- BoxCox.lambda(var))

## [1] 0.5167714

#(nsdiffs(var))  This produces error indicating non-seasonal data.

(ndiffs(var))

## [1] 1

(frequency(var))

## [1] 1

autoplot(BoxCox(var, lambda)) + theme_fivethirtyeight() + labs(title = varStr)

ggtsdisplay(diff(var))

Box.test(diff(var))

## 
##  Box-Pierce test
## 
## data:  diff(var)
## X-squared = 0.80522, df = 1, p-value = 0.3695

usgdp

In this case the lambda transform value of 0.36 and resulting transform (Log or square root) did straighten out the plot - linear. As a result, this series lends itself to a linear regression analysis. The ndiffs function below indicates no seasonality and similar to the prior series first diffencing would be appropriate to achieve white noise.

autoplot(var) + theme_fivethirtyeight() + labs(title = varStr, subtitle = 'Quarterly 1947-2006')

(lambda <- BoxCox.lambda(var))

## [1] 0.366352

(nsdiffs(var))

## [1] 0

(ndiffs(var))

## [1] 2

(frequency(var))

## [1] 4

autoplot(BoxCox(var, lambda)) + theme_fivethirtyeight() + labs(title = varStr)

ggtsdisplay(diff(var))

Box.test(diff(var))

## 
##  Box-Pierce test
## 
## data:  diff(var)
## X-squared = 38.693, df = 1, p-value = 4.959e-10

mcopper

Our lambda transform value of copper is 0.19 (near zero) so a log transform would be employed. The series displays an increasing trend and also may have some out-liers around the great-recession that could be influencing the series. The series appears to have monthly seasonality which seems to be supported somewhat by the polar plot below. Finally, the plots below indicate that the box-cox transform and first differencing made the series (near) stationary.

autoplot(var) + theme_fivethirtyeight() + labs(title = varStr, subtitle = '1960 - 2007')

(lambda <- BoxCox.lambda(var))

## [1] 0.1919047

(nsdiffs(var))

## [1] 0

(ndiffs(var))

## [1] 1

(frequency(var))

## [1] 12

autoplot(BoxCox(var, lambda)) + theme_fivethirtyeight() + labs(title = varStr)

ggtsdisplay(diff(var))

Box.test(diff(var))

## 
##  Box-Pierce test
## 
## data:  diff(var)
## X-squared = 38.705, df = 1, p-value = 4.93e-10

ggseasonplot(var, polar = TRUE)

enplanements

The emplanements series shows an upward trend and strong seasonality. The variability of the series also appears to increase over time. The box-cox transform value of -0.22 results and resulting transform (log) improve the series plot reducing variability (smaller highs and lows). The nsdiffs function (value of 1) indicates a seasonal difference. In turn, we apply the ndiff function which suggests first differencing. We also see that frequency of the series is annual. Accordingly, the plot with box-cox transform, seasonal differncing and a lag of 1 (BoxCox(BoxCox.lambda(var)) %>% diff(12) %>% diff(1)) appears to yield the most stationary series.

autoplot(var) + theme_fivethirtyeight() + labs(title = varStr, subtitle = '1996-2000')

(lambda <- BoxCox.lambda(var))

## [1] -0.2269461

(nsdiffs(var))

## [1] 1

(ndiffs(var))

## [1] 1

(frequency(var))

## [1] 12

autoplot(BoxCox(var, lambda)) + theme_fivethirtyeight() + labs(title = varStr)

ggtsdisplay(diff(var))

ggtsdisplay(var %>% BoxCox(BoxCox.lambda(var))  %>% diff(12) %>% diff(1))

Box.test(diff(var))

## 
##  Box-Pierce test
## 
## data:  diff(var)
## X-squared = 12.811, df = 1, p-value = 0.0003446

ggseasonplot(var, polar = TRUE)

visitors

Similar to the emplanements series the visior’s lambda transform value calls for a log transform. The nsdiffs and ndiffs results of 1 and 1, respectively indicate that a first difference after a seasonal difference will result in a stationary series.

autoplot(var) + theme_fivethirtyeight() + labs(title = varStr, subtitle = 'May 1985 - April 2005')

(lambda <- BoxCox.lambda(var))

## [1] 0.2775249

(nsdiffs(var))

## [1] 1

(ndiffs(var))

## [1] 1

(frequency(var))

## [1] 12

autoplot(BoxCox(var, lambda)) + theme_fivethirtyeight() + labs(title = varStr)

ggtsdisplay(diff(var))

Box.test(diff(var))

## 
##  Box-Pierce test
## 
## data:  diff(var)
## X-squared = 28.739, df = 1, p-value = 8.283e-08

ggtsdisplay(var %>% BoxCox(BoxCox.lambda(var))  %>% diff(12) %>% diff(1))

For your retail data (from Exercise 3 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

The lambda value of 0.19 calls for the log transforms and clearly straightens out the data series. The nsdiff value of 1 as well as the plots indicate seasonality. Therefore, single seasonal differencing along with the box-cox transform should lead yield a stionary series.

retaildata <- readxl::read_excel("retail.xlsx", skip = 1)
myts <- ts(retaildata[,"A3349335T"], frequency=12, start=c(1982,4))
myts %>% ggtsdisplay()

lambda <- BoxCox.lambda(myts)
lambda

## [1] 0.193853

tdata <- BoxCox(myts, lambda)
tdata %>% ggtsdisplay()

tdata %>% diff(lag = frequency(tdata)) %>% ggtsdisplay()

tdata %>% ndiffs()

## [1] 1

tdata %>% diff(lag = frequency(tdata)) %>% diff() %>%  ggtsdisplay()

The low-value lambda transformation straightens the plot line

Use R to simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with ϕ1=0.6 and σ2=1. The process starts with y1=0.

set.seed(123)

ts_AR <- function(p){
  y <- ts(numeric(100))
  e <- rnorm(100)
  for(i in 2:100)
    y[i] <- p*y[i-1] + e[i]
  return (y)
}

ts_AR06 <- ts_AR(0.6)
ts_AR08 <- ts_AR(0.8)
ts_AR10 <- ts_AR(1)

Produce a time plot for the series. How does the plot change as you change ϕ1. ϕ1=0

Time plot for series with ϕ = 0.6

ts_AR06 %>% 
  ggtsdisplay()

Time plot for series with ϕ = 0.8

ts_AR08 %>% 
  ggtsdisplay()

Time plot for series with ϕ = 1.0

ts_AR10 %>% 
  ggtsdisplay()

autoplot(ts_AR06, size = 1, series = "0.6") +
  autolayer(ts_AR08, size = 1, series = "0.8") +
  autolayer(ts_AR10, size = 1, series = "1.0") +
  theme_fivethirtyeight() + labs(title = "AR(1) Models", subtitle = 'p = 0.6, 0.8 and 1.0')

Write your own code to generate data from an MA(1) model with theta1 = 0.6 and sigma^2 = 1.

set.seed(123)

ts_MA <- function(t) {
    y <- ts(numeric(100))
    e <- rnorm(100)
    for(i in 2:100)
      y[i] <- t * e[i-1] + e[i]
    return (y)
}

Produce a time plot for the series. How does the plot change as you change theta1?

Time plot for series with ϕ = 0.6

ts_MA(0.6) %>% 
  ggtsdisplay()

Time plot for series with ϕ = 0.8

ts_MA(0.8) %>% 
  ggtsdisplay()

Time plot for series with ϕ = 1.0

ts_MA(1) %>% 
  ggtsdisplay()

autoplot(ts_MA(0.6), size = 1, series = "0.6") +
  autolayer(ts_MA(0.8), size = 1, series = "0.8") +
  autolayer(ts_MA(1), size = 1, series = "1.0") +
  theme_fivethirtyeight() + labs(title = "MA(1) Models", subtitle = 't = 0.6, 0.8 and 1.0')

Generate data from an ARMA(1,1) model with phi1 = 0.6, theta1 = 0.6 and sigma^2 = 1.

set.seed(123)

ts_ARMA <- function(p, t) {
    y <- ts(numeric(100))
    e <- rnorm(100)
    for(i in 2:100)
      y[i] <- p * y[i-1] + t *e[i-1] + e[i]
    return (y)
}

Generate data from an AR(2) model with phi1 = -0.8, phi2 = 0.3 and sigma^2 = 1. (Note that these parameters will give a non-stationary series.)

set.seed(123)
ts_AR2 <- function(p1, p2) {
    y <- ts(numeric(100))
    e <- rnorm(100)
    for(i in 3:100)
      y[i] <- p1*y[i-1] + p2*y[i-2] + e[i]
    return (y)
}

Graph the latter two series and compare them.

Both ARIMA(1,1) and ARIMA(2) are scattered around zero, however ARMA(1,1) look more like white noise, whereas AR(2)’s variance increase with time to form a bugle/horn like graph .

ARMA(1,1)

ts_ARMA(0.6, 0.6) %>% 
  ggtsdisplay()

AR(2)

ts_AR2(-0.8, 0.3) %>% 
  ggtsdisplay()

Consider the number of women murdered each year (per 100,000 standard population) in the United States. (Data set wmurders).

By studying appropriate graphs of the series in R, find an appropriate ARIMA(p,d,q) model for these data.

wmurders %>% 
  ggtsdisplay()

Initially, the series shows an positive upward trend. It then levels off from the 70s through the 90s and then begins a steady decline into the 2000s. There is also a upward spike in early 2000 that temporarily interrupts the steady decline.

wmurders %>% 
  ndiffs()

## [1] 2

wmurders %>% 
  diff() %>% 
  Box.test()

## 
##  Box-Pierce test
## 
## data:  .
## X-squared = 0.39628, df = 1, p-value = 0.529

wmurders %>% 
  diff() %>% 
  diff() %>% 
  Box.test()

## 
##  Box-Pierce test
## 
## data:  .
## X-squared = 24.722, df = 1, p-value = 6.623e-07

wmurders %>% 
  diff() %>% 
  ggtsdisplay()

Alternative Forecasts

my_fit1 <- Arima(wmurders, order = c(2,1,0))
my_fit1

## Series: wmurders 
## ARIMA(2,1,0) 
## 
## Coefficients:
##           ar1     ar2
##       -0.0572  0.2967
## s.e.   0.1277  0.1275
## 
## sigma^2 estimated as 0.04265:  log likelihood=9.48
## AIC=-12.96   AICc=-12.48   BIC=-6.99

my_fit2 <- Arima(wmurders, order = c(0,1,2))
my_fit2

## Series: wmurders 
## ARIMA(0,1,2) 
## 
## Coefficients:
##           ma1     ma2
##       -0.0660  0.3712
## s.e.   0.1263  0.1640
## 
## sigma^2 estimated as 0.0422:  log likelihood=9.71
## AIC=-13.43   AICc=-12.95   BIC=-7.46

checkresiduals(my_fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)
## Q* = 9.7748, df = 8, p-value = 0.2812
## 
## Model df: 2.   Total lags used: 10

Should you include a constant in the model? Explain.

No, I don’t believe a constant should be added to the model. The time series does not appear to a have a consistent trend. Therefore the introduction of a constant could cause drift and undermine the model.

Write this model in terms of the backshift operator.

The model is characterized by (1−B)yt

Fit the model using R and examine the residuals. Is the model satisfactory?

The residuals plot for the ARIMA(0,1,2) above appears to be white noise, which is indicative of a good fit.

Forecast three times ahead. Check your forecasts by hand to make sure that you know how they have been calculated.

my_fit3 <-  forecast(my_fit2, h= 3)

Hand Calculation

r <- resid(my_fit2)
l <- length(r)
et <- r[l]
et_1 <- r[l - 1]
ma1 <- coef(my_fit2)["ma1"]
ma2 <- coef(my_fit2)["ma2"]


my_f1_byhand <- wmurders[length(wmurders)] + ma1 * et + ma2 * et_1
my_f2_byhand <- my_f1_byhand + ma2 * et
my_f3_byhand <- my_f2_byhand

Compare Forecast

my_fit3

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2005       2.458450 2.195194 2.721707 2.055834 2.861066
## 2006       2.477101 2.116875 2.837327 1.926183 3.028018
## 2007       2.477101 1.979272 2.974929 1.715738 3.238464

c(my_f1_byhand, my_f2_byhand, my_f3_byhand)

##      ma1      ma1      ma1 
## 2.458450 2.477101 2.477101

Create a plot of the series with forecasts and prediction intervals for the next three periods shown.

autoplot(my_fit3) + theme_fivethirtyeight() + labs(title = "Women's Murders Per Year", subtitle = 'per 100,000 standard population')

Does auto.arima give the same model you have chosen? If not, which model do you think is better?

The auto.arima model performed worse than the manually selected model.

auto_fit <- auto.arima(wmurders, seasonal = FALSE, stepwise = FALSE, approximation = FALSE)
auto_fit

## Series: wmurders 
## ARIMA(0,2,3) 
## 
## Coefficients:
##           ma1     ma2      ma3
##       -1.0154  0.4324  -0.3217
## s.e.   0.1282  0.2278   0.1737
## 
## sigma^2 estimated as 0.04475:  log likelihood=7.77
## AIC=-7.54   AICc=-6.7   BIC=0.35

Data 624 Assignment 6: ARIMA

Jim Mundy

usnetelec

usgdp

mcopper

enplanements

visitors

Time plot for series with ϕ = 0.6

Time plot for series with ϕ = 0.8

Time plot for series with ϕ = 1.0

Time plot for series with ϕ = 0.6

Time plot for series with ϕ = 0.8

Time plot for series with ϕ = 1.0

Both ARIMA(1,1) and ARIMA(2) are scattered around zero, however ARMA(1,1) look more like white noise, whereas AR(2)’s variance increase with time to form a bugle/horn like graph .

ARMA(1,1)

AR(2)

By studying appropriate graphs of the series in R, find an appropriate ARIMA(p,d,q) model for these data.

Alternative Forecasts

Should you include a constant in the model? Explain.

No, I don’t believe a constant should be added to the model. The time series does not appear to a have a consistent trend. Therefore the introduction of a constant could cause drift and undermine the model.

Write this model in terms of the backshift operator.

The model is characterized by (1−B)yt

Fit the model using R and examine the residuals. Is the model satisfactory?

The residuals plot for the ARIMA(0,1,2) above appears to be white noise, which is indicative of a good fit.

Forecast three times ahead. Check your forecasts by hand to make sure that you know how they have been calculated.

Hand Calculation

Compare Forecast

Create a plot of the series with forecasts and prediction intervals for the next three periods shown.

Does auto.arima give the same model you have chosen? If not, which model do you think is better?

The auto.arima model performed worse than the manually selected model.