ARIMA models

8.1

Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.

a.

Explain the differences among these figures. Do they all indicate that the data are white noise?

All three ACF plots are white noise because 95% of the spikes in the ACF lie within \(\pm2/ \sqrt{T}\) where T is the length of the time series. The blue dash lines is the boundary. The differences among this three ACF plots are the size of the lag and the range of the boundary. First plot has wide boundary and bigger size of spike, comparing to the other two plots.

b.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

Spikes in the ACF lie within \(\pm2/ \sqrt{T}\) where T is the length of the time series. When the random numbers increase, the length of the time series increase. When T become larger and we can expect the boundary become narrower.

8.2

A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.

tsdisplay(ibmclose)

The IBM stock price plot indicates and upward and then downward trend during this period. A stationary time series is one whose properties do not depend on the time at which the series is observed, so the plot isn’t stationary time series. ACF plot indicate that the size of the spikes is gradually decrease. For stationary data, the ACF will drop to Zero relatively quickly. PACF plot indicate that there is a significant spike at lag 1. Except lag 1, all the spikes are lie within the boundary.

8.3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.

a.usnetelec

ggtsdisplay(usnetelec)

ustrans<-BoxCox(usnetelec,BoxCox.lambda(usnetelec))
ndiffs(ustrans)
## [1] 2

Two differencing in order to obtain stationary data.

diff(diff(ustrans))%>%
    tsdisplay()

Compare to the previous plot, upward trend was removed and 95% of the lag is within the blue dot boundary.

b.usgdp

ggtsdisplay(usgdp)

gdptrans<-BoxCox(usgdp,BoxCox.lambda(usgdp))
ndiffs(gdptrans)
## [1] 1
diff(gdptrans)%>%
    tsdisplay()

One differencing in order to obtain stationary data. We don’t detect upward trend. Except lag 1,2, 12,all lag is within boundary.

c.mcopper

tsdisplay(mcopper)

mcoppertrans<-BoxCox(mcopper,BoxCox.lambda(mcopper))
ndiffs(mcoppertrans)
## [1] 1
tsdisplay(diff(mcoppertrans))

The data is stationary after BoxCox and One differencing. 95% of the spikes are within the boudary.

d.enplanements

tsdisplay(enplanements)

entrans<-BoxCox(enplanements,BoxCox.lambda(enplanements))
ndiffs(entrans)
## [1] 1
tsdisplay(diff(entrans))

The data has upward trend and we also detect seasonality through the previous plot. ndiffs() indicate we need to do one order of differencing. After one order of differencing, we notice in ACF, more than half of the lag is outside the boundary. We still need to perform seasonal differencing.

tsdisplay(diff(diff(entrans),lag=12))

After seasonal differences, the data is stationary right now. In ACF, except lag 1 and lag 12, most spikes are within the boundary. For PACF, 95% of the spikes are within the boundary.

e.visitors

tsdisplay(visitors)

visitorsTrans<-BoxCox(visitors,BoxCox.lambda(visitors))
ndiffs(visitorsTrans)
## [1] 1
tsdisplay(diff(diff(visitorsTrans),lag=12))

Same as enplanements dataset, except upward trend, the visitors data is seasonal. According to the ndiffs(), we need to do one order of differencing and one order of seasonal differencing. After Boxcox transformation and differencing, ACF and PACF plot indicate the data is white noise.

8.5

For your retail data (from Exercise 3 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.

retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,"A3349873A"],
  frequency=12, start=c(1982,4))

autoplot(myts)

retailTrans<-BoxCox(myts, lambda = BoxCox.lambda(myts))
ndiffs(retailTrans)
## [1] 1

The plot shows an upward trend and seasonality during the period. we need to perform Boxcox transformation, and then do one order of differencing and one order of seasonal differencing.

mytsDif<-diff(diff(retailTrans),lag = 12)
tsdisplay(mytsDif)

mytsDif %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0138 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

After Boxcox transformation and differencing, the data show white noise. When we use KPSS Unit Root Test, we can also reject the null hypothesis of nonstationarity. The data is stationary.

8.6

Use R to simulate and plot some data from simple ARIMA models.

a.

Use the following R code to generate data from an AR(1) model with \(\phi_1=0.6\) and \(\sigma^2=1\).The process starts with \(y_1=0\)

y <- ts(numeric(100))
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]

b.

Produce a time plot for the series. How does the plot change as you change \(\phi_1\)?

y <- ts(numeric(100))
e <- rnorm(100)

for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
  autoplot(y)+ggtitle('pi=0.6')

for(i in 2:100)
  y[i] <- 1*y[i-1] + e[i]
  autoplot(y)+ggtitle('pi=1')

for(i in 2:100)
  y[i] <- 0.1*y[i-1] + e[i]
  autoplot(y)+ggtitle('pi=0.1')

Changing the parameter \(\phi_1\) result in different time series patterns. The larger of the \(\phi_1\), the bigger of the variance. We can detect a wider range of y value and an dramatically upward and downward change when we set the \(\phi_1=1\).

c.

Write your own code to generate data from an MA(1) model with \(\theta_1=0.6\) and \(\sigma^2=1\)

y <- ts(numeric(100))
e<- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*e[i-1]+e[i]

d.

Produce a time plot for the series. How does the plot change as you change \(\theta_1\)

y <- ts(numeric(100))
e<- rnorm(100)

for(i in 2:100)
  y[i] <- 0.6*e[i-1]+e[i]
  autoplot(y)+ggtitle("Theta = 0.6")

for(i in 2:100)
  y[i] <- 1.5*e[i-1]+e[i]
  autoplot(y)+ggtitle("Theta = 1.5")

for(i in 2:100)
  y[i] <- 0.05*e[i-1]+e[i]
  autoplot(y)+ggtitle("Theta = 0.05")

Change the parameters \(\theta_1\) also result in different time series patterns. Higher the theta means higher the variance of the time series. Y value range increase when Theta=1.5

e.

Generate data from an ARMA(1,1) model with \(\phi_1=0.6 ,\theta_1=0.6\) and \(\sigma^2=1\).

y1 <- ts(numeric(100))
e <- rnorm(100,sd=1)

for(i in 2:100)
  y1[i] <- 0.6*y1[i-1] +0.6*e[i-1]+ e[i]

f.

Generate data from an AR(2) model with \(\phi_1=-0.8,\phi_2=0.3\) and \(\sigma^2=1\).(Note that these parameters will give a non-stationary series.)

y2 <- ts(numeric(100))
e <- rnorm(100,sd=1)

for(i in 3:100)
  y2[i] <- -0.8*y2[i-1] + 0.3*y2[i-2]+e[i]

g.

Graph the latter two series and compare them.

par(mfrow=c(1, 1))
autoplot(y1)+ggtitle('ARMA:Pi=0.6 & Thera=0.6')

autoplot(y2)+ggtitle('AR2:Pi_1=0.8 & Pi_2=0.3' )

tsdisplay(y1)

tsdisplay(y2)

The first ARMA model seems stationary because we can’t observe any trends and seasonality, but ACF plot indicate that about 1/3 of the lags outside the boundary. The data is not white noise. When \(|\theta|<1\), the most recent observations have higher weight than observations from the more distant past. Thus, the process is invertible. For AR(1)model, y2 tends to oscillate between positive and negative values when \(\phi_1<0\).

8.7

Consider wmurders, the number of women murdered each year (per 100,000 standard population) in the United States.

a.

By studying appropriate graphs of the series in R, find an appropriate ARIMA(p,d,q)model for these data.

tsdisplay(wmurders)

ndiffs(wmurders)
## [1] 2
wmurders_dif<-diff(diff(wmurders))
tsdisplay(wmurders_dif)

wmurders_dif%>% ur.kpss()%>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 0.0458 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

We perform two differencing to make the data stationary. We can choose d equal to 2. When we examine the ACF plot, two spikes in lag 1 and lag 2, but non beyond lag 2.PACF show one significant spike in lag 1 and then is decaying. We can follow ARIMA (0, 2, 2) model first.

b.

Should you include a constant in the model? Explain.

No. We should not include a constant in the model. The inclusion of a constant in a non-stationary ARIMA model is equivalent to inducing a polynomial trend of order d in the forest function. If the constant is omitted, the forecast function includes a polynomial trend of order d-1.For d>1, the constant is always omitted.

c.

Write this model in terms of the backshift operator.

The model can be written in backshift notation as

\((1-\phi_1)B(1-B)^2 y_t=c+(1+\theta_1 B)\varepsilon_t\)

d.

Fit the model using R and examine the residuals. Is the model satisfactory?

fit<-Arima(wmurders,order=c(0,2,2))
fit
## Series: wmurders 
## ARIMA(0,2,2) 
## 
## Coefficients:
##           ma1     ma2
##       -1.0181  0.1470
## s.e.   0.1220  0.1156
## 
## sigma^2 estimated as 0.04702:  log likelihood=6.03
## AIC=-6.06   AICc=-5.57   BIC=-0.15
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,2)
## Q* = 11.764, df = 8, p-value = 0.1621
## 
## Model df: 2.   Total lags used: 10

For the ACF plot, all residual lie within the boundary. Residuals appear to be white noise and very close to normal distributed. The Ljung-Box test also shows that the residuals have no remaining autocorrelations. The model is satisfied.

e.

Forecast three times ahead. Check your forecasts by hand to make sure that you know how they have been calculated.

arimaFor<-forecast(fit,h=3)
arimaFor
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2005       2.480525 2.202620 2.758430 2.055506 2.905544
## 2006       2.374890 1.985422 2.764359 1.779250 2.970531
## 2007       2.269256 1.772305 2.766206 1.509235 3.029276

We can use formula 8.3 in the textbook and replace \(\theta\) with the coefficients we calculate before.

\((1-B)^2 * y_t=(1-1.0181*B+0.1470*B^2) * e_t\)

\(y_t=2y_{t-1}-y_{t-2}+e_t-1.0181*e_{t-1}+0.1470*e_{t-2}\)

years <- length(wmurders)
e <- fit$residuals
y1 <- 2*wmurders[years] - wmurders[years - 1] - 1.0181*e[years] + 0.1470*e[years - 1]
y2 <- 2*y1 - wmurders[years] + 0.1470*e[years]
y3 <- 2*y2 - y1
y1
## [1] 2.480523
y2
## [1] 2.374887
y3
## [1] 2.269252

f.

Create a plot of the series with forecasts and prediction intervals for the next three periods shown.

autoplot(arimaFor)

g.

Does auto.arima() give the same model you have chosen? If not, which model do you think is better?

fit2<-auto.arima(wmurders)
fit2
## Series: wmurders 
## ARIMA(1,2,1) 
## 
## Coefficients:
##           ar1      ma1
##       -0.2434  -0.8261
## s.e.   0.1553   0.1143
## 
## sigma^2 estimated as 0.04632:  log likelihood=6.44
## AIC=-6.88   AICc=-6.39   BIC=-0.97
checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,2,1)
## Q* = 12.419, df = 8, p-value = 0.1335
## 
## Model df: 2.   Total lags used: 10

auto.arima() give a different ARIMA model (1,2,1). The residuals appear as white noise and no spike outside the limit. These two models are very similar but auto.arima()give out lower AIC and seems better than the model I create.