Forecasting HW2 - Hyndman Ch 8 (ARIMA)

Chapter 8 - ARIMA Models

Question 1

Figure 8.24 shows the ACFs for 36 random numbers, 360 random numbers and for 1,000 random numbers.

Explain the differences among these figures. Do they all indicate the data are white noise?

Answer
These figures show the correlation between different lags of the series (shown on the x axis). The y axis (the correlation) has the same scale for each plot, but the x axis shows an increasing number of lags as the series gets longer.

If the data are white noise (random) then we expect the correlations to be below the blue line, which indicates a significant lag.

For all the plots, the correlations of the lags shown are all below the significance level so they are all indicative of white noise. This is because none of the spikes are larger than the critical value range for any of the plots.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

Answer
The formula for the critical values is +/- 1.96/(sqrt(T - d)) where T is the sample size and d is the amount of differencing used. As the sample size(T) increases the critical values get smaller. This explains why the critical value region gets smaller (from left to right in the plot) as the sample size increases.

Question 2

A classic example of a non-stationary series is the daily closing IBM stock prices (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows the series is non-stationary and should be differenced.

library(fma)   # for ibmclose data

## Warning: package 'fma' was built under R version 3.4.4

## Loading required package: forecast

## Warning: package 'forecast' was built under R version 3.4.3

plot(ibmclose)

par(mfrow=c(1,2))
plot(acf(ibmclose))

plot(pacf(ibmclose))

There is clearly a trend element throughout the plot. The ACF plot shows that there are significant autocorrelations throughout. Therefore the data should be differenced in order to remove autocorrelation.

As the ACF is slowly decaying, in the graph above, it is a sign that the series may be auto regressive. We then look to the PACF to tell us of what degree. In this case we expect an AR(1) process that will need to be differenced once to be made stationary.

Question 3

For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.
a. usnetelec b. usgdp c. mcopper d. enplanements e. visitors

library(expsmooth)   # for data import

## Warning: package 'expsmooth' was built under R version 3.4.4

plot(usnetelec); plot(BoxCox(usnetelec,lambda=1/2)) # Square-root transformation

plot(usgdp); plot(BoxCox(usgdp,lambda=0)) # Log transformation

plot(mcopper); plot(BoxCox(mcopper,lambda=1/3)) # cube root transformation

plot(enplanements); plot(BoxCox(enplanements,lambda=0)) # Log transformation

plot(visitors); plot(BoxCox(visitors,lambda=1/3)) # cube root transformation

Question 4

For the enplanements data, write down the differences you chose above using backshift operator notation.

acf(enplanements)

pacf(enplanements)

acf(enplanements - lag(enplanements,1))

pacf(enplanements - lag(enplanements,1))

There is clearly some seasonality component present in this series. We can see it ACF plot as well. After taking lag1 of the series of backshift by 1 period, most of it has been covered by differencing of order 1.

Question 5

Use R to simulate and plot some data from simple ARIMA models.

Use the following R code to generate data from an AR(1) model with $ϕ1=0.6 and σ2=1. The process starts with y0=0. $ R code: y <- ts(numeric(100)) e <- rnorm(100) for(i in 2:100) y[i] <- 0.6*y[i-1] + e[i]

set.seed(25)
y <- ts(numeric(100))    
e <- rnorm(100)   
for(i in 2:100)    
  y[i] <- 0.6*y[i-1] + e[i]

plot(y)

Produce a time plot for the series. How does the plot change as you change ϕ1?

for(i in 2:100)    
  y[i] <- 0.3*y[i-1] + e[i]

plot(y)

for(i in 2:100)    
  y[i] <- 0.8*y[i-1] + e[i]

plot(y)

Write your own code to generate data from an MA(1) model with θ1=0.6 and σ2=1.

theta=0.6
for(i in 2:100)    
  y[i] <- theta*e[i-1] + e[i]

plot(y)

Produce a time plot for the series. How does the plot change as you change θ1?

theta=0.8
for(i in 2:100)    
  y[i] <- theta*e[i-1] + e[i]

plot(y)

theta=0.2
for(i in 2:100)    
  y[i] <- theta*e[i-1] + e[i]

plot(y)

Generate data from an ARMA(1,1) model with ϕ1 = 0.6 and θ1=0.6 and σ2=1.

phi=0.6 ; theta=0.6

for(i in 2:100)    
  y[i] <- phi*y[i-1] + theta*e[i-1] + e[i]

plot(y)

Generate data from an AR(2) model with ϕ1=−0.8, and ϕ2=0.3 and σ2=1. (Note that these parameters will give a non-stationary series.)

phi1=-0.8 ; phi2=0.3 

for(i in 3:100)    
  y[i] <- phi1*y[i-1] + phi2*e[i-2] + e[i]

plot(y)

Graph the latter two series and compare them. Answer
The plots above show that ARMA(1,1) resulted into a more stable series versus the AR(2) high variation time series.

Question 6

Consider the number of women murdered each year (per 100,000 standard population) in the United States (data set wmurders).

library(fpp)

## Warning: package 'fpp' was built under R version 3.4.4

## Loading required package: lmtest

## Warning: package 'lmtest' was built under R version 3.4.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.4.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: tseries

## Warning: package 'tseries' was built under R version 3.4.3

data(wmurders)
tsdisplay(wmurders)

By studying appropriate graphs of the series in R, find an appropriate ARIMA(p,d,q) model for these data.

This is clearly not stationary. Let’s start with taking the first difference.

dif1.wmurders <- diff(wmurders)
tsdisplay(dif1.wmurders)

This looks much better, but the ACF and PACF still show some significant spikes at a lag of two. Let’s try a unit root test.

adf.test(dif1.wmurders)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  dif1.wmurders
## Dickey-Fuller = -3.7688, Lag order = 3, p-value = 0.02726
## alternative hypothesis: stationary

kpss.test(dif1.wmurders)

## 
##  KPSS Test for Level Stationarity
## 
## data:  dif1.wmurders
## KPSS Level = 0.58729, Truncation lag parameter = 1, p-value =
## 0.02379

These tests are telling us different things. An ADF test $< 0.05$ indicates a stationary series, but a KPSS test $< 0.05$ indicates a non-stationary series. Lets take another difference and see what happens

wmurders.d2 <- diff(diff(wmurders))

tsdisplay(diff(wmurders,2))

# unit root tests
adf.test(diff(wmurders,2))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(wmurders, 2)
## Dickey-Fuller = -3.9819, Lag order = 3, p-value = 0.01705
## alternative hypothesis: stationary

kpss.test(diff(wmurders,2))

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(wmurders, 2)
## KPSS Level = 0.72551, Truncation lag parameter = 1, p-value =
## 0.01123

Now both unit root tests tell us we have a stationary series. Looking at the ACF and PACF plots, the large spike at 1 tells us we need either p or q to be 1.

ar.fit<-auto.arima(wmurders)
summary(ar.fit)

## Series: wmurders 
## ARIMA(1,2,1) 
## 
## Coefficients:
##           ar1      ma1
##       -0.2434  -0.8261
## s.e.   0.1553   0.1143
## 
## sigma^2 estimated as 0.04632:  log likelihood=6.44
## AIC=-6.88   AICc=-6.39   BIC=-0.97
## 
## Training set error measures:
##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -0.01065956 0.2072523 0.1528734 -0.2149476 4.335214 0.9400996
##                    ACF1
## Training set 0.02176343

As anticipated, auto.arima() spits out a c(1,2,1) model with least of AIC/BIC measures.

tsdisplay(residuals(ar.fit))

# portmanteau test - tests if reisduals are white noise or not
# Box.test(residuals(ar.fit), lag=24, fitdf=4, type="Ljung")

Should you include a constant in the model? Explain.

Answer
No. A constant introduces drift into the model, which we do not seem to have in these data.

Write this model in terms of the backshift operator.

Answer
$y^t^=phi x y^(t-2)^ + theta x (e-2)$

Fit the model using R and examine the residuals. Is the model satisfactory?

same as (a) above.

Forecast three times ahead. Check your forecasts by hand to make sure you know how they have been calculated.

forecast(ar.fit,h=3)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2005       2.470660 2.194836 2.746484 2.048824 2.892496
## 2006       2.363106 1.986351 2.739862 1.786908 2.939304
## 2007       2.252833 1.765391 2.740276 1.507354 2.998313

Create a plot of the series with forecasts and prediction intervals for the next three periods shown.

plot(forecast(ar.fit,h=3))

Does auto.arima give the same model you have chosen? If not, which model do you think is better?

Answer
I did use auto.arima() to create the model above and is quite the model I anticipated based on the analysis of series as done above in part(a).

Question 7

Consider the quarterly number of international tourists to Australia for the period 1999–2010. (Data set austourists.)

library(fpp)
data(austourists)

Describe the time plot.

tsdisplay(austourists)

What can you learn from the ACF graph?

Answer
The ACF of the quarterly australian tourist data shows exponentially decaying relationship with the past observations. There’s peculiar seasonality too with peaks every Q1 followed by dips 2 quarters.

What can you learn from the PACF graph?

Answer
PACF of the quarterly australian tourist data only has dependency with 3 past observations which maybe fixed with differencing.

par(mfrow=c(1,2))
acf(diff(austourists))
pacf(diff(austourists))

ndiffs(austourists)

## [1] 1

# freq of ts = 4 here
par(mfrow=c(1,2))
acf(diff(austourists,diff=4))
pacf(diff(austourists,diff=4))

diff=4 doesnt help improving the plots. It appears to be a seasonal differencing case.

Produce plots of the seasonally differenced data (1−B4)Yt. What model do these graphs suggest?

plot(diff(austourists, lag=4, differences=1))

par(mfrow=c(1,2))
acf(diff(austourists,lag=4, differences=1))
pacf(diff(austourists,lag=4, differences=1))

Now, the residual seems to only a p=1 or q=1 relationship left in hte series.

Does auto.arima give the same model that you chose? If not, which model do you think is better?

auto.arima(austourists)

## Series: austourists 
## ARIMA(1,0,0)(1,1,0)[4] with drift 
## 
## Coefficients:
##          ar1     sar1   drift
##       0.4493  -0.5012  0.4665
## s.e.  0.1368   0.1293  0.1055
## 
## sigma^2 estimated as 5.606:  log likelihood=-99.47
## AIC=206.95   AICc=207.97   BIC=214.09

auto.arima correctly chose as seasonal arima model as shown above.

Write the model in terms of the backshift operator, and then without using the backshift operator.

Answer
$(1-B4)Y^t=BY^(t-1)^ + e^t$

Question 8

Consider the total net generation of electricity (in billion kilowatt hours) by the U.S. electric industry (monthly for the period 1985–1996). (Data set usmelec.) In general there are two peaks per year: in mid-summer and mid-winter.

Examine the 12-month moving average of this series to see what kind of trend is involved.

data(usmelec)
movAvg <- ma(usmelec, order=12)
plot(usmelec, col='gray', main="Electricity Net Generation",
     ylab="Billions of kilowatt hours (kWh)")
lines(movAvg)

There is an increase in electricity generation over time. There is also a strong season component.

Do the data need transforming? If so, find a suitable transformation.

There is an increase in variance over time. Hence, it needs to be transformed suitably to stabilize the variance.

lambda <- BoxCox.lambda(usmelec)
netusage <- BoxCox(usmelec, lambda=lambda)
plot(netusage)

plot(log(usmelec) - log(movAvg)) # Log transformation

plot(netusage - BoxCox(movAvg, lambda=lambda))   # Box-Cox transformation

Looking closely (especially the label on y-axis) we see that box-cox transformation stabilizes the variance the most.

Are the data stationary? If not, find an appropriate differencing which yields stationary data.

No the data is not stationary, but its seasonal. Let’s start with a seasonal differencing.

tsdisplay(diff(netusage, 1))

tsdisplay(diff(netusage, lag=12,differences = 1))

tsdisplay(diff(diff(netusage, lag=12,differences = 1)))

The diff of diff looks better.

adf.test(diff(netusage, 12))

## Warning in adf.test(diff(netusage, 12)): p-value smaller than printed p-
## value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(netusage, 12)
## Dickey-Fuller = -5.054, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

kpss.test(diff(netusage, 12))

## Warning in kpss.test(diff(netusage, 12)): p-value smaller than printed p-
## value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(netusage, 12)
## KPSS Level = 0.80167, Truncation lag parameter = 4, p-value = 0.01

adf.test(diff(diff(netusage, 12)))

## Warning in adf.test(diff(diff(netusage, 12))): p-value smaller than printed
## p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(diff(netusage, 12))
## Dickey-Fuller = -10.551, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

kpss.test(diff(diff(netusage, 12)))

## Warning in kpss.test(diff(diff(netusage, 12))): p-value greater than
## printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(diff(netusage, 12))
## KPSS Level = 0.012984, Truncation lag parameter = 4, p-value = 0.1

The unit root test suggest here that the series are now stationary after 2nd order seasonal differencing.

Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AIC values?

test.arima <- function(t.series, params){
  order <- as.numeric(params[1:3])
  seasonal <- as.numeric(params[4:6])
  df <- data.frame(model=paste0("ARIMA(", 
                                paste0(params, collapse=","), 
                                ")"),
                   AICc=Arima(t.series, 
                              order=order, 
                              seasonal=seasonal,
                              lambda=lambda)$aicc)
  return(df)
}

gridSearch <- expand.grid(c(0, 1, 2), #p
                          c(2), #d
                          c(0, 1), #q
                          c(1, 2), #P
                          c(1), #D
                          c(0, 1)  #Q
                          )
df.list <- apply(gridSearch, MARGIN=1, FUN=function(x) {test.arima(usmelec, x)})

df <- do.call(rbind, df.list)

df

##                 model      AICc
## 1  ARIMA(0,2,0,1,1,0) -3607.482
## 2  ARIMA(1,2,0,1,1,0) -3754.593
## 3  ARIMA(2,2,0,1,1,0) -3876.260
## 4  ARIMA(0,2,1,1,1,0) -4007.760
## 5  ARIMA(1,2,1,1,1,0) -4041.012
## 6  ARIMA(2,2,1,1,1,0) -4075.363
## 7  ARIMA(0,2,0,2,1,0) -3657.899
## 8  ARIMA(1,2,0,2,1,0) -3811.595
## 9  ARIMA(2,2,0,2,1,0) -3918.264
## 10 ARIMA(0,2,1,2,1,0) -4057.933
## 11 ARIMA(1,2,1,2,1,0) -4091.944
## 12 ARIMA(2,2,1,2,1,0) -4120.432
## 13 ARIMA(0,2,0,1,1,1) -3750.161
## 14 ARIMA(1,2,0,1,1,1) -3900.239
## 15 ARIMA(2,2,0,1,1,1) -3996.371
## 16 ARIMA(0,2,1,1,1,1) -4137.666
## 17 ARIMA(1,2,1,1,1,1) -4165.397
## 18 ARIMA(2,2,1,1,1,1) -4188.174
## 19 ARIMA(0,2,0,2,1,1) -3752.429
## 20 ARIMA(1,2,0,2,1,1) -3904.609
## 21 ARIMA(2,2,0,2,1,1) -3996.740
## 22 ARIMA(0,2,1,2,1,1) -4140.124
## 23 ARIMA(1,2,1,2,1,1) -4168.028
## 24 ARIMA(2,2,1,2,1,1) -4189.558

By AICc, the best model appears to be ARIMA(2, 2, 1)(2, 1, 1)

Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.

fit <- Arima(usmelec, order=c(2, 2, 1),
             seasonal=c(2, 1, 1),
             lambda=lambda)

tsdisplay(fit$residuals)

adf.test(fit$residuals)

## Warning in adf.test(fit$residuals): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  fit$residuals
## Dickey-Fuller = -10.743, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

kpss.test(fit$residuals)

## Warning in kpss.test(fit$residuals): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  fit$residuals
## KPSS Level = 0.033955, Truncation lag parameter = 4, p-value = 0.1

Forecast the next 15 years of generation of electricity by the U.S. electric industry. Get the latest figures from http://data.is/zgRWCO to check on the accuracy of your forecasts.

downloaded <- read.csv('electricity-overview.csv')
names(downloaded) <- c("month", "elec")

actual.ts <- ts(downloaded$elec, start=c(1973, 1), frequency = 12)
plot(actual.ts)

fcast <- forecast(fit, h=15*12)
plot(fcast)
lines(actual.ts, col='red')

How many years of forecasts do you think are sufficiently accurate to be usable?

Answer
Judging from the confidence intervals, we may use forecasts of next five years, not much beyond that as things get really uncertain.

Question 9

For the mcopper data:

if necessary, find a suitable Box-Cox transformation for the data;

fit a suitable ARIMA model to the transformed data using auto.arima();

try some other plausible models by experimenting with the orders chosen;

choose what you think is the best model and check the residual diagnostics;

produce forecasts of your fitted model. Do the forecasts look reasonable?

compare the results with what you would obtain using ets() (with no transformation).

Question 10

Choose one of the following seasonal time series: condmilk, hsales, uselec