library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 3.4.3
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.3
## 
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
## 
##     time<-
library(forecast)
library(xts)

Data

We are using the AirPassenger data to find our own ARIMA model. First, we have to look to make sure there is no trend in our data–we don’t want the distance between peaks and pits growing over time.

plot.ts(AirPassengers)

This looks bad, so we can try a log transformation.

plot.ts(log(AirPassengers))

Filter

Now, we can filter our data so that we can see the seasonality.

plot.ts(filter(log(AirPassengers), sides=2, filter=rep(1/3,3)))

Differencing

It looks like there is a yearly trend, so we will try differencing with 12.

diff1 <- diff(log(AirPassengers), lag = 12)
plot.ts(diff1) 

adf.test(diff1)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -3.1899, Lag order = 5, p-value = 0.09265
## alternative hypothesis: stationary

Because our adf test has a small p-value, we want to have a D of 1. Looking at the plot, there is not really a trend any more, so d will be 0.

Conclusions on differencing: D=1, d=0.

ACF

If there are spikes at the beginning, then we want to have q equal a number. If there are spikes every multiple, then we want Q to equal a number.

Acf(diff1)

Based on this information, we want a q of 1 or 2. We want a Q of 1.

Conclusion: q=1 or 2, Q=1

PACF

We can use PACF for the autoregressive component. If there are spikes at the beginning, then p will not be 0. If there are spikes at multiples, then P will not be zero.

Pacf(diff1)

Based on this graph, I would say p=1 and P=0.

Conclusion: p=1, P=0.

Estimating Model

Based on our testing earlier, we will run an ARIMA of (1, 0, 1) x (0, 1, 1) [12]

mod1 <- arima(log(AirPassengers), order = c(1,0,1), season = list( order = c( 0,1,1), period=12))
summary(mod1) 
## 
## Call:
## arima(x = log(AirPassengers), order = c(1, 0, 1), seasonal = list(order = c(0, 
##     1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.9959  -0.3994  -0.5526
## s.e.  0.0050   0.0898   0.0741
## 
## sigma^2 estimated as 0.001345:  log likelihood = 245.61,  aic = -483.22
## 
## Training set error measures:
##                       ME       RMSE        MAE        MPE      MAPE
## Training set 0.003012636 0.03514401 0.02638549 0.05667535 0.4783111
##                   MASE       ACF1
## Training set 0.2912769 0.01751614
Box.test(mod1$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.045108, df = 1, p-value = 0.8318

We want to check that none of our test stats are over 2; if they are over 2 then that will make our p-values too big. We can do this manually; by taking \(\theta\) / SE(\(\theta\)) or \(\phi\) / SE(\(\phi\)). However, we can use the Ljung-Box funtion instead. Which we see at the bottom of this output. It tells us that our p-value is 0.8381. This is good, that means that we have no evidence that our model is inadequate.

If we wanted to do it by hand, we can. We will show an example with ar1.

(teststat <- 0.9959/0.0050)
## [1] 199.18
2*pt(abs(teststat), df = 144-3, lower.tail = FALSE)
## [1] 1.109161e-174

This p-value is small, so we know it should stay in the model.

Even though this model is good, we might as well try adding more components to see what we get.

mod2 <-arima(log(AirPassengers), order = c(2,1,1), season = list( order = c(1, 1, 2), period = 12))
summary(mod2)
## 
## Call:
## arima(x = log(AirPassengers), order = c(2, 1, 1), seasonal = list(order = c(1, 
##     1, 2), period = 12))
## 
## Coefficients:
##          ar1     ar2      ma1     sar1    sma1     sma2
##       0.5467  0.2597  -0.9654  -0.9515  0.3821  -0.4916
## s.e.  0.0965  0.0954   0.0472   0.2510  0.2992   0.1984
## 
## sigma^2 estimated as 0.001293:  log likelihood = 246.52,  aic = -479.03
## 
## Training set error measures:
##                         ME       RMSE        MAE         MPE      MAPE
## Training set -0.0007550448 0.03432059 0.02519184 -0.01036797 0.4566572
##                   MASE         ACF1
## Training set 0.2780999 -0.005528303
Box.test(mod2$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  mod2$residuals
## X-squared = 0.0044933, df = 1, p-value = 0.9466
(teststat <- .3821/.2992)
## [1] 1.277072
2*pt(abs(teststat), df = 144-3, lower.tail = FALSE)
## [1] 0.2036758

This p-value is too big, so we will take this out of our model. After we fit it, we can compare it to our original one.

mod3 <-arima(log(AirPassengers), order = c(2,1,1), season = list( order = c(1, 1, 1), period = 12))
summary(mod3)
## 
## Call:
## arima(x = log(AirPassengers), order = c(2, 1, 1), seasonal = list(order = c(1, 
##     1, 1), period = 12))
## 
## Coefficients:
##          ar1     ar2      ma1     sar1     sma1
##       0.5552  0.2530  -0.9653  -0.0598  -0.5168
## s.e.  0.0956  0.0949   0.0466   0.1551   0.1367
## 
## sigma^2 estimated as 0.001305:  log likelihood = 246.21,  aic = -480.42
## 
## Training set error measures:
##                         ME      RMSE        MAE         MPE      MAPE
## Training set -0.0007815228 0.0344809 0.02534988 -0.01083164 0.4593521
##                   MASE         ACF1
## Training set 0.2798445 -0.002774659
Box.test(mod3$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  mod3$residuals
## X-squared = 0.0011319, df = 1, p-value = 0.9732

Now that the new model is set up, we can use AIC to compare them.

AIC(mod1)
## [1] -483.2233
AIC(mod3)
## [1] -480.4212

Obviously, mod3 is much smaller.

In conclusion, our hand-built ARIMA model is (2, 1, 1) x (1, 1, 1,) [12].

Auto ARIMA

There is a function in R that can auto fit an ARIMA model. We will compare the one we built to the one that R gives us.

(autoMod <- auto.arima(log(AirPassengers)))
## Series: log(AirPassengers) 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 estimated as 0.001371:  log likelihood=244.7
## AIC=-483.4   AICc=-483.21   BIC=-474.77

R’s is a little difference in the autoregressive section. We can use AIC to compare which model is better.

AIC(mod3)
## [1] -480.4212
AIC(autoMod)
## [1] -483.3991

They are quite close, so we want to go with the smaller model. Our final decision should be to use the model that R gave us.

Forecasting

The last subject we will use with our ARIMA models is to forecast the data into the future.

plot(forecast(autoMod, h = 12), main = "auto.arima")

plot(forecast(mod3, h = 12), main = "home-cooked")

When clicking between the two, we can’t find too much of a difference.