Today we learned more about ARIMA models, looking at how to work with seasonal data.

library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 3.4.3
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.3
## 
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
## 
##     time<-
library(forecast)
library(xts)

1. Plotting data

I am going to use the Air Passenger data. We have to plot our data first. The seasonal variation has to be constant so I applied a log transfomration to the data.

data("AirPassengers")
is.ts(AirPassengers) #tells us that our data is already a TS so we do not need to make it in to one.
## [1] TRUE
plot(AirPassengers)

Air<-log(AirPassengers)
plot(Air)

Now it looks prett good! We can see that there is definitley seasonality in our data so we will incorporate that into our model. ## Decomposing Time Series Decomposing a time series just gives us another way to look at our data. It is kind of ni

TScomp <- decompose(AirPassengers)
plot
## standardGeneric for "plot" defined from package "graphics"
## 
## function (x, y, ...) 
## standardGeneric("plot")
## <environment: 0x000000001b84eae8>
## Methods may be defined for arguments: x, y
## Use  showMethods("plot")  for currently available ones.
#looking at the data without the seasonal component
noseason <- Air - TScomp$seasonal
plot(noseason)

2. Differenes

First we will use differences of lag S to remove seasonality.

diff1 <- diff(Air, lag = 12)
plot.ts(diff1)  #want mean 0 and onstat variane

adf.test(diff1) #p-value >.05, so  not stationary so we need to try degree 2 differencing 
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -3.1899, Lag order = 5, p-value = 0.09265
## alternative hypothesis: stationary
#D=2
diff2<- diff(diff1, lag=12)
plot.ts(diff2)  #want mean 0 and onstat variane

adf.test(diff2) #p-value is now <.05 so stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff2
## Dickey-Fuller = -3.5215, Lag order = 4, p-value = 0.04333
## alternative hypothesis: stationary

After doing degree 2 differencing (D=2) for seasonality. It no longer looks like there is trend, so we will not do any more differening for trend (d=0).

3. ACF and PACF

ACF stands for auto-correlation function. It tells us the orrelation of 2 points seperated by a specified lag. -By looking at the spikes from low lags, we can estimate non-seasonal moving average term (q) -By looking at the spikes at multiples of S, we can estimate the seasonal moving average term (Q)

PACF stats for partial auto-correlation function. It is like ACF, but ontrols for values of shorter lag (in between values). -By looking at spikes with low lags, we can estimate non-seasonal auto-regressive term (p) -By looking at spikes at multiples of S, we can estimate seasonal auto-regressive term (P)

Acf(diff2)

Pacf(diff2)

Looking at ACF -non-seasonal MA: We could use 4 maybe even 5 terms for q -seasonal MA: Q>=1, could be 2 Looking at PACF -non-seasonal AR: p>=1, probably one maybe 2 -seasonal AR: P>=1. probably 2, maybe even 3

4. Estimate the model

Let’s start seeing how this fairly simple model is ARIMA (1, 0, 1) x (1, 2, 1) 12

modair <- arima(Air, order = c(1,0,1), season = list( order = c( 1,2,1), period=12))

5. Examine the Model (and maybe make new ones)

summary(modair)
## 
## Call:
## arima(x = Air, order = c(1, 0, 1), seasonal = list(order = c(1, 2, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.9303  -0.4175  -0.4505  -1.0000
## s.e.  0.0448   0.0991   0.0861   0.1038
## 
## sigma^2 estimated as 0.001415:  log likelihood = 203.55,  aic = -397.1
## 
## Training set error measures:
##                        ME       RMSE        MAE         MPE      MAPE
## Training set -0.001134586 0.03439361 0.02451314 -0.01830164 0.4413303
##                   MASE        ACF1
## Training set 0.2706075 -0.02443055

We want all the test stats (coeff/SE) to be far from 0. Anything between -2 and 2 will not be significant. It looks like in this model all the test stats are far enough from 0.

Now let’s test to see if our model is adequate. In this case we actually want our p-value to be high because it means our model is adequate. Ho = Adequate Ha = Not Adequate

Box.test(modair$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  modair$residuals
## X-squared = 0.08775, df = 1, p-value = 0.7671

A high p-value! Ourmodel is adequate.

This model is not bad but let’s see if we can make it better. We saw from ACF that we could try more terms for q.

modair2 <- arima(Air, order = c(1,0,2), season = list( order = c( 1,2,1), period=12))
summary(modair2)
## 
## Call:
## arima(x = Air, order = c(1, 0, 2), seasonal = list(order = c(1, 2, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     ma2     sar1     sma1
##       0.9270  -0.4199  0.0169  -0.4508  -1.0000
## s.e.  0.0492   0.1035  0.0854   0.0863   0.1039
## 
## sigma^2 estimated as 0.001414:  log likelihood = 203.57,  aic = -395.14
## 
## Training set error measures:
##                        ME       RMSE        MAE         MPE      MAPE
## Training set -0.001183983 0.03437866 0.02447487 -0.01915085 0.4406654
##                   MASE        ACF1
## Training set 0.2701851 -0.01756902
Box.test(modair2$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  modair2$residuals
## X-squared = 0.045381, df = 1, p-value = 0.8313

The second moving average term ma2 is too close to 0 so it is not significant and we are better off not inluding it.

Back to the simpler model!

AIC(modair,modair2)
##         df       AIC
## modair   5 -397.1033
## modair2  6 -395.1425

We can see that according to AiC the simpler model is better anyway.

Auto-ARIMA

automod <- auto.arima(Air)
automod 
## Series: Air 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 estimated as 0.001371:  log likelihood=244.7
## AIC=-483.4   AICc=-483.21   BIC=-474.77
Box.test(automod$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  automod$residuals
## X-squared = 0.030651, df = 1, p-value = 0.861

The model it gave us is not too far off from what I had found!

AIC(automod,modair)
## Warning in AIC.default(automod, modair): models are not all fitted to the
## same number of observations
##         df       AIC
## automod  3 -483.3991
## modair   5 -397.1033

The automod is definitely better than mine though!

Forecasting

I tried some forecasting using both the automod and my model.

par(mfrow=c(2,1))
plot(forecast(automod, h = 12), main = "auto.arima")
plot(forecast(modair, h = 12), main = "mine")

Pretty close huh!

Overview

Today we went more in depth. It was interesting how to see how you determine p,d,q by hand instead of just using auto.arima. Though, auto.arima is very nice to use most of the time. It was also good to learn about working with seasonal data. So much data includes seasonality that it is important to know how to work with it.