In Thursday’s class, we continued talking about ARIMA models, but we looked at adding seasonality to these models. We do this when we see a seasonal trend in our data. The form for this type of model is ARIMA (p,d,q) x (P,D,Q)S, where our lowercase p, d, and q, represent our nonseasonal data and capital P, D, and Q represent our seasonal data. S is the time span in a seasonal trend; for example, if we have S=12, this means we are looking at monthly data.
We will now look at an example of how to create an ARIMA model by hand instead of using the auto.arima command.
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 3.4.3
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
##
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
##
## time<-
library(forecast)
library(xts)
data(wineind)
We are going to use a dataset of Australian total wine sales from January 1980 to August 1994.
We begin by looking at a plot of our data.
wineind <- wineind[-c(1:24)]
wineind <- ts(wineind, frequency=12, start=c(1980,1))
par(mfrow=c(2,1))
plot.ts(wineind)
Because this plot seems to have pretty equal variance, we will not perform any transformations on it.
We will now look at what order of differences our model needs. We do this using the diff function with our data set and the lag as arguments. We set our lag equal to 12 to look back at monthly data:
diff1 <- diff(wineind, lag = 12)
plot.ts(diff1)
adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff1
## Dickey-Fuller = -4.0358, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
We also ran adf.test to see if our data is stationary. Our test came back with a low p-value of .01, so we will accept the aternative hypothesis that this data is stationary and use d = D = 0.
To choose our values of p, q, P, and Q, we use the autocorrelation function (ACF) and partial autocorrelation function (PACF).
To determine our value of q, we look for spikes at low lags of our ACF graph. To determine our value of Q, we look for spikes at lags that are multiples of S in our ACF graph. To determine our value of p, we look for spikes at low lags of our PACF graph. To determine our value of P, we look for spikes at lags that are multiples of S in our PACF graph.
Acf(diff1)
Pacf(diff1)
Looking at our ACF graph, we do not have any spikes at low lags, so we determine that q = 0. We do have a spike around 12, however, so we will determine that Q = 1.
Looking at our PACF graph, we do not have any spikes at low lags, so we determine that p = 0. We do have a spike around 12, however, so we will determine that P = 1.
This step is easy. We just create our model with our determind p, q, d, P, Q, D and S values, as follows:
mod1 <- arima(wineind, order = c(0,0,0), season = list( order = c( 1,1,1), period=12))
We will now look at more specifics of the model that we have created. To run diagnostics on our models, we want to analyze the residuals. We can do this using a hypothesis test that uses the Ljung Box statistic.
If our ARIMA model is doing a good job, it should be accounting for the dependency of \(y_t\)s and the residuals should be small. Our Ljung Box Statistic will also then be small.
In this test, our null hypothesis is that our model is adequate. Our alternative hypothesis is that the model is inadequate (meaning our residuals are too big which creates a large test stat).
If we reject the null hypothesis, we go back to the drawing board and look at autocorrelation function and PACF to improve the model (find better p, d, q)
summary(mod1)
##
## Call:
## arima(x = wineind, order = c(0, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
##
## Coefficients:
## sar1 sma1
## 0.3807 -0.9305
## s.e. 0.1344 0.2266
##
## sigma^2 estimated as 5796736: log likelihood = -1295.62, aic = 2597.23
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 421.0175 2310.659 1698.323 0.8636952 6.650661 0.3376614
## ACF1
## Training set 0.03255104
Box.test(mod1$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 0.16425, df = 1, p-value = 0.6853
Based on the p-value of this test, our model is adequate, meaning our residuals are small enough. It looks like our sar1 is pretty close to 0, though, so we will try removing it:
mod2 <- arima(wineind, order = c(0,0,0), season = list( order = c( 0,1,1), period=12))
summary(mod2)
##
## Call:
## arima(x = wineind, order = c(0, 0, 0), seasonal = list(order = c(0, 1, 1), period = 12))
##
## Coefficients:
## sma1
## -0.5330
## s.e. 0.1232
##
## sigma^2 estimated as 6486448: log likelihood = -1298.62, aic = 2601.25
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 333.4408 2444.26 1821.747 0.5345291 7.190648 0.3622007
## ACF1
## Training set 0.04947764
Box.test(mod1$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 0.16425, df = 1, p-value = 0.6853
This looks good also.
Out of curiosity, we will compare our results with the results of an automated process:
automod <- auto.arima(wineind)
automod
## Series: wineind
## ARIMA(3,0,2)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sma1
## 0.1275 0.5764 0.2441 -0.1338 -0.6116 -0.6845
## s.e. 0.2519 0.2594 0.0916 0.2524 0.2305 0.0896
##
## sigma^2 estimated as 5797471: log likelihood=-1289.33
## AIC=2592.65 AICc=2593.5 BIC=2613.24
AIC(automod); AIC(mod2)
## [1] 2592.653
## [1] 2601.245
Box.test(automod$residuals)
##
## Box-Pierce test
##
## data: automod$residuals
## X-squared = 0.03366, df = 1, p-value = 0.8544
We were pretty close!
par(mfrow=c(2,1))
plot(forecast(automod, h = 12), main = "auto.arima")
plot(forecast(mod2, h = 12), main = "home-cooked")
In class, we also learned about the decompose() function in R. We will use it on our dataset:
dec <- decompose(wineind)
plot(dec)
In the above graph, our first plot is of our observed data. The second is of the overall trend of our data, which there does not really seem to be. The third plot is of our seasonal trend and the fourth is of our residuals.