Intro

Today we learned about decomposing models to forecast time series that exhibit trend and seasonal affects and also how to use an ARIMA model when seasonal data is present in a time series.

Decomposing Time Series

For and example of a decomposing time series, we will look at birth data from the state of New York.

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
birthstimeseries <- ts(births, frequency=12, start=c(1946,1))

Now that we have our time series data set, we can use the ‘decompose’ function to eliminate the trend within the data. We will then plot the decomposition to see the plots of the original data, the trend present, and the seasonal trend within the data.

TScomp <- decompose(birthstimeseries)
plot(TScomp)

From the four plots we see that there is a positive trend and a seasonal component to the births time series data. We can now get rid of the seasonal component.

noseason <- birthstimeseries - TScomp$seasonal
plot(noseason)

Now that the seasonal trend has been subtracted, we see solely the trend and irregular component that is present. There is a obvious increasing trend in births as time goes on.

Seasonal ARIMA

We now added an additional step to the ARIMA model that we were introduced last class. Now we can create ARIMA models when there is obvious seasonal trend in the time series data. We learned of the variables p,d, and q last time, but now we have P, D, and Q. P represents the seasonal auto-regressive component, D is the seasonal differences, and Q is the seasonal moving average component. The standard equation for seasonal ARIMA is (P,D,Q)Sx(p,d,q), where S indicates the length of season (i.e. S=12 for monthly, S=4 for quarterly). Now we will walk through the steps of creating a seasonal ARIMA model and compare it to the model that R creates.

Step 1

First we will plot the data to know that type of seasonality is present.

library(forecast)
library(quantmod)
library(tseries)
library(timeSeries)
library(forecast)
library(xts)
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))
par(mfrow=c(2,1))
plot.ts(births) 
plot.ts(filter(births, sides=2, filter=rep(1/3,3)))

par(mfrow=c(1,1))
(wgts <- c(.5, rep(1,11), .5)/12)
##  [1] 0.04166667 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
##  [7] 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## [13] 0.04166667
plot.ts(filter(births, sides=2, filter = wgts))

Here we see that this is monthly data, so now we will need to use the seasonal differences to eliminate that trend.

Step 2

First we try to eliminate the seasonal differences present. For example, the differences for monthly are \(y_t-y_{t-12}\). This helps remove the seasonal trend. Since we know that this is monthly data, we will use lag=12 to indicate this.

diff1 <- diff(births, lag = 12)
plot.ts(diff1) 

adf.test(diff1)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -3.4166, Lag order = 5, p-value = 0.05488
## alternative hypothesis: stationary

Looking at the plot, it appears that we have eliminated the seasonal trend that was present in the original data, the next step would be to eliminate any other trend present with the standard ARIMA differences, but in this case the seasonal differences appears to have eliminated the positive trend that was present as well. Therefore, this step is not necessary. Our difference values are D=1 and d=0.

Step 3

Now that we have our difference values, we will preliminary choose values for p, q, P, and Q by looking at the auto correlation function (Acf) plot and the partial auto correlation function (Pacf) plot. First we looked at the Acf plot, which helped us choose values for both seasonal and non-seasonal moving average.

Acf(diff1)

For the non-seasonal value, we look for spikes present at low lags, meaning that the lines extend past the dotted blue line. In our case the first two lines are spikes, with the second one being much closer to the blue line than the first. For now, we will say that q>=1. Now for the seasonal moving average values, we look for spikes present at lags that are multiples of our S, which is 12. We look at 12 and 24 for spikes. There is one present at 12, and not at 24. We conclude that Q=1. Now we will look at the Pacf plot to choose our auto-regressive components.

Pacf(diff1)

Similar to the Acf, for the non-seasonal AR we look for spikes present in low lags. We see one spike present so we will choose p=1. For seasonal AR we look at the lags that are multiples of our S=12. We see one spike at 12, and one right before 24, so we will choose our P value to be >=1.

Step 4

Now we will set up an ARIMA model with the preliminary values we found in the previous steps. For this first trial we will use (1,0,1)x(1,1,1)12.

mod1 <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))

Step 5

Now we will examine our model and compare it to others. If the model is good, then our residuals will be small and the Ljung Box stat will also be small. The null hypothesis is that the model is adequate, and the alternative is that the model is inadequate.

summary(mod1) 
## 
## Call:
## arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.9884  -0.1564  -0.2199  -0.8820
## s.e.  0.0287   0.1747   0.1004   0.1498
## 
## sigma^2 estimated as 0.3924:  log likelihood = -136.81,  aic = 283.61
## 
## Training set error measures:
##                      ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.06656964 0.5998188 0.4452406 0.2299257 1.727419 0.3675936
##                   ACF1
## Training set 0.0410978
Box.test(mod1$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.24832, df = 1, p-value = 0.6183

With a p-value of 0.6183, we do not have significant evidence to conclude our model is inadequate, but we will try one more model for comparison. To test whether a certain component is significant we divide the coefficient by the standard error and that is our test stat. Then to find the p-val we use the t distribution to see whether it is worth including. We will look at ma1 because that appears to have a small test stat.

teststat <- -.1564/.1747
pval <- 2*pt(abs(teststat), lower.tail = FALSE, df=140)
pval
## [1] 0.3721906

At a p-value of 0.372, the ma1 component is not worth keeping. Now we will try a model with the values (1,0,0)x(1,1,1)12.

mod2 <- arima(births, order = c(1,0,0), season = list( order = c( 1,1,1), period=12))
Box.test(mod2$residuals, type = "Ljung") # yay
## 
##  Box-Ljung test
## 
## data:  mod2$residuals
## X-squared = 0.62922, df = 1, p-value = 0.4276
summary(mod2)
## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     sar1     sma1
##       0.9750  -0.2263  -0.8595
## s.e.  0.0326   0.0998   0.1280
## 
## sigma^2 estimated as 0.3986:  log likelihood = -137.27,  aic = 282.55
## 
## Training set error measures:
##                      ME      RMSE      MAE     MPE     MAPE      MASE
## Training set 0.07991759 0.6045416 0.450681 0.28466 1.748639 0.3720852
##                     ACF1
## Training set -0.06542041

All the components appear to be of significance to the model, so we will now compare the two models using AIC.

AIC(mod1);AIC(mod2)
## [1] 283.6117
## [1] 282.5484

Our mod2 has a lower AIC by about ten, therefore we conclude that mod2 is the best model of the ones we created. One way to see how we did is by giving R the same time series data, and letting the auto.arima command come up with its own ARIMA model to compare it to ours.

automod <- auto.arima(births)
automod 
## Series: births 
## ARIMA(1,0,0)(2,1,1)[12] with drift 
## 
## Coefficients:
##          ar1     sar1     sar2     sma1   drift
##       0.6988  -0.4160  -0.3503  -0.6846  0.0447
## s.e.  0.0631   0.1222   0.1093   0.1281  0.0033
## 
## sigma^2 estimated as 0.3287:  log likelihood=-122.05
## AIC=256.09   AICc=256.77   BIC=273.39

It looks like our mod2 only had the seasonal AR value difference from the auto.arima model, they came up with a value of two instead of one.

Forecasting

Now we will use both our model and the auto.arima model to forecast baby births 12 months.

par(mfrow=c(2,1))
plot(forecast(automod, h = 12), main = "auto.arima")
plot(forecast(mod2, h = 12), main = "home-cooked")

The forecasts look similar, but the auto.arima model has a much smaller interval for the same level of confidence. This is pretty convincing that the auto.arima model provides a more accurate forecast.