In today’s class, we covered decomposition of time series and creating ARIMA models by hand.
We can use the decompose function to break a time series down into trend, seasonal, and random components. One of the datasets we looked at was for the number of births per month in New York City. We use the ts function to first create our time series model. As always, we need to load a few libraries before we begin.
library(forecast)
## Warning: package 'forecast' was built under R version 3.3.3
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.3.3
## Warning: package 'xts' was built under R version 3.3.3
## Warning: package 'TTR' was built under R version 3.3.3
library(tseries)
## Warning: package 'tseries' was built under R version 3.3.3
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.3.3
library(xts)
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
birthstimeseries <- ts(births, frequency=12, start=c(1946,1))
plot(birthstimeseries)
This is the plot of our time series model. Now we can decompose it.
TScomp <- decompose(birthstimeseries)
plot(TScomp)
The output consists of four plots. The top plot is our original time series. The second plot shows overall trends. The third plot shows the seasonal component, and the last plot shows the random component.
We can see there is definitely a seasonal component as the seasonal plot has a jagged pattern.
In the previous class we used R to automatically give us an ARIMA model for our time series data. Now let’s try creating one by hand to understand the process. Our original time series had a lot of noise and seasonality, so let’s try smoothing it out first.
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))
par(mfrow=c(2,1))
plot.ts(births)
plot.ts(filter(births, sides=2, filter=rep(1/3,3)))
par(mfrow=c(1,1))
(wgts <- c(.5, rep(1,11), .5)/12)
## [1] 0.04166667 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## [7] 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## [13] 0.04166667
plot.ts(filter(births, sides=2, filter = wgts))
Now we need to take differences to eliminate seasonality and trends, since we saw both in our original plots. We use the diff function to do this. The lag to eliminate seasonality in this case is 12, since we have monthly observations and there are 12 months in the year.
differences1 <- diff(births, lag = 12)
plot.ts(differences1)
The trend seems to be mostly gone. Just to be sure, we can use the adf.test function to check. If the p-value is insignificant, then we should be alright.
adf.test(differences1)
##
## Augmented Dickey-Fuller Test
##
## data: differences1
## Dickey-Fuller = -3.4166, Lag order = 5, p-value = 0.05488
## alternative hypothesis: stationary
The p-value looks good. Next we will look at the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF). We will use the Acf and Pacf functions, giving our differences as the argument.
par(mfrow=c(1,2))
Acf(differences1)
Pacf(differences1)
We use these functions to determine the p and q values for the nonseasonal component. We look for spikes at low lags in both cases. If we see spikes at low lags in the ACF, then the nonseasonal q is greater than or equal to 1. If we see spikes at low lags in the PACF, then the nonseasonal p is greater than or equal to 1.
In our case, we can see both functions have spikes at low lags, so both nonseasonal p and q are at least 1.
We use the same functions to determine the P and Q values for our seasonal component, except we are looking at spikes at lags of multiples of 12 (lag 12, 24, 36, 48, etc). In the ACF, we can see a spike at lag 12, so our Q is at least 1. In the PACF, we also see a spike at lag 12, so our P is at least 1.
Recall that ARIMA models have values p, d, q for the nonseasonal component and P, D, Q for the seasonal component. So far our model has (p,d,q) as (1,0,1). d is 0 because we did not need to take a difference to eliminate trend. Our seasonal component (P,D,Q) is (1,1,1). D is 1 in this case because we needed to take a difference to eliminate seasonality.
Now we can create a model.
ManualArimaMod <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))
summary(ManualArimaMod)
##
## Call:
## arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.9884 -0.1564 -0.2199 -0.8820
## s.e. 0.0287 0.1747 0.1004 0.1498
##
## sigma^2 estimated as 0.3924: log likelihood = -136.81, aic = 283.61
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.06656964 0.5998188 0.4452406 0.2299257 1.727419 0.3675936
## ACF1
## Training set 0.0410978
To test our model, we should use the Ljung Box test. The function is Box.test and the arguments are our model’s residuals and the type, which we will specify as “Ljung.”
Box.test(ManualArimaMod$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: ManualArimaMod$residuals
## X-squared = 0.24832, df = 1, p-value = 0.6183
Typically large p-values are not a good thing, however this is an exception. A large p-value for the Ljung Box test means the model is not inadequate, whereas a small p-value would mean the model is inadequate and we should change it.
We can use the model we created to forecast the number of births in future months.
plot(forecast(ManualArimaMod, h = 12), main = "Manual")
Comparing it to the automatically created model:
AutoArimaMod <- auto.arima(births)
plot(forecast(AutoArimaMod, h = 12), main = "Automated")
In practice, using auto.arima is almost always a more efficient way to create an ARIMA model. We can see the automated model is better than the one we created by hand, and is much faster than going through each step to look at the time series data. However, it is still important to know how to create an ARIMA model by hand as a way of understanding how it works.