In class we looked at how to decompose a time series. We used the births data, which we will read in using the scan() command here. It is the number of births per month in New York City.

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
birthstimeseries <- ts(births, frequency=12, start=c(1946,1))

To initialize this as a time series, call ts() and include the data, the frequency, and the starting point as inputs. Let’s plot this time series of ours.

plot(birthstimeseries)

We see some trend, some seasonal variation, and surely some unexplained variation as well. We can see each of these factors graphed using the decompose() function with the time series as the input.

TScomp <- decompose(birthstimeseries)
plot(TScomp)

From here, we can do some interesting things! We could remove a factor from our time series. For example, let’s lose the seasonal component

noseason <- birthstimeseries - TScomp$seasonal
plot(noseason)

# only trend and irregular component remains
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
## 
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
## 
##     time<-
library(forecast)
library(xts)

Now, to bring ARIMA into this… First let’s plot the data.

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))
par(mfrow=c(2,1))
plot.ts(births) 
plot.ts(filter(births, sides=2, filter=rep(1/3,3)))

par(mfrow=c(1,1))
(wgts <- c(.5, rep(1,11), .5)/12)
##  [1] 0.04166667 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
##  [7] 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## [13] 0.04166667
plot.ts(filter(births, sides=2, filter = wgts))

We are essentially trying to show seasonality without all of the “noise”. The complicated code above can smooth out until we can generally see the seasonality.

The next step is all about the “diffs”, which are useful in removing seasonality and trend until only the randomness is all that is left.

diff1 <- diff(births, lag = 12)
plot.ts(diff1) 

adf.test(diff1) #stationary? good enough for me.
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -3.4166, Lag order = 5, p-value = 0.05488
## alternative hypothesis: stationary

The augmented Dickey-Fuller test on the diffs has a low p-value, so this means it is stationary (no trend remaining).

Now we are looking at autocorrelation.

Acf(diff1) 

Pacf(diff1) 

If there are spikes at low lags, then our order (p or q) is greater than or equal to one. Which there are. Our ARIMA model is now (1+, 0, 1+).

As we know from ARIMA models, this only covers the non-seasonal side. Lets take a look at P and Q, the seasonal orders.

Acf(diff1) 

Pacf(diff1) 

# Our model thus far is P>=1, D>=1, Q>=1

Now we are looking for spikes at the frequency of our time series (lag 12). Both have spikes here, so our orders are also greater than one when seasonal is turned on.

The next step is to estimate a time series model using our ARIMA orders. ARIMA (p, d, q) x (P, D, Q) S ARIMA (1, 0, 1) x (1, 1, 1) 12

mod1 <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))

The next step consists of model building and examination.

summary(mod1) 
## 
## Call:
## arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.9884  -0.1564  -0.2199  -0.8820
## s.e.  0.0287   0.1747   0.1004   0.1498
## 
## sigma^2 estimated as 0.3924:  log likelihood = -136.81,  aic = 283.61
## 
## Training set error measures:
##                      ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.06656964 0.5998188 0.4452406 0.2299257 1.727419 0.3675936
##                   ACF1
## Training set 0.0410978
Box.test(mod1$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.24832, df = 1, p-value = 0.6183
# no evidence the model is inadequate! (resids small enough)

This is a box-Ljung test, which shows that the model is adequate if the p-value is large.

mod2 <- arima(births, order = c(1,0,0), season = list( order = c( 1,1,1), period=12))
Box.test(mod2$residuals, type = "Ljung") 
## 
##  Box-Ljung test
## 
## data:  mod2$residuals
## X-squared = 0.62922, df = 1, p-value = 0.4276
summary(mod2) 
## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     sar1     sma1
##       0.9750  -0.2263  -0.8595
## s.e.  0.0326   0.0998   0.1280
## 
## sigma^2 estimated as 0.3986:  log likelihood = -137.27,  aic = 282.55
## 
## Training set error measures:
##                      ME      RMSE      MAE     MPE     MAPE      MASE
## Training set 0.07991759 0.6045416 0.450681 0.28466 1.748639 0.3720852
##                     ACF1
## Training set -0.06542041
(teststat <- -0.2263/0.0998)
## [1] -2.267535
2*pt(abs(teststat), df = 144-3, lower.tail = FALSE) 
## [1] 0.02487795
mod3 <- arima(births, order = c(2,0,0), season = list( order = c( 1,1,1), period=12))
mod3 
## 
## Call:
## arima(x = births, order = c(2, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     ar2     sar1     sma1
##       0.9275  0.0513  -0.2237  -0.8671
## s.e.  0.0876  0.0881   0.0999   0.1333
## 
## sigma^2 estimated as 0.3965:  log likelihood = -137.1,  aic = 284.21
mod4 <- arima(births, order = c(1,0,0), season = list( order = c( 2,1,1), period=12))
mod4 
## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(2, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     sar1     sar2     sma1
##       0.9677  -0.4796  -0.3846  -0.6243
## s.e.  0.0268   0.1138   0.1022   0.1227
## 
## sigma^2 estimated as 0.3673:  log likelihood = -131.72,  aic = 273.44
AIC(mod4); AIC(mod2) 
## [1] 273.4391
## [1] 282.5484
mod5 <- arima(births, order = c(1,0,0), season = list( order = c( 2,1,2), period=12))
mod5 
## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(2, 1, 2), period = 12))
## 
## Coefficients:
##          ar1     sar1     sar2     sma1    sma2
##       0.9612  -0.2486  -0.3563  -0.8751  0.2615
## s.e.  0.0290   0.2787   0.1310   0.3022  0.2498
## 
## sigma^2 estimated as 0.3631:  log likelihood = -131.09,  aic = 274.19
AIC(mod5); AIC(mod4)
## [1] 274.1868
## [1] 273.4391
mod4
## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(2, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     sar1     sar2     sma1
##       0.9677  -0.4796  -0.3846  -0.6243
## s.e.  0.0268   0.1138   0.1022   0.1227
## 
## sigma^2 estimated as 0.3673:  log likelihood = -131.72,  aic = 273.44
Box.test(mod4$residuals)
## 
##  Box-Pierce test
## 
## data:  mod4$residuals
## X-squared = 1.3432, df = 1, p-value = 0.2465

This is our final model. After all of the test and interpretation which I don’t fully understand, I do know how to forecast! First lets compare what we did to auto.arima, which does all of this for us.

automod <- auto.arima(births)
automod
## Series: births 
## ARIMA(1,0,0)(2,1,1)[12] with drift 
## 
## Coefficients:
##          ar1     sar1     sar2     sma1   drift
##       0.6988  -0.4160  -0.3503  -0.6846  0.0447
## s.e.  0.0631   0.1222   0.1093   0.1281  0.0033
## 
## sigma^2 estimated as 0.3287:  log likelihood=-122.05
## AIC=256.09   AICc=256.77   BIC=273.39
AIC(automod); AIC(mod4)
## [1] 256.0934
## [1] 273.4391
Box.test(automod$residuals)
## 
##  Box-Pierce test
## 
## data:  automod$residuals
## X-squared = 0.23843, df = 1, p-value = 0.6253

These models are not very similar. Auto.arima is much easier, lets compare their forcasts.

par(mfrow=c(2,1))
plot(forecast(automod, h = 12), main = "auto.arima")
plot(forecast(mod4, h = 12), main = "home-cooked")

We are not professionals, but auto.arima was coded by some. I think we did okay for rookies!