Intro

On Tuesday we covered seasonality and how to make our ARIMA models our selves. To do a hypothesis test where either,

\[H0: theta(or) phi=0 \] \[Ha: theta(or) phi!=0\]

The test stat for these hypothesises are either: theta/SE(Theta hat) or phi/SE(phi hat) or T (df)

The df for these ARIMA are df=n-(#theta’s + #phi’s).

Diagnostics: The residual analysis will determine how well the ARIMA is made.If ARIMA is doing well, then residuals will be small, and T-stat will be small, P-value high.

Interesting enough in these test we want the H0 to be small for these are the meanings behind them. H0: “The model adequately models our time series data.” Ha: “The model is inadequate.”

Now there are several test stats that we can use to test, build, and improve our ARIMA models and they are: box pierce stat (original), Ljung-box stat (better), box.test(mod1#residuals), ACF, and PACF

Seasonal ARIMA:

To build an ARIMA including seasonal components we need to include x(P,D,Q)S in the second half. The P,D,Q was explained in the last R guide. The set up for this is below.

ARIMA(p,d,q)x(P,D,Q)S

The S here means how many months are in the seasonallity. S=12 for monthly, S=4 for quarterly. P~AR(r) piece D~I(d) piece Q~MA(q) piece.

Now that we know how to test the data, we will by using the data for births in NYC.

Births

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
birthstimeseries <- ts(births, frequency=12, start=c(1946,1))
TScomp <- decompose(birthstimeseries)
plot(TScomp)

As seen by the graphs there is a lot of seasonallity in the number of births in NYC. The trend for births is also increasing. Now that we know there is seasonallity we can subtract the seasonallity component.

seasonofbirths <- birthstimeseries - TScomp$seasonal
plot(seasonofbirths)

As shown from before there is a upward sloping trend of the number of births in NYC.

library(forecast)

## Warning: package 'forecast' was built under R version 3.4.4

library(xts)

## Warning: package 'xts' was built under R version 3.4.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.4.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(timeSeries)

## Warning: package 'timeSeries' was built under R version 3.4.4

## Loading required package: timeDate

## Warning: package 'timeDate' was built under R version 3.4.3

## 
## Attaching package: 'timeSeries'

## The following object is masked from 'package:zoo':
## 
##     time<-

library(tseries)

## Warning: package 'tseries' was built under R version 3.4.4

library(quantmod)

## Warning: package 'quantmod' was built under R version 3.4.4

## Loading required package: TTR

## Warning: package 'TTR' was built under R version 3.4.3

## Version 0.4-0 included new data defaults. See ?getSymbols.

Now to build data.

This section was taken from Knudson’s explaination. Seasonal vs nonseasonal diff’s 1st seasonal diffs: y(t)-y(t-s) ex y(t)-y(t-4) #quarterly or y(t)-y(t-12) #monthly 1st order seasonal differences. (used to calculate D)

nonseasonal differences (done if non-flat trend after differences) 2nd remove upward/downward trend done 2nd, because seasonal may take care of nonseasonal differences.

steps: 1, plot data to see if you have seasonality -if seasonally, your data has an increasing tend, transform before going forward. (log() seems to work nicely) 2. do diffs to identify D,d -if seasonality and no trend, use D>=1,d=0 -If trend but no seasonality, use D=0, d>=1 -If seasonality and trend, do D>=1 then see if resulting differences have trend. (if yes, d>=1)

Examine ACF and PACF to choose preliminary vals for p,q,P,Q (ACF=autocorrilation function~tells you correlation of 2 points separated by a specified lag.)(PACF~like ACF, but controls for vals of shorter lag.) 3.(cont.) ACF -look @ spikes from low lags ~> nonseasonal moving average. -look @ spikes from lags that are multiples of S(12 for monthly, 4 for quarterly…) tells us seasonal MA PACF -Look @ spikes low lags nonseasonal AR(p) -look @ spikes at lag S 2S 3S etc. tells us seasonal AR P estimate model diagonstis, hypothesis tests, use AIC to compare when a few models are close

STEP 1: PLOT DATA

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))
par(mfrow=c(2,1))
plot.ts(births) 
plot.ts(filter(births, sides=2, filter=rep(1/3,3)))

par(mfrow=c(1,1))
(wgts <- c(.5, rep(1,11), .5)/12)

##  [1] 0.04166667 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
##  [7] 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## [13] 0.04166667

plot.ts(filter(births, sides=2, filter = wgts))

STEP 2: DIFFS

First: diffs of lag S to remove seasonality

Second: diffs to remove trend

(Skip either step if deemed unnecessary in step 1)

Step 2, part A:

diff1 <- diff(births, lag = 12)
plot.ts(diff1)

adf.test(diff1)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -3.4166, Lag order = 5, p-value = 0.05488
## alternative hypothesis: stationary

diff2 <- diff(diff1, lag=12)

Step 2, part B: doesn’t look necessary anymore

STEP 3: ACF and PACF

Let’s look at the nonseasonal p and q first

Acf(diff1) #spikes at low lags ==> nonseasonal MA (q > =1)

Pacf(diff1) # spike at low lag ==> nonseasonal AR (p >= 1)##partial ACF here

Our model thus far is p>=1, d=0, q>=1

STEP 4: ESTIMATE MODEL

ARIMA (p, d, q) x (P, D, Q) S

ARIMA (1, 0, 1) x (1, 1, 1) 12

mod1 <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))

          #       order of nonseasonal          list of order of seasonal and S ~ period

STEP 5: EXAMINE MODELs, and maybe make more box tests here

ARIMA (1, 0, 1) x (1, 1, 1) 12

summary(mod1)

## 
## Call:
## arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.9884  -0.1564  -0.2199  -0.8820
## s.e.  0.0287   0.1747   0.1004   0.1498
## 
## sigma^2 estimated as 0.3924:  log likelihood = -136.81,  aic = 283.61
## 
## Training set error measures:
##                      ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.06656964 0.5998188 0.4452406 0.2299257 1.727419 0.3675936
##                   ACF1
## Training set 0.0410978

Box.test(mod1$residuals, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.24832, df = 1, p-value = 0.6183

# no evidence the model is inadequate! (resids small enough)


# but mod1's ma1 is not significantly diff from 0, so let's try removing it

ARIMA (1, 0, 0) x (1, 1, 1) 12

mod2 <- arima(births, order = c(1,0,0), season = list( order = c( 1,1,1), period=12))
Box.test(mod2$residuals, type = "Ljung") #yay

## 
##  Box-Ljung test
## 
## data:  mod2$residuals
## X-squared = 0.62922, df = 1, p-value = 0.4276

summary(mod2) # most Wald test stats look decently far from 0

## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     sar1     sma1
##       0.9750  -0.2263  -0.8595
## s.e.  0.0326   0.0998   0.1280
## 
## sigma^2 estimated as 0.3986:  log likelihood = -137.27,  aic = 282.55
## 
## Training set error measures:
##                      ME      RMSE      MAE     MPE     MAPE      MASE
## Training set 0.07991759 0.6045416 0.450681 0.28466 1.748639 0.3720852
##                     ACF1
## Training set -0.06542041

# let's test the parameter with Wald test stat closest to 0
(teststat <- -0.2263/0.0998)

## [1] -2.267535

2*pt(abs(teststat), df = 144-3, lower.tail = FALSE) #p value

## [1] 0.02487795

# therefore, reject null that this parameter = 0
# we should keep this parameter estimate in the model
# and not remove the seasonal ar parameter
### Let's see if we can improve our model. 
## we said p >=1, P >=1, Q >=1 and we tried all EQUAL to 1
# ARIMA (2, 0, 0) x (1, 1, 1) 12
mod3 <- arima(births, order = c(2,0,0), season = list( order = c( 1,1,1), period=12))
mod3 # ar2 not significantly diff from 0

## 
## Call:
## arima(x = births, order = c(2, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     ar2     sar1     sma1
##       0.9275  0.0513  -0.2237  -0.8671
## s.e.  0.0876  0.0881   0.0999   0.1333
## 
## sigma^2 estimated as 0.3965:  log likelihood = -137.1,  aic = 284.21

# ARIMA (1, 0, 0) x (2, 1, 1) 12
mod4 <- arima(births, order = c(1,0,0), season = list( order = c( 2,1,1), period=12))
mod4 # sar2 IS significantly diff from 0

## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(2, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     sar1     sar2     sma1
##       0.9677  -0.4796  -0.3846  -0.6243
## s.e.  0.0268   0.1138   0.1022   0.1227
## 
## sigma^2 estimated as 0.3673:  log likelihood = -131.72,  aic = 273.44

mod5 <- arima(births, order = c(1,0,0), season = list( order = c( 2,1,2), period=12))
mod5 # sma1 NOT significantly diff from 0 # so forget mod5. Simpler (mod4) is better

## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(2, 1, 2), period = 12))
## 
## Coefficients:
##          ar1     sar1     sar2     sma1    sma2
##       0.9612  -0.2486  -0.3563  -0.8751  0.2615
## s.e.  0.0290   0.2787   0.1310   0.3022  0.2498
## 
## sigma^2 estimated as 0.3631:  log likelihood = -131.09,  aic = 274.19

AIC(mod5); AIC(mod4) #plus they have the same AIC (roughly anyway)

## [1] 274.1868

## [1] 273.4391

Box.test(mod4$residuals, type = "Ljung") # model not inadequate :)

## 
##  Box-Ljung test
## 
## data:  mod4$residuals
## X-squared = 1.3714, df = 1, p-value = 0.2416

# So our final mod is mod4


######### For fun, let's compare what we got to auto.arima
automod <- auto.arima(births)
automod # pretty similar

## Series: births 
## ARIMA(1,0,0)(2,1,1)[12] with drift 
## 
## Coefficients:
##          ar1     sar1     sar2     sma1   drift
##       0.6988  -0.4160  -0.3503  -0.6846  0.0447
## s.e.  0.0631   0.1222   0.1093   0.1281  0.0033
## 
## sigma^2 estimated as 0.3287:  log likelihood=-122.05
## AIC=256.09   AICc=256.77   BIC=273.39

AIC(automod); AIC(mod4)

## [1] 256.0934

## [1] 273.4391

Box.test(automod$residuals)

## 
##  Box-Pierce test
## 
## data:  automod$residuals
## X-squared = 0.23843, df = 1, p-value = 0.6253

######### Forecasting
par(mfrow=c(2,1))
plot(forecast(automod, h = 12), main = "auto.arima")
plot(forecast(mod4, h = 12), main = "home-cooked")

# til we're better chefs, let's just eat at the restaurant
# auto.arima isn't Michelin rated, but it's better than we are
# after just 2 days of training

So it shows that we can not make ARIMAs as good as the computer so we should just auto them all.

Learning Log

Rebecca Lewis

May 1, 2018

Intro

Seasonal ARIMA:

Births

Now to build data.

STEP 1: PLOT DATA

STEP 2: DIFFS

First: diffs of lag S to remove seasonality

Second: diffs to remove trend

(Skip either step if deemed unnecessary in step 1)

Step 2, part A:

Step 2, part B: doesn’t look necessary anymore

STEP 3: ACF and PACF

Let’s look at the nonseasonal p and q first

Our model thus far is p>=1, d=0, q>=1

STEP 4: ESTIMATE MODEL

ARIMA (p, d, q) x (P, D, Q) S

ARIMA (1, 0, 1) x (1, 1, 1) 12

STEP 5: EXAMINE MODELs, and maybe make more box tests here

ARIMA (1, 0, 1) x (1, 1, 1) 12

ARIMA (1, 0, 0) x (1, 1, 1) 12