Likelihood ratio test (wilks test)

We can compare 2 nested models in terms of model fit, where we test the null that
1 model is a special case of the other (ie one is contained within the other)

We find the test statistic by doing 2 * (log likelihood of the complicated model - log likelihood of the simpler model)
Note that the likelyhood(complicated) > likelihood (simpler)

The steps we take to test the hypothesis are:
1) fit models
2) find log likelihood using logLik(model name)
3) compare using the command pchisq(test stat) which is a chi square, and note that the number of degrees of freedom is the difference in the number of parameters used.

For this output, we have 2 options. A high test stat indicates a low p value ==> the complicated model is better
A low test stat indicates a high p value ==> the simpler model is better since it’s easier to interpret.

library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
## 
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
## 
##     time<-
library(forecast)
library(xts)

A Motivating Example

I’ll use the time series data gas. we will take the log of this data because it has increasing seasonal variance

gdata <- log(gas)
plot.ts(gdata)

Now that there’s somewhat reasonably constant seasonal variance we can create a new model.
Like we did last time, we find the P,D and Q. we always start with D.

diff1 <- diff(gdata, lag = 4)
plot.ts(gdata)

adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -12.65, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

There’s still an upward trend but it is slight. In this case, we will define D = 1 and d=0

Now we must find ps and qs.

Acf(diff1)

Pacf(diff1)

Here we can see that there are spikes. I would say that q >=1, Q >= 1, p >= 1 and P >= 1. Now we have to make arima models using these numbers. We will start our first model with p and q equal to 1, then increase them each by 1.
ARIMA (p, d, q) x (P, D, Q) S

mod1 = arima(gdata, order = c(1,0,1), season = list( order = c(1,1,1), period=4))
mod2 = arima(gdata, order = c(2,0,2), season = list( order = c(1,1,1), period=4))
mod1
## 
## Call:
## arima(x = gdata, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 4))
## 
## Coefficients:
##          ar1     ma1     sar1     sma1
##       0.9993  0.2272  -0.3207  -0.9882
## s.e.  0.0031  0.0355   0.0447   0.0246
## 
## sigma^2 estimated as 0.007763:  log likelihood = 469.12,  aic = -928.24
mod2
## 
## Call:
## arima(x = gdata, order = c(2, 0, 2), seasonal = list(order = c(1, 1, 1), period = 4))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     sar1    sma1
##       1.6043  -0.8304  -0.5533  0.3869  -0.1506  -0.543
## s.e.  0.0768   0.0871   0.1670  0.0418   0.0588   0.095
## 
## sigma^2 estimated as 0.006206:  log likelihood = 526.7,  aic = -1039.4

Our hypotheses are below:

H_0: ??=0, ??=0 H_alternative: ?? =/= 0 o ?? =/= 0

The null hypothesis is that we don’t need Q = 2 and P = 2 and our alternative is that we do need the more “complicated” model.

The next step is finding the log likelihoods and computing the test statistic.

teststat <- 2 * (as.numeric(logLik(mod2))-as.numeric(logLik(mod1)))
teststat
## [1] 115.1602

This seems like a high test stat and we will compare to the chi square distribution. Make sure the df = # parameters different, in this case it is 2.

pchisq(teststat,df = 2, lower.tail=FALSE)
## [1] 9.846443e-26

This is a low p value and therefore we should use the more complicated model with p = 2 and q = 2.

additionally, we know there’s a function that finds the arima for us. we can compare that.

gasmod <- auto.arima(gdata)
gasmod
## Series: gdata 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.3304  -0.8256
## s.e.   0.0462   0.0473
## 
## sigma^2 estimated as 0.002553:  log likelihood=719.45
## AIC=-1432.91   AICc=-1432.86   BIC=-1420.49

then we find the test stat and use the pchisq function again.

gteststat <- 2 * (as.numeric(logLik(gasmod))-as.numeric(logLik(mod2)))
gteststat
## [1] 385.5069

gasmod has 2 coefficients, and mod2 has 6, so we will test with df = 4

pchisq(gteststat,df=4, lower.tail = FALSE)
## [1] 3.762545e-82

Here we have a very low p value again, and we will use the more complicated model. In this case the more complicated model is not the auto.arima, but instead our mod2.