We can compare 2 nested models in terms of model fit, where we test the null that
1 model is a special case of the other (ie one is contained within the other)
We find the test statistic by doing 2 * (log likelihood of the complicated model - log likelihood of the simpler model)
Note that the likelyhood(complicated) > likelihood (simpler)
The steps we take to test the hypothesis are:
1) fit models
2) find log likelihood using logLik(model name)
3) compare using the command pchisq(test stat) which is a chi square, and note that the number of degrees of freedom is the difference in the number of parameters used.
For this output, we have 2 options. A high test stat indicates a low p value ==> the complicated model is better
A low test stat indicates a high p value ==> the simpler model is better since it’s easier to interpret.
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
##
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
##
## time<-
library(forecast)
library(xts)
I’ll use the time series data gas. we will take the log of this data because it has increasing seasonal variance
gdata <- log(gas)
plot.ts(gdata)
Now that there’s somewhat reasonably constant seasonal variance we can create a new model.
Like we did last time, we find the P,D and Q. we always start with D.
diff1 <- diff(gdata, lag = 4)
plot.ts(gdata)
adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff1
## Dickey-Fuller = -12.65, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
There’s still an upward trend but it is slight. In this case, we will define D = 1 and d=0
Now we must find ps and qs.
Acf(diff1)
Pacf(diff1)
Here we can see that there are spikes. I would say that q >=1, Q >= 1, p >= 1 and P >= 1. Now we have to make arima models using these numbers. We will start our first model with p and q equal to 1, then increase them each by 1.
ARIMA (p, d, q) x (P, D, Q) S
mod1 = arima(gdata, order = c(1,0,1), season = list( order = c(1,1,1), period=4))
mod2 = arima(gdata, order = c(2,0,2), season = list( order = c(1,1,1), period=4))
mod1
##
## Call:
## arima(x = gdata, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 4))
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.9993 0.2272 -0.3207 -0.9882
## s.e. 0.0031 0.0355 0.0447 0.0246
##
## sigma^2 estimated as 0.007763: log likelihood = 469.12, aic = -928.24
mod2
##
## Call:
## arima(x = gdata, order = c(2, 0, 2), seasonal = list(order = c(1, 1, 1), period = 4))
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1
## 1.6043 -0.8304 -0.5533 0.3869 -0.1506 -0.543
## s.e. 0.0768 0.0871 0.1670 0.0418 0.0588 0.095
##
## sigma^2 estimated as 0.006206: log likelihood = 526.7, aic = -1039.4
Our hypotheses are below:
H_0: ??=0, ??=0 H_alternative: ?? =/= 0 o ?? =/= 0
The null hypothesis is that we don’t need Q = 2 and P = 2 and our alternative is that we do need the more “complicated” model.
The next step is finding the log likelihoods and computing the test statistic.
teststat <- 2 * (as.numeric(logLik(mod2))-as.numeric(logLik(mod1)))
teststat
## [1] 115.1602
This seems like a high test stat and we will compare to the chi square distribution. Make sure the df = # parameters different, in this case it is 2.
pchisq(teststat,df = 2, lower.tail=FALSE)
## [1] 9.846443e-26
This is a low p value and therefore we should use the more complicated model with p = 2 and q = 2.
additionally, we know there’s a function that finds the arima for us. we can compare that.
gasmod <- auto.arima(gdata)
gasmod
## Series: gdata
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.3304 -0.8256
## s.e. 0.0462 0.0473
##
## sigma^2 estimated as 0.002553: log likelihood=719.45
## AIC=-1432.91 AICc=-1432.86 BIC=-1420.49
then we find the test stat and use the pchisq function again.
gteststat <- 2 * (as.numeric(logLik(gasmod))-as.numeric(logLik(mod2)))
gteststat
## [1] 385.5069
gasmod has 2 coefficients, and mod2 has 6, so we will test with df = 4
pchisq(gteststat,df=4, lower.tail = FALSE)
## [1] 3.762545e-82
Here we have a very low p value again, and we will use the more complicated model. In this case the more complicated model is not the auto.arima, but instead our mod2.