We learned about model comparisons today in class. We looked at the likelihood ratio test. We need nested models for the Wald test. The LRT is similar to the F-test as it measures goodness of fit using liklihood. We will compare the test stat to a chi squared distribution. The degrees of freedom is the difference in the number of parameters in the two models being tested. We will look at the Iris data set to show this. We want to know if we should include species data in our model.

simpler <- lm(Petal.Length ~ Sepal.Length, data =iris)
biggermod <- lm(Petal.Length ~ Sepal.Length + Species, data =iris)
summary(biggermod)
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76390 -0.17875  0.00716  0.17461  0.79954 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.70234    0.23013  -7.397 1.01e-11 ***
## Sepal.Length       0.63211    0.04527  13.962  < 2e-16 ***
## Speciesversicolor  2.21014    0.07047  31.362  < 2e-16 ***
## Speciesvirginica   3.09000    0.09123  33.870  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2826 on 146 degrees of freedom
## Multiple R-squared:  0.9749, Adjusted R-squared:  0.9744 
## F-statistic:  1890 on 3 and 146 DF,  p-value: < 2.2e-16

We cant use t test so we will use the liklihood ratios.

logLik(simpler)
## 'log Lik.' -190.5675 (df=3)
logLik(biggermod)
## 'log Lik.' -21.23709 (df=5)
mydf <- length(coef(biggermod)) - length(coef(simpler)) 

(teststat<- -2*as.numeric(logLik(simpler) - logLik(biggermod)))
## [1] 338.6608
pchisq(teststat, df = mydf, lower.tail = FALSE)
## [1] 2.888934e-74

We have a very low p value so we should include species. This means that species adds a lot of useful data after accounting for sepal length.

Now we will look at an ARIMA model

library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
## 
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
## 
##     time<-
library(forecast)
library(xts)

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))

Now let’s build our models and test them against each other

mod1 <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))
mod2 <- arima(births, order = c(2,0,2), season = list( order = c( 1,1,1), period=12))

mod1
## 
## Call:
## arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.9884  -0.1564  -0.2199  -0.8820
## s.e.  0.0287   0.1747   0.1004   0.1498
## 
## sigma^2 estimated as 0.3924:  log likelihood = -136.81,  aic = 283.61
mod2
## 
## Call:
## arima(x = births, order = c(2, 0, 2), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ar2      ma1      ma2     sar1     sma1
##       1.2898  -0.2902  -0.5257  -0.2915  -0.2450  -0.8684
## s.e.  0.1738   0.1735   0.1660   0.1090   0.1029   0.1309
## 
## sigma^2 estimated as 0.3295:  log likelihood = -125.31,  aic = 264.62
mydf <- length(coef(mod2)) - length(coef(mod1)) 

(teststat<- -2*as.numeric(logLik(mod1) - logLik(mod2)))
## [1] 22.98939
pchisq(teststat, df = mydf, lower.tail = FALSE)
## [1] 1.018396e-05

We should include the ar2 and ma2 because we have evidence to think that they are signficant and important.

We can now compare time series models and add this to our knowledge of model building with a new thing to compare on.