LL24

In class on Tuesday, we went over how to do some tests that we didn’t cover throughout the rest of the class, especially using the Likelihood Ratio test.

First we worked in the context of linear regression, using the petal length data.

simpler <- lm(Petal.Length ~ Sepal.Length, data =iris)
biggermod <- lm(Petal.Length ~ Sepal.Length + Species, data =iris)

summary(biggermod) #why can't we use this to decide?

## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76390 -0.17875  0.00716  0.17461  0.79954 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.70234    0.23013  -7.397 1.01e-11 ***
## Sepal.Length       0.63211    0.04527  13.962  < 2e-16 ***
## Speciesversicolor  2.21014    0.07047  31.362  < 2e-16 ***
## Speciesvirginica   3.09000    0.09123  33.870  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2826 on 146 degrees of freedom
## Multiple R-squared:  0.9749, Adjusted R-squared:  0.9744 
## F-statistic:  1890 on 3 and 146 DF,  p-value: < 2.2e-16

We can’t use the individual t-tests to decide whether or not to include species in our data because species is in some way included in the intercept.

# Could use the F-test to decide
anova(biggermod, simpler)

## Analysis of Variance Table
## 
## Model 1: Petal.Length ~ Sepal.Length + Species
## Model 2: Petal.Length ~ Sepal.Length
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    146  11.657                                  
## 2    148 111.459 -2   -99.802 624.99 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The partial f-test returns another super low p-value. I consider this method trustworthy, but let’s see what the Wilk’s test does.

# Or could use the LRT (Wilks test)
logLik(simpler)

## 'log Lik.' -190.5675 (df=3)

logLik(biggermod)

## 'log Lik.' -21.23709 (df=5)

mydf <- length(coef(biggermod)) - length(coef(simpler))

The logLik teststatistic is larger for the more complicated model. Should we use the more complicated model?

(teststat<- -2*as.numeric(logLik(simpler) - logLik(biggermod)))

## [1] 338.6608

pchisq(teststat, df = mydf, lower.tail = FALSE)

## [1] 2.888934e-74

#reject the null: the more complicated model (biggmod) fits better

The low p-val says that we should reject the null hypothesis and thus the simpler model. The bigger model fits better.

This Wilk’s test can be used in time series context too.

#install.packages("forecast")
#install.packages("uantmod")
#install.packages("tseries")
#install.packages("timeSeries")
#install.packages("xts")

library(forecast)
library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Version 0.4-0 included new data defaults. See ?getSymbols.

library(tseries)
library(timeSeries)

## Loading required package: timeDate

## 
## Attaching package: 'timeSeries'

## The following object is masked from 'package:zoo':
## 
##     time<-

library(xts)

#### NY births data
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))

# In TSDecomp.R we decided to use an ARIMA model with
# p>=1, d=0, q>=1 (nonseasonal)
# P>=1, D>=1, Q>=1 (seasonal)

# Fit models
mod1 <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))
mod2 <- arima(births, order = c(2,0,2), season = list( order = c( 1,1,1), period=12))

Using the results from the last Learning Log, we created a bigger model to compare the built model too. We should expect to see that our model is better because we did a bunch of fun stuff to get to that model.

mod1

## 
## Call:
## arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.9884  -0.1564  -0.2199  -0.8820
## s.e.  0.0287   0.1747   0.1004   0.1498
## 
## sigma^2 estimated as 0.3924:  log likelihood = -136.81,  aic = 283.61

mod2

## 
## Call:
## arima(x = births, order = c(2, 0, 2), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ar2      ma1      ma2     sar1     sma1
##       1.2898  -0.2902  -0.5257  -0.2915  -0.2450  -0.8684
## s.e.  0.1738   0.1735   0.1660   0.1090   0.1029   0.1309
## 
## sigma^2 estimated as 0.3295:  log likelihood = -125.31,  aic = 264.62

The AIC is minimized in the more complicated model (by more than 10!).

mydf <- length(coef(mod2)) - length(coef(mod1))

(teststat<- -2*as.numeric(logLik(mod1) - logLik(mod2)))

## [1] 22.98939

pchisq(teststat, df = mydf, lower.tail = FALSE)

## [1] 1.018396e-05

#reject the null: the more complicated model (mod2) fits better

Well, the results of our last Learning Log were beaten by the (2, 0, 2) ARIMA model that we just created today.

Let’s try this with new data!

data("AirPassengers")

AirPassengers <- AirPassengers[-c(1:24)]
AirPassengers <- ts(AirPassengers, frequency=12, start=c(1949,1))

Mod1 <- arima(AirPassengers, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))
Mod2 <- arima(AirPassengers, order = c(2,0,2), season = list( order = c( 1,1,1), period=12))

## Warning in arima(AirPassengers, order = c(2, 0, 2), season = list(order =
## c(1, : possible convergence problem: optim gave code = 1

Mod1

## 
## Call:
## arima(x = AirPassengers, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 
##     1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1    sma1
##       0.9743  -0.3106  -0.9824  0.9394
## s.e.  0.0196   0.0987   0.0905  0.1742
## 
## sigma^2 estimated as 145.8:  log likelihood = -425.03,  aic = 860.06

Mod2

## 
## Call:
## arima(x = AirPassengers, order = c(2, 0, 2), seasonal = list(order = c(1, 1, 
##     1), period = 12))
## 
## Coefficients:
##          ar1     ar2     ma1      ma2     sar1    sma1
##       0.5986  0.3592  0.0613  -0.0331  -0.9899  0.9556
## s.e.  0.5354  0.5226  0.5336   0.2159   0.0790  0.1859
## 
## sigma^2 estimated as 144.3:  log likelihood = -424.78,  aic = 863.56

The AIC is minimized in the simpler model, which is perfect because we prefer the simpler model.

Mydf <- length(coef(Mod2)) - length(coef(Mod1))

(teststat2<- -2*as.numeric(logLik(Mod1) - logLik(Mod2)))

## [1] 0.4919102

pchisq(teststat2, df = Mydf, lower.tail = FALSE)

## [1] 0.7819573

We do not reject the null hypothesis because the test stat is small and the p-val is large.

LL24

Carter Hunt

May 4, 2018