We learned about model comparisons today in class. We looked at the likelihood ratio test. We need nested models for the Wald test. The LRT is similar to the F-test as it measures goodness of fit using liklihood. We will compare the test stat to a chi squared distribution. The degrees of freedom is the difference in the number of parameters in the two models being tested. We will look at the Iris data set to show this. We want to know if we should include species data in our model.
simpler <- lm(Petal.Length ~ Sepal.Length, data =iris)
biggermod <- lm(Petal.Length ~ Sepal.Length + Species, data =iris)
summary(biggermod)
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76390 -0.17875 0.00716 0.17461 0.79954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.70234 0.23013 -7.397 1.01e-11 ***
## Sepal.Length 0.63211 0.04527 13.962 < 2e-16 ***
## Speciesversicolor 2.21014 0.07047 31.362 < 2e-16 ***
## Speciesvirginica 3.09000 0.09123 33.870 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2826 on 146 degrees of freedom
## Multiple R-squared: 0.9749, Adjusted R-squared: 0.9744
## F-statistic: 1890 on 3 and 146 DF, p-value: < 2.2e-16
We cant use t test so we will use the liklihood ratios.
logLik(simpler)
## 'log Lik.' -190.5675 (df=3)
logLik(biggermod)
## 'log Lik.' -21.23709 (df=5)
mydf <- length(coef(biggermod)) - length(coef(simpler))
(teststat<- -2*as.numeric(logLik(simpler) - logLik(biggermod)))
## [1] 338.6608
pchisq(teststat, df = mydf, lower.tail = FALSE)
## [1] 2.888934e-74
We have a very low p value so we should include species. This means that species adds a lot of useful data after accounting for sepal length.
Now we will look at an ARIMA model
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
##
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
##
## time<-
library(forecast)
library(xts)
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))
Now let’s build our models and test them against each other
mod1 <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))
mod2 <- arima(births, order = c(2,0,2), season = list( order = c( 1,1,1), period=12))
mod1
##
## Call:
## arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.9884 -0.1564 -0.2199 -0.8820
## s.e. 0.0287 0.1747 0.1004 0.1498
##
## sigma^2 estimated as 0.3924: log likelihood = -136.81, aic = 283.61
mod2
##
## Call:
## arima(x = births, order = c(2, 0, 2), seasonal = list(order = c(1, 1, 1), period = 12))
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1
## 1.2898 -0.2902 -0.5257 -0.2915 -0.2450 -0.8684
## s.e. 0.1738 0.1735 0.1660 0.1090 0.1029 0.1309
##
## sigma^2 estimated as 0.3295: log likelihood = -125.31, aic = 264.62
mydf <- length(coef(mod2)) - length(coef(mod1))
(teststat<- -2*as.numeric(logLik(mod1) - logLik(mod2)))
## [1] 22.98939
pchisq(teststat, df = mydf, lower.tail = FALSE)
## [1] 1.018396e-05
We should include the ar2 and ma2 because we have evidence to think that they are signficant and important.
We can now compare time series models and add this to our knowledge of model building with a new thing to compare on.