Likelihood Ratio Tests

Likelihood ratio tests are used to compare the goodness of fit of two statistical models. The LRT compares two hierarchically nested models to determine whether or not adding complexity to your model (i.e., adding more parameters) makes your model significantly more accurate. The “hierarchically nested models” simply means that the complex model differs only from the simpler (or “nested”) model by the addition of one or more parameters.

In summary, the LRT tells us if it is beneficial to add parameters to our model, or if we should stick with our simpler model.

In their most basic form, the hypotheses for the LRT are:

H0: You should use the nested model.

Ha: You should use the complex model.

Thus, if you reject the H0, you can conclude that the complex model is significantly more accurate than the nested model, and you would choose to use the complex model. If you fail to reject the H0, you can conclude that the complex model is NOT significantly more accurate than the nested model, so you would choose to use the nested model instead.

The test statistic for the LRT follows a chi-squared distribution with degrees of freedom equal to the difference in dimensionality of your models. The equation for the test statistic is provided below:

-2 * [loglikelihood(nested)-loglikelihood(complex)]

——————————————————————————————————

Using R for Likelihood Ratio Tests

Before you begin: Download the package “lmtest” and call on that library in order to access the lrtest() function later.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

We begin by reading in our dataset. For this example, we are reading in data regarding student performance based on a variety of factors.

#grades <- read.csv(file="/Users/tommyanderson/Documents/School/Fall 2018/STAT 314/Presentation/grades.csv", header=TRUE, sep=",")
grades <- read.csv("http://cknudson.com/data/grades.csv", header=TRUE, sep=",")

We may want to get a look at the pairwise correlation in our dataset. Use the code below to get this view and see how different variables may be correlated with one other.

plot(grades)

As you can see, different variables are correlated in a variety of ways with other variables. Suppose that we would like to develop a model for final grades. Which variables would help us create the best model? Should we use all of the variables or just some of them? We will use the LRT to find out!

Let’s create our models. First, it is a good idea to get the names of the variables in our dataset. It is also a good idea to become familar with the our dataset. Using the View() function opens a window where we can view our dataset in tabular form. [The View() function is commented out here for easier formatting.]

names(grades)
## [1] "age"        "traveltime" "studytime"  "absences"   "per1grade" 
## [6] "per2grade"  "finalgrade"
#View(grades)

Next, we create our models using variables that we think may be good predictors of final grades.

#glm stands for "generalized linear model"
nested <- glm(finalgrade~absences,data=grades)
complex <- glm(finalgrade~absences+age,data=grades)

Finally, we compute our test statistic. In order to do this, we find the loglikelihoods of each model and plug them into the formula -2 * [loglikelihood(nested)-loglikelihood(complex)]. Our test statistic follows a chi-squared distribution with degrees of freedom equal to the difference in the number of free parameters between the complex model and the nested model. With this information, we may calculate the p-value, and if it is less than our significance level, we reject the null hypothesis.

[Notice the logLik() function gives us degrees of freedom, so we may calculate the degrees of freedom for our test statistic by hand and use it in the p-value calculation.]

(A <- logLik(nested))
## 'log Lik.' -1678.742 (df=3)
(B <- logLik(complex))
## 'log Lik.' -1675.847 (df=4)
#Since the logLik() function gives more information than the numeric value, use the as.numeric() function to isolate the numeric value
(teststat <- -2 * (as.numeric(A)-as.numeric(B)))
## [1] 5.791478
#df = 4 - 3 = 1
(p.val <- pchisq(teststat, df = 1, lower.tail = FALSE))
## [1] 0.01610404

A common significance level to use is .05. Under that significance level, we would reject the null hypothesis and conclude that we should use the more complex model. Let’s look at another example. Notice the degrees of freedom for our test statistic. This is hard-coded into the p-value calculation below, so don’t forget to change it from the previous example.

nested <- glm(finalgrade~per1grade+per2grade,data=grades)
complex <- glm(finalgrade~per1grade+per2grade+absences+age+traveltime,data=grades)

(A <- logLik(nested))
## 'log Lik.' -1070.655 (df=4)
(B <- logLik(complex))
## 'log Lik.' -1067.172 (df=7)
(teststat <- -2 * (as.numeric(A)-as.numeric(B)))
## [1] 6.966092
#df = 7 - 4 = 3
(p.val <- pchisq(teststat, df = 3, lower.tail = FALSE))
## [1] 0.07298641

With a .05 significance level, we would fail to reject the null hypothesis. This means we should use the nested model instead of the complex model. This makes sense when you look at our pairwise correlations above. The period 1 grades and period 2 grades appear to be clearly positively correlated with final grades, while other variables show weaker correlation.

Now let’s use the lrtest() function to perform the LRT. Remember, we can use this function because we downloaded the package “lmtest” and called the library “lmtest”. [We will begin by repeating the first example.]

nested <- glm(finalgrade~absences,data=grades)
complex <- glm(finalgrade~absences+age,data=grades)

lrtest(nested, complex)
## Likelihood ratio test
## 
## Model 1: finalgrade ~ absences
## Model 2: finalgrade ~ absences + age
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   3 -1678.7                       
## 2   4 -1675.8  1 5.7915     0.0161 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

First, let’s comment on the structure of this output versus what we saw from our first two examples. Here, we see each model listed out, followed by a table that shows each model’s degrees of freedom and loglikelihood. To the right of these values, we are given the degrees of freedom of our test statistic, the value of our test statistic, and then our p-value. Notice all of these values match up with our results in the first example. In conclusion, either method will work. With that being said, if you take the time to download the lmtest package, the lrtest() function is easy, quick, and intuitive.

Now commenting on these specific results, we will reject the null hypothesis at the .05 significance level. However, if we were to set the significance level to be .01, we would fail to reject the null hypothesis. This tells us that the complex model is significantly more accurate under an alpha equal to .05, but it is relatively close to the threshold where we would use the nested model instead. Compare this situation to the situation below.

nested <- glm(finalgrade~studytime,data=grades)
complex <- glm(finalgrade~studytime+per1grade+per2grade,data=grades)

lrtest(nested,complex)
## Likelihood ratio test
## 
## Model 1: finalgrade ~ studytime
## Model 2: finalgrade ~ studytime + per1grade + per2grade
##   #Df  LogLik Df Chisq Pr(>Chisq)    
## 1   3 -1666.2                        
## 2   5 -1070.2  2  1192  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This null hypothesis would be rejected at nearly every significance level. Thus, we know that we should definitely use the complex model as it increases the accuracy of our model by a substantial amount. Now we may want to find out if we need all three of those variables, or if two will be sufficient.

nested <- glm(finalgrade~studytime+per1grade,data=grades)
complex <- glm(finalgrade~studytime+per1grade+per2grade,data=grades)

lrtest(nested,complex)
## Likelihood ratio test
## 
## Model 1: finalgrade ~ studytime + per1grade
## Model 2: finalgrade ~ studytime + per1grade + per2grade
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   4 -1308.3                         
## 2   5 -1070.2  1 476.13  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we would reject the null hypothesis, so we choose to use the complex model (i.e., we should use all three variables in our model). Thus, an advantage of the LRT is that you can test a complex model with many more parameters than the nested model, and if you reject the null hypothesis, you can use further analysis to determine exactly which variables would create the best model.