G tests for categorical variables

Introduction
G test of goodness of fit
G test of independence
Conclusion

knitr::opts_chunk$set(include = TRUE)

library(DescTools)

Introduction

G tests are helpful hypothesis test for categorical variables. They are alternatives to the \(\chi^2\) test of goodness of fit and the \(\chi^2\) test of independence.

As such they compare the observed counts of the unique sample space values of a categorical variable or variables against an expected count.

G test of goodness of fit

This hypothesis test looks at the observed counts for a single categorical variable. Unlike the exact test of goodness of fit, a test statistic is calculated and then the probability of finding such as extreme statistic. Whereas the exact test is a binomial test, the G test of goodness of fit is a log-likelihood test.

The null hypothesis of this test states that the observed counts are equal to that of predicted counts. The G statistic for this test is given in equation (1) under the G test for independence.

The DescTools package provides the GTest() function that takes a vector of observed counts and a vector of probabilities.

In the example below we observe \(175\) values for one of the two unique sample space data point values for a variable and \(190\) for the other data point value. We imagine that we expected a \(50\):\(50\) split.

obs <- c(175, 190)
GTest(x = obs,
      p = c(0.5, 0.5),
      correct = "none")

## 
##  Log likelihood ratio (G-test) goodness of fit test
## 
## data:  obs
## G = 0.61661, X-squared df = 1, p-value = 0.4323

We note that our finding was not significantly different from the expected.

G test of independence

The G test is used as alternative to the \(\chi^2\) test of independence. The latter is in fact an approximation of the log-likelihood ratio on which the G test is based. The equation for the G test is given in (1).

\[G=2 \sum_{i=1}^n \text{observed}_i \cdot \ln {\left( \frac{\text{observed}_i}{\text{expected}_i} \right)} \tag{1}\]

Here \(n\) is the total sample size and every \(i\) is a cell value in the contingency table. The \(\ln\) refers to the natural logarithm.

The GTest() function is available in the DescTools library and is the same function as we used before. It takes an observed (contingency) table as input when we want to use it as a test of independence.

In the sample below, we use exactly the same simulated research project as used in the tutorial on tests for categorical variables (where we used the \(\chi^2\) test of independence and found a p value of \(0.01124\)).

obs <- rbind(c(33, 44, 25),
             c(11, 28, 30))
DescTools::GTest(obs,
                 correct = "none")

## 
##  Log likelihood ratio (G-test) test of independence without
##  correction
## 
## data:  obs
## G = 9.1435, X-squared df = 2, p-value = 0.01034

We find a small p value and conclude that the two variables are dependent.

Conclusion

These two tests are very similar to their \(chi^2\) test counterparts. The DescTools package makes for easy implementation in R.

G tests for categorical variables

Dr Juan H Klopper

Introduction

G test of goodness of fit

G test of independence

Conclusion