Hypothesis testing: chi-square tests

Author

Shota Momma

In the previous section, we exclusively dealt with Bernouli trials, where the only possible outcomes are ‘success’ or ‘failure.’ But most of the concepts we learned can be generalized to other types of experiments. In this section, we will learn about a couple of different statistical tests based \(\chi^2\) (‘chi-square’) distribution. The primary goal here is to understand how the concepts we learned in the last section generalize. Along the way, we will learn about some new important concepts that will be useful for understanding more complex statistics used in the contemporary linguistics and psycholinguistics.

Is a die fair?: \(\chi^2\) goodness-of-fit test

Instead of evaluating whether a coin is fair, suppose you want to evaluate whether a die is fair. Assuming we are not playing Dungeon and Dragons, throwing a die usually has 6 possible outcomes. If the die is fair, they are all equi-probable. Like in the case of testing whether a coin is fair, we can define two competing hypotehses: H\(_0\) and H\(_1\).

H\(_0\): P = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
H\(_1\): P ≠ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)

Suppose you threw a die 120 times. What is the expected frequency of each outcome? Since every face of a die is equally likely by assumption, each face is expected 20 times. Now, suppose you got the following result:

1	2	3	4	5	6
22	24	13	36	13	12

When we were testing if a coin is fair, our test statistic was simply the number of successes/heads. But this time, our test statsitic is a value that indicates how devient the observed result is from the expected result, known as \(\chi^2\) (‘chi-square’) statistic. To compute \(\chi^2\), you (1) subtract the expected frequency of each outcome from the observed frequency of each outcome, (2) square each value, (3) divide the squared value by the expected frequency of each outcome, and finally (4) sum the resulting values up. In a mathmatical formula:

\[ \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \]

Where \(O_i\) is the observed frequency of \(i\)th outcome, and \(E_i\) is the expected frequency of \(i\)th outcome, and \(k\) is the number of possible outcomes. Given the observed result above:

n = 120
num_faces <- 6

observed <- c(22,24,13,36,13,12)
expected <- rep(n*1/num_faces, num_faces)

chi_square <- 0
for (i in 1:length(expected)){
    val <- ((observed[i]-expected[i])^2)/expected[i]
    chi_square <- chi_square + val
}
chi_square

[1] 21.9

So… what does this mean? Well, just like the number of successes in repeated Bernoulli trials follows a theoretical distribution, chi-square value also follows a theoretical distribution, known as \(\chi^2\) distribution. \(\chi^2\) distribution has one parameter, degrees of freedom. Degrees of freedom are expressed by a number of independently randomly varying values that can be assigned to a statistical distribution. Because there are 6 possible outcomes whose frequencies/probabilities are randomly and independently varying, you might be tempted to say the relevant \(\chi^2\) distribution has 6 degree of freedom. But remember that probabilities must sum to 1, meaning that only 5 values are independently varying (once you fix the probabilities of 5 outcomes, the probaiblity of the last outcome is automatically determined). Here is a plot showing the shape of chi-square distribution with three degrees of freedom: 3,4, and 5. Note that \(\chi^2\) value is continuous, unlike number of successes in binomial distribution.

curve(dchisq(x, df=5), col='black', from=0,to=20)

curve(dchisq(x, df=3), xlab = 'value', ylab = 'p', col = rgb(0, 0, 1, 0.8), from=0, to = 20)
curve(dchisq(x, df=4), xlab = 'value', ylab = 'p', col = rgb(1, 0, 0, 0.8), from=0, to = 20, add = TRUE)
curve(dchisq(x, df=5), xlab = 'value', ylab = 'p', col = rgb(0, 1, 0, 0.8), from=0, to = 20, add = TRUE)

text(5, 0.20, "df = 3", col=rgb(0, 0, 1, 1), cex = 1.5)
text(6, 0.17, "df = 4", col=rgb(1, 0, 0, 1), cex = 1.5)
text(7, 0.14, "df = 5", col=rgb(0, 1, 0, 1), cex = 1.5)

Given that we are testing a die, \(\chi^2\) with df = 5 is the distribution we compare the observed result against. Just like before, we first want to define the critical region. Given \(\alpha\) = 0.05, this is the right tail region with 5% of the total probability mass. Because \(\chi^2\) distribution is right skewed (not symmetric), we consider only the right tail.

Just like binomial distribution, \(\chi^2\) distribution has four associated R functions: dchisq; pchisq; qchisq; and rchisq (see the node below). Like before, we can idenitfy the critical value (the edge of the critical region) using the quartile function, qchisq.

qchisq(0.05,5,lower.tail = FALSE)

[1] 11.0705

Graphically:

curve(dchisq(x, df=5), col='black', from=0,to=20)
abline(v = 11.075, col="red", lwd=3, lty=2)
text(13.5, 0.08, " -> critical region", col="red", cex = 1)

Let’s verify that the probability mass of \(\chi^2\) value greater than 11.0705 sums to 0.05, using the cumulative density function.

pchisq(11.0705, 5, lower.tail = FALSE)

[1] 0.04999996

This means that a \(\chi^2\) value equal to or greater than 11.0705 would correspond to a p-value less than 0.05. The observed result had the \(\chi^2\) value of 21.9, which is well above the threshold. Let’s compute the exact p-value, using the cumulative density function.

pchisq(chi_square, 5, lower.tail = FALSE)

[1] 0.0005470186

This agrees with the output of the built-in \(\chi^2\) test in R:

chisq.test(observed)


    Chi-squared test for given probabilities

data:  observed
X-squared = 21.9, df = 5, p-value = 0.000547

In a statistical jargon, the test you just run is an example of Pearson’s \(\chi^2 goodness-of-fit\) test, so named because it is evaluating the goodness of fit between your data and a theoretical distribution.

\(\chi^2\) test of independence

The \(\chi^2\) goodness-of-fit test is used when comparing observed data on a categorical scale against a theoretical distribution. But suppose that you have two dice and you want to test if these two dice behave alike. In such cases, we use a slightly different test that still uses \(\chi^2\) distribution: \(\chi^2\) test of independence. (sometimes called \(\chi^2\) test of association.).

Like before we start with defining H\(_0\) and H\(_1\). We are testing if two dice behave alike. Informally:

H\(_0\): p(die A = 1) = p (die B = 1), p(die A = 2) = p(die B = 2), etc.
H\(_1\): p(die A = 1) ≠ p (die B = 1) OR p(die A = 2) = p(die B = 2) OR… etc.

This time, let’s start by randomly generate data to brush up your R skill. Assume we have two dice, A and B, and we throw die A 120 times and die B 160 times. Let’s rig each die in different ways so they have different biases. The probability of 6 is twice as large with A, and the probability of 1 is twice as large with B, so we know that these dice actually do not behave alike.

You can use ‘sample’ function to randomly generate integer between 1-6 with different biases. This is effectively simulating two dice with desired biases:

set.seed(1234)

n1 = 120
n2 = 160

d1 = sample(1:6, n1, replace = TRUE, prob = c(1/7,1/7,1/7,1/7,1/7,2/7))
d2 = sample(1:6, n2, replace = TRUE, prob = c(2/7,1/7,1/7,1/7,1/7,1/7))

Now, we can create a single dataframe that contain all observations.

lab <- c(rep('A',n1),rep('B',n2))
df <- data.frame(lab, c(d1,d2))
colnames(df) <- c('die', 'outcome')

We can now create a contingency table (I also transposed it for easier exposition):

observed <- t(table(df$die,df$outcome))
observed

Now, can we reject the null hypothesis given the data? To answer this question, we need to calculate expected frequencies in each cell under the null hypotheis, just like in the goodness-of-fit test. To compute expected frequencies, we first estimate expected probabilities of each face of a die. To do this, we first compute the total frequency of each face.

totalf <- observed[,1] + observed[,2]
data.frame(totalf)

We can then convert these frequencies into probabilities by dividing each frequency with the total number:

probs <- totalf/sum(totalf)
data.frame(probs)

      probs
1 0.1750000
2 0.1142857
3 0.1428571
4 0.1785714
5 0.1214286
6 0.2678571

Then, if we multiply each of these probabilities with the number of times die A and die B were thrown (120 and 140, respectively), we get:

expected_a <- probs*n1
expected_b <- probs*n2
expected <- t(rbind(expected_a, expected_b))
expected

  expected_a expected_b
1   21.00000   28.00000
2   13.71429   18.28571
3   17.14286   22.85714
4   21.42857   28.57143
5   14.57143   19.42857
6   32.14286   42.85714

Now we have both expected and observed frequecy for each cell in the contingency table. Now we can compute the \(\chi^2\) value much the same way as before, except we use the for loop twice because we have to sum over rows (faces) and columns (die A vs. die B):

chisquare2 <- 0
for (i in 1:6){
    for (j in 1:2){
        val <- (observed[i,j] - expected[i,j])^2/expected[i,j]
        chisquare2 <- chisquare2 + val
    }
}

Just to summarize what we just did in a mathmatical formula: \[ \chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(E_{ij} - O_{ij})^2}{E_{ij}} \]

where \(r\) is the number of rows and \(c\) is the number of columns in the contingency table. \(E_{ij}\) and \(O_{ij}\) are expected and observed frequencies \(i\)th row and \(j\)th column.

We got the \(\chi^2\) value of 19.00398. Now, we need to figure out the degrees of freedom to compare this against the theoretical \(\chi^2\) distribution with appropriate degrees of freedom. You can figure out degrees of freedom using the following formula: \[ df = (r - 1)(c - 1) \]

Where r = the number of rows (faces) and c = the number of columns (die A vs. die B). Using this, we get (6-1)(2-1) = 5 (if you are curious about how to derive this formula, see the degrees of freedom section below).

Now that we have the test statistic (\(\chi^2\) value) and degrees of freedom, we can compute the p-value. Just like before, we can use the pchisq function to compute the probability mass at and to the right of the obtained \(\chi^2\) value:

pchisq(chisquare2, df = 5, lower.tail = FALSE)

[1] 0.00191886

This matches with the p-balue based on the pre-built function in R:

chisq.test(observed)


    Pearson's Chi-squared test

data:  observed
X-squared = 19.004, df = 5, p-value = 0.001919

Degrees of freedom

The following is a bit involved, so you can skip it if you are satisfied with the formula for determining the degree of freedom for the \(\chi^2\) test of independence. But why \((r-1)(c-1)\)? Here’s an answer (taken from here).

First, see a table of \(O_{ij}\) - \(E_{ij}\) (a term in the equation for computing the \(\chi^2\) value) filling each cell:

observed - expected

   
              A           B
  1  -8.0000000   8.0000000
  2  -1.7142857   1.7142857
  3  -0.1428571   0.1428571
  4  -2.4285714   2.4285714
  5  -2.5714286   2.5714286
  6  14.8571429 -14.8571429

Notice that the sum of values in each row (and in each column) must be 0. This constraint means that the values of the first column can be expressed in terms of the values of the second column. Because (\(O_{i1}\) \(-\) \(E_{i1}\)) + (\(O_{i2}\) \(-\) \(E_{i2}\)) = 0, (\(O_{i1}\) \(-\) \(E_{i1}\)) = \(-\)(\(O_{i2}\) \(-\) \(E_{i2}\)). To be explicit:

\(-\)(\(O_{12}\) \(-\) \(E_{12}\)) = -8.00
\(-\)(\(O_{22}\) \(-\) \(E_{22}\)) = -1.71
\(-\)(\(O_{32}\) \(-\) \(E_{32}\)) = -0.14
\(-\)(\(O_{42}\) \(-\) \(E_{42}\)) = -2.42
\(-\)(\(O_{52}\) \(-\) \(E_{52}\)) = -2.57
\(-\)(\(O_{62}\) \(-\) \(E_{62}\)) = 14.85

Because these values must sum to 0, if 5 of these values are fixed, the last value will be automatically fixed, i.e., 5 degree of freedom.

Suppose you have three columns instead (e.g., when comparing three different dice: A, B and C), the constraint is that the (\(O - E\)) values in each row must sum to 0. This means that a column can be expressed in terms of the other two columns, i.e., \(O_{i1} = - O_{i2} - O_{i3}\). This means that for each row we have 2 values (the number of columns minus 1) to fix. There are 5 rows to fix, and therefore, there are 5 * 2 = 10 degrees of freedom if we were comparing three different dice using chi-square test. This agrees with the formula above: (6-1)(3-1) = 5*2 = 10.

Distribution of p-values under the null hypothesis

One thing you should know is that, when the test-statistic is continuous and if the null hypothesis is true, the distribution of p-value follows the uniform distribution, which means that it’s equally likely to get the p-value of 0.00000001 and the p-value of 0.59 (for reasons I don’t go into, p-values under the null hypothesis is only quasi-uniform when the test statistics is discrete).

Linguistic example: speech errors

Garrett reported the following data about the frequencies of sound exchange errors (‘feed the dog’ -> ‘deed the fog’) and independent word exchange errors (‘enough of both’ -> ‘both of enough’). Here, ‘distance’ means the number of words intervening the two exchanged units.

sound <- data.frame(
  distance = c(0,1,2,3),
  frequency = c(51,66,9,2)
)

word <- data.frame(
  distance = c(0,1,2,3),
  frequency = c(0,49,31,17)
)

sound

  distance frequency
1        0        51
2        1        66
3        2         9
4        3         2

word

  distance frequency
1        0         0
2        1        49
3        2        31
4        3        17

Test if two error types are different in terms of how likely an error occurs at different distances. Do not use the chisq.test() function.