Hypotheisis testing: binomial test

Author

Shota Momma

Overview

In this section, we will discuss hypothesis testing under the frequentist tradition of statistics. We extensively discuss statistics of flipping a coin many times because it’s simple and easy to understand (I love simplicity!). Although flipping a coin is much simpler than what we have to deal with in real life, many of the basic concepts we discuss here generalize to more complex cases.

Is a coin fair?

Suppose you flipped a coin 6 times and saw the head 6 times. Would you be surprised? Do you think the coin was biased? The world is an uncertain place, and the head coming up 6 out of 6 times can happen by sheer chance. How do we decide if the coin is biased?

Formally, this question can be translated into the problem of evaluating two competing hypotheses expressed in terms of probabilities of ‘success’ (head) and ‘failure’ (tail):

  • H\(_0\) (null hypothesis): p(head) = 0.5
  • H\(_1\) (alternative hypothesis): p(head) ≠ 0.5

To conclude that the coin is biased is to reject H\(_0\). The question is: when are we justified to reject H\(_0\)?

Well, you probably have heard something like the following: “When you have a p-value of less than 0.05, you reject the null hypothesis!” In R, it’s easy to compute a p-value for a result of repeated Bernoulli trials (trials with exactly two possible outcomes whose probabilities do not change over time) using the function that implements the binomial test. In fact, you only need one line of code.

binom.test(x = 6, n = 6, p = 0.5)

    Exact binomial test

data:  6 and 6
number of successes = 6, number of trials = 6, p-value = 0.03125
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5407419 1.0000000
sample estimates:
probability of success 
                     1 

Yay - R spit out a p-value right there, and it’s 0.03125! It’s smaller than 0.05, so we failed to reject the null hypothesis, and we are done!

The goal here is to understand what is going on under the hood. Along the way, you will learn various important concepts in hypothesis testing and elementary probability theory.

Excercise: Bernoulli’s trial, named after the Swiss mathematician Jacob Bernoulli, is defined as a random experiment with exactly two possible outcomes where the probability of ‘success’ and ‘failure’ does not change over time. Flipping a coin is a stereotypical example of the Bernoulli trials, but the idea is much more general. Can you think of something that can be modeled as Bernoulli trials in your area of study?

Bernouli trials and binomial distribution

It’s actually quite easy to compute the p-value of a repeated coin-flip experiment by enumerating all possible outcomes when the number of trials is small. Let’s try this brute force method first to understand what’s going on.

Suppose we flip a coin 4 times. There are \(2^4\) = 16 possible outcomes. Let’s enumerate them all.

  • 0 head (1 way): TTTT
  • 1 head (4 ways): HTTT, THTT, TTHT, TTTH
  • 2 heads (6 ways): HHTT, HTHT, HTTH, THHT, THTH, TTHH
  • 3 heads (4 ways): HHHT, HHTH, HTHH, THHH
  • 4 heads (1 way): HHHH

Importantly, extreme values are less likely. There are more ways to get 2 heads than 1 or 3 heads, and there are more ways to get 1 or 3 heads than 0 or 4 heads. Converting the frequency count into probability, we get the following:

  • 0 heads: 1/16 = 0.0625
  • 1 heads: 4/16 = 0.25
  • 2 heads: 6/16 = 0.375
  • 3 heads: 4/16 = 0.25
  • 4 heads: 1/16 = 0.0625

Graphically:

n = 4
p = 0.5
x= 0:n

plot(x, dbinom(x, n, p),ylab = 'p',type='h')

This gives us the theoretical distribution of the number of successes assuming a fair coin.

The number of successes in repeated Bernoulli trials, which is our test statistic (a value over which we run a statistical test), follows binomial distribution. Binomial distribution has two parameters: the probability of success (\(\theta\)) and the number of repeated Bernoulli trials (\(n\)). Thus, if we flip a coin 4 times and if the coin is fair (i.e., p(success) = 0.5), the test statistic should follow a binomial distribution with \(\theta\) = 0.5 and \(N\) = 4. Often, you will see a mathematical notation like the following:

\[ X \sim Binomial(\theta, N) \]

This expression says that \(X\) is a random variable (a variable whose value is randomly drawn from a distribution) that follows the binomial distribution with \(\theta\) (probability of success) and \(N\) (number of trials). It’s likely useful to be familiar with this notation when you self-study statistics and probability theory.

Excercise: Create a graph like above for the following distribution: Binomial(0.65, 120).

P-values

Actually, we already have everything we need to compute the p-value associated with the result. P-values express the probability of getting the results…

  • at least as extreme as the observed result
  • assuming that the null hypothesis is true

0 head and 4 heads are equally extreme results. So, to compute the p-value, we simply need to sum up the probability of 0 head and 4 heads: 0625 + 0.0625 = 0.125. This is the same as the p-value that R spits out.

binom.test(4,4,p=0.5)

    Exact binomial test

data:  4 and 4
number of successes = 4, number of trials = 4, p-value = 0.125
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.3976354 1.0000000
sample estimates:
probability of success 
                     1 

Note: P-value is constantly misunderstood (to the extent that it makes me nervous when I have to explain it). A common misunderstanding is that p-values express the probability that the null hypothesis is true given the observed result, that is, p( null | result). This is wrong. P-values express the probability of getting the results at least as extreme as the observed result given the null hypothesis, that is, p( results at least as extreme as the observed result | null ). Another common misunderstanding is that small p-values suggest a large effect. This is also incorrect. If there is a real effect, even if it’s minuscule, p-values become infinitesimally smaller as you accumulate more observations. So you can get an extremely small p-value for a practically negligible effect.

This approach to computing p-values is transparent but enumerating and counting all possible outcomes become quickly infeasible as the number of trials increases. If you flip a coin 10 times, you already have \(2^{10}\) = 1026 possible outcomes. Fortunately, we can use Probability Mass Function (PMF) for the binomial distribution, which allows you to compute the probability of any \(x\) (number of successes) with any \(n\) (number of trials) and any \(p\) (the probability of success). Here is PMF for binomial distribution.

\[P(X = x) = \binom{n}{x} p^x (1-p)^{n-x}\]

\(\binom{n}{x}\) is called binomial coefficient and is sometimes pronounced ‘n choose x’. and can be computed by the following formula.

\[ \binom{n}{x} = \frac{n!}{x! (n - x)}\]

For example, suppose that you flipped a coin 10 times and the head came up 8 times. Assuming a fair coin, you can compute the probability of getting 8 heads by simply plugging in values to PMF.

\[P(X = 8) = \frac{10!}{8! (2)} (0.5)^8 (0.5)^{2} = 0.04394531\]

Note: when the sample space, a set of possible outcomes, of the test statistic is continuous, a function that returns the probability of an outcome is called Probability Density Function (PDF) instead. Here, the sample space can only be integers, so it is discrete; hence PMF, not PDF.

R also has a built-in function for computing a probability density of any number of successes with any number of trials and any probability of success, namely, dbinom(x, n, p). The name of the function is because it computes probability density (actually, mass for when the sample space is discrete) assuming a binomial distribution (as you might have guessed, this naming convention generalizes to other types of statistical distribution, e.g., dnorm for computing a density of a value x assuming a normal distribution). If we want to compute the probability mass at \(x\) = 8 given \(n\) = 10 and \(p\) = 0.5, you can do:

dbinom(8, 10, 0.5)
[1] 0.04394531

Using the PMF, we can compute the probability of any \(x\) given any \(n\) and \(p\). This means we can compute p-values. Again, remember that the p-value expresses the probability of the results at least as extreme as the observed results assuming the null hypothesis is true.

If we observe 8 heads out of 10 coin flips, 0 heads, 1 heads, 2 heads, 8 heads, 9 heads, and 10 heads are all results that are at least as extreme as the observed results. So summing up the probabilities gives you:

dbinom(0, 10, 0.5) + dbinom(1, 10, 0.5) + dbinom(2, 10, 0.5) + dbinom(8, 10, 0.5) + dbinom(9, 10, 0.5) + dbinom(10, 10, 0.5)
[1] 0.109375

Equivalently:

sum(dbinom(0:2, 10, 0.5)) + sum(dbinom(8:10, 10, 0.5))
[1] 0.109375

Equivalently, you can also use the pbinom function twice, which allows you to compute cumulative mass (the sum of probability mass over a range of x values) at both tails. If you declare ‘lower.tail = FALSE,’ the function computes the cumulative mass to the right of a given value x.

pbinom(2, 10, 0.5) + pbinom(7,10,0.5,lower.tail=FALSE)
[1] 0.109375

Note that for the second call of the pbinom function, I specified the x to be 7. This is because when lower.tail is set to be FALSE, the function only sum over the value greater than (not greater than or equal to) x.

Now, let’s confirm the result using the pre-built function in R:

binom.test(x = 8, n = 10, p = 0.5)

    Exact binomial test

data:  8 and 10
number of successes = 8, number of trials = 10, p-value = 0.1094
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4439045 0.9747893
sample estimates:
probability of success 
                   0.8 

Excercise: Compute the p-value of 16 heads out of 42 coin flips. Do this in two ways: (1) using the pre-build function binom.test and using the dbinom function and summation. Compare the results.

Decision making and inferential errors

We have a p-value, now what? We need some threshold for rejecting H\(_0\), to make a decision. The threshold for rejecting H\(_0\) is defined by the \(\alpha\) value. To explain what \(\alpha\) value is, I first need to talk about two types of inferential errors: Type 1 and Type 2 errors. If we are concerned with whether to reject H\(_0\), there are four possible situations:

  1. H\(_0\) is actually false and we corretly rejected H\(_0\) (Hit),
  2. H\(_0\) is actually true and we failed to reject H\(_0\) (Correct rejection).
  3. H\(_0\) is actually true but we erroneously rejected H\(_0\) (False positive, aka. Type 1 error).
  4. H\(_0\) is actually false but we failed to reject H\(_0\) (False negative, aka. Type 2 error)

\(\alpha\) is a Type 1 error rate that you are willing to tolerate. In psychology and linguistics (and many other fields), \(\alpha\) is conventionally set to be 0.05. That is, by convention, we usually tolerate a 5% rate of Type 1 error. 0.05 is completely arbitrary. There are no deep reasons, mathematical or otherwise, behind this convention.

\(\beta\) expresses the probability of Type 2 error. You might have heard of the term power, the probability of correctly rejecting H\(_0\) when H\(_0\) is false. This is just 1 - \(\beta\). Usually, people are less willing to tolerate Type 1 errors (false positive) than Type 2 errors (false negative). We will discuss how to compute power given a specific hypothesis below, but before that, I need to discuss the distinction between one-tailed and two-tailed tests.

One-sided vs. two-sided tests

So far, we used the following null and alternative hypotheses:

  • H\(_0\) (null hypothesis): p(head) = 0.5
  • H\(_1\) (alternative hypothesis): p(head) ≠ 0.5

But in some cases, we might want to test if a coin is biased in a specific direction, e.g., we may suspect that a coin is biased for the head. In such cases, we can define H\(_1\) and H\(_0\) differently, and the appropriate statistical test differs a bit. If we want to test if a coin is biased for heads, we define H\(_0\) and H\(_1\) as follows:

  • H\(_0\) (null hypothesis): p(head) <= 0.5
  • H\(_1\) (alternative hypothesis): p(head) > 0.5

When we were testing a non-directional hypothesis, we defined at least as extreme in both directions, that is, both small and large values of X. So when we had 8 heads out of 10 coin flips, we summed up the probability of 0, 1, 2, 8, 9 and 10 heads. This gives us the p-value from a two-sided (or two-tailed) test, which is the default in R’s binom.test function. But if we define the alternative hypothesis as “a coin is biased for heads” or “a coin is biased for tails”, the at least as extreme is defined in one direction. For instance, if the alternative is that the coin is biased for heads, and if we got 8 heads out of 10 coin flips, we sum up the probabilities of 8, 9, and 10 heads. This gives the p-value from a one-sided test.

Let’s compute the p-value using the one-sided test. We can compute the p-value given 8 heads out of 10 coin flips by simply summing up the probability mass of 8 heads, 9 heads, and 10 heads, each of which can be computed using the dbinom function.

dbinom(8, 10, 0.5) + dbinom(9, 10, 0.5) + dbinom(10, 10, 0.5)
[1] 0.0546875

This returns the same value as the output of the exact binomial test function in R when we specify the directionality of the alternative hypothesis.

binom.test(8,10,0.5,alternative = "greater")

    Exact binomial test

data:  8 and 10
number of successes = 8, number of trials = 10, p-value = 0.05469
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
 0.4930987 1.0000000
sample estimates:
probability of success 
                   0.8 

As you might have noticed, the p-value we got (0.0546875) is exactly half of the p-value we got from the two-sided test (0.109375). This is of course not a coincidence. Because binomial distribution is symmetric, p-values are exactly twice as large when using two-sided tests compared to when using one-sided tests. Note that when the relevant distribution is asymmetric, there is no such thing as two-sided tests.

Note: You will always get a smaller p-value if you use a one-sided test, which means that it’s easier to argue for a positive result with one-sided tests. However, it is usually NOT okay to use one-sided tests just because. You have to provide strong a priori reason to expect a directional effect AND make a case that the alternative hypothesis in the other direction is completely uninteresting theoretically.

Power

As we just discussed, power is defined as 1 - \(\beta\), that is, the probability of correctly rejecting a false null hypothesis. When you design an experiment, it’s useful to have a guesstimate of how likely you can reject a null hypothesis if it’s indeed false.

To calculate power, you need to specify three things:

  1. \(\alpha\).
  2. a guess of the size of the effect, e.g., p(head) = 0.6
  3. sample size. e.g.,, n = 100

First, assuming a fair coin and assuming that we use the two-tailed test, we want to find the value of x where the cumulative probability mass is just below 0.025 (half the \(\alpha\) level for two-tailed tests). You can find this out by using the qbinom() function, which gives you the value of x in a specific quartile.

qbinom(.025, 100, .5)
[1] 40

Let’s see if the cumulative probability of X \(\le\) is 40 using the pbinom function, which allows you to compute the cumulative probability of X \(\le\) x.

pbinom(40, 100, .5)
[1] 0.02844397

This tells you that the cumulative probability of X \(\le\) 40 is 0.028. This is not quite what we want, because we want to find out the value x where the cumulative probability of X \(\ge\) x is just below 0.025, and 0.028 is not below 0.25. This is simply due to how quartile is computed in R. x must be an integer, and qbinom() function simply gives you the value of x where the cumulative probability of X \(\le\) x is closest to 0.025. What we want instead is the value of x where the cumulative probability of X \(\le\) x is lower than AND closest to 0.025, which is 39:

pbinom(39, 100, .5)
[1] 0.0176001

This means that 39 is the value of x that gives you the cumulative probability of X \(\le\) x just below 0.25 (0.0176001 to be precise). If we subtract 39 from 100 (n), we get 61. Because binomial distribution is symmetric, the cumulative probability of X \(\ge\) 61 is also 0.0176001. The critical region given \(\alpha\) = 0.05 and binomial(100,0.5) is the region outside this range [39, 61] (including these values). Graphically, the regions outside the dotted vertical lines are critical regions (in the context of two-tailed tests):

n = 100
p1 = 0.5
x = 0:n

plot(x, dbinom(x,n,0.5),type = 'h',ylab = 'p', xlim = c(0,n), ylim = c(0,0.1))
abline(v = 39, col="red", lwd=3, lty=2)
abline(v = 61, col="red", lwd=3, lty=2)
text(25, 0.08, "critical region <- ", col="red", cex = 1)
text(75, 0.08, "-> critical region ", col="red", cex = 1)

Power is the probability of a randomly drawn value from the distribution under the alternative hypothesis falling in the critical region of the distribution under the null hypothesis. Thus, we can compute power by simply summing the probabilities mass given \(X\) = 0, 1, 2, …,39 and \(X\) = 61, 62, 63, …100, assuming the distribution under the alternative hypothesis: binomial(100, 0.6). In R:

sum(dbinom(0:39, 100, .6)) + sum(dbinom(61:100, 100, .6))
[1] 0.4620934

Equivalently, we can use pbinom() function to compute the cumulative probability of \(X\) <= 39 and \(X\) >= 61. Note that in R, if ‘lower.tail = true,’ it computes the cumulative probability mass to the right of a certain value, but NOT including that value Thus, we have to set the first argument the pbinom() function to be 60, not 61.

pbinom(39, 100, 0.6,lower.tail =TRUE) + pbinom(60,100,0.6,lower.tail =FALSE)
[1] 0.4620934

This means that the probability of correctly rejecting the null hypothesis is 0.4620834 (and the probability of failing to rejecting the false null hypothesis is 1 minus that). That is, we have about 46% chance of correctly rejecting the false null hypothesis (that p(head) = 0.5) if p(head) = 0.6.

There are two ways to increase power. First, we can increase the number of trials. For instance, let’s increase the number of trials to 200, keeping everything else constant. First, let’s compute the 25% quartile of binomial(200, 0.5).

qbinom(.025, 200, .5)
[1] 86

And let’s check if the cumulative probability of X \(\le\) 86 is below 0.025.

pbinom(86, 200, .5)
[1] 0.02798287

Again, this isn’t quite what we want, as the cumulative probability of X \(\le\) 86 isn’t below 0.025. Instead, we want 85:

pbinom(85, 200, .5)
[1] 0.0200186

We can subtract 85 from 200 and get 115. The critical region of the distribution under the null hypothesis is the region outside the range [85, 115] (including these values).

Now, let’s compute the summed probability of \(X\) being in the critical region of the null distribution, assuming the distribution under the alternative hypothesis: binom(200, 0.6).

sum(dbinom(0:85, 200, .6)) + sum(dbinom(115:200, 200, .6))
[1] 0.7868483

This gives us the power of 0.7868483. That is, we have about a 79% chance of correctly rejecting the false null hypothesis if we flip the coin 200 times and if \(\theta = 0.6\). Generally, increasing the number of trials increases power.

The second way is to increase the effect size. Of course, we often have little control over this. But for illustration, let’s compute power with 100 trials, assuming that p(head) = 0.7. We’ve already computed the critical region of binomial(100, 0.5) with \(\alpha = 0.5\), which is the region outside the range [39, 61]. So we just need to compute the sum probability of X being in the critical region of the distribution under the null hypothesis assuming binomial(100, 0.7)

sum(dbinom(0:39, 100, .70)) + sum(dbinom(61:100, 100, .70))
[1] 0.9790114

This means that we have about a 98% chance of rejecting the null hypothesis if it is false.

We can also compute power based on simulation. Simulation-based power calculation is a method that can be generalized to other types of statistical tests, and it’s going to be useful. So let’s try doing it as an exercise here. The idea is simple: we run a whole bunch of ‘simulated experiments,’ and compute the proportion of results that turned out to be significant.

First, let’s simulate 100 coin flips 10000 times, assuming that p(head) = 0.6. Here, the function rbinom(), which draws a random sample from a binomial distribution, comes in handy.

set.seed(1)
obs = 100000
n = 100
p = 0.6

vec <- rbinom(obs, n, p)

This gives you the results of 10000 simulated experiments, each corresponding to 100 coin flips. We can now submit the result of each simulated experiment to the binomial test, and compute the proportion of experiments that yielded significant results. One way to do this in R is:

pvals <- c()
for (i in 1:length(vec)){
    pval <- binom.test(vec[i], 100)$p.value
    pvals[i] <- pval   
}
length(pvals[pvals < 0.05])/length(vec)
[1] 0.46402

This shows that in 46215 out of 100000 simulated experiments, we correctly rejected the false null hypothesis. This agrees pretty well with the power we computed above: 0.46.

Exercise: You hypothesized that p(head) = 0.32. Using simulations, compute the power of an experiment where \(N\) is 13. Assume \(\alpha\) = 0.01 and H\({_0}\): p(head) = 0.5.

Multiple comparisons problem

Assuming that \(\alpha\) is 0.05, if you run an experiment 20 times, you have about 64% chance of getting a result whose p-value is less than 0.05 even if there is nothing interesting going on. The probability of getting no significant result is (1-0.05)^20 = 0.3584859, so the probability of getting at least one significant result is 1 - 0.3584859 = 0.6415141.

Let’s simulate 10 flips of a truly fair coin 20 times and run the binomial test 20 times.

set.seed(1)

n = 10
vec <- rbinom(20,n,0.5)
count = 0

ps <- c()
for (v in vec){
    p = binom.test(v,n,p=0.5)$p.val
    count = count + 1
    ps[count] = p
}
ps[ps < 0.05]
[1] 0.02148438

As you can see, we got a p-value of less than 0.05 once. This is not surprising, because there is about a 64% chance of getting a false positive. Suppose a scientist ran the same experiment 20 times and selectively reported the result that produced a p-value of less than 0.05. You see how detrimental this is to science.

The multiple comparisons problem is unfortunately extremely common in our and neighboring fields. Let me introduce an extreme version of this problem for the purpose of illustration. In an fMRI experiment, significance testing is often run for each \(voxel\) (a three-dimensional counterpart of a pixel) to see if an area of the brain ‘lights up’ (‘lights up’ means a manipulation had a ‘significant’ effect). This means that a whole bunch of significance tests will be run, and it is practically guaranteed that some voxels ‘light up.’ In fact, if you scan a dead salmon using fMRI, you will indeed find that a dead salmon ‘lights’ up..

In linguistics and psycholinguistics, we often use methods like self-paced reading and eye-tracking while reading. These methods also suffer from the multiple comparisons problem, because these methods will produce multiple measurements over which you can perform statistical tests. For instance, in a self-paced reading study, you get a reading time at the critical region, critical region + 1, critical region +2, and so forth. In an eye-tracking, while reading study, you get measures like ‘first fixation,’ ‘first path reading time,’ ‘regression rate,’ ‘total reading time,’ etc. If you run a significance test for each of these measurements, the probability of getting a false positive result is greater than 0.05, i.e., the Type 1 error rate is inflated. See, for example, this paper for a discussion on the multiple comparison problems in eye-tracking while reading.

There are ways to counteract the multiple comparisons problem. A most widely known solution is to simply have a more stringent decision threshold (i.e., set the \(\alpha\) threshold to be smaller) as you run more statistical tests. The most widely known version of this is perhaps the Bonferroni correction, named after an Italian mathematician Carlo Emilio Bonferroni. To apply Bonferroni correction, you simply divide the alpha value by the number of statistical tests you run. In the coin flip example, since we run the binomial test 20 times, \(\alpha\) is set to be 0.05/20 = 0.0025.

Excercise: patient XYZ

Suppose that you want to test if an individual patient with aphasia, XYZ, is capable of showing a better-than-chance performance in comprehending object relative clauses, e.g., the boy that the girl kissed wears a blue shirt. You administered a sentence-picture matching task to XYZ. In this task, XYZ was asked to pick a picture that correctly depict a sentence. In each trial, XYZ was presented with two pictures to choose from, the one that matches with the sentence (e.g., a picture where a boy kissing a girl is wearing a blue shirt) and the other not (e.g., the picture where a girl kissing a boy is wearing a blue shirt).

Suppose that XYZ went through 80 trials like this, and you found that they picked the correct picture 55 times.

  1. Using R, (a) identify the critical region of the binomial distribution under the null hypothesis and (b) compute the p-value of the observed result using a two-tailed test. Do not use the binom.test function.

  2. In the current scenario, would you believe that using a one-tailed test is justified? Why or why not?

  3. Suppose that, before you run this experiment, you estimated, based on some quantitative theory that predicts the performance of individual patients with aphasia, that XYZ’s rate of success is 0.55 (only slightly above chance). How many trials are necessary to achieve 80% power if you use a two-tailed test? Assume that \(\alpha\) is set to be 0.05. Compute power in two ways: (i) using the qbinom, dbinom, and/or pbinom functions and (ii) by simulation.

  4. Suppose that you believe in a natural category of aphasia: aphasia X. XYZ suffers from this type of aphasia. You found 10 other patients suffering from it, and want to know if this population (patients with aphasia X) has better-than-chance performance on the picture-matching task. Each patient did 80 trials of the picture-matching task described above, and this resulted in 800 observations in total. You counted the total number of ‘success’ trials and found out that you got 600 success trials. What is wrong with running a binomial test like the following?

binom.test(600, 800, 0.5, 'two.sided')

    Exact binomial test

data:  600 and 800
number of successes = 600, number of trials = 800, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.7184813 0.7796637
sample estimates:
probability of success 
                  0.75