Hypothesis testing

Alban Guillaumet, Troy University

“…a hypothesis test tells us whether the observed data are consistent with the null hypothesis, and a confidence interval tells us which hypotheses are consistent with the data.”

- William C. Blackwelder

Objectives

  • Intro to hypothesis testing

  • Null distribution & P-value

  • Binomial distribution

  • Errors in Hypothesis Testing

Hypothesis testing (An example)

  • In 1954, test of Salk's polio vaccine on elementary school students

  • ~ 400,000 (treatment + control)

  • Of those that received the vaccine, 0.016% developed paralytic polio, whereas 0.057% of the control group developed the disease.

  • Did the vaccine work?

Hypothesis testing (An example)

  • Hypothesis testing uses probability to answer this question.

  • The null hypothesis is that the vaccine didn't work, and that the difference between groups arose purely by chance.

  • Evaluating the null hypothesis involves calculating the probability, under the assumption that the vaccine has no effect, of getting a difference between groups as big or bigger than observed.

  • In this case, this probability was very small, thus the null hypothesis was rejected.

Hypothesis testing (Definitions)

Definition: Hypothesis testing compares data to what we would expect to see if a specific null hypothesis were true. If the data are too unusual, compared to what we would expect to see if the null hypothesis were true, then the null hypothesis is rejected.

Hypothesis testing (Definitions)

Definition: A null hypothesis is a specific statement about a population parameter made for the purpose of argument. A good null hypothesis is a statement that would be interesting to reject

  • For example, there is no effect, no preference, no correlation, or no difference.

Definition: The alternative hypothesis includes all other feasible values for the population parameter besides the value stated in the null hypothesis.

Hypothesis testing (Problem #25)

Can parents distinguish their own children by smell alone? To investigate, Porter and Moore (1981) gave new T-shirts to children of nine mothers. Each child wore his or her shirt to bed for three consecutive nights. During the day, from waking until bedtime, the shirts were kept in individually sealed plastic bags. No scented soaps or perfumes were used during the study. Each mother was then given the shirt of her child and that of another, randomly chosen child and asked to identify her own by smell.

Discuss: What is the null hypothesis? alternative hypothesis?

Hypothesis testing (Problem #25)

Can parents distinguish their own children by smell alone? To investigate, Porter and Moore (1981) gave new T-shirts to children of nine mothers. Each child wore his or her shirt to bed for three consecutive nights. During the day, from waking until bedtime, the shirts were kept in individually sealed plastic bags. No scented soaps or perfumes were used during the study. Each mother was then given the shirt of her child and that of another, randomly chosen child and asked to identify her own by smell.

Discuss: What is the null hypothesis? alternative hypothesis?

Answer: With \( p \) the probability of choosing correctly,
\[ H_{0}: \ p = 0.5 \] \[ H_{A}: \ p \neq 0.5 \ (two-sided) \]

Hypothesis testing (how it's done)

Definition: The test statistic is a number calculated from the data that is used to evaluate how compatible the data are with the result expected under the null hypothesis.

Definition: The null distribution is the sampling distribution (i.e., the probability distribution) of outcomes for the test statistic under the assumption that the null hypothesis is true.

Definition: The  \( P \)-value is the probability of obtaining the data (or data showing as great or greater difference from the null hypothesis) if the null hypothesis were true.

Hypothesis testing (how it's done)

  • State the hypotheses.
  • Compute the test statistic.
  • Determine the \( P \)-value.
  • Draw the appropriate conclusions.

Hypothesis testing (Problem #25)

Can parents distinguish their own children by smell alone? To investigate, Porter and Moore (1981) gave new T-shirts to children of nine mothers. Each child wore his or her shirt to bed for three consecutive nights. During the day, from waking until bedtime, the shirts were kept in individually sealed plastic bags. No scented soaps or perfumes were used during the study. Each mother was then given the shirt of her child and that of another, randomly chosen child and asked to identify her own by smell. Eight of nine mothers identified their children correctly.

Discuss: What test statistic should you use? And what is the expectation?

Answer: The number of mothers with correct identifications. Expectation = 4.5

Hypothesis testing (Problem #25)

The following figure shows the null distribution for the number of mothers out of nine guessing correctly. alt text

Discuss: If \( H_{0} \) were true, what is the probability of exactly eight correct identifications?

Answer: Pr[number correct = 8] = 0.018

Hypothesis testing (Problem #25)

The following figure shows the null distribution for the number of mothers out of nine guessing correctly. alt text

Discuss: If \( H_{0} \) were true, what is the probability of obtaining eight or more correct identifications?

Answer: Pr[number correct \( \geq \) 8] = 0.018 + 0.002 = 0.02

Discuss: What is the \( P \)-value?

Answer: \( P = 2\times(0.02) = 0.04 \)

The null distribution...

Discuss: Propose a strategy to generate the null distribution

alt text

Let's make an experiment!

x = c(3,4,3,4,3,6,5,5,5,6,6,5,3,4,3,6,6,4,5,4,1,6,3,7,3,3,6,6,4,6,3,6); n = length(x)
hist(x, br = 10, col = "gray", main = "Null distribution", xlab = "")
round( length(x[x == 5]) / n , 3 )
[1] 0.156

plot of chunk unnamed-chunk-1

The null distribution...

n = 1000000; x = rbinom(n, 9, 0.5)
hist(x, br = 10, col = "gray", main = "Null distribution", xlab = "")
round( length(x[x == 5]) / n , 3 )
[1] 0.246

plot of chunk unnamed-chunk-2

What is the sampling distribution for the number of successes?

Definition: The binomial distribution provides the probability distribution for the number of “successes” in a fixed number of independent trials (\( n \)), when the probability of success (\( p \)) is the same in each trial.

Binomial distribution

If we have \( n \) trials, and the probability of success in each trial is \( p \), we have: \[ \mathrm{Pr[}X \mathrm{ \ successes]} = \left(\begin{array}{c}{n \\ X}\end{array}\right)p^{X}(1-p)^{n-X}, \] where \[ \left(\begin{array}{c}{n \\ X}\end{array}\right) = \frac{n!}{X!(n-X)!}, \] and \[ n! = n\times(n-1)\times(n-2)\cdots 2\times 1. \]

Why?

Binomial distribution

To figure out Pr[\( X \) successes], first ask

Question: “What are all different outcomes of \( X \) successes in \( n \) trials?”

Example: Suppose \( n=3 \) and \( X=2 \).

\[ 2 \ \mathrm{successes} = \{SSF, SFS, FSS\} \]

\[ \mathrm{Pr}[SSF] = \mathrm{Pr}[S]\times \mathrm{Pr}[S]\times \mathrm{Pr}[F] = p^2(1-p) \]

\[ \mathrm{Pr}[SFS] = \mathrm{Pr}[S]\times \mathrm{Pr}[F]\times \mathrm{Pr}[S] = p^2(1-p) \]

\[ \mathrm{Pr}[FSS] = \mathrm{Pr}[F]\times \mathrm{Pr}[S]\times \mathrm{Pr}[S] = p^2(1-p) \]

Binomial distribution

Example: Suppose \( n=3 \) and \( X=2 \).

\[ 2 \ \mathrm{successes} = \{SSF, SFS, FSS\} \]

\[ \mathrm{Pr}[SSF] = \mathrm{Pr}[SFS] = \mathrm{Pr}[FSS] = p^2(1-p) = p^X(1-p)^{n-X} \]

How many ways are there to have 2 successes in 3 trials?

\[ \left(\begin{array}{c}{3 \\ 2}\end{array}\right) = \frac{3!}{2!(3-2)!}= \frac{3\times 2\times 1}{2\times 1\times 1}=3 \]

Therefore!

\[ \mathrm{Pr[}X \mathrm{ \ successes]} = \left(\begin{array}{c}{n \\ X}\end{array}\right)p^{X}(1-p)^{n-X} \]

\[ \mathrm{Pr[2 \ successes| n = 3]} = \left(\begin{array}{c}{3 \\ 2}\end{array}\right)p^2(1-p) \]

\[ \mathrm{Pr[5 \ successes| {n = 9, p=0.5}]} = \left(\begin{array}{c}{9 \\ 5}\end{array}\right)0.5^5(0.5)^4 \]

( p = factorial(9)/( factorial(5)*factorial(4))*0.5^5*0.5^4 )
[1] 0.2460938

P-value...

hist(x, br = 10, col = "gray", main = "Null distribution", xlab = "")
P = (1/n) * ( length(x[x == 0]) + length(x[x == 1]) + length(x[x == 8]) + length(x[x == 9])) ; ( round(P, 5) )
[1] 0.03873

plot of chunk unnamed-chunk-4

binom.test(8, 9, 0.5, alternative="two.sided")

    Exact binomial test

data:  8 and 9
number of successes = 8, number of trials = 9, p-value = 0.03906
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5175035 0.9971909
sample estimates:
probability of success 
             0.8888889 

Binomial test - Definitions

Definition: The binomial test uses data to test whether a population proportion (\( p \)) matches a null expectation (\( p_{0} \)) for the proportion.

Definition: The null hypothesis \( H_{0} \) and alternative hypothesis \( H_{A} \) for a binomial test are given by:

    \( H_{0} \): Relative frequency of successes in population is \( p_{0} \).
    \( H_{A} \): Relative frequency of successes in population is not \( p_{0} \).

Hypothesis testing (Problem #25)

So, P = 0.04. What now?

Definition: The significance level, \( \alpha \), is the probability used as a criterion for rejecting the null hypothesis. If the \( P \)-value is less than or equal to \( \alpha \), then the null hypothesis is rejected. If the \( P \)-value is greater than \( \alpha \), then the null hypothesis is not rejected

Definition: A result is considered statistically significant when \( P \)-value \( \leq \alpha \).

Definition: A result is considered not statistically significant when \( P \)-value \( > \alpha \).

Hypothesis testing (Problem #25)

Can parents distinguish their own children by smell alone? To investigate, Porter and Moore (1981) gave new T-shirts to children of nine mothers. Each child wore his or her shirt to bed for three consecutive nights. During the day, from waking until bedtime, the shirts were kept in individually sealed plastic bags. No scented soaps or perfumes were used during the study. Each mother was then given the shirt of her child and that of another, randomly chosen child and asked to identify her own by smell. Eight of nine mothers identified their children correctly.

Discuss: Given \( \alpha = 0.05 \), \( \{H_{0}: \ p = 0.5\} \), and \( P \)-value of 0.04, what is the appropriate conclusion?

Answer: Reject \( H_{0} \). There is evidence that mothers consistently identify their own children correctly by smell.

What the P-value is, and is not...

  • The P-value is NOT the probability that the null hypothesis is true.

  • If the data are consistent with the null hypothesis, it means that we failed to reject it, but we can NOT say that it is true!

  • The P-value is the probability to observe a result as extreme or more extreme than that observed, assuming the null hypothesis is true.

  • If the data are NOT consistent with the null hypothesis, we reject it and say that the data support the alternative hypothesis.

Caveats

  • Statistical significance is NOT the same as biological importance.

  • Effect sizes (e.g., mean difference, correlation between 2 variables) are important and need to be reported systematically together with the P-value.

    • Large sample sizes can lead to statistically significant results, even though the effect size is small!
    • Conversely, if a test fails to reject a null hypothesis (e.g, p = 0.50), but the 95% confidence interval is wide (e.g, 0.24 < p < 0.76), then we realize we do not have enough information to draw a strong conclusion. We would have a different interpretation if 0.49 < p < 0.51!

Errors in Hypothesis Testing

Errors in Hypothesis Testing

alt text

Definition: Type I error is rejecting a true null hypothesis. The probability of a Type I error is: \[ \mathrm{Pr[Reject} \ H_{0} \ | \ H_{0} \ \mathrm{is \ true}] = \alpha \]

Definition: Type II error is failing to reject a false null hypothesis. The probability of a Type II error is: \[ \mathrm{Pr[Do \ not \ reject} \ H_{0} \ | \ H_{0} \ \mathrm{is \ false}] = \beta \]

Errors in Hypothesis Testing - Power

alt text

Definition: The power of a statistical test is given by \[ \begin{align*} \mathrm{Pr[Reject} \ H_{0} \ | \ H_{0} \ \mathrm{is \ false}] & = 1-\beta \\ \end{align*} \]

A study has more power if the sample size is large, if the true discrepancy from the null hypothesis is large, or if the variability in the population is low.