James Scott (UT-Austin)
Reference: “Data Science” Chapter 7
Unless you're from a narrow strip of land from Connecticut to Maine, you probably dislike the New England Patriots.
First of all, they win too much.
Then there's Tom Brady, their star quarterback…
“The more hydrated I am, the less likely I am to get sunburned.” –TB12
And Bill Belichick, their coach…
And of course, the cheating!
But could even the Patriots cheat at the pre-game coin toss?
Many people think so!
For a 25-game stretch during the 2014 and 2015 NFL seasons, the Patriots won the pre-game coin toss 19 out of 25 times, for a suspiciously high winning percentage of 76%.
Shannon Sharpe: “This proves that either God or the devil is a Patriots fan, and it can't possibly be God.”
“Use the Force…”
But before we invoke religion or the Force to explain this fact, let's consider the innocent explanation first: blind luck.
To the code in patriots.R! Let's simulate some coin flips.
This simple example has all the major elements of hypothesis testing:
All hypothesis testing problems have these same four elements.
Within this basic framework, there are two schools of thought about how to proceed.
Suppose our observed test statistic is \( t_{ob} \). In step 4, we should report the quantity
\[ p = P(T \geq t_{ob} \mid H_0) \]
Fisher called this the \( p \)-value: the probability that, if the null hypothesis were true, we would observe a test statistic \( T \) at least as large as the value we actually observed (\( t_{ob} \)).
The \( p \)-value summarizes the strength of evidence provided by the data against the null hypothesis.
According to Fisher: job done! Report the \( p \)-value and let your readers make whatever they will of it.
“Mwa-ha-ha-ha-ha-ha!”
“You're on your own, suckahs!”
The biggest advantage of \( p \)-values is that they provide a sliding scale of evidence against the null hypothesis: small \( p \)-values mean stronger evidence.
The biggest problem with \( p \)-values is that they are very hard to interpret correctly.
The second-biggest problem with \( p \)-values is that people seem to insist on turning them into a binary thing. E.g. is \( p < 0.05 \)?
“I got a \( p \)-value of 0.02, so there's a 2% chance that the null hypothesis is right.”
Wrong:
Remember: conditional probabilities aren't symmetric!
“Bob got a \( p \)-value of 0.1, but I got a \( p \)-value of 0.01. My null hypothesis is ten times less likely to be true than Bob's is.”
Wrong:
“I got a \( p \)-value of 0.02. There's only a 2% chance I would have observed my test statistic if the null hypothesis were true.”
Wrong:
Remember: the \( p \)-value is the probability of observing the test statistic you actually observed, or any more extreme test statistic, assuming \( H_0 \) is true.
Because \( p \)-values are hard to interpret, people tend to impose arbitrary cut-offs for what counts as a “significant” p-value.
Psychologists, for example, will generally publish research findings for which \( p < 0.05 \).
Then again, psychologists will believe anything.
Then again, psychologists will believe anything.
Then again, psychologists will believe anything.
Physicists are a bit more skeptical; they generally publish results only when \( p < 0.000001 \).
For example, the \( p \)-value in the paper announcing the discovery of the Higgs boson was \( 1.7 \times 10^{-9} \).
They spent a lot of time and money (> $1 billion) collecting more data, even after the evidence was really, really strong.
Nobody except Fisher knows how to interpret them.
Rejecting a null hypothesis isn't meaningful unless we have some alternative hypothesis in mind. Since people will inevitably use a \( p \)-value to make a binary decision (“null” versus “alternative”), we should formalize that decision process.
The Neyman-Pearson approach is aimed at quantifying (and controlling) the error probabilities associated with a hypothesis test.
In Neyman Pearson testing, we have a modified sequence of steps:
Before looking at the observed test statistic \( t_{ob} \) for your actual data, continue as follows.
4a. Specify a rejection region \( R \subset \mathcal{T} \).
4b. Calculate \( \alpha = P(T \in R \mid H_0) \). This is called the alpha level or size of the test.
4c. Calculate the power of your test as \( P(T \in R \mid H_A) \).
4d. Check whether your observed test statistic, \( t_{ob} \), falls in \( R \). If so, reject \( H_0 \) in favor of \( H_A \). If not, retain \( H_0 \).
The test is characterized by two properties:
Neyman and Pearson are basically asking you: would you buy a car without a warranty?
Then you shouldn't test a hypothesis without one, either!
\( \alpha \) and power (or \( \beta \)) serve as the test's “warranty,” or specific guarantee of performance:
These are knowable in advance. As with cars, so too with hypothesis tests: always check the whole warranty! If someone only tells you the \( \alpha \) level of a test and omits the power, it's like only giving a warranty on part of the car.
At the end of a Neyman-Pearson test, you report two things.
No \( p \)-values! (This was Fisher's criticism: no matter how strong the evidence against the null, an NP test ends up reporting the same thing for any \( T \in R \).)
The difficulty of conducting a Neyman-Pearson test depends upon the alternative hypothesis.
Let's go back to the Patriots problem. Our test statistic is \( X \), the number of successful coin flips in 25 tries. Suppose that \( p \) is the true probability that the Patriots will win the coin toss. Consider testing the two hypotheses:
Suppose we decide to reject \( H_0 \) if \( X \geq 17 \). In this case the power is easy to calculate: it's just \( P(X \geq 17) \) when \( X \sim \mbox{Binom}(N=25, p=2/3) \).
Let's look at power.R (part 1).
Suppose now that we follow the Patriots for a 50-flip stretch and count the number of times \( X \) they win the coin toss. As before, our null hypothesis is that \( X \sim \mbox{Binom}(N, p=0.5) \).
Follow the steps of an NP test:
Compare this to the more realistic situation where our alternative hypothesis isn't so specific:
This is called a “composite alternative hypothesis.” It “hedges its bets,” i.e. it doesn't make any specific predictions except that the null hypothesis is wrong.
Now the power of the test isn't just a single number.
Rather, it's a function, or a power curve:
\[ \mbox{Power}(p) = P(X \geq 17 \mid p) \]
where \( p \) is the assumed binomial success probability. This is a function of \( p \).
Back to power.R (part 2).
Suppose we take data points \( X_1, \ldots, X_N \), where each \( X_i \) comes from some some parametric probability distribution \( p(X \mid \theta) \).
\[ T = T(X_1, \ldots, X_N) \]
Suppose our observed test statistic is \( t_{ob} \).
\[ P(T \geq t_{ob} \mid H_0) \]
\[ P(T \in \Gamma(t_{ob}) \mid H_0) \, , \]
\[ R = \{ \mathcal{T}: T \geq c \} \quad \mbox{or} \quad R = \{ \mathcal{T}: |T| \geq c \} \]
\[ \alpha = P(T \in R \mid H_0) = P(T \in R \mid \theta = \theta_0) \]
An alternative hypothesis takes the form \( H_A: \theta \in \Theta_A \), where \( \Theta_A \) is some subset of the parameter space not containing \( \theta_0 \).
A simple alternative is where \( \Theta_A \) contains a single value, whereas composite alternative has multiple possible values. For example, \( H_A: \theta \neq 0 \) and \( H_A: p > 0.5 \) are both composite alternatives.
The power of the test at some specific \( \theta_a \in \Theta_A \) is defined as
\[ R(\theta) = P(T \in R \mid \theta = \theta_a) = 1 - \beta(\theta) \]
Return to our example where we follow the Patriots for a 50-flip stretch and count the number of times \( X \) they win the coin toss. Clearly \( X \sim \mbox{Binom}(N, p) \), and our null hypothesis is that \( p = 0.5 \).
Follow the steps of an NP test for two different rejection regions: \( R_1 = \{X : X \geq 30\} \) and \( R_2 = \{X : X \geq 34\} \). For each of these two rejection regions, check the warranty! That is:
In Neyman-Pearson testing, it's really important that you specify the rejection region \( R \) in advance, before seeing the data.
In particular, you absolutely, positively cannot do the following:
In particular, you absolutely, positively cannot do the following:
That's what cheaters like Tom Brady do. It voids the warranty usually enjoyed by a Neyman-Pearson test.
Why is this cheating?
Why is this cheating?
Because the two key probabilities in NP testing assume that \( R \) is fixed and pre-specified, and that only the test statistic \( T \) is random from sample to sample:
\[ \alpha = P(T \in R \mid H_0) \quad \mbox{and} \quad \mbox{Power}(\theta_a) = P(T \in R \mid \theta = \theta_a) \]
If you choose \( R \) based on the data, then the rejection region itself is a random variable, and all probability statements are off. This voids the warranty!
Be like a physicist here.
Physicists (especially particle physicists) are the most honest statisticians in the observable universe. They always specify \( R \) in advance, and they never change their \( \alpha \) level to make their data look maximally impressive.
In my experience, people in most other fields are sloppy, dishonest, or both when it comes to setting \( \alpha \) levels.
Remember the difference between Fisher and Neyman-Pearson, and don't conflate them:
In fact, this is a great litmus test for probing the depth of someone's statistical knowledge: ask them “What's the difference between the Fisherian and Neyman-Pearson frameworks for hypothesis testing?”
In fact, this is a great litmus test for probing the depth of someone's statistical knowledge: ask them “What's the difference between the Fisherian and Neyman-Pearson frameworks for hypothesis testing?”
Most people who can compute a \( p \)-value and think they understand statistics will look at you with a blank stare if you ask them this question.
This is like a biologist who can't explain the difference between Darwin's and Lamarck's views on evolution. Hold yourself to a higher standard.
Be careful about using \( p \)-values.
And finally, the single most important “best practice” of hypothesis testing is:
And finally, the single most important “best practice” of hypothesis testing is:
Don't do hypothesis testing.
And finally, the single most important “best practice” of hypothesis testing is:
Don't do hypothesis testing.
Any time you're about to calculate a \( p \)-value, ask yourself: do I really need to? Wouldn't I be better off reporting a confidence interval instead? Usually the answer is yes!