04 October, 2016

Use right/left arrow to go forward/back or click in black margins on sides

Introduction

This is a short introduction to statistical hypothesis testing. It contains three essential parts and one non-essential, included to give the curious reader more knowledge of the "behind-the-scenes" work of statistical hypothesis testing.

The three essential parts are

  1. Explanation of what a statistical hypothesis is, including some examples,
  2. How testing of a statistical hypothesis is done,
  3. Review of some of the most common issues with hypothesis testing.

A brief summary is included to provide a quick overview.

If you have any comments or questions, please feel free to send me an email.

Enjoy!

Statistical Hypotheses

What is a statistical hypothesis?

A STATISTICAL HYPOTHESIS is an assumption. It can be regarding the slope of a regression line, the proportion of a population with a certain disease, or a difference in efficacy between two different treatments.

In statistics, we write the assumption we want to test, often refered to as the NULL HYPOTHESIS, as "\(H_0:\) our assumption". This makes it easy to refer to the hypothesis throughout the analysis, as we can construct sentences such as "under \(H_0\), \(X\) is normally distributed" or "with a p-value of \(0.01\), we reject \(H_0\)".

For every null hypothesis, there is an ALTERNATIVE HYPOTHESIS. This is, as the name clearly states, the alternative to the null, or the opposite assumption of the one made in the null hypothesis. This is an important aspect, as changing the alternative hypothesis might change which test is more appropriate to use.

The null hypothesis is often formulated in a slightly unintuitive way. For example, if we want to test for a difference in means in two groups, the null hypothesis is "\(H_0:\) no difference" and the alternative "\(H_1:\) some difference". This is due to the fact that "no difference" means something very specific – the difference is \(0\) – whereas "some difference" includes all possible differences.

Example I

We wish to study the endothelial cell loss of children aged 0-5 and children aged 6-10 after cataract surgery. The question we want to answer is: "is there a difference?"

If we let \(\mu_0=\) 'average of cell loss of children aged 0-5', and \(\mu_1=\) 'average of cell loss of children aged 6-10', the answer can be formulated as: "is \(\mu_0\) different from \(\mu_1\)?"

We can now formulate our statistical hypothesis. The null hypothesis would in this case be \[H_0: \mu_0 = \mu_1\] which we would want to test against the alternative \[H_1: \mu_0 \neq \mu_1\]

As we will see later, it is useful to rephrase this as \[H_0: \mu_0 - \mu_1 = 0\] and \[H_1: \mu_0 - \mu_1 \neq 0.\]

Reference

Example II

Researchers want to determine if history of arthritis is a risk factor for the prevalence of dry eye syndrome in the Beaver Dam Eye Study Cohort. That is, we ask: "do people with a history of arthritis have an increased odds of dry eye syndrome?"

This can be formulated as "is the odds of dry eye syndrome in the group of people with a history of arthritis the same as the odds of dry eye syndrome in the group of people without?"

We will often answer such a question using odds ratios. Since an odds ratio of \(1\) is the same as no difference in odds, the null hypothesis is \(H_0: \text{OR} = 1\) and the alternative hypothesis \(H_1: OR \neq 1\).

Reference

Example III

In this hypothetical study, a researcher wants to determine the effect of a given treatment. The researcher uses logMAR as her main outcome. To find a difference in the two groups, one could simply utilize a t-test. However, this clever researcher knows that other factors, such as age and duration of disease, impact the outcome, so instead she opts for a linear regression model, where it is possible to adjust for such other factors.

Her hypothesis is now formulated in terms of regression coefficients, i.e. how much does the logMAR change when we go from one treatment group to another. The null hypothesis is then \(H_0: \beta_{Treatment} = 0\), which is to say that there is no effect of the treatment, while the alternative hypothesis is \(H_1: \beta_{Treatment} \neq 0\), i.e. there is an effect.

(For more information on linear regression and the use of these models, check out the link below.)

Reference

Why it's important

Formulating your research question as a statistical hypothesis is not just rewriting it with \(H_0\)'s and means and odds ratios.

It helps you think about what you are actually looking for – what measure should you use to quantify the differences you're looking for?

Furthermore, it helps you figure out what test is appropriate. In Example I, we found that the research question could be formulated in terms of means. This indicates that a t-test/ANOVA is a good place to start.

In Example II, we found that we were actually interested in the odds ratio. So we need to somehow estimate probabilities, since we need these to calculate the odds ratio. Logistic regression might be a great method to use in a situation like that.

All methods come with pros and cons, so thinking about what method you want to use might help you counteract some of the pitfalls in said method by a well designed study.

Testing a Statistical Hypothesis

Testing a Statistical Hypothesis

As mentioned, a statistical hypothesis is an assumption. So to test a statistical hypothesis is to ask is the assumption violated?

This is not a trivial question. As an example, assume we want to test the hypothesis \(H_0:\) there's no effect of treatment A as compared to treatment B. This can be done using a t-test to check if the mean outcome of treatment A is the same as the mean outcome of treatment B. I.e. \(H_0: \mu_A - \mu_B = 0\), where \(\mu_A\) is the mean outcome of treatment A and \(\mu_B\) is the mean outcome of treatment B.

Now, even if it actually is true that the treatments have the exact same effect, when sampling patients who have been treated with treatment A and patients who have been treated with treatment B, we will always get two different means.

Hence, the question becomes: is the difference in the means big enough for us to say there is an actual difference between the treatments? Or is it small enough that it can be discarded?

Testing a Statistical Hypothesis

P-values

To assess if a difference is 'big enough', we use p-values.

Assuming all assumptions are valid, it can be determined how the difference of the means should behave – in other words, the distribution is known. This means it can actually be calculated how likely it is, if the experiment is repeated, that something "more extreme", or "less compliant" with the assumptions, is observed.

This probability is what is called the p-value: it is the chance of collecting data that fit worse with the assumption than the data at hand. I.e. if the p-value is very small, the data at hand do not fit the assumption very well, and hence the assumption (the null hypothesis) is rejected. On the other hand, if the p-value is very large, the data are in agreement with the null hypothesis, and therefore it is NOT rejected.

(For a more in-depth description of this idea, see the Appendix I - Testing a Hypothesis)

Potential pitfalls

Potential pitfalls

There are a few things one has to be aware of when performing hypothesis tests:

  • Type-I, Type-II errors, Power (or lack thereof)
  • Statistical significance vs. clinical significance
  • Association vs. Causation
  • P-hacking/multiple comparisons
  • Ignoring key assumptions

Potential pitfalls

Type-I, Type-II errors, Power

When testing a hypothesis, one of two errors can happen:

  • Type-I errors (also called false positives)
    • Test results tell us there is a significant difference when there actually is not
  • Type-II errors (false negatives).
    • Test results tell us there is NO significant difference when there actually is

Power is closely related to type-II errors. The power of a test is the probability of seeing a significant effect IF it is actually there. Therefore, low power means low probability of detecting true effects (more on statistical power).

The point is: a significant p-value does not guarantee a REAL effect! (And vice versa.)

Potential pitfalls

Statistical significance vs. clinical significance

It is very important to distinguish between 'statistical significance' and 'clinical significance'. This is in particular a problem with very large studies.

[Large study -> small standard deviations -> very small differences come out as significant.]

This is because 'statistically significant' simply means that the effect is 'far from zero as compared to the expected standard deviation'. So, if the standard deviation is very small, even the smallest differences between groups will show up as significant.

To avoid reporting clinically irrelevant findings as significant, it is important to consider effect sizes and confidence intervals. These help paint a much more detailed picture.

Potential pitfalls

Association vs. Causation

Statistical significance never implies a cause-effect relationship, merely an association.

Therefore, never forget to interpret your results in the right context. Consider your study design carefully when drawing any conclusions. Could there be any confounding factors?

Potential pitfalls

P-hacking/multiple comparisons

When we test for significance at a level of 0.05, it means that there is a five percent chance of a false positive.

What if we test 20 effects at once? Let us assume we have 20 variables, all of which we know do not have any effect on our outcome. Assuming they are all independent (which they probably aren't), the probability of all 20 results being true negatives is \(0.95^{20} =\) 0.358.

I.e., there is a 0.642 chance of at least one false positive (i.e. rejecting the null hypothesis when it is in fact true). So, even when we KNOW there are no effects, we still have a good chance of finding something that's statistically significant.

The moral of the story is: even if there are NO effects, if you throw in enough variables, you are almost certain to find a false positive.

Potential pitfalls

P-hacking/multiple comparisons (cont.)

What can we do about it?

  • Avoid exploratory analyses,
  • Keep an eye on any inconsistencies in effect sizes (anything contradicting your expectations should be considered a red flag),
  • Adjust p-values for multiple tests using for example the Benjamini-Hochberg or the Bonferroni procedure.

Especially the third point above is important. For a more in-depth discussion on why, how and when to adjust for multiple comparisons, see this excellent discussion of it.

Potential pitfalls

Ignoring key assumptions

Contrary to popular belief, statistics is NOT an exact science. It's all about decisions and assumptions, and then justifying those. We constantly have to make decisions such as…

… which model to use…
… which variables to include…
… are any transformations of the data appropriate…

… and check that common assumptions are met, such as…

… are the data normally distributed?
… are the explanatory variables correlated?
… is the variance constant across different groups?

None of the above can be guaranteed beyond doubt to be true, hence we have to use our best judgement of the data at hand to find out if any assumptions are violated, and if so, if we should try to correct for it.

Potential pitfalls

Ignoring key assumptions (cont.)

When testing a hypothesis, we test to see if the data behave as we would expect IF all assumptions made were true.

When we reject the null hypothesis, we conclude one of more assumptions are wrong.

BUT if we have not validated the assumptions we make in our test, we cannot be sure that the assumption that should be rejected is the null hypothesis – it might as well be that the data are not normally distributed, for example.

Example I revisited

When testing the hypothesis that the endothelial cell loss of children aged 0-5 and children aged 6-10 after cataract surgery is the same, we want to use a standard t-test. This assumes

  1. the data from each of the two groups follow normal distributions,
  2. the standard deviation of the data should be the same in the two groups,
  3. the two samples are independent.

If we test the hypothesis and get a very small p-value, it might as well be because one of the three assumptions above is false, not because the null hypothesis is. Therefore, it is important to make sure that the assumptions made when performing a test are not violated.

Summary

Summary

We always have two hypotheses:

  1. The null hypothesis: rejecting or not rejecting this answers the overall question that we are interested in
  2. The alternative hypothesis: the alternative of the null. This is what we will default to, given that the null is rejected

The null hypothesis is either rejected or not rejected based on a p-value.

  • Smaller p-values indicate lower chance of seeing more extreme outcomes, hence less faith in the null hypothesis

It is important to know what the null hypothesis is: the interpretation of the p-value is very closely related to the formulation of the null hypothesis.

  • If \(H_0: \mu_A - \mu_B = 0\), a large p-value indicates that there is no difference between the means
  • If \(H_0: \mu_A - \mu_B \leq 0\), a large p-value indicates that \(\mu_A\) is NOT larger than \(\mu_B\).

Summary

(cont.)

Always remember:

  • false positives are real
  • false negatives are real
  • tests never imply causation
  • statistical significant does not necessarily mean clinical significant.

Therefore:

  • ALWAYS consider adjusting for multiple tests
  • ALWAYS consider effect sizes and confidence intervals (when applicable)
  • ALWAYS use your intuition
  • ALWAYS seek help when in doubt!

Appendix I - Testing a Hypothesis

Testing a Hypothesis

The strategy

All hypothesis tests follow the same quite simple recipe:

  1. State your null and alternative hypotheses,
  2. Find a test statistic to quantify how far from the assumption in your hypothesis the data are,
  3. Assess whether the value of your test statistic is significantly far from what you assume it to be in your null hypothesis.

Let us take a look at an example of this.

How to test a hypothesis?

We have collected data from 18 subjects in two groups, A and B. We have good reason to believe that the measured attribute is normally distributed.

We want to find out if the means of the measurements from the two groups are the same. Our null hypothesis is therefore \(H_0:\) "no difference in the means" and our alternative hypothesis \(H_1:\) "there is a difference".

We will use \(\mu_A\) to denote the mean of group A, and \(\mu_B\) to denote the mean of group B. So, the null hypothesis is \(H_0: \mu_A = \mu_B\) and the alternative is \(H_1: \mu_A \neq \mu_B\).

How to test a hypothesis?

With our hypothesis stated, we now look at the data. The question we seek to answer is: "are the means far enough apart that we can confidently say that they must be different?"

First, we simply look at the difference between the two means. We find that they are 2.415 apart. Is that enough for us to say that they are significantly different?

How to test a hypothesis?

Obviously, we need more than that. So, we take another look at the data and find the values of the means. Now we can see that the mean in group A is 3.272 and in group B is 0.857. I.e. the mean in group A is \(\approx\) 3.8 times that in group B. Surely that's a significant difference, right?

How to test a hypothesis?

Let us add the actual observations. This gives us an idea of the variation of the data. As we see, the data vary quite a bit compared to the difference between the two means.

Even though the mean in group A is 3.8 times that in group B, this figure indicates that there is not necessarily a significant difference. The two groups almost cover the exact same range, so how could we possibly conclude they have different means?

How to test a hypothesis?

In some sense, we have now moved from comparing two numbers (the two averages) to comparing two distributions (the two sample distributions).

The sample distributions give us a much more complete picture of the population the samples are drawn from. So, comparing the distributions is much more powerful than comparing the averages alone.

As mentioned earlier, it is fair to assume that this data are normally distributed. Based on the two samples, we can estimate means and standard deviations and then draw the sample distributions.

Can we really say that these for sure are different, and not just shifted by chance?

How to test a hypothesis?

This is probably the closest we will get to answering the question: Sure, there is a difference between the means, but the two distributions overlap quite a bit, so maybe this difference is visible simply by chance? Maybe if we had the opportunity to get 1000 subjects in each sample instead of just 18, we would see that the two curves are the exact same. Or maybe they would be significantly different. The thing is, it is very hard to say based on this string of thoughts.

What if we rephrase the hypothesis as \(H_0: \mu_A - \mu_B = 0\)? Clearly, this is the same hypothesis, but now we'll consider the difference of the means instead.

Since we assumed that the samples A and B are from normal distributions, the means of the samples are normally distributed, and so are the difference in means. IF THE NULL HYPOTHESIS IS TRUE, then the difference between the means is normally distributed with mean \(0\).

How to test a hypothesis?

So, we draw the normal distribution we think the difference of means come from. That is, the mean is \(0\) and the standard deviation are estimated from the data.

Why is this useful? Because this tells us what to expect if the hypothesis is true. If the null is true, then we would expect the mean difference to be 'close to the middle'. The value of the graph can be interpreted as the probability of observing the value at that point. I.e., it would be very unlikely that we saw a difference in means of 4 or more, while any difference less that \(2.5\) would be expected.

How to test a hypothesis?

We can now actually assess whether the observed value is 'unusual' far away from the assumed value, i.e. \(0\). Below, \(\pm\) the observed difference is marked. Everything filled with grey would be even more extreme deviations from the assumption.

The question we really want to ask is "how far are we from the assumption?" This is not quite possible to answer, but a slightly different question can be answered: "what is the probability of getting something 'further away' from the assumption than what we got?"

This probability is what we call the p-value.

How to test a hypothesis?

Now we have a simple way of answering the question: "is there a difference between the two group means?"

If there is no difference, then the probability of us observing something at least as far from \(0\) as what we got, the area of the two grey areas, should be large. If there is a difference, we would expect it to be very unlikely that we get something 'further away' from zero.

In this case, the total area of the grey parts of the graph is 0.099 (our p-value).

So, there is almost a \(10\%\) chance of getting something further away from our assumption.

How to test a hypothesis?

Let's recap:

  1. We stated our null hypothesis: there is no difference (\(H_0: \mu_A - \mu_B = 0\)) and our alternative hypothesis: there is a difference (\(H_1: \mu_A - \mu_B \neq 0\)),
  2. We found a test statistic that quantifies how far away from the null hypothesis the observed data seem to be: \(\mu_A - \mu_B\),
  3. We assessed how far our test statistic is from the assumption in \(H_0\). This was done by calculating the probability of being 'further away', in other words, we calculated the p-value.

Luckily, lots of really clever people have thought about these things for a long, long time. So, in general, once the null and alternative hypotheses are stated, we pick an appropriate test, ask our favorite statistical software "how far from our null are we?" and it will give us an answer (in terms of a p-value).

It's as simple as that!

HOWEVER…

… interpreting that p-value is not always easy! It depends on how the null hypothesis is formulated, and when we ask a computer, it is important to make sure that the way we formulate \(H_0\) is the same as the way our favorite software does.