Null Hypothesis Significant Testing

Null hypothesis significance testing is a central tool in science. Although this paradigm has become subject to severe scrutiny and skepticism in recent years, in part due to problems with the interpretation of p-values, an amazing amount of progress can and has been achieved with null hypothesis testing. In fact, most of modern science from the last century has been wholly reliant on this paradigm. So, despite the problems that one can encounter in the NHST framework, it still contains many amazingly powerful inferential tools. An ounce of caution and some common sense can help you avoid its pitfalls and still arrive at interesting conclusions from your data.

What I aim to do in this lesson, is to not only show you how to perform many common NHSTs with R but also how to properly interpret your results. Whether you are a fan of NHST or not, remember that there is no substitute for critical thinking. I will argue that the problems lie not with the NHST framework itself, but in the failure to recognize the limits of one’s own data and of what any given hypothesis test can say. Recognizing limits can help you maximize what NHSTs can reveal about your data while also avoiding overconfidence in your results.

Null-hypothesis testing explained

Before we dig into an example, I think it will be helpful to first explain the concept behind null-hypothesis testing. When we conduct experiments as researchers, we are often interested in whether we see a difference in some response variable across groups in our data. For example, think about the effectiveness of a new vaccine, whether your parent’s education level influences your expected college GPA, or variability in gene expression across different populations of a particular species.

The null hypothesis is essentially a Devil’s advocate stance, or an innocent-until-proven-guilty position, that posits there will be no difference between the groups we are interested in. While the alternative hypothesis posits that there will be a difference (and sometimes we specify which direction the difference will tend towards). When we have sufficient evidence against the null hypothesis, we reject the null.

However, we can’t just look at data and come to a conclusion for or against the null hypothesis. We need test statistics to help us out.

Suppose we set up a trial to measure hand-eye coordination in a population of volunteers by making them throw three-pointers before and after giving them a new super pill that will instantly turn them into star athletes. But of course we know, and will keep it a closely guarded secret, that the pill is a placebo.

We know for a fact that the null hypothesis is true (i.e. there will be no difference in three-pointer ability before or after taking the placebo). Why? Because we made it that way on purpose by using a placebo.

However, there is always a chance that we will find a difference between our groups completely at random. Or perhaps, due to imperfections in the experimental design. For example, if we let participants throw three-pointers immediately after taking the placebo, maybe their average really will go up but only because they had a chance to warm up during the initial trial.

No matter the reason, it’s basically inconceivable that two samples of data would ever be exactly the same. So could we ever tell if two (or more) groups are actually different? Alternatively, how do I know if a treatment actually worked? This brings us to the concept of statistical significance.

Alpha-levels, p-values, and statistical significance

Significance in statistics is a tricky subject. Put simply, significance is the threshold for helping you determine whether you should reject your null hypothesis. We usually express significance thresholds by setting an alpha-level. The alpha-level corresponds to the Type I one error rate, or the probability of rejecting the null when it is in fact true (the false positive rate). We typically compare the results from our experiments to our stated alpha-level by calculating a p-value.

A p-value measures only one thing, which is the estimated probability that future repeated experiments would obtain a result similar to or more extreme than yours, assuming that the null is true. That last piece is crucial: assuming that the null is true. This means that if there really is no difference between groups, in (hypothetical) repeated studies you should still see your results (or more extreme ones) some percentage of the time, just due to the random nature of the universe.

Alpha-levels and p-values work together to help you confront your hypotheses with data. You use the alpha-level to set the standard for how unlikely your data must be before rejecting the null hypothesis. The p-value tells you how extreme your data are, under the assumption that the null is actually true. Comparing the p-value to the alpha-level is how we determine statistical significance and decide whether to reject the null hypothesis.

For completely arbitrary reasons having to do with a possibly apocryphal story about a lady tasting tea, our old buddy R.A. Fisher gave us 0.05 as an alpha threshold for rejecting the null. Although Fisher may never have intended for this threshold to take on such the life that it has, it is now all but enshrined in modern science as the gold standard for rejecting null hypotheses. That is, when our p-value is below or equal to 0.05, we typically feel comfortable calling our experiments “successful” or declaring an effect “real”. But a non-significant result in the statistical sense hardly means there is no information of value there. Furthermore, it is easy to either misinterpret or over-interpret what a p-value really means. A p-value cannot tell you anything about the mechanistic importance of a parameter, or the size of an effect, or how confident you are in the results. It simply measures how likely you are to obtain your results in repeated experiments in the world where the null-hypothesis is true.

Therefore, in addition to reporting a p-value, it is important to provide a suite of additional statistics in your work. Things like sample sizes, effect sizes, confidence intervals, and statistical power are all just as important as the p-value itself and I will show you how to calculate those things shortly.

Getting started

To start, let’s load the packages and the data we will use in today’s examples. The first data set I want to work with is the SandwichAnts data set from the Stat2Data package. This data set is from an experiment done by a student who counted the number of ants on different sandwich types in laboratory trials. There are 48 observations on 5 variables in this data set.

library(ggplot2)
library(dplyr)
library(tidyr)
library(Stat2Data)

data("SandwichAnts")
d <- as_tibble(SandwichAnts) #Rename data frame for ease of typing and convert to tibble
head(d)

## # A tibble: 6 x 5
##   Trial Bread      Filling      Butter  Ants
##   <int> <fct>      <fct>        <fct>  <int>
## 1     1 WholeWheat HamPickles   no        34
## 2     2 MultiGrain PeanutButter yes       47
## 3     3 Rye        HamPickles   yes       67
## 4     4 MultiGrain HamPickles   yes       63
## 5     5 WholeWheat HamPickles   no        65
## 6     6 WholeWheat HamPickles   yes       58

Two sample t-test

Suppose we hypothesized before running the experiments that more ants would be attracted to sandwiches with butter on them than those that were un-buttered. The data includes a variable called Butter, which is a binary indicator telling us whether the sandwich had been buttered. Let’s use that information to classify our data into groups and investigate our hypothesis. We can test our hypothesis with a simple two sample t-test. A t-test is one of the most bast of all statistical tests and it simply tries to help you decide whether a difference exists between two independent distributions.

The first thing we might want to do is visualize the data.

ggplot(d, aes(x = Butter, y = Ants, fill = Butter)) +
  geom_boxplot() +
  scale_fill_manual(values = c("grey", "gold2")) +
  theme_classic()

A first glance seems to suggest there may be a real difference between our two groups. Let’s set up our null and alternative hypotheses and then conduct our test.

H0: There is no difference in the mean number of ants between groups.

Ha: There is a difference in the mean number of ants between groups.

t.test(Ants ~ Butter, d)

## 
##  Welch Two Sample t-test
## 
## data:  Ants by Butter
## t = -2.6051, df = 45.956, p-value = 0.01233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.056335  -2.443665
## sample estimates:
##  mean in group no mean in group yes 
##            38.125            48.875

qt(0.975, 46) #Calculates critical t-value for 95% confidence on 46 df

## [1] 2.012896

Along with our p-value t.test() spits out our t-statistic, the degrees of freedom, a 95 percent confidence interval for the difference between group means, and the sample means for each group. I calculated the critical t-value by hand, because the t.test() output does not give it to you. You may not need to know the critical t-value but for the sake of being complete, now you know how to find it.

Our results gave us the estimated means in each group but we are interested in the difference between the two groups. Not only are interested in the difference between both groups, we want to know the effect size. To get the effect size I will calculate a statistic called Cohen’s D, which is simply the the difference in sample means divided by the pooled standard deviation.

delta <- abs(38.125 - 48.875)
delta

## [1] 10.75

st.dev <- sd(d$Ants)

delta/st.dev

## [1] 0.7096222

Another way to get Cohen’s D is with the cohensD() function in the lsr package. Let’s check my work.

library(lsr)

cohensD(Ants ~ Butter, d)

## [1] 0.7520397

Not bad, small differences may arise depending on how calculations are carried out within specific functions. But as you can see, we got an effect size of ~ 0.71 - 0.75 which is decent. Cohen suggested that D values of 0.2, 0.5, and 0.8 should correspond to small, medium, and large effect sizes, respectively.

Okay, bringing this all together - our t-statistic was |-2.61| (remember it’s two sided so it’s the absolute values we care about) which exceeded the critical t-value of 2.01. We had 46 degrees of freedom and a p-value of 0.01 (which is smaller than the suggested 0.05 significance level). The mean difference between groups was 10.75 [95% CI: 2.44 - 19.10] and the effect size was 0.71.

If I were to interpret these results for a report I would conclude that, based on the evidence presented above, ants like butter. In fact, we are likely to see on average 24.7% [95% CI: 5.6 - 43.9%] more ants on buttered sandwiches than on non-buttered sandwiches. Perhaps we omit the butter then if we are planning a picnic?

Analysis of Variance

Another extremely useful NHST is the Analysis of Variance (ANOVA). Basically an extension of the two-sample t-test, all of which are basically extensions of linear models, but never mind that for now. The basic utility of the ANOVA is that is will allow you to test the means of more than two independent groups.

ANOVAs follow the same basic assumptions as most NHSTs, which is to say normally distributed data, similar variances, and independent groups. There are also all kinds of modifications to the ANOVA to help accommodate for specific study designs or questions. But for this example, I’m just going to stick with the simplest form which is a one-way ANOVA.

Let’s looks at an example about the fillings that ants might prefer. In this experiment, the sandwiches has three different kinds of fillings.

levels(d$Filling)

## [1] "HamPickles"   "PeanutButter" "Vegemite"

Before running our ANOVA, let’s summarize and visualize the data.

d %>%
  group_by(Filling) %>%
  summarise(count = n(),
            mean = mean(Ants),
            ci = qnorm(0.975)*sd(Ants)/sqrt(count)) %>% #Calculation for 95% CI 
ggplot(aes(x = Filling, y = mean, group = 1)) +
  geom_point(size = 3) +
  geom_line() +
  geom_errorbar(aes(ymin = mean - ci, ymax = mean + ci), width = 0.25) +
  theme_classic()

Again, visualizing the data seems to suggest that the groups differ. Let’s put that to the test with an ANOVA. We can do that with the aov() function.

a <- aov(Ants ~ Filling, d)
summary(a)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Filling      2   3721    1860   11.85 7.35e-05 ***
## Residuals   45   7065     157                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It looks like the ANOVA confirms that a difference exists between our sandwich groups. Typically with an ANOVA, there are a couple of key bits of information that you should report, along with the p-value. You definitely need to report the F-statistic and the degrees of freedom for the main effect and the residuals. So for example, if I were to write up these results I would report F(2, 45) = 11.85, p < 0.001.

But wait, we almost forgot about effect size. For an ANOVA, you can report eta-squared as a measure of effect size. Eta-squared is similar to the R-squared statistic you might see in a linear regression and you can pretty much interpret it the same way. It tells you the proportion of variance you can attribute to the variables tested. To calculate this statistic, we will use the etaSquared() function from the lsr package. The etaSquared() function will take the fitted aov() object as an input.

etaSquared(a)

##            eta.sq eta.sq.part
## Filling 0.3449379   0.3449379

To interpret: we can attribute about 34% of the variance in our data to the type of filling in the sandwich.

Post-hoc comparisons

The ANOVA does a great job of telling us if a difference exists among groups but we need to do some additional digging to find out where that difference occurred. To do this, we can use a post-hoc test. A common post-hoc pairwise comparison test is the Tukey’s range test (Tukey’s HSD). We wouldn’t run this test unless the ANOVA returned a significant p-value. Post-hoc tests like Tukey’s HSD need only be used to confirm a significant result from an ANOVA. Let’s go ahead an compute it. The TukeyHSD() command simply takes the fitted ANOVA object as its argument.

t <- TukeyHSD(a)
t

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Ants ~ Filling, data = d)
## 
## $Filling
##                            diff       lwr        upr     p adj
## PeanutButter-HamPickles -15.125 -25.86202  -4.387985 0.0038299
## Vegemite-HamPickles     -20.875 -31.61202 -10.137985 0.0000698
## Vegemite-PeanutButter    -5.750 -16.48702   4.987015 0.4037366

The output gives us the pairwise difference in means along with the associated 95% confidence intervals and p-values. From this table we can see that both peanut butter and vegemite sandwiches had significantly fewer ants on them than the ham and pickles sandwiches.

Let’s make a nice plot to visualize these differences.

as.data.frame(t$Filling) %>%
  tibble::rownames_to_column("Contrast") %>%
  ggplot(aes(x = diff, y = Contrast)) +
  geom_point(size = 3) +
  geom_linerange(aes(xmin = lwr, xmax = upr), size = 1.5, alpha = 0.5) +
  geom_vline(aes(xintercept = 0), lty = 2) +
  theme_classic()

Multiple comparisons

You might ask: why not just run multiple pairwise t-tests to look for differences between groups? The reason is because each NHST has its own alpha-level (Type I error rate) and running multiple tests increases our risk of committing a Type I error somewhere along the way.

It’s like taking random draws from a deck of cards for a game of poker. Getting pocket aces is rare. In fact, there is only a 0.0045% chance that you will draw two aces in any given hand. In other words, there is a 99.55% chance that you will NOT draw pocket aces. However, the more hands of poker you play, the more likely it is that you will draw pocket aces at least once. For example, let’s calculate the probability of drawing pocket aces at least one time in 100 hands.

1 - (0.9955^100) #1 minus the probability of NOT drawing pocket aces in 100 hands

## [1] 0.3630191

The probability of drawing pocket aces in 100 hands is about 36%. Therefore, if you play a lot of poker you might not be too surprised to draw that hand.

The same goes for statistical significance tests. If we had three groups we insisted on comparing with pairwise t-tests, we would have to perform three tests. If we set our alpha at 0.05, our inflated Type I error rate would be:

1 - (0.95^3)

## [1] 0.142625

Around 14%. Depending on the circumstance, we might not want to accept that 14% probability. Maybe this example is a bit dramatic but would you fly on a plane that caught fire 14% of the time? This is the principle behind post-hoc tests like Tukey’s HSD. They try to correct for that Type I error rate. One way the Tukey’s HSD does this is by relying on a distribution with fatter tails, that is to say it is more skeptical of extreme values than most conventional test distributions.

Power analysis

Now what about our statistical power? First, what is statistical power?

Technically speaking, statistical power is the probability that we reject our null hypothesis when it is false. We can think of power as the probability of detecting an effect if one really exists. Another way of looking at it might be to take 1 - power, which would correspond to the probability of failing to detect a difference when one exists (Type II error). In practice, one should really conduct a power analysis before starting an experiment to determine how large a sample size must be to see an effect of a desired size within an acceptable amount of error.

Sometimes you might be tempted (or asked) to perform a power analysis post-hoc to determine how much power your study design had to detect an effect in the first place. This is always a bad idea. The reason why is because the observed effect size from your experiment is never guaranteed to be the same as the true effect size you want to estimate from the population (which is unknown). And because statistical power is tied to effect size (and p-values), post-hoc power analyses are essentially useless. A more detailed explanation on this subject can be read in this excellent blog post.

The main take away here is: do your power analysis before you start your experiment. No amount of fancy statistics can salvage a study that was designed poorly to begin with.

So, instead of setting a bad example and performing a power analysis for the SandwichAnts experiment post-hoc, I will instead just give you an example of what you should do prior to an experiment. A convenient package in R called pwr will help us do just that. In this power analysis I am going to imagine an experiment where I would like to use a two-sided, two-sample t-test to estimate an effect size of at least 0.8 (large), at a significance level of 0.05, and a power to detect an effect of at least 0.9. This would set up a pretty robust study. As you will see, pwr.t.test() will tell me exactly how many samples I’m going to need per group to achieve my desired power to detect my desired effect.

library(pwr)

pwr.t.test(d = 0.8, sig.level = 0.05, power = 0.9, type = "two.sample", alternative = "two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 33.82555
##               d = 0.8
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Now I know I need at least 34 n per group to meet my study design’s specifications. However, sometimes we are limited to a fixed number of samples for whatever reason and we might still want to know how much power our experiment is going to have. Suppose I know I will only have 20 n per group for a forthcoming experiment. How will that affect my power?

pwr.t.test(d = 0.8, sig.level = 0.05, n = 20, type = "two.sample", alternative = "two.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 20
##               d = 0.8
##       sig.level = 0.05
##           power = 0.6934042
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Now I know that if I only had 20 n per group, my power would reduce to 0.69 and I would need to take that fact into account when writing up my results. The pwr package includes a lot of different functions including pwr.anova.test() for ANOVAs, and pwr.t2n.test() for t-tests with unequal group sizes, and much more. I encourage you to explore the pwr package for your own work and choose the right function for your own study design.

Thanks for sticking with it until the end!

BONUS: ANOVA as a linear model

As I briefly mentioned, ANOVAs are basically just extensions of simple linear models. This means we could have also just used R’s linear model function lm() to arrive at the same conclusions.

model <- lm(Ants ~ Filling, d)
summary(model)

## 
## Call:
## lm(formula = Ants ~ Filling, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.500 -10.094   2.938   9.500  22.375 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           55.500      3.133  17.717  < 2e-16 ***
## FillingPeanutButter  -15.125      4.430  -3.414  0.00136 ** 
## FillingVegemite      -20.875      4.430  -4.712 2.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.53 on 45 degrees of freedom
## Multiple R-squared:  0.3449, Adjusted R-squared:  0.3158 
## F-statistic: 11.85 on 2 and 45 DF,  p-value: 7.351e-05

The lm() output returns an F-statistic just like aov(). Compare the lm() output F-statistic (11.85, p < 0.001) to the aov() F-statistic (11.85, p < 0.001). Also notice that lm() returns an R-squared value (0.34) and compare that to the eta-squared we calculated earlier (0.34).

An advantage to just fitting the data to a linear model, rather than as an ANOVA, is that we can also look at each of the coefficients from the lm() output and interpret them directly.

model$coefficients

##         (Intercept) FillingPeanutButter     FillingVegemite 
##              55.500             -15.125             -20.875

In this model, the intercept represents the first level of our Filling variable, which was HamPickles. Therefore the intercept is the model’s estimated mean for that group. To get the estimated means for the other groups, we simply add them to the intercept.

#PeanutButter
55.5 + (-15.125)

## [1] 40.375

#Vegemite
55.5 + (-20.875)

## [1] 34.625

Now we have the implied estimates from the model about the average differences between each group. Unfortunately, TukeyHSD() only accepts ANOVA objects as inputs. But we can still generate our contrasts and look at the pairwise differences by instead calculating the estimated marginal means. The estimated marginal means are the model implied average effects across all groups. We can accomplish this with the emmeans() package.

library(emmeans)

emmeans(model, pairwise ~ Filling)

## $emmeans
##  Filling      emmean   SE df lower.CL upper.CL
##  HamPickles     55.5 3.13 45     49.2     61.8
##  PeanutButter   40.4 3.13 45     34.1     46.7
##  Vegemite       34.6 3.13 45     28.3     40.9
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast                  estimate   SE df t.ratio p.value
##  HamPickles - PeanutButter    15.12 4.43 45 3.414   0.0038 
##  HamPickles - Vegemite        20.88 4.43 45 4.712   0.0001 
##  PeanutButter - Vegemite       5.75 4.43 45 1.298   0.4037 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

The emmeans() output includes a contrasts table which is similar to the TukeyHSD() output and it will tell you exactly where the pairwise differences occurred and their associated p-values.