I. How do we design an experiment?
Muriel’s claim seemed crazy. The British scholars and researchers who heard her laughed. No way, they said, could she do it.
What Muriel Tasted
Think about adding cool milk to steaming hot tea. The first few drops of milk get scalded leaving a hint of a burnt taste because they have been heated too quickly. If the cool milk is in the cup, the first hot tea drops warm the milk more gradually. Surprisingly, Dr. Bristol’s discriminating taste buds could recognize that tiny bit of difference in the taste of the tea.
Ron appeared thoughtful. Muriel said tea tastes better when the milk is added to the cup before the tea. Muriel was convinced adding the tea first, then the milk, was inferior. She claimed she could taste the difference.
Ron wondered, ``How could we test this scientifically?" Ron decided Muriel a could do a blind taste test. He would have 8 cups of tea prepared, 4 each way, out of Muriel’s sight. Dr. Muriel Bristol, a botanist colleague who worked at the same research lab, would try to separate the cups into two groups, accurately finding the ones that had the milk added first.
Ronald Fischer, the godfather of statistics, later told this story of how his null hypothesis idea was born. He created a hypothesis for Dr. Bristol randomly guessing. What if she just had a coin flip’s chance of getting each cup right? He couldn’t prove that hypothesis, but if Dr. Bristol were extremely accurate it would provide strong evidence that the null hypothesis was false. Fischer knew he could calculate coin flip probabilities and compare them to the observed data from the taste test.
Fischer’s book The Design of Experiments (1935) doesn’t mention the results of the taste test, but reports suggest Dr. Bristol was successful in all 8 cases. Fischer was less concerned with the actual experiment than the innovative idea of using a negating approach to hypothesis testing. His inspiration for the breakthrough was testing Bristol’s taste buds.
Initializing RStudio
The Mosaic package was created by statistics instructors to help students learn the coding in R. Commands are streamlined to be more intuitive. Execute the code block below to load Mosaic (required each session).
library(mosaic)
II. Null Hypothesis
A null hypothesis is one that can only be disproven, never proven. What was Fischer’s hypothesis? A charlatan who made Dr. Bristol’s claim, if tested, would be blindly guessing. The charlatan would have a 50% chance of guessing correctly. Let’s alter the experiment slightly and test it ourselves.
Original Distribution
In Fischer’s book, he used the hypergeometric distribution to calculate Muriel’s chances of properly sorting the 8 cups into two groups of four. Because \(\binom{8}{4}=70\), he calculated the chance of randomly choosing 4 of the 8 that were milk-first would be 1 in 70.
Dr. Bristol will taste eight cups of tea. For each one, we will flip a fair coin. For heads, we pour milk first. For tails, it’s tea first. Each cup could be made either way. We could have 8 cups that were milk-first, or none. A charlatan would, in essence, have to guess the results of 8 coin flips. At each stage, the charlatan would have a fifty-fifty chance of success, so the probability of the charlatan getting all of them right would be: \[\left(\frac{1}{2}\right)^{8}=\frac{1}{2^8} = 0.00390625\]
If Dr. Bristol were to get all 8 correct, this would be strong, odds-defying evidence she had not guessed, that she was actually tasting a difference. The idea of the null hypothesis was born.
Given Fischer’s null hypothesis that assumed random guessing, we have a couple of different choices of how to determine the probabilities.
- Empirical: Use randomized trials to estimate probabilities.
- Theoretical: Assume a probability distribution, use a probability desnity function (pdf) to calculate probabilities.
Traditional statistics relies upon assumed distributions and pdf’s, most typically bell-shaped pdf’s. We live at a statistically interesting moment in history where randomization methods are gaining traction in scientific research publications mainly because computing power for randomization is readily available for virtually research settings. Perhaps in the next decade or so, randomization will replace traditional methods as the gold standard for research-level statistics.
III. Empirical Probabilities with Randomization
RStudio has tools both to conduct the random trials and then collect and analyze the results. Since the tea-first or milk-first choice was determined by coin flipping, random coin flips simulate trying to match the pattern of heads and tails. Here’s the R code for 8 coin flips.
#library(mosaic)
rflip(8)
Flipping 8 coins [ Prob(Heads) = 0.5 ] ...
H T T H H H T H
Number of Heads: 5 [Proportion Heads: 0.625]
Our plan is to repeatedly flip 8 coins and track what happens for each group of 8. Mosaic’s do function behaves like one would think. In the example below, it repeats the rflip(8) process 10,000 times which should produce accurate empirical estimates of the probabilities.
coins = do(10000) * rflip(8)
tally(~ heads, data = coins)
heads
0 1 2 3 4 5 6 7 8
37 335 1112 2229 2683 2199 1083 281 41
The pattern is bell-shaped as can be seen in the histogram.
histogram(~heads, data = coins,
width = 1,
type = "count")

Let \(x\) be the number of successful guesses out of 10. Using the tallies, we can calculate empirical probabilities.
What is the probability of randomly guessing and getting all 8 correct? About 0.4%. \[P(x = 8)=\frac{41}{10000}\]
What is the probability of randomly guessing and getting 7 or more correct? About 3.5%. \[P(x \geq 7)=\frac{322}{10000}\]
What should the cutoff be? We can all probably agree that if Dr. Bristol gets all 8 correct (which she did), there is ample evidence she is not a charlatan. But is 7 correct potential guesses enough?
The chosen cutoff value for what constitutes evidence against the null hypothesis is called level of significance the notation for which is the Greek letter alpha (\(\alpha\)). In modern statistics, we tend to use \(\alpha = 0.05\) as a default setting in absence of any other information.
IV. Theoretical Probabilities
Since we are modeling Muriel’s guessing with coin flips, the Binomial distribution with 50% chance of success applies. The R function choose produces the binomial coefficients we need.
Let \(x\in \{0,1,2,\dots,9,10\}\) be a random variable indicating the number of successful matches out of ten attempts. If \(x=8\), for example, we learned to calculate the theoretical probability using a term of the binomial expansion. \[P(x=7)=\binom{8}{7}(.5)^7(.5)^1\] We can evaluate this probability using R code.
choose(8,7)*.5^7*.5^1
[1] 0.03125
Let’s have R create a T-Chart of \(x\)-values and probabilities \(p(x)\).
# Creating Function
pFun <- function(t) {
choose(8,t)*(.5)^t *(.5)^(8-t)
}
# Creating T-chart ()
x = 0:8
tChart = data.frame(x,pFun(x))
names(tChart) = c("x","p(x)")
print(tChart)
R Coding. T-charts for pdf’s is not part of this course. Creating a function like pFun will be used in the Exercises.
The theoretical probability that Dr. Bristol earns 7 or more successes with random guessing is calculated using function pFun(k), a probability function we created above to do calculations based on the binomial theorem for \(k\) successes out of 8 trials.
pFun(7)+pFun(8)
[1] 0.03515625
We can use summation by placing a colon between the start and stop values. (This will be helpful in the Exercises.)
sum(pFun(7:8))
[1] 0.03515625
What is the probability of randomly guessing and getting all 8 correct? About 0.4%. \[P(x = 8)=\binom{8}{8}(.5)^8=\frac{1}{256}=.00390625\]
What is the probability of randomly guessing and getting 7 or more correct? About 3.5%. \[P(x \geq 7)=\frac{\binom{8}{7}+\binom{8}{8}}{256}=\frac{9}{256}=.03515625\]
V. Conclusion
What would Fischer do? Fischer had two suggestions to provide even more evidence for Bristol’s claim. First, having Bristol taste test more than 8 cups of tea would provide even more evidence to falsify the null. If Bristol was only 90% accurate but tasted 20 cups of tea, the probability of random guessing being successful drops by a factor of 10.Second, Fischer encouraged validation studies, repeating the same experiment multiple times to compare an contrast the results.
Fischer’s simple advice is often ignored by modern researchers. Too many studies with small sample sizes get published, and too few validation studies are performed. The results have led to the replication crisis in Psychology and the reproducibility crisis in medicine. Only when we design good experiments and replicate them within different populations do we get valid scientific research results.
VI. Exercises
Using the Mosaic function rflip, flip 16 coins and count the number of Heads. Repeat 10,000 times using the Mosaic function do. Estimate the probability that, if Dr. Bristol tasted 16 cups of tea each of which was randomly chosen to be “tea first” or “milk first,” that she would get at least 14 correct using a histogram with the type parameter set to “count.”
Find the theoretical probability in Exercise 1 by altering the code block that created the pFun function. Use your new pFun function and R’s sum function for the calculations. How does your theoretical calculation compare to your empirical estimate in Exercise 1?
pFun <- function(t) {
choose(16,t)*(.5)^t *(.5)^(16-t)
}
For 32 coins flips (and 10,000 randomized draws), estimate the probability that Dr. Bristol would get at least 24 correct if she were guessing at random. Use a histogram with the type parameter set to “count.” Update pFun to find the theoretical probabilities and compare the two results.
How many successes out of 100 would Dr. Bristol need to have before you would believe she was not guessing at random? Explain your reasoning based on empirical or theoretical calculations.
