This document is supposed to supplement Hypothesis Testing JOLT to help students to develop intuition related to hypothesis testing and the relationship between sample and effect sizes, power and other essential concepts.
Each part contains guiding questions you should try to answer in writing. Do not rely on thinking about the answer. Put it on paper! Also, copy graphs and try to doodle over them to answer the questions.
When answering the next bunch of guiding questions, reread your answers for the previous ones.
We take only the Prototype A task for hypothesis testing and try to use it to develop intuitions. If you want to learn more, try to simulate other tasks from JOLT, and add and try to reply to your own guiding questions!
All simulations are coded in a very straightforward R code. There are more efficient ways to do that (e.g. coin package and some tidymodels packages can do that much better). This code is only to make acquiring intuitions easier. Most code chunks are hidden by default. Press “code” to open them.
Email A: 243 students received it. The proportion who started homework: 0.56
Email B: 257 students received it. The proportion who started homework: 0.39
Do you all think there is ACTUALLY a difference in the Outcome: Proportion of #zzStudentsStartingHomework, or was this due to #zzRandomVariation? Just intuitively. What are the factors you are using in trying to decide? E.g. Difference between Proportion/Mean Email A vs B? Number of participants?
Simulations can help us understand what is going on in an artificial world where we know everything and help us compare this with what we observe in the real world.
Knowing that each student either opened email or not (we are dealing with Bernoulli distribution), we can generate artificial data with these proportions and observe how sample proportions (or, actually, the difference between the proportions for email A and email B) would behave.
We will generate \(K = 1000\) samples (artificially repeat the experiment \(K\) times) to see the pattern, find the difference in proportions for each simulation, and draw them.
We will start assuming that we sent \(n=500\) emails in each simulation.
emailA_p <- 0.56
emailB_p <- 0.39
emailA_distr <- dist_bernoulli(emailA_p)
emailB_distr <- dist_bernoulli(emailB_p)
real_diff = emailA_p - emailB_p
n_a = 243
n_b = 257
K = 1000
sim_results = tibble(
difference = numeric(K)
)
for(i in 1:K){
draw_a <- generate(emailA_distr, n_a)[[1]]
draw_b <- generate(emailB_distr, n_b)[[1]]
sim_results[i, "difference"] <- (sum(draw_a)/length(draw_a)) - (sum(draw_b)/length(draw_b))
}
sim_results %>%
ggplot(aes(x = difference)) +
stat_dotsinterval() +
xlim(-0.2,0.6) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title.y=element_blank()) +
ggtitle(
str_glue("Sample differences (real diff = {real_diff}, n={n_a+n_b})")
) +
geom_vline(xintercept = real_diff, color = "darkgoldenrod3") +
geom_vline(xintercept = 0, color = "blue4") -> p500
p500
The golden line is real (population) difference (in the real world we don’t observe real difference) between homework starters for emailA and emailB, blue line – zero difference between homework starters for emailA and emailB.
library(ggdist)
library(distributional)
library(dplyr)
library(ggplot2)
library(stringr)
emailA_p <- 0.56
emailB_p <- 0.39
emailA_distr <- dist_bernoulli(emailA_p)
emailB_distr <- dist_bernoulli(emailB_p)
real_diff = emailA_p - emailB_p
n_a = 43
n_b = 57
K = 1000
sim_results = tibble(
difference = numeric(K)
)
for(i in 1:K){
draw_a <- generate(emailA_distr, n_a)[[1]]
draw_b <- generate(emailB_distr, n_b)[[1]]
sim_results[i, "difference"] <- (sum(draw_a)/length(draw_a)) - (sum(draw_b)/length(draw_b))
}
sim_results %>%
ggplot(aes(x = difference)) +
stat_dotsinterval() +
xlim(-0.2,0.6) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title.y=element_blank()) +
ggtitle(
str_glue("Sample differences (real diff = {real_diff}, n={n_a+n_b})")
) +
geom_vline(xintercept = real_diff, color = "darkgoldenrod3") +
geom_vline(xintercept = 0, color = "blue4") -> p100
p100
library(ggdist)
library(distributional)
library(dplyr)
library(ggplot2)
library(stringr)
emailA_p <- 0.56
emailB_p <- 0.39
emailA_distr <- dist_bernoulli(emailA_p)
emailB_distr <- dist_bernoulli(emailB_p)
real_diff = emailA_p - emailB_p
n_a = 21
n_b = 29
K = 1000
sim_results = tibble(
difference = numeric(K)
)
for(i in 1:K){
draw_a <- generate(emailA_distr, n_a)[[1]]
draw_b <- generate(emailB_distr, n_b)[[1]]
sim_results[i, "difference"] <- (sum(draw_a)/length(draw_a)) - (sum(draw_b)/length(draw_b))
}
sim_results %>%
ggplot(aes(x = difference)) +
stat_dotsinterval() +
xlim(-0.2,0.6) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title.y=element_blank()) +
ggtitle(
str_glue("Sample differences (real diff = {real_diff}, n={n_a+n_b})")
) +
geom_vline(xintercept = real_diff, color = "darkgoldenrod3") +
geom_vline(xintercept = 0, color = "blue4") -> p50
p50
library(patchwork)
p500 / p100 / p50
Now let us try to develop a simple prototype of hypothesis testing using simulations.
This setup is different: we don’t know much about what happens in the population and can only rely on the data at hand.
Our experiment had \(n=500\) participants (\(K=1000\) is still the number of simulations).
Say, our emailA data experiment results are:
emailA_res <- c(1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1)
\(p_a=\) 0.54, and number of students \(n_a=\) 170.
and for emailB:
emailB_res <- c(0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1)
\(p_b=\) 0.51, and number of students \(n_b=\) 230.
So observed difference in proportions is observed_diff= 0.03. Is email A really more efficient?
Let’s start building our simulation from a bit simplistic definition of p-value: p-value shows probability to get data with the difference in proportions the same or more extreme than observed, provided that \(H_0\) is true (there is no real difference in proportions).
In real life, we don’t know if \(H_0\) is true, i.e. we don’t know real (population) proportions for emailA and emailB – that is why we test!
But knowing only our experimental data, we can simulate this assumption (that there is no real difference in proportions)!
One way to do it is to randomly interchange observations between conditions (emailA and emailB).
sim_results = tibble(
difference = numeric(K)
)
for(i in 1:K){
res = c(emailA_res, emailB_res)
condition = sample(c("A", "B"), length(emailA_res) + length(emailB_res), replace = T)
prop_A = sum(res[condition == "A"])/length(res[condition == "A"])
prop_B = sum(res[condition == "B"])/length(res[condition == "B"])
sim_results[i, "difference"] = prop_A - prop_B
}
sim_results %>%
ggplot(aes(x = difference)) +
stat_dotsinterval() +
xlim(-0.15,0.15) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title.y=element_blank()) +
ggtitle(
"difference distribution (n=500)"
) +
geom_vline(xintercept = 0, color = "blue4") -> p4
p4
Let’s look at the graph again:
p4
What do we need to add to do a visual hypothesis test?
Well, we don’t have observed difference to make conclusions. Let’s add it!
p4 + geom_vline(xintercept = observed_diff, color = "darkgoldenrod3") -> p5
p5
Say our previous results were only a part of data from our experiment, and the observed difference in proportions is the same (0.03).