About this document

This document is supposed to supplement Hypothesis Testing JOLT to help students to develop intuition related to hypothesis testing and the relationship between sample and effect sizes, power and other essential concepts.

Each part contains guiding questions you should try to answer in writing. Do not rely on thinking about the answer. Put it on paper! Also, copy graphs and try to doodle over them to answer the questions.

When answering the next bunch of guiding questions, reread your answers for the previous ones.

We take only the Prototype A task for hypothesis testing and try to use it to develop intuitions. If you want to learn more, try to simulate other tasks from JOLT, and add and try to reply to your own guiding questions!

All simulations are coded in a very straightforward R code. There are more efficient ways to do that (e.g. coin package and some tidymodels packages can do that much better). This code is only to make acquiring intuitions easier. Most code chunks are hidden by default. Press “code” to open them.

Task (Prototype A from JOLT)

Email A: 243 students received it. The proportion who started homework: 0.56

Email B: 257 students received it. The proportion who started homework: 0.39

Do you all think there is ACTUALLY a difference in the Outcome: Proportion of #zzStudentsStartingHomework, or was this due to #zzRandomVariation? Just intuitively. What are the factors you are using in trying to decide? E.g. Difference between Proportion/Mean Email A vs B? Number of participants?

Simulations can help us understand what is going on in an artificial world where we know everything and help us compare this with what we observe in the real world.

What would happen if these proportions (0.56, 0.39) were true for all students?

Knowing that each student either opened email or not (we are dealing with Bernoulli distribution), we can generate artificial data with these proportions and observe how sample proportions (or, actually, the difference between the proportions for email A and email B) would behave.

We will generate \(K = 1000\) samples (artificially repeat the experiment \(K\) times) to see the pattern, find the difference in proportions for each simulation, and draw them.

(Assuming n = 500)

We will start assuming that we sent \(n=500\) emails in each simulation.

emailA_p <- 0.56
emailB_p <- 0.39
emailA_distr <- dist_bernoulli(emailA_p)
emailB_distr <- dist_bernoulli(emailB_p)

real_diff = emailA_p - emailB_p

n_a = 243
n_b = 257

K = 1000

sim_results = tibble(
  difference = numeric(K)
)

for(i in 1:K){
  draw_a <- generate(emailA_distr, n_a)[[1]]
  draw_b <- generate(emailB_distr, n_b)[[1]]
  sim_results[i, "difference"] <- (sum(draw_a)/length(draw_a)) - (sum(draw_b)/length(draw_b))
}

sim_results %>%
  ggplot(aes(x = difference)) +
  stat_dotsinterval() +
  xlim(-0.2,0.6) +
      theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank()) +
  ggtitle(
    str_glue("Sample differences (real diff = {real_diff}, n={n_a+n_b})")
  ) +
  geom_vline(xintercept = real_diff, color = "darkgoldenrod3") + 
  geom_vline(xintercept = 0, color = "blue4") -> p500

p500

The golden line is real (population) difference (in the real world we don’t observe real difference) between homework starters for emailA and emailB, blue line – zero difference between homework starters for emailA and emailB.

Guiding questions:

  • What does each gray dot on the graph represent?
  • What does zero difference between the proportion of homework starters for emailA and emailB mean?
  • What would difference = 0.3 mean?

(Assuming n = 100)

library(ggdist)
library(distributional)
library(dplyr)
library(ggplot2)
library(stringr)

emailA_p <- 0.56
emailB_p <- 0.39
emailA_distr <- dist_bernoulli(emailA_p)
emailB_distr <- dist_bernoulli(emailB_p)

real_diff = emailA_p - emailB_p

n_a = 43
n_b = 57

K = 1000

sim_results = tibble(
  difference = numeric(K)
)

for(i in 1:K){
  draw_a <- generate(emailA_distr, n_a)[[1]]
  draw_b <- generate(emailB_distr, n_b)[[1]]
  sim_results[i, "difference"] <- (sum(draw_a)/length(draw_a)) - (sum(draw_b)/length(draw_b))
}

sim_results %>%
  ggplot(aes(x = difference)) +
  stat_dotsinterval() +
  xlim(-0.2,0.6) +
      theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank()) +
  ggtitle(
    str_glue("Sample differences (real diff = {real_diff}, n={n_a+n_b})")
  ) +
  geom_vline(xintercept = real_diff, color = "darkgoldenrod3") + 
  geom_vline(xintercept = 0, color = "blue4") -> p100

p100

Guiding questions:

  • What do dots on the left of and on the blue line itself represent?

(Assuming n = 50)

library(ggdist)
library(distributional)
library(dplyr)
library(ggplot2)
library(stringr)

emailA_p <- 0.56
emailB_p <- 0.39
emailA_distr <- dist_bernoulli(emailA_p)
emailB_distr <- dist_bernoulli(emailB_p)

real_diff = emailA_p - emailB_p

n_a = 21
n_b = 29

K = 1000

sim_results = tibble(
  difference = numeric(K)
)

for(i in 1:K){
  draw_a <- generate(emailA_distr, n_a)[[1]]
  draw_b <- generate(emailB_distr, n_b)[[1]]
  sim_results[i, "difference"] <- (sum(draw_a)/length(draw_a)) - (sum(draw_b)/length(draw_b))
}

sim_results %>%
  ggplot(aes(x = difference)) +
  stat_dotsinterval() +
  xlim(-0.2,0.6) +
      theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank()) +
  ggtitle(
    str_glue("Sample differences (real diff = {real_diff}, n={n_a+n_b})")
  ) +
  geom_vline(xintercept = real_diff, color = "darkgoldenrod3") + 
  geom_vline(xintercept = 0, color = "blue4") -> p50

p50

Guiding questions:

  • What is the approximate proportion of dots on the left of and on the blue line? (remember K = 1000)

Now let’s compare the graphs in more details

library(patchwork)

p500 / p100 / p50

Guiding questions:

  • What happens with decreasing \(n\) with respect to the proportion of dots on the left of and on the blue line? Why? Talk aloud
  • How overal distribution shapes are different? Why?
  • What \(n\) (total sample size) is responsible for?
  • What \(K\) (number of simulations) is responsible for?
  • How would you approach testing hypotheses based on such simulations? Write down some draft steps
  • How would the graphs change if real proportions were 0.56 and 0.52? Try drawing changes over the old graphs or sketching new ones
  • Consider the idea of power – a chance of “true positive” detection, i.e. when the test we perform successfully detect an actual real effect. What do you think power depends on?

Simulation (permutation)-based hypothesis testing

Now let us try to develop a simple prototype of hypothesis testing using simulations.

This setup is different: we don’t know much about what happens in the population and can only rely on the data at hand.

Our experiment had \(n=500\) participants (\(K=1000\) is still the number of simulations).

Say, our emailA data experiment results are:

emailA_res <- c(1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1)

\(p_a=\) 0.54, and number of students \(n_a=\) 170.

and for emailB:

emailB_res <- c(0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1)

\(p_b=\) 0.51, and number of students \(n_b=\) 230.

So observed difference in proportions is observed_diff= 0.03. Is email A really more efficient?

Let’s start building our simulation from a bit simplistic definition of p-value: p-value shows probability to get data with the difference in proportions the same or more extreme than observed, provided that \(H_0\) is true (there is no real difference in proportions).

In real life, we don’t know if \(H_0\) is true, i.e. we don’t know real (population) proportions for emailA and emailB – that is why we test!

But knowing only our experimental data, we can simulate this assumption (that there is no real difference in proportions)!

One way to do it is to randomly interchange observations between conditions (emailA and emailB).

Guiding question:

  • If we randomly interchange observations between the conditions (emailA and emailB), what difference between the proportions do we expect?
sim_results = tibble(
  difference = numeric(K)
)


for(i in 1:K){
  res = c(emailA_res, emailB_res)
  condition = sample(c("A", "B"), length(emailA_res) + length(emailB_res), replace = T)
  
  prop_A = sum(res[condition == "A"])/length(res[condition == "A"])
  prop_B = sum(res[condition == "B"])/length(res[condition == "B"])
  
  sim_results[i, "difference"] = prop_A - prop_B
}

sim_results %>%
  ggplot(aes(x = difference)) +
  stat_dotsinterval() +
  xlim(-0.15,0.15) +
      theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank()) +
  ggtitle(
    "difference distribution (n=500)"
  ) +
  geom_vline(xintercept = 0, color = "blue4") -> p4

p4

Guiding questions:

  • What does each gray dot represent?
  • What does the blue line represent?
  • What does zero difference between proportions mean?
  • What does this distribution represent?
  • How would you use this for testing hypotheses?

Let’s now figure out how to test the hypothesis…

Let’s look at the graph again:

p4

What do we need to add to do a visual hypothesis test?

Well, we don’t have observed difference to make conclusions. Let’s add it!

p4 + geom_vline(xintercept = observed_diff, color = "darkgoldenrod3") -> p5

p5

Guiding questions:

  • Remind yourself what this gray dot distribution represents?
  • What would you say about the results of this hypothesis test and the strength of evidence it provides?
  • Where would you put the golden line if your task was to draw a graph showing strong evidence for the real difference between emailA and emailB efficiency?
  • Look back at p-value definition. How would you geometrically calculate it?

Sample size and back to power

Say our previous results were only a part of data from our experiment, and the observed difference in proportions is the same (0.03).

Guiding questions:

  • Estimate an approximate p-value based on the graph?
  • Have your hypothesis test conclusion changed, looking at this new data?
  • Why? What influences your conclusion?
  • Which email is better?
  • If we designed emailC which we know is better than both A and B, and compared it to the worst of the two, what would the graphs for n=500 and n=5000 look like? Doodle over the current graphs. Explain your answer

Overall guiding questions