Introduction

This experiment recruits participants to a chatbot survey using Facebook ads. After clicking on the Facebook ad, a participant is randomized into one of 5 treatment arms (the appendix considers 7 treatment arms).

The power analysis varies along two dimensions:

hypothesized outcomes for each treatment arm (and hence, treatment effects)
number of observations (clicks)

There are three outcomes, which are correlated: Pr(unvax complete|click), Pr(unvax website click|complete), and Pr(unvax useful free text|complete). We have an estimate of 14% for Pr(unvax complete|click) from the pilots and a rough estimate of 50% for Pr(unvax useful free text|complete), but we do not have an estimate for Pr(unvax website click|complete).

In the following power calculations, the hypothesized level of Pr(unvax complete|click) ranges from 0.04 to 0.20, with treatment effects ranging from 2 to 8 percentage points when comparing adjacent treatments (i.e. treatment 1 to 2, or 2 to 3). The hypothesized level of Pr(unvax useful free text|complete) ranges from 0.35 to 0.55, with treatment effects ranging from 2 to 10 percentage points when comparing adjacent treatments. The hypothesized level of Pr(unvax website click|complete) ranges from 0.05 to 0.35, with treatment effects ranging from 3 to 10 percentage points when comparing adjacent treatments. I do not anticipate that the Pr(unvax website click|complete) outcome will necessarily achieve a high of 0.35, but the outcome we end up using in its place (based on pilot results) should.

For the number of observations, we consider 50k to 150k clicks in 25k intervals.

Hypotheses

The pre-analysis plan specifies two research questions, with five hypotheses (one in two parts) to answer them. I list the research questions and accompanying hypotheses below.

Is a chatbot a more effective tool for eliciting information than a traditional survey?
- T1 = T4
What quality of a chatbot makes it a more effective tool?
- T1 = T2
- T2 = T3
- T3 = T4
- T4 = T5
- (T2 - T1) = (T4 - T5)

Note: The current code only includes the first four hypotheses for research question 2. It will be updated soon.

Key Takeaways

We are well-powered to answer research question 1 with 50k clicks for all outcomes, assuming a 2-3pp difference between adjacent treatment arms (so an 8 to 11pp difference between treatments 1 and 4).
For Pr(unvax complete|click), we are well-powered to detect 2pp differences between adjacent treatment arms at 50k clicks.
For Pr(unvax useful free text|complete), we are well-powered to detect 5pp differences between adjacent treatment arms at 75k clicks.
For Pr(unvax website click|complete), we are well-powered to detect 5pp differences between adjacent treatment arms at 50k clicks and 3pp differences at 150k clicks.

I recommend that the main experiment aim for 75k clicks at an approximate cost of $15k at $0.146 per click (plus mobile airtime costs which would amount to no more than $1500 at the most optimistic completion rate of 20%).

Power Calculations

To assess the power for testing multiple correlated hypotheses, I simulate data, test the hypotheses, correct the p-values, and repeat for r repetitions, then calculate the proportion of repetitions in which I reject each hypothesis. The variables needed in the simulated data are the treatment assignment and the three outcomes.

The data generation process is as follows:

Assign each observation to a treatment group. (treatment variable)
Take a draw from a binomial distribution with probability equal to the hypothesized unvax completion probability for the observation’s treatment group. (completion variable)
For observations who completed the survey, take a draw from a binomial distribution with probability equal to the hypothesized unvax website click probability for the observation’s treatment group. (webclick variable)
- For observations who completed the survey, take a draw from a binomial distribution with probability equal to the hypothesized free text proportion for the observation’s treatment group. (free text variable)

Power is calculated using the following parameters:

$\alpha=0.05$
replications $r=1000$
number of observations (clicks) $n=\{50k, 75k,100k,125k,150k\}$ for a cost of $\{11k,15k,22k,29k\}$ at $0.146 per click.

Results are reported for uncorrected p-values and for p-values corrected using Bonferonni, Holm, and Hochberg.

Hypothesis

Hypothesis 1. Participants in Treatment 4 provide more information, higher-quality information, and are more receptive to new information about the vaccine compared to participants in Treatment 1.

Hypothesis 2. Participants in Treatment 3 provide more information, higher-quality information, and are more receptive to new information about the vaccine compared to participants in Treatment 2.

Hypothesis 3. Participants in Treatment 4 provide more information, higher-quality information, and are more receptive to new information about the vaccine compared to participants in Treatment 3.

Hypothesis 4. Participants in Treatment 4 provide more information, higher-quality information, and are more receptive to new information about the vaccine compared to participants in Treatment 5.

Hypothesis 5. (A) Participants in Treatment 2 provide more information, higher-quality information, and are more receptive to new information about the vaccine compared to participants in Treatment 1. (B) Moreover, the difference between participants in Treatment 2 and participants in Treatment 1 is greater than the difference between participants in Treatment 4 and participants in Treatment 5.

Romano-Wolf Correction

We run power calculations on a specific set of outcomes using the Romano-Wolf correction. We will use the Romano-Wolf correction later in experimental analysis to correct for multiple hypothesis. The specific outcomes we are testing are:

List specific tests for Romano-Wolf

N is the number of observations in the sample

B is the number of bootstrap subsamples to draw

n defines the size of the bootstrap subsamples

In this output, N = 75000, B = 1000,n=75000. We can adjust these later, this is just to keep run time low.

The function is divided into three parts:

compute t statistic/p estimates for each combination of treatment and outcome comparisons (ie treatment A and B for outcome y1) from the full data set
get B bootstrapped samples of size n from the full data set and compute a test statistic for each outcome-treatment comparisons
The actual Romano-Wolf correction

https://www.rdocumentation.org/packages/hdm/versions/0.3.1/topics/p_adjust

https://www.econ.uzh.ch/static/wp/econwp219.pdf

Define Outcome Matrices

outcomes <- matrix(c(   0.08,   0.12,   0.14,   0.16,   0.14, # Pr(unvax complete|click)
                      0.05, 0.15,   0.2,    0.25,   0.2, # Pr(unvax web click|complete)
                 0.35,  0.45,   0.5,    0.55,   0.5), # Pr(useful free text| complete)
                  nrow = 3, byrow = T)

outcomes %>%
  kable(caption = "Outcomes Matrix") %>%
  kable_styling()

Outcomes Matrix
0.08	0.12	0.14	0.16	0.14
0.05	0.15	0.20	0.25	0.20
0.35	0.45	0.50	0.55	0.50

### romano wolf
romano_wolf <- function(N, B, n){

  # treatments
  t <- 1:5
  
  # make data frame of size N
  df <- data.frame(matrix(nrow = N, ncol = 0))
  
  # treatment assignment
  df$treat <- sample(t, N, replace=T, c(3/14,3/14,2/14,4/14,2/14)) 
  
  # generate data with completion outcome
  df <- df %>% 
    mutate(complete = rbinom(n(),1,outcomes[1,treat])) 
  
  completes <- df %>%
    filter(complete ==1) %>%
    mutate(webclick = rbinom(n(),1,outcomes[2,treat]), 
           useful = rbinom(n(),1,outcomes[3,treat]))

  noncompletes <- df %>%
    filter(complete ==0) %>%
    mutate(webclick = NA,
           useful = NA)
  
  
  d <-rbind(completes, noncompletes) %>% 
    .[sample(1:nrow(.)), ] #random order
  
  ### Romano Wolf Starts Here
  
  # Say that we are interested in testing B vs A, C vs A, D vs A, C vs B, D vs B, and D vs C for each of the outcomes.
  # That makes a total of 18 hypotheses.
  # We will need the estimates and t-stats (or z-stats) of each parameter about which we are conducting an hypothesis test.
  # Eventually, we will also need the standard errors #
  
  comparisons = cbind(combn(t,2)[, c(3, 5, 8, 10, 1)], c("2-1", "4-5"))
  num_hyp = nrow(outcomes) * ncol(comparisons)
  
  orig_results = data.frame(matrix(NA,num_hyp,3))
  colnames(orig_results) = c("estimates","ts","p_val")
  
  
  ### step 1: compute t statistic/p estimates for each combination of treatment and outcome comparisons (ie treatment A and B for outcome y1)  from the full data set
  
  i = 1
  for (y in c("complete", "webclick", "useful")){
    for (comp in 1:ncol(comparisons)){
      # We create the name of the hypothesis we are testing, for example y1_A_vs_B #
      rownames(orig_results)[i] = paste(y,"_",comparisons[2,comp],"_vs_",comparisons[1,comp],sep="")
      
      # Run the t.test and extract the estimate and t-statistic #
      
      if(comp <6){
        test = t.test(d[d$treat==comparisons[1,comp], y],
                      d[d$treat==comparisons[2,comp],y ])
      
        
        
        orig_results[i,"estimates"] = test$estimate[1] - test$estimate[2] 
        # We use absolute values for the t-statistic since we want to test two-sided hypothesis # 
        orig_results[i,"ts"] = abs(test$statistic)
        # Also get the p-values to compare
        orig_results[i,"p_val"] = test$p.value
        
      } else{
          
        formula <- as.formula(paste(y, "~ 0 + treat"))
        
          model = lm(formula, d %>% 
               # remove observations with treatments not relevant to diff in diff
               filter(treat != 3) %>% 
               # make numeric categories a character so lm doesn't treat them as continuous
               mutate(treat = paste("W", treat, sep = ""))) 
  
        # run hypothesis test for diff in diff
        glht_mod = glht(model, "(treatW2 - treatW1) - (treatW5 - treatW4) = 0")

        orig_results[i,"estimates"] <- summary(glht_mod)$test$coefficients
        orig_results[i,"ts"]  <- summary(glht_mod)$test$tstat
        orig_results[i,"p_val"] <- summary(glht_mod)$test$pvalues
        
        }
      i = i+1}}
  
  # We sort the hypotheses by the size of the original t-statistic #
  sorted_orig = orig_results[order(-orig_results$ts),]
  # We check if we reject the null hypothesis using the unadjusted p-values
  alpha = 0.05
  sorted_orig$reject = sorted_orig$p_val < alpha
  sorted_orig
  
  
  ### bootstrapping
  
  bootstrapped_statistics = data.frame(matrix(NA,B,nrow(sorted_orig)))
  # Important, we keep the order of the sorted set of hypotheses #
  colnames(bootstrapped_statistics) = rownames(sorted_orig)
  
  set.seed(777)
  # In each of the bootstrap rounds we do the following:
  for (b in 1:B){
      
      # We sample with replacement the indices which will make up our bootstrap sample
      idx = sample(1:n,n,TRUE)
  
      data_boot = d[idx,]
      
    # We repeat exactly the same procedure as before to get the bootstrap estimates and standard errors 
    for (y in c("complete","webclick","useful")){
      
      
      for (comp in 1:ncol(comparisons)){
        # By getting the name of the hypothesis we are testing we don't need to worry about order, we match by name
        hyp = paste(y, "_", comparisons[2,comp],"_vs_",comparisons[1,comp],sep="")
        # Run the t.test on the bootstrap sample
        
        if(comp < 6){
        
          test = t.test(data_boot[data_boot$treat==comparisons[1,comp], y],
                        data_boot[data_boot$treat==comparisons[2,comp], y])
          # The null statistic is computed by subtracting the original estimate from the bootstrap sample and dividing
          # by the standard error of the bootstraped estimate
          # Remember the absolute value since we are testing a two-sided hypothesis
          bootstrapped_statistics[b,hyp] = abs((test$estimate[1] - test$estimate[2] - sorted_orig[hyp,"estimates"])/test$stderr)
        } else {
          
          formula <- as.formula(paste(y, "~ 0 + treat"))
          
            model = lm(formula, d %>% 
                 # remove observations with treatments not relevant to diff in diff
                 filter(treat != 3) %>% 
                 # make numeric categories a character so lm doesn't treat them as continuous
                 mutate(treat = paste("W", treat, sep = ""))) 
    
          # run hypothesis test for diff in diff
          glht_mod = glht(model, "(treatW2 - treatW1) - (treatW5 - treatW4) = 0")
          bootstrapped_statistics[b,hyp]   <- abs((summary(glht_mod)$test$coefficients - sorted_orig[hyp,"estimates"])/summary(glht_mod)$test$sigma)
          }
       }
    }
  }

  ### The actual Romano-Wolf Correction Step
  
  # Now we compute the "max" statistics and the empirical quantiles #
  max_stats = data.frame(matrix(NA,B,num_hyp))
  colnames(max_stats) = colnames(bootstrapped_statistics)
  for (h in 1:(num_hyp-1)){
    max_stats[,h] = apply(bootstrapped_statistics[,h:num_hyp], MARGIN=1, FUN=max)}
  max_stats[,num_hyp] = bootstrapped_statistics[,num_hyp]
  # And the empirical quantiles given the significance level #
  Cs = unname(apply(max_stats, MARGIN=2, FUN=quantile, probs=c(1-alpha)))
    
  # And finally we run the rejection algorithm
  sorted_orig$adjust_rejected = NA
  # R will index the hypothesis we need to test at each round
  R = 1
  # r will be the counter of rejected hypotheses at each round
  r = Inf
  # The while loop stops if no hypotheses are rejected in a round or if we reject all.
  while (r > 0 & R <= num_hyp){
    # We compare the original t-statistics of the hypotheses we haven't tested at each round with the quantiles
    sorted_orig$adjust_rejected[R:num_hyp] = (sorted_orig$ts[R:num_hyp] > Cs[R])
    r = sum(sorted_orig$adjust_rejected[R:num_hyp])
    R = r + R}
  # See that the 5th t-statistic is lower than the 5th empirical quantile, hence we fail to reject and we stop.
  print(sorted_orig[5,"ts"])
  print(Cs[5])
  
  sorted_orig[c("reject","adjust_rejected")]
  
  
  # We can also compute the adjusted p-values
  sorted_orig$adj_p_val = NA
  sorted_orig$adj_p_val[1] = (sum(max_stats[,1]>=sorted_orig$ts[1])+1)/(B+1)
  for (h in 2:num_hyp){
    sorted_orig$adj_p_val[h] = max((sum(max_stats[,h]>=sorted_orig$ts[h])+1)/(B+1), sorted_orig$adj_p_val[h-1])}
  # Compare them to Bonferroni's p-values (note that the minimum p-value in RW is 1/(B+1))
  sorted_orig$p_val_bonf = p.adjust(sorted_orig$p_val,"bonferroni")
  return(round(sorted_orig[c("p_val","adj_p_val","p_val_bonf")],5))
  
}

results <- romano_wolf(N = 75000,B = 1000, n=75000)

## [1] 10.40109
## [1] 2.820481

power_calc <- results %>%
  as.data.frame() %>%
  rownames_to_column(var = "comparison") %>%
  mutate(treatments = str_extract(string = comparison, pattern = "[0-9]\\_vs\\_[0-9]|[0-9]-[0-9]\\_vs\\_[0-9]-[0-9]"), 
         outcomes = str_remove(string = comparison, pattern = paste("_", treatments, sep= ""))) %>%
  .[, c("outcomes", "p_val", "adj_p_val", "p_val_bonf", "treatments")] %>% 
  pivot_longer(cols = c("p_val", "adj_p_val", "p_val_bonf"), names_to = "p_val_type")%>%
  pivot_wider(names_from = treatments, values_from  = value) %>%
  mutate(outcomes = factor(outcomes, levels = c("complete", "webclick", "useful"))) %>% 
  arrange(outcomes) %>%
  mutate(outcomes = c(rep('Unvax completes|Click',3), rep('Unvax web clicks|Complete', 3),
                       rep('Unvax useful freetext|Complete', 3)), 
         p_val_type = ifelse(p_val_type == "p_val", "Original p-value", ifelse(p_val_type == "adj_p_val", "Romano-Wolf p-value", "Bonferroni p-value"))) %>%
  rename(correction = p_val_type) %>%
  .[, c("outcomes","correction", "2_vs_1", "4_vs_1", "3_vs_2", "4_vs_3", "5_vs_4", "4-5_vs_2-1")]
 
colnames(power_calc) <- c("Outcome","Correction",  'T1 v. T2',  'T1 v. T4', 'T2 v. T3','T3 v. T4', 'T4 v. T5','T2 - T1 v. T5 - T4')

kable(power_calc,  format = "html", caption = paste("Power Calculations: ",75000/1000," Thousand Clicks"), digits = 5)%>%
kable_styling()%>%
collapse_rows(columns = 1)%>%
  scroll_box(height = "100%")

Power Calculations: 75 Thousand Clicks
Outcome	Correction	T1 v. T2	T1 v. T4	T2 v. T3	T3 v. T4	T4 v. T5	T2 - T1 v. T5 - T4
Unvax completes\|Click	Original p-value	0.00000	0.000	0.00000	0.00000	0.00003	0.000
	Romano-Wolf p-value	0.00100	0.001	0.00100	0.00100	0.00300	0.001
	Bonferroni p-value	0.00000	0.000	0.00001	0.00003	0.00060	0.000
Unvax web clicks\|Complete	Original p-value	0.00000	0.000	0.00006	0.00027	0.57304	0.000
	Romano-Wolf p-value	0.00100	0.001	0.00300	0.00400	0.59341	0.001
	Bonferroni p-value	0.00000	0.000	0.00109	0.00487	1.00000	0.000
Unvax useful freetext\|Complete	Original p-value	0.00000	0.000	0.00416	0.00148	0.00001	0.000
	Romano-Wolf p-value	0.00100	0.001	0.01399	0.01399	0.00100	0.001
	Bonferroni p-value	0.00001	0.000	0.07482	0.02658	0.00011	0.000

Romano Wolf Correction

Kristine Koutout & Sarah McDonald

May 2021