library(tidyverse)
## ── Attaching packages ───────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ──────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Let’s start by convincing ourselves that t-tests have the appropriate false positive rate. Run 10,000 t-tests with standard, normally-distributed data from a made up 30-person, single-measurement experiment (the command for sampling from a normal distribution is rnorm). What’s the mean number of “significant” results?

First do this using a for loop.

#collect data for experiment 
testing_t_test=data.frame(
  value=rnorm(n=30, mean=0, sd=1),
  group=rep(1:2, each=15))

#t-test for one experiment
t_test_result=t.test(value~group, data=testing_t_test)
pvalue=(t_test_result$p.value)
is_significant=(pvalue <.05)
#10,000 experiments
is_significant_vector=c()
for (i in 1:10000) {
  testing_t_test=data.frame(
    value=rnorm(n=30, mean=0, sd=1),
    group=rep(1:2, each=15))
  
  #t-test for each experiment
  t_test_result=t.test(value~group, data=testing_t_test)
  pvalue=(t_test_result$p.value)
  is_significant=(pvalue <.05)
  is_significant_vector[i]=is_significant
}
sum(is_significant_vector)/10000
## [1] 0.0491
run_experiment= function(){
  testing_t_test=data.frame(
    value=rnorm(n=30, mean=0, sd=1),
    group=rep(1:2, each=15))
  
  #t-test for each experiment
  t_test_result=t.test(value~group, data=testing_t_test)
  pvalue=(t_test_result$p.value)
  is_significant=(pvalue <.05)

  return(is_significant)
}

Next, do this using the replicate function:

sum(replicate(10000, run_experiment()))
## [1] 485

Ok, that was a bit boring. Let’s try something more interesting - let’s implement a p-value sniffing simulation, in the style of Simons, Nelson, & Simonsohn (2011).

Consider this scenario: you have done an experiment, again with 30 participants (one observation each, just for simplicity). The question is whether their performance is above chance. You aren’t going to check the p-value every trial, but let’s say you run 30 - then if the p-value is within the range p < .25 and p > .05, you optionally run 30 more and add those data, then test again. But if the original p value is < .05, you call it a day, and if the original is > .25, you also stop.

First, write a function that implements this sampling regime.

#Adding more samples if p val is less than 0.05
double.sample <- function () {
}

run_experiment= function(){
  testing_t_test=data.frame(
    value=rnorm(n=30, mean=0, sd=1),
    group=rep(1:2, each=15))
  
  testing_t_test2=data.frame(
  value=rnorm(n=30, mean=0, sd=1),
  group=rep(1:2, each=15))
  
  #t-test for each experiment
  t_test_result=t.test(value~group, data=testing_t_test)
  pvalue=(t_test_result$p.value)
  is_significant=(pvalue <.05)
  is_gt_twenty_five=(pvalue >.25)
  
  if(is_significant | is_gt_twenty_five) {
    return (is_significant)
  }
  #if repeat
  else {
    testing_t_test_sixty=rbind(testing_t_test, testing_t_test2)
    t_test_result=t.test(value~group, data=testing_t_test_sixty)
    pvalue=(t_test_result$p.value)
    is_significant=(pvalue < .05)
    return(is_significant)
  }
  
}

Now call this function 10k times and find out what happens.

sum(replicate(10000, run_experiment()))
## [1] 704

Is there an inflation of false positives? How bad is it?

There is an inflation in false positive rates. The false positive rates went from approximately 5% to 7%. Adding more participants to achieve a level of significance will add approximately 200 more false positive rates in every 10,000 experiments.

Now modify this code so that you can investigate this “double the sample” rule in a bit more depth. Let’s see what happens when you double the sample ANY time p > .05 (not just when p < .25), or when you do it only if p < .5 or < .75. How do these choices affect the false positive rate?

HINT: Try to do this by making the function double.sample take the upper p value as an argument, so that you can pass this through dplyr.

HINT 2: You may need more samples. Find out by looking at how the results change from run to run.

#
double.sample <- function () {
}

run_experiment= function(gt_condition){
  testing_t_test=data.frame(
    value=rnorm(n=30, mean=0, sd=1),
    group=rep(1:2, each=15))
  
  testing_t_test2=data.frame(
  value=rnorm(n=30, mean=0, sd=1),
  group=rep(1:2, each=15))
  
  #t-test for each experiment
  t_test_result=t.test(value~group, data=testing_t_test)
  pvalue=(t_test_result$p.value)
  is_significant=(pvalue <.05)
  is_gt_twenty_five=(pvalue > gt_condition)
  if(is_significant | is_gt_twenty_five) {
    return (is_significant)
  }
  #if repeat
  else {
    testing_t_test_sixty=rbind(testing_t_test, testing_t_test2)
    t_test_result=t.test(value~group, data=testing_t_test_sixty)
    pvalue=(t_test_result$p.value)
    is_significant=(pvalue < .05)
    return(is_significant)
  }
  
}

sum(replicate(100000, run_experiment(.25)))
## [1] 6933
sum(replicate(100000, run_experiment(.75)))
## [1] 8165
sum(replicate(100000, run_experiment(1)))
## [1] 8181

What do you conclude on the basis of this simulation? How bad is this kind of data-dependent policy?

Adding participants on the basis of a null result is highly concerning. This kind of practice leads to significantly higher false positive rates. If you need to add participants to achieve a significant result, then your finding is unlikely to be robust. Researchers should be cautions and thoughtful upfront when choosing their sample size. It is bad practice to make this decision after knowing the result. Preregistering your data could solve this problem. Interestingly, once you reach a certain threshold of a p value (0.75), the false positive rate seems to asymptote.