This is problem set #3, in which we want you to integrate your knowledge of data wrangling with some basic simulation skills and some linear modeling.

For ease of reading, please separate your answers from our text by marking our text with the > character (indicating quotes).

library(tidyverse)
## ── Attaching packages ────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Let’s start by convincing ourselves that t-tests have the appropriate false positive rate. Run 10,000 t-tests with standard, normally-distributed data from a made up 30-person, single-measurement experiment (the command for sampling from a normal distribution is rnorm). What’s the mean number of “significant” results?

The mean number of significant results is about 500 (~5%).

First do this using a for loop.

num_sig_for = 0
for (i in 1:10000) {
  norm_data <- rnorm(30)
  p_val <- t.test(norm_data)$p.value
  if(p_val < .05) {
    num_sig_for = num_sig_for + 1
  }
}
mean_for = num_sig_for / 10000
mean_for
## [1] 0.0495

Next, do this using the replicate function:

fn <- function() {
  norm_data <- rnorm(30)
  if(t.test(norm_data)$p.value < 0.05) {
    return(1)
  }
  return(0)
}
vec <- replicate(10000, fn())
num_sig_rep = sum(vec)
mean_rep = num_sig_rep / 10000
mean_rep
## [1] 0.0493

Ok, that was a bit boring. Let’s try something more interesting - let’s implement a p-value sniffing simulation, in the style of Simons, Nelson, & Simonsohn (2011).

Consider this scenario: you have done an experiment, again with 30 participants (one observation each, just for simplicity). The question is whether their performance is above chance. You aren’t going to check the p-value every trial, but let’s say you run 30 - then if the p-value is within the range p < .25 and p > .05, you optionally run 30 more and add those data, then test again. But if the original p value is < .05, you call it a day, and if the original is > .25, you also stop.

First, write a function that implements this sampling regime.

double.sample <- function (upper_p) {
  norm_data = rnorm(30)
  p_val <- t.test(norm_data)$p.value
  if (p_val < upper_p && p_val > .05) {
    norm_data <- c(norm_data,rnorm(30))
    p_val <- t.test(norm_data)$p.value
  } 
  if (p_val < .05) {
    return(1)
  }else {
    return(0)
  }
}

Now call this function 10k times and find out what happens.

vec <- replicate(10000, double.sample(.25))
num_sig_rep = sum(vec)
mean_rep = num_sig_rep / 10000
mean_rep
## [1] 0.0732

Is there an inflation of false positives? How bad is it?

Yes. We now see the mean number of signifianct results is around 710 (~7.1%).

Now modify this code so that you can investigate this “double the sample” rule in a bit more depth. Let’s see what happens when you double the sample ANY time p > .05 (not just when p < .25), or when you do it only if p < .5 or < .75. How do these choices affect the false positive rate?

HINT: Try to do this by making the function double.sample take the upper p value as an argument, so that you can pass this through dplyr.

HINT 2: You may need more samples. Find out by looking at how the results change from run to run.

sum_50 <- sum(replicate(30000, double.sample(.5)))
sum_50
## [1] 2370
mean_50 <- sum_50 / 30000
mean_50
## [1] 0.079
sum_75 <- sum(replicate(30000, double.sample(.75)))
sum_75
## [1] 2501
mean_75 <- sum_75 / 30000
mean_75
## [1] 0.08336667
sum_100 <- sum(replicate(50000, double.sample(1)))
sum_100
## [1] 4073
mean_100 <- sum_100 / 50000
mean_100
## [1] 0.08146

What do you conclude on the basis of this simulation? How bad is this kind of data-dependent policy?

When p-values are trending toward significance, adding additional data has a large capacity to affect the results. By doubling the amount of data when p-values are less than .25, we nearly are 50% more likely to find significant results in our t-test. However, as results become further from significant, adding this additional data has decreasing effect. When we double the data for all p-values, for instance, we only see an average of about 8.4% significant results, not a proportional increase over 7.1% for p-values less than .25. This is because continuing to add data when p-values are far from significant is less likely to swing the p-value to significant than adding that data when the p-value is already trending in the “right” direction. In summary, this data-dependent policy is bad regardless of what upper p-value threshold we use, because we will end up counting more results as significant than we should. However, it has a decreasing effect on results as upper p-values increase. It highlights the importance of establishing sample size before beginning to run a study.