This is problem set #4, in which we want you to integrate your knowledge of data wrangling with some basic simulation skills. It’s a short problem set to help you get your feet wet in testing statistical concepts through “making up data” rather than consulting a textbook or doing math.
For ease of reading, please separate your answers from our text by marking our text with the > character (indicating quotes).
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.7
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
set.seed(0)
Let’s start by convincing ourselves that t-tests have the appropriate false positive rate. Run 10,000 t-tests with standard, normally-distributed data from a made up 30-person, single-measurement experiment (the command for sampling from a normal distribution is rnorm). What’s the proportion of “significant” results (\(p < .05\)) that you see?
First do this using a for loop.
# set up things
n_experiments = 10000
n_subjects = 30
for_loop_results = c()
experiment_alpha = .05
# do things
for (i_experiment in 1:n_experiments) {
for_loop_results = c(for_loop_results, t.test(rnorm(n_subjects))$p.value)
}
# say things
paste('The probability of getting a significant result in the experiments above was', mean(for_loop_results <= experiment_alpha))
## [1] "The probability of getting a significant result in the experiments above was 0.0478"
Next, do this using the replicate function:
# do things
replicate_results = replicate(n_experiments, t.test(rnorm(n_subjects))$p.value, simplify = TRUE)
# say things
paste('The probability of getting a significant result in the experiments above was', mean(replicate_results <= experiment_alpha))
## [1] "The probability of getting a significant result in the experiments above was 0.0494"
Ok, that was a bit boring. Let’s try something more interesting - let’s implement a p-value sniffing simulation, in the style of Simons, Nelson, & Simonsohn (2011).
Consider this scenario: you have done an experiment, again with 30 participants (one observation each, just for simplicity). The question is whether their performance is above chance. You aren’t going to check the p-value every trial, but let’s say you run 30 - then if the p-value is within the range p < .25 and p > .05, you optionally run 30 more and add those data, then test again. But if the original p value is < .05, you call it a day, and if the original is > .25, you also stop.
First, write a function that implements this sampling regime.
# define the bad thing
double.sample <- function (n_subjects, upper_bound, lower_bound) {
# run single experiment (still a good thing)
experiment_sample = rnorm(n_subjects)
single_test = t.test(experiment_sample)$p.value
# do the bad thing
if ( single_test > lower_bound & single_test < upper_bound ) {
experiment_sample = c(experiment_sample, rnorm(n_subjects))}
# now do the bad thing with math
double_test = t.test(experiment_sample)$p.value
# publish the bad thing
return(double_test)
}
Now call this function 10k times and find out what happens.
# define params
lower_bound = experiment_alpha
upper_bound = 0.25
# the god move
double_test_distribution = replicate(n_experiments, double.sample(n_subjects, upper_bound, lower_bound))
# the math move
double_test_significance = mean(double_test_distribution <= experiment_alpha)
# the student move
paste('The probability of getting a significant result in the experiments above was', double_test_significance)
## [1] "The probability of getting a significant result in the experiments above was 0.0716"
Is there an inflation of false positives? How bad is it?
Yes, there is inflation. The probability of false alarms increase from ~0.05 to ~0.07. This is squarely in the “bad” quadrant of good and bad things.
Now modify this code so that you can investigate this “double the sample” rule in a bit more depth. Let’s see what happens when you double the sample ANY time p > .05 (not just when p < .25), or when you do it only if p < .5 or < .75. How do these choices affect the false positive rate?
HINT: Try to do this by making the function double.sample take the upper p value as an argument, so that you can pass this through dplyr.
HINT 2: You may need more samples. Find out by looking at how the results change from run to run.
sample_strategies = data.frame(lower_bound=rep(experiment_alpha,4), upper_bound=c(.25, .50, .75, 1))
for (i_strategy in 1:nrow(sample_strategies)){
# employ a sample strategory
results = replicate(n_experiments, double.sample(n_subjects, sample_strategies$upper_bound[i_strategy], sample_strategies$lower_bound[i_strategy]))
# save our results
sample_strategies$false_positives[i_strategy] = mean(results <= experiment_alpha)
}
ggplot(data=sample_strategies, aes(x=upper_bound, y=false_positives)) +
geom_line() + geom_point() + ggtitle("Relationship between the upper bound (in our sampling policy) and false alarms")
What do you conclude on the basis of this simulation? How bad is this kind of data-dependent policy?
The number of false alarms increases with the threshold on the upper bound. That is, the more liberal we are in deciding “let’s just collect some more data” the more likely we are to find a significant effect, even when the samples are drawn from a null distribution. This is “very bad.”