This is problem set #3, in which we want you to integrate your knowledge of data wrangling with some basic simulation skills and some linear modeling.>
For ease of reading, please separate your answers from our text by marking our text with the
>character (indicating quotes).
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.7
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Let’s start by convincing ourselves that t-tests have the appropriate false positive rate. Run 10,000 t-tests with standard, normally-distributed data from a made up 30-person, single-measurement experiment (the command for sampling from a normal distribution is rnorm). What’s the mean number of “significant” results?
First do this using a for loop.
#running 10000 t-tests
set.seed(508)
sigResults <- 0
for (i in 1:10000) {
x <- rnorm(30)
p <- t.test(x)$p.
if(p<0.05) sigResults <- sigResults + 1
}
sigResults
## [1] 483
ANSWER: 483 out of 10000
Next, do this using the replicate function:
#running 10000 t-test's with replicate function
set.seed(508)
results <- replicate(10000, t.test (rnorm(30))$p. <0.05)
sum(results)
## [1] 483
Answer: 483 out of 10000
Ok, that was a bit boring. Let’s try something more interesting - let’s implement a p-value sniffing simulation, in the style of Simons, Nelson, & Simonsohn (2011).
Consider this scenario: you have done an experiment, again with 30 participants (one observation each, just for simplicity). The question is whether their performance is above chance. You aren’t going to check the p-value every trial, but let’s say you run 30 - then if the p-value is within the range p < .25 and p > .05, you optionally run 30 more and add those data, then test again. But if the original p value is < .05, you call it a day, and if the original is > .25, you also stop.
First, write a function that implements this sampling regime.
#addition of more samples if p-value is less than 0.05
double.sample <- function () {
}
run_experiment= function(){
testing_t_test=data.frame(
value=rnorm(n=30, mean=0, sd=1),
group=rep(1:2, each=15))
testing_t_test2=data.frame(
value=rnorm(n=30, mean=0, sd=1),
group=rep(1:2, each=15))
#running t-test for each experiment
t_test_result=t.test(value~group, data=testing_t_test)
pvalue=(t_test_result$p.value)
is_significant=(pvalue <.05)
is_gt_twenty_five=(pvalue >.25)
if(is_significant | is_gt_twenty_five) {
return (is_significant)
}
#repeat if
else {
testing_t_test_sixty=rbind(testing_t_test, testing_t_test2)
t_test_result=t.test(value~group, data=testing_t_test_sixty)
pvalue=(t_test_result$p.value)
is_signficant=(pvalue < .05)
return(is_signficant)
}
}
Now call this function 10k times and find out what happens.
sum(replicate(10000, run_experiment()))
## [1] 763
Is there an inflation of false positives? How bad is it?
Yes, there is an inflation. There is now 711 out of 10000 after adding more participapnts, in comparison to approximately 500 out of 10000. The false positive rate increased from 5% to 7%.
Now modify this code so that you can investigate this “double the sample” rule in a bit more depth. Let’s see what happens when you double the sample ANY time p > .05 (not just when p < .25), or when you do it only if p < .5 or < .75. How do these choices affect the false positive rate?
HINT: Try to do this by making the function double.sample take the upper p value as an argument, so that you can pass this through dplyr.
HINT 2: You may need more samples. Find out by looking at how the results change from run to run.
#doubling sample when p > 0.5
set.seed(508)
p_resample <- .05
double.sample <- function () {
x <- rnorm(30)
p <- t.test(x)$p.
if(p > p_resample) {
x <- c(x, rnorm(30))
p <- t.test(x)$p.
}
p
}
results <- replicate(10000, double.sample() < 0.05)
sum(results)
## [1] 821
#doubling sample when 0.05 < p < 0.5
set.seed(508)
p_resample <- c(0.05, 0.5)
double.sample <- function () {
x <- rnorm(30)
p <- t.test(x)$p.
if(p > p_resample[1] & p < p_resample[2]) {
x <- c(x, rnorm(30))
p <- t.test(x)$p.
}
p
}
results <- replicate(10000, double.sample() < 0.05)
sum(results)
## [1] 779
set.seed(508)
p_resample <- c(0.05, 0.75)
double.sample <- function () {
x <- rnorm(30)
p <- t.test(x)$p.
if(p > p_resample[1] & p < p_resample[2]) {
x <- c(x, rnorm(30))
p <- t.test(x)$p.
}
p
}
results <- replicate(10000, double.sample() < 0.05)
sum(results)
## [1] 821
What do you conclude on the basis of this simulation? How bad is this kind of data-dependent policy?
By adding more participants to the sample we see a consistent 2-3% increase from the original false positive rate of approximately 5%. When the p-value does not reach significance (more than 0.05) after resampling, the false positive rate goes up to approximately 8.2% - 821 out of 10000. When the p-value is calculated between 0.05 and 0.5 after resampling, the false positive rate increases to about 7.7% - 779 out of 10000. Finally, when the p-value is calculated between 0.05 and 0.75 after resempling, the false positive rate increases to 8.2% - 821 out of 10000.