This is problem set #4, in which we want you to integrate your knowledge of data wrangling with some basic simulation skills. It’s a short problem set to help consolidate your ggplot2 skills and then help you get your feet wet in testing statistical concepts through “making up data” rather than consulting a textbook or doing math.

For ease of reading, please separate your answers from our text by marking our text with the > character (indicating quotes).

Part 1: ggplot practice

This part is a warmup, it should be relatively straightforward ggplot2 practice.

Load data from Frank, Vul, Saxe (2011, Infancy), a study in which we measured infants’ looking to hands in moving scenes. There were infants from 3 months all the way to about two years, and there were two movie conditions (Faces_Medium, in which kids played on a white background, and Faces_Plus, in which the backgrounds were more complex and the people in the videos were both kids and adults). An eye-tracker measured children’s attention to faces. This version of the dataset only gives two conditions and only shows the amount of looking at hands (other variables were measured as well).

library(tidyverse)
theme_set(theme_minimal())
fvs <- read_csv("data/FVS2011-hands.csv")
## Parsed with column specification:
## cols(
##   subid = col_double(),
##   age = col_double(),
##   condition = col_character(),
##   hand.look = col_double()
## )

First, use ggplot to plot a histogram of the ages of children in the study. NOTE: this is a repeated measures design, so you can’t just take a histogram of every measurement.

fvs %>% 
  group_by(subid) %>% 
  summarize(age = age[1]) %>% 
  ggplot(aes(x = age)) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = seq(1, max(fvs$age))) +
  labs(x = "Age", y = "Frequency")

Second, make a scatter plot showing hand looking as a function of age and condition. Add appropriate smoothing lines. Take the time to fix the axis labels and make the plot look nice.

fvs %>% 
  ggplot(aes(x = age, y = hand.look, color = condition)) +
  geom_point(alpha = .3) +
  geom_smooth(se=F, size = .2) +
  geom_smooth(method = "lm", linetype="dotted") +
  scale_x_continuous(breaks = seq(1, max(fvs$age))) +
  labs(x = "Age", y = "Hand looking time", color = "Condition")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Looking at difference between conditions within subjects

fvs %>% 
  ggplot(aes(x = age, y = hand.look, color = factor(subid))) +
  geom_line() + 
  theme(legend.position = "none") +
  scale_x_continuous(breaks = seq(1, max(fvs$age))) +
  labs(x = "Age", y = "Hand looking time", color = "Condition")

# 
# lmer(hand.look ~ age*condition + (1|subid), fvs) %>% 
#   summary

What do you conclude from this pattern of data?

As children get older, they look at hands more. This effect is more pronounced in complex scenes.

What statistical analyses would you perform here to quantify these differences?

I model the interaction of age and condition with the subject ID as a random intercept. There are not enough data to include any random slopes.

Part 2: Simulation

Let’s start by convincing ourselves that t-tests have the appropriate false positive rate. Run 10,000 t-tests with standard, normally-distributed data from a made up 30-person, single-measurement experiment (the command for sampling from a normal distribution is rnorm).

The goal of these t-tests are to determine, based on 30 observations, whether the underlying distribution (in this case a normal distribution with mean 0 and standard deviation 1) has a mean that is different from 0. In reality, the mean is not different from 0 (we sampled it using rnorm), but sometimes the 30 observations we get in our experiment will suggest that the mean is higher or lower. In this case, we’ll get a “significant” result and incorrectly reject the null hypothesis of mean 0.

What’s the proportion of “significant” results (\(p < .05\)) that you see?

Setting constants and useful functions

REPS <- 10000
SAMP_SZ <- 30

prop_positive <- function(x, alpha = .05){
  sum(x < (alpha/2) | x > (1-(alpha/2)))/length(x)
}

First do this using a for loop.

p_vals <- numeric(REPS)

t1 <- Sys.time()
for(i in 1:length(p_vals)){
  data <- rnorm(SAMP_SZ)
  p_vals[i] <- t.test(data)$p.value
}
t2 <- Sys.time()
t2 - t1
## Time difference of 0.675077 secs
prop_positive(p_vals)
## [1] 0.0502

Next, do this using the replicate function:

t1 <- Sys.time()
p_vals <- replicate(REPS, t.test(rnorm(SAMP_SZ))$p.value)
t2 <- Sys.time()
t2 - t1
## Time difference of 0.920558 secs
prop_positive(p_vals)
## [1] 0.0512

How does this compare to the intended false-positive rate of \(\alpha=0.05\)?

It looks approximately correct.

Ok, that was a bit boring. Let’s try something more interesting - let’s implement a p-value sniffing simulation, in the style of Simons, Nelson, & Simonsohn (2011).

Consider this scenario: you have done an experiment, again with 30 participants (one observation each, just for simplicity). The question is whether the true mean is different from 0. You aren’t going to check the p-value every trial, but let’s say you run 30 - then if the p-value is within the range p < .25 and p > .05, you optionally run 30 more and add those data, then test again. But if the original p value is < .05, you call it a day, and if the original is > .25, you also stop.

First, write a function that implements this sampling regime.

# original recursive version
# too memory inefficient for the very expensive simulations later on :(
# double.sample <- function(dat = numeric(0), 
#                           samp_sz = SAMP_SZ, 
#                           alpha = .05, 
#                           upper_threshold = .25){
#   dat <- c(dat, rnorm(samp_sz))
#   p_val <- t.test(dat)$p.value
#   if(p_val < alpha | p_val > upper_threshold){
#     return(p_val)
#   } else(
#     double.sample(dat, samp_sz, alpha, upper_threshold)
#   )
# }


#' double.sample
#'
#' @param samp_sz size of sample that gets added after each test 
#' @param block_sz size of block to tack on when adding more participants
#' @param alpha alpha threshold for significance
#' @param upper_threshold upper threshold for when to stop adding participants
#' @param killswitch how many participants to collect before abandoning experiment
#'
#' @return a list with the data as the first element and the p value as the second
double.sample <- function(samp_sz = SAMP_SZ,
                          block_sz = SAMP_SZ,
                          alpha = .05, 
                          upper_threshold = .25,
                          killswitch = Inf){
  dat <- rnorm(samp_sz)
  p_val <- t.test(dat)$p.value
  while(p_val > alpha && p_val < upper_threshold){
    dat <- c(dat, rnorm(block_sz))
    p_val <- t.test(dat)$p.value
    if(killswitch <= length(dat)) break
  }
  return(list(data=dat, pvalue=p_val))
}

Now call this function 10k times and find out what happens.

# false positive rate for optional stopping with 30 subs
ss30 <- replicate(REPS, list(double.sample())) 
ss30 %>% 
  map(~.$pvalue) %>% 
  unlist %>% 
  prop_positive
## [1] 0.0706
# what does the distribution of sample sizes look like
ss30 %>% 
  map(~length(.$data)) %>% 
  unlist %>% 
  table
## .
##   30   60   90  120  150  180  210  240  270  300  360  390  420  480  510 
## 8018 1328  368  140   70   34   12   12    9    2    2    2    1    1    1
# Highly suspect p-curve
ss30 %>%
  map(~.$pvalue) %>% 
  unlist %>% 
  data.frame(pvalue=.) %>%
  ggplot(aes(x = pvalue, y = ..count..)) +
  stat_bin(geom="line", binwidth = .01) +
  scale_x_continuous(breaks=seq(0,1,.05))

# optional stopping with additions of 2 participants at a time
ss30_b2 <- replicate(REPS, list(double.sample(block_sz = 2)))
ss30_b2 %>% 
  map(~.$pvalue) %>% 
  unlist %>% 
  prop_positive
## [1] 0.0541
# distribution of sample sizes
ss30_b2 %>% 
  map(~length(.$data)) %>% 
  unlist %>% 
  table
## .
##   30   32   34   36   38   40   42   44   46   48   50   52   54   56   58 
## 8030  548  334  245  182  141  103   92   57   51   42   29   23   21   17 
##   60   62   64   66   68   70   72   74   76   78   80   82   84   86   90 
##   14    9   11    5    8    6    1    4    5    2    3    2    2    2    2 
##   94   96   98  100  108  112  124 
##    1    2    2    1    1    1    1
# optional stopping with a reduced significance threshold
ss30_b5_a005 <- replicate(REPS, list(double.sample(samp_sz = 30, block_sz = 5, alpha = .005)))
ss30_b5_a005 %>% 
  map(~.$pvalue) %>% 
  unlist %>% 
  prop_positive(alpha = .05)
## [1] 0.0527
ss30_b5_a005 %>%
  map(~.$pvalue) %>% 
  unlist %>% 
  data.frame(pvalue=.) %>%
  ggplot(aes(x = pvalue, y = ..count..)) +
  stat_bin(geom="line", binwidth = .01) +
  scale_x_continuous(breaks=seq(0,1,.05))

# distribution of sample sizes
ss30_b5_a005 %>% 
  map(~length(.$data)) %>% 
  unlist %>% 
  table
## .
##   30   35   40   45   50   55   60   65   70   75   80   85   90   95  100 
## 7465  691  398  278  175  147  123   85   67   46   51   40   47   34   29 
##  105  110  115  120  125  130  135  140  145  150  155  160  165  170  175 
##   23   16   20   17   17   15   10   13   13    7    6   13   10    7   10 
##  180  185  190  195  200  205  210  215  220  225  230  235  240  245  250 
##    6    2    9    2    3    3    1    1    6    2    6    1    5    2    2 
##  255  260  265  270  275  285  290  295  300  305  310  315  325  330  335 
##    1    2    5    2    4    2    1    3    2    1    2    2    1    1    2 
##  340  345  350  360  365  375  380  395  400  405  410  435  440  445  450 
##    5    2    2    2    1    3    1    1    1    1    2    1    1    1    1 
##  455  470  520  530  540  600  615  635  650  785  805 1075 1190 1240 2300 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 2495 2600 3295 4485 6135 
##    1    1    1    1    1

Is there an inflation of false positives? How bad is it?

If you run participants in blocks of 30, you increase your false positive rate by about 1.5%. If you run participants in blocks of 2, you increase your false positive rate by about 2.5%. However, if you lower your significance threshold to .005, it appears to balance out to a false positive rate of .05 :)

Now modify this code so that you can investigate this “double the sample” rule in a bit more depth. In the previous question, the researcher doubles the sample only when they think they got “close” to a significant result, i.e. when their not-significant p is less than 0.25. What if the researcher was more optimistic? See what happens in these 3 other scenarios:

How do these choices affect the false positive rate?

HINT: Try to do this by making the function double.sample take the upper p value as an argument, so that you can pass this through dplyr.

HINT 2: You may need more samples. Find out by looking at how the results change from run to run.

u5 <- replicate(REPS*5, list(double.sample(upper_threshold = .5)))
u5 %>% 
  map(~.$pvalue) %>% 
  unlist %>% 
  prop_positive()
## [1] 0.09022
u75 <- replicate(REPS*5, list(double.sample(upper_threshold = .75)))
u75 %>% 
  map(~.$pvalue) %>% 
  unlist %>% 
  prop_positive
## [1] 0.13864
u75 %>% 
  map(~length(.$data)) %>% 
  unlist %>% 
  table
## .
##    30    60    90   120   150   180   210   240   270   300   330   360 
## 14951  9830  6235  4142  2859  2005  1565  1152   889   727   628   503 
##   390   420   450   480   510   540   570   600   630   660   690   720 
##   407   339   318   263   247   213   192   171   137   131   111   114 
##   750   780   810   840   870   900   930   960   990  1020  1050  1080 
##   100    92    76    76    70    74    46    55    50    51    48    36 
##  1110  1140  1170  1200  1230  1260  1290  1320  1350  1380  1410  1440 
##    53    37    32    35    27    33    37    28    25    20    20    24 
##  1470  1500  1530  1560  1590  1620  1650  1680  1710  1740  1770  1800 
##    19    16    21    19    13    22    25    14    12    17    14    16 
##  1830  1860  1890  1920  1950  1980  2010  2040  2070  2100  2130  2160 
##    15     9     9    10    11    16     5    16     8     9     9     5 
##  2190  2220  2250  2280  2310  2340  2370  2400  2430  2460  2490  2520 
##     8     9    11     8     7    11     7     4     7     6     6     3 
##  2550  2580  2610  2640  2670  2700  2730  2760  2790  2820  2850  2880 
##     3     3     4     5     3     6     6     1     4     1     7     5 
##  2910  2940  2970  3000  3030  3060  3090  3120  3150  3180  3210  3240 
##     4     7     5     3     3     3     4     1     3     4     1     1 
##  3270  3300  3330  3360  3390  3420  3450  3480  3510  3540  3570  3600 
##     1     1     5     5     1     3     6     2     2     2     2     4 
##  3630  3660  3690  3720  3750  3780  3810  3840  3870  3900  3930  3960 
##     4     4     2     1     2     2     1     3     3     3     2     3 
##  3990  4020  4050  4080  4110  4170  4200  4230  4260  4320  4350  4380 
##     1     2     2     1     1     1     2     2     2     2     1     1 
##  4410  4440  4470  4530  4590  4620  4680  4710  4740  4770  4830  4860 
##     1     3     2     3     2     1     1     2     4     1     3     3 
##  4920  4980  5070  5100  5130  5190  5220  5280  5310  5340  5370  5400 
##     2     3     3     1     3     1     1     1     1     1     1     1 
##  5430  5460  5550  5580  5640  5670  5700  5730  5760  5790  5820  6120 
##     2     1     2     1     1     1     1     1     1     1     3     2 
##  6150  6420  6570  6720  6780  6870  6930  6990  7050  7320  7350  7470 
##     1     2     1     1     1     2     1     1     1     1     2     1 
##  7560  7590  7740  7770  7860  7950  8160  8190  8640  8700  9000  9420 
##     1     1     1     1     1     1     1     1     1     2     1     1 
##  9480  9660  9690  9720  9750 10020 10290 10320 10380 10410 10440 10500 
##     1     2     1     1     1     1     1     1     1     1     2     1 
## 10560 10800 10890 11040 11070 11160 11310 11610 11850 11880 11970 13260 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 13680 15030 16020 17700 19020 19290 19500 19590 20010 20640 22470 26370 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 30360 31110 32280 35550 35910 43320 53820 55140 60480 74520 75150 85050 
##     1     1     1     1     1     1     1     1     1     1     1     1
u75 %>% 
  map(~length(.$data)) %>% 
  unlist %>% 
  qplot(binwidth = 10) +
  scale_x_continuous(limits = c(0,1000))
## Warning: Removed 1232 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

# This simulation is too computationally expensive to ever finish so I added a "killswitch"
# i.e. simulation will stop collecting participants if it exceeds that number
u1_k1000 <- replicate(REPS*3, list(double.sample(upper_threshold = 1, killswitch = 1000)))
u1_k1000 %>%
  map(~.$pvalue) %>%
  unlist %>%
  prop_positive
## [1] 0.0993
# resulting sample sizes
u1_k1000 %>%
  map(~length(.$data)) %>%
  unlist %>%
  table
## .
##    30    60    90   120   150   180   210   240   270   300   330   360 
##  1493  1018   712   572   482   365   350   297   267   243   232   224 
##   390   420   450   480   510   540   570   600   630   660   690   720 
##   187   188   169   151   142   125   139   121   107   103    99    98 
##   750   780   810   840   870   900   930   960   990  1020 
##   102    88    88    84    89    73    73    87    67 21365
# another very sketchy p curve
u1_k1000 %>% 
  map(~.$pvalue) %>% 
  unlist %>% 
  data.frame(pvalue=.) %>%
  ggplot(aes(x = pvalue, y = ..count..)) +
  stat_bin(geom="line", binwidth = .01) +
  scale_x_continuous(breaks=seq(0,1,.05))

What do you conclude on the basis of this simulation? How bad is this kind of data-dependent policy?

Optional stopping increases your false positive rate. The effect is particularly pronounced as you increase your upper threshold for calling it quits. Although, if you abide by an upper threshold of .25 and lower your alpha to .005, it seems to balance out to a false positive rate of about .05.