Data peeking without cheating

Matthew McBee
21 October, 2016

Introduction

Research is difficult!

Constraints, incentives, and normative practices:

  • Resources, time, and subject pools are limited
  • Publication is strongly incentived
  • Papers are evaluated in terms of the results
  • File-drawer bias

What is Data Peeking?

Data peeking is interim data analysis.

Reasons for data peeking:

  • Pilot studies
  • Conferences (abstract, posters, presentations)
  • Curiosity / Excitement
  • Inform decision-making about the study

Decisions based on interim analysis

End Study Early if:

  • p < .05 at interim check
  • p indicates no or weak evidence

Continue Study If:

  • p is non-significant but trending towards significance

Legitimate Reasons for Data Peeking or Optional Stopping

  • Desire to optimally allocate resources
    • When significant equals publishable, no incentive to continue after p<.05
    • Since non-significant often means non-publishable, no incentive to further invest in studies that seem to be “failing”
  • Ambiguity of non-significance
    • \( H_0 \) true
    • Inadequate statistical power

“One could easily argue that psychological researchers have an ethical obligation to repeatedly analyze accumulating data, given that continuing data collection whenever the desired level of confidence is reached, or whenever it is sufficiently clear the expected effects are not present, is a waste of the time of participants and the money provided by tax-payers.” (Lakens, 2014, p.3)

Implications of Data-Peeking in the Frequentist Statistical Framework

Null Hypothesis Significance Testing (NHST) framework

  • NHST remains dominant in psychology
  • Paradigm is a mixture of Fisher and Neyman-Pearson approaches, favoring N-P.
    • Fisher: p-value is continuous measure of evidence against \( H_0 \)
    • Neyman-Pearson: formalized decision-making process (reject or fail to reject \( H_0 \)) based on \( \alpha \).
  • The only real virtue of p-values under the Neyman-Pearson approach is long-run error control.

Fragility of p-values

  • Performance of p-values for measuring evidence (Fisher) or controlling Type-I error rates (N-P) depends strongly on researcher intentions.
  • p-values are only valid under a limited set of circumstances
    • Confirmatory design
    • Single test (or adjustment for multiple testing)
    • Assumptions satisfied
    • Most important: future actions must not be conditioned on the results of interim hypothesis tests

Data peeking

The universe exists in two states:

  • before seeing the data
  • after seeing the data

Data peeking is harmless (relative to NHST) if seeing the intermin result does not, in any way, change your future behavior.

\[ \textbf{Is this realistic?} \]

\[ \textbf{Really?} \]

Optional Stopping

Optional stopping is the practice of iteratively adding subjects and updating statistical hypothesis tests.

Iterate until p-value becomes “decisive.”

  • p < .05 or
  • p strongly non significant

(Alternatively: iterate until p<.05).

Optional Stopping

  • Optional stopping is a form of p-hacking
  • p-values follow a uniform distribution under \( H_0 \)
  • Type-I error rate is 100% given unrestricted iteration
  • Even a few uncorrected peeks result in notable inflation of Type-I error rate

Optional stopping destroys the error control property of p-values.

Frequentist Solution

Sequential Analysis

  • Interim analysis is allowed in frequentist statistics if \( \alpha \) is adjusted for multiple comparisons
  • The technique for adjusting \( \alpha \) is called sequential analysis (Lakens, 2014).
  • Widely used in clinical trials

Sequential Analysis

  • If sample sizes for interim analyses can be prespecified:

    • we find boundary critical values \( c_1 \) and \( c_2 \) such that after collecting \( n \) and \( N \) observations
    • \( p(Z_n \geq c_1) + p(Z_n < c_1, Z_N \geq c_2) = \frac{\alpha}{2} \)
  • Pocock bounds find equal critical values for all interim tests.

  • O'Brian-Fleming bounds set a higher critical value for the early tests than the later ones.

  • Both solutions require equal \( n \) between interim analyses.

Example: Pocock bounds

  • If researcher plans \( n_{max} \) = 120 and one interim analysis at \( n=60 \)
  • Pocock bounds are \( \alpha_1 = \alpha_2 = .0294 \)

Spending functions

  • Sometimes it is not possible to preplan interim analyses at specific levels of \( n \)
  • (However, you must preplan the number of peeks)
  • In this case you can use a spending function to decide how to allocate \( \alpha \) across interim analyses
  • WinDL and the R package GroupSeq can do this.
  • You cannot decide to run more tests than planned or do additional data collection if final planned test is non-significant.
  • Power is somewhat reduced compared with single-analysis strategy

Bayesian Solution

Bayesian Statistics

  • Computationally challenging but easier to interpret
  • Three core concepts
    • Prior distribution
    • Likelihood (data)
    • Posterior distribution
  • The frequentist approach is about controlling error rates
  • The Bayesian approach is about learning from data

Bayesian Statistics

\[ \begin{aligned} \\ \\ \\ p(\theta|y) & & \propto & & p(y|\theta) & & p(\theta) \\ \\ (\textit{posterior} & & & & (\textit{likelihood}) & & (\textit{prior}) \\ \end{aligned} \]

Weak Prior, Weak Data

(8 heads in 10 coin flips)

plot of chunk unnamed-chunk-5

Weak Prior, Strong Data

(80 heads in 100 coin flips)

plot of chunk unnamed-chunk-6

Strong Prior, Weak Data

(8 heads in 10 coin flips)

plot of chunk unnamed-chunk-7

Strong Prior, Strong Data

(800 heads in 1,000 coin flips)

plot of chunk unnamed-chunk-8

Uninformative (flat) Prior, Weak Data

(4 heads in 5 coin flips)

plot of chunk unnamed-chunk-9

Bayes Factors

Given two hypotheses \( H_0 \) and \( H_1 \), and some data \( x \).

\[ \begin{aligned} \frac{p(H_0|x)}{p(H_1|x)} = & & \frac{p(x|H_0)}{p(x|H_1)} & & \frac{p(H_0)}{p(H_1)} \\ & & & & \\ \textit{Posterior odds} & & \textit{Bayes Factor} & & \textit{Prior odds} \\ \end{aligned} \]

Bayes Factors

The Bayes Factor describes how relative probability of \( H_1 \) versus \( H_0 \) should be updated given the data.

\( H_1 \) and \( H_0 \) are represented by competing priors.

Usually the prior for \( H_0 \) is strongly concentrated at zero while prior \( H_1 \) is spread over the parameter space.

Bayes Factors: Software Options

  • JASP. Free software with SPSS-like point-and-click interface. Download from jasp-stats.org.

    • Currently cannot do any data manipulation in JASP.
    • Produces nice APA-formatted tables.
    • Reads SPSS .sav files as well as comma-separated (.csv) and tab-delimited formats
  • R package BayesFactor.

  • Bayes Factors can only be computed for relatively simple models such as t-tests, ANOVA, and linear regression.

Bayes Factor Priors

  • The specific nature of these priors matters for computing the Bayes Factor. However, there are certain defauls that have been show to work well in most circumstances (Morey & Rouder, 2015).

    • The prior for \( H_0 \) is a point prior at \( d \)=0.
    • R's BayesFactor package has three selectable priors for \( H_1 \).
    • The default \( H_1 \) prior in JASP is the same as the BayesFactor default but is user-adjustable.
    • 'Wide' priors for \( H_1 \) are more conservative (favor \( H_0 \))

Selectable priors in BayesFactor package

plot of chunk unnamed-chunk-10

Interpreting Bayes Factors

Jeffreys (1961)

BF Interpretation
0 to 5 barely worth mentioning
5 to 10 substantial
10 to 15 strong
15 to 20 very strong
>20 decisive

Interpreting Bayes Factors

Kass & Raftery (1995)

BF Interpretation
0 to 2 not worth more than a bare mention
2 to 6 positive
6 to 10 strong
>10 very strong

Monitoring Bayes Factors

  • Unlike frequentist p-values, one can update the computed BF after each subject
  • There is no harm in repeated testing so long as stopping criteria is prespecified
  • Monitoring BF during data collection displays how evidence accumulates

Accumulating Evidence for H0

plot of chunk unnamed-chunk-13

Accumulating Evidence for H1

plot of chunk unnamed-chunk-14

Accumulating Evidence for H1

plot of chunk unnamed-chunk-15

Equivocal Evidence

plot of chunk unnamed-chunk-16

Optional stopping with p-values (ex 1)

\( \delta=0 \), stop collecting data when p < .05 or n = 200

plot of chunk unnamed-chunk-17

Seeing the future (ex 1)

\( \delta=0 \), stop collecting data when BF > 10 or n = 200

plot of chunk unnamed-chunk-18

Optional stopping with p-values (ex 2)

\( \delta=0 \), stop collecting data when p < .05 or n = 200

plot of chunk unnamed-chunk-19

Seeing the future (ex 2)

\( \delta=0 \), stop collecting data when BF > 10 or n = 200

plot of chunk unnamed-chunk-20

Optional stopping with p-values (ex 3)

\( \delta=0 \), stop collecting data when p < .05 or n = 200

plot of chunk unnamed-chunk-21

Seeing the future (ex 3)

\( \delta=0 \), stop collecting data when BF > 10 or n = 200

plot of chunk unnamed-chunk-22

Optional stopping with p-values (ex 4)

\( \delta=.2 \), stop collecting data when p < .05 or n = 200

plot of chunk unnamed-chunk-23

Seeing the Future (ex 4)

\( \delta=.2 \), stop collecting data when BF > 10 or n = 200

plot of chunk unnamed-chunk-24

Preregistration

  • Preregister your interim data analysis plan to assure reviewers and readers that you did not p-hack.
  • Examples:
    • “We will update the the Bayes Factor after every five subjects. We will terminate data collection when either \( BF_{10} \) or \( BF_{01} \) reaches 10 or when we reach \( n \)=200.”
    • “We will use sequential analysis to compute Pocock critical values for hypothesis tests at \( n \)=50, \( n \)=100, and \( n \)=150. We will end data collection either at \( n \)=150 or after reaching statistical significance in an interim analysis.”
  • Preregistration is easy and quick

References

  • Etz, A., Gronau, Q. F., Dablander, F., Eldsbrunner, P. A., & Baribault, B. (2016). How to become a Bayesian in eight easy steps: An annotated reading list. Preprint posted on the Open Science Framework. https://osf.io/cpvfk/

  • JASP Team (2016). JASP (Version 0.7.5.6) [Computer software]

  • Kass R. E. & Raftery, A. E. (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773-795.

  • Lakens, D. (2014). Performing high-powered studies efficiently with sequential analysis. Social Science Research Network. doi: http://dx.doi.org/10.2139/ssrn.2333729

References