P Values: Practices and Alternatives

Tyler McDaniel
6/6/2018

Part 1: Errors and Misuse

Why talk about P values?

emoji

Reproducibility

  • What does reproducibility mean in the context of our work?
  • What are challenges to making our work more reproducible?
  • Do we replicate others' work? What type of replication?

Psychology

psychology

  • 97% of original studies statistically significant
  • 36% of replication studies statistically significant

Psychology

psychology2

  • Success better predicted by strength of original evidence than characteristics of research teams

pcurve

Psychology

desperate

Medicine

medicine

Medicine

medicine2

  • One third of highly cited studies were contradicted or found stronger effects than subsequent studies
  • More prevalent in non-random and/or small studies

Criminology

criminology

  • Systemic reviews demonstrate heterogenous findings
  • Problematic for “what works” policy recommendations

Prevention Science

prevention

  • Push to report non-significant and negative results

Children's Health

conflict

  • “More efforts are needed in order to ensure transparency.”

reproducibliy

  • Methods reproducibility: same data, same tools
  • Results reproducibility: same methods, new study
  • Inferential reproducibiliy: investigate claims

Reproducibility

  • Can we pre-register research plans?
  • Can we publish data/code?
  • What types of reproducibility are useful at Child Trends?
  • Can we improve methodological training/tools?
  • Can we incentivize good research?

So what is a P value?

  • Probability that one would get a result as extreme (as the current one)
  • or more extreme,
  • assuming the null hypothesis is true,
  • if the experiment were replicated infinately,
  • under perfect conditions.
  • (Or something like that)

P Value Thresholds

  • \( 0.05 \) (social sciences, biomedical sciences, etc.)
  • \( 3 \) x \( 10^{-7} \) (high energy physics)
  • \( 5 \) x \( 10^{-8} \) (genomics)

The Origins of P Values

  • Fisher invented the P value to use in agricultural decisions
  • “This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.” (Fisher, 1929)

This is pretty boring

But the implications are quite interesting!

veggie

The Origins of P Values

  • Fisher invented the P value to use in agricultural decisions
  • “This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.” (Fisher, 1929)

ASA Statement

asa

ASA Statement

  • Can indicate how compatible data are with given model
  • Does not measure probability that hypothesis is true or that data were produced by chance alone
  • Decisions should not be based on P value threshold
  • Proper inference requires transparency
  • Different from effect size and importance of results
  • Should not be used as sole measure of a hypothesis

Discussion

  • How do we define P values?
  • How do we interpret (non)significant results?
  • What do * vs. *** mean?

Part 2: Alternative Approaches

Why 0.05?

redefine

  • In fields where p<0.05 is considered the threshold for significance, changing the standard to 0.005 would make P hacking and naiive data manipulation more difficult

Why 0.05?

  • \( \frac{P(H_1 | x_{obs})}{P(H_0 | x_{obs})} = \frac{f(x_{obs} | H_1)}{f(x_{obs} | H_0)} * \frac{P(H_1)}{P(H_0)} = Bayes Factor * Prior Odds \)
  • For psychology, the Bayesian prior odds of \( H_1 \) relative to \( H_0 \) are roughly 1:10
  • A two-sided P value of 0.05 equates to a Bayes Factor of 2.5-3.4
  • This means that, according to a Bayesian, the probability of \( H_1 \) relative to \( H_0 \) for a study deemed “significant” might be \( 3 * \frac{1}{10} = \frac{3}{10} \)!!!!
  • Alternatively, a two-sided P value of 0.005 corresponds to a Bayes Factor of between 14 and 26

How Easy is P-Hacking?

P Value Thresholds

  • Are P value thresholds in our work immutable?
  • Where might P hacking occur?

Some Examples- But First, Some Context

  • Racial disparities in math for Black and Hispanic youth have been linked to school segregation (Ready & Silander, 2011)
  • Three times as many Black and Hispanic students attend intensely segregated schools as white students, which are associated with high levels of poverty, higher teacher mobility, less qualified teachers, and less resources (Orfied & Lee, 2005)
  • Extremely few Black and Hispanic 4th-8th graders live in districts where test scores are at or above the national average (Reardon, 2017)

Model Robustness

Algebra Prop_wh Mftotal Security Mentalhealth APClasses Corporal Teachers Athletics
. . . . . . . . .
. . . . . . . . .
  • Example: model how do the number of certified algebra teachers vary with the proportion of white students?
  • \( y \) = \( \beta \ _0 \) + \( \beta \ _1 \)*\( X_1 \) + …\( \beta \ _n \)*\( X_n \)+ \( \epsilon \ \)
  • 7 possible control variables
  • We have \( 2^7 \) = 128 possible models
  • But this could quickly get unwieldy- \( 2^{15} \) = 32,768!

Model Robustness

reg

Model Robustness

mrobust

Model Robustness

graph

Model Robustness

robust

Model Robustness

# net install mrobust, replace
# ssc describe mrobust

Model Robustness

  • How can we choose better models?
  • Potential downsides to model averaging?
  • How does this relate to P Values and reproducibliity?

Data Visualization

plot of chunk unnamed-chunk-2

Data Visualization

plot of chunk unnamed-chunk-3

Data Visualization

plot of chunk unnamed-chunk-4

Discussion

  • What resources are most helpful in conceptualizing data?
  • What are challenges related to visualizing or explaining P values?
  • What types of tools and interpretations are desired by partners and funders?

Resources