Matthew T. McBee
April 22, 2017
These slides are available at www.rpubs.com/mmcbee.
Replication Crisis in psychology and other health / behavioral sciences
Nullius In Verba
The Circle of Life
The Circle of Life
According to Karl Popper, the truth of a theory can never be confirmed, no matter how many true hypotheses it generates.
But a single false prediction is “logically decisive”, it is sufficient to prove that a theory is wrong.
So the best way for science to progress is for research to provide sensitive tests of false predictions or errors.
Popper argued that falsifiability is what distinguishes science from non-science.
The number and specificity of hypotheses that be derived from a theory are a measure of its falsifiability.
Strong theories have many ways of being wrong.
Falsificationist orientation is a state of mind.
No softball empirical research!
Logically decisive falsifications are rare in fields where results are statistical probabilities, not certainties.
However, falsifying evidence accumulates over time and becomes convincing.
Gozer the Destructor
Choose and Perish
Typically \(H_1\), the alternative or research hypothesis, is what the theory says should happen.
Its foil is \(H_0\), the null.
\(H_1: \mu_1-\mu_2 \ne 0\)
\(H_0: \mu_1-\mu_2 = 0\)
By rejecting \(H_0\) we support \(H_1\).
The procedure underlying statistical tests is:
Assume that \(H_0\) is true.
Calculate a p-value that describes how unusual your data are given \(H_0\).
If p is smaller than a pre-specified criterion (\(\alpha\)), then reject \(H_0\). We have achieved statistical significance.
Rejecting \(H_0\) is interpreted as supporting or strengthening a theory.
Failing to reject \(H_0\) is interpreted as weakening it.
Cohen (1994) coined the phrase “nil hypothesis” to describe the typical null of zero effect.
Testing the nil hypothesis means testing whether the data are consistent with what the theory doesn’t predict (zero) rather than what it does!
The theory’s true predictions (to the extent that it has them!) never get tested.
We can test a different kind of “nil” null hypothesis when we can a least commit to a directional hypothesis.
\(H_1: \mu_1-\mu_2 > 0\)
\(H_0: \mu_1-\mu_2 \le 0\)
This is usually presented as a way of increasing statistical power rather than improving inference.
Is this really the best that we can do?
In the absence of specific predictions, we can only test the nil hypothesis.
Paradoxically, our test becomes less falsificationist as sample size increases!
“In the physical sciences, the usual result of an improvement in experimental design, instrumentation, or numerical mass of data, is to increase the difficulty of the ‘observational hurdle’ which the physical theory of interest must successfully surmount; whereas, in psychology and some of the allied behavior sciences, the usual effect of such improvement in experimental precision is to provide an easier hurdle for the theory to surmount.” – Meehl (1967), p. 103
Non-directional “nil hypothesis” significance test for correlation.
\(H_0: \rho = 0\)
\(H_1: \rho \ne 0\)
Statistically significant estimated correlation coefficients versus sample size
In nil hypothesis tests, statistically significant results are interpreted as supportive of a theory.
Interpretation of estimated correlation coefficients by sample size
As the sample size (\(n\)) approaches \(\infty\), the theory rules out less of the parameter space.
Interpretation of estimated correlation coefficients by sample size
If we can at least derive the sign of the correlation from theory, we can do a directional hypothesis test.
\(H_1: \rho > 0\)
\(H_0: \rho \le 0\)
Statistically significant correlation coefficients versus sample size, directional case
In both of these cases, the statistical test becomes less strict as sample size increases!
Imagine that we could derive some quantitative expectations from psychological theories:
For example:
The population standardized group mean difference between the treatment and control groups should be \(\delta = 0.4\).
The population correlation coefficient between self-efficacy and effort in mathematics should be \(\rho=.6\)
Assume that our theory implies that the correlation between two variables should be \(\rho=.5\).
\(H_0: \rho = .5\)
\(H_1: \rho \ne .5\)
How does this change things?
Everything changes!
When testing point predictions:
Rejecting \(H_0\) weakens support for the theory
Failing to reject \(H_0\) strengthens support for the theory.
Statistical significance becomes “significantly different from what the theory predicts”, not “significantly different from zero.”
Most importantly, our hypothesis tests become much more falsificationist.
Interpretation of estimated correlation coefficients by sample size
The paradox is now resolved: larger samples lead to stronger, stricter tests of theories.
Decision errors become reversed:
Type-I error: is now the risk of obtaining evidence against a true theory. It is fixed and controlled by the choice of \(\alpha\).
Type-II error: is now the risk of failing to obtain evidence against a false theory.
Vague, qualitative theories
No units for measurement
Lack of psychological constants
Create better, more specific theories!
Hypothesize in terms of standardized units
Test bounded or interval predictions
Often a theory will not be able of predicting specific values for parameters, but they can rule out certain ranges.
For example:
“Patients undergoing cognitive-behavioral therapy will improve by at least \(\delta=0.2\) in symptom expression after six sessions.”
“The correlation between socioeconomic status and educational attainment is at least \(\rho=0.5\).”
\(H_0: \rho > 0.5\)
\(H_1: \rho \le 0.5\)
Statistically significant correlation coefficients versus sample size, bounded prediction
We use theory to place a lower and upper bound a parameter that we cannot derive precisely from theory. For example:
Hypotheses
“A moderate to strong relationship will exist between intelligence and performance on an insight problem solving task.”
Using Cohen’s guidelines, call “moderate” \(\rho=0.5\) and “strong” \(\rho=0.8\).
\(H_0: \rho > 0.5\) and \(\rho < 0.8\)
\(H_1: \rho < 0.5\) or \(\rho > 0.8\)
Interpretation of estimated correlation coefficients by sample size, interval prediction case
There is a direct correspondence between the 95% confidence interval and a statistical test with \(\alpha=.05\).
If the value under \(H_0\) is in the confidence interval, the hypothesis test is non-significant.
You can exploit this fact to implement these tests!
If the theory-implied value is contained in the confidence interval, you should not reject the theory!
Most statistical applications do not report standard errors or confidence intervals for correlation coefficients.
R can do it using the CIr() function in the package psychometric.
Example
According to theory, the population correlation (\(\rho\)) should be in the range of \([0.2, 0.5]\).
We collect some data (\(n=98\)) and calculate \(r=0.57\). The function CIr() will report the 95% confidence interval.
CIr(r=0.57, n=98, level=0.95)## [1] 0.4189640 0.6903431
Since the upper bound of our prediction (\(\rho=0.5\)) is inside the confidence interval \([0.420, 0.690]\), we do not reject \(H_0\).
Our conclusion: this finding is compatible with the theory.
For SPSS users, use the free online calculator at http://vassarstats.net/rho.html to calculate the CI.
Vassarstats online calculator
R users can use the cohen.d() function in the effsize library to calculate confidence intervals around Cohen’s d effect sizes.
Example
According to theory, the population group mean difference between the treatment group and control group should be at least \(\delta=0.7\).
We collect some data (\(n=160\)) from a randomized between-groups experiment$. The function cohen.d() will calculate the estimated \(d\) effect size and the 95% confidence interval.
cohen.d(Y~grp, data=data)##
## Cohen's d
##
## d estimate: 0.3060114 (small)
## 95 percent confidence interval:
## inf sup
## 0.02406885 0.58795387
-The hypothesized lower-boundary value of \(\delta=0.7\) is outside the upper bound of the estimated confidence interval.
-Thus we have evidence against the theory
Daniel Lakens has compiled some great resources on how to calculate confidence intervals around effect sizes.
http://daniellakens.blogspot.com/2014/06/calculating-confidence-intervals-for.html
Exploratory Software for Confidence Intervals (ESCI; Cummings, 2016) runs under Excel and calculates confidence intervals around many effect sizes.
Open Science is necessary but insufficient
Falsificationism is not optional
“Nil” hypothesis testing is not falsificationist
Ideally \(H_0\) should be a point prediction
Boundary and interval predictions
Confidence intervals for inference
Science is not self-correcting. We have to correct it.