Kirby Arinder
8/23/2023
We do lots of stuff at PEER.
Today is about one subset of that stuff: Causal inference.
There will be three sections of today's talk:
Let's break it down! It's
(And that's why they pay me the big bucks, right there.)
They're most of what we do at PEER, to my way of thinking. Deriving unknown facts from known facts.
Even when simply fact-finding, we are tasked with a kind of inference:
That's tough. But in general, it's something like:
Maybe one of the hardest problems facing humanity!
But we approach it sideways, like humans do, and we get by pretty well.
It's not the same as:
In our context, causal inference typically happens when either:
Some actions are performative
Some actions have small noise and very high expected success rates
Generally, we're in the world of “social science” here!
Probably the most common case of this for PEER is comparing outcomes.
You've seen this:
Intuitive (but insufficient) evaluation:
“Our program definitely works. Just look at Timmy! He went through it, and turned his whole life around!”
This is not, by itself, evidence of a program's effectiveness.
This is proverbially true. But why?
With enough participants and normal conditions, you're effectively guaranteed to have some decent results even in a pretty bad program.
This is because of noise: The inevitable random(-ish) variation in the outcomes of causal processes!
The noise is that bell curve we just saw!
Imagine two programs doing the same thing.
We normally believe that the 80-unit program is better.
A mean is a mean.
As long as you don't – heh – mean anything more by it, compare all you want!
It's only when you want to attribute causal distinction this all becomes important.
Statistical context simpliciter isn't enough!
Hoo boy is this a long digression, which I may not do live, but:
Sampling methods vs. permutation methods
The upshot: The traditional t-test and its siblings aren't designed for this.
Most of the intro-stats methods aren't!
“The odds of a random sample containing values as extreme as the ones you saw” is meaningless outside of the context of a random sample.
Numeric comparisons – X is bigger than Y – require statistical context if they are to undergird causal inferences.
Specifically, they require context that appropriately measures relevant noise.
So that's how to say that two causal processes are causally similar or distinct.
This can be valuable, even if we don't want to hypothesize about why they are similar or distinct!
But sometimes we do.
Or, more ideally, the sizeable set of large, powerful RCTs preregistered and conducted at multiple sites!
But why randomized, controlled trials?
The short answer: RCTs are our best method of establishing that A causes B.
Imagine you’re a researcher for a shoe company; you’re testing a running shoe that is supposed to shave time off of your sprint.
So you set up a test: Runners in your shoes versus runners in some different shoe.
After statistical analysis, we find the group with your shoe crossed the finish line significantly before the other group.
But wait: You had your group running 100m, while the comparison group ran 200m!
This comparison wasn’t fair; even if the results are good, we can't say they were because of the shoe.
This is the essence of controlling for confounding variables: basic fairness in comparisons.
(Statistical) control = making sure everybody has the same starting line before comparing them.
There are several ways to control for confounding variables. For instance:
(Obviously this last is just for the sake of the example, and would not be appropriate in a real setting)
These methods of control can be very sophisticated. But there's a problem:
“… the golden rule of causal analysis: No causal claim can be established by a purely statistical method, be it propensity scores, regression, stratification, or any other distribution-based design.”
-Judea Pearl, “Causality,” p. 350
Well-conducted random sampling guarantees that all possible confounding variables are randomly distributed among conditions – which is to say, there’s no correlation between any trait and group membership.
Which means the groups, overall, start and finish on the same lines….
Which lets us assume that if they finish at different times, it's because of the program.
Randomized controlled trials are:
As compared to nonrandomized evaluation.
RCTs frequently conflict with less rigorous preliminary trials:
Gold is rare. What if we don't have any and still need to act?
Research quality is idiosyncratic and multidimensional. But here's a helpful tool:
The Maryland Scientific Methods scale.
Described by Farrington et al. (2002) in Evidence-based Crime Prevention.
It's a five-point ordinal scale – 1 is the worst, 5 is the best!
It rates our general ability to draw conclusions from the study.
It's definitely not safe to make inferences from any trial below level 3….
But beware even at that level!
(Of course, hypothesis-generation is free game.)
Everything said so far assumes that the research is well-conducted.
There is a crisis of reproducibility in science, especially social science!
Some have gone so far as to suggest that most published research is false.
This problem affects random and nonrandom studies alike.
Attributing effects to specific causes requires a lot.
We probably won't be asked to do it at PEER.
But we're asked to evaluate others' attempts all the time!
Remembering these guidelines should help.
Coalition for Evidence-Based Policy (2013). Randomized Controlled Trials Commissioned by the Institute of Education Sciences Since 2002: How Many Found Positive Versus Weak or No Effects. Retrieved from http://coalition4evidence.org/wp-content/uploads/2013/06/IES-Commissioned-RCTs-positive-vs-weak-or-null-findings-7-2013.pdf
Farrington, D.P., Gottfredson, D.C., Sherman, L.W. & Welsh, B.C. (2002). The Maryland Scientific Methods Scale. In Farrington, D.P., MacKenzie. D. L., Sherman, L.W.,& Welsh, B.C. (Eds.), Evidence-Based Crime Prevention (pp. 13-21). London: Routledge.
Ioannidis, J.P.A. (2005). Contradicted and Initially Stronger Effects in Highly Cited Clinical Research. Journal of the American Medical Association, 294(2), 218-228.
Manzi, J. (2012). Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society. New York: Perseus Books Group.
Pearl, J. (2009). Causality (2nd ed.). Cambridge: Cambridge University Press.
Zia, M. I., Siu, L. L., Pond, G. R., & Chen, E. X. (2005). Comparison of Outcomes of Phase II Studies and Subsequent Randomized Control Studies Using Identical Chemotherapeutic Regimens. Journal of Clinical Oncology, 23(28), 6982-6991.