Causal inference

Kirby Arinder
8/23/2023

Overview

We do lots of stuff at PEER.

Today is about one subset of that stuff: Causal inference.

Overview

There will be three sections of today's talk:

  • What is causal inference?
  • One common issue: Comparing outcomes of causal processes
  • Another common issue: Attributing effects to causes

I. What is causal inference?

Let's break it down! It's

  • Inference about
  • Causation.

(And that's why they pay me the big bucks, right there.)

Okay, inferences are pretty clear.

They're most of what we do at PEER, to my way of thinking. Deriving unknown facts from known facts.

Even when simply fact-finding, we are tasked with a kind of inference:

  • Putative facts A-G are from source S, which is unreliable;
  • Putative facts H-J are from source T, which is reliable;
  • Thus, we should (provisionally) accept H-J and reject A-G.

But what is a cause, anyway?

That's tough. But in general, it's something like:

  • B causes A if and only if
  • A follows B, necessarily, or
  • A follows B, in counterfactual-supporting ways.

Knowing that this is the case is very, very hard.

Maybe one of the hardest problems facing humanity!

But we approach it sideways, like humans do, and we get by pretty well.

Examples and nonexamples

It's not the same as:

  • A is in legal compliance with B
  • A is financially feasible given (or in accordance with the norms of) B
  • A is to be expected under the theory of B

Examples and nonexamples

In our context, causal inference typically happens when either:

  • The government takes an action, and we want to know the likely results, or
  • We observe an outcome, and we want to know whether a government action can be given credit.

Digression: Performatives and certainties

Some actions are performative

  • E.g., licensure

Some actions have small noise and very high expected success rates

  • E.g., new buildings (for some purposes)

Generally, we're in the world of “social science” here!

II. Comparing outcomes of causal processes

Probably the most common case of this for PEER is comparing outcomes.

You've seen this:

  • “Our program pilot did better than the comparison.” (synchronic comparison)
  • “Recidivism rates have come down over time.” (diachronic comparison)

The usual arguments

Intuitive (but insufficient) evaluation:

  • Anecdote
  • Simple numerical comparison

Evaluation by anecdote

“Our program definitely works. Just look at Timmy! He went through it, and turned his whole life around!”

This is not, by itself, evidence of a program's effectiveness.

This is proverbially true. But why?

Here are Timmy and a couple of his friends.

But here's the rest of Timmy's program.

Causal processes and noise

With enough participants and normal conditions, you're effectively guaranteed to have some decent results even in a pretty bad program.

This is because of noise: The inevitable random(-ish) variation in the outcomes of causal processes!

The noise is that bell curve we just saw!

What if we get more numbers?

Imagine two programs doing the same thing.

  • One of them achieves 40 units of effect on average.
  • The other achieves 80 units of effect on average.

We normally believe that the 80-unit program is better.

Here's what we probably imagine...

But here's what could be happening!

Another digression

A mean is a mean.

As long as you don't – heh – mean anything more by it, compare all you want!

It's only when you want to attribute causal distinction this all becomes important.

Digression, continued

Statistical context simpliciter isn't enough!

Hoo boy is this a long digression, which I may not do live, but:

Sampling methods vs. permutation methods

Digression, continued more

The upshot: The traditional t-test and its siblings aren't designed for this.

Most of the intro-stats methods aren't!

“The odds of a random sample containing values as extreme as the ones you saw” is meaningless outside of the context of a random sample.

So the takeaway:

Numeric comparisons – X is bigger than Y – require statistical context if they are to undergird causal inferences.

Specifically, they require context that appropriately measures relevant noise.

II. Attributing effects to (specific) causes

So that's how to say that two causal processes are causally similar or distinct.

This can be valuable, even if we don't want to hypothesize about why they are similar or distinct!

But sometimes we do.

Enter the randomized controlled trial

Or, more ideally, the sizeable set of large, powerful RCTs preregistered and conducted at multiple sites!

But why randomized, controlled trials?

RCTs and causation

The short answer: RCTs are our best method of establishing that A causes B.

Imagine you’re a researcher for a shoe company; you’re testing a running shoe that is supposed to shave time off of your sprint.

So you set up a test: Runners in your shoes versus runners in some different shoe.

Shoe trials

After statistical analysis, we find the group with your shoe crossed the finish line significantly before the other group.

But wait: You had your group running 100m, while the comparison group ran 200m!

This comparison wasn’t fair; even if the results are good, we can't say they were because of the shoe.

Statistical control and fairness

This is the essence of controlling for confounding variables: basic fairness in comparisons.

(Statistical) control = making sure everybody has the same starting line before comparing them.

Control for confounders continued

There are several ways to control for confounding variables. For instance:

  • Simple physical setup of the trial
    • Don’t use different length tracks
  • Various mathematical methods
    • Multiply short-track group time by two

(Obviously this last is just for the sake of the example, and would not be appropriate in a real setting)

Control for confounders continued

These methods of control can be very sophisticated. But there's a problem:

  • You have to know that a confounding variable exists in order to control for it.
  • And it's impossible to know ahead of time what all the confounding variables are.

A relevant quote

“… the golden rule of causal analysis: No causal claim can be established by a purely statistical method, be it propensity scores, regression, stratification, or any other distribution-based design.”

-Judea Pearl, “Causality,” p. 350

RCTs and causation

Well-conducted random sampling guarantees that all possible confounding variables are randomly distributed among conditions – which is to say, there’s no correlation between any trait and group membership.

Which means the groups, overall, start and finish on the same lines….

Which lets us assume that if they finish at different times, it's because of the program.

Back to the question: Why RCTs?

Randomized controlled trials are:

  • Epistemically preferable
    • they enable causal inferences
  • Practically preferable
    • the math for control and testing is far simpler
  • Legally preferable

As compared to nonrandomized evaluation.

And the difference isn't small.

RCTs frequently conflict with less rigorous preliminary trials:

  • In medicine: 50-80% of positive results in initial clinical trials are overturned by subsequent RCTs (Ioannidis (2005), Zia et al. (2005))
  • In business: 80-90% of new products and strategies tested under RCTs by Google and Microsoft have found no significant effects (Manzi (2012))
  • In education: 91% of rigorous RCTs conducted by the Institute for Education Sciences showed weak or no positive effects (CEBP (2013))

But sometimes, the perfect is the enemy of the good.

Gold is rare. What if we don't have any and still need to act?

Research quality is idiosyncratic and multidimensional. But here's a helpful tool:

The Maryland Scientific Methods scale.

The MSM scale?

Described by Farrington et al. (2002) in Evidence-based Crime Prevention.

It's a five-point ordinal scale – 1 is the worst, 5 is the best!

It rates our general ability to draw conclusions from the study.

  • Or said another way: it rates what threats to our desired conclusions are ruled out.

The MSM scale (and threats at each level)

  1. Simple descriptive association
    • threats: causal direction, confounders
  2. Pre-post testing
    • threats: confounders
  3. Control group
    • threats: nonequivalence of groups
  4. Control group plus high-quality statistical controls
    • threats: inadequate control
  5. Randomized control group
    • threats: inappropriate implementation and analysis

The MSM scale

It's definitely not safe to make inferences from any trial below level 3….

But beware even at that level!

(Of course, hypothesis-generation is free game.)

But it takes more than just this.

Everything said so far assumes that the research is well-conducted.

There is a crisis of reproducibility in science, especially social science!

Some have gone so far as to suggest that most published research is false.

This problem affects random and nonrandom studies alike.

How does the problem happen?

There are several ways.

And this isn't an exhaustive list! Practices like these can make even crazy results seem scientifically justified.

So what's our takeaway?

Attributing effects to specific causes requires a lot.

We probably won't be asked to do it at PEER.

But we're asked to evaluate others' attempts all the time!

Remembering these guidelines should help.

References without direct links

Coalition for Evidence-Based Policy (2013). Randomized Controlled Trials Commissioned by the Institute of Education Sciences Since 2002: How Many Found Positive Versus Weak or No Effects. Retrieved from http://coalition4evidence.org/wp-content/uploads/2013/06/IES-Commissioned-RCTs-positive-vs-weak-or-null-findings-7-2013.pdf

Farrington, D.P., Gottfredson, D.C., Sherman, L.W. & Welsh, B.C. (2002). The Maryland Scientific Methods Scale. In Farrington, D.P., MacKenzie. D. L., Sherman, L.W.,& Welsh, B.C. (Eds.), Evidence-Based Crime Prevention (pp. 13-21). London: Routledge.

Ioannidis, J.P.A. (2005). Contradicted and Initially Stronger Effects in Highly Cited Clinical Research. Journal of the American Medical Association, 294(2), 218-228.

Manzi, J. (2012). Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society. New York: Perseus Books Group.

Pearl, J. (2009). Causality (2nd ed.). Cambridge: Cambridge University Press.

Zia, M. I., Siu, L. L., Pond, G. R., & Chen, E. X. (2005). Comparison of Outcomes of Phase II Studies and Subsequent Randomized Control Studies Using Identical Chemotherapeutic Regimens. Journal of Clinical Oncology, 23(28), 6982-6991.

This presentation

Questions?