Causal Inference

Ozan Aksoy

January 20, 2026

Potential outcomes, counterfactuals and A Christmas Carol

The potential outcome framework

Unit Control Treatm. Effect
\(i\) \(Y_i^C\) \(Y_i^T\) \(\delta_i\)
1 8 9 1
2 5 3 -2
3 6 4 -2
4 6 2 -4
5 15 18 3
6 13 16 3
7 8 9 1
8 2 0 -2
9 4 3 -1
10 2 0 -2
Mean 6.9 6.4 -0.5

“The science table”

\(D = T\): Treatment

\(D = C\): Control

\(Y = Y^T\) if \(D = T\)

\(Y = Y^C\) if \(D = C\)

Average causal effect:

\(\bar{\delta} = 6.4 - 6.9 = -0.5\)

Potential vs. observed outcomes

Unit Status PO Cont PO Treat Observed
\(i\) \(D\) \(Y_i^C\) \(Y_i^T\) \(Y_{obs}\)
1 T ? 9 9
2 C 5 ? 5
3 C 6 ? 6
4 C 6 ? 6
5 T ? 18 18
6 T ? 16 16
7 T ? 9 9
8 C 2 ? 2
9 C 4 ? 4
10 C 2 ? 2
Mean 4.2 13

Observed “effect”:

\(13-4.2=8.8^{**}\)!!

Wrong conclusions:

New treatment adds 9 years of life

If all treated average life = 13

Random sample doesn’t help.

Fundamental problem of causal inference

  • For subject \(i\), the causal effect of the treatment is the difference between two outcomes:

  • \(\delta_{i} = Y_i^T-Y_i^C\) (\(Y_i^T\): \(i\)’s PO in treatment, \(Y_i^C\): \(i\)’s PO in control)

  • But only one of the two potential outcomes is realised/observed

  • (Unless Christmas spirits help…)

Group \(D\) \(Y_i^T\) \(Y_i^C\)
Treatment T Observable Counterfactual
Control C Counterfactual Observable

“Naive” estimator for the treatment effect

\(\hat{\delta}_{naive} = Avr(Y_i^{obs} | D = T) - Avr(Y_i^{obs} | D = C)\)

(Observed difference between treatment and control)

Unit Status PO Cont PO Treat Observed
\(i\) \(D\) \(Y_i^C\) \(Y_i^T\) \(Y_{obs}\)
1 T ? 9 9
2 C 5 ? 5
3 C 6 ? 6
4 C 6 ? 6
5 T ? 18 18
6 T ? 16 16
7 T ? 9 9
8 C 2 ? 2
9 C 4 ? 4
10 C 2 ? 2
Mean 4.2 13
  • Naive estimator for the example:

\(\hat{\delta}_{naive} = \frac{9+18+16+9}{4} - \frac{5+6+6+2+4+2}{6} = 8.8\)

  • What is wrong?

ATE, ATT, ATC

  • Individual Causal / Treatment Effect:
    \(\delta_i = Y_i^T - Y_i^C\)

  • Average Treatment Effect (ATE) for the entire population:

    \[ \text{ATE} = \text{Average}(\delta) = E\!\left[Y_i^T - Y_i^C\right] = \color{red}{E[Y_i^T] - E[Y_i^C]} \]

  • Average Treatment Effect for the Treated (ATT):

    \[ \text{ATT} = \text{Average}(\delta \mid D = T) = E(Y_i^T - Y_i^C \mid D = T) = \color{red}{E(Y_i^T \mid D = T) - E(Y_i^C \mid D = T)} \]

  • Average Treatment Effect for the Controls (ATC) or sometimes ATUT:

    \[ \text{ATC} = \text{Average}(\delta \mid D = C) = E(Y_i^T - Y_i^C \mid D = C) = \color{red}{E(Y_i^T \mid D = C) - E(Y_i^C \mid D = C)} \]

ATE, ATT, ATC: numerical example

Potential outcomes
Unit D Control Treatm. Effect
i \(Y_i^C\) \(Y_i^T\) \(\delta_i\)
1 T 8 9 1
2 C 5 3 -2
3 C 6 4 -2
4 C 6 2 -4
5 T 15 18 3
6 T 13 16 3
7 T 8 9 1
8 C 2 0 -2
9 C 4 3 -1
10 C 2 0 -2
  • ATE = ?
  • ATT = ?
  • ATC = ?

ATE, ATT, ATC: numerical example

Unit Status Control Treatm. Effect
\(i\) \(D\) \(Y_i^C\) \(Y_i^T\) \(\delta_i\)
1 T 8 9 1
2 C 5 3 -2
3 C 6 4 -2
4 C 6 2 -4
5 T 15 18 3
6 T 13 16 3
7 T 8 9 1
8 C 2 0 -2
9 C 4 3 -1
10 C 2 0 -2
Mean 6.9 6.4 -0.5
  • ATE \(= E[Y_i^T] - E[Y_i^C] = 6.4 - 6.9 = -0.5\)

  • ATT \(= \text{Avrg}(\delta \mid D = T) = \frac{1 + 3 + 3 + 1}{4} = 2\)

  • ATC \(= \text{Avrg}(\delta \mid D = C) = \frac{-2 - 2 - 4 - 2 - 1 - 2}{6} = -2.17\)

  • Beware: these are not readily obtainable in observational data

Naive estimator versus the estimand

Estimand: Average Treatment Effect (ATE)

\[ \begin{align} \widehat{\delta}_{\text{naive}} &= \underbrace{\text{Avrg}(\delta)}_{\text{ATE}} \\ &\quad + \underbrace{\text{Avrg}(Y_i^C \mid D = T) - \text{Avrg}(Y_i^C \mid D = C)}_{\text{Selection (baseline) bias}} \\ &\quad + (1 - \pi) \times \underbrace{\left[ \text{Avrg}(\delta \mid D = T) - \text{Avrg}(\delta \mid D = C) \right]}_{\text{ATT - ATC (differential treatment effect) bias} } \end{align} \] (\(\pi =\) proportion of sample in the treatment group. The more people are treated, the smaller will be the differential treatment effect bias because the ’naive’ estimate would be already more based on those who are treated)

Naive estimate vs. ATE: Selection Baseline Bias

Potential outcomes
Unit D Control Treatm. Effect
i \(Y_i^C\) \(Y_i^T\) \(\delta_i\)
1 T 8 9 1
2 C 5 3 -2
3 C 6 4 -2
4 C 6 2 -4
5 T 15 18 3
6 T 13 16 3
7 T 8 9 1
8 C 2 0 -2
9 C 4 3 -1
10 C 2 0 -2
Average 6.9 6.4 -0.5
  • Selection Baseline Bias: even in the absence of treatment, those in the treatment group are different from those in the control group
  • Even without the drug, those in the treatment group would have lived longer
  • \(\frac{44}{4} - \frac{25}{6} = 11 - 4.2 = 6.8\)

Naive estim. vs. ATE: Differential Treatm. Effect

Potential outcomes
Unit D Control Treatm. Effect
i \(Y_i^C\) \(Y_i^T\) \(\delta_i\)
1 T 8 9 1
2 C 5 3 -2
3 C 6 4 -2
4 C 6 2 -4
5 T 15 18 3
6 T 13 16 3
7 T 8 9 1
8 C 2 0 -2
9 C 4 3 -1
10 C 2 0 -2
Average 6.9 6.4 -0.5
  • Differential Treatment Effect: difference in the treatment effect between those in the treatment and those in the control group
  • The drug has a different effect on the treatment and the control group
  • \(0.6 \times (\text{ATT} - \text{ATC}) = 0.6 \times (2 + 2.17) = 2.5\)

Naive estimator vs. ATE: numerical example

\[ \begin{align} 8.8 &= -0.5 && (\text{ATE}) \\ &\quad + \frac{44}{4} - \frac{25}{6} && (= 6.8:\ \text{selection (baseline) bias}) \\ &\quad + 0.6 \cdot (2 + 2.17) && (= 2.5:\ \text{ATT--ATC bias}) \end{align} \]

Randomisation as the solution

  • Conditional independence: (D) is given randomly to individuals
  • In this case, selection bias \(= 0\), \(ATT - ATC = 0\), thus:
  • \(\widehat{\delta}_{\text{naive}} = \underbrace{\text{Avrg}(\delta)}_{\text{ATE}}\)
  • So randomisation solves various biases and the fundamental problem of causal inference (on average)

Random assignment has been called the gold standard for causal inference: it guarantees the necessary assumptions for causal inference hold by design.

What is the difference between random assignment and random sampling???

  • BUT: we cannot do experiments all the time?
  • What if we need to work with observational data?

Key assumtion when randomising: SUTVA

When relying on the Rubin/Neyman Causal Model with potential outcomes, we rely on SUTVA:

  • SUTVA: Stable Unit Stable Unit Treatment Value Assumption.

  • An observation’s outcome is not affected by other observations’ assignments.

    • If your neighbor is treated or not, that cannot affect your potential outcomes (e.g. make your \(\delta_i\) larger).
    • This means that there cannot be any spill-over effects or general equilibrium effects.
  • Example: Immunization randomised control trial may violate SUTVA because immunization also has a group effect.

Forms of validity


Bias in the naive estimator when trying to reach our estimand (ATE): baseline differences (under the control condition), and differential response to the treatment (under the exposure condition).

When exposure is randomised properly, we know that who ends up in each treatment arm has nothing to do with their potential outcomes!

This is why we generally say that randomised experiments are great for internal validity: we can rule out systematic bias in our study sample!

This does not imply that our results are externally valid, i.e., that they apply to people outside our study! We need further assumptions to move from one to another.

Types of experiments

  • Laboratory experiments: Usually conducted with a small sample (of undergraduate psychology students), many times involving games in a computer. Helpful for cognitive/behavioral questions.

  • Field experiments: In order to obtain more externally valid results, experiments conducted in the field (i.e., under real-world conditions) are the way to go. Definitely more expensive though. Audit studies are a particular type of field experiment.

  • Survey experiments: One can randomize treatment conditions in a survey to evaluate how participants change their responses based on certain stimulus. Vignettes and list experiments are examples of this approach.

  • (Bonus) Quasi-experiments: Researchers usually call quasi-experiments to real-world situations that offer as-if random variation in a treatment of interest. For example, earthquakes, change in laws, date of birth, etc.

How to randomise?


How much dispersion (i.e. uncertainty) is in our distribution will be affected by the level at which randomization (i.e. treatment) happens: at the individual or cluster/group level?

The more the aggregation, the more uncertainty. So why would we want to randomize at the cluster level?

Conditional randomization (i.e., blocking) increases efficiency, when we have variables that are highly predictive of the outcome of interest

One extreme of this is randomization in matched pairs: for each pair of individual with similar covariates, we randomly assign one to treatment and one to control

Blocking


Similar to the intuition for stratified random sampling in the context of surveys, blocking may increase precision in experimental design

Precision gains are similar to increasing the sample size

  • Collect background information on covariates relevant to the outcome

  • Pre-stratify your sample, then randomise within blocks

    • This ensures that, with respect to the blocked factors, both treatment arms are identical

    • It is essentially the same as running a separate experiment in each strata

Blocking


For estimation, obtain block-specific effects, and average according to population shares. With \(J\) strata:

\[ \tau_{\text{block}} = \sum_{j=1}^{J} \frac{N_j}{N} \tau_j \]

Short activity (3 mins)

It may be hard to imagine an experiment that would be relevant for the type of questions we care about.

Some even say that experiments tend to emphasize “small” versus “big” questions, promoting incremental/testable policies.

However, there are examples of experiments addressing big and difficult questions. Can you think of any example or a proposal? ( hint, see https://graemeblair.com/teaching/UCLA_PS200E_Syllabus.pdf )

What is your estimand?

Researchers tend to formalize their effect of interest as regression coefficients (i.e., their hypotheses are formulated within a statistical model)

This is too restrictive!

Potential outcomes offer a way to formalize what we mean by a causal effect outside any statistical model. Graphical models provides a way to formalize our assumptions without parametric restrictions.

This allows us to clearly separate what do we want (a certain estimand), what needs to be true so we get what we want (identifying assumptions), the statistical machinery to transform data into an answer for our question (an estimator), and the particular answer we get (our empirical estimate).

What is your estimand?

Statistics/ML vs Causal Inference

Statistics/ML

  • Passive observation of the data generating process
  • Estimand: Joint probabilities, CEF \[P(Y,X)\] \[E(Y|X=x)\]
  • Focus on asymptotics / out of sample prediction
  • Estimation problem: variance-bias tradeoff
  • Pearl: “deep learning is just curve fitting”

Causal Inference

  • Prediction under interventions on the DGP
  • Estimand: interventional quantities \[P(Y|do(x))\] \[E(Y|do(x)) - E(Y|do(x'))\] \[= E(Y_x) - E(Y_{x'})\]
  • Identification problem: consistency (infinite sample)
  • Estimation problem: in general, focus on bias over variance (but changing)

The ladder of causation

The ladder of causation

Estimand Activity Field/Discipline Questions Example
\(\mathbf{P(Y \vert X)}\) Seeing, Observing Stats, Machine Learning What would I believe about Y if I see X? What is the expected income of a college graduate in a given field?
Estimand Activity Field/Discipline Questions Example
\(\mathbf{P(Y \vert do(x))}\) Doing, Intervening Experiments, Policy evaluation What would happen with Y if I change X? How would income levels change in response to college expansion?
Estimand Activity Field/Discipline Questions Example
\(\mathbf{P(Y_x \vert x',y')}\) Imagining, Retrospecting Structural Models What would have happened with Y have I done X instead of X’? Why? What would have my parents’ income been, have they graduated from college, given that they didn’t go?

Why should we care about causal inference?


Most social science questions are in fact causal

The social sciences are experimenting what some authors have described as the rise of “causal empiricism” (Samii, 2016), a “credibility revolution” (Angrist and Pischke, 2010), or simply a “causal revolution” (Pearl and MacKenzie, 2018)

In artificial intelligence/ML, causality have been deemed “the next frontier” and “the next most important thing”

The enormous progress in the last decades has been facilitated by the development of mathematical frameworks that provide researchers with tools to handle causal questions: Potential Outcomes and the Structural Causal Model

What should you expect from this class?


This class is designed as a first course in causal inference, so we focus on essentials:

  • Familiarize yourself with the most widely used causal inference frameworks

  • Understand the role of randomisation to tackle causal questions

  • Use potential outcomes (and the do-operator) to formalize causal estimands

  • Use directed acyclic graphs (DAGs) to encode qualitative assumptions and derive testable implications

  • Selection on observables (regression, imputation, matching, weighting, doubly robust methods and flexible estimation using machine learning)

  • Difference-in-difference, synthetic controls, and extensions

  • Instrumental variables and regression discontinuity designs

  • Sensitivity analysis

Logistics


  • Lectures: Wed 2-4pm

  • Exercises: hands-on practices that accompany the lectures (self-study), released weekly

  • Bi-weekly drop-ins: optional, lecturer and TA present for any questions, Fri 12-1pm weeks 2/4/6/8

  • Summative assessments: to be released bi-weekly

We are in it together!

Ask questions, engage with the material, immerse yourself

Get confused and frustrated but carry on

Help each other (except summative assessments)

Good methodological skills will empower you as a sociologist!

Climbing the causal ladder can change you (as did Ebenezer Scrooge)

Correlation vs causation


Selection bias