While randomized experiments (also called randomized controlled trials, RCTs) are considered the gold standard for evaluating treatment effects because they eliminate selection bias, they can still fail in practice. The failure doesn’t necessarily come from the randomization itself, but from how the treatment is delivered, accepted, or analyzed.
Researchers wanted to study the effect of job search assistance on unemployed individuals. Participants were randomly assigned to either:
Later, researchers followed up and collected data on:
Because the initial assignment to groups was random, comparing outcomes across these groups should, in theory, give an unbiased estimate of the treatment effect.
But There Was a Problem
Many participants in the treatment group did not actually receive the treatment.
This is where things go wrong. Just because someone is assigned to receive a treatment doesn’t mean they accept or use it.
This situation creates a compliance issue.
❗ Why is this a problem?
Because now we have two different types of “treatment” groups:
So, if we now try to compare:
we’re no longer using randomization. We’re comparing self-selected groups, and that opens the door to bias.
🔄 Two possible (but flawed) approaches:
Exclude the non-compliers (those assigned to treatment but didn’t take it), and compare only:
Include the non-compliers as part of the control group, and compare them all with those who received treatment.
🤯 Why both are problematic:
Option 1 (exclude non-compliers): Maybe people who refused treatment did so because they didn’t believe it would help them. If that’s true, the remaining people who did receive treatment might already be more optimistic or motivated. → This leads to overestimating how effective the treatment is.
Option 2 (include non-compliers in the control group): Now you’re mixing people who chose not to take the treatment into the group that never got offered the treatment. → This again distorts the estimate and overstates the benefit.
Bottom Line:
Even when random assignment is done correctly, the actual delivery and acceptance of treatment can create biases. Randomization protects against selection bias only if everyone complies with their assignment, which often doesn’t happen in real-world studies.
In later parts of the lecture, the video will introduce remedies for these kinds of problems, such as:
The lecture discusses foundational ideas in causal inference, particularly how we define, estimate, and interpret treatment effects using potential outcomes. It starts with the philosophical stance that causal relationships can be meaningful even at the individual level. That means we can talk about how a specific treatment affects a specific person, not just populations.
One key principle is that this individual-level causal view does not conflict with deterministic thinking. Even if we assume that each person has a fixed outcome under treatment and under control, we can still talk meaningfully about cause and effect. For example, even if people are deterministically affected by treatment, we can focus on isolating one cause at a time while holding others constant. This allows for a coherent framework to discuss causality, even without full knowledge of all possible factors.
The lecture then introduces counterfactual reasoning. To evaluate a treatment effect, we imagine a person—say, John—and consider two scenarios: one in which he takes a medicine and one in which he does not. There are four logical combinations: he gets better with or without the medicine, or he does not get better in either case, or he gets better only in one of the two. By comparing these potential outcomes, we define the individual treatment effect.
The framework used here is the Rubin Causal Model or potential outcomes framework. Each individual has two potential outcomes: one under treatment and one under control. However, in real life, we only get to observe one of these outcomes for any individual. This is known as the Fundamental Problem of Causal Inference. If we could see both outcomes, estimating treatment effects would be easy.
To formalize this, let i represent a unit (e.g., John). For each unit i, we define:
We can only observe one of these for each individual, depending on whether they were treated or not. The observed outcome \(Y_i\) is determined by their treatment status \(Z_i\). If they were treated, we observe \(Y_i(1)\); if not, we observe \(Y_i(0)\).
The individual treatment effect is \(Y_i(1) - Y_i(0)\), but since we can only see one outcome, we focus instead on estimating average treatment effects.
Several types of average treatment effects are introduced:
Average treatment effects are useful in many real-world decisions. For instance, a doctor might want to know how a surgery affects a particular patient. Although that individual-level effect is unknown, knowing how the surgery performs on average helps guide decisions. Similarly, a policymaker may want to understand how effective a training program is on average before deciding whether to scale it up.
To estimate these effects, we rely on study design. In randomized experiments, treatment assignment is random, so we can often estimate average treatment effects without bias. In observational studies, individuals choose whether to receive treatment, so we must rely on additional assumptions like unconfoundedness.
An important assumption is the Stable Unit Treatment Value Assumption (SUTVA). It includes two ideas:
The lecture also explains that potential outcomes can be viewed differently depending on whether we’re dealing with a finite or infinite population. In a finite population, potential outcomes are considered fixed, and we estimate fixed quantities. In an infinite or super-population framework, outcomes are treated as random variables.
Finally, the lecture notes that not all causal estimands depend only on the marginal distributions of outcomes. Some, like the proportion of individuals who benefit from treatment (i.e., those with \(Y_i(1) > Y_i(0)\)), depend on the joint distribution of the potential outcomes. Estimating these quantities is more challenging and may require additional assumptions.
In conclusion, causal inference allows us to define and estimate the effect of treatments even when we cannot observe counterfactual outcomes. With appropriate assumptions and methods, we can estimate meaningful average effects that are useful in practice, particularly in randomized experiments or well-designed observational studies.
The potential outcomes framework for causal inference. This framework allows us to define individual treatment effects and, based on those, define various forms of average treatment effects—specifically:
These treatment effects can be estimated without bias or consistently under certain assumptions, especially in randomized experiments and, to a lesser extent, in observational studies when certain conditions are met.
To begin understanding these conditions, the instructor draws an analogy with random sampling from a population—a concept familiar from basic statistics. If units \(i\) are independently and identically distributed (i.i.d.) and the expected value of the outcome \(Y\) exists and is finite, then the sample mean is an unbiased and consistent estimator of the population mean.
Applying this idea to causal inference:
To formalize this, suppose we observe a dataset of \((Y_i, Z_i)\), where \(Z_i = 1\) means the unit is treated and \(Z_i = 0\) means untreated. If we want to estimate \(E[Y(1)]\), the mean of the outcomes among treated units will only give us that value if the treated units are a random sample from the population in terms of their potential outcomes. This is only true if potential outcomes are independent of treatment assignment, which leads to the ignorability assumption or unconfoundedness.
This assumption can be stated as: \(Y(0), Y(1) \perp Z\) That is, potential outcomes are independent of treatment assignment.
In randomized experiments, this condition is guaranteed by design. Treatment is assigned randomly—like flipping a coin—so that it is not related to the potential outcomes. As a result, comparisons between treated and untreated groups can be attributed to the treatment itself.
In contrast, in observational studies, treatment assignment is not random. People (or their doctors, for example) choose whether to take the treatment. That choice may be influenced by factors like age, health status, or beliefs about treatment efficacy. For example:
If these beliefs are accurate, and these variables (like age) are related to both treatment assignment and potential outcomes, then simply comparing treated and untreated outcomes will bias the estimated treatment effect.
This is where confounding arises. Confounding occurs when a third variable affects both the treatment assignment and the outcome. In the example, age is a confounder. It influences both treatment choice and outcome. Therefore, without accounting for age, we cannot attribute differences in outcomes solely to the treatment.
To address this issue, researchers can use stratification. For example:
This condition is weaker than full ignorability and often more realistic in observational studies. It means treatment is not randomly assigned overall, but conditionally random given some covariates. If this condition holds, we can estimate conditional ATEs within subgroups and then compute an overall ATE as a weighted average of these subgroup ATEs.
This is the bridge between randomized experiments and observational studies. In a randomized trial, treatment is assigned by a known mechanism, and we know ignorability holds. In an observational study, we do not know the treatment mechanism. However, if we adjust for variables that influence both treatment and outcome (e.g., through regression, stratification, matching, etc.), we may approximate a conditionally randomized design and estimate causal effects.
However, unlike randomized trials, we cannot directly test whether the ignorability assumption holds in observational studies. Because we never observe both potential outcomes for any unit, we can’t verify that treatment and outcomes are truly independent, even given covariates. This makes causal inference from observational studies inherently more fragile.
Moreover, real-world problems are more complex than the age-only example. There may be many covariates that influence treatment and outcome. Researchers might miss some of these, or be unable to measure them accurately, making full adjustment impossible. Still, including as many relevant covariates as possible is generally advised.
The upcoming modules will delve deeper into identifying and estimating the three average treatment effects—SATE, FPATE, and ATE—both in randomized and observational study settings. Later, the course will explore what happens when unconfoundedness fails, and present alternative assumptions and methods (e.g., instrumental variables, sensitivity analysis) to still estimate causal effects.
In summary:
In this module, the focus is on understanding randomized experiments as a foundation for causal inference, especially because they serve as a bridge to observational studies. The lectures and slides introduce core concepts like assignment mechanisms and how they relate to potential outcomes and covariates.
1. Review of Prior Concepts: Previously, we introduced:
2. Why Study Randomized Experiments: Randomized experiments help us understand when and how we can make causal claims. The ability to make valid causal inferences relies on the assignment rule—the method by which units are assigned to treatment or control. Randomization guarantees (under proper assumptions) that treatment assignment is independent of potential outcomes, a property called unconfoundedness or ignorability.
3. Notation Used:
The assignment rule is the probability of any particular vector \(Z = z\), given covariates and potential outcomes: Pr(Z = z | X, Y(0), Y(1))
4. Desired Properties of Assignment Rules: An ideal assignment rule has these properties:
If all these hold, then the assignment is said to be strongly ignorable, as per Rosenbaum and Rubin (1983).
This condition allows the use of the propensity score \(e(X_i) = \Pr(Z_i = 1 | X_i)\), which is the probability of treatment given covariates. This concept is crucial in both randomized and observational studies.
5. Types of Randomized Experiments:
a. Bernoulli Randomized Experiment (Coin Tossing)
b. Completely Randomized Experiment
c. Randomized Block Experiment
d. Paired Randomized Experiment
6. Importance of Understanding Assignment Mechanisms: Understanding the rules for how treatment is assigned is fundamental to correctly identifying and estimating treatment effects. These randomized mechanisms ensure that differences in outcomes between treated and control groups can be attributed to the treatment itself, not to pre-existing differences in covariates or selection bias.
Moreover, in observational studies, we try to emulate these randomized designs (e.g., via stratification or propensity score matching), but we cannot control the assignment mechanism. This makes randomized experiments a benchmark and reference point for identifying causal effects.
The key idea in randomization-based inference is that potential outcomes are treated as fixed constants, not as random variables. This is unlike classical statistical methods where outcomes are assumed to have probability distributions. Instead, randomness comes solely from the assignment mechanism, which is probabilistic—i.e., from how units are randomly assigned to treatment or control.
We typically begin by evaluating a sharp null hypothesis: that treatment has absolutely no effect on any unit. This is stronger than assuming that the average treatment effect is zero because it assumes each unit’s treatment effect is zero.
The formal framework introduces notation:
Since under the null hypothesis, the outcomes are the same for any assignment, we can calculate the test statistic for the observed assignment and compare it to what we would have observed under all other possible assignments. The probability of getting a value as extreme or more extreme than the observed statistic gives the p-value.
For randomized designs where all assignments are equally likely (such as completely randomized experiments), the p-value is simply the proportion of all assignments where the test statistic is as extreme as or more extreme than the observed one.
Several test statistics are commonly used in randomization inference:
More generally, one can test a null hypothesis of constant treatment effect \(H_0: \tau_i = \tau\) for all units i. Under this assumption, we can impute missing potential outcomes and compute test statistics accordingly. Testing for different values of \(\tau\) gives a confidence interval.
Several practical considerations influence the choice of test statistic:
For large sample sizes (e.g., n = 500), the number of possible assignments becomes astronomically large. In such cases, full enumeration is infeasible. Two practical solutions are:
Advantages of randomization inference include:
However, there are limitations:
Ultimately, researchers often wish to summarize treatment effects even when they vary across units—leading to the next topic: estimating the Sample Average Treatment Effect (SATE) using randomization-based inference.
The main goal is to estimate the sample average treatment effect (SATE), defined as the average difference between potential outcomes under treatment and control across all units in the sample:
\[ \text{SATE} = \frac{1}{n} \sum_{i=1}^{n} \left( Y_i(1) - Y_i(0) \right) \]
This allows for unit-level heterogeneity in treatment effects (i.e., \(Y_i(1) - Y_i(0)\) may differ across individuals).
In a completely randomized design, \(n_1\) units are assigned to treatment and \(n_0 = n - n_1\) to control.
Estimator:
\[ \hat{\tau} = \bar{Y}_1 - \bar{Y}_0 = \frac{1}{n_1} \sum_{i:Z_i=1} Y_i - \frac{1}{n_0} \sum_{i:Z_i=0} Y_i \]
This is the difference in observed sample means between the treatment and control groups.
Unbiasedness: The estimator is unbiased for SATE because:
In block randomization:
SATE estimator: The overall estimate is a weighted average of block-specific treatment effects:
\[ \hat{\tau}_{\text{SATE}} = \sum_{s=1}^{S} w_s \left( \bar{Y}_{1,s} - \bar{Y}_{0,s} \right) \]
where \(w_s\) is the proportion of the total sample in block \(s\), and \(\bar{Y}_{1,s}\), \(\bar{Y}_{0,s}\) are treatment/control group means within that block.
Each block-specific estimator is unbiased for the block-level SATE, so the overall estimator remains unbiased.
The variance of \(\hat{\tau}\) under complete randomization is:
\[ \operatorname{Var}(\hat{\tau}) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0} + \frac{1}{n} \cdot \operatorname{Var}(Y_i(1) - Y_i(0)) \]
However, we cannot estimate the third term directly, since we never observe both \(Y_i(1)\) and \(Y_i(0)\) for any individual.
Since the third term is unobservable, a conservative estimator of variance (overestimates variance slightly) is used:
\[ \hat{V}(\hat{\tau}) = \frac{s_1^2}{n_1} + \frac{s_0^2}{n_0} \]
where \(s_1^2\) and \(s_0^2\) are the sample variances of observed outcomes in the treatment and control groups, respectively.
This estimator is unbiased if treatment effects are constant across units.
Using the conservative variance estimate, one can form a test statistic:
\[ Z = \frac{\hat{\tau}}{\sqrt{\hat{V}(\hat{\tau})}} \]
For large samples, \(Z\) approximately follows a standard normal distribution under the null hypothesis. This allows:
\[ \hat{\tau} \pm 1.96 \cdot \sqrt{\hat{V}(\hat{\tau})} \]
In the first lesson, the focus is on introducing randomized experiments and distinguishing different randomization schemes. These include:
The key idea is that the assignment mechanism is the source of randomness, while the potential outcomes for each unit are considered fixed constants.
In the second lesson, the concept of randomization-based inference is introduced. Unlike classical inference that treats outcomes as random, here randomness comes solely from the treatment assignment. The process involves:
Defining a sharp null hypothesis – e.g., no unit experiences any treatment effect, i.e., \(Y_i(1) = Y_i(0)\) for all \(i\).
Under the null, we know all units’ outcomes under all assignments, so we can:
This method was famously illustrated by Fisher’s tea-tasting experiment, where Lady Bristol claimed she could tell whether milk or tea was poured first. Given the treatment assignment and her responses, one could compute an exact p-value under the sharp null.
This framework allows for testing more general hypotheses as well, such as constant effects across units (e.g., \(Y_i(1) - Y_i(0) = \tau\) for all \(i\)) and constructing confidence intervals for the treatment effect by identifying values of \(\tau\) that are not rejected.
The third lesson extends the randomization-based approach from hypothesis testing to estimation of the Sample Average Treatment Effect (SATE).
For a completely randomized experiment, the estimator is simply the difference in sample means between treatment and control groups: \(\bar{Y}_1 - \bar{Y}_0 = \frac{1}{n_1} \sum_{i:Z_i=1} Y_i - \frac{1}{n_0} \sum_{i:Z_i=0} Y_i\)
This estimator is unbiased for the SATE because of the random assignment mechanism, even though the potential outcomes are fixed.
In the case of block randomized experiments, the SATE can be estimated as a weighted average of treatment effects within each block, with weights proportional to block sizes.
The variance of the estimator has three components:
Since we can’t observe both potential outcomes for any individual, the third term is unidentifiable. However, if treatment effects are constant, this term drops out.
A conservative estimator of the variance (still valid even when effects vary) is: \(\hat{V}(\bar{Y}_1 - \bar{Y}_0) = \frac{s_0^2}{n_0} + \frac{s_1^2}{n_1}\) where \(s_0^2\) and \(s_1^2\) are the sample variances in the control and treatment groups, respectively.
With large samples, one can use the normal approximation: \(\frac{\bar{Y}_1 - \bar{Y}_0}{\sqrt{\frac{s_0^2}{n_0} + \frac{s_1^2}{n_1}}} \sim N(0,1)\) to construct confidence intervals and perform hypothesis tests.
In conclusion, randomization inference provides a powerful framework with strong internal validity, minimal distributional assumptions, and clear logic grounded in the assignment mechanism. While it may have limitations in generalizing beyond the sample and in handling effect heterogeneity, it lays a foundational methodology for causal inference using experimental data.
In previous modules, potential outcomes were treated as fixed constants, and randomness came only from the random treatment assignment. We estimated the average treatment effect (ATE) within the sample, and tested hypotheses using randomization-based methods. However, in this lesson, we consider that the sample itself was randomly drawn from a finite population of size N. This introduces a second source of randomness — the sample selection process.
Each unit i has two potential outcomes: one under treatment (y_i(1)) and one under control (y_i(0)). But as always in causal inference, for each unit, we only observe one of these two outcomes. This is a missing data problem — half of the potential outcomes are unobserved.
In this extended setting, the estimator remains the same — the difference in means between treated and control units. But now, this estimator is a random variable due to both:
We define a binary sampling indicator T_i, where T_i = 1 if unit i is included in the sample and T_i = 0 otherwise. Then the estimator can be expressed as: (1/n) ∑ T_i * (y_i(1) - y_i(0)) This form shows explicitly that the estimator depends on the random sampling process.
This estimator is unbiased for the finite population average treatment effect (FATE), which is: (1/N) ∑ (y_i(1) - y_i(0)) over all N units
To verify the unbiasedness, we compute the expectation of the estimator over both sources of randomness:
We can interchange the order of expectations (by the law of iterated expectations), and we find that the expected value of the estimator equals the FATE. Therefore, the estimator is unbiased for the true average treatment effect in the population.
However, while the estimator is unbiased, hypothesis testing becomes more complex. The randomization test that worked under fixed samples is not valid here, because the distribution of the estimator now depends on both sampling and treatment assignment.
To perform inference about FATE, we also need to consider the variance of the estimator. This variance consists of three terms:
The third term is not identifiable from the observed data because we never observe both potential outcomes for the same unit. If we assume constant treatment effects across units, this third term drops out, and we can compute a conservative variance estimate using the sample variances from the treatment and control groups.
Using this variance estimate, we can construct confidence intervals and test hypotheses using a normal approximation. Under the null hypothesis of no average treatment effect, the standardized difference-in-means statistic is approximately standard normal.
The lesson then transitions to model-based inference. In this framework, potential outcomes are treated as random variables rather than fixed constants. This is more familiar to most students, as it aligns with standard statistical modeling — like assuming outcomes are generated by some probabilistic process (e.g., normal distributions).
Under model-based inference, the sample is considered a random sample from an infinite or large super-population. The treatment assignment is assumed to be independent of the potential outcomes. The estimator remains the difference in sample means. Under the assumption of independent and identically distributed (i.i.d.) potential outcomes, this estimator is still unbiased for the population average treatment effect (PATE).
The lesson shows how, under these assumptions, the expected value of the estimator equals the expectation of the difference in potential outcomes, and variance estimates follow from standard statistical theory.
Finally, the same reasoning is extended to block randomized designs. In this case, treatment assignment is independent of potential outcomes within blocks, but the probability of assignment may depend on observed covariates (i.e., blocking variables). The estimator — a weighted average of treatment effects within each block — remains unbiased under this framework.
Here is a comparison table summarizing the key differences between the Finite Population Average Treatment Effect (FATE) and the Average Treatment Effect (ATE):
Feature | Finite Population ATE (FATE) | Average Treatment Effect (ATE) |
---|---|---|
Definition | Average treatment effect over a fixed set of N units | Expected treatment effect over a conceptual (infinite) superpopulation |
Mathematical Form | \(\text{FATE} = \frac{1}{N} \sum_{i=1}^{N} (Y_i(1) - Y_i(0))\) | \(\text{ATE} = \mathbb{E}[Y(1) - Y(0)]\) |
Population Type | Finite and fixed sample of units | Infinite or hypothetical superpopulation |
Potential Outcomes | Treated as fixed constants | Treated as random variables |
Sources of Randomness | Two sources: sampling and treatment assignment | Typically from treatment assignment only; sampling from the population |
Estimator | Sample mean difference: \(\bar{Y}_1 - \bar{Y}_0\) | Same estimator, interpreted as an estimate of a population quantity |
Inference Target | Causal effect for the observed (finite) units | Causal effect generalizable to a broader population |
Typical Use Cases | Small-scale evaluations, surveys, census-like experiments | Clinical trials, economic studies, policy evaluations with generalizability |
Confidence Intervals & Testing | Based on finite population corrections or randomization-based inference | Based on model-based inference (e.g., CLT, regression assumptions) |
Estimator Bias | Estimator is unbiased for FATE under random sampling and assignment | Estimator is unbiased for ATE under i.i.d. and ignorability assumptions |
In the previous lesson, we discussed model-based inference in the context of completely randomized and block-randomized experiments. Now we reformulate this idea using linear regression models. This allows us to explore causal relationships within a regression framework.
We begin with the causal regression model for the potential outcomes. For each subject \(i\), and for each possible treatment assignment \(z = 0\) (control) or \(z = 1\) (treatment), we define:
\[ Y_i(z) = \alpha + \tau z + \varepsilon_i(z) \]
This model posits that each subject has two potential outcomes: one if they receive treatment and one if they do not. The error term \(\varepsilon_i(z)\) captures individual variation not explained by the treatment. To ensure the model is identifiable, we assume \(E[\varepsilon_i(z)] = 0\).
This model gives us a clean interpretation: the intercept \(\alpha\) is the expected outcome under control (i.e., \(E[Y(0)]\)), and the treatment effect \(\tau\) is the expected difference between the treated and untreated outcomes (i.e., \(E[Y(1) - Y(0)]\)).
However, in practice, we never observe both potential outcomes for the same individual. We only observe the outcome under the actual treatment assignment. That is, we observe:
\[ Y_i = Y_i(Z_i) \]
So, researchers instead fit a regression model on observed data:
\[ Y_i = \alpha^* + \tau^* Z_i + v_i \]
Here, \(\alpha^*\) and \(\tau^*\) are regression coefficients estimated from the data, and \(v_i\) is the regression residual. To identify these coefficients, we assume that the error term \(v_i\) has mean zero conditional on treatment assignment: \(E[v_i | Z_i] = 0\). Under this assumption:
\[ \alpha^* = E[Y_i | Z_i = 0], \quad \tau^* = E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0] \]
The question then becomes: how do \(\alpha^*\) and \(\tau^*\) relate to the original causal parameters \(\alpha\) and \(\tau\)? By rewriting the observed outcome as a function of potential outcomes:
\[ Y_i = Z_i Y_i(1) + (1 - Z_i) Y_i(0) \]
and substituting the causal model in, we get:
\[ Y_i = \alpha + \tau Z_i + \varepsilon_i(Z_i) \]
where \(\varepsilon_i(Z_i) = Z_i \varepsilon_i(1) + (1 - Z_i) \varepsilon_i(0)\).
Taking expectations conditional on treatment, we find:
\[ \alpha^* = \alpha + E[\varepsilon_i(0) | Z_i = 0], \quad \tau^* = \tau + E[\varepsilon_i(1) | Z_i = 1] - E[\varepsilon_i(0) | Z_i = 0] \]
Therefore, in general, \(\alpha^* \neq \alpha\) and \(\tau^* \neq \tau\). Equality holds only if the potential outcomes (or equivalently, the potential errors) are independent of the treatment assignment. This is precisely the case in a completely randomized experiment.
In contrast, in an observational study, treatment assignment may depend on the potential outcomes. For example, consider a study of depressed patients choosing whether to take medication. Suppose patients who believe the drug will help them choose to take it, and those who believe it won’t help choose not to. This self-selection violates the independence assumption.
Suppose half of the patients take the drug and improve from a mood score of 5 to 10, and the other half do not take the drug and remain at 8. The true average treatment effect is:
\[ (10 - 5) \times 0.5 = 2.5 \]
However, the observed group means are 10 (treated) and 8 (untreated), so the regression estimate \(\tau^* = 2\), which underestimates the effect.
Now suppose the treated patients improve from 5 to 15 instead. Then the true treatment effect is 5, but the observed difference in group means is \(15 - 8 = 7\), which overestimates the effect.
This example illustrates how observational data can lead to biased treatment effect estimates when treatment assignment is confounded with unobserved factors.
A different way to understand this is through the ordinary least squares (OLS) estimator in the observed regression model. The estimates are:
\[ \hat{\alpha}^* = \bar{Y}_0, \quad \hat{\tau}^* = \bar{Y}_1 - \bar{Y}_0 \]
These are unbiased estimates of:
\[ E[Y(1) | Z = 1] - E[Y(0) | Z = 0] \]
But what we actually want is:
\[ E[Y(1)] - E[Y(0)] \]
These are only equal when treatment assignment is independent of potential outcomes, as in randomized experiments.
We can extend this analysis to stratified models, where the investigator is interested in estimating treatment effects within strata defined by covariates. For each stratum \(s\), we define:
\[ Y_{si}(z) = \alpha_s + \tau_s z + \varepsilon_{si}(z), \quad E[\varepsilon_{si}(z) | S = s] = 0 \]
In the observed data, this becomes:
\[ Y_{si} = \alpha_s^* + \tau_s^* Z_{si} + v_{si}, \quad E[v_{si} | Z_{si}, S = s] = 0 \]
Again, in completely randomized or block-randomized studies, or in observational studies where treatment assignment is unconfounded within strata, we get:
\[ \alpha_s = \alpha_s^*, \quad \tau_s = \tau_s^* \]
To recover the overall average treatment effect (ATE), we average \(\tau_s\) across the strata using the distribution of \(S\):
\[ ATE = E_S[\tau_s] \]
In summary, model-based inference using linear regression can identify causal effects if treatment assignment is independent of potential outcomes (as in randomized studies) or if confounding is appropriately controlled (as in stratified observational studies). Otherwise, regression estimates may be biased and mislead conclusions.
In the previous lesson, we reformulated model-based inference in terms of linear regression. Now, in this lesson, we extend that framework by adding covariates—variables like age, gender, or baseline characteristics—which are often collected in both completely randomized experiments and observational studies.
In a completely randomized experiment, we already know that the difference in sample means (i.e., \(\bar{Y}_1 - \bar{Y}_0\)) provides an unbiased estimator of the Average Treatment Effect (ATE), which is \(E[Y(1) - Y(0)]\).
Still, researchers often perform a regression of the outcomes on treatment assignment and covariates, using a model like:
\[ Y_i = \alpha^* + \tau^* Z_i + \beta^{*\top} X_i + v_i \]
This regression helps in two ways:
Ordinary Least Squares (OLS) estimation ensures that the residuals and their weighted sums equal zero:
\[ \sum_i v_i = 0,\quad \sum_i Z_i v_i = 0,\quad \sum_i X_i v_i = 0 \]
From this, we get:
For the control group (\(Z = 0\)):
\[ \bar{Y}_0 = \hat{\alpha}^* + \hat{\beta}^{*\top} \bar{X}_0 \]
For the treatment group (\(Z = 1\)):
\[ \bar{Y}_1 = \hat{\alpha}^* + \hat{\tau}^* + \hat{\beta}^{*\top} \bar{X}_1 \]
So, taking the difference gives:
\[ \bar{Y}_1 - \bar{Y}_0 = \hat{\tau}^* + \hat{\beta}^{*\top}(\bar{X}_1 - \bar{X}_0) \]
This expression shows the difference in outcomes is not just the treatment effect, but also includes the covariate imbalance term.
In completely randomized experiments, treatment assignment is independent of both potential outcomes and covariates. Therefore, in expectation, \(\bar{X}_1 = \bar{X}_0\), and the difference simplifies:
\[ E[\hat{\tau}^*] = E[\bar{Y}_1 - \bar{Y}_0] = \text{ATE} \]
Thus, both \(\hat{\tau}^*\) and the raw mean difference are unbiased estimators of the treatment effect. However, because the regression model uses covariates to explain more variation, its variance is smaller—making it statistically more efficient.
Now an important clarification: even though we use this linear regression model, we are not assuming that:
\[ E[Y_i | Z_i, X_i] = \alpha^* + \tau^* Z_i + \beta^{*\top} X_i \]
In other words, we’re not saying the model is the true conditional expectation function. Instead, we’re just assuming:
\[ E[v_i | Z_i, X_i] = 0 \]
This is a weaker assumption and sufficient for unbiasedness of the estimator under randomization.
In observational studies, researchers also often run this kind of regression. However, the situation is more delicate.
Let’s suppose that:
\[ Y_i(z) = g(z, X_i) + \varepsilon_i(z),\quad \text{with } E[\varepsilon_i(z) | X_i] = 0 \]
This allows for the possibility that the outcome depends nonlinearly or interactively on both treatment and covariates. Then the individual treatment effect is:
\[ \text{ATE}(X) = g(1, X) - g(0, X),\quad \text{and } \text{ATE} = E[\text{ATE}(X)] \]
In this case, the linear regression estimator \(\hat{\tau}^*\) approximates:
\[ [E[Y(1) | Z=1] - \hat{\beta}^{*\top} E[X | Z=1]] - [E[Y(0) | Z=0] - \hat{\beta}^{*\top} E[X | Z=0]] \]
Which can differ from the true ATE if:
Let’s take a special case to explore this further.
Suppose:
\[ g(1, X) = g(0, X) + \tau \]
This means the treatment effect is constant across all values of \(X\), which is an “additive treatment effect” model. Then:
\[ \hat{\tau}^* = \tau + \text{Bias} \]
Where the bias is due to:
\[ E[g(0, X) - \hat{\beta}^{*\top} X | Z=1] - E[g(0, X) - \hat{\beta}^{*\top} X | Z=0] \]
This bias disappears if either:
However, if neither of these conditions holds, then even in the additive case, linear regression can be biased. And this bias can be large if:
This is especially relevant in observational studies, where covariate imbalance is common and the correct functional form of \(g(z, X)\) is usually unknown.
In completely randomized experiments, we observed that including covariates in a regression model leads to an adjusted estimator. This estimator accounts for differences in the means of covariates between the treatment and control groups. Despite this adjustment, the estimator remains unbiased for the Average Treatment Effect (ATE). In fact, it generally has lower variance than the unadjusted difference-in-means estimator.
Importantly, this unbiasedness does not rely on the regression model being correctly specified. Nor does it require the treatment effect to be constant across different values of the covariates. Heterogeneous treatment effects across covariate values are allowed. Thus, even if the regression model is misspecified, the adjusted estimator remains unbiased in randomized experiments.
To address this bias, two remedies have been proposed.
Both methods—flexible modeling and matching—become difficult in high-dimensional covariate spaces. However, computational advances over the past 30 years have enabled significant progress on both fronts. To understand these developments and their role in practice, it is helpful to start with one foundational concept: the propensity score, as introduced by Rosenbaum and Rubin.
The propensity score is defined as the probability of receiving treatment given covariates, \(e(X) = P(Z = 1 \mid X)\). Under the assumptions of strong ignorability—that treatment assignment is unconfounded given \(X\), and that \(0 < e(X) < 1\)—Rosenbaum and Rubin showed that:
Given the same propensity score, the distribution of covariates is the same in both the treatment and control groups.
Therefore, conditioning on the propensity score is sufficient to remove confounding, just like conditioning on all covariates.
This is a powerful result because even if modeling \(E[Y \mid X, Z]\) directly is difficult due to high-dimensional \(X\), modeling \(E[Y \mid e(X), Z]\) using the one-dimensional scalar \(e(X)\) becomes much more feasible.
Using the estimated propensity score, one can:
In all these methods, the key property is that within groups of similar propensity scores, the covariates are balanced between treatment and control groups. This allows for unbiased estimation of treatment effects, assuming that all relevant confounders are included in the propensity score model.
However, propensity scores are not magical. If important confounders are omitted from the model used to estimate the score, balance will not be achieved, and bias will persist. Therefore, good subject-matter knowledge is essential in selecting covariates.
1. Overview of Propensity Score Uses
In observational studies, propensity scores are used to estimate treatment effects via four main strategies:
These techniques aim to control for confounding when treatment assignment is not randomized.
2. Motivation for Using Propensity Scores
3. Subclassification on Propensity Score
Definitions:
Let \(\bar{Y}_{1s}\) and \(\bar{Y}_{0s}\) be the mean outcomes in stratum \(s\) for treated and control units, respectively.
The ATE is estimated as:
\[ \hat{ATE} = \sum_{s=1}^{S} \frac{n_s}{n} (\bar{Y}_{1s} - \bar{Y}_{0s}) \]
The ATT (Average Treatment Effect on the Treated) is estimated as:
\[ \hat{ATT} = \sum_{s=1}^{S} \frac{n_{1s}}{n_1} (\bar{Y}_{1s} - \bar{Y}_{0s}) \]
4. Key Considerations in Subclassification
Estimated Propensity Scores are used to form subclasses because the true scores are unknown.
Bias-Variance Tradeoff:
Subclassification can reduce bias by over 90% compared to unadjusted comparisons.
5. Practical Steps for Subclassification
Estimate Propensity Scores:
Handle Insufficient Overlap:
At extreme ends of the propensity score distribution, there may be too few treated or control observations (known as poor overlap).
Solutions:
Form Strata and Check Balance:
6. Regression Within Subclasses
To further improve precision and account for any residual imbalance, a regression model can be used within each subclass:
\[ Y_i = \alpha^*_s + \tau^*_s Z_i + \beta^*_s X_i + \nu_i \]
Where:
The overall ATE and ATT are computed as weighted averages of the subclass estimates \(\tau^*_s\), using either:
Variance estimates are similarly aggregated across subclasses.
Weighting Using Propensity Scores (IPTW – Inverse Probability of Treatment Weighting)
Purpose: To create a pseudo-population where the distribution of covariates is the same between treated and control groups.
Weights:
IPTW Estimator (ATE):
\[ \hat{\text{ATE}} = \frac{1}{n} \sum_{i=1}^n \left[ \frac{Z_i Y_i}{e(X_i)} - \frac{(1 - Z_i) Y_i}{1 - e(X_i)} \right] \]
Alternative (normalized) estimator:
\[ \hat{\text{ATE}}' = \frac{ \sum \frac{Z_i Y_i}{e(X_i)} }{ \sum \frac{Z_i}{e(X_i)} } - \frac{ \sum \frac{(1 - Z_i) Y_i}{1 - e(X_i)} }{ \sum \frac{1 - Z_i}{1 - e(X_i)} } \]
Key Insight: Weighting is a more refined version of sub-classification. Sub-classification applies coarse weights using block averages; IPTW uses individual-level weights based on exact scores.
Challenges in Practice
ATT Estimation via Weighting
Estimating the Average Treatment effect on the Treated (ATT):
\[ \hat{\text{ATT}} = \bar{Y}_1 - \frac{ \sum (1 - Z_i) e(X_i) Y_i / (1 - e(X_i)) }{ \sum (1 - Z_i) e(X_i) / (1 - e(X_i)) } \]
Treated units’ outcomes are used directly; control units are reweighted to match the propensity distribution in the treated group.
Double Robustness
In observational studies, unlike randomized controlled trials, individuals are not randomly assigned to treatment or control groups. This creates the potential for confounding bias because individuals who choose treatment may differ systematically from those who do not. To adjust for this and attempt to estimate causal effects, researchers use various methods, one of which is matching.
Matching is a strategy designed to make the treated and control groups more comparable by aligning individuals with similar observed characteristics (covariates). The central idea is to mimic randomization by ensuring that the distribution of covariates is similar between treated and untreated units. This allows for a more valid comparison of outcomes.
One foundational concept related to matching is the propensity score, which is the probability that a unit receives treatment given its observed covariates. This scalar summary of multivariate covariates greatly simplifies the problem: instead of matching on many variables simultaneously, one can match units based on their estimated propensity scores. According to Rosenbaum and Rubin (1983), if treated and control units have the same propensity score, their covariate distributions should be balanced. This is the key justification for matching on the propensity score.
Matching comes in several forms. In exact matching, one attempts to match each treated unit with a control unit that has identical values for all covariates. While theoretically ideal, this is often impractical in real datasets with many covariates. Instead, researchers turn to approximate matching, where similarity is defined via a distance metric. For continuous covariates, Euclidean distance may be used, but this ignores differing variances and covariate correlations. Therefore, Mahalanobis distance is generally preferred, as it standardizes for variance and accounts for correlation between variables. It is defined as the distance between two covariate vectors using the inverse of the pooled sample covariance matrix as a weighting factor.
When using propensity score matching, distance is defined in terms of the absolute difference in estimated propensity scores between treated and control units. Matching can be done either with replacement (where the same control can be matched to multiple treated units) or without replacement (each control is used only once). Matching with replacement generally provides better matches but introduces dependency across matched pairs, complicating variance estimation.
One common implementation is 1:1 nearest neighbor matching, where each treated unit is paired with the nearest available control unit based on the distance metric. More generally, 1:k matching pairs each treated unit with multiple control units to reduce variance.
Once matching is done, the treatment effect for the treated (ATT) is estimated as the average difference in outcomes between treated units and their matched controls. Mathematically, for a treated unit \(i\), if \(Y_i^{(T)}\) is the outcome of the treated unit and \(\bar{Y}_i^{(C)}\) is the average outcome of its matched controls, the ATT estimate is:
\[ \widehat{ATT} = \frac{1}{n_T} \sum_{i=1}^{n_T} \left(Y_i^{(T)} - \bar{Y}_i^{(C)}\right) \]
This estimate is unbiased if matching achieves balance on covariates. Therefore, it is essential to assess covariate balance after matching. One method is to compute the standardized difference for each covariate before and after matching. If the standardized differences are close to zero after matching (typically below 0.1 or 10%), then the matching is considered successful. Another approach is to compare summary metrics such as the Mahalanobis distance or the distribution of estimated propensity scores.
In practice, estimating the propensity score itself requires modeling, typically via logistic regression. The model must be specified carefully; mis-specification can reduce balance and increase bias. Unlike IPW, where model mis-specification can directly bias treatment effect estimates, in matching the goal is to obtain balanced samples, not to recover the true propensity function. If initial matching fails to achieve balance, researchers often revise the model (e.g., include higher-order terms or interactions) and repeat the matching.
Matching methods have evolved over time. Beyond simple nearest neighbor matching, newer techniques aim to optimize balance directly. These include:
A notable development in variance estimation for matching-based estimators comes from the work of Abadie and Imbens. They propose variance formulas that account for the fact that control units may be matched to multiple treated units and for the randomness induced by the matching process. In the case of matching with replacement, their variance estimator includes terms for how frequently each control unit is used.
It is also worth noting that while matching is often used to estimate the ATT (the effect of treatment on the treated), it can also be applied to estimate the average treatment effect (ATE) or the effect on the untreated (ATU). The choice depends on the research question and the structure of the available data.
Overview and Motivation
So far, we’ve discussed methods that estimate treatment effects using covariates or the propensity score, but without explicitly modeling the regression function (i.e., the expected outcome given covariates and treatment). This might seem surprising because, under the assumption of unconfoundedness (i.e., treatment assignment is independent of potential outcomes given covariates), modeling the regression function can directly identify the average treatment effect (ATE). Specifically, if we can estimate the expected potential outcomes \(\mathbb{E}[Y(1) | X]\) and \(\mathbb{E}[Y(0) | X]\), then taking the difference gives us the conditional treatment effect, and averaging those over the population gives us ATE.
Regression Imputation and Its Use
In practice, we can:
This approach also allows estimation of:
Importantly, to estimate SATT, we only need to model the control group. That is, for treated individuals, we already observe \(Y(1)\), and only need to impute \(Y(0)\).
Challenges with Misspecification
A common practice is to model the regression function with linear regression. However, this is risky:
This dual vulnerability is what initially motivated the use of propensity score methods to reduce reliance on potentially misspecified outcome models.
Non-Parametric and Modern Solutions
Recent advances have led to non-parametric approaches for estimating:
Such approaches reduce reliance on parametric assumptions.
Double Robustness
A key breakthrough is the concept of double robustness. This refers to estimators that are consistent if either:
If both are correct, the estimator is efficient (i.e., has the smallest variance among unbiased estimators).
This leads to the Doubly Robust Estimator (DRE) of the treatment effect.
Formulation of the Double Robust Estimator
Let:
The doubly robust estimator for \(\mathbb{E}[Y(1)]\) (the mean potential outcome under treatment) includes two terms:
A weighted average term that resembles inverse probability weighting (IPW):
\[ \mathbb{E}\left[\frac{Z_i Y_i}{e(X_i)}\right] \]
A correction term that uses the residual from the regression:
\[ \mathbb{E}\left[\left( Z_i - e(X_i) \right) \cdot g(X_i) \right] \]
If the propensity score is correctly specified, the second term vanishes (because the expected value of \(Z_i - e(X_i)\) given \(X_i\) is zero), and the first term gives a consistent estimate.
If the regression model is correctly specified, the first term becomes zero (on average, due to residuals), and the second term gives a consistent estimate.
Thus, the estimator is robust to misspecification in either model.
Practical Implications and Tools
Summary
The integration of machine learning (ML) into causal inference—particularly for estimating treatment effects—is a relatively new but rapidly growing field. Traditionally, treatment effect estimation relied on parametric models like linear regression, logistic regression, or Cox models. But these come with strong assumptions about the relationship between covariates and outcomes.
ML, in contrast, offers flexible, non-parametric tools (e.g., random forests, boosting, neural networks) that can capture complex, nonlinear relationships without needing to specify them in advance.
1. Early Work: Bayesian Additive Regression Trees (BART)
One of the earliest applications of ML to causal inference is Jennifer Hill’s work using Bayesian Additive Regression Trees (BART). Her method focuses on estimating the regression function \(g(X) = \mathbb{E}[Y | X, Z]\), rather than the propensity score.
Key benefits of BART:
But: this approach does not use the propensity score, so it’s not doubly robust.
2. The Problem of Regularization-Induced Bias
Modern ML methods like LASSO, random forests, and boosting often regularize (i.e., shrink) model parameters to prevent overfitting.
However, when these regularized models are used directly to estimate treatment effects (especially regression-based methods like \(Y_i = \theta Z_i + g(X_i) + \varepsilon_i\)), they can produce biased estimates, particularly if:
This bias does not vanish asymptotically, and it leads to inconsistent estimates of treatment effects if used naïvely.
3. Double/Debiased Machine Learning (DML)
To fix this, Chernozhukov et al. proposed the Double/Debiased Machine Learning (DML) framework. The key idea is to combine ML models for both the regression function and the propensity score, and then correct for the regularization bias. This results in:
How DML works (simplified):
This technique is particularly powerful in high-dimensional settings or where flexible modeling is essential.
4. Targeted Maximum Likelihood Estimation (TMLE)
Another ML-based causal inference approach is Targeted Maximum Likelihood Estimation (TMLE).
TMLE works in two steps:
TMLE:
Summary: What to Take Away
If you’re entering causal inference using machine learning, it’s critical to:
1. What is the Unconfoundedness Assumption?
The unconfoundedness assumption (also known as ignorability or selection on observables) states that treatment assignment is independent of the potential outcomes given observed covariates:
\[ (Y(0), Y(1)) \perp Z \mid X \]
This means that after controlling for covariates \(X\), treatment assignment \(Z\) behaves like random assignment. It’s essential for making causal inference in observational studies, allowing us to estimate causal effects without randomized experiments.
Key point: This assumption is untestable in practice, because for each unit, we only observe one of the two potential outcomes, \(Y(0)\) or \(Y(1)\), but never both.
2. Assessing Unconfoundedness (Two Main Approaches)
Even though we cannot test unconfoundedness directly, proxy assessments have been proposed to evaluate its plausibility.
Approach 1: Use Control Outcomes (Placebo Outcomes)
Challenges:
Bottom line: Simple and intuitive, but selecting a truly valid control outcome requires careful judgment and domain knowledge.
Approach 2: Use Multiple Control Groups
Example:
Cautions:
Bottom line: Useful for checking balance across groups, but effectiveness depends on careful selection of diverse control groups.
3. Sensitivity Analysis: What if Unconfoundedness Fails?
Instead of testing whether unconfoundedness holds (which we can’t), sensitivity analysis asks:
How would our conclusions change if the assumption is violated?
Rosenbaum’s Sensitivity Analysis (Randomization Inference)
Applies to matched or paired studies. The idea is to simulate departures from random assignment.
Setup:
Procedure:
Interpretation:
Extensions:
4. Model-Based Sensitivity Analysis Using Bias Formulas
We can also quantify bias due to unobserved confounding.
Suppose there’s an unmeasured confounder \(U\). The bias in the estimated treatment effect is:
\[ \text{Bias} = \mathbb{E}[Y \mid X, Z=1, U] - \mathbb{E}[Y \mid X, Z=0, U] \times \left[ P(U \mid X, Z=1) - P(U \mid X, Z=0) \right] \]
This shows that bias depends on:
Sensitivity analysis methods like those by VanderWeele and Arah use such formulas to simulate how much unmeasured confounding would be required to explain away the estimated effect.
5. Worst-Case Scenario: Bounds
Drawback:
Summary and Implications
Unconfoundedness is central to modern causal inference in observational studies.
It is not testable, but there are tools to assess its plausibility or quantify sensitivity.
Two assessment tools:
Two sensitivity strategies:
Worst-case bounds provide identification without unconfoundedness but are often too vague.
Holland, P. (1986), “Statistics and Causal Inference,” (with discussion), Journal of the American Statistical Association, 81, 945-970.
Rosenbaum, P. R. (2002), Observational Studies, New York: Springer-Verlag.
Rosenbaum, P. R. (2010), Design of Observational Studies, New York: Springer-Verlag.
Rosenbaum, P. R. (2017), Observation and Experiment, Cambridge, MA: Harvard University Press.
Imbens, G. W., and Rubin, D. B. (2015), Causal Inference for Statistics, Social and Biomedical Sciences: An Introduction, New York: Cambridge University Press.
Imbens, G. W. (2004), “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” Review of Economics and Statistics, 86, 1-29.
McCaffrey, D.F., Ridgeway, G., and Moral, A.R. (2004), “Propensity Score Estimation with Boosted Regression for Evaluating Causal Effects in Observational Studies,” Psychological Methods 9, 403-425.
Rosenbaum, P.R., and Rubin, D.B. (1983), “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika 70, 41-55.
Cochran, W.G., and Rubin, D. (1973), “Controlling Bias in Observational Studies: A Review”, Sankhya, 35, 417-446.
Imbens G, Abadie A, Drukker D, and Herr J. (2004), Implementing Matching Estimators for Average Treatment Effects in Stata,” The STATA Journal, 4, :290-311.
Imbens, G. W., and Rubin, D. B. (2015), Causal Inference for Statistics, Social and Biomedical Sciences: An Introduction, New York: Cambridge University Press.
Rosenbaum, P.R. (2012), “Optimal Matching of an Optimally Chosen Subset in Observational Studies,” Journal of Computational and Statistical Graphics, 21, 57-71.
Sekhon, J. S. (2011), “Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R,” Journal of Statistical Software, 7.
Stuart, E. A. (2010), “ Matching Methods for Causal Inference: A Review and a Look Forward,” Statistical Science, 25, 1-21.
Zubizarreta, J.R., Paredes, R.D., and Rosenbaum, P.R. (2014), “Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-Profit and Not-For-Profit High Schools in Chile,” The Annals of Applied Statistics, 8, 204-231.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018), “Double/Debiased Machine Learning for Treatment and Structural Parameters,” Econometrics Journal, 21, C1-C68.
Chipman, H. A, George, E. I., and McCulloch, R, E., (2010), “BART: Bayesian additive regression trees,” The Annals of Applied Statistics, 4, 266-298.
Glynn, A.N., and Quinn, K.M. (2010), “An Introduction to the Augmented Inverse Propensity Weighted Estimator,” Political Analysis, 18, 3656.
Hill, J.L. (2011), “Bayesian Nonparametric Modeling for Causal Inference,”: Journal of Computational Graphics and Statistics, 20,217-240.
Kang, J.D.Y, and Schafer, J.L. (2007), “Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data,” Statistical Science, 22, 523-539.
Lee, B.K., Lessler, J., and Stuart, E.A. (2010), “Improving Propensity Score Weighting Using Machine Learning,” Statistics in Medicine, 29,337-346.
Liu, W. Kuramoto, S.J., and Stuart, E.A. (2013), “An Introduction to Sensitivity Analysis for Unobserved Confounding in Non-Experimental Prevention Research,” Prevention Science, 14, 570-580.
Lopez, M.J., and Gutman, R,. (2017), “Estimation of Causal Effects with Multiple Treatments: A Review and New Ideas,” Statistical Science, 32, 432-454.
Manski, C.F. (1990), “Nonparametric Bounds on Treatment Effects,” American Economic Association Papers and Proceedings, 80, 319-323.
Richardson, A., Hudgens, M.G., Gilbert, P., and Fine, J.P. (2014), Nonparametric Bounds and Sensitivity Analysis of Treatment Effects,” Statistical Science, 29, 596-618.
Robins J.M. (1989). “The Analysis of Randomized and Non-Randomized AIDS Treatment Trials Using a New Approach to Causal Inference in Longitudinal Studies.” Pp. 113-159 in L. Sechrest, H. Freeman, and A. Mulley (Eds.), Health Service Research Methodology: A Focus on AIDS. Washington, D.C.: U.S. Public Health Service, National Center for Health Services Research.
Rosenbaum, P. R. (2002), Observational Studies, New York: Springer-Verlag.
Rosenbaum, P. R. (2010), Design of Observational Studies, New York: Springer-Verlag.
Rosenbaum, P. R. (2017), Observation and Experiment, Cambridge, MA: Harvard University Press.
Scharfstein, D.O., Rotnitzky, A,., and Robins, J.M. (1999), “Adjusting for Non-ignorable Drop-Out Using Semiparametric Non-Response Models,” (with discussion), Journal of the American Statistical Association, 94, 1096-1146.
van der Laan, M.J., and Rose, S. (2011), Targeted Learning: Causal Inference for Observational and Experimental Data, New York: Springer.