Essentials of Causal Inference using Observational Data

Notes

Deck for the original full 2 hour course link
This one hour version focus on concept and less(to my best) technical details
If you have problem with math display, try using Firefox or Chrome

Outline

Outline: Key Concepts

Overview
- Simpson's Paradox: Correlation vs. Causation
- Limitaton: “Tobacco gene” and Fisher's comment
- Causality vs. predictive modelling
Potential outcome framework and missing data in data analysis
Ignorability of treatment assignment

Outline: Technical Methodologies

Regression modelling
Matching: the art of re-weighting samples
Propensity function and dimension reduction
Combining regression modelling with reweighting: Doubly robust estimate

Priority is to cover key concepts. Will get to technical methodologies if we have time and likely survey them instead of going deep

Overview

Introduction: Simpson's Paradox

Berkeley gender bias case

	Applicants	Admitted
Men	8442	44%
Women	4321	35%

Men more likely to be admitted than women?

Deep dive:

Department	Men		Women
Department	Applicants	Admitted	Applicants	Admitted
A	825	62%	108	82%
B	560	63%	25	68%
C	325	37%	593	34%
D	417	33%	375	35%
E	191	28%	393	24%
F	272	6%	341	7%

The departments are sorted by acceptance rate. Most women applied to departments with acceptance rate < 40% while most men are doing the opposite.

Why paradox?

This seems like a paradox because the following deduction

Men are more likely to be accepted
The above is causal
Since it is causal, men are more likely to be accepted in all departments
Contradiction

Step 2 is wrong, and step 3 is also sutble

Causation and Correlation

Causation(causality): the relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first.
Correlation: statistical relationship between two random variables or two sets of data sets.
Bekerley case shows gender being male correlates to higher acceptance rate
Key difference: Causation implies the result of intervention(Pearl). It has much greater generalization power than correlation, i.e. applies to all conditions(departments), etc.

Correlation does not imply causation

causation vs correlation

Tobacco and Lung cancer

Connection between tobacco and lung cancer has long been observed
Causal link finally established in 1998:

Although the link between cigarette smoke and cancer has been understood for decades, the so-called “smoking gun” came to light in 1998. Researchers found out how benzopyrene affects a gene known as “p53”. The gene acts as a screening agent to make sure that healthy cells divide while unhealthy ones die off. In the cases of cigarette smokers, benzopyrene “shuts off” the p53 mechanism, allowing unhealthy and malignant cells to grow and spread throughout the body.(Source: http://www.lung-cancer.com/tobacco.html)

Fisher's Comment

Wikipedia: Fisher was opposed to the conclusions of Richard Doll and A.B. Hill(1950, British Medical Journal) that smoking caused lung cancer. He compared the correlations in their papers to a correlation between the import of apples and the rise of divorce in order to show that correlation does not imply causation.

A compelling argument: what if there exists a gene that

Directly causes a person's addiction to tobacco
Directly causes a higher lung cancer risk

In this case the real cause is the “tobacco gene”. There is no causal link between usage of tobacco and lung cancer. All the correlation between the two can be explained away by the “tobacco gene”

Fisher's Comment Cont'd

Fisher's case is hypothetical, but hard to dispute (how to dispute?)

The hypothetical gene is an extreme case of so called unobserved confounder
In the Berkeley admission case, we've seen the department that students applied to is also a confounder. It was observed in the second table.

Final Note Before We Jump in

“Are you trying to tell us that it's so easy to make mistakes when drawing any conclusion?”

Yes and No! The take away message is:

Causal inference is hard
Correlation does not mean causation

Don't get me wrong.

Casuality is not necessary for many applications. Correlation is very useful in predictive modeling and in many cases this is all we need. In fact, there is a whole area of statistics called multivariate analysis to study correlation between many factors. It's modern name? (Statistical)Machine Learning!

But Causality Sends a Much Stronger Message

Causality implies the effect of intervention. Crucial for economics, social science, medical study, etc.

Democratic president better for economics?

http://finance.yahoo.com/news/The-Market-And-Presidential-investopedia-2715775352.html

More Gun Control Brings Lower Crime Rate?

http://en.wikipedia.org/wiki/Legalized_abortion_and_crime_effect

And much more… Fun books to read:

Freakonomics (Levitt and Dubner)
The Undercover Economist (Tim Harford)

Potential Outcome Framework

Potential Outcome Framework (Donald Rubin)

Suppose we are interested in causal effect of a treatment \( T \) on the measurement \( Y \)
Potential outcome is a random variable pair \( (Y^{(0)},Y^{(1)}) \).
- \( Y^{(0)} \) is the value of the measurment if the treatment was not applied
- \( Y^{(1)} \) is the value of the measurment if the treatment was applied
In our applications \( Y \) is some user level metric. Only one of \( (Y^{(0)},Y^{(1)}) \) is observed. The unobserved part is called counterfactual

\( \tau = Y^{(1)}-Y^{(0)} \) is the (random) treatment effect
For each user, the potential outcome pair \( (Y^{(1)}_i, Y^{(0)}_i) \) is drawn from the distribution of \( (Y^{(1)}_i, Y^{(0)}_i) \)
Average Treatment Effect on \( N \) samples \[ \mu = \frac{\sum_1^N (Y^{(1)}_i-Y^{(0)}_i)}{N} \]
The above formula is not useful in pactice as \( Y^{(1)}_i-Y^{(0)}_i \) cannot be computed for we only observe one of the potential outcomes
An alternative perspective: we have missing data. For each data point, the counterfactural is missing!(Very useful, more after 2 slides)

In most cases, we are interested in the treatment effect on the larger population and we only have some samples from this population.
Average Treatment Effect (ATE) \[ \mu = E(\tau) = E(Y^{(1)}-Y^{(0)}), \] where the expectation is taken for the distriubion of \( (Y^{(1)}, Y^{(0)}) \) represents the whole population

Treated and Untreated

In observational study, treated population is a subset of the whole population where \( Y^{(1)} \) is observed.
Similarly, untreated population is those where \( Y^{(0)} \) is observed.
For those two populations, the distribution of \( (Y^{(1)}, Y^{(0)}) \) might be different.
Treatment Effect on Treated (TT): \[ \mu_{T} = E_{T}(\tau) = E_{T}(Y^{(1)}-Y^{(0)}) \]
Treatment Effect on Untreated (TUT): \[ \mu_{UT} = E_{UT}(\tau) = E_{UT}(Y^{(1)}-Y^{(0)}) \]

Missing Data in Data Analysis

N measurements \( Y_i, i=1,\dots,N \), average is \( \sum Y_i /N \)
Suppose some of those \( N \) observations are missing
Problem: what is the alternative of \( \sum Y_i /N \)?

Missing Completely at Random (MCAR)

Missing completely at random requires the missing data to be an outcomes of a pure random coin tossing process, i.e., each sample has a fixed probability to be missing.

Imagin a mischievous oracle have the complete set of observations and then she toss a coin to delete some from them and only hand the remained to us

If observations are MCAR, and \( Y_i, i = 1, \dots, n, n\le N \) is the set of observed, then \[ E(\sum_{i=1}^n Y_i /n)=\sum_{i=1}^N Y_i /N \]

In other words, the non-missing data represents the same distribution as the whole population

Controlled Experiment and MCAR

Average Treatment Effect on \( N \) samples \[ \mu = \frac{\sum_1^N (Y^{(1)}_i-Y^{(0)}_i)}{N} \]
Problem reduced to the estimation of \( \sum_1^N Y^{(1)}/N \) and \( \sum_1^N Y^{(0)}/N \)
If treated(untreated, resp.) are randomly selected from \( N \) samples, then \( Y^{(1)} \) (\( Y^{(0)} \)) is MCAR
\( \sum Y^T_i/N_T - \sum Y^C_i/N_C \) is unbiased estimator for the ATE

Conditional Controlled Experiment

Instead of having the same traffic splitting design, let us consider the following design:

Browser	Treatment	Control
IE	60%	40%
FireFox	10%	90%
Chrome	90%	10%
Safari	50%	50%

For each browser, we can estimate treatment effect as a normal controlled experiment. Then we can summarize the treatment effect for all browsers together using a reference browser distribution.

(Remark: this example is essentaially the Berkeley admision example in Simpson's paradox)

Conditional Controlled Experiment

In general, if we split traffic by a function of \( X \)(e.g. Browser, or multiple attributes), i.e. for each value of \( X \), there is a \( e(X) \) probability of being assigned to treatment.

For each value of \( X \), we can estimate treatment effect \( \tau(X) \) as usual.

In the end we summarize them by \[ \tau = \sum \tau(X_i)P(X = X_i) \] or in continuous case \[ \tau = E(\tau(X)) \] where the expectation is taken over the distribution of \( X \)

Conditional Controlled Experiment and Matching

In other word: if treatment vs control splitting depends on some variable \( X \), we should match on \( X \), estimate treatment effect at each level and then combine

Matching translate observational data analysis to experimentational data analysis

Missing At Random (MAR)

Suppose for each measurement \( Y_i \), we also measure \( X_i \). Unlike \( Y_i \), we observe complete observations of \( X_i \).

Missing at random assumes the probability of \( Y \) being missing depends solely on the value of \( X \). i.e., the oracle have observed all \( (X_i, Y_i) \), then for each value of \( X_i \), she has a corresponding coin and toss that coin to decide whether to erase \( Y_i \).

Many methods have been developed for MAR scenario to estimate \( \sum_{i=1}^N Y_i /N \)(or in general \( E(Y) \)).

How is this related to causal inference? (If you feel it is obvious now, then I have (mostly) finished my job for this talk, the rest is just technicality, and I can't do the rest without more technical notions either)

Ignorability

Ignorability of Treatment Assignment

Potential outcomes, treatment assignment and covariates: \( (Y^{(1)},Y^{(0)}, TA, X) \)

Ignorability: \( (Y^{(1)},Y^{(0)}) \perp \! \! \! \perp TA | X \)

Ignorability (missing data perspective): \( Y^{(1)} \) and \( Y^{(0)} \) are missing at random conditional on the covariate \( X \)

Ignorability (controlled experiment perspective): for each value of \( X \), there exists an equivalent conditional controlled experiment

Central Role of Ignorability

Finding \( X \) to satisfy ignorability is at the central role of causal inference with observational data.

It relates causality to conditional randomized experiment (conditioned on \( X \)), for which causal inference is easy. All techniques are based on this key observation.

Does such \( X \) always exist to satisfy ignorability?

Trival cases: Let \( X = TA \), then \( (Y^{(1)},Y^{(0)}) \perp \! \! \! \perp TA | TA \) trivially (Conditioned on \( TA \), \( TA \) is not random so it is independent of any random variable). The choice of \( X \) above is useless. Why? (Hint: what can we get from a controlled experiment with 100%/0% design?)

Limitations

To find a useful \( X \) that provides ignorability is a difficult problem. It involves domain knowledge, i.e. need to identify key counfounders, which is very difficult.

When adding too many variables in \( X \), conditioned on \( X \), the corresponding conditional experiment might be close to 100%/0% design, i.e. hard to match. In such case \( \tau(X) \) cannot be reliablly estimated and thus the approach might be flawed.

An equally important problem is how do you test ignorability condition.

In general there is no direct test. There exists indirect test. (Hint: AA test)

Randomized Experiments

If \( TA \) is an randomized assignment. Then \( (Y^{(1)},Y^{(0)}) \perp \! \! \! \perp TA \)

Ranzomized experiment provides unconditional ignorability

On the other hand, conditional ignorability implies equivalent conditional randomized experiments.

Technical Methodologies

Regression Modeling

Regression is a popular tool used to “adjust for confounders”.

Linear Model:

\( Y = \theta_0 + \mu TA + \theta \times X + \epsilon \)

This is an explicit model for \( (Y^{(1)},Y^{(0)}) | (TA, X) \). In particular,\( Y^{(1)} - Y^{(0)} = \mu \), and \( Y^{0} = \theta_0+ \theta \times X + \epsilon \) where \( \epsilon \) is noise around 0 uncorrelated with \( X \) and \( TA \).

Theorem: Under the assumption of the linear model, \( X \) provides ignorability. Moreover, \( \mu \) is the treatment effect and can be estimated unbiasedly by least square estimation.

Assumptions of Regression Modelling

Linear regression is widely used for causal inference but its assumptions are too strong for most applications.
It assumes a linear relationship
It requires the treatment effect for all individuals to be the same.
In most cases treatment effect do vary from individual to another and all we care is average treatment effect(ATE) or treatment effect on treated (TT).
More sophisticated modeling can allow some sort of hetorogeneous treatment effect (add interaction terms between \( X \) and \( TA \), etc). But the key issue remains: regression modelling puts strong assumption on model and causal inference can be sensitive to the validity of the model.

Data non-overlap and Model Overfit

We can add more and more terms to the linear model to control more covariates. Keep adding until we get treatment effect close to 0? Can we claim ignorability and the linear model is good to go?

Adding too many control terms in linear model can back fire.

When adding too many control terms, then conditioned on any fixed value of those covariates, there might be only treated or untreated in our data. Just as 100%/0% design does not help, this issue makes linear model unstable.
Multilinearity and matrix degeneracy.
Model over-fitting.

Project a metric from one population to another

A common problem:

You have observed some metrics \( Y \) from Population B.
You want to project this metrics to Population A, for which \( Y \) is not observed.
You do know A and B are different, but you believe their differences are all due to differences in the distribution of some covariate \( X \), i.e., given the same \( X \) value, the subpopulation in A and in B are not distinguishable.
In statistics, likelihood ratio is used to shift a distribution to another(mathematicians call it Radon-Nykodim derivative)

set.seed(100)
a <- rpois(10000,2) #  X~ Poisson(2)
b <- rpois(1000,4)  # Y~ Poisson(4)
matchxy <- ExactUniMatchLR(a,b) #compute Likelihood Ratio

plot of chunk unnamed-chunk-2

mean(a)

[1] 2

mean(b)

[1] 3.947

mean(b*matchxy$right$LR)

[1] 2

The likelihood ratio weight is doing its job!

set.seed(100)
#a <- rpois(10000,2) #  X~ Poisson(2)
ya <- a^2+rnorm(10000) # Y = X^2 + normal noise
#b <- rpois(1000,4)  # Y~ Poisson(4)
yb <- b^2+rnorm(10000)
mean(ya)

[1] 6.026

mean(yb)

[1] 19.27

mean(yb*matchxy$right$LR)

[1] 6.028

If we observe X for both population A and population B. We can infer the mean of Y for population A from mean of Y for population B!

Matching

You already know the idea. Think about the conditional randomized experiment.

Given each value of \( X \), you calculate treatment effect for this value of \( X \) by treating it as a randomized experiment.
\[ \tau = \sum \tau(X_i)P(X = X_i) \]

This is exact what matching means: match treated and untreated users based on the value \( X \). This approach is intuitive. But when \( X \) is continuous or have many values, the procedure becomes cumbersome. Using weighted sample is the modern perspective, which is actually more natural when you understand it and more advanced techniques are easier to be approached with the weighted sample perspective.

Matching as weighted samples

We have observed \( Y^{(0)} \) for the untreated group. If we can project average of \( Y^{(0)} \) from untreated group to the the treated group, then we can estimate the treatment effect on treated (TT).
Similarly we observed \( Y^{(1)} \) for the treated group, if we can project average of \( Y^{(1)} \) from treated group to untreated, we can get treatment effect on untreated (TUT).
We can get average treatment effect as \( TT*(treated rate)+ TUT*(untreated rate) \)
Ignorability condition => We can do this projection by using weighted samples.

Propensity Function and Dimension Reduction

What if \( X \) is multivariate? Matching directly will be hard, if not impossible.

Turns out, it suffices to match on a single dimensioinal variable called propensity \[ e(X) = P(treated|X) \]

If \( X \) make \( TA \) ignorable, then \( e(X) \) alone also makes \( TA \) ignorable. This is dimension reduction with no information loss.

Propensity function quantifies the probability of being treated given covariate \( X \).
\( e(X) \) can be estimated in many ways: generalized linear model(logistic regression), decision tree, genetic matching, etc.
Can either use \( e(X) \) to calculate likelihood ratio, i.e. when we estimate treatment effect on treated, then all untreated sample should be weighted proportional to \( e(x)/(1-e(X)) \)
We can also apply matching on propensity function (matching on coninuous varaible requires kernel density estimation)

Caveats

Directly use propensity function to weight samples usually does not perform well

Model imperfection, e.g. logistic regression assumes logit relationship.
\( e(X) \) could be close to 1 for some sample, then the weight will be close to infinity, making the estimation of TT unstable. Similarly, for TUT, when \( e(X) \) close to 0, the weight \( (1-e(x))/e(x) \) is close to infinity and TUT cannot be accurately estimated. (unbiased but high variance)
Matching on propensity function is better.
But there are still better ways…

Regression with weighted Samples

Weighted samples perspective naturally leads to other applications. One extension of regression model is to use weighted samples instead of all samples having the same weight.

This leads to so called doublly robust estimator:

If the regression model is correct, then this approach gives unbiased estimator.
If the weights from propensity function/likelihood ratio are correct, this approach also gives unbiased estimator