Alex Deng
8/7/2013
Priority is to cover key concepts. Will get to technical methodologies if we have time and likely survey them instead of going deep
Berkeley gender bias case
Applicants | Admitted | |
---|---|---|
Men | 8442 | 44% |
Women | 4321 | 35% |
Men more likely to be admitted than women?
Deep dive:
Department | Men | Women | ||
---|---|---|---|---|
Applicants | Admitted | Applicants | Admitted | |
A | 825 | 62% | 108 | 82% |
B | 560 | 63% | 25 | 68% |
C | 325 | 37% | 593 | 34% |
D | 417 | 33% | 375 | 35% |
E | 191 | 28% | 393 | 24% |
F | 272 | 6% | 341 | 7% |
The departments are sorted by acceptance rate. Most women applied to departments with acceptance rate < 40% while most men are doing the opposite.
This seems like a paradox because the following deduction
Step 2 is wrong, and step 3 is also sutble
Although the link between cigarette smoke and cancer has been understood for decades, the so-called “smoking gun” came to light in 1998. Researchers found out how benzopyrene affects a gene known as “p53”. The gene acts as a screening agent to make sure that healthy cells divide while unhealthy ones die off. In the cases of cigarette smokers, benzopyrene “shuts off” the p53 mechanism, allowing unhealthy and malignant cells to grow and spread throughout the body.(Source: http://www.lung-cancer.com/tobacco.html)
Wikipedia: Fisher was opposed to the conclusions of Richard Doll and A.B. Hill(1950, British Medical Journal) that smoking caused lung cancer. He compared the correlations in their papers to a correlation between the import of apples and the rise of divorce in order to show that correlation does not imply causation.
A compelling argument: what if there exists a gene that
In this case the real cause is the “tobacco gene”. There is no causal link between usage of tobacco and lung cancer. All the correlation between the two can be explained away by the “tobacco gene”
Fisher's case is hypothetical, but hard to dispute (how to dispute?)
“Are you trying to tell us that it's so easy to make mistakes when drawing any conclusion?”
Yes and No! The take away message is:
Don't get me wrong.
Casuality is not necessary for many applications. Correlation is very useful in predictive modeling and in many cases this is all we need. In fact, there is a whole area of statistics called multivariate analysis to study correlation between many factors. It's modern name? (Statistical)Machine Learning!
Causality implies the effect of intervention. Crucial for economics, social science, medical study, etc.
Democratic president better for economics?
More Gun Control Brings Lower Crime Rate?
And much more… Fun books to read:
Missing completely at random requires the missing data to be an outcomes of a pure random coin tossing process, i.e., each sample has a fixed probability to be missing.
Imagin a mischievous oracle have the complete set of observations and then she toss a coin to delete some from them and only hand the remained to us
If observations are MCAR, and \( Y_i, i = 1, \dots, n, n\le N \) is the set of observed, then \[ E(\sum_{i=1}^n Y_i /n)=\sum_{i=1}^N Y_i /N \]
In other words, the non-missing data represents the same distribution as the whole population
Instead of having the same traffic splitting design, let us consider the following design:
Browser | Treatment | Control |
---|---|---|
IE | 60% | 40% |
FireFox | 10% | 90% |
Chrome | 90% | 10% |
Safari | 50% | 50% |
For each browser, we can estimate treatment effect as a normal controlled experiment. Then we can summarize the treatment effect for all browsers together using a reference browser distribution.
(Remark: this example is essentaially the Berkeley admision example in Simpson's paradox)
In general, if we split traffic by a function of \( X \)(e.g. Browser, or multiple attributes), i.e. for each value of \( X \), there is a \( e(X) \) probability of being assigned to treatment.
For each value of \( X \), we can estimate treatment effect \( \tau(X) \) as usual.
In the end we summarize them by \[ \tau = \sum \tau(X_i)P(X = X_i) \] or in continuous case \[ \tau = E(\tau(X)) \] where the expectation is taken over the distribution of \( X \)
In other word: if treatment vs control splitting depends on some variable \( X \), we should match on \( X \), estimate treatment effect at each level and then combine
Matching translate observational data analysis to experimentational data analysis
Suppose for each measurement \( Y_i \), we also measure \( X_i \). Unlike \( Y_i \), we observe complete observations of \( X_i \).
Missing at random assumes the probability of \( Y \) being missing depends solely on the value of \( X \). i.e., the oracle have observed all \( (X_i, Y_i) \), then for each value of \( X_i \), she has a corresponding coin and toss that coin to decide whether to erase \( Y_i \).
Many methods have been developed for MAR scenario to estimate \( \sum_{i=1}^N Y_i /N \)(or in general \( E(Y) \)).
How is this related to causal inference? (If you feel it is obvious now, then I have (mostly) finished my job for this talk, the rest is just technicality, and I can't do the rest without more technical notions either)
Potential outcomes, treatment assignment and covariates: \( (Y^{(1)},Y^{(0)}, TA, X) \)
Ignorability: \( (Y^{(1)},Y^{(0)}) \perp \! \! \! \perp TA | X \)
Ignorability (missing data perspective): \( Y^{(1)} \) and \( Y^{(0)} \) are missing at random conditional on the covariate \( X \)
Ignorability (controlled experiment perspective): for each value of \( X \), there exists an equivalent conditional controlled experiment
Finding \( X \) to satisfy ignorability is at the central role of causal inference with observational data.
It relates causality to conditional randomized experiment (conditioned on \( X \)), for which causal inference is easy. All techniques are based on this key observation.
Does such \( X \) always exist to satisfy ignorability?
Trival cases: Let \( X = TA \), then \( (Y^{(1)},Y^{(0)}) \perp \! \! \! \perp TA | TA \) trivially (Conditioned on \( TA \), \( TA \) is not random so it is independent of any random variable). The choice of \( X \) above is useless. Why? (Hint: what can we get from a controlled experiment with 100%/0% design?)
To find a useful \( X \) that provides ignorability is a difficult problem. It involves domain knowledge, i.e. need to identify key counfounders, which is very difficult.
When adding too many variables in \( X \), conditioned on \( X \), the corresponding conditional experiment might be close to 100%/0% design, i.e. hard to match. In such case \( \tau(X) \) cannot be reliablly estimated and thus the approach might be flawed.
An equally important problem is how do you test ignorability condition.
In general there is no direct test. There exists indirect test. (Hint: AA test)
If \( TA \) is an randomized assignment. Then \( (Y^{(1)},Y^{(0)}) \perp \! \! \! \perp TA \)
Ranzomized experiment provides unconditional ignorability
On the other hand, conditional ignorability implies equivalent conditional randomized experiments.
Regression is a popular tool used to “adjust for confounders”.
Linear Model:
This is an explicit model for \( (Y^{(1)},Y^{(0)}) | (TA, X) \). In particular,\( Y^{(1)} - Y^{(0)} = \mu \), and \( Y^{0} = \theta_0+ \theta \times X + \epsilon \) where \( \epsilon \) is noise around 0 uncorrelated with \( X \) and \( TA \).
Theorem: Under the assumption of the linear model, \( X \) provides ignorability. Moreover, \( \mu \) is the treatment effect and can be estimated unbiasedly by least square estimation.
Linear regression is widely used for causal inference but its assumptions are too strong for most applications.
It assumes a linear relationship
It requires the treatment effect for all individuals to be the same.
In most cases treatment effect do vary from individual to another and all we care is average treatment effect(ATE) or treatment effect on treated (TT).
More sophisticated modeling can allow some sort of hetorogeneous treatment effect (add interaction terms between \( X \) and \( TA \), etc). But the key issue remains: regression modelling puts strong assumption on model and causal inference can be sensitive to the validity of the model.
We can add more and more terms to the linear model to control more covariates. Keep adding until we get treatment effect close to 0? Can we claim ignorability and the linear model is good to go?
Adding too many control terms in linear model can back fire.
A common problem:
set.seed(100)
a <- rpois(10000,2) # X~ Poisson(2)
b <- rpois(1000,4) # Y~ Poisson(4)
matchxy <- ExactUniMatchLR(a,b) #compute Likelihood Ratio
mean(a)
[1] 2
mean(b)
[1] 3.947
mean(b*matchxy$right$LR)
[1] 2
The likelihood ratio weight is doing its job!
set.seed(100)
#a <- rpois(10000,2) # X~ Poisson(2)
ya <- a^2+rnorm(10000) # Y = X^2 + normal noise
#b <- rpois(1000,4) # Y~ Poisson(4)
yb <- b^2+rnorm(10000)
mean(ya)
[1] 6.026
mean(yb)
[1] 19.27
mean(yb*matchxy$right$LR)
[1] 6.028
If we observe X for both population A and population B. We can infer the mean of Y for population A from mean of Y for population B!
You already know the idea. Think about the conditional randomized experiment.
This is exact what matching means: match treated and untreated users based on the value \( X \). This approach is intuitive. But when \( X \) is continuous or have many values, the procedure becomes cumbersome. Using weighted sample is the modern perspective, which is actually more natural when you understand it and more advanced techniques are easier to be approached with the weighted sample perspective.
We have observed \( Y^{(0)} \) for the untreated group. If we can project average of \( Y^{(0)} \) from untreated group to the the treated group, then we can estimate the treatment effect on treated (TT).
Similarly we observed \( Y^{(1)} \) for the treated group, if we can project average of \( Y^{(1)} \) from treated group to untreated, we can get treatment effect on untreated (TUT).
We can get average treatment effect as \( TT*(treated rate)+ TUT*(untreated rate) \)
Ignorability condition => We can do this projection by using weighted samples.
What if \( X \) is multivariate? Matching directly will be hard, if not impossible.
Turns out, it suffices to match on a single dimensioinal variable called propensity \[ e(X) = P(treated|X) \]
If \( X \) make \( TA \) ignorable, then \( e(X) \) alone also makes \( TA \) ignorable. This is dimension reduction with no information loss.
Directly use propensity function to weight samples usually does not perform well
Weighted samples perspective naturally leads to other applications. One extension of regression model is to use weighted samples instead of all samples having the same weight.
This leads to so called doublly robust estimator: