\[\\[0.5in]\]
Are the following causal questions or association questions?
Question: Which cancer treatments are best for which patients?
\[\\[0.5in]\]
\[\\[0.5in]\]
\[\\[0.5in]\]
\[\\[0.5in]\]
Historically, there has been a lot of writing about what causation is
Causation in statistics: Will changing \(X\) also change \(Y\)?
Association: Does knowing \(X\) help to predict \(Y\)?
If \(X\) causes \(Y\), then \(X\not\perp Y\) (\(X\) and \(Y\) are associated)
But, the converse is not true: it is possible for \(X\not\perp Y\) but \(X\) and \(Y\) do not cause each other
How can statistics be misleading on causation?
New infectious disease with high mortality rate
Scientists develop an experimental treatment and give it to doctors to try out
## treatment symptoms mortality
## 687 experimental severe alive
## 239 experimental severe died
## 530 placebo mild died
## 846 placebo mild alive
## 346 experimental mild died
## 634 placebo mild alive
## 509 placebo mild alive
## 475 experimental severe alive
## 213 placebo severe died
## 431 placebo mild alive
## 632 experimental mild alive
## 630 experimental severe alive
## 318 placebo severe died
## 991 placebo mild alive
## 648 placebo severe alive
## 880 placebo mild alive
## 136 experimental severe died
## 655 experimental severe alive
## 497 placebo mild alive
## 358 placebo mild alive
##
## experimental placebo
## 487 513
##
## mild severe
## 514 486
##
## alive died
## 754 246
table(df$treatment, df$mortality)[,2]/table(df$treatment)
##
## experimental placebo
## 0.2648871 0.2280702
The death rate is higher in the experimental treatment group than the placebo group
Question: Can we conclude that the experimental drug in not effective?
\[\\[0.5in]\]
table(df$treatment, df$symptoms, df$mortality)[,,2]/table(df$treatment, df$symptoms)
##
## mild severe
## experimental 0.07352941 0.33903134
## placebo 0.16402116 0.40740741
Someone with severe symptoms is much more likely to die than someone with mild symptoms
But, in both groups, the experimental treatment was associated with fewer deaths
Question: What might this indicate? (select all that apply)
\[\\[0.5in]\]
table(df$treatment, df$symptoms)
##
## mild severe
## experimental 136 351
## placebo 378 135
chisq.test(df$treatment, df$symptoms)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$treatment and df$symptoms
## X-squared = 207.58, df = 1, p-value < 2.2e-16
Those with severe symptoms were more likely to be on the experimental treatment
Timing matters here! Was severity taken before or after treatment?
Neither scenario below is conclusive from the data, but knowing the time ordering can help rule one out
Example: Sleeping with shoes on is associated with waking up with a headache
Example: Yellow teeth and cancer are confounded by smoking
Confounding: \(X\) and \(Y\) are confounded if they are both influenced by a third variable
Example from last section: Assume that severity was taken before treatment was given
There are two “open” paths from treatment to mortality:
Looking at the total association between treatment and mortality (without severity) will use include assocations from both paths
Controlling for severity “blocks” the
treatment <- severity -> mortality pathway
We’re left with the direct causal path
treatment -> mortality
Diagram above: A and B are confounded
by C and have no direct causal relationship
\(X = 2C + \epsilon_X \Rightarrow C = 0.5X + 0.5\epsilon_X\), so \(Y = 0.5 \cdot 3X + \text{noise}\)
size <- 500
C <- 10*runif(size)
X <- 2*C + rnorm(size)
Y <- 3*C + rnorm(size)
summary(lm(Y~X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6491 -1.2526 -0.0418 1.2540 4.1460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6102 0.1522 4.01 7.01e-05 ***
## X 1.4342 0.0131 109.52 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.74 on 498 degrees of freedom
## Multiple R-squared: 0.9601, Adjusted R-squared: 0.9601
## F-statistic: 1.199e+04 on 1 and 498 DF, p-value: < 2.2e-16
summary(lm(Y~X+C))
##
## Call:
## lm(formula = Y ~ X + C)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.81847 -0.63492 0.02683 0.65114 2.43640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03361 0.08757 0.384 0.701
## X -0.03018 0.04536 -0.665 0.506
## C 3.04750 0.09315 32.717 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.981 on 497 degrees of freedom
## Multiple R-squared: 0.9874, Adjusted R-squared: 0.9873
## F-statistic: 1.941e+04 on 2 and 497 DF, p-value: < 2.2e-16
Diagram above: X influences Y but the
effect is confounded by Z. That is, failing to account for
Z in a regression will lead to an incorrect causal
parameter estimate
parameter = (counfounding bias) + (causal effect) = 1.5 + 5 = 6.5
Y <- 3*C + 5*X + rnorm(size)
summary(lm(Y~X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.261 -1.139 -0.012 1.183 5.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.65110 0.14861 4.381 1.44e-05 ***
## X 6.43596 0.01279 503.296 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.7 on 498 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.998
## F-statistic: 2.533e+05 on 1 and 498 DF, p-value: < 2.2e-16
summary(lm(Y~X+C))
##
## Call:
## lm(formula = Y ~ X + C)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2887 -0.6351 -0.0185 0.6507 2.9811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09275 0.08704 1.066 0.287
## X 5.01797 0.04509 111.295 <2e-16 ***
## C 2.95101 0.09258 31.875 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.975 on 497 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 3.853e+05 on 2 and 497 DF, p-value: < 2.2e-16
Note: the covariate causal effect math only works here because the simulated data is linear. In the real world, we typically can’t assume linearity. This math doesn’t work without if the true data aren’t linear.
In most real-world datasets, there will always be the possibility of latent confounding
Sampling Bias occurs when sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others.
Sampling bias can lead to incorrect estimates when variables of interest influence sampling
Example: In 1936, the American Literary Digest sent out two million surveys to its readers and predicted that Alf Landon would beat incumbent president, Franklin Roosevelt, by a landslide, but the opposite happened. This was because readers over-represented Republicans.
Example: A researcher analysed data from 257 hospitalized individuals and detected an association between locomotor disease and respiratory disease (odds ratio 4.06). The researcher repeated the analysis in a sample of 2783 individuals from the general population and found no association (odds ratio 1.06)
size <- 100
locomotor <- 10*runif(size)
respiratory <- 10*runif(size)
hospitalized <- rbinom(size, 1, (locomotor+respiratory)/20)==0
df_all <- data.frame(locomotor, respiratory, hospitalized)
library(ggplot2)
ggplot(df_all[hospitalized==TRUE,], aes(respiratory, locomotor)) + geom_point() +
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
ggplot(df_all, aes(respiratory, locomotor)) + geom_point(aes(color=hospitalized)) +
stat_smooth(method=lm, se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
Sampling bias is an example of collider bias
Here we are conditioning on hospitalization
Even though locomotor and respiratory are independent, conditioning on a collider induces an association
Conditioning on a collider “open” an association pathway
If \(A\) and \(B\) are random variables and associated (dependent) \(A\not\perp B\), it’s thought that there are 3 ways this can happen
Direct causation:
OR
\(A\) and \(B\) are unmeasured confounding:
Collider bias: \(A\) and \(B\) both influence sampling
If we only have data for \(A\) and \(B\), we cannot distinguish between these case without more assumptions
If there are more variables, we figure out more using conditional independence
| Structure | Path type | \(A,C\) independent/dependent? | \(A,C\) conditionally independent/dependent given \(B\) |
|---|---|---|---|
| Chain | \(A \rightarrow B \rightarrow C\) | \(A\not\perp C\) | \(A\perp C|B\) |
| Other Chain | \(A \leftarrow B \leftarrow C\) | \(A\not\perp C\) | \(A\perp C|B\) |
| Fork (Confounder) | \(A \leftarrow B \rightarrow C\) | \(A\not\perp C\) | \(A\perp C|B\) |
| Collider | \(A \rightarrow B \leftarrow C\) | \(A\perp C\) | \(A\not\perp C|B\) |
Question: If I want to use inference (causal or otherwise) to quantify the effect of \(X\) on \(Y\), which regression model should I use?
\[\\[0.5in]\]
For GLMs and causal inference, should only control for variables that can cause our outcome
For GLMs and causal inference, should not control for variables that are caused by the outcome (this can lead to collider bias)
Markov Blanket Image from Wikipedia
In the image above, the Markov Blanket of \(A\) is the set of white variable within the large circle
Questions:
In the graph above, \(D\) and \(G\) are associated (\(D \not\!\perp \!\!\! \perp G\))
\[\\[0.5in]\] In the graph above, \(D \perp \!\!\! \perp G |E\)
\[\\[0.5in]\] In the graph above, \(D \perp \!\!\! \perp G |E, H\)
\[\\[0.5in]\]
In the graph above, which variables should be used for inference on \(G\) as the outcome? (text using capitals, no space, no comma, alphabetical order)
\[\\[0.5in]\]
In the graph above, which variables should be used for predicting \(G\)? (text using capitals, no space, no comma, alphabetical order)
\[\\[0.5in]\] Using the graph above, which variables should be significant in the following regression \(G\sim A+B+C+D+E+F+H+I\) (text using capitals, no space, no comma, alphabetical order)
Manpower Demonstration Research Corporation was a federally and privately funded program implemented in the mid-1970s to provide work experience for a period of 6-18 months to individuals who faced economic and social problems prior to enrollment
Program occurred between in 1976 and 1977
Those selected to join the program participated in various types of work such as restaurant and construction
Pre-treatment information was collected - earnings, education, age, ethnicity, marital status
All observations here are from men
Goal: Determine causal efficacy of program on earnings
Must control for potentially nonlinear confounding
library(ggplot2)
data(lalonde)
dim(lalonde)
## [1] 614 10
names(lalonde)
## [1] "treat" "age" "educ" "black" "hispan" "married"
## [7] "nodegree" "re74" "re75" "re78"
table(lalonde$treat)
##
## 0 1
## 429 185
lalonde$treat <- ifelse(lalonde$treat == 1, TRUE, FALSE)
#age
tapply(lalonde$age, lalonde$treat, mean)
## FALSE TRUE
## 28.03030 25.81622
t.test(lalonde$age ~ lalonde$treat)
##
## Welch Two Sample t-test
##
## data: lalonde$age by lalonde$treat
## t = 2.9911, df = 510.57, p-value = 0.002914
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 0.7598127 3.6683610
## sample estimates:
## mean in group FALSE mean in group TRUE
## 28.03030 25.81622
ggplot(lalonde, aes(x=age, fill=treat)) + geom_density(alpha=0.25)
#educ
tapply(lalonde$educ, lalonde$treat, mean)
## FALSE TRUE
## 10.23543 10.34595
t.test(lalonde$educ ~ lalonde$treat)
##
## Welch Two Sample t-test
##
## data: lalonde$educ by lalonde$treat
## t = -0.54676, df = 485.37, p-value = 0.5848
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## -0.5076687 0.2866393
## sample estimates:
## mean in group FALSE mean in group TRUE
## 10.23543 10.34595
ggplot(lalonde, aes(x=educ, fill=treat)) + geom_density(alpha=0.25)
#black
table(lalonde$black, lalonde$treat)
##
## FALSE TRUE
## 0 342 29
## 1 87 156
chisq.test(lalonde$black, lalonde$treat)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: lalonde$black and lalonde$treat
## X-squared = 219.04, df = 1, p-value < 2.2e-16
ggplot(lalonde, aes(x=black, fill=treat)) + geom_bar(position = 'dodge')
#hispan
table(lalonde$hispan, lalonde$treat)
##
## FALSE TRUE
## 0 368 174
## 1 61 11
chisq.test(lalonde$hispan, lalonde$treat)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: lalonde$hispan and lalonde$treat
## X-squared = 7.7664, df = 1, p-value = 0.005323
ggplot(lalonde, aes(x=hispan, fill=treat)) + geom_bar(position = 'dodge')
#married
table(lalonde$married, lalonde$treat)
##
## FALSE TRUE
## 0 209 150
## 1 220 35
chisq.test(lalonde$married, lalonde$treat)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: lalonde$married and lalonde$treat
## X-squared = 54.428, df = 1, p-value = 1.613e-13
ggplot(lalonde, aes(x=married, fill=treat)) + geom_bar(position = 'dodge')
#nodegree
table(lalonde$nodegree, lalonde$treat)
##
## FALSE TRUE
## 0 173 54
## 1 256 131
chisq.test(lalonde$nodegree, lalonde$treat)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: lalonde$nodegree and lalonde$treat
## X-squared = 6.4107, df = 1, p-value = 0.01134
ggplot(lalonde, aes(x=nodegree, fill=treat)) + geom_bar(position = 'dodge')
#re74
tapply(lalonde$re74, lalonde$treat, mean)
## FALSE TRUE
## 5619.237 2095.574
t.test(lalonde$re74 ~ lalonde$treat)
##
## Welch Two Sample t-test
##
## data: lalonde$re74 by lalonde$treat
## t = 7.2456, df = 475.99, p-value = 1.748e-12
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 2568.067 4479.258
## sample estimates:
## mean in group FALSE mean in group TRUE
## 5619.237 2095.574
ggplot(lalonde, aes(x=re74, fill=treat)) + geom_density(alpha=0.25)
#re75
tapply(lalonde$re75, lalonde$treat, mean)
## FALSE TRUE
## 2466.484 1532.055
t.test(lalonde$re75 ~ lalonde$treat)
##
## Welch Two Sample t-test
##
## data: lalonde$re75 by lalonde$treat
## t = 3.2776, df = 356.22, p-value = 0.00115
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 373.742 1495.116
## sample estimates:
## mean in group FALSE mean in group TRUE
## 2466.484 1532.055
ggplot(lalonde, aes(x=re75, fill=treat)) + geom_density(alpha=0.25)
\[\\[0.5in]\]
\[\\[0.5in]\]
Let \(Y_i^1\) be the outcome under treatment for observation \(i\) and let \(Y_i^0\) be the outcome without treatment for observation \(i\).
Counterfactual results would be the results if the treatment were changed
Common example: Does taking vitamin C prevent sickness:
The causal effect for observation \(i\) is \[Y_i^1 - Y_i^0\]
Unfortunately, it’s not possible to any individual’s with and without treatment
## A Y Y0 Y1
## 1 0 0 0 NA
## 2 0 0 0 NA
## 3 0 0 0 NA
## 4 0 0 0 NA
## 5 1 0 NA 0
## 6 1 1 NA 1
## 7 1 1 NA 1
## 8 1 1 NA 1
In a population, the average causal effect (ATE) is \[\text{ATE} = E\left[Y^1 - Y^0\right] = E[Y^1] - E[Y^0]\]
Estimated as \[\widehat{\text{ATE}} = \frac{1}{n} \sum_{i=1}^n\left(Y_i^1 - Y_i^0\right)\]
Question: Is \(E(Y^1|A=1)\) different from \(E(Y|A=1)\)?
\[\\[0.5in]\]
\(E(Y^1|A=1)\) is the causal effect of treatment while \(E(Y|A=1)\) is the association
\(Y^1 = Y^{A=1}\) and \(Y^0 = Y^{A=0}\) assumes that \(A\) is not influenced by any variables, measured or latent we are able to choose the counterfactual scenario
No other variables influence \(A\) when we consider \(Y^1, Y^0\)
ATE can be interpreted as the average difference in outcome within the population that is attributed to \(A\)
There may be other reasons \(Y\) differs from person to person, like age, severity, etc
Conditional Average Causal Effect (CATE): if \(Z\) is another covariate, \[\text{CATE}_z = E[Y^1|Z=z] - E[Y^0|Z=z]\] is the average causal effect for group \(Z=z\).
When does a parameter estimate have a causal interpretation?
Consider vaccine trails: a population is randomized to receive a test vaccine or placebo
Single or double blind: participants (and sometimes researcher) are not told which group they are in
Because of randomization, confounding is not possible and with a large enough sample, the two group will be statistically identical other than vaccine treatment
Any difference in infection acquisition can be attributed to vaccination.
Can estimate ATE: \[\begin{align} \widehat{\text{ATE}} &= E[Y^{\text{vaccine}}] - E[Y^{\text{placebo}}] \\ &= \frac{1}{n} \sum_{i=1}^n Y_i^{\text{vaccine}} - \frac{1}{m} \sum_{j=1}^m Y_j^{\text{placebo}} \end{align}\]
\(H_0: \widehat{\text{ATE}} = 0, H_1: \widehat{\text{ATE}} > 0\)
Two sample t-test is sufficient
set.seed(1234)
lalonde$random <- sample(rep(c(TRUE, FALSE), length.out=dim(lalonde)[1]))
ggplot(lalonde, aes(x=treat, fill=random)) + geom_bar(position = 'dodge')
ggplot(lalonde, aes(x=age, fill=random)) + geom_density(alpha=0.25)
ggplot(lalonde, aes(x=black, fill=random)) + geom_bar(position = 'dodge')
ggplot(lalonde, aes(x=hispan, fill=random)) + geom_bar(position = 'dodge')
ggplot(lalonde, aes(x=married, fill=random)) + geom_bar(position = 'dodge')
ggplot(lalonde, aes(x=nodegree, fill=random)) + geom_bar(position = 'dodge')
ggplot(lalonde, aes(x=re74, fill=random)) + geom_density(alpha=0.25)
ggplot(lalonde, aes(x=re78, fill=random)) + geom_density(alpha=0.25)
Questions:
What type of bias does causal inference attempt to eliminate? (select all that apply)
\[\\[0.5in]\]
Randomizing treatments ensures there is no (select all that apply)
\[\\[0.5in]\] Average treatment effect (ATE) quantifies what? (text/conversation)
\[\\[0.5in]\]
If \(Y\) is the outcome and \(A\) the treatment, how can one interpret \(E[Y^1 | A=0]\)?
\[\\[0.5in]\]
A UM research wants to understand the impact of a low-calorie vs a low-fat/low-carb diet on weight loss among UM students.
If she advertises enrollment for her study among UM athletes and follows each participant, collecting diet information and weight over time, what are possible sources of statistical association between diet and weight loss in her data? (select all that apply)
\[\\[0.5in]\]
If she advertises enrollment for her study among UM athletes, randomizes diet type for each participant, and collects weight over time, what are possible sources of statistical association between diet and weight loss in her data? (select all that apply)
\[\\[0.5in]\] Given that there is confounding, how can we stop remove its bias from our parameter estimate? (select all that apply)
\[\\[0.5in]\]
In observational data, the population receiving a treatment is difficult to compare to the other groups because of possible confounding
In the first example, the treatment and non-treatment populations had different severity levels, so are hard to compare directly
table(df$treatment, df$symptoms)
##
## mild severe
## experimental 136 351
## placebo 378 135
chisq.test(df$treatment, df$symptoms)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$treatment and df$symptoms
## X-squared = 207.58, df = 1, p-value < 2.2e-16
ggplot(df, aes(x=symptoms, fill=treatment)) + geom_bar(position = 'dodge')
This plot shows that symptoms and treatment are associated
If severity was used to assign treatment and if symptoms influences outcomes, then there will be confounding
On the other hand, if symptoms and treatment were independent, then it is not possible for symptoms to confound
If symptoms and treatment are independent, then the treatment and placebo populations to be generated from the same distribution
Similarly, for comparison, we want the treatment and placebo populations to be generated from the same distribution
This blocks any association between confounders and treatment
Propensity scores (PS) indicate propensity for an observation to be in the the treatment group
Propensity score are sometimes call inverse probability treatment weights
If a covariate influences (causes) an observation’s chances of getting treatment, it may confound the result
Question: To be a confounder, what must also happen? (text)
\[\\[0.5in]\]
PS use (observed) variables to weight observations by calculating on their probability of receiving treatment: \[\text{PS}_i = \frac{1}{P(A=1 | C = x_i)}\]
Question: If a variable is unobserved (not in the data), can it still confound?
\[\\[0.5in]\]
PS tries to make the populations look as if treatment were randomized
In standard regression, each observation contributes that same weight to the parameter estimates
PS makes it so that each observation contributes a weight which is inversely proportional to its probability of treatment
Question: Which plot is more likely to use inverse probability treatment weights? (choose A or B)
\[\\[0.5in]\]
Another way to think about inverse probability weighting is balancing variables so that all variable distributions look similar in treatment groups after weighting
Side note: Sometimes people are matched on covariates.
What is a possible draw back of matching? (conversation)
\[\\[0.5in]\]
Two-Steps:
Fit propensity scores for treatment (\(A\)) using pre-treatment covariates
Use propensity score to estimate ATE (average treatment effect) of \(A\) on \(Y\)
Assumptions:
Sufficient overlap: For all \(i\) \[0 < P(A=1|C=x_i) < 1\]
No unknown confounders: \(A \perp \!\!\! \perp Y^t | X\) for \(t=0,1\)
Probability of Treatment: For each observation, \(i\), let \[p_i = P(A=1|C=x_i) = P(i \text{ gets treatment}|x_i)\]
the associated weight is \[w_i = \begin{cases} \frac{1}{p_i} & \text{when }A=1 \\ \frac{1}{1-p_i} & \text{when }A=0 \end{cases}\]
Propensity score Theorem \[(Y^0,Y^1) \perp A | X \Rightarrow (Y^0, Y^1) \perp A | w(X)\]
Note: This is saying that getting treatment or not is independent of what the response would have been in a counterfactual setting
Once the weights are estimates, ATE for a binary treatment is estimated as \[\widehat{\text{ATE}} = \frac{\sum_{i=1}^n A_iY_iw_i}{\sum_{i=1}^n A_iw_i} -\frac{\sum_{i=1}^n (1-A_i)Y_iw_i}{\sum_{i=1}^n (1-A_i)w_i}\]
Why not use regression to control for confounders McCaffrey et all 2013:
By summarizing all pretreatment variables to a single score, propensity scores are an important dimension reduction tool for evaluating treatment effects. This characteristic of propensity scores is particularly advantageous over standard adjustment methods when there exists a potentially large number of pretreatment covariates.
Propensity score methods derive from a formal model for causal inference, the potential outcomes framework, so that causal questions can be well defined and explicitly specified and not conflated with the modeling approach as they are with traditional regression approaches.
Propensity score methods do not require modeling the mean for the outcome. This can help avoid bias from misspecification of that model.
Propensity score methods avoid extrapolating beyond the observed data unlike parametric regression modeling for outcomes, which extrapolate whenever the treatment and control groups are disparate on pretreatment variables.
Propensity score adjustments can be implemented using only the pretreatment covariates and treatment assignments of study participants without any use of the outcomes. This feature of propensity score adjustments is valuable because it eliminates the potential for the choice of model specification for pretreatment variables to be influenced by its impact on the estimated treatment effect.
There is no one standard method for model selection in the context of estimating propensity scores for IPTW for multiple treatments
It’s much more common to use non-parametric methods for estimating probabilities
Note that in this context, we care more about the prediction value than the interpretation of parameter estimates
Let’s compare logistic regression and generalized boosted models (GBM) to estimate observation weights
Question: Why might logistic regression be a bad choice for estimating weights?
\[\\[0.5in]\]
Logistic Regression
prop.mod <- glm(treat ~ age + educ+ black + hispan + married + nodegree + re74 + re75, data=lalonde, family=binomial())
summary(prop.mod)
##
## Call:
## glm(formula = treat ~ age + educ + black + hispan + married +
## nodegree + re74 + re75, family = binomial(), data = lalonde)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7645 -0.4736 -0.2862 0.7508 2.7169
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.729e+00 1.017e+00 -4.649 3.33e-06 ***
## age 1.578e-02 1.358e-02 1.162 0.24521
## educ 1.613e-01 6.513e-02 2.477 0.01325 *
## black 3.065e+00 2.865e-01 10.699 < 2e-16 ***
## hispan 9.836e-01 4.257e-01 2.311 0.02084 *
## married -8.321e-01 2.903e-01 -2.866 0.00415 **
## nodegree 7.073e-01 3.377e-01 2.095 0.03620 *
## re74 -7.178e-05 2.875e-05 -2.497 0.01253 *
## re75 5.345e-05 4.635e-05 1.153 0.24884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.49 on 613 degrees of freedom
## Residual deviance: 487.84 on 605 degrees of freedom
## AIC: 505.84
##
## Number of Fisher Scoring iterations: 5
treat_prop_logistic <- predict(prop.mod, newdata = lalonde[,-1], type = 'response')
lalonde$logistic_prob <- treat_prop_logistic
lalonde$logistic_weight <- ifelse(lalonde$treat, 1/(1-treat_prop_logistic), 1/treat_prop_logistic)
head(lalonde)
## treat age educ black hispan married nodegree re74 re75 re78 random
## 1 TRUE 37 11 1 0 1 1 0 0 9930.0460 FALSE
## 2 TRUE 22 9 0 1 0 1 0 0 3595.8940 TRUE
## 3 TRUE 30 12 1 0 0 0 0 0 24909.4500 FALSE
## 4 TRUE 27 11 1 0 0 1 0 0 7506.1460 FALSE
## 5 TRUE 33 8 1 0 0 1 0 0 289.7899 TRUE
## 6 TRUE 22 9 1 0 0 1 0 0 4056.4940 FALSE
## logistic_prob logistic_weight
## 1 0.6387699 2.768319
## 2 0.2246342 1.289714
## 3 0.6782439 3.107944
## 4 0.7763241 4.470754
## 5 0.7016387 3.351642
## 6 0.6990699 3.323031
hist(lalonde$logistic_weight)
boxplot(logistic_prob ~ treat, data=lalonde)
Boosting
A problem with logistic regression for estimating weights is that can be challenging to get a good model fit
Boosting makes this a lot easier and automatic
We already talked about AdaBoost and a little about general boosting
For a reference, in AdaBoost
each lazy learner is fitted to the data with weights that depend on the last model’s performance
all of the lazy learners are scored depending on their performance
For general boosting
Observations are not weighted
We use learners (models) like regression or trees that are more complex than lazy learner
Similar to AdaBoost, we must choose the number of sequential learner to use, \(M\)
Each learner tries to predict the error of the previous model
Error: in the literature, the error is the negative gradient of the loss function with respect to the model evaluated at an observation: \[-\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)}\]
For a continuous outcomes, this is \(y_i-f(x_i)\) for continuous outcomes where \(f\) is a continuous learner
For a binary outcomes, this is \(I(y_i=1)-p(x_i)\) where \(p\) is a learner like logistic regression model or decision tree
For categorical outcomes, this is \(I(y_i=\text{Category }k)-p_k(x_i)\) where \(p_k\) is models the probability of category \(k\)
Learning rate: to avoid over fitting, we reduce the contribution of each learner by \(0<\nu<1\)
Trees are the most common learner to use for boosting, usually with between 8 and 32 terminal leaves
Image from geeksforgeeks
Gradient Tree Boosting Algorithm from ESL
Initialize \(f_0(x)=\arg\min_\gamma \sum_{i=1}^n L(y_i,\gamma)\)
For \(m=1\) to \(M\):
Output \(\hat f(x)=f_M(x)\)
Generalized Boosting for Propensity Scores
For causal inference, we want to weight the observation to make the treatment and control groups look as if they were randomized
Using inverse probabilities from boosting model, we want to use as many trees as we need to achieve balance in the covariates
Ideally, we want the distributions of each covariate in the treatment and control groups to be similar (mean, variance, skew, shape, etc)
Clearly this is rarely possible, so we might restrict balance to mean and/or standard deviation for example
Assessing covariate balance with standardize bias estimation \[SB_k = \frac{|\bar X_{k1} - \bar X_{k0}|}{\hat \sigma_k}\]
Generally, standardized mean differences of less than 0.20are considered small, 0.40 are considered moderate, and 0.60 are considered large
This cutoff can change within fields and between investigator
McCaffery et al: \(SB>0.2\) is problematic
Below the twang R package automatically add mores
trees until a stopping rule based on balance it met
twang chooses propensity scores based on the boosted
model with the best balance
boosted.mod <- ps(treat ~ age + educ + black + hispan + married + nodegree + re74 + re75,
data=lalonde,
estimand = "ATE",
n.trees = 5000,
interaction.depth=2,
perm.test.iters=0,
verbose=FALSE,
stop.method = c("es.mean"))
summary(boosted.mod)
## n.treat n.ctrl ess.treat ess.ctrl max.es mean.es max.ks
## unw 185 429 185.00000 429.0000 1.3085999 0.4432341 0.6404460
## es.mean.ATE 185 429 60.12492 193.2837 0.5527156 0.1873184 0.2705063
## max.ks.p mean.ks iter
## unw NA 0.2702451 NA
## es.mean.ATE NA 0.1168750 2488
summary(boosted.mod$gbm.obj,
n.trees=boosted.mod$desc$es.mean.ATE$n.trees,
plot=FALSE)
## var rel.inf
## black black 56.02950248
## re74 re74 16.57601323
## age age 16.48551967
## re75 re75 4.22031848
## educ educ 3.17248606
## married married 2.97585135
## nodegree nodegree 0.44543160
## hispan hispan 0.09487713
lalonde$boosted <- get.weights(boosted.mod)
## Warning in get.weights(boosted.mod): No stop.method specified. Using es.mean.ATE
hist(lalonde$boosted)
plot(boosted.mod)
plot(boosted.mod, plots=2)
plot(boosted.mod, plots=3)
bal.table(boosted.mod)
## $unw
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks
## age 25.816 7.155 28.030 10.787 -0.224 -2.994 0.003 0.158
## educ 10.346 2.011 10.235 2.855 0.042 0.547 0.584 0.111
## black 0.843 0.365 0.203 0.403 1.309 19.371 0.000 0.640
## hispan 0.059 0.237 0.142 0.350 -0.257 -3.413 0.001 0.083
## married 0.189 0.393 0.513 0.500 -0.656 -8.607 0.000 0.324
## nodegree 0.708 0.456 0.597 0.491 0.231 2.716 0.007 0.111
## re74 2095.574 4886.620 5619.237 6788.751 -0.544 -7.254 0.000 0.447
## re75 1532.055 3219.251 2466.484 3291.996 -0.284 -3.282 0.001 0.288
## ks.pval
## age 0.003
## educ 0.081
## black 0.000
## hispan 0.339
## married 0.000
## nodegree 0.081
## re74 0.000
## re75 0.000
##
## $es.mean.ATE
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks
## age 25.369 7.050 27.478 10.057 -0.213 -2.732 0.006 0.135
## educ 10.679 2.035 10.318 2.682 0.137 1.173 0.241 0.090
## black 0.634 0.483 0.364 0.482 0.553 3.080 0.002 0.271
## hispan 0.117 0.323 0.117 0.322 0.000 0.003 0.998 0.000
## married 0.317 0.467 0.430 0.496 -0.229 -1.318 0.188 0.113
## nodegree 0.593 0.493 0.601 0.490 -0.016 -0.099 0.921 0.008
## re74 3034.535 5240.245 4587.694 6420.822 -0.240 -1.874 0.061 0.168
## re75 1784.285 3374.853 2145.958 3190.482 -0.110 -0.940 0.348 0.151
## ks.pval
## age 0.377
## educ 0.852
## black 0.002
## hispan 1.000
## married 0.604
## nodegree 1.000
## re74 0.152
## re75 0.243
Want \(E[Y^1] - E[Y^0]\)
Treatment population mean estimate for \(t=0,1\): \[\widehat\mu_t = \frac{\sum_{i=1}^n I(T_i=t) Y_i w_i(t)}{\sum_{i=1}^n I(T_i=t) w_i(t)}\]
Estimate \(E[Y^1] - E[Y^0]\) as \[\widehat{\text{ATE}} = \widehat\mu_1-\widehat\mu_0\]
We can use a weighted t-test to evaluate \(\widehat{\text{ATE}}\)
McCaffery et all suggests using svyglm which
achieves the same goal
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
design <- svydesign(ids=~1, weights=~boosted, data=lalonde)
glm1 <- svyglm(re78 ~ treat, design=design)
summary(glm1)
##
## Call:
## svyglm(formula = re78 ~ treat, design = design)
##
## Survey design:
## svydesign(ids = ~1, weights = ~boosted, data = lalonde)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6646.9 388.3 17.118 <2e-16 ***
## treatTRUE -720.4 866.5 -0.831 0.406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 49530170)
##
## Number of Fisher Scoring iterations: 2
This indicates that there is no causal effect from being in the work program vs not
What would we have gotten if we would have used linear regression?
summary(lm(re78 ~ treat + age + educ+ black + hispan + married + nodegree + re74 + re75, data=lalonde))
##
## Call:
## lm(formula = re78 ~ treat + age + educ + black + hispan + married +
## nodegree + re74 + re75, data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13595 -4894 -1662 3929 54570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.651e+01 2.437e+03 0.027 0.9782
## treatTRUE 1.548e+03 7.813e+02 1.982 0.0480 *
## age 1.298e+01 3.249e+01 0.399 0.6897
## educ 4.039e+02 1.589e+02 2.542 0.0113 *
## black -1.241e+03 7.688e+02 -1.614 0.1071
## hispan 4.989e+02 9.419e+02 0.530 0.5966
## married 4.066e+02 6.955e+02 0.585 0.5590
## nodegree 2.598e+02 8.474e+02 0.307 0.7593
## re74 2.964e-01 5.827e-02 5.086 4.89e-07 ***
## re75 2.315e-01 1.046e-01 2.213 0.0273 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6948 on 604 degrees of freedom
## Multiple R-squared: 0.1478, Adjusted R-squared: 0.1351
## F-statistic: 11.64 on 9 and 604 DF, p-value: < 2.2e-16
According to this model, the work program boosted post-treatment earnings by $1548 on average which is borderline significant
Question: Which model should we trust?
\[\\[0.5in]\]
Goal: Quantify causal effect of \(A\) on \(Y\)
Below: Let \(A\) be the treatment, \(Y\) the outcome, \(X\) measured confounders, \(W\) non-confounding variables, \(U\) unmeasured confounders
Question: If there are variables like \(U\) (unmeasured confounders), is it possible to use propensity score to determine the causal effect of \(A\) on \(Y\)?
\[\\[0.5in]\]
Question: Assume there are no unmeasured confounders, will using \(W\) and \(X\) to estimate propensity scores give an accurate causal estimate?
\[\\[0.5in]\]
Question: Assume there are no unmeasured confounders, will only using \(X\) to estimate propensity scores give an accurate causal estimate?
\[\\[0.5in]\]
If there are unmeasured confounders, propensity score methods will not give accurate causal estimates for the effect of treatment on outcome
Including non-confounding, pre-treatment variables in the propensity score analysis, does not bias causal estimates for the effect of treatment on outcome
Below: Let \(A\) be the treatment, \(Y\) the outcome, \(X\) measured confounders, \(W\) non-confounding variables, \(M\) mediating variables
Question: Will using \(W\), \(X\), and \(M\) to estimate propensity scores give an accurate causal estimate?
\[\\[0.5in]\]
Including a mediating variable in the propensity score analysis will block the causal pathway through that variable
Generally, we do not want to do this but there may be times where it is warrented
Example: Vaccines may influence our behavior but also may prevent sickness
If we want to remove the effect of behavior from the analysis, we can control for it (include it in our propensity score analysis)
Consider the following Bayesian network where the numbers indicate linear parameters, for example \(A = 2 Z + C + \epsilon_A\)
Want to determine the causal effect of \(A\) on \(Y\)
\[\\[0.5in]\]
Question: Regressing \(Y\) on \(A\) (in R, \(Y \sim A\)) would include confounding in the parameter corresponding to \(A\)
\[\\[0.5in]\]
Question: Regression \(Y\) on \(Z\) (in R, \(Y\sim Z\)) would include confounding in the parameter corresponding \(Z\)
\[\\[0.5in]\]
\[\\[0.5in]\]
\[\\[0.5in]\]
set.seed(1234)
n_obs <- 100000
Z <- 400 * rnorm(n_obs)
C <- 500 * rnorm(n_obs)
A <- 2 * Z + 1 * C + rnorm(n_obs)
Y <- 3 * A + 5 * C + rnorm(n_obs)
summary(lm(Y ~ A ))
##
## Call:
## lm(formula = Y ~ A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9576 -1427 -3 1426 9368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.496372 6.672603 0.374 0.708
## A 4.398125 0.007078 621.397 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2110 on 99998 degrees of freedom
## Multiple R-squared: 0.7943, Adjusted R-squared: 0.7943
## F-statistic: 3.861e+05 on 1 and 99998 DF, p-value: < 2.2e-16
summary(lm(Y ~ Z))
##
## Call:
## lm(formula = Y ~ Z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18625.1 -2692.4 -18.7 2678.5 16980.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.81128 12.58789 1.018 0.309
## Z 6.02354 0.03149 191.304 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3981 on 99998 degrees of freedom
## Multiple R-squared: 0.2679, Adjusted R-squared: 0.2679
## F-statistic: 3.66e+04 on 1 and 99998 DF, p-value: < 2.2e-16
summary(lm(A ~ Z))
##
## Call:
## lm(formula = A ~ Z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2327.96 -336.31 -2.43 334.94 2122.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.603158 1.573487 1.019 0.308
## Z 2.002944 0.003936 508.897 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 497.6 on 99998 degrees of freedom
## Multiple R-squared: 0.7214, Adjusted R-squared: 0.7214
## F-statistic: 2.59e+05 on 1 and 99998 DF, p-value: < 2.2e-16
summary(lm(Y ~ Z))$coefficients[2,1]/summary(lm(A ~ Z))$coefficients[2,1]
## [1] 3.007345
Three steps:
Assumptions:
The instrument does not affect any confounders
The instrument only affects the outcome through the treatment
Linearity
Minimum wage is the legal minimum that an employer can pay a worker
This topic has a contentious history
The US established a minimum wage during the Great Depression
But, the federal minimum wage is not tied to inflation, so the actual value of minimum wage decreases over time if not increased by politicians
Image from medium.com
In 1992, 79% of economists responded that having a minimum wage increases unemployment among young and low-skilled workers in a survey of the American Economics Association
Using the classical supply and demand model, the idea driving this reasoning was that if the price of labor is increased, then the employers will buy less labor, resulting in fewer jobs
On the other hand, paying employees meager wages while making considerable profit can seem like exploitation
Does this classical model accurately describe the minimum wage and number of jobs dynamic?
Just like any model, the supply and demand model requires assumptions
Further, consider factories vs restaurants:
In 1992, New Jersey raised its minimum wage from $4.25/hr to $5.05/hr while Pennsylvania kept its minimum wage at $4.25/hr
David Card and Alan Krueger at Princeton University used this opportunity to study employment at fast-food restaurants in both states before the April increase and again several months later
The increase in the wage floor did not lead to jobs being lost in New Jersey; employment in the restaurants they looked at went up
Results in this paper
Some economists (46%) now think that some labor markets are better characterized by a monopsony market structure: one dominant buyer purchases from many sellers, and so is able to lower prices (wages)
By increases wages, more people may have been willing to work, at a higher wage
A following study in 2017, looking at aggregate employment data on average hours and earnings found that increasing the minimum wage in Seattle to $15 led to reducing worker hours and subsequently pay
Another study in 2018 in Seattle looking at tax data found that after the increase, workers gained earnings on a weekly basis but only because they worked more jobs
These studies show the impact that data can have on policy
This material was taken from this article
Using natural randomization: In 1973 the Eldfell volcano in Iceland on the island of Heimaey erupted, destroying about 400 homes.
The Icelandic government compensated those who lost their homes, many never returned.
An economics paper showed that, among people less than 25 years old at the time of the eruption, those who had moved averaged four more years of schooling and earnings $27,000 greater per year than those from families who had kept their home.
Because those who lost their homes was naturally randomized, this paper used instrumental variables
The treatment was moving away and the outcome was later earnings
Losing home \(\rightarrow\) moving away \(\rightarrow\) later earnings
Instrument \(\rightarrow\) Treatment \(\rightarrow\) Effect
Instrument can only influence effect through treatment (no direct arrow from instrument to effect)
We are not going to focus on instrumental variables analysis here
Used when there is a natural experiment
There are times when researchers have a good understand of the causal relationships between variables
When this is the case, this information can (and should) be used to create model
Consider the game of baseball and prediction
Question: Assuming linearity holds in the graph above, what methods would accurately modeling \(Y\)? (choose all that apply)
\[\\[0.5in]\]
Question: Again assuming linearity holds in the graph above, which model would be best for accurately modeling \(Y\) (choose one)
\[\\[0.5in]\]
Now, we want to quantify the causal impact of \(A\) on \(Y\) without assuming linearity
Question: Would using \(C1, C2, C3, C4, C5\) for propensity score weighting accurately quantify the effect of \(A\) on \(Y\)?
\[\\[0.5in]\]
Question: Would using \(C1, C2, C3, C4\) for propensity score weighting accurately quantify the effect of \(A\) on \(Y\)?
\[\\[0.5in]\]
Question: Would using \(C1, C4\) for propensity score weighting accurately quantify the effect of \(A\) on \(Y\)?
\[\\[0.5in]\]
Question: Would using \(C1, C2, C4\) for propensity score weighting accurately quantify the effect of \(A\) on \(Y\)?
\[\\[0.5in]\] - Question: Would using \(C5\) as an instrumental variable accurately quantify the effect of \(A\) on \(Y\)?
\[\\[0.5in]\] - Question: If we want to predict \(Y\), which variables do we not need?
\[\\[0.5in]\]
We want to isolate the causal effect of A on Y
Would conditioning on \(C2\) only accurately quantify the causal effect of A on Y?
\[\\[0.5in]\]
Would conditioning on \(C3\) accurately quantify the causal effect of A on Y?
\[\\[0.5in]\]
Would conditioning on \(C1\) and \(D1\) accurately quantify the causal effect of A on Y?
\[\\[0.5in]\]
Would conditioning on \(C1\) and \(D2\) accurately quantify the causal effect of A on Y?
\[\\[0.5in]\]
Would conditioning on \(C1\) and \(M2\) accurately quantify the causal effect of A on Y?
\[\\[0.5in]\]
Researchers frequently use directed acyclic graphs (DAG) to visualize causal relationships between variables in a dataset
directed - all edges in graph have direction (all edges are arrows)
acyclic - arrow directions do not create a loop
DAGs are sometimes called Bayesian networks
DAG below
A, B, C, D, E, F, G, H, I represent variables in a dataset
Arrows indicate a direct causal relationship
A is the parent of D
D is the child of A
B is the ancestor of I
I is the successor of B
Here: we are interested in paths between variables
Note: paths can have arrows pointing either direction
One path between D and F is \[D\leftarrow A\rightarrow E \rightarrow G \leftarrow F\]
Another is \[D \rightarrow H \leftarrow G \rightarrow I \leftarrow F\]
These are of interest if we want to know if D and F are associated or causally linked
Somewhat technical assumptions: in general assume arrows between nodes indicate a causal relationship
Text on Probabilistic Graphical Models for more detail
Joint distribution can factored as \[P(A, B, C, D, E) = P(A) P(B|A) P(C) P(D|B,C) P(E|C)\]
For each variable, only need to condition on its immediate parents
DAGs give information about conditional independence relationships among variables
This DAG has one path from A to E: \(A \rightarrow B \rightarrow D \leftarrow C \rightarrow E\)
How can we use this DAG to determine statistical associations between variables/nodes?
With 3 nodes:
| Structure | Path type | \(A,C\) independent/dependent? | \(A,C\) conditionally independent/dependent given \(B\) |
|---|---|---|---|
| Chain | \(A \rightarrow B \rightarrow C\) | \(A\not\!\perp \!\!\! \perp C\) | \(A\perp \!\!\! \perp C|B\) |
| Other Chain | \(A \leftarrow B \leftarrow C\) | \(A\not\!\perp \!\!\! \perp C\) | \(A\perp \!\!\! \perp C|B\) |
| Fork (Confounder) | \(A \leftarrow B \rightarrow C\) | \(A\not\!\perp \!\!\! \perp C\) | \(A\perp \!\!\! \perp C|B\) |
| Collider | \(A \rightarrow B \leftarrow C\) | \(A\perp \!\!\! \perp C\) | \(A\not\!\perp \!\!\! \perp C|B\) |
Rule: If a path has no colliders, then yes and we say that the path is open. If there is a collider on the path, then no and we say that the path is blocked
What about conditioning?
Questions:
Rule: Conditioning on a collider opens a path. Conditioning on a non-collider blocks a path
Note: A variable/node and be a collider on one path but not on another.
How can we show that if \(D \leftarrow C \rightarrow E\) then \(D \perp E | C\)?
We use Bayesian networks because they help us to quickly understand marginal and conditional independence between the variables
\(X\) and \(Y\) are independent iff (def) \(\mathbb P(X, Y) = \mathbb P(X) \mathbb P(Y)\)
\(X\) and \(Y\) are conditionally independent given \(Z\) iff (def) \(\mathbb P(X, Y |Z) = \mathbb P(X | Z) \mathbb P(Y | Z)\)
\(D \perp E | C\) if and only if \(P(D,E | C) = P(D|C) P(E|C)\) \[P(D,E|C) = \frac{P(D,E,C)}{P(C)} = \frac{P(C)P(D|C)P(E|C)}{P(C)} = P(D|C)P(E|C)\]
What about \(A \rightarrow B \rightarrow C\)?
Markov Assumption
We are implicitly using the Markov assumption here
Specifically, we are assuming different paths won’t perfectly cancel out
I want a good score on my test so I continue studying
With the additional studying, I sleep less
With less sleep, my performance isn’t as good
With the extra studying, I know the material better
Can my lack of sleep and extra studying perfectly cancel out?
Markov assumption says no
Causal discovery is the process of estimating the causal network from the data
Assume we have a dataset with variables \(A, B, C, D, E\) where we do not the the associated Bayesian network
Question: If \(A \perp \!\!\! \perp B\) can \(A\) and \(B\) be directly connected in the unknown Bayesian network?
\[\\[0.5in]\]
Question: If \(A \perp \!\!\! \perp B | \{\text{other variables}\}\) can \(A\) and \(B\) be directly connected in the unknown Bayesian network?
\[\\[0.5in]\]
The PC algorithm exploits this fact by running marginal and conditional independence tests
The algorithm begins by creating a node for each variable and connecting each variable to every other variable to form a complete graph
\[\\[0.5in]\]
Next, for each edge, corresponds to 2 variables, run an independence test
If significantly associated, leave edge in place
If not significant, remove edge
Next, for each remaining edge, run a conditional independence test, conditioning on one variable at a time
To make the algorithm run faster, conditioning variables are chosen from neighboring variables
Again, remove edges when there is a non-significant test
Next condition on two variables at a time
Stop when the size of conditioning set is larger than number of neighbors
This undirected structure is sometimes call the skeleton
Next search for non-looping paths of 3 variables
If collider, orient edges
Look for \(Y\) structures
Create complete, undirected graph
Pairwise Conditional Independence Testing
library(pcalg)
library(Rgraphviz)
## Loading required package: graph
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
## duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
## pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames,
## sapply, setdiff, sort, table, tapply, union, unique, unsplit
data(lalonde)
pc.fit <- pc(suffStat = list(C = cor(lalonde), n = dim(lalonde)[1]), indepTest = gaussCItest, ## indep.test: partial correlations
alpha=0.01, labels = names(lalonde), verbose = FALSE)
plot(pc.fit, main="Lalonde Estimated Bayesian Network")
| Total (n) | Treatment (n1) | Control (n2) | p-value | |
|---|---|---|---|---|
| Covariate 1 | mean (sd) | mean (sd) | mean (sd) | <0.001 |
| Covariate 2 | % | % | % | 0.34 |
| Covariate 3 | median (q1, q3) | median (q1, q3) | median (q1, q3) | 0.02 |