Association vs Causation

\[\\[0.5in]\]

\[\\[0.5in]\]

\[\\[0.5in]\]

\[\\[0.5in]\]

\[\\[0.5in]\]

Simpson’s paradox

##        treatment symptoms mortality
## 687 experimental   severe     alive
## 239 experimental   severe      died
## 530      placebo     mild      died
## 846      placebo     mild     alive
## 346 experimental     mild      died
## 634      placebo     mild     alive
## 509      placebo     mild     alive
## 475 experimental   severe     alive
## 213      placebo   severe      died
## 431      placebo     mild     alive
## 632 experimental     mild     alive
## 630 experimental   severe     alive
## 318      placebo   severe      died
## 991      placebo     mild     alive
## 648      placebo   severe     alive
## 880      placebo     mild     alive
## 136 experimental   severe      died
## 655 experimental   severe     alive
## 497      placebo     mild     alive
## 358      placebo     mild     alive
## 
## experimental      placebo 
##          487          513
## 
##   mild severe 
##    514    486
## 
## alive  died 
##   754   246
table(df$treatment, df$mortality)[,2]/table(df$treatment)
## 
## experimental      placebo 
##    0.2648871    0.2280702

\[\\[0.5in]\]

table(df$treatment, df$symptoms, df$mortality)[,,2]/table(df$treatment, df$symptoms)
##               
##                      mild     severe
##   experimental 0.07352941 0.33903134
##   placebo      0.16402116 0.40740741

\[\\[0.5in]\]

table(df$treatment, df$symptoms)
##               
##                mild severe
##   experimental  136    351
##   placebo       378    135
chisq.test(df$treatment, df$symptoms)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df$treatment and df$symptoms
## X-squared = 207.58, df = 1, p-value < 2.2e-16

Confounding

Confounding Simulations

  • Diagram above: A and B are confounded by C and have no direct causal relationship

  • \(X = 2C + \epsilon_X \Rightarrow C = 0.5X + 0.5\epsilon_X\), so \(Y = 0.5 \cdot 3X + \text{noise}\)

size <- 500
C <- 10*runif(size)
X <- 2*C + rnorm(size)
Y <- 3*C + rnorm(size)

summary(lm(Y~X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6491 -1.2526 -0.0418  1.2540  4.1460 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.6102     0.1522    4.01 7.01e-05 ***
## X             1.4342     0.0131  109.52  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.74 on 498 degrees of freedom
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9601 
## F-statistic: 1.199e+04 on 1 and 498 DF,  p-value: < 2.2e-16
summary(lm(Y~X+C))
## 
## Call:
## lm(formula = Y ~ X + C)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.81847 -0.63492  0.02683  0.65114  2.43640 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03361    0.08757   0.384    0.701    
## X           -0.03018    0.04536  -0.665    0.506    
## C            3.04750    0.09315  32.717   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.981 on 497 degrees of freedom
## Multiple R-squared:  0.9874, Adjusted R-squared:  0.9873 
## F-statistic: 1.941e+04 on 2 and 497 DF,  p-value: < 2.2e-16

  • Diagram above: X influences Y but the effect is confounded by Z. That is, failing to account for Z in a regression will lead to an incorrect causal parameter estimate

  • parameter = (counfounding bias) + (causal effect) = 1.5 + 5 = 6.5

Y <- 3*C + 5*X + rnorm(size)

summary(lm(Y~X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.261 -1.139 -0.012  1.183  5.458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.65110    0.14861   4.381 1.44e-05 ***
## X            6.43596    0.01279 503.296  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.7 on 498 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.998 
## F-statistic: 2.533e+05 on 1 and 498 DF,  p-value: < 2.2e-16
summary(lm(Y~X+C))
## 
## Call:
## lm(formula = Y ~ X + C)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2887 -0.6351 -0.0185  0.6507  2.9811 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.09275    0.08704   1.066    0.287    
## X            5.01797    0.04509 111.295   <2e-16 ***
## C            2.95101    0.09258  31.875   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.975 on 497 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 3.853e+05 on 2 and 497 DF,  p-value: < 2.2e-16
  • Note: the covariate causal effect math only works here because the simulated data is linear. In the real world, we typically can’t assume linearity. This math doesn’t work without if the true data aren’t linear.

  • In most real-world datasets, there will always be the possibility of latent confounding

Sampling Bias

size <- 100
locomotor <- 10*runif(size)
respiratory <- 10*runif(size)
hospitalized <- rbinom(size, 1, (locomotor+respiratory)/20)==0
df_all <- data.frame(locomotor, respiratory, hospitalized)
library(ggplot2)
ggplot(df_all[hospitalized==TRUE,], aes(respiratory, locomotor)) + geom_point() + 
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

ggplot(df_all, aes(respiratory, locomotor)) + geom_point(aes(color=hospitalized)) +
  stat_smooth(method=lm, se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

Statistical Dependency and Causation

  • If \(A\) and \(B\) are random variables and associated (dependent) \(A\not\perp B\), it’s thought that there are 3 ways this can happen

    • Direct causation: OR

    • \(A\) and \(B\) are unmeasured confounding:

    • Collider bias: \(A\) and \(B\) both influence sampling

    • If we only have data for \(A\) and \(B\), we cannot distinguish between these case without more assumptions

  • If there are more variables, we figure out more using conditional independence

    • Thinking about if knowing one variable gives information another
    • Variables: Drank too much alcohol last night, Woke up with a hangover, Woke up with shoes on

    • Hangover\(\perp\)Shoes?
    • Hangover\(\perp\)Shoes\(|\)Alcohol?
      • Given that someone drank too much alcohol last night, does knowing that they woke up with their shoes on give any information about if they have a hangover?
    • Variables: Burglary, Earthquake, Home motion detector

    • Burglary\(\perp\)Earthquake
    • Burglary\(\perp\)Earthquake\(|\)Detector?
      • Given that the home motion detector alerted, does knowing that there was no earthquake give any information about a burglary?
Structure Path type \(A,C\) independent/dependent? \(A,C\) conditionally independent/dependent given \(B\)
Chain \(A \rightarrow B \rightarrow C\) \(A\not\perp C\) \(A\perp C|B\)
Other Chain \(A \leftarrow B \leftarrow C\) \(A\not\perp C\) \(A\perp C|B\)
Fork (Confounder) \(A \leftarrow B \rightarrow C\) \(A\not\perp C\) \(A\perp C|B\)
Collider \(A \rightarrow B \leftarrow C\) \(A\perp C\) \(A\not\perp C|B\)

GLMs, Causal Inference, and Prediction

Question: If I want to use inference (causal or otherwise) to quantify the effect of \(X\) on \(Y\), which regression model should I use?

\[\\[0.5in]\]

Causal Inference

  • If we do not control for all confounding variables, any association between \(X\) and \(Y\) present through the confounder pathway (red in image below) will likely be included in the parameter quantified by a model
  • In any inference, we want to quantify the true causal parameter
    • Causal inference models make this easier by relaxing some assumptions

  • Assumption of linear relationships within the data are likely not true (or at least hard to verify)
  • Causal inference models make it possible to control for confounding variables in a non-linear setting
  • Goal: Estimate a causal parameter of interest (when there are not linear relationships)
    • Causal parameters in our simulations are the parameters in the Bayesian network (DAG)
    • Causal parameter interpretation: if we were to change \(X\), we should be able to use the causal parameter to accurately predict \(Y\)
    • Recall: when we did not control for confounding variables, the parameter estimates did not match the DAG
  • Assumptions: no latent confounding (we have all confounding variables included in the data)

Prediction

  • use all useful information in data to make a prediction
  • we can use causes and effects
  • we also want to use colliders because they also provide information
  • General rule: use Markov Blanket
  • For prediction, all variables in Markov blanket are useful for prediction, but not causation
    • If we changes causes, we should see changes in outcome
    • If we change variables in collider, we should not
  • Note: Prediction methods try to find all variables in Markov blanket
    • This will depend on signal/noise ratio and model assumptions
  • Given Markov Blanket, all other variables are conditionally independent from outcome
  • If we know causal Bayesian network underlying a dataset, we should include all variables in Markov blanket and no others
    • Including variables not in Markov Blanket will be noise variables (given the Markov Blanket)

Markov Blanket Image from Wikipedia

In the image above, the Markov Blanket of \(A\) is the set of white variable within the large circle

Questions:

In the graph above, \(D\) and \(G\) are associated (\(D \not\!\perp \!\!\! \perp G\))

  • A. True
  • B. False
  • C. Not sure

\[\\[0.5in]\] In the graph above, \(D \perp \!\!\! \perp G |E\)

  • A. True
  • B. False
  • C. Not sure

\[\\[0.5in]\] In the graph above, \(D \perp \!\!\! \perp G |E, H\)

  • A. True
  • B. False
  • C. Not sure

\[\\[0.5in]\]

In the graph above, which variables should be used for inference on \(G\) as the outcome? (text using capitals, no space, no comma, alphabetical order)

\[\\[0.5in]\]

In the graph above, which variables should be used for predicting \(G\)? (text using capitals, no space, no comma, alphabetical order)

\[\\[0.5in]\] Using the graph above, which variables should be significant in the following regression \(G\sim A+B+C+D+E+F+H+I\) (text using capitals, no space, no comma, alphabetical order)

Evaluating Work Training Programs

library(ggplot2)
data(lalonde)
dim(lalonde)
## [1] 614  10
names(lalonde)
##  [1] "treat"    "age"      "educ"     "black"    "hispan"   "married" 
##  [7] "nodegree" "re74"     "re75"     "re78"
table(lalonde$treat)
## 
##   0   1 
## 429 185
lalonde$treat <- ifelse(lalonde$treat == 1, TRUE, FALSE)

#age
tapply(lalonde$age, lalonde$treat, mean)
##    FALSE     TRUE 
## 28.03030 25.81622
t.test(lalonde$age ~ lalonde$treat)
## 
##  Welch Two Sample t-test
## 
## data:  lalonde$age by lalonde$treat
## t = 2.9911, df = 510.57, p-value = 0.002914
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  0.7598127 3.6683610
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            28.03030            25.81622
ggplot(lalonde, aes(x=age, fill=treat)) +  geom_density(alpha=0.25)

#educ
tapply(lalonde$educ, lalonde$treat, mean)
##    FALSE     TRUE 
## 10.23543 10.34595
t.test(lalonde$educ ~ lalonde$treat)
## 
##  Welch Two Sample t-test
## 
## data:  lalonde$educ by lalonde$treat
## t = -0.54676, df = 485.37, p-value = 0.5848
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -0.5076687  0.2866393
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            10.23543            10.34595
ggplot(lalonde, aes(x=educ, fill=treat)) + geom_density(alpha=0.25)

#black
table(lalonde$black, lalonde$treat)
##    
##     FALSE TRUE
##   0   342   29
##   1    87  156
chisq.test(lalonde$black, lalonde$treat)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  lalonde$black and lalonde$treat
## X-squared = 219.04, df = 1, p-value < 2.2e-16
ggplot(lalonde, aes(x=black, fill=treat)) + geom_bar(position = 'dodge')

#hispan
table(lalonde$hispan, lalonde$treat)
##    
##     FALSE TRUE
##   0   368  174
##   1    61   11
chisq.test(lalonde$hispan, lalonde$treat)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  lalonde$hispan and lalonde$treat
## X-squared = 7.7664, df = 1, p-value = 0.005323
ggplot(lalonde, aes(x=hispan, fill=treat)) +  geom_bar(position = 'dodge')

#married
table(lalonde$married, lalonde$treat)
##    
##     FALSE TRUE
##   0   209  150
##   1   220   35
chisq.test(lalonde$married, lalonde$treat)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  lalonde$married and lalonde$treat
## X-squared = 54.428, df = 1, p-value = 1.613e-13
ggplot(lalonde, aes(x=married, fill=treat)) +  geom_bar(position = 'dodge')

#nodegree
table(lalonde$nodegree, lalonde$treat)
##    
##     FALSE TRUE
##   0   173   54
##   1   256  131
chisq.test(lalonde$nodegree, lalonde$treat)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  lalonde$nodegree and lalonde$treat
## X-squared = 6.4107, df = 1, p-value = 0.01134
ggplot(lalonde, aes(x=nodegree, fill=treat)) +  geom_bar(position = 'dodge')

#re74
tapply(lalonde$re74, lalonde$treat, mean)
##    FALSE     TRUE 
## 5619.237 2095.574
t.test(lalonde$re74 ~ lalonde$treat)
## 
##  Welch Two Sample t-test
## 
## data:  lalonde$re74 by lalonde$treat
## t = 7.2456, df = 475.99, p-value = 1.748e-12
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  2568.067 4479.258
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            5619.237            2095.574
ggplot(lalonde, aes(x=re74, fill=treat)) +  geom_density(alpha=0.25)

#re75
tapply(lalonde$re75, lalonde$treat, mean)
##    FALSE     TRUE 
## 2466.484 1532.055
t.test(lalonde$re75 ~ lalonde$treat)
## 
##  Welch Two Sample t-test
## 
## data:  lalonde$re75 by lalonde$treat
## t = 3.2776, df = 356.22, p-value = 0.00115
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##   373.742 1495.116
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            2466.484            1532.055
ggplot(lalonde, aes(x=re75, fill=treat)) +  geom_density(alpha=0.25)

\[\\[0.5in]\]

\[\\[0.5in]\]

Quantifying Causal Effect with Counterfactuals

##   A Y Y0 Y1
## 1 0 0  0 NA
## 2 0 0  0 NA
## 3 0 0  0 NA
## 4 0 0  0 NA
## 5 1 0 NA  0
## 6 1 1 NA  1
## 7 1 1 NA  1
## 8 1 1 NA  1

\[\\[0.5in]\]

Randomized Controlled Experiments or Trials

set.seed(1234)
lalonde$random <- sample(rep(c(TRUE, FALSE), length.out=dim(lalonde)[1]))
ggplot(lalonde, aes(x=treat, fill=random)) + geom_bar(position = 'dodge')

ggplot(lalonde, aes(x=age, fill=random)) +  geom_density(alpha=0.25)

ggplot(lalonde, aes(x=black, fill=random)) + geom_bar(position = 'dodge')

ggplot(lalonde, aes(x=hispan, fill=random)) + geom_bar(position = 'dodge')

ggplot(lalonde, aes(x=married, fill=random)) + geom_bar(position = 'dodge')

ggplot(lalonde, aes(x=nodegree, fill=random)) + geom_bar(position = 'dodge')

ggplot(lalonde, aes(x=re74, fill=random)) +  geom_density(alpha=0.25)

ggplot(lalonde, aes(x=re78, fill=random)) +  geom_density(alpha=0.25)

Questions:

What type of bias does causal inference attempt to eliminate? (select all that apply)

\[\\[0.5in]\]

Randomizing treatments ensures there is no (select all that apply)

\[\\[0.5in]\] Average treatment effect (ATE) quantifies what? (text/conversation)

\[\\[0.5in]\]

If \(Y\) is the outcome and \(A\) the treatment, how can one interpret \(E[Y^1 | A=0]\)?

\[\\[0.5in]\]

A UM research wants to understand the impact of a low-calorie vs a low-fat/low-carb diet on weight loss among UM students.

If she advertises enrollment for her study among UM athletes and follows each participant, collecting diet information and weight over time, what are possible sources of statistical association between diet and weight loss in her data? (select all that apply)

\[\\[0.5in]\]

If she advertises enrollment for her study among UM athletes, randomizes diet type for each participant, and collects weight over time, what are possible sources of statistical association between diet and weight loss in her data? (select all that apply)

\[\\[0.5in]\] Given that there is confounding, how can we stop remove its bias from our parameter estimate? (select all that apply)

\[\\[0.5in]\]

Observation weighting when treatment not randomized

table(df$treatment, df$symptoms)
##               
##                mild severe
##   experimental  136    351
##   placebo       378    135
chisq.test(df$treatment, df$symptoms)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df$treatment and df$symptoms
## X-squared = 207.58, df = 1, p-value < 2.2e-16
ggplot(df, aes(x=symptoms, fill=treatment)) +  geom_bar(position = 'dodge')

\[\\[0.5in]\]

\[\\[0.5in]\]

\[\\[0.5in]\]

\[\\[0.5in]\]

Propensity Score Method

Two-Steps:

  1. Fit propensity scores for treatment (\(A\)) using pre-treatment covariates

  2. Use propensity score to estimate ATE (average treatment effect) of \(A\) on \(Y\)

Assumptions:

  1. Sufficient overlap: For all \(i\) \[0 < P(A=1|C=x_i) < 1\]

  2. No unknown confounders: \(A \perp \!\!\! \perp Y^t | X\) for \(t=0,1\)

Probability of Treatment: For each observation, \(i\), let \[p_i = P(A=1|C=x_i) = P(i \text{ gets treatment}|x_i)\]

the associated weight is \[w_i = \begin{cases} \frac{1}{p_i} & \text{when }A=1 \\ \frac{1}{1-p_i} & \text{when }A=0 \end{cases}\]

Propensity score Theorem \[(Y^0,Y^1) \perp A | X \Rightarrow (Y^0, Y^1) \perp A | w(X)\]

  • Note: This is saying that getting treatment or not is independent of what the response would have been in a counterfactual setting

  • Once the weights are estimates, ATE for a binary treatment is estimated as \[\widehat{\text{ATE}} = \frac{\sum_{i=1}^n A_iY_iw_i}{\sum_{i=1}^n A_iw_i} -\frac{\sum_{i=1}^n (1-A_i)Y_iw_i}{\sum_{i=1}^n (1-A_i)w_i}\]

  • Why not use regression to control for confounders McCaffrey et all 2013:

  1. By summarizing all pretreatment variables to a single score, propensity scores are an important dimension reduction tool for evaluating treatment effects. This characteristic of propensity scores is particularly advantageous over standard adjustment methods when there exists a potentially large number of pretreatment covariates.

  2. Propensity score methods derive from a formal model for causal inference, the potential outcomes framework, so that causal questions can be well defined and explicitly specified and not conflated with the modeling approach as they are with traditional regression approaches.

  3. Propensity score methods do not require modeling the mean for the outcome. This can help avoid bias from misspecification of that model.

  4. Propensity score methods avoid extrapolating beyond the observed data unlike parametric regression modeling for outcomes, which extrapolate whenever the treatment and control groups are disparate on pretreatment variables.

  5. Propensity score adjustments can be implemented using only the pretreatment covariates and treatment assignments of study participants without any use of the outcomes. This feature of propensity score adjustments is valuable because it eliminates the potential for the choice of model specification for pretreatment variables to be influenced by its impact on the estimated treatment effect.

Estimating Weights

  • There is no one standard method for model selection in the context of estimating propensity scores for IPTW for multiple treatments

  • It’s much more common to use non-parametric methods for estimating probabilities

  • Note that in this context, we care more about the prediction value than the interpretation of parameter estimates

  • Let’s compare logistic regression and generalized boosted models (GBM) to estimate observation weights

  • Question: Why might logistic regression be a bad choice for estimating weights?

    • A: Not cross-validated
    • B: Assumes \(\text{logit}(E[Y|X])\) is a linear function of parameters
    • C: It’s difficult to test many possible interactions
    • D: B and C
    • E: All of the above

\[\\[0.5in]\]

Logistic Regression

  • This example is for teaching purposes, it is recommended to not use logistic regression for estimating propensity scores
prop.mod <- glm(treat ~ age + educ+  black + hispan + married + nodegree + re74 + re75, data=lalonde, family=binomial())
summary(prop.mod)
## 
## Call:
## glm(formula = treat ~ age + educ + black + hispan + married + 
##     nodegree + re74 + re75, family = binomial(), data = lalonde)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7645  -0.4736  -0.2862   0.7508   2.7169  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.729e+00  1.017e+00  -4.649 3.33e-06 ***
## age          1.578e-02  1.358e-02   1.162  0.24521    
## educ         1.613e-01  6.513e-02   2.477  0.01325 *  
## black        3.065e+00  2.865e-01  10.699  < 2e-16 ***
## hispan       9.836e-01  4.257e-01   2.311  0.02084 *  
## married     -8.321e-01  2.903e-01  -2.866  0.00415 ** 
## nodegree     7.073e-01  3.377e-01   2.095  0.03620 *  
## re74        -7.178e-05  2.875e-05  -2.497  0.01253 *  
## re75         5.345e-05  4.635e-05   1.153  0.24884    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.49  on 613  degrees of freedom
## Residual deviance: 487.84  on 605  degrees of freedom
## AIC: 505.84
## 
## Number of Fisher Scoring iterations: 5
treat_prop_logistic <- predict(prop.mod, newdata = lalonde[,-1], type = 'response')
lalonde$logistic_prob <- treat_prop_logistic
lalonde$logistic_weight <- ifelse(lalonde$treat, 1/(1-treat_prop_logistic), 1/treat_prop_logistic)
head(lalonde)
##   treat age educ black hispan married nodegree re74 re75       re78 random
## 1  TRUE  37   11     1      0       1        1    0    0  9930.0460  FALSE
## 2  TRUE  22    9     0      1       0        1    0    0  3595.8940   TRUE
## 3  TRUE  30   12     1      0       0        0    0    0 24909.4500  FALSE
## 4  TRUE  27   11     1      0       0        1    0    0  7506.1460  FALSE
## 5  TRUE  33    8     1      0       0        1    0    0   289.7899   TRUE
## 6  TRUE  22    9     1      0       0        1    0    0  4056.4940  FALSE
##   logistic_prob logistic_weight
## 1     0.6387699        2.768319
## 2     0.2246342        1.289714
## 3     0.6782439        3.107944
## 4     0.7763241        4.470754
## 5     0.7016387        3.351642
## 6     0.6990699        3.323031
hist(lalonde$logistic_weight)

boxplot(logistic_prob ~ treat, data=lalonde)

  • Assessing covariate balance after weighting \[PSB_k = \frac{| \bar X_{k1} - \bar X_{k0}|}{\widehat \sigma_k}\] where \[\bar X_{kt} = \frac{\sum_{i=1}^n I(A=t) X_{ki} w_i }{ \sum_{i=1}^n I(A=t) w_i}\]

Boosting

  • A problem with logistic regression for estimating weights is that can be challenging to get a good model fit

  • Boosting makes this a lot easier and automatic

  • We already talked about AdaBoost and a little about general boosting

For a reference, in AdaBoost

  • each lazy learner is fitted to the data with weights that depend on the last model’s performance

  • all of the lazy learners are scored depending on their performance

For general boosting

  • Observations are not weighted

  • We use learners (models) like regression or trees that are more complex than lazy learner

  • Similar to AdaBoost, we must choose the number of sequential learner to use, \(M\)

  • Each learner tries to predict the error of the previous model

  • Error: in the literature, the error is the negative gradient of the loss function with respect to the model evaluated at an observation: \[-\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)}\]

  • For a continuous outcomes, this is \(y_i-f(x_i)\) for continuous outcomes where \(f\) is a continuous learner

  • For a binary outcomes, this is \(I(y_i=1)-p(x_i)\) where \(p\) is a learner like logistic regression model or decision tree

  • For categorical outcomes, this is \(I(y_i=\text{Category }k)-p_k(x_i)\) where \(p_k\) is models the probability of category \(k\)

  • Learning rate: to avoid over fitting, we reduce the contribution of each learner by \(0<\nu<1\)

  • Trees are the most common learner to use for boosting, usually with between 8 and 32 terminal leaves

Image from geeksforgeeks

Gradient Tree Boosting Algorithm from ESL

  1. Initialize \(f_0(x)=\arg\min_\gamma \sum_{i=1}^n L(y_i,\gamma)\)

  2. For \(m=1\) to \(M\):

    1. For \(i=1,\dots,n\) compute the error \[r_{im}=-\left[\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)}\right]_{f=f_{m-1}}\]
    2. Fit a regression tree to the errors, \(r_{im}\) with terminal regions \(R_{jm}\) where \(j=1,\dots, J_m\)
    3. For \(j=1,\dots, J_m\) compute \[\gamma_{jm}=\arg\min_\gamma \sum_{x_i\in R_{jm}} L(y_i,f_{m-1}(x_i)+\gamma)\]
    4. Update \(f_m(x)=f_{m-1}(x)+\nu\sum_{j=1}^{J_m} \gamma_{jm}I(x\in R_{jm})\)
  3. Output \(\hat f(x)=f_M(x)\)

Generalized Boosting for Propensity Scores

  • For causal inference, we want to weight the observation to make the treatment and control groups look as if they were randomized

  • Using inverse probabilities from boosting model, we want to use as many trees as we need to achieve balance in the covariates

  • Ideally, we want the distributions of each covariate in the treatment and control groups to be similar (mean, variance, skew, shape, etc)

  • Clearly this is rarely possible, so we might restrict balance to mean and/or standard deviation for example

  • Assessing covariate balance with standardize bias estimation \[SB_k = \frac{|\bar X_{k1} - \bar X_{k0}|}{\hat \sigma_k}\]

  • Generally, standardized mean differences of less than 0.20are considered small, 0.40 are considered moderate, and 0.60 are considered large

  • This cutoff can change within fields and between investigator

  • McCaffery et al: \(SB>0.2\) is problematic

  • Below the twang R package automatically add mores trees until a stopping rule based on balance it met

  • twang chooses propensity scores based on the boosted model with the best balance

  • see twang documentation

boosted.mod <- ps(treat ~ age + educ + black + hispan + married + nodegree + re74 + re75,
                  data=lalonde,
                  estimand = "ATE",
                  n.trees = 5000, 
                  interaction.depth=2, 
                  perm.test.iters=0, 
                  verbose=FALSE, 
                  stop.method = c("es.mean"))
summary(boosted.mod)
##             n.treat n.ctrl ess.treat ess.ctrl    max.es   mean.es    max.ks
## unw             185    429 185.00000 429.0000 1.3085999 0.4432341 0.6404460
## es.mean.ATE     185    429  60.12492 193.2837 0.5527156 0.1873184 0.2705063
##             max.ks.p   mean.ks iter
## unw               NA 0.2702451   NA
## es.mean.ATE       NA 0.1168750 2488
summary(boosted.mod$gbm.obj,
        n.trees=boosted.mod$desc$es.mean.ATE$n.trees, 
        plot=FALSE)
##               var     rel.inf
## black       black 56.02950248
## re74         re74 16.57601323
## age           age 16.48551967
## re75         re75  4.22031848
## educ         educ  3.17248606
## married   married  2.97585135
## nodegree nodegree  0.44543160
## hispan     hispan  0.09487713
lalonde$boosted <- get.weights(boosted.mod)
## Warning in get.weights(boosted.mod): No stop.method specified.  Using es.mean.ATE
hist(lalonde$boosted)

plot(boosted.mod)

plot(boosted.mod, plots=2)

plot(boosted.mod, plots=3)

bal.table(boosted.mod)
## $unw
##             tx.mn    tx.sd    ct.mn    ct.sd std.eff.sz   stat     p    ks
## age        25.816    7.155   28.030   10.787     -0.224 -2.994 0.003 0.158
## educ       10.346    2.011   10.235    2.855      0.042  0.547 0.584 0.111
## black       0.843    0.365    0.203    0.403      1.309 19.371 0.000 0.640
## hispan      0.059    0.237    0.142    0.350     -0.257 -3.413 0.001 0.083
## married     0.189    0.393    0.513    0.500     -0.656 -8.607 0.000 0.324
## nodegree    0.708    0.456    0.597    0.491      0.231  2.716 0.007 0.111
## re74     2095.574 4886.620 5619.237 6788.751     -0.544 -7.254 0.000 0.447
## re75     1532.055 3219.251 2466.484 3291.996     -0.284 -3.282 0.001 0.288
##          ks.pval
## age        0.003
## educ       0.081
## black      0.000
## hispan     0.339
## married    0.000
## nodegree   0.081
## re74       0.000
## re75       0.000
## 
## $es.mean.ATE
##             tx.mn    tx.sd    ct.mn    ct.sd std.eff.sz   stat     p    ks
## age        25.369    7.050   27.478   10.057     -0.213 -2.732 0.006 0.135
## educ       10.679    2.035   10.318    2.682      0.137  1.173 0.241 0.090
## black       0.634    0.483    0.364    0.482      0.553  3.080 0.002 0.271
## hispan      0.117    0.323    0.117    0.322      0.000  0.003 0.998 0.000
## married     0.317    0.467    0.430    0.496     -0.229 -1.318 0.188 0.113
## nodegree    0.593    0.493    0.601    0.490     -0.016 -0.099 0.921 0.008
## re74     3034.535 5240.245 4587.694 6420.822     -0.240 -1.874 0.061 0.168
## re75     1784.285 3374.853 2145.958 3190.482     -0.110 -0.940 0.348 0.151
##          ks.pval
## age        0.377
## educ       0.852
## black      0.002
## hispan     1.000
## married    0.604
## nodegree   1.000
## re74       0.152
## re75       0.243

Estimating Average Treatment Effect (ATE)

library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
design <- svydesign(ids=~1, weights=~boosted, data=lalonde)
glm1 <- svyglm(re78 ~ treat, design=design)
summary(glm1)
## 
## Call:
## svyglm(formula = re78 ~ treat, design = design)
## 
## Survey design:
## svydesign(ids = ~1, weights = ~boosted, data = lalonde)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6646.9      388.3  17.118   <2e-16 ***
## treatTRUE     -720.4      866.5  -0.831    0.406    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 49530170)
## 
## Number of Fisher Scoring iterations: 2
summary(lm(re78 ~ treat + age + educ+  black + hispan + married + nodegree + re74 + re75, data=lalonde))
## 
## Call:
## lm(formula = re78 ~ treat + age + educ + black + hispan + married + 
##     nodegree + re74 + re75, data = lalonde)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13595  -4894  -1662   3929  54570 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.651e+01  2.437e+03   0.027   0.9782    
## treatTRUE    1.548e+03  7.813e+02   1.982   0.0480 *  
## age          1.298e+01  3.249e+01   0.399   0.6897    
## educ         4.039e+02  1.589e+02   2.542   0.0113 *  
## black       -1.241e+03  7.688e+02  -1.614   0.1071    
## hispan       4.989e+02  9.419e+02   0.530   0.5966    
## married      4.066e+02  6.955e+02   0.585   0.5590    
## nodegree     2.598e+02  8.474e+02   0.307   0.7593    
## re74         2.964e-01  5.827e-02   5.086 4.89e-07 ***
## re75         2.315e-01  1.046e-01   2.213   0.0273 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6948 on 604 degrees of freedom
## Multiple R-squared:  0.1478, Adjusted R-squared:  0.1351 
## F-statistic: 11.64 on 9 and 604 DF,  p-value: < 2.2e-16

\[\\[0.5in]\]

Unmeasured confounding and mediating variables

  • Goal: Quantify causal effect of \(A\) on \(Y\)

  • Below: Let \(A\) be the treatment, \(Y\) the outcome, \(X\) measured confounders, \(W\) non-confounding variables, \(U\) unmeasured confounders

  • Question: If there are variables like \(U\) (unmeasured confounders), is it possible to use propensity score to determine the causal effect of \(A\) on \(Y\)?

    • A. Yes
    • B. No
    • C. Not sure

\[\\[0.5in]\]

  • Question: Assume there are no unmeasured confounders, will using \(W\) and \(X\) to estimate propensity scores give an accurate causal estimate?

    • A. Yes
    • B. No
    • C. Not sure

\[\\[0.5in]\]

  • Question: Assume there are no unmeasured confounders, will only using \(X\) to estimate propensity scores give an accurate causal estimate?

    • A. Yes
    • B. No
    • C. Not sure

\[\\[0.5in]\]

  • If there are unmeasured confounders, propensity score methods will not give accurate causal estimates for the effect of treatment on outcome

  • Including non-confounding, pre-treatment variables in the propensity score analysis, does not bias causal estimates for the effect of treatment on outcome

  • Below: Let \(A\) be the treatment, \(Y\) the outcome, \(X\) measured confounders, \(W\) non-confounding variables, \(M\) mediating variables

  • Question: Will using \(W\), \(X\), and \(M\) to estimate propensity scores give an accurate causal estimate?

    • A. Yes
    • B. No
    • C. Not sure

\[\\[0.5in]\]

  • Including a mediating variable in the propensity score analysis will block the causal pathway through that variable

  • Generally, we do not want to do this but there may be times where it is warrented

  • Example: Vaccines may influence our behavior but also may prevent sickness

  • If we want to remove the effect of behavior from the analysis, we can control for it (include it in our propensity score analysis)

Causal Inference Models Using Instrumental Variables

  • Consider the following Bayesian network where the numbers indicate linear parameters, for example \(A = 2 Z + C + \epsilon_A\)

  • Want to determine the causal effect of \(A\) on \(Y\)

  • Question: From the Bayesian network, what is the causal effect of \(A\) on \(Y\)? (numeric answer)

\[\\[0.5in]\]

  • Question: Regressing \(Y\) on \(A\) (in R, \(Y \sim A\)) would include confounding in the parameter corresponding to \(A\)

    • A. True
    • B. False
    • C. Not sure

\[\\[0.5in]\]

  • Question: Regression \(Y\) on \(Z\) (in R, \(Y\sim Z\)) would include confounding in the parameter corresponding \(Z\)

    • A. True
    • B. False
    • C. Not sure

\[\\[0.5in]\]

  • Question: Regression \(Y\) on \(Z\) (in R, \(Y\sim Z\)) should have what value for the parameter corresponding \(Z\) (numeric)

\[\\[0.5in]\]

  • Question: Regression \(A\) on \(Z\) (in R, \(A\sim Z\)) should have what value for the parameter corresponding \(A\) (numeric)

\[\\[0.5in]\]

set.seed(1234)
n_obs <- 100000

Z <- 400 * rnorm(n_obs)
C <- 500 * rnorm(n_obs)

A <- 2 * Z + 1 * C + rnorm(n_obs)
Y <- 3 * A + 5 * C + rnorm(n_obs)

summary(lm(Y ~ A ))
## 
## Call:
## lm(formula = Y ~ A)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9576  -1427     -3   1426   9368 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.496372   6.672603   0.374    0.708    
## A           4.398125   0.007078 621.397   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2110 on 99998 degrees of freedom
## Multiple R-squared:  0.7943, Adjusted R-squared:  0.7943 
## F-statistic: 3.861e+05 on 1 and 99998 DF,  p-value: < 2.2e-16
summary(lm(Y ~ Z))
## 
## Call:
## lm(formula = Y ~ Z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18625.1  -2692.4    -18.7   2678.5  16980.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.81128   12.58789   1.018    0.309    
## Z            6.02354    0.03149 191.304   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3981 on 99998 degrees of freedom
## Multiple R-squared:  0.2679, Adjusted R-squared:  0.2679 
## F-statistic: 3.66e+04 on 1 and 99998 DF,  p-value: < 2.2e-16
summary(lm(A ~ Z))
## 
## Call:
## lm(formula = A ~ Z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2327.96  -336.31    -2.43   334.94  2122.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.603158   1.573487   1.019    0.308    
## Z           2.002944   0.003936 508.897   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 497.6 on 99998 degrees of freedom
## Multiple R-squared:  0.7214, Adjusted R-squared:  0.7214 
## F-statistic: 2.59e+05 on 1 and 99998 DF,  p-value: < 2.2e-16
summary(lm(Y ~ Z))$coefficients[2,1]/summary(lm(A ~ Z))$coefficients[2,1]
## [1] 3.007345

Instrumental Variables Big Picture

Three steps:

  1. Regress Treatment onto Instrument (i.e. treatment as dependent variable) to get \(\widehat a\)
  2. Regress Outcome onto Instrument (i.e. outcome as dependent variable) to estimate \(\widehat{ab}\)
  3. Divide \(\frac{\widehat{ab}}{\widehat a}\) to get \(\widehat b\)

Assumptions:

  • The instrument does not affect any confounders

  • The instrument only affects the outcome through the treatment

  • Linearity

Minimum wage: Causal Inference and Economics

  • Minimum wage is the legal minimum that an employer can pay a worker

  • This topic has a contentious history

  • The US established a minimum wage during the Great Depression

  • But, the federal minimum wage is not tied to inflation, so the actual value of minimum wage decreases over time if not increased by politicians

Image from medium.com

  • In 1992, 79% of economists responded that having a minimum wage increases unemployment among young and low-skilled workers in a survey of the American Economics Association

  • Using the classical supply and demand model, the idea driving this reasoning was that if the price of labor is increased, then the employers will buy less labor, resulting in fewer jobs

  • On the other hand, paying employees meager wages while making considerable profit can seem like exploitation

  • Does this classical model accurately describe the minimum wage and number of jobs dynamic?

  • Just like any model, the supply and demand model requires assumptions

  • Further, consider factories vs restaurants:

    • Factories face international competition to keep prices low and can more easily automate jobs
    • Restaurant face local competition and cannot easily automate jobs
  • In 1992, New Jersey raised its minimum wage from $4.25/hr to $5.05/hr while Pennsylvania kept its minimum wage at $4.25/hr

  • David Card and Alan Krueger at Princeton University used this opportunity to study employment at fast-food restaurants in both states before the April increase and again several months later

  • The increase in the wage floor did not lead to jobs being lost in New Jersey; employment in the restaurants they looked at went up

  • Results in this paper

  • Some economists (46%) now think that some labor markets are better characterized by a monopsony market structure: one dominant buyer purchases from many sellers, and so is able to lower prices (wages)

  • By increases wages, more people may have been willing to work, at a higher wage

  • A following study in 2017, looking at aggregate employment data on average hours and earnings found that increasing the minimum wage in Seattle to $15 led to reducing worker hours and subsequently pay

  • Another study in 2018 in Seattle looking at tax data found that after the increase, workers gained earnings on a weekly basis but only because they worked more jobs

  • These studies show the impact that data can have on policy

  • This material was taken from this article

Earnings after moving

  • Using natural randomization: In 1973 the Eldfell volcano in Iceland on the island of Heimaey erupted, destroying about 400 homes.

  • The Icelandic government compensated those who lost their homes, many never returned.

  • An economics paper showed that, among people less than 25 years old at the time of the eruption, those who had moved averaged four more years of schooling and earnings $27,000 greater per year than those from families who had kept their home.

  • Because those who lost their homes was naturally randomized, this paper used instrumental variables

  • The treatment was moving away and the outcome was later earnings

  • Losing home \(\rightarrow\) moving away \(\rightarrow\) later earnings

  • Instrument \(\rightarrow\) Treatment \(\rightarrow\) Effect

  • Instrument can only influence effect through treatment (no direct arrow from instrument to effect)

  • We are not going to focus on instrumental variables analysis here

  • Used when there is a natural experiment

Regression with knowledge of causal structure

  • There are times when researchers have a good understand of the causal relationships between variables

  • When this is the case, this information can (and should) be used to create model

  • Consider the game of baseball and prediction

  • Question: Assuming linearity holds in the graph above, what methods would accurately modeling \(Y\)? (choose all that apply)

    • A. GLM using only \(A\)
    • B. GLM using \(A\) and \(X\)
    • C. Instrumental variable analysis with \(X\) as instrument
    • D. Propensity scoring using \(X\)
    • E. Random forest using \(X\) and \(A\)

\[\\[0.5in]\]

  • Question: Again assuming linearity holds in the graph above, which model would be best for accurately modeling \(Y\) (choose one)

    • A. GLM using only \(A\)
    • B. GLM using \(A\) and \(X\)
    • C. Instrumental variable analysis with \(X\) as instrument
    • D. Propensity scoring using \(X\)
    • E. Random forest using \(X\) and \(A\)

\[\\[0.5in]\]

  • Now, we want to quantify the causal impact of \(A\) on \(Y\) without assuming linearity

  • Question: Would using \(C1, C2, C3, C4, C5\) for propensity score weighting accurately quantify the effect of \(A\) on \(Y\)?

    • A. Yes
    • B. No

\[\\[0.5in]\]

  • Question: Would using \(C1, C2, C3, C4\) for propensity score weighting accurately quantify the effect of \(A\) on \(Y\)?

    • A. Yes
    • B. No

\[\\[0.5in]\]

  • Question: Would using \(C1, C4\) for propensity score weighting accurately quantify the effect of \(A\) on \(Y\)?

    • A. Yes
    • B. No

\[\\[0.5in]\]

  • Question: Would using \(C1, C2, C4\) for propensity score weighting accurately quantify the effect of \(A\) on \(Y\)?

    • A. Yes
    • B. No

\[\\[0.5in]\] - Question: Would using \(C5\) as an instrumental variable accurately quantify the effect of \(A\) on \(Y\)?

  • A. Yes
  • B. No

\[\\[0.5in]\] - Question: If we want to predict \(Y\), which variables do we not need?

  • A. \(C1\)
  • B. \(C2\)
  • C. \(C3\)
  • D. \(C4\)
  • E. \(C5\)

\[\\[0.5in]\]

  • We want to isolate the causal effect of A on Y

  • Would conditioning on \(C2\) only accurately quantify the causal effect of A on Y?

    • A. Yes
    • B. No

\[\\[0.5in]\]

  • Would conditioning on \(C3\) accurately quantify the causal effect of A on Y?

    • A. Yes
    • B. No

\[\\[0.5in]\]

  • Would conditioning on \(C1\) and \(D1\) accurately quantify the causal effect of A on Y?

    • A. Yes
    • B. No

\[\\[0.5in]\]

  • Would conditioning on \(C1\) and \(D2\) accurately quantify the causal effect of A on Y?

    • A. Yes
    • B. No

\[\\[0.5in]\]

  • Would conditioning on \(C1\) and \(M2\) accurately quantify the causal effect of A on Y?

    • A. Yes
    • B. No

\[\\[0.5in]\]

Graphical Causal Models and D-separation

  • Researchers frequently use directed acyclic graphs (DAG) to visualize causal relationships between variables in a dataset

  • directed - all edges in graph have direction (all edges are arrows)

  • acyclic - arrow directions do not create a loop

  • DAGs are sometimes called Bayesian networks

  • DAG below

  • A, B, C, D, E, F, G, H, I represent variables in a dataset

  • Arrows indicate a direct causal relationship

  • A is the parent of D

  • D is the child of A

  • B is the ancestor of I

  • I is the successor of B

  • Here: we are interested in paths between variables

  • Note: paths can have arrows pointing either direction

  • One path between D and F is \[D\leftarrow A\rightarrow E \rightarrow G \leftarrow F\]

  • Another is \[D \rightarrow H \leftarrow G \rightarrow I \leftarrow F\]

  • These are of interest if we want to know if D and F are associated or causally linked

  • Somewhat technical assumptions: in general assume arrows between nodes indicate a causal relationship

  • Text on Probabilistic Graphical Models for more detail

Factorization and conditional independence

  • Bayesian networks are an intuitive way to visualize Probability factorizations

  • Joint distribution can factored as \[P(A, B, C, D, E) = P(A) P(B|A) P(C) P(D|B,C) P(E|C)\]

  • For each variable, only need to condition on its immediate parents

  • DAGs give information about conditional independence relationships among variables

  • This DAG has one path from A to E: \(A \rightarrow B \rightarrow D \leftarrow C \rightarrow E\)

  • How can we use this DAG to determine statistical associations between variables/nodes?

  • With 3 nodes:

    • Chain: \(A \rightarrow B \rightarrow C\)
    • Other Chain: \(A \leftarrow B \leftarrow C\)
    • Fork: \(A \leftarrow B \rightarrow C\)
    • Collider: \(A \rightarrow B \leftarrow C\)
Structure Path type \(A,C\) independent/dependent? \(A,C\) conditionally independent/dependent given \(B\)
Chain \(A \rightarrow B \rightarrow C\) \(A\not\!\perp \!\!\! \perp C\) \(A\perp \!\!\! \perp C|B\)
Other Chain \(A \leftarrow B \leftarrow C\) \(A\not\!\perp \!\!\! \perp C\) \(A\perp \!\!\! \perp C|B\)
Fork (Confounder) \(A \leftarrow B \rightarrow C\) \(A\not\!\perp \!\!\! \perp C\) \(A\perp \!\!\! \perp C|B\)
Collider \(A \rightarrow B \leftarrow C\) \(A\perp \!\!\! \perp C\) \(A\not\!\perp \!\!\! \perp C|B\)
  • Questions:
    1. Are A and B associated?
    2. Are A and D associated?
    3. Are D and E associated?
    4. Are B and C associated?
    5. Are A and E associated?

Rule: If a path has no colliders, then yes and we say that the path is open. If there is a collider on the path, then no and we say that the path is blocked

  • What about conditioning?

  • Questions:

    1. Are A and D associated given B?
    2. Are D and E associated given C?
    3. Are B and C associated given D?
    4. Are A and C associated given D?
    5. Are A and E associated given D?
    6. Are A and E associated given B?
    7. Are A and E associated given B, D?

    Rule: Conditioning on a collider opens a path. Conditioning on a non-collider blocks a path

  • Note: A variable/node and be a collider on one path but not on another.

  • How can we show that if \(D \leftarrow C \rightarrow E\) then \(D \perp E | C\)?

  • We use Bayesian networks because they help us to quickly understand marginal and conditional independence between the variables

  • \(X\) and \(Y\) are independent iff (def) \(\mathbb P(X, Y) = \mathbb P(X) \mathbb P(Y)\)

  • \(X\) and \(Y\) are conditionally independent given \(Z\) iff (def) \(\mathbb P(X, Y |Z) = \mathbb P(X | Z) \mathbb P(Y | Z)\)

  • \(D \perp E | C\) if and only if \(P(D,E | C) = P(D|C) P(E|C)\) \[P(D,E|C) = \frac{P(D,E,C)}{P(C)} = \frac{P(C)P(D|C)P(E|C)}{P(C)} = P(D|C)P(E|C)\]

What about \(A \rightarrow B \rightarrow C\)?

Markov Assumption

  • We are implicitly using the Markov assumption here

  • Specifically, we are assuming different paths won’t perfectly cancel out

  • I want a good score on my test so I continue studying

  • With the additional studying, I sleep less

  • With less sleep, my performance isn’t as good

  • With the extra studying, I know the material better

  • Can my lack of sleep and extra studying perfectly cancel out?

  • Markov assumption says no

Causal Discovery using PC algorithm

\[\\[0.5in]\]

\[\\[0.5in]\]

\[\\[0.5in]\]

  1. Create complete, undirected graph

  2. Pairwise Conditional Independence Testing

  1. For all non-looping triples:
  1. Find all Y-structures in graph:
library(pcalg)
library(Rgraphviz)
## Loading required package: graph
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
##     duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames,
##     sapply, setdiff, sort, table, tapply, union, unique, unsplit
data(lalonde)
pc.fit <- pc(suffStat = list(C = cor(lalonde), n = dim(lalonde)[1]), indepTest = gaussCItest, ## indep.test: partial correlations
             alpha=0.01, labels = names(lalonde), verbose = FALSE)
plot(pc.fit, main="Lalonde Estimated Bayesian Network")

Writing tips

Total (n) Treatment (n1) Control (n2) p-value
Covariate 1 mean (sd) mean (sd) mean (sd) <0.001
Covariate 2 % % % 0.34
Covariate 3 median (q1, q3) median (q1, q3) median (q1, q3) 0.02