Causal Analysis Briefly

dummy slide

Introduction

Knowledge from data

Aim: “Research to improve human health”

Confounding

Outline

Causal statistical analysis in health science

-unobserved confounding?
-average causal effect?
-objective causal sensitivity?

What is it about and how to compute?

The problem

association

(Smoking): How harmful is the exposure?
(Genes): How about unobserved confounders - eg. genetic effects?

A. Hill: Proceedings of the Royal Society of Medicine, 1965

Strength: Quantification of association.
Consistency: Repeatedly observed.
Specificity: Certain type of disease, but not others.
Temporality: Cause precedes consequence.
Biological gradient: Dose-response relationship.
Biological plausible.
Coherent with laboratory evidence.
Experimental or semi-experimental: If exposure was removed, does that prevent the disease?
Analogy with known exposure-disease causal effect.

Causal analysis - Main aims

To target the exposure-outcome relation taking unobserved confounding into account (using ‘counterfactuals’).

To estimate the average causal effect (by ‘g-computation’)

To gauge the magnitude of unobserved confounding explaining away the average causal effect (obtaining objective sensitivity)

Methodology

Synopsis

We review the randomized design of study to introduce causal assumptions for analyzing observational studies using counterfactuals.

Principle of Randomization

randomization

Randomization negates the effect of confounders.
-the treated would have experienced the same outcome as the untreated and vice versa would they have been exchanged.
Can this be used for observational studies? Yes, in some cases.

Regression analysis - measuring effect

\[ \underbrace{E(Y|X_1,\ldots ,X_p)}_{\text{expected Y given covariates}} = \underbrace{\alpha+\beta_1X_1+\ldots +\beta_pX_p}_{\text{predictor}} \]

For association, the effect of treatment \(X_1\) given covariates \(X_2,\ldots,X_p\) is

\[ \underbrace{E(Y|X_1=1)-E(Y|X_1=0)}_{\text{treated versus untreated}} = \underbrace{\beta_1}_{\text{effect}} \]

Causation versus Association

Exchangeability and Consistency

Each individual has two potential outcomes, \[Y^{a=1},\quad Y^{a=0}\]
Our target is the causal effect \[E(Y^{a=1})-E(Y^{a=0})\]
- but counterfactuals are never seen(!)
Two assumptions: Exchangeability and Consistency

Exchangeability: The potential outcome, \(Y^{a}\) is independent of treatment allocation \(A\) for all levels \(a\) of \(A\).

-in other words, the treated would have experienced the same outcome as the untreated and vice versa would they have been exchanged.

Consistency: If an individual is given treatment \(a\), that is, \(A=a\), then the potential outcome, \(Y^{a}\) equals the outcome, \(Y^{a}=Y\).

Exchangeability and Consistency cont’d

\[ \begin{align*} \underbrace{E(Y^{a=1})-E(Y^{a=0})}_{\text{causal effect}} &= \underbrace{E(Y^{a=1}|A=1)-E(Y^{a=0}|A=0)}_{\text{exchangeability and consistency}} \\ &=\underbrace{E(Y|A=1)-E(Y|A=0)}_{\text{association effect}} \end{align*} \]

So association is causation(!)

Comments

Exchangeability, consistency and also an assumption of positivity are the premises for causal analysis.
If fulfilled, we obtain the causal interpretation of desired targets.
Untestable assumptions.
It is an obligation to aim for fulfillment - we consider this.

Causal analysis in observational studies

Synopsis

We illustrate the need to actual do causal analysis. First we review the instrumental variables approach as remedy to obtain causal interpretation of treatment effect.

The problem

Example

How harmful is drinking alchohol? (coronary heart disease (J-shape), hypertension, cancers,..)
How about unobserved lifestyle factors?
Causal meaning is doubtful.

The problem - instrumental variables

Example IV

How harmful is drinking alchohol? (coronary heart disease (J-shape), hypertension, cancers,..)
How about unobserved lifestyle factors?
ALDH2 is a genetic variant - the instrumental variable.

Instrumental variables

Example IV

True model: \(E(Y|X,G,Z,U)=\alpha+\beta_XX+\beta_UU+\beta_ZZ\)
Target: \(E(Y^{x=1})-E(Y^{x=0})\) is the causal effect.
Can we estimate this?

Causal analysis - computation

Example IV

True model: \(E(Y|X,G,Z,U)=1+0.5X-1U+0.3Z\)
We simulate such data and perform causal analysis.

Simulate data

set.seed(160322)
n=10000
U=rnorm(n)
G=rbinom(n,1,0.5)
Z=rbinom(n,1,0.5)
alp0=1;alp1=0.75;alp2=-0.5;alp3=0.3
mu=alp0+alp1*G+alp2*U+alp3*Z
X=rnorm(n,mu,1)

bet0=1;bet1=0.5;bet2=-1;bet3=0.3
thet=bet0+bet1*X+bet2*U+bet3*Z
Y=rnorm(n,thet,0.5)

The true effect is \(\beta_X=0.5\)

True: \(E(Y|X,G,Z,U)=1+0.5X-1U+0.3Z\)

## 
## Call:
## lm(formula = Y ~ X + Z + U)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8000 -0.3293 -0.0013  0.3336  1.6806 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.99043    0.00928   106.8   <2e-16 ***
## X            0.50733    0.00456   111.3   <2e-16 ***
## Z            0.29846    0.00990    30.2   <2e-16 ***
## U           -0.99422    0.00538  -184.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 9996 degrees of freedom
## Multiple R-squared:  0.888,  Adjusted R-squared:  0.888 
## F-statistic: 2.63e+04 on 3 and 9996 DF,  p-value: <2e-16

Problem is that U is unobserved!
So we cannot just do the above regression.

The true effect is \(\beta_X=0.5\)

Need to do the regression without U:

## 
## Call:
## lm(formula = Y ~ X + G + Z)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.194 -0.680  0.004  0.679  3.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.60555    0.01968   30.77   <2e-16 ***
## X            0.89711    0.00909   98.68   <2e-16 ***
## G           -0.30181    0.02148  -14.05   <2e-16 ***
## Z            0.17157    0.02056    8.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02 on 9996 degrees of freedom
## Multiple R-squared:  0.513,  Adjusted R-squared:  0.513 
## F-statistic: 3.52e+03 on 3 and 9996 DF,  p-value: <2e-16

This is the observational data standard regression.
-biased estimate of \(\beta_X\)
-there is an effect of G now.

The true effect is \(\beta_X=0.5\)

But biased estimate of \(\beta_X\) and significant effect of \(G\)

## 
## Call:
## lm(formula = Y ~ G)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.302 -0.980  0.009  0.973  5.934 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.7089     0.0204    83.9   <2e-16 ***
## G             0.3740     0.0290    12.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.45 on 9998 degrees of freedom
## Multiple R-squared:  0.0164, Adjusted R-squared:  0.0163 
## F-statistic:  167 on 1 and 9998 DF,  p-value: <2e-16

But \(G\) is independent of \(Y\) conditional on \(X\)

The true effect is \(\beta_X=0.5\)

We can estimate the correct effect of X without using U
It is called ‘2SLS regression’
Two Stage Least Squares regression

2SLS regression

* Do the regression of X on G (and Z), and 
* Calculate the predicted values: $\hat{X}$
* Do the regression of Y on $\hat{X}$ (and Z)

hat.X=fitted(lm(X~G+Z))
summary(lm(Y~hat.X+Z))

## 
## Call:
## lm(formula = Y ~ hat.X + Z)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.526 -0.963  0.001  0.941  5.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0023     0.0555   18.07   <2e-16 ***
## hat.X         0.4942     0.0382   12.93   <2e-16 ***
## Z             0.2937     0.0310    9.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.43 on 9997 degrees of freedom
## Multiple R-squared:  0.0395, Adjusted R-squared:  0.0393 
## F-statistic:  205 on 2 and 9997 DF,  p-value: <2e-16

We get the correct estimate of 0.50, hurra!
Standard errors are (slightly) incorrect.

2SLS regression

* Do the regression of X on G (and Z), and 
* Calculate the predicted values: $\hat{X}$
* Do the regression of Y on $\hat{X}$ (and Z)

## 
## Call:
## lm(formula = Y ~ hat.X + Z)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.526 -0.963  0.001  0.941  5.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0023     0.0555   18.07   <2e-16 ***
## hat.X         0.4942     0.0382   12.93   <2e-16 ***
## Z             0.2937     0.0310    9.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.43 on 9997 degrees of freedom
## Multiple R-squared:  0.0395, Adjusted R-squared:  0.0393 
## F-statistic:  205 on 2 and 9997 DF,  p-value: <2e-16

We get the correct estimate of 0.50, hurra!
Standard errors are (slightly) incorrect.

2SLS regression with correct standard errors:

## 
## Call:
## ivreg(formula = Y ~ Z + X | Z + G)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6502 -0.7582  0.0099  0.7483  4.4193 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0023     0.0432    23.2   <2e-16 ***
## Z             0.2937     0.0241    12.2   <2e-16 ***
## X             0.4942     0.0297    16.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.11 on 9997 degrees of freedom
## Multiple R-Squared: 0.418,   Adjusted R-squared: 0.418 
## Wald test:  339 on 2 and 9997 DF,  p-value: <2e-16

2SLS regression diagnostics

## 
## Call:
## ivreg(formula = Y ~ Z + X | Z + G)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6502 -0.7582  0.0099  0.7483  4.4193 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0023     0.0432    23.2   <2e-16 ***
## Z             0.2937     0.0241    12.2   <2e-16 ***
## X             0.4942     0.0297    16.6   <2e-16 ***
## 
## Diagnostic tests:
##                   df1  df2 statistic p-value    
## Weak instruments    1 9997      1117  <2e-16 ***
## Wu-Hausman          1 9996       197  <2e-16 ***
## Sargan              0   NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.11 on 9997 degrees of freedom
## Multiple R-Squared: 0.418,   Adjusted R-squared: 0.418 
## Wald test:  339 on 2 and 9997 DF,  p-value: <2e-16

The F-statistic is 1117.1 (large!). So not a weak instrument
The Wu-Hausman test compares the naive estimator to the 2SLS-estimator, and they are significantly different.
This points to that there is unobserved confounding.

Comments

The 2SLS regression recovered the true estimate
Standard regression for observed data yielded biased effects
but highly significant effects… to be published?
We need to do causal analysis.
assumptions in this?

Instrumental variables assumptions

Example IV

\(\textrm {G}\) and \(\textrm {X}\) are associated.
\(\textrm {G}\) is independent of the unmeasured confounders \(\textrm {U}\).
\(\textrm {G}\) is independent of the outcome given \(\textrm {X}\) and \(\textrm {U}\)

-often \(\textrm {G}\) is an inherited genetic variant ensuring Mendelian randomization(!)

Average causal treatment effect

Synopsis

We obtain the average causal treatment effect from the g-computation formula.

Target

The expected risk of outcome \(Y=1\) if everybody have had the exposure \(X=1\) versus the risk of \(Y=1\) if everybody have not had the exposure \(X=0\).

Example: Lungcancer remission

Remission from lungcancer treatment studied through family history, sex, immunology markers, length of stay and cancerstage.
Logistic regression model provides odds-rati of effects.
Average family history effect on remission?

Example: Lungcancer remission

library(lme4)
library(tidyverse)
library(readstata13)
library(mets)

D<-read.dta13('lungcancer.dta')
    
D<- D %>% mutate( sex=factor(sex), cancerstage=factor(cancerstage), familyhx=factor(familyhx))

m5 <- logitATE(remission ~ familyhx+sex+ il6 + crp + cancerstage + 
               lengthofstay +experience + cluster(did), 
               data = D, treat.model=familyhx ~sex+ cancerstage + lengthofstay 
               +experience + cluster(did))

Example: Lungcancer remission

## 
##     n events
##  8525   2521
## 
##  407 clusters
## coeffients:
##                Estimate  Std.Err     2.5%    97.5% P-value
## (Intercept)    -1.43573  0.39656 -2.21297 -0.65850    0.00
## familyhxyes    -0.86541  0.07457 -1.01155 -0.71926    0.00
## sexmale         0.05470  0.05076 -0.04479  0.15420    0.28
## il6            -0.03928  0.00819 -0.05533 -0.02322    0.00
## crp            -0.01822  0.00805 -0.03400 -0.00243    0.02
## cancerstageII  -0.23451  0.06116 -0.35438 -0.11464    0.00
## cancerstageIII -0.56350  0.08436 -0.72884 -0.39816    0.00
## cancerstageIV  -1.50431  0.13377 -1.76650 -1.24212    0.00
## lengthofstay   -0.05498  0.02963 -0.11306  0.00309    0.06
## experience      0.08622  0.01969  0.04763  0.12482    0.00
## 
## exp(coeffients):
##                Estimate  2.5% 97.5%
## (Intercept)       0.238 0.109  0.52
## familyhxyes       0.421 0.364  0.49
## sexmale           1.056 0.956  1.17
## il6               0.961 0.946  0.98
## crp               0.982 0.967  1.00
## cancerstageII     0.791 0.702  0.89
## cancerstageIII    0.569 0.482  0.67
## cancerstageIV     0.222 0.171  0.29
## lengthofstay      0.947 0.893  1.00
## experience        1.090 1.049  1.13
## 
## Average Treatment effects (G-formula) :
##             Estimate Std.Err    2.5%   97.5% P-value
## treat-1       0.1735  0.0655  0.0452  0.3019    0.01
## treat-0       0.3237  0.0762  0.1742  0.4731    0.00
## differenceG  -0.1501  0.0550 -0.2580 -0.0422    0.01
## 
## Average Treatment effects (double robust) :
##              Estimate  Std.Err     2.5%    97.5% P-value
## treat-1       0.17527  0.01019  0.15531  0.19524       0
## treat-0       0.32391  0.00563  0.31288  0.33494       0
## differenceDR -0.14864  0.01155 -0.17128 -0.12599       0
## 
## Average Treatment effects on Treated/Non-Treated (DR) :
##     Estimate Std.Err    2.5%   97.5% P-value
## ATT  -0.1409  0.0105 -0.1614 -0.1205       0
## ATC  -0.2981  0.0078 -0.3134 -0.2828       0

Average family history effect on remission?

##             Estimate Std.Err    2.5%   97.5%  P-value
## treat-1        0.174  0.0655  0.0452  0.3019 8.05e-03
## treat-0        0.324  0.0762  0.1742  0.4731 2.18e-05
## differenceG   -0.150  0.0550 -0.2580 -0.0422 6.39e-03

There is a lower change of remission if family history of lung cancer. It is overall of \(15\%\) with a 95% CI. of \((6\%,26\%)\).

Average causal effect

\[ \begin{align*} \underbrace{E(Y^{a=1})}_{\text{causal effect}} &= \underbrace{E(Y^{a=1}|A=1,Z=z)}_{\text{causal assumptions}} \\ &=\underbrace{E(Y|A=1,Z=z)}_{\text{association effect}} \\ &=\underbrace{\sum_{z}P(Y=1\mid A=1,Z=z))P(A=1\mid Z=z))P(Z=z)}_{\text{g-computation}} \\ \end{align*} \]

All confounders \(Z\) are known. Similarly we obtain \(E(Y^{a=0})\) and the difference, \(E(Y^{a=1})-E(Y^{a=0})\), the average family effect.

Objective sensitivity analysis

Synopsis

We introduce e-values as a measure of influence that a set of unobserved confounders must have to explain away the treatment effect result.

The problem

association

(Smoking): How harmful is the exposure?
(Genes): How about unobserved confounders - eg. genetic effects?

Finding - causal?

In the Nordic twin cancer study, Thorax 2016 it is found that:

“Among smoking discordant pairs, the pairwise HR for lung cancer of the ever smoker twin compared to the never smoker co-twin was 5.4 (95% CI 2.1 to 14.0) in MZ pairs and 5.0 (95% CI 3.2 to 7.9) in DZ pairs.”

Bias? Confounding may explain all of the apparent association.
We calculate the magnitude of confounding that would be necessary to fully explain the estimated ratio of 5.0

E values for objective sensitivity

library(EValue)
evalues.RR(5.0, lo=3.2, hi=7.9)

##          point lower upper
## RR        5.00  3.20   7.9
## E-values  9.47  5.85    NA

The E-value of 9.47 tells us that a hypothetical set of confounders would have to be associated with a 9-fold increase in the risk of lung cancer, and must be 9 times more prevalent in smokers than non-smokers, to explain the observed risk ratio.

E values for objective sensitivity

If the strength of one of these relationships were weaker, the other would have to be stronger for the causal effect of smoking on lung cancer to be truly null.

E values for objective sensitivity

bias_plot(5.0, xmax = 20)

Summary

Causal analysis for unobserved confounding \(\surd\)
Untestable, but more general assumptions
Causal treatment effect obtainable \(\surd\)
Objective sensitivity analysis: E-values.

Thank You!

The presentation is at rpubs.com/jhjelmborg/SDU_causal_briefly.

References are on the next slide.