\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Aim: “Research to improve human health”
Causal statistical analysis in health science
What is it about and how to compute?
To target the exposure-outcome relation taking unobserved confounding into account (using ‘counterfactuals’).
To estimate the average causal effect (by ‘g-computation’)
To gauge the magnitude of unobserved confounding explaining away the average causal effect (obtaining objective sensitivity)
We review the randomized design of study to introduce causal assumptions for analyzing observational studies using counterfactuals.
\[ \underbrace{E(Y|X_1,\ldots ,X_p)}_{\text{expected Y given covariates}} = \underbrace{\alpha+\beta_1X_1+\ldots +\beta_pX_p}_{\text{predictor}} \]
For association, the effect of treatment \(X_1\) given covariates \(X_2,\ldots,X_p\) is
\[ \underbrace{E(Y|X_1=1)-E(Y|X_1=0)}_{\text{treated versus untreated}} = \underbrace{\beta_1}_{\text{effect}} \]
Each individual has two potential outcomes, \[Y^{a=1},\quad Y^{a=0}\]
Our target is the causal effect \[E(Y^{a=1})-E(Y^{a=0})\]
Two assumptions: Exchangeability and Consistency
Exchangeability: The potential outcome, \(Y^{a}\) is independent of treatment allocation \(A\) for all levels \(a\) of \(A\).
-in other words, the treated would have experienced the same outcome as the untreated and vice versa would they have been exchanged.
Consistency: If an individual is given treatment \(a\), that is, \(A=a\), then the potential outcome, \(Y^{a}\) equals the outcome, \(Y^{a}=Y\).
\[ \begin{align*} \underbrace{E(Y^{a=1})-E(Y^{a=0})}_{\text{causal effect}} &= \underbrace{E(Y^{a=1}|A=1)-E(Y^{a=0}|A=0)}_{\text{exchangeability and consistency}} \\ &=\underbrace{E(Y|A=1)-E(Y|A=0)}_{\text{association effect}} \end{align*} \]
So association is causation(!)
We illustrate the need to actual do causal analysis. First we review the instrumental variables approach as remedy to obtain causal interpretation of treatment effect.
##
## Call:
## lm(formula = Y ~ X + Z + U)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8000 -0.3293 -0.0013 0.3336 1.6806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99043 0.00928 106.8 <2e-16 ***
## X 0.50733 0.00456 111.3 <2e-16 ***
## Z 0.29846 0.00990 30.2 <2e-16 ***
## U -0.99422 0.00538 -184.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 9996 degrees of freedom
## Multiple R-squared: 0.888, Adjusted R-squared: 0.888
## F-statistic: 2.63e+04 on 3 and 9996 DF, p-value: <2e-16
##
## Call:
## lm(formula = Y ~ X + G + Z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.194 -0.680 0.004 0.679 3.750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.60555 0.01968 30.77 <2e-16 ***
## X 0.89711 0.00909 98.68 <2e-16 ***
## G -0.30181 0.02148 -14.05 <2e-16 ***
## Z 0.17157 0.02056 8.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.02 on 9996 degrees of freedom
## Multiple R-squared: 0.513, Adjusted R-squared: 0.513
## F-statistic: 3.52e+03 on 3 and 9996 DF, p-value: <2e-16
But biased estimate of \(\beta_X\) and significant effect of \(G\)
##
## Call:
## lm(formula = Y ~ G)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.302 -0.980 0.009 0.973 5.934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7089 0.0204 83.9 <2e-16 ***
## G 0.3740 0.0290 12.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.45 on 9998 degrees of freedom
## Multiple R-squared: 0.0164, Adjusted R-squared: 0.0163
## F-statistic: 167 on 1 and 9998 DF, p-value: <2e-16
* Do the regression of X on G (and Z), and
* Calculate the predicted values: $\hat{X}$
* Do the regression of Y on $\hat{X}$ (and Z)
##
## Call:
## lm(formula = Y ~ hat.X + Z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.526 -0.963 0.001 0.941 5.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0023 0.0555 18.07 <2e-16 ***
## hat.X 0.4942 0.0382 12.93 <2e-16 ***
## Z 0.2937 0.0310 9.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.43 on 9997 degrees of freedom
## Multiple R-squared: 0.0395, Adjusted R-squared: 0.0393
## F-statistic: 205 on 2 and 9997 DF, p-value: <2e-16
* Do the regression of X on G (and Z), and
* Calculate the predicted values: $\hat{X}$
* Do the regression of Y on $\hat{X}$ (and Z)
##
## Call:
## lm(formula = Y ~ hat.X + Z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.526 -0.963 0.001 0.941 5.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0023 0.0555 18.07 <2e-16 ***
## hat.X 0.4942 0.0382 12.93 <2e-16 ***
## Z 0.2937 0.0310 9.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.43 on 9997 degrees of freedom
## Multiple R-squared: 0.0395, Adjusted R-squared: 0.0393
## F-statistic: 205 on 2 and 9997 DF, p-value: <2e-16
##
## Call:
## ivreg(formula = Y ~ Z + X | Z + G)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6502 -0.7582 0.0099 0.7483 4.4193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0023 0.0432 23.2 <2e-16 ***
## Z 0.2937 0.0241 12.2 <2e-16 ***
## X 0.4942 0.0297 16.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.11 on 9997 degrees of freedom
## Multiple R-Squared: 0.418, Adjusted R-squared: 0.418
## Wald test: 339 on 2 and 9997 DF, p-value: <2e-16
##
## Call:
## ivreg(formula = Y ~ Z + X | Z + G)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6502 -0.7582 0.0099 0.7483 4.4193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0023 0.0432 23.2 <2e-16 ***
## Z 0.2937 0.0241 12.2 <2e-16 ***
## X 0.4942 0.0297 16.6 <2e-16 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 9997 1117 <2e-16 ***
## Wu-Hausman 1 9996 197 <2e-16 ***
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.11 on 9997 degrees of freedom
## Multiple R-Squared: 0.418, Adjusted R-squared: 0.418
## Wald test: 339 on 2 and 9997 DF, p-value: <2e-16
-often \(\textrm {G}\) is an inherited genetic variant ensuring Mendelian randomization(!)
We obtain the average causal treatment effect from the g-computation formula.
The expected risk of outcome \(Y=1\) if everybody have had the exposure \(X=1\) versus the risk of \(Y=1\) if everybody have not had the exposure \(X=0\).
Remission from lungcancer treatment studied through family history, sex, immunology markers, length of stay and cancerstage.
Logistic regression model provides odds-rati of effects.
Average family history effect on remission?
library(lme4)
library(tidyverse)
library(readstata13)
library(mets)
D<-read.dta13('lungcancer.dta')
D<- D %>% mutate( sex=factor(sex), cancerstage=factor(cancerstage), familyhx=factor(familyhx))
m5 <- logitATE(remission ~ familyhx+sex+ il6 + crp + cancerstage +
lengthofstay +experience + cluster(did),
data = D, treat.model=familyhx ~sex+ cancerstage + lengthofstay
+experience + cluster(did))##
## n events
## 8525 2521
##
## 407 clusters
## coeffients:
## Estimate Std.Err 2.5% 97.5% P-value
## (Intercept) -1.43573 0.39656 -2.21297 -0.65850 0.00
## familyhxyes -0.86541 0.07457 -1.01155 -0.71926 0.00
## sexmale 0.05470 0.05076 -0.04479 0.15420 0.28
## il6 -0.03928 0.00819 -0.05533 -0.02322 0.00
## crp -0.01822 0.00805 -0.03400 -0.00243 0.02
## cancerstageII -0.23451 0.06116 -0.35438 -0.11464 0.00
## cancerstageIII -0.56350 0.08436 -0.72884 -0.39816 0.00
## cancerstageIV -1.50431 0.13377 -1.76650 -1.24212 0.00
## lengthofstay -0.05498 0.02963 -0.11306 0.00309 0.06
## experience 0.08622 0.01969 0.04763 0.12482 0.00
##
## exp(coeffients):
## Estimate 2.5% 97.5%
## (Intercept) 0.238 0.109 0.52
## familyhxyes 0.421 0.364 0.49
## sexmale 1.056 0.956 1.17
## il6 0.961 0.946 0.98
## crp 0.982 0.967 1.00
## cancerstageII 0.791 0.702 0.89
## cancerstageIII 0.569 0.482 0.67
## cancerstageIV 0.222 0.171 0.29
## lengthofstay 0.947 0.893 1.00
## experience 1.090 1.049 1.13
##
## Average Treatment effects (G-formula) :
## Estimate Std.Err 2.5% 97.5% P-value
## treat-1 0.1735 0.0655 0.0452 0.3019 0.01
## treat-0 0.3237 0.0762 0.1742 0.4731 0.00
## differenceG -0.1501 0.0550 -0.2580 -0.0422 0.01
##
## Average Treatment effects (double robust) :
## Estimate Std.Err 2.5% 97.5% P-value
## treat-1 0.17527 0.01019 0.15531 0.19524 0
## treat-0 0.32391 0.00563 0.31288 0.33494 0
## differenceDR -0.14864 0.01155 -0.17128 -0.12599 0
##
## Average Treatment effects on Treated/Non-Treated (DR) :
## Estimate Std.Err 2.5% 97.5% P-value
## ATT -0.1409 0.0105 -0.1614 -0.1205 0
## ATC -0.2981 0.0078 -0.3134 -0.2828 0
## Estimate Std.Err 2.5% 97.5% P-value
## treat-1 0.174 0.0655 0.0452 0.3019 8.05e-03
## treat-0 0.324 0.0762 0.1742 0.4731 2.18e-05
## differenceG -0.150 0.0550 -0.2580 -0.0422 6.39e-03
There is a lower change of remission if family history of lung cancer. It is overall of \(15\%\) with a 95% CI. of \((6\%,26\%)\).
\[ \begin{align*} \underbrace{E(Y^{a=1})}_{\text{causal effect}} &= \underbrace{E(Y^{a=1}|A=1,Z=z)}_{\text{causal assumptions}} \\ &=\underbrace{E(Y|A=1,Z=z)}_{\text{association effect}} \\ &=\underbrace{\sum_{z}P(Y=1\mid A=1,Z=z))P(A=1\mid Z=z))P(Z=z)}_{\text{g-computation}} \\ \end{align*} \]
All confounders \(Z\) are known. Similarly we obtain \(E(Y^{a=0})\) and the difference, \(E(Y^{a=1})-E(Y^{a=0})\), the average family effect.
We introduce e-values as a measure of influence that a set of unobserved confounders must have to explain away the treatment effect result.
In the Nordic twin cancer study, Thorax 2016 it is found that:
“Among smoking discordant pairs, the pairwise HR for lung cancer of the ever smoker twin compared to the never smoker co-twin was 5.4 (95% CI 2.1 to 14.0) in MZ pairs and 5.0 (95% CI 3.2 to 7.9) in DZ pairs.”
Bias? Confounding may explain all of the apparent association.
We calculate the magnitude of confounding that would be necessary to fully explain the estimated ratio of 5.0
## point lower upper
## RR 5.00 3.20 7.9
## E-values 9.47 5.85 NA
If the strength of one of these relationships were weaker, the other would have to be stronger for the causal effect of smoking on lung cancer to be truly null.
The presentation is at rpubs.com/jhjelmborg/SDU_causal_briefly.
References are on the next slide.
Comments