Recognize what to control for in DAGs
Estimation methods
Steps to causal inference
Define causal effects
Specify models
Specify and defend assumptions
exposure/intervention of interest
Random variable
lowercase a refers to a realization of var A
Treatment status of subject i
Could be continuous
Could be time to event
Could be multidimensional
Ya for continuous ind var
Y1 and Y0 for binary
i.e. Y1: time until individual would get flu if they received the vaccine
Y0: time until the individual would get the flu if they did not receive the vaccine
Observed outcome Y is the outcome under treatment a subject received, i.e. Y = YA
Missing data problem where selection decided by treatment received
A had causal effect on Y if Y1 =/= Y0
Cant assume causal effect of ibuprofen without knowing untreated effect
However can make estimates on population level (average causal effects)
E(Y1 - Y0)
Average outcome if everyone had been treated versus if everyone had been treated
Typically not equal to E(Y|A = 1) - E(Y|A = 0) (two subpopulations) unless we randomize
Pearl uses notation E(Y|do(A = 1)) - E(Y|do(A = 0))
do notation means “set” treatment, equal to notation on top
Potential risk ratio
Causal effect of a subpopulation E(Y1 - Y0|V = v)
Removing an exposure, hypothetical “what ifs” E(Y1 - Y0 |A = 1)
Quantile causal effects F\(\overline{1}\)1(p) - F\(\overline{0}\)1(p)
Invalid when comparing same treatment on two separate groups without strong assuptions
Not valid when comparing separate subpopulations even when we have different treatment effects
When there are variables that affect the treatment decision and outcome
Also possible in randomized trials when there is noncompliance
Creates a problem because there is a lack of comparability between groups
Severe injury is affecting likelihood of getting surgery but its not deterministic
Treatment will bump up Ya
Will have separate distributions for a=0 and a=1, but curve might get pulled in one direction by L
Controllable confounding
No confounding
Confounding not controllable using standard methods but controllable using others (e.g. instrumental variables)
COuld use a natural randomizer
Difference of differences tries to difference out unmeasured confounding variables
Uncontrollable confounding
L is set of baseline (pre-treatment) covariates
If we control for confounding we can assume any other differnces are random
Controllable if there is overt (not hidden) bias
f(Y0, Y1 |L,A) = f(Y0, Y1 |L)
f(A|L, Y0, Y1) = f(A|L)
Basically means conditional randomization
Treatment depends on potential outcomes
dag terminology
library(dagitty)
library(ggdag)
dagified <- dagify(z ~ x,
y ~ z,
exposure = "x",
outcome = "y")
dag <- tidy_dagitty(dagified)
ggdag(dag, layout = circle) +
theme(panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())
In this DAG:
X is parent of Z, ancestor of Y
Y is child of Z, descendent of X
This is an example of chains
Forks
A \(\leftarrow\) Z \(\rightarrow\) B
X and Y are not independent in diagram below since information flows from Z in both cases
X and Y are associated with eachother
In cases of a collider G
A \(\rightarrow\) G \(\leftarrow\) B
A and B both affect G
A and B are both independent
Paths can be blocked by conditioning on nodes in the path
Stopping the independence
i.e. temperature -> icy sidewalks -> slipping
We can block the relationship between temperature and slipping by conditioning on sidewalks
A path is d-separated by a set of nodes C if:
it contains a chain (D\(\rightarrow\)E\(\rightarrow\)F) and the middle part is in C
OR
it contains a fork (D\(\leftarrow\)E\(\rightarrow\)F) and the middle part is in C
OR
it contains an inverted form (D\(\rightarrow\)E\(\leftarrow\)F) and the middle part is not in C
Backdoor paths from treatment A to outcome Y are paths from A to that travel through arrows going into A
These confound the relationship between A and Y
Need to be blocked to make causal assumptions about effect of A on Y
Set of variables L is sufficient to control for confounding if:
It blocks all backdoor paths from treatment to the outcome
It does not include any descendants of treatment
This is the backdoor path criterion
dagified <- dagify(y ~ a,
a ~ w,
y ~ w,
w ~ z,
a ~ z,
exposure = "a",
outcome = "y")
dag <- tidy_dagitty(dagified)
ggdag(dag, layout = circle) +
theme(panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())
In this example we can block the backdoor path of information from a to y by controlling for W or for W and Z
Can also control for other risk factors affecting y for sake of efficiency gains, does not affect causal outcome
Conditioning on collider opens path between vars leading into collider (ex 4 in slides)
dagified <- dagify(y ~ a,
y ~ x2,
x3 ~ x2,
x3 ~ x1,
a ~ x1,
x3 ~ a,
y ~ x3,
exposure = "a",
outcome = "y")
dag <- tidy_dagitty(dagified)
ggdag(dag, layout = circle) +
theme(panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Here we need to condition on x3 because we need to prevent a backdoor path from x1 but we create another backdoor path when we condition on x3. We must also condition on x2.
Adjust for all pre-treatment variables
Select minimum set using backdoor criterion
Use disjunctive cause criterion
Requires knowing all vars that directly affect treatment decision and outcome
Just need to control for all potentially causal variables on both treatment and exposure
Debate on data-driven approach
Could discard many weak confounders
inferential challenge about getting SEs right for data collection
Identifying assumptions
Ignorability
Positivity
Consistency
E(Ya) = \(\sum_{l = 0}^{1} f(x)\)E(Y|A = a, L)dF(L)
Could estimate E(Ya) by using models for E(Y|A = a,L) and f(L)
E(Ya) = \(\int_{}^{}\) E(Y|A = a,L;\(\theta\))f(L;\(\eta\))dL
Can fit logistic regression Y|A,L
Get predicted values for Y for each Li if A set to a
Average over empirical distribution of L
do for a = 0 and a = 1
Averaging over the marginal distributon of L
The propensity score is the conditional probability of treatment P(A|L)
IPTW
Can weight by the inverse of the probability of treatment received
Creating a pseudo-population by getting rid of confounder by creating a balance between treated and controlled
When we have balance we can just take sample means and have weighted estimates
Basically “erasing an arrow” from confounding, but you have to get the right Ls to weight to
If the prob of a particular treatment give L is small, then the weight will be large, the use of stabilized weights can help
large weights lead to large SEs
very large weights if positivity nearly violated
We can do trimming or weight truncation to deal with large weights
weight truncation is setting a max weight
augmented IPTW - doubly robust and more efficient than IPTW
Matching is an alternative to standardization and IPTW estimation
Idea is to make the data more like what you would expect from a randomized trial
Match treated subjects to controls subjects on the confounders
Advtanges to matching
controlling for confounders is achieved at the design phase without looking at the outcome
Matching will reveal lack of overlap in covariate distribution
Once data are matched, essentiall treated as if from an RCT
Matching requires calculating distance between Li and Lj
can do this with exact matching
mahalanobis distance
mahalanobis distance with a caliper (if distance to large, we force it to not match)
propensity score - probability of treatment and compare scores for matching
propoensity score with caliper
Can combine measures, such as using mahalanobis with propensity score caliper or KNN matching
If there is poor overlap we’d have to do extrapolation. We’d lose lots of data if we did trimming.
For subjects i,j, We then can create a distance matrix between ei and ej (propensity of i and j)
Nearest Neighbor matching
Optimal matching
Sparse optimal matching
Somewhat stratified
Optimal matching within blocks
After matching you can analyze outcome data
library(sandwich)
n<-200
L<-rbinom(n,1,0.5)
A<-rbinom(n,1,(0.3+0.3*L))
Y<-rnorm(n,(35+10*L+20*A),5)
mydata<-data.frame(cbind(A,L,Y))
kable(head(mydata))
| A | L | Y |
|---|---|---|
| 1 | 0 | 65.93717 |
| 0 | 1 | 44.00917 |
| 1 | 1 | 67.89592 |
| 0 | 0 | 43.63981 |
| 0 | 1 | 43.23738 |
| 1 | 1 | 70.90363 |
Let’s look at the diff of means
#unadjusted
#mean of Y for treated subjects
mean(Y[A==1])
## [1] 63.88793
#mean of Y for untreated subjects
mean(Y[A==0])
## [1] 38.31022
#difference
mean(Y[A==1])-mean(Y[A==0])
## [1] 25.57771
reg<-lm(Y~A+L,data=mydata)
summary(reg)
##
## Call:
## lm(formula = Y ~ A + L, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6897 -3.1837 0.1299 3.6035 11.2976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5494 0.5490 62.94 <2e-16 ***
## A 20.4373 0.7932 25.77 <2e-16 ***
## L 11.4975 0.7932 14.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.004 on 197 degrees of freedom
## Multiple R-squared: 0.8846, Adjusted R-squared: 0.8834
## F-statistic: 754.9 on 2 and 197 DF, p-value: < 2.2e-16
#can do by hand since one binary covariate
pr.l <- prop.table(table(mydata$L))
kable(pr.l)
| Var1 | Freq |
|---|---|
| 0 | 0.465 |
| 1 | 0.535 |
tab.out <- aggregate(Y ~ A + L, mydata, mean)
kable(tab.out)
| A | L | Y |
|---|---|---|
| 0 | 0 | 34.47951 |
| 1 | 0 | 55.22618 |
| 0 | 1 | 46.19054 |
| 1 | 1 | 66.41428 |
We can computer conditional means and weigh by value of the covariate. This will allow us to correct the bias of treatment.
((mean(mydata$Y[mydata$A==1 & mydata$L==1]) -
mean(mydata$Y[mydata$A==0 & mydata$L==1]))*pr.l[2]) +
(mean(mydata$Y[mydata$A==1 & mydata$L==0]) -
mean(mydata$Y[mydata$A==0 & mydata$L==0]))*pr.l[1]
## 1
## 20.4669
p.score <- glm(A ~ as.factor(L), data=mydata, family=binomial)
p.a <- ifelse(mydata$A == 0, 1 - predict(p.score, type = "response"),
predict(p.score, type = "response"))
kable(table(p.a))
| p.a | Freq |
|---|---|
| 0.225806451613194 | 21 |
| 0.327102803738384 | 35 |
| 0.672897196261616 | 72 |
| 0.774193548386806 | 72 |
We get four values corresponding to two values of a for each treated and untreated L.
mydata$w <- 1/p.a
kable(table(mydata$w))
| Var1 | Freq |
|---|---|
| 1.29166666666715 | 72 |
| 1.48611111111126 | 72 |
| 3.05714285714224 | 35 |
| 4.42857142856573 | 21 |
summary(mydata$w)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.292 1.292 1.486 2.000 3.057 4.429
sd(mydata$w)
## [1] 1.046915
mean(mydata$w*as.numeric(mydata$A==1)*mydata$Y) -
mean(mydata$w*as.numeric(mydata$A==0)*mydata$Y)
## [1] 20.4669
msm <- lm(Y ~ A, data = mydata, weights = w)
Weights create a pseudo population:
SE <-sqrt(diag(vcovHC(msm, type="HC0"))) # robust standard errors
beta <- coef(msm)
lcl <- beta-1.96*SE
ucl <- beta+1.96*SE
cbind(beta, lcl, ucl)[2,]
## beta lcl ucl
## 20.46690 18.10841 22.82540
library(tableone)
library(Matching)
library(geepack)
library(stats)
load(url("https://urldefense.proofpoint.com/v2/url?u=http-3A__biostat.mc.vanderbilt.edu_wiki_pub_Main_DataSets_rhc.sav&d=DwIGAg&c=lEzKI_JJakPtcnbAQ6Q5xQ&r=ADBnQF_uxktNSAkWyomgLeFb-eOkP5LRzvDAXj-4ESQ&m=GJ1JdJTpKzJvLWlb0bGX87lB6Gx5rPtIV9m0KvRo6AI&s=639__Kuc3BOyiS-CFnTwBH3AVQyrBK3GMJStuyusjrk&e="))
kable(head(rhc))
| cat1 | cat2 | ca | sadmdte | dschdte | dthdte | lstctdte | death | cardiohx | chfhx | dementhx | psychhx | chrpulhx | renalhx | liverhx | gibledhx | malighx | immunhx | transhx | amihx | age | sex | edu | surv2md1 | das2d3pc | t3d30 | dth30 | aps1 | scoma1 | meanbp1 | wblc1 | hrt1 | resp1 | temp1 | pafi1 | alb1 | hema1 | bili1 | crea1 | sod1 | pot1 | paco21 | ph1 | swang1 | wtkilo1 | dnr1 | ninsclas | resp | card | neuro | gastr | renal | meta | hema | seps | trauma | ortho | adld3p | urin1 | race | income | ptid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| COPD | NA | Yes | 11142 | 11151 | NA | 11382 | No | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 70.25098 | Male | 12.000000 | 0.6409912 | 23.50000 | 30 | No | 46 | 0 | 41 | 22.0976562 | 124 | 10 | 38.69531 | 68.0000 | 3.500000 | 58.00000 | 1.0097656 | 1.1999512 | 145 | 4.000000 | 40 | 7.359375 | No RHC | 64.69995 | No | Medicare | Yes | Yes | No | No | No | No | No | No | No | No | 0 | NA | white | Under $11k | 00005 |
| MOSF w/Sepsis | NA | No | 11799 | 11844 | 11844 | 11844 | Yes | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 78.17896 | Female | 12.000000 | 0.7549996 | 14.75195 | 30 | No | 50 | 0 | 63 | 28.8984375 | 137 | 38 | 38.89844 | 218.3125 | 2.599609 | 32.50000 | 0.6999512 | 0.5999756 | 137 | 3.299805 | 34 | 7.329102 | RHC | 45.69998 | No | Private & Medicare | No | No | No | No | No | No | No | Yes | No | No | NA | 1437 | white | Under $11k | 00007 |
| MOSF w/Malignancy | MOSF w/Sepsis | Yes | 12083 | 12143 | NA | 12400 | No | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 46.09198 | Female | 14.069916 | 0.3169999 | 18.13672 | 30 | No | 82 | 0 | 57 | 0.0499954 | 130 | 40 | 36.39844 | 275.5000 | 3.500000 | 21.09766 | 1.0097656 | 2.5996094 | 146 | 2.899902 | 16 | 7.359375 | RHC | 0.00000 | No | Private | No | Yes | No | No | No | No | No | No | No | No | NA | 599 | white | $25-$50k | 00009 |
| ARF | NA | No | 11146 | 11183 | 11183 | 11182 | Yes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 75.33197 | Female | 9.000000 | 0.4409790 | 22.92969 | 30 | No | 48 | 0 | 55 | 23.2968750 | 58 | 26 | 35.79688 | 156.6562 | 3.500000 | 26.29688 | 0.3999634 | 1.6999512 | 117 | 5.799805 | 30 | 7.459961 | No RHC | 54.59998 | No | Private & Medicare | Yes | No | No | No | No | No | No | No | No | No | NA | NA | white | $11-$25k | 00010 |
| MOSF w/Sepsis | NA | No | 12035 | 12037 | 12037 | 12036 | Yes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 67.90997 | Male | 9.945259 | 0.4369998 | 21.05078 | 2 | Yes | 72 | 41 | 65 | 29.6992188 | 125 | 27 | 34.79688 | 478.0000 | 3.500000 | 24.00000 | 1.0097656 | 3.5996094 | 126 | 5.799805 | 17 | 7.229492 | RHC | 78.39996 | Yes | Medicare | No | Yes | No | No | No | No | No | No | No | No | NA | 64 | white | Under $11k | 00011 |
| COPD | NA | No | 12389 | 12396 | NA | 12590 | No | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 86.07794 | Female | 8.000000 | 0.6650000 | 17.50000 | 30 | No | 38 | 0 | 115 | 18.0000000 | 134 | 36 | 39.19531 | 184.1875 | 3.099609 | 30.50000 | 1.0097656 | 1.3999023 | 138 | 5.399414 | 68 | 7.299805 | No RHC | 54.89999 | No | Medicare | Yes | No | No | No | No | No | No | No | No | No | 0 | 242 | white | Under $11k | 00012 |
ARF<-as.numeric(rhc$cat1=='ARF')
CHF<-as.numeric(rhc$cat1=='CHF')
Cirr<-as.numeric(rhc$cat1=='Cirrhosis')
colcan<-as.numeric(rhc$cat1=='Colon Cancer')
Coma<-as.numeric(rhc$cat1=='Coma')
COPD<-as.numeric(rhc$cat1=='COPD')
lungcan<-as.numeric(rhc$cat1=='Lung Cancer')
MOSF<-as.numeric(rhc$cat1=='MOSF w/Malignancy')
sepsis<-as.numeric(rhc$cat1=='MOSF w/Sepsis')
female<-as.numeric(rhc$sex=='Female')
died<-as.numeric(rhc$death=='Yes')
age<-rhc$age
treatment<-as.numeric(rhc$swang1=='RHC')
meanbp1<-rhc$meanbp1
mydata <-data.frame(cbind(ARF,CHF,Cirr,colcan,Coma,lungcan,MOSF,sepsis,
age,female,meanbp1,treatment,died))
kable(head(mydata))
| ARF | CHF | Cirr | colcan | Coma | lungcan | MOSF | sepsis | age | female | meanbp1 | treatment | died |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 70.25098 | 0 | 41 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 78.17896 | 1 | 63 | 1 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 46.09198 | 1 | 57 | 1 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 75.33197 | 1 | 55 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 67.90997 | 0 | 65 | 1 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 86.07794 | 1 | 115 | 0 | 0 |
xvars<-c("ARF","CHF","Cirr","colcan","Coma","lungcan","MOSF","sepsis",
"age","female","meanbp1")
table1<- CreateTableOne(vars=xvars,strata="treatment", data=mydata, test=FALSE)
print(table1,smd=TRUE)
## Stratified by treatment
## 0 1 SMD
## n 3551 2184
## ARF (mean (SD)) 0.45 (0.50) 0.42 (0.49) 0.059
## CHF (mean (SD)) 0.07 (0.25) 0.10 (0.29) 0.095
## Cirr (mean (SD)) 0.05 (0.22) 0.02 (0.15) 0.145
## colcan (mean (SD)) 0.00 (0.04) 0.00 (0.02) 0.038
## Coma (mean (SD)) 0.10 (0.29) 0.04 (0.20) 0.207
## lungcan (mean (SD)) 0.01 (0.10) 0.00 (0.05) 0.095
## MOSF (mean (SD)) 0.07 (0.25) 0.07 (0.26) 0.018
## sepsis (mean (SD)) 0.15 (0.36) 0.32 (0.47) 0.415
## age (mean (SD)) 61.76 (17.29) 60.75 (15.63) 0.061
## female (mean (SD)) 0.46 (0.50) 0.41 (0.49) 0.093
## meanbp1 (mean (SD)) 84.87 (38.87) 68.20 (34.24) 0.455
mean(died[treatment==1])
## [1] 0.6804029
mean(died[treatment==0])
## [1] 0.6296818
outmodel<-glm(died~treatment+ARF+CHF+Cirr+colcan+Coma+lungcan+MOSF+
sepsis+age+female+meanbp1,
family=binomial(),data=mydata)
#now get predicted values for each person if they had been treated
mydata1<-mydata
mydata1$treatment<-1
pred1<-predict(outmodel,mydata1,type='response')
#get predicted values for each person if they had not been treated
mydata0<-mydata
mydata0$treatment<-0
pred0<-predict(outmodel,mydata0,type='response')
mean(pred1)
## [1] 0.6797657
mean(pred0)
## [1] 0.6298412
psmodel<-glm(treatment~ARF+CHF+Cirr+colcan+Coma+lungcan+MOSF+
sepsis+age+female+meanbp1,
family=binomial(),data=mydata)
#show coefficients etc
summary(psmodel)
##
## Call:
## glm(formula = treatment ~ ARF + CHF + Cirr + colcan + Coma +
## lungcan + MOSF + sepsis + age + female + meanbp1, family = binomial(),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7013 -1.0097 -0.6336 1.1814 2.4791
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7299670 0.1997692 -3.654 0.000258 ***
## ARF 1.2931956 0.1487784 8.692 < 2e-16 ***
## CHF 1.6804704 0.1715672 9.795 < 2e-16 ***
## Cirr 0.5234506 0.2181458 2.400 0.016416 *
## colcan 0.0295468 1.0985361 0.027 0.978542
## Coma 0.7013451 0.1854937 3.781 0.000156 ***
## lungcan -0.0869570 0.5039331 -0.173 0.863000
## MOSF 1.3046587 0.1772705 7.360 1.84e-13 ***
## sepsis 2.0433604 0.1545437 13.222 < 2e-16 ***
## age -0.0031374 0.0017289 -1.815 0.069567 .
## female -0.1697903 0.0583574 -2.909 0.003620 **
## meanbp1 -0.0109824 0.0008217 -13.366 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7621.4 on 5734 degrees of freedom
## Residual deviance: 6983.6 on 5723 degrees of freedom
## AIC: 7007.6
##
## Number of Fisher Scoring iterations: 4
#create propensity score
pscore<-psmodel$fitted.values
We can output quick histograms to make sure there is enough overlap.
#unstabilized weights
wt<-ifelse(treatment==1,1/pscore,1/(1-pscore))
hist(wt)
summary(wt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.046 1.405 1.721 2.001 2.280 21.606
#stabilized weights
#probability of treatment (unconditional)
probt<-mean(treatment)
sw<-ifelse(treatment==1,probt/pscore,(1-probt)/(1-pscore))
hist(sw)
summary(sw)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4917 0.7579 0.8990 1.0004 1.0997 8.2279
weighted.mean(died[treatment==1],wt[treatment==1])
## [1] 0.6820133
weighted.mean(died[treatment==0],wt[treatment==0])
## [1] 0.6304638
#now do weighted regression (get same estimates)
id<-1:length(died)
msm1 <- geeglm(as.numeric(died) ~ treatment,
id = id,
weights = wt,
family = binomial(link=logit))
summary(msm1)
##
## Call:
## geeglm(formula = as.numeric(died) ~ treatment, family = binomial(link = logit),
## weights = wt, id = id)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 0.53421 0.03602 219.95 < 2e-16 ***
## treatment 0.22883 0.06486 12.45 0.000419 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 1 0.01235
##
## Correlation: Structure = independenceNumber of clusters: 5735 Maximum cluster size: 1
head(msm1$fitted.values)
## [,1]
## 1 0.6305
## 2 0.6820
## 3 0.6820
## 4 0.6305
## 5 0.6820
## 6 0.6305
#using stabilized weights
msm2 <- geeglm(as.numeric(died) ~ treatment,
id = id,
weights = sw,
data=mydata,
family = binomial(link=logit))
summary(msm2)
##
## Call:
## geeglm(formula = as.numeric(died) ~ treatment, family = binomial(link = logit),
## data = mydata, weights = sw, id = id)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 0.5342 0.0360 219.9 < 2e-16 ***
## treatment 0.2288 0.0649 12.4 0.00042 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 1 0.0106
##
## Correlation: Structure = independenceNumber of clusters: 5735 Maximum cluster size: 1
head(msm2$fitted.values)
## [,1]
## 1 0.630
## 2 0.682
## 3 0.682
## 4 0.630
## 5 0.682
## 6 0.630
greedymatch<-Match(Tr=treatment,M=1,X=mydata[xvars],replace=FALSE)
matched<-mydata[unlist(greedymatch[c("index.treated","index.control")]), ]
#get table 1 for matched data with standardized differences
matchedtab1<-CreateTableOne(vars=xvars, strata ="treatment",
data=matched, test = FALSE)
print(matchedtab1, smd = TRUE)
## Stratified by treatment
## 0 1 SMD
## n 2184 2184
## ARF (mean (SD)) 0.42 (0.49) 0.42 (0.49) 0.006
## CHF (mean (SD)) 0.10 (0.29) 0.10 (0.29) <0.001
## Cirr (mean (SD)) 0.02 (0.15) 0.02 (0.15) <0.001
## colcan (mean (SD)) 0.00 (0.02) 0.00 (0.02) <0.001
## Coma (mean (SD)) 0.04 (0.20) 0.04 (0.20) <0.001
## lungcan (mean (SD)) 0.00 (0.05) 0.00 (0.05) <0.001
## MOSF (mean (SD)) 0.07 (0.26) 0.07 (0.26) <0.001
## sepsis (mean (SD)) 0.24 (0.43) 0.32 (0.47) 0.177
## age (mean (SD)) 61.53 (16.15) 60.75 (15.63) 0.049
## female (mean (SD)) 0.44 (0.50) 0.41 (0.49) 0.042
## meanbp1 (mean (SD)) 73.12 (34.28) 68.20 (34.24) 0.144
#outcome analysis
y_trt<-matched$died[matched$treatment==1]
y_con<-matched$died[matched$treatment==0]
mean(y_trt)
## [1] 0.68
mean(y_con)
## [1] 0.626
#pairwise difference
diffy<-y_trt-y_con
#paired t-test
t.test(diffy)
##
## One Sample t-test
##
## data: diffy
## t = 4, df = 2000, p-value = 9e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.0271 0.0810
## sample estimates:
## mean of x
## 0.054
#McNemar test
kable(table(y_trt,y_con))
| 0 | 1 | |
|---|---|---|
| 0 | 303 | 395 |
| 1 | 513 | 973 |
mcnemar.test(matrix(c(973,513,395,303),2,2))
##
## McNemar's Chi-squared test with continuity correction
##
## data: matrix(c(973, 513, 395, 303), 2, 2)
## McNemar's chi-squared = 20, df = 1, p-value = 1e-04
This is a case where we have an intervening variable that affects outcome, but isn’t confounding.
dag3 <- dagify(y ~ x,
y ~ x + m,
m ~ x)
dag_med <- tidy_dagitty(dag3)
ggdag(dag_med, layout = circle) +
theme(panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Traditional approach is to fit two regression models (Baron and Kenny)
unobservable usually
Integrating over a different distribution