12 逆概率加权和边缘结构模型
(需要注意,A和Y都可以是连续变量或者分类变量)
戒烟对体重增加的因果效应均值是多少?在本 章,我们会讲述如何使用逆概率加权从观察性数据中估计这一效应。我们将用现实世界中 NHEFS 的数据来估计戒烟对体重增加的效应。NHEFS 表示“美国国家健 康与营养调查数据 1 期:流行病学跟踪研究”。NHEFS 是由美国国家健康数据中心、老龄化研究 所、以及其他美国公共卫生服务机构共同发起的研究。NHEFS 的详尽介绍及数据能在以下网址中 找到:wwwn.cdc.gov/nchs/nhanes/nhefs/。
关于数据集的信息如链接所示: https://github.com/IBM/causallib/blob/master/causallib/datasets/data/nhefs/NHEFS_codebook.csv
12.1 因果性为题
我们的目标是估计戒烟 A (治疗变量)对体重增加 Y (结局变量)的因果效应均值,因而我 们从 NHEFS 数据中选取了 1566 名 25 至 74 岁的参与人员,他们有第一次访问时的数据,并在 10 年后被再次随访。如果他们在随访之前戒烟了,那就被称为治疗组 A = 1 ,如果没有则是非治疗 组 A = 0 。在初次访问和随访中都记录了每个人的体重(kg),因而可以计算出体重增加了多少。
我们定义戒烟效果的因果效应定义为这两者的差别,也就是E(Ya1)-E(Ya0)
通常而言,相关性差值E(Y|A=1)-E(Y|A=0)与E(Ya1)-E(Ya0)有区别。如果戒烟者和非戒烟者的特征分布不一样, 那么相关性的差值就没有因果意义。(我们可以分析治疗组和非治疗组的特征分布是不是一样。)
#####################################################
# PROGRAM 12.1
# Descriptive statistics from NHEFS data (Table 12.1)
#####################################################
#install.packages("readxl") # install package if required
library("readxl")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0) # 如果没有值就记为1
nhefs.nmv <- nhefs[which(!is.na(nhefs$wt82)),] #删除缺失值 provisionally ignore subjects with missing values for weight in 1982
lm(wt82_71 ~ qsmk, data=nhefs.nmv) # qsmk 表示是否是治疗组
##
## Call:
## lm(formula = wt82_71 ~ qsmk, data = nhefs.nmv)
##
## Coefficients:
## (Intercept) qsmk
## 1.984 2.541
predict(lm(wt82_71 ~ qsmk, data=nhefs.nmv), data.frame(qsmk=1)) # Smoking cessation
## 1
## 4.525079
predict(lm(wt82_71 ~ qsmk, data=nhefs.nmv), data.frame(qsmk=0)) # No smoking cessation
## 1
## 1.984498
# Table
summary(nhefs.nmv[which(nhefs.nmv$qsmk==0),]$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 33.00 42.00 42.79 51.00 72.00
summary(nhefs.nmv[which(nhefs.nmv$qsmk==0),]$wt71)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.82 59.19 68.49 70.30 79.38 151.73
summary(nhefs.nmv[which(nhefs.nmv$qsmk==0),]$smokeintensity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 15.00 20.00 21.19 30.00 60.00
summary(nhefs.nmv[which(nhefs.nmv$qsmk==0),]$smokeyrs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 15.00 23.00 24.09 32.00 64.00
summary(nhefs.nmv[which(nhefs.nmv$qsmk==1),]$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 35.00 46.00 46.17 56.00 74.00
summary(nhefs.nmv[which(nhefs.nmv$qsmk==1),]$wt71)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.58 60.67 71.21 72.35 81.08 136.98
summary(nhefs.nmv[which(nhefs.nmv$qsmk==1),]$smokeintensity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 10.0 20.0 18.6 25.0 80.0
summary(nhefs.nmv[which(nhefs.nmv$qsmk==1),]$smokeyrs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 15.00 26.00 26.03 35.00 60.00
table(nhefs.nmv$qsmk, nhefs.nmv$sex)
##
## 0 1
## 0 542 621
## 1 220 183
prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$sex), 1)
##
## 0 1
## 0 0.4660361 0.5339639
## 1 0.5459057 0.4540943
table(nhefs.nmv$qsmk, nhefs.nmv$race)
##
## 0 1
## 0 993 170
## 1 367 36
prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$race), 1)
##
## 0 1
## 0 0.85382631 0.14617369
## 1 0.91066998 0.08933002
table(nhefs.nmv$qsmk, nhefs.nmv$education)
##
## 1 2 3 4 5
## 0 210 266 480 92 115
## 1 81 74 157 29 62
prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$education), 1)
##
## 1 2 3 4 5
## 0 0.18056750 0.22871883 0.41272571 0.07910576 0.09888220
## 1 0.20099256 0.18362283 0.38957816 0.07196030 0.15384615
table(nhefs.nmv$qsmk, nhefs.nmv$exercise)
##
## 0 1 2
## 0 237 485 441
## 1 63 176 164
prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$exercise), 1)
##
## 0 1 2
## 0 0.2037833 0.4170249 0.3791917
## 1 0.1563275 0.4367246 0.4069479
table(nhefs.nmv$qsmk, nhefs.nmv$active)
##
## 0 1 2
## 0 532 527 104
## 1 170 188 45
prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$active), 1)
##
## 0 1 2
## 0 0.4574377 0.4531384 0.0894239
## 1 0.4218362 0.4665012 0.1116625
从结果中可以看到,戒烟者和非戒烟者在性别、种族、教育程度、初次访问时的体重以及 吸烟频率等特征方面并不相同。如果这些变量是混杂变量,那我们就需要在分析中调整这些变 量。在本书第十八章我们将讨论混杂变量的选取。在此处,我们假设初次访问时的 9 个变量就足 以代表所有混杂,这些变量包括:性别(sex1,0:男性,1:女性),年龄(age,单位:年), 种族(race,0:白人,1:其他),教育程度(education,5 个分类),吸烟频率 (smokeintensity,单位:支/天),烟龄(smokeyrs,单位:年),每日体力情况(active,3 个分类),运动量(exercise,3 个分类),以及初次访问时的体重(wt71,单位:kg)。也 即,我们常用来表示混杂变量的 L 此时是一个含 9 个变量的向量。下一节我们将用逆概率加权的 方法来调整这些变量。
12.2 使用模型计算逆概率权重
为了得到p(A=1|L)的参数估计值,可以使用逻辑回归模型,用来预测控制了混杂变量之后戒烟的概率。
在虚拟人群中,我们通过加权最小二乘法拟合(饱和)线性均值模型E[Y | A]=θ θ+A,从而估计E [Y−| A=1] E=[Y | A 0].
有两步:1. 根据A与L构建逻辑回归模型,罗辑回归模型的预测值的倒数就是L的逆概率。(计算逆概率) 2.使用加权最小二乘拟合A与Y的线性回归模型。(构建回归模型)
#####################################################
# PROGRAM 12.2
# Estimating IP weights
# Data from NHEFS
#####################################################
# Estimation of ip weights via a logistic model
fit <- glm(qsmk ~ sex + race + age + I(age^2) +
as.factor(education) + smokeintensity +
I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) +
as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2),
family = binomial(), data = nhefs.nmv)
summary(fit)
##
## Call:
## glm(formula = qsmk ~ sex + race + age + I(age^2) + as.factor(education) +
## smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) +
## as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2),
## family = binomial(), data = nhefs.nmv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5127 -0.7907 -0.6387 0.9832 2.3729
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2425191 1.3808360 -1.624 0.104369
## sex -0.5274782 0.1540496 -3.424 0.000617 ***
## race -0.8392636 0.2100665 -3.995 6.46e-05 ***
## age 0.1212052 0.0512663 2.364 0.018068 *
## I(age^2) -0.0008246 0.0005361 -1.538 0.124039
## as.factor(education)2 -0.0287755 0.1983506 -0.145 0.884653
## as.factor(education)3 0.0864318 0.1780850 0.485 0.627435
## as.factor(education)4 0.0636010 0.2732108 0.233 0.815924
## as.factor(education)5 0.4759606 0.2262237 2.104 0.035384 *
## smokeintensity -0.0772704 0.0152499 -5.067 4.04e-07 ***
## I(smokeintensity^2) 0.0010451 0.0002866 3.647 0.000265 ***
## smokeyrs -0.0735966 0.0277775 -2.650 0.008061 **
## I(smokeyrs^2) 0.0008441 0.0004632 1.822 0.068398 .
## as.factor(exercise)1 0.3548405 0.1801351 1.970 0.048855 *
## as.factor(exercise)2 0.3957040 0.1872400 2.113 0.034571 *
## as.factor(active)1 0.0319445 0.1329372 0.240 0.810100
## as.factor(active)2 0.1767840 0.2149720 0.822 0.410873
## wt71 -0.0152357 0.0263161 -0.579 0.562625
## I(wt71^2) 0.0001352 0.0001632 0.829 0.407370
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1786.1 on 1565 degrees of freedom
## Residual deviance: 1676.9 on 1547 degrees of freedom
## AIC: 1714.9
##
## Number of Fisher Scoring iterations: 4
p.qsmk.obs <- ifelse(nhefs.nmv$qsmk == 0, 1 - predict(fit, type = "response"),
predict(fit, type = "response"))
nhefs.nmv$w <- 1/p.qsmk.obs
summary(nhefs.nmv$w)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.054 1.230 1.373 1.996 1.990 16.700
sd(nhefs.nmv$w)
## [1] 1.474787
#install.packages("geepack") # install package if required
library("geepack")
## Warning: package 'geepack' was built under R version 4.1.2
msm.w <- geeglm(wt82_71 ~ qsmk, data=nhefs.nmv, weights=w,id = seqn,corstr="independence")
summary(msm.w)
##
## Call:
## geeglm(formula = wt82_71 ~ qsmk, data = nhefs.nmv, weights = w,
## id = seqn, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 1.7800 0.2247 62.73 2.33e-15 ***
## qsmk 3.4405 0.5255 42.87 5.86e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 65.06 4.221
## Number of clusters: 1566 Maximum cluster size: 1
beta <- coef(msm.w)
SE <- coef(summary(msm.w))[,2]
lcl <- beta-qnorm(0.975)*SE
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)
## beta lcl ucl
## (Intercept) 1.780 1.340 2.22
## qsmk 3.441 2.411 4.47
# no association between sex and qsmk in pseudo-population
xtabs(nhefs.nmv$w ~ nhefs.nmv$sex + nhefs.nmv$qsmk)
## nhefs.nmv$qsmk
## nhefs.nmv$sex 0 1
## 0 763.6 763.6
## 1 801.7 797.2
# "check" for positivity (White women)
table(nhefs.nmv$age[nhefs.nmv$race == 0 & nhefs.nmv$sex == 1],
nhefs.nmv$qsmk[nhefs.nmv$race == 0 & nhefs.nmv$sex == 1])
##
## 0 1
## 25 24 3
## 26 14 5
## 27 18 2
## 28 20 5
## 29 15 4
## 30 14 5
## 31 11 5
## 32 14 7
## 33 12 3
## 34 22 5
## 35 16 5
## 36 13 3
## 37 14 1
## 38 6 2
## 39 19 4
## 40 10 4
## 41 13 3
## 42 16 3
## 43 14 3
## 44 9 4
## 45 12 5
## 46 19 4
## 47 19 4
## 48 19 4
## 49 11 3
## 50 18 4
## 51 9 3
## 52 11 3
## 53 11 4
## 54 17 9
## 55 9 4
## 56 8 7
## 57 9 2
## 58 8 4
## 59 5 4
## 60 5 4
## 61 5 2
## 62 6 5
## 63 3 3
## 64 7 1
## 65 3 2
## 66 4 0
## 67 2 0
## 69 6 2
## 70 2 1
## 71 0 1
## 72 2 2
## 74 0 1
12.3 逆概率稳定权重
不过,除此之外,我们还有其他方法来构建虚拟人群,从而让其中的 A 和 L 相互独立。比 如,我们可以构建一个无论 L 是什么, A = 1 的概率是 50%、 A = 0 的概率也是 50%的虚拟人群, 这样一个虚拟人群的逆概率权重是0.5/ f (A| L)。
逆概率权重为 p / f ( A | L ) 时 同理,其中0< p≤1。权重WA =1/ f (A|L)只是p=1时的一个特例。
让我们进一步往前推理。调整混杂的目的是让治疗变量 A 的概率不取决于混杂变量 L 。于是 我们可以构建一个虚拟人群来达到这个目的,而其中每个人接受治疗 A 的概率是否相同则无关紧 要,只要这个概率不取决于 L 就行。
比如,一个常用做法是让虚拟人群中接受治疗的概率等于原 本样本中治疗的概率Pr[A =1],对于未接受治疗的概率同理。因而,此时的逆概率权重是:治疗组是Pr[A=1]/ f (A|L),非治疗组是Pr[A=0]/ f (A|L)。或者写作f (A)/ f (A|L)。
分子中的 f (A)被称为稳定因子,会缩小 f (A)/ f (A| L)的取值范围。
W A =1/ f (A| L)被称为非稳定权重,而SW A = f (A)/ f (A| L)被称为稳定权重。
#####################################################
# PROGRAM 12.3
# Estimating stabilized IP weights
# Data from NHEFS
#####################################################
# estimation of denominator of ip weights
denom.fit <- glm(qsmk ~ as.factor(sex) + as.factor(race) + age + I(age^2) +
as.factor(education) + smokeintensity +
I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) +
as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2),
family = binomial(), data = nhefs.nmv)
summary(denom.fit)
##
## Call:
## glm(formula = qsmk ~ as.factor(sex) + as.factor(race) + age +
## I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) +
## smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
## wt71 + I(wt71^2), family = binomial(), data = nhefs.nmv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.513 -0.791 -0.639 0.983 2.373
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.242519 1.380836 -1.62 0.10437
## as.factor(sex)1 -0.527478 0.154050 -3.42 0.00062 ***
## as.factor(race)1 -0.839264 0.210067 -4.00 6.5e-05 ***
## age 0.121205 0.051266 2.36 0.01807 *
## I(age^2) -0.000825 0.000536 -1.54 0.12404
## as.factor(education)2 -0.028776 0.198351 -0.15 0.88465
## as.factor(education)3 0.086432 0.178085 0.49 0.62744
## as.factor(education)4 0.063601 0.273211 0.23 0.81592
## as.factor(education)5 0.475961 0.226224 2.10 0.03538 *
## smokeintensity -0.077270 0.015250 -5.07 4.0e-07 ***
## I(smokeintensity^2) 0.001045 0.000287 3.65 0.00027 ***
## smokeyrs -0.073597 0.027777 -2.65 0.00806 **
## I(smokeyrs^2) 0.000844 0.000463 1.82 0.06840 .
## as.factor(exercise)1 0.354841 0.180135 1.97 0.04885 *
## as.factor(exercise)2 0.395704 0.187240 2.11 0.03457 *
## as.factor(active)1 0.031944 0.132937 0.24 0.81010
## as.factor(active)2 0.176784 0.214972 0.82 0.41087
## wt71 -0.015236 0.026316 -0.58 0.56262
## I(wt71^2) 0.000135 0.000163 0.83 0.40737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1786.1 on 1565 degrees of freedom
## Residual deviance: 1676.9 on 1547 degrees of freedom
## AIC: 1715
##
## Number of Fisher Scoring iterations: 4
pd.qsmk <- predict(denom.fit, type = "response")
# estimation of numerator of ip weights
numer.fit <- glm(qsmk~1, family = binomial(), data = nhefs.nmv)
summary(numer.fit)
##
## Call:
## glm(formula = qsmk ~ 1, family = binomial(), data = nhefs.nmv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.771 -0.771 -0.771 1.648 1.648
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0598 0.0578 -18.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1786.1 on 1565 degrees of freedom
## Residual deviance: 1786.1 on 1565 degrees of freedom
## AIC: 1788
##
## Number of Fisher Scoring iterations: 4
pn.qsmk <- predict(numer.fit, type = "response")
nhefs.nmv$sw <- ifelse(nhefs.nmv$qsmk == 0, ((1-pn.qsmk)/(1-pd.qsmk)),
(pn.qsmk/pd.qsmk))
summary(nhefs.nmv$sw)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.331 0.867 0.950 0.999 1.079 4.298
msm.sw <- geeglm(wt82_71 ~ qsmk, data=nhefs.nmv, weights=sw, id=seqn,
corstr="independence")
summary(msm.sw)
##
## Call:
## geeglm(formula = wt82_71 ~ qsmk, data = nhefs.nmv, weights = sw,
## id = seqn, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 1.780 0.225 62.7 2.3e-15 ***
## qsmk 3.441 0.525 42.9 5.9e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 60.7 3.71
## Number of clusters: 1566 Maximum cluster size: 1
beta <- coef(msm.sw)
SE <- coef(summary(msm.sw))[,2]
lcl <- beta-qnorm(0.975)*SE
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)
## beta lcl ucl
## (Intercept) 1.78 1.34 2.22
## qsmk 3.44 2.41 4.47
# no association between sex and qsmk in pseudo-population
xtabs(nhefs.nmv$sw ~ nhefs.nmv$sex + nhefs.nmv$qsmk)
## nhefs.nmv$qsmk
## nhefs.nmv$sex 0 1
## 0 567 197
## 1 595 205
如果稳定权重和非稳定权重都能得到同样的结果,我们为什么还要使用稳定权重?这是因为 稳定权重的通常会给出更窄的置信区间。然而,只有当(逆概率加权)模型不是饱和的时候,稳 定权重的优越性才会显现出来。在我们的例子中,E[Y | A]=θ θ1*A是饱和的,因为治疗变量A只有两个可能取值。在大多数情形中(比如时异性或连续性治疗),加权模型就不再是饱和的, 从而我们会更常用稳定结局
如果治疗变量 A 是一个二分变量,那么这个模型就是一个饱和模型。
12.4 边缘结构模型
让我们思考下面这个在治疗取值为 a 时的线性模型 
这个模型和我们之前所讲的模型都不一样,此处模型的因变量是反事实结局,因而我们不可能观 测到。因此,这个模型不能用现实数据拟合。这一模型的因变量是反事实结局的边缘均值,因而 这一模型被称为边缘结构均值模型。
治疗变量经常是多分类或连续的。比如,如果我们的治疗变量 A 是“吸烟频率的变 化”,通过随访时的吸烟频率减去初次访问时的吸烟频率计算得到,那它就是一个连续变量,可 以有许多取值,比如可以是-25,表示某人吸烟频率降低了 25 支/天,也可以是 25,表示某人吸 烟频率增加了 25 支/天。在初次访问时,有 1162 人每天吸烟数少于等于 25 支。假设我们的目标 是估计这 1162 人中不同吸烟频率变化所对应的体重变化,
那我们想估计的是E(ya0)-E(ya1),其中a0和a1可以是任意值。
因为治疗变量 A 可能有许多不同取值,此时饱和模型就变得不切实际。我们需要使用非饱和 模型,并假设治疗 A 和结局 Y 的剂量反应曲线形式。如果我们觉得抛物线能恰当描述这一关系, 那我们的边缘结构模型就可以是:

假设我们想估计吸烟频率提高 20 支/天相比于频率没有变化所对应因果效应均值,即E(Ya20)-E(ya0),根据我们的 模型E(Y20)=β0+β1*20+β2*400
,因此,E(Y20-E(Y0)=β1*20+β2*400
,因此我们需要估计β1和β2,因此我们需要使用逆概率权重SWA去构建一个虚拟人群,其中L不再是混杂变量,然后使用这个虚拟人群中拟合相关性模型E(Y|A)=θ0+θ1*A+θ2*A^2
为了计算权重SWA=f(A)/f(A|L),我们需要先估计f(A|L),对于二分类的A,可以使用逻辑回归来估计他。对于连续变量A,f(A|L)是概率密度函数,遗憾的是,一般情况下很能正确估计一个概率密度函数,因而对连续 性治疗变量使用逆概率加权方法存在风险。
,我们假设概率密度 f (A| L)是一个 正态分布,u=E(A|L),标准差是σ,接下来我们用线性回归来估计不同L取值下 的均值E[A|L]和残差标准差σ,我们假设分子中的密度f (A)也是正态分布的。因为效应估计对 条件密度 f (A| L)的模型选择非常敏感,所以我们对连续变量使用逆概率加权时需要非常小心。
#####################################################
# PROGRAM 12.4
# Estimating the parameters of a marginal structural mean model
# with a continuous treatment Data from NHEFS
#####################################################
# Analysis restricted to subjects reporting <=25 cig/day at baseline
nhefs.nmv.s <- subset(nhefs.nmv, smokeintensity <=25)
# estimation of denominator of ip weights
den.fit.obj <- lm(smkintensity82_71 ~ as.factor(sex) +
as.factor(race) + age + I(age^2) +
as.factor(education) + smokeintensity + I(smokeintensity^2) +
smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 +
I(wt71^2), data = nhefs.nmv.s)
p.den <- predict(den.fit.obj, type = "response")
dens.den <- dnorm(nhefs.nmv.s$smkintensity82_71, p.den, summary(den.fit.obj)$sigma) # 逆概率权重的分母
# estimation of numerator of ip weights
num.fit.obj <- lm(smkintensity82_71 ~ 1, data = nhefs.nmv.s) # 逆概率加权的分子
p.num <- predict(num.fit.obj, type = "response")
dens.num <- dnorm(nhefs.nmv.s$smkintensity82_71, p.num, summary(num.fit.obj)$sigma)
nhefs.nmv.s$sw.a = dens.num/dens.den
summary(nhefs.nmv.s$sw.a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.19 0.89 0.97 1.00 1.05 5.10
msm.sw.cont <- geeglm(wt82_71 ~ smkintensity82_71 + I(smkintensity82_71*smkintensity82_71),
data=nhefs.nmv.s, weights=sw.a, id=seqn, corstr="independence")
summary(msm.sw.cont)
##
## Call:
## geeglm(formula = wt82_71 ~ smkintensity82_71 + I(smkintensity82_71 *
## smkintensity82_71), data = nhefs.nmv.s, weights = sw.a, id = seqn,
## corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 2.00452 0.29512 46.13 1.1e-11 ***
## smkintensity82_71 -0.10899 0.03154 11.94 0.00055 ***
## I(smkintensity82_71 * smkintensity82_71) 0.00269 0.00242 1.24 0.26489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 60.5 4.5
## Number of clusters: 1162 Maximum cluster size: 1
beta <- coef(msm.sw.cont)
SE <- coef(summary(msm.sw.cont))[,2]
lcl <- beta-qnorm(0.975)*SE
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)
## beta lcl ucl
## (Intercept) 2.00452 1.42610 2.58295
## smkintensity82_71 -0.10899 -0.17080 -0.04718
## I(smkintensity82_71 * smkintensity82_71) 0.00269 -0.00204 0.00743
边缘结构模型的治疗变量也可以是一个二分变量。比如,我们想估计戒烟 A (1:是,0: 否)对死亡(1:是,0:否)的因果效应,我们可以用以下边缘结构 logistic 模型:

其中 exp (α1 ) 是戒烟者比上非戒烟者的死亡因果性比值比。在我们之前所述假设成立的情况下,
我们可以在逆概率权重构建的虚拟人群拟合logistic模型logitPr(D=1|A)=θ0+θ1*A 估计上述边缘结构logistic模型中的参数。
#####################################################
# PROGRAM 12.5
# Estimating the parameters of a marginal structural logistic model
# Data from NHEFS
#####################################################
table(nhefs.nmv$qsmk, nhefs.nmv$death) # qsmk表示是否是是治疗组 # death 表示是是否死亡
##
## 0 1
## 0 963 200
## 1 312 91
# First, estimation of stabilized weights sw.a (same as in PROGRAM 12.3)
# Second, fit logistic model below # nhefs.nmv.s/nhefs.nmv
msm.logistic <- geeglm(death ~ qsmk, data=nhefs.nmv.s, weights=sw.a, # sw /sw.a
id=seqn, family=binomial(), corstr="independence")
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(msm.logistic)
##
## Call:
## geeglm(formula = death ~ qsmk, family = binomial(), data = nhefs.nmv.s,
## weights = sw.a, id = seqn, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.5026 0.0981 234.80 <2e-16 ***
## qsmk 0.0369 0.1736 0.05 0.83
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1 0.082
## Number of clusters: 1162 Maximum cluster size: 1
beta <- coef(msm.logistic)
SE <- coef(summary(msm.logistic))[,2]
lcl <- beta-qnorm(0.975)*SE
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)
## beta lcl ucl
## (Intercept) -1.5026 -1.695 -1.310
## qsmk 0.0369 -0.303 0.377
12.5 效应修饰与边缘结构模型
当我们想估计的参数是目标人群中的因果效应均值时,边缘结构模型不会包含其他协变量。 然而,当效应修饰存在时,我们可以在边缘结构模型中放入其他协变量,即使这些协变量不是混 杂变量。假设戒烟的效应对不同性别V (1:女性,0:男性)不一样。 为了验证这一假设,我们 需要在边缘结构均值模型中加入协变量V :
如果β2不等于0,那么存在加法效应修饰。严格而言,这不再是边缘模型,因为存在其他协变量V, 然而我们依然称之为边缘结构模型。
我们可以在逆概率权重构建的虚拟人群中拟合线性模型```E(Y|A,V)=θ0+θ1*A+θ2*A*V+θ3*V```
从而估计上述边缘模型中的参数值。此时向量 L 需要包含V ——即使V 不是混杂变量——以及其 他可能影响互换性的变量。
因为我们现在要考虑的是V 的每一分层中治疗的因果效应,所以我们稳定权重的分子可以是 f (A)或f (A|V)。如果分母是f(A|V),置信区间会更加窄一些。
我们可以这样理解:因为分子分母中都含有V ,所以分子和分母的值更加接近,从而使得
逆概率权重更加稳定,因而 95%置信区间更窄。
此时,在估计权重分子的时候,我们会在 logistic模型中加入协变量V 。
V 是 L 的一个子集,选择什么样的变量作为V 需要反映出研究者真切关心的研究问题。比如,只有当研究者确信V 是一个效应修饰因子,并且对V 的每一分层中的因果效应非常感兴趣时,V 才能被放入边缘结构模型之中。如果研究者决定把 L 中的所有变量全部放入边 缘结构模型之中,那逆概率加权就不再是必须的,因为此时正确设定的未加权回归模型也能完全 控制 L 带来的混杂出于这一原因,我们将包含了所有变量 L 的边缘结构模型 称为仿制的边缘结构模型。
需要注意的是:如果我们要探索两个治疗变量 A 和 B 之间的交互作用,我们需要把 A 和 B 都放入边缘结构 模型之中,在计算逆概率权重时,分母中会包含这两个变量的联合分布。我们会对 A 和 B 都假设 互换性、正数性和一致性成立。
```r
#####################################################
# PROGRAM 12.6
# Assessing effect modification by sex using a marginal structural mean model
# Data from NHEFS
#####################################################
table(nhefs.nmv$sex)
##
## 0 1
## 762 804
# estimation of denominator of ip weights
denom.fit <- glm(qsmk ~ as.factor(sex) + as.factor(race) + age + I(age^2) +
as.factor(education) + smokeintensity +
I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) +
as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2),
family = binomial(), data = nhefs.nmv) # 计算权重的分母
summary(denom.fit)
##
## Call:
## glm(formula = qsmk ~ as.factor(sex) + as.factor(race) + age +
## I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) +
## smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
## wt71 + I(wt71^2), family = binomial(), data = nhefs.nmv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.513 -0.791 -0.639 0.983 2.373
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.242519 1.380836 -1.62 0.10437
## as.factor(sex)1 -0.527478 0.154050 -3.42 0.00062 ***
## as.factor(race)1 -0.839264 0.210067 -4.00 6.5e-05 ***
## age 0.121205 0.051266 2.36 0.01807 *
## I(age^2) -0.000825 0.000536 -1.54 0.12404
## as.factor(education)2 -0.028776 0.198351 -0.15 0.88465
## as.factor(education)3 0.086432 0.178085 0.49 0.62744
## as.factor(education)4 0.063601 0.273211 0.23 0.81592
## as.factor(education)5 0.475961 0.226224 2.10 0.03538 *
## smokeintensity -0.077270 0.015250 -5.07 4.0e-07 ***
## I(smokeintensity^2) 0.001045 0.000287 3.65 0.00027 ***
## smokeyrs -0.073597 0.027777 -2.65 0.00806 **
## I(smokeyrs^2) 0.000844 0.000463 1.82 0.06840 .
## as.factor(exercise)1 0.354841 0.180135 1.97 0.04885 *
## as.factor(exercise)2 0.395704 0.187240 2.11 0.03457 *
## as.factor(active)1 0.031944 0.132937 0.24 0.81010
## as.factor(active)2 0.176784 0.214972 0.82 0.41087
## wt71 -0.015236 0.026316 -0.58 0.56262
## I(wt71^2) 0.000135 0.000163 0.83 0.40737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1786.1 on 1565 degrees of freedom
## Residual deviance: 1676.9 on 1547 degrees of freedom
## AIC: 1715
##
## Number of Fisher Scoring iterations: 4
pd.qsmk <- predict(denom.fit, type = "response")
# estimation of numerator of ip weights
numer.fit <- glm(qsmk~as.factor(sex), family = binomial(), data = nhefs.nmv) # 计算权重的分母
summary(numer.fit)
##
## Call:
## glm(formula = qsmk ~ as.factor(sex), family = binomial(), data = nhefs.nmv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.825 -0.825 -0.719 1.576 1.720
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9016 0.0799 -11.28 <2e-16 ***
## as.factor(sex)1 -0.3202 0.1160 -2.76 0.0058 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1786.1 on 1565 degrees of freedom
## Residual deviance: 1778.4 on 1564 degrees of freedom
## AIC: 1782
##
## Number of Fisher Scoring iterations: 4
pn.qsmk <- predict(numer.fit, type = "response")
nhefs.nmv$sw.a <- ifelse(nhefs.nmv$qsmk == 0, ((1-pn.qsmk)/(1-pd.qsmk)),
(pn.qsmk/pd.qsmk))
summary(nhefs.nmv$sw.a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.29 0.88 0.96 1.00 1.08 3.80
sd(nhefs.nmv$sw.a)
## [1] 0.271
# Estimating parameters of a marginal structural mean model
msm.emm <- geeglm(wt82_71~as.factor(qsmk) + as.factor(sex)
+ as.factor(qsmk):as.factor(sex), data=nhefs.nmv,
weights=sw.a, id=seqn, corstr="independence")
summary(msm.emm)
##
## Call:
## geeglm(formula = wt82_71 ~ as.factor(qsmk) + as.factor(sex) +
## as.factor(qsmk):as.factor(sex), data = nhefs.nmv, weights = sw.a,
## id = seqn, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 1.78445 0.30984 33.17 8.5e-09 ***
## as.factor(qsmk)1 3.52198 0.65707 28.73 8.3e-08 ***
## as.factor(sex)1 -0.00872 0.44882 0.00 0.98
## as.factor(qsmk)1:as.factor(sex)1 -0.15948 1.04608 0.02 0.88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 60.8 3.71
## Number of clusters: 1566 Maximum cluster size: 1
beta <- coef(msm.emm)
SE <- coef(summary(msm.emm))[,2]
lcl <- beta-qnorm(0.975)*SE
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)
## beta lcl ucl
## (Intercept) 1.78445 1.177 2.392
## as.factor(qsmk)1 3.52198 2.234 4.810
## as.factor(sex)1 -0.00872 -0.888 0.871
## as.factor(qsmk)1:as.factor(sex)1 -0.15948 -2.210 1.891
12.6 删失和缺失值
当我们估计戒烟 A 对增重 Y 的因果效应时,我们将分析局限于 1566 名有体重变化信息的人群 中。然而,还有 63(上面的代码中我们删除了这一部分数据) 名人员虽然符合入组标准,但是因为在随访中没有体重信息而被排除于分析之 外。只选择没有结局缺失值的人——也即删失有缺失值的人——可能会引入选择偏移.
让我们用删失变量 C 表示在随访中的体重测量:1 表示没有测量(即被删失),0 表示有测 量(即没有被删失)。显然,我们的分析只能在没有被删失的人群中进行,即 C = 0 的人群中。
也就是说,我们在12.2小节和12.4小节中的拟合的(加权)回归模型不是
因为失访导致的删失可能会引入选择偏移,所以我们希望不存在删失,并估计不存在删失时
的因果效应。
主要关注如何计算Wac和SWac
```r
#####################################################
# PROGRAM 12.7
# Estimating IP weights to adjust for selection bias due to censoring
# Data from NHEFS
#####################################################
table(nhefs$qsmk, nhefs$cens)
##
## 0 1
## 0 1163 38
## 1 403 25
summary(nhefs[which(nhefs$cens==0),]$wt71)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.6 59.5 69.2 70.8 79.8 151.7
summary(nhefs[which(nhefs$cens==1),]$wt71)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.2 63.1 72.1 76.6 87.9 169.2
# estimation of denominator of ip weights for A
denom.fit <- glm(qsmk ~ as.factor(sex) + as.factor(race) + age + I(age^2) +
as.factor(education) + smokeintensity +
I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) +
as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2),
family = binomial(), data = nhefs)
summary(denom.fit)
##
## Call:
## glm(formula = qsmk ~ as.factor(sex) + as.factor(race) + age +
## I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) +
## smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
## wt71 + I(wt71^2), family = binomial(), data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.465 -0.804 -0.646 1.058 2.355
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.988902 1.241279 -1.60 0.10909
## as.factor(sex)1 -0.507522 0.148232 -3.42 0.00062 ***
## as.factor(race)1 -0.850231 0.205872 -4.13 3.6e-05 ***
## age 0.103013 0.048900 2.11 0.03515 *
## I(age^2) -0.000605 0.000507 -1.19 0.23297
## as.factor(education)2 -0.098320 0.190655 -0.52 0.60607
## as.factor(education)3 0.015699 0.170714 0.09 0.92673
## as.factor(education)4 -0.042526 0.264276 -0.16 0.87216
## as.factor(education)5 0.379663 0.220395 1.72 0.08495 .
## smokeintensity -0.065156 0.014759 -4.41 1.0e-05 ***
## I(smokeintensity^2) 0.000846 0.000276 3.07 0.00216 **
## smokeyrs -0.073371 0.026996 -2.72 0.00657 **
## I(smokeyrs^2) 0.000838 0.000443 1.89 0.05867 .
## as.factor(exercise)1 0.291412 0.173554 1.68 0.09314 .
## as.factor(exercise)2 0.355052 0.179929 1.97 0.04846 *
## as.factor(active)1 0.010875 0.129832 0.08 0.93324
## as.factor(active)2 0.068312 0.208727 0.33 0.74346
## wt71 -0.012848 0.022283 -0.58 0.56423
## I(wt71^2) 0.000121 0.000135 0.89 0.37096
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1876.3 on 1628 degrees of freedom
## Residual deviance: 1766.7 on 1610 degrees of freedom
## AIC: 1805
##
## Number of Fisher Scoring iterations: 4
pd.qsmk <- predict(denom.fit, type = "response")
# estimation of numerator of ip weights for A
numer.fit <- glm(qsmk~1, family = binomial(), data = nhefs)
summary(numer.fit)
##
## Call:
## glm(formula = qsmk ~ 1, family = binomial(), data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.781 -0.781 -0.781 1.635 1.635
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0318 0.0563 -18.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1876.3 on 1628 degrees of freedom
## Residual deviance: 1876.3 on 1628 degrees of freedom
## AIC: 1878
##
## Number of Fisher Scoring iterations: 4
pn.qsmk <- predict(numer.fit, type = "response")
# estimation of denominator of ip weights for C
denom.cens <- glm(cens ~ as.factor(qsmk) + as.factor(sex) +
as.factor(race) + age + I(age^2) +
as.factor(education) + smokeintensity +
I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) +
as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2),
family = binomial(), data = nhefs)
summary(denom.cens)
##
## Call:
## glm(formula = cens ~ as.factor(qsmk) + as.factor(sex) + as.factor(race) +
## age + I(age^2) + as.factor(education) + smokeintensity +
## I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71^2), family = binomial(),
## data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.097 -0.287 -0.207 -0.157 2.996
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.014466 2.576106 1.56 0.1192
## as.factor(qsmk)1 0.516867 0.287716 1.80 0.0724 .
## as.factor(sex)1 0.057313 0.330278 0.17 0.8622
## as.factor(race)1 -0.012271 0.452489 -0.03 0.9784
## age -0.269729 0.117465 -2.30 0.0217 *
## I(age^2) 0.002884 0.001114 2.59 0.0096 **
## as.factor(education)2 -0.440788 0.419399 -1.05 0.2933
## as.factor(education)3 -0.164688 0.370547 -0.44 0.6567
## as.factor(education)4 0.138447 0.569797 0.24 0.8080
## as.factor(education)5 -0.382382 0.560181 -0.68 0.4949
## smokeintensity 0.015712 0.034732 0.45 0.6510
## I(smokeintensity^2) -0.000113 0.000606 -0.19 0.8517
## smokeyrs 0.078597 0.074958 1.05 0.2944
## I(smokeyrs^2) -0.000557 0.001032 -0.54 0.5894
## as.factor(exercise)1 -0.971471 0.387810 -2.51 0.0122 *
## as.factor(exercise)2 -0.583989 0.372313 -1.57 0.1168
## as.factor(active)1 -0.247479 0.325455 -0.76 0.4470
## as.factor(active)2 0.706583 0.396458 1.78 0.0747 .
## wt71 -0.087887 0.040012 -2.20 0.0281 *
## I(wt71^2) 0.000635 0.000226 2.81 0.0049 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 533.36 on 1628 degrees of freedom
## Residual deviance: 465.36 on 1609 degrees of freedom
## AIC: 505.4
##
## Number of Fisher Scoring iterations: 7
pd.cens <- 1-predict(denom.cens, type = "response")
# estimation of numerator of ip weights for C
numer.cens <- glm(cens~as.factor(qsmk), family = binomial(), data = nhefs)
summary(numer.cens)
##
## Call:
## glm(formula = cens ~ as.factor(qsmk), family = binomial(), data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.347 -0.254 -0.254 -0.254 2.628
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.421 0.165 -20.75 <2e-16 ***
## as.factor(qsmk)1 0.641 0.264 2.43 0.015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 533.36 on 1628 degrees of freedom
## Residual deviance: 527.76 on 1627 degrees of freedom
## AIC: 531.8
##
## Number of Fisher Scoring iterations: 6
pn.cens <- 1-predict(numer.cens, type = "response")
nhefs$sw.a <- ifelse(nhefs$qsmk == 0, ((1-pn.qsmk)/(1-pd.qsmk)),
(pn.qsmk/pd.qsmk))
nhefs$sw.c <- pn.cens/pd.cens
nhefs$sw <- nhefs$sw.c*nhefs$sw.a
summary(nhefs$sw.a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.33 0.86 0.95 1.00 1.08 4.21
sd(nhefs$sw.a)
## [1] 0.284
summary(nhefs$sw.c)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.94 0.98 0.99 1.01 1.01 7.58
sd(nhefs$sw.c)
## [1] 0.178
summary(nhefs$sw)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.35 0.86 0.94 1.01 1.08 12.86
sd(nhefs$sw)
## [1] 0.411
msm.sw <- geeglm(wt82_71~qsmk, data=nhefs,
weights=sw, id=seqn, corstr="independence")
summary(msm.sw)
##
## Call:
## geeglm(formula = wt82_71 ~ qsmk, data = nhefs, weights = sw,
## id = seqn, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 1.662 0.233 51.0 9.3e-13 ***
## qsmk 3.496 0.526 44.2 2.9e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 61.8 3.83
## Number of clusters: 1566 Maximum cluster size: 1
beta <- coef(msm.sw)
SE <- coef(summary(msm.sw))[,2]
lcl <- beta-qnorm(0.975)*SE
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)
## beta lcl ucl
## (Intercept) 1.66 1.21 2.12
## qsmk 3.50 2.47 4.53
13标准化和参数G-公式
在实践中,研究者可以在逆概率加权和标准化两种方法中选择一种用来估计因果效应。这两 种方法都基于同样的可识别性条件,然而它们的模型假设却不一样。
在上一章,我们用逆概率加权估计了戒烟 A (1:是,0:否)对增重 Y 的(单位:kg)因果 效应。本章我们会估计同样的因果效应,不过将用另一种方法,即标准化。我们的分析人群同样 是 NHFES 中的 1629 名参与者,他们在 25-74 岁之间,参加了初访,并在 10 年之后参加了随访。 在这些人当中,1566 名人员在初访和随访中都有体重测量,因而没有被删失( C = 0 )。
13.1 标准化
标准化均值是每一分层中条件均值的加权平均,而权重是每一分层的出现概率 Pr [L = l ]。也 就是说,在计算标准化均值的时候,人数最多的分层占有的权重最多。
我们可以把治疗为 a 且未被删失的标准化均值表述为:

如果L是连续的,我们需要把Pr[L = l]替换为概率密度函数 fL (l),同时把求和符合替换为积分 符号。
13.2 通过模型估计结局均值
理想情况下,我们可以用非参数的方法估计条件均值E[Y | A=1,C 0,L= l].
但在高维数据中——比如我们戒烟的例子——我们不可能用非参数的方法估计 E[Y | A=1,C 0,L= l]。在我=们的例子中,L有上百万个分层,而我们只有403名戒烟者。因 此我们需要借助模型。
为了得到上百万个L分层中E[Y|A=a,C 0,L= l]的参数=估计值,我们需要拟合一个线性 模型,其中的因变量是增重,模型中会包含所有混杂变量。对连续性变量,诸如年龄、初访时体 重、烟龄和吸烟频率,模型中会放入它们的一次项和二次项。也就是说,我们假设条件均值和这 些连续性协变量的关系可以用抛物线表示。我们在模型中也包括了戒烟和吸烟频率的乘积项。也 就是说,我们认为戒烟的效果会因吸烟频率的不同而不同,并且这一关系是线性的,同时戒烟的 效果独立于其他变量。
################################################################
# PROGRAM 13.1
# Estimating the mean outcome within levels of treatment
# and confounders: Data from NHEFS
################################################################
#install.packages("readxl") # install package if required
library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
# some preprocessing of the data
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)
fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71) + qsmk*smokeintensity, data=nhefs)
summary(fit)
##
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) +
## as.factor(education) + smokeintensity + I(smokeintensity *
## smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity,
## data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -42.06 -4.17 -0.34 3.89 44.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.588166 4.313036 -0.37 0.71276
## qsmk 2.559594 0.809149 3.16 0.00159 **
## sex -1.430272 0.468958 -3.05 0.00233 **
## race 0.560110 0.581889 0.96 0.33591
## age 0.359635 0.163319 2.20 0.02781 *
## I(age * age) -0.006101 0.001726 -3.53 0.00042 ***
## as.factor(education)2 0.790444 0.607000 1.30 0.19304
## as.factor(education)3 0.556312 0.556102 1.00 0.31728
## as.factor(education)4 1.491570 0.832270 1.79 0.07330 .
## as.factor(education)5 -0.194977 0.741369 -0.26 0.79259
## smokeintensity 0.049136 0.051725 0.95 0.34229
## I(smokeintensity * smokeintensity) -0.000991 0.000938 -1.06 0.29110
## smokeyrs 0.134369 0.091712 1.47 0.14309
## I(smokeyrs * smokeyrs) -0.001866 0.001544 -1.21 0.22683
## as.factor(exercise)1 0.295975 0.535153 0.55 0.58030
## as.factor(exercise)2 0.353913 0.558859 0.63 0.52665
## as.factor(active)1 -0.947569 0.409934 -2.31 0.02094 *
## as.factor(active)2 -0.261378 0.684558 -0.38 0.70265
## wt71 0.045502 0.083371 0.55 0.58530
## I(wt71 * wt71) -0.000965 0.000525 -1.84 0.06600 .
## qsmk:smokeintensity 0.046663 0.035145 1.33 0.18446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 53.6)
##
## Null deviance: 97176 on 1565 degrees of freedom
## Residual deviance: 82763 on 1545 degrees of freedom
## (63 observations deleted due to missingness)
## AIC: 10701
##
## Number of Fisher Scoring iterations: 2
nhefs$predicted.meanY <- predict(fit, nhefs) # 得到了条件均值E[Y | A=1,C 0,L= l]的结果
nhefs[which(nhefs$seqn==24770), c("predicted.meanY", "qsmk", "sex",
"race", "age", "education",
"smokeintensity", "smokeyrs", "exercise",
"active", "wt71")]
## predicted.meanY qsmk sex race age education smokeintensity smokeyrs
## 1582 0.342 0 0 0 26 4 15 12
## exercise active wt71
## 1582 1 0 112
summary(nhefs$predicted.meanY[nhefs$cens==0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.88 1.12 3.04 2.64 4.51 9.88
summary(nhefs$wt82_71[nhefs$cens==0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -41.3 -1.5 2.6 2.6 6.7 48.5
有了这些假设后,我们得到A和L所有可能组合之下的E[Y|A=a,C 0,L= l],同时也能。
得到数据中 403 名戒烟者和 1163 名非戒烟者的预测值。比如,如果某参与人员的信息是:非戒烟 者,男性,白人,26 岁,大学肄业,12 年烟龄,吸烟频率 15 支/天,适当运动,体力良好,初访 时体重 112kg,那么我们就能得到他的增重预测值是 0.34kg(碰巧编号为 24770 的参与者满足以 上条件,你可以在看看他的预测值是多少)
。总而言之,在这 1566 名参与人员中,增重预测均值 是 2.6kg,这与我们观测到的增重值相同,而取值范围是-41.3 到 48.5kg。
我们的目标是估计A=a时的,∑E[Y|A=a,C=0,L =l]*Pr[L l]。里的求和符合在更 l 正式的情况下应该被写作积分符号,因为 L 中的变量基本上是连续的,不能用简单的概率表示。不过无论用什么符号表示,我们现在已经估计了 A 和 L 所有组合下的 E [Y | A = a, C =0, L= l ]。
下一步我们需要把这些均值根据 L 中的取值 l 进行标准化。
13.3 根据混杂变量的分布对结局均值进行标准化
标准化均值是各条件均值E[Y|A=a,C 0,L= l]的加权=平均。当L中的所有变量都是离散 的时候,条件均值对应的权重就是 L = l 的人数比例,即 Pr [L = l ]。原则上,我们可以非参数地 计算这一概率,只需用 L = l 的人数除以总人数即可,我们在 2.3 小节中就是这么做的。然而,在 高维数据中,我们需要用到其他方法。
幸运的是,我们其实并不需要去估计 Pr [L = l ]。我们只需要估计每个个体 i 在 L = l 时的条件,均值E[Y|A=a,C 0,L= l]即可,然后再计算平均值:

其中n是研究的总参与人数。这是因为

也可以被写作双重期望: 
接下来我们将讲述如何在治疗组( A = 1 )和非治疗组( A = 0 )中估计 
且不需要计算Pr[L=l].
这个方法有四步:扩充数据,结局回归,预测,以 及标准化。
################################################################
# PROGRAM 13.2
# Standardizing the mean outcome to the baseline confounders
# Data from Table 2.2
################################################################
id <- c("Rheia", "Kronos", "Demeter", "Hades", "Hestia", "Poseidon",
"Hera", "Zeus", "Artemis", "Apollo", "Leto", "Ares", "Athena",
"Hephaestus", "Aphrodite", "Cyclope", "Persephone", "Hermes",
"Hebe", "Dionysus")
N <- length(id)
L <- c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
A <- c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Y <- c(0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0)
interv <- rep(-1, N)
observed <- cbind(L, A, Y, interv) # di一组
untreated <- cbind(L, rep(0, N), rep(NA, N), rep(0, N)) # 第二组,非治疗组,A=0
treated <- cbind(L, rep(1, N), rep(NA, N), rep(1, N))# 第三组,治疗组 ,A=1
table22 <- as.data.frame(rbind(observed, untreated, treated)) # 按照行合并数据集
table22$id <- rep(id, 3) # 设置ID
# 这属于第一步,扩充数据
# 第二部,用这个新数据集去拟合结局均值和治疗 A 以及混杂变量 L 的回归模型。注意到,其实只有第一个板块中的数据(也即实 际数据)会被用来拟合模型,这是因为第二和第三个板块中的结局数据都是缺失值。
glm.obj <- glm(Y~ A*L, data = table22)
summary(glm.obj)
##
## Call:
## glm(formula = Y ~ A * L, data = table22)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6667 -0.2500 0.0417 0.3333 0.7500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.50e-01 2.55e-01 0.98 0.34
## A -4.16e-16 3.61e-01 0.00 1.00
## L 4.17e-01 3.90e-01 1.07 0.30
## A:L 3.24e-16 4.96e-01 0.00 1.00
##
## (Dispersion parameter for gaussian family taken to be 0.26)
##
## Null deviance: 5.0000 on 19 degrees of freedom
## Residual deviance: 4.1667 on 16 degrees of freedom
## (40 observations deleted due to missingness)
## AIC: 35.39
##
## Number of Fisher Scoring iterations: 2
# 第三部,下一步我们需要用第一个板块拟合的模型去预测第二个和第三个板块的结局。(也就是,我 们将第二和第三个板块中的 L 与 A 和回归系数结合起来,从而得到结局 Y 的预测值。)
table22$predicted.meanY <- predict(glm.obj, table22)
# 第四步 , 我们将会计算第二个板块中的预测值均值。
# 因为 60%的人是 L = 1 ,40%的是 L = 0 ,所 以 L = 1 的均值就会有更多权重。也就是说,第二个板块中的预测值均值就是非治疗组的标准化结 局均值。同理,第三个板块中的预测值均值是治疗组的标准化结局均值。这样一来,我们的计算 就结束了。
mean(table22$predicted.meanY[table22$interv==-1])
## [1] 0.5
mean(table22$predicted.meanY[table22$interv==0])
## [1] 0.5
mean(table22$predicted.meanY[table22$interv==1])
## [1] 0.5
################################################################
在实践中,用经验分布经计算标准化均值是一个可行的办法,尤其在 L 是高维向量的情况 下。我们戒烟例子中所用的方法步骤和前几段所描述的大致相同。我们需要在原数据中添加第二 和第三个板块,拟合E[Y|A=a,C 0,L= l]的模型=,然后生成预测值。第二个板块中预测值的均值是1.66,第三个板块的值是5.18 ,因果效应E(Ya1,c0)-E(Ya0,c0) 就是3.5。
# PROGRAM 13.3
# Standardizing the mean outcome to the baseline confounders:
# Data from NHEFS
################################################################
# create a dataset with 3 copies of each subject
nhefs$interv <- -1 # 1st copy: equal to original one
# 用这个变量表示分组
interv0 <- nhefs # 2nd copy: treatment set to 0, outcome to missing
interv0$interv <- 0 # 表示第二组
interv0$qsmk <- 0 # 设置A为0
interv0$wt82_71 <- NA
interv1 <- nhefs # 3rd copy: treatment set to 1, outcome to missing
interv1$interv <- 1 # 表示第三组
interv1$qsmk <- 1 # 表示A为1
interv1$wt82_71 <- NA #
onesample <- rbind(nhefs, interv0, interv1) # combining datasets # 合并数据集
# linear model to estimate mean outcome conditional on treatment and confounders
# parameters are estimated using original observations only (nhefs)
# parameter estimates are used to predict mean outcome for observations with
# treatment set to 0 (interv=0) and to 1 (interv=1)
std <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age)
+ as.factor(education) + smokeintensity
+ I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise)
+ as.factor(active) + wt71 + I(wt71*wt71) + I(qsmk*smokeintensity),
data=onesample)# 构建模型
summary(std)
##
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) +
## as.factor(education) + smokeintensity + I(smokeintensity *
## smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity),
## data = onesample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -42.06 -4.17 -0.34 3.89 44.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.588166 4.313036 -0.37 0.71276
## qsmk 2.559594 0.809149 3.16 0.00159 **
## sex -1.430272 0.468958 -3.05 0.00233 **
## race 0.560110 0.581889 0.96 0.33591
## age 0.359635 0.163319 2.20 0.02781 *
## I(age * age) -0.006101 0.001726 -3.53 0.00042 ***
## as.factor(education)2 0.790444 0.607000 1.30 0.19304
## as.factor(education)3 0.556312 0.556102 1.00 0.31728
## as.factor(education)4 1.491570 0.832270 1.79 0.07330 .
## as.factor(education)5 -0.194977 0.741369 -0.26 0.79259
## smokeintensity 0.049136 0.051725 0.95 0.34229
## I(smokeintensity * smokeintensity) -0.000991 0.000938 -1.06 0.29110
## smokeyrs 0.134369 0.091712 1.47 0.14309
## I(smokeyrs * smokeyrs) -0.001866 0.001544 -1.21 0.22683
## as.factor(exercise)1 0.295975 0.535153 0.55 0.58030
## as.factor(exercise)2 0.353913 0.558859 0.63 0.52665
## as.factor(active)1 -0.947569 0.409934 -2.31 0.02094 *
## as.factor(active)2 -0.261378 0.684558 -0.38 0.70265
## wt71 0.045502 0.083371 0.55 0.58530
## I(wt71 * wt71) -0.000965 0.000525 -1.84 0.06600 .
## I(qsmk * smokeintensity) 0.046663 0.035145 1.33 0.18446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 53.6)
##
## Null deviance: 97176 on 1565 degrees of freedom
## Residual deviance: 82763 on 1545 degrees of freedom
## (3321 observations deleted due to missingness)
## AIC: 10701
##
## Number of Fisher Scoring iterations: 2
onesample$predicted_meanY <- predict(std, onesample) # 进行预测
# estimate mean outcome in each of the groups interv=0, and interv=1
# this mean outcome is a weighted average of the mean outcomes in each combination
# of values of treatment and confounders, that is, the standardized outcome
mean(onesample[which(onesample$interv==-1),]$predicted_meanY) # 计算
## [1] 2.56
mean(onesample[which(onesample$interv==0),]$predicted_meanY)
## [1] 1.66
mean(onesample[which(onesample$interv==1),]$predicted_meanY)
## [1] 5.18
我们需要自助法计算置信区间,置信区间为(2.6,4.5)
################################################################
# PROGRAM 13.4
# Computing the 95% confidence interval of the standardized means
# and their difference: Data from NHEFS
################################################################
#install.packages("boot") # install package if required
library(boot) # 之后好好学学这个R包
# function to calculate difference in means
# indices -i表示按行,frequencies-f表示按照频率,weights-w 表示权重 ,这个参数主要设置如何抽样
standardization <- function(data, indices) {
# create a dataset with 3 copies of each subject
d <- data[indices,] # 1st copy: equal to original one,
d$interv <- -1
d0 <- d # 2nd copy: treatment set to 0, outcome to missing
d0$interv <- 0
d0$qsmk <- 0
d0$wt82_71 <- NA
d1 <- d # 3rd copy: treatment set to 1, outcome to missing
d1$interv <- 1
d1$qsmk <- 1
d1$wt82_71 <- NA
d.onesample <- rbind(d, d0, d1) # combining datasets
# linear model to estimate mean outcome conditional on treatment and confounders
# parameters are estimated using original observations only (interv= -1)
# parameter estimates are used to predict mean outcome for observations with set
# treatment (interv=0 and interv=1)
fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) +
as.factor(education) + smokeintensity +
I(smokeintensity*smokeintensity) + smokeyrs + I(smokeyrs*smokeyrs) +
as.factor(exercise) + as.factor(active) + wt71 + I(wt71*wt71),
data=d.onesample)
d.onesample$predicted_meanY <- predict(fit, d.onesample)
# estimate mean outcome in each of the groups interv=-1, interv=0, and interv=1
return(c(mean(d.onesample$predicted_meanY[d.onesample$interv==-1]),
mean(d.onesample$predicted_meanY[d.onesample$interv==0]),
mean(d.onesample$predicted_meanY[d.onesample$interv==1]),
mean(d.onesample$predicted_meanY[d.onesample$interv==1])-
mean(d.onesample$predicted_meanY[d.onesample$interv==0])))
}
# bootstrap
results <- boot(data=nhefs, statistic=standardization, R=5)
# generating confidence intervals
se <- c(sd(results$t[,1]), sd(results$t[,2]),
sd(results$t[,3]), sd(results$t[,4]))
mean <- results$t0 # t0 是什么
ll <- mean - qnorm(0.975)*se
ul <- mean + qnorm(0.975)*se
bootstrap <- data.frame(cbind(c("Observed", "No Treatment", "Treatment",
"Treatment - No Treatment"), mean, se, ll, ul))
bootstrap
## V1 mean se ll
## 1 Observed 2.56188497106103 0.249348302062059 2.07317127941318
## 2 No Treatment 1.65212306626746 0.166328637079571 1.32612492799386
## 3 Treatment 5.11474489549346 0.512935487268879 4.10940981405396
## 4 Treatment - No Treatment 3.46262182922601 0.418650414071998 2.64208209553211
## ul
## 1 3.05059866270888
## 2 1.97812120454105
## 3 6.12007997693297
## 4 4.2831615629199
13.4 逆概率加权or 标准化
在进行因果推理的时候,可以尝试使用多种方法进行分析。逆概率加权均值和标准化 均值,只有在不使用模型的时候,才是相等的,否则不一定相等。
在实践中,或多或少的模型错误设定总是不可避免,而错误设定会带来偏移。治疗模型(逆 概率加权)和结局模型(标准化)的错误设定给效应估计带来的偏移影响不尽相同。因此,逆概 率加权得到的效应估计会和标准化得到的效应估计有所区别。如果逆概率加权和标准化两种方法得到结果差距很大,会让我们警觉至少其中一种方法存在严重的模型错误设定。如果两种方法得 到的结果差距不大,虽然不能完全排除存在模型错误设定,但可以侧面说明模型错误设定带来的影响不是很严重——这是因为两种方法的模型都错误设定且导致的偏移影响差不多,这一概率实 在太低了。
。如果我们只对人群的一个子 集感兴趣,那么我们应该将计算局限在这个子集当中。比如,如果我们对性别的修饰效应感兴 趣,我们需要在女性和男性中分别计算标准化均值。逆概率加权和标准化两种方法都能用来计算 人群子集的因果效应。
13.5 如何看待估计值
本章和上一章的分析只是我们在现实数据中的初步尝试。在逆概率加权和标准化方法中,我 们得到同样的结果:如果每个参与人员都戒烟,将会平均增重 5.2kg,而如果每个参与人员都不戒烟,将会平均增重 1.7kg。两种方法得到的戒烟因果效应均值都是 3.5kg(95%置信区间:2.5 到 4.5)。
不同方法得到同样的结果,这让我们的结果更加可信,这是因为每种方法的模型假设都不一 样。然而,观测性研究得到的效应估计总是有严重缺陷。我们对目标人群的效应估计依然需要其他 假设条件。我们将这些条件分成三组。
1.第一,可识别性条件,即互换性、正数性和一致性 2. 第二,所有变量都应该被正确测量。 3. 第三,所有数据分析中的模型都应该是正确设定的。如果连续变量年龄的真实函数形式不是抛物线,而是另一种复杂曲线,那么即使 L 中 的所有变量都是正确测定的,但因为模型设定错误,逆概率加权依然不能调整所有混杂。模型错 误设定带来的效果就如混杂变量的测量误差一样。
(因果推断的有效性依赖于以下几个条件:互换性,正数性,一致性。并不涉及测量误差和 模型错误设定。)
保证以上条件成立,或者近似成立,是研究者的主要任务。如果这些条件不一定成立,那数 据分析就是无意义的。问题是,没有人能保证这些条件都能完美成立。未测的混杂变量、不重叠 的混杂变量分布、劣定的干预措施、误测的变量、以及错误设定的模型都是我们在效应估计中常 见的问题。其中某些问题可以用经验方法解决,但是其他问题则需要专业判断,因而会被其他人 批评。遗憾的是,我们的数据并不能反驳这些批评。比如,我们可以尝试使用不同的模型,但是 我们不可能去调整没有测量的变量。 因果推断依赖于以上条件。这些条件看起来很美好,但是却不能实证地验证。因果推断的重 要问题是我们不可能观察到所有的反事实数据,因此我们用上述假设去近似得到这些数据。我们越偏离这些假设,那我们的效应估计也就越偏离真实因果效应。因此,对观察性研究中得到的因 果推断必须保持谨慎怀疑。实际中,我们需要依据上面提到的假设,对现实中的因果推断进行逐 条讨论,从而保证因果推断的严肃性。我们需要像对待效应估计一样,严肃对待这些条件假设。
14 G-估计和结构嵌入模型
逆概率加权、标准化和 G-估算经常被统称为 G-方法。因为它们被用于广义(Generalized) 的治疗效果比较,且可以涉及不同时间下的治疗。
14.1 互换性
有界互换性意味着,在 L 取值相同的子人群中,非戒烟者 ( A = 0 )如果戒烟了,那么他们的增重均值会和戒烟者( A = 1 )一样。换句话说,有界互换性 意味着现实中的戒烟者和非戒烟者如果戒烟状态一样,那么这两个组的结局分布相同。
把有界互换性表述为概率形式将有助于我们理解接下来要说的 G-估算。具体而言,假设我们 要拟合下述参数模型来预测接受治疗的概率:

当然,我们在现实世界中不能拟合这个模型,因为我们不知道所有人的反事实结局Ya0.
假设我们知道了、拟合了上述模型,并且有界互换性成立、模型也设定正确,那么参数α1 会是什 么样呢?正确答案是 0,这是因为ya0并不能用来预测参与人员是 否接受了治疗。接下来,我们将介绍 G-估算的另外一半:结构模型。
14.2 结局均值和结构嵌入模型
我们想估计在L的每一分层中,治疗A的因果效应均值,也就是: 
我们可以写作:

对因果效应构建结构模型,就有: 
一般而言, L 可能存在效应修饰作用.在有界互换性下,结构模型 也能被表述成: 
这被称为结构嵌入均值模型。与这些参数模型相 比,结构嵌入模型是半参数化的,因为这个模型中没有截距项和 L 的效应项,也即没有 β0 和 β3 L 两项。
当我们使用 G-估算的时候,我们需要先使用逆概率加权调整删失。在实践中,这意味 着我们需要先使用逆概率加权去构建一个虚拟人群,其中没有人被删失,然后再在虚拟人群中使 用G-估算。
我们假设控制了 L 之 后,被删失的和未被删失的人群是可互换的。在这一假设之下,虚拟人群中的结构模型将是:

对于连续性变量,模型需要假设治疗 A 对结局 Y 的剂量反应函数。比如:

或者其他任意光滑函数,比如样条
14.3 保序性
假设我们知道所有人的Y a=1 和Y a=0 ,我们可以就根据Y a=1 和 Y a=0 进行排序,并得到两份排序名单。如果两份名单的顺序是一样的,那我们就把这称为“保序 性”。
如果在加法尺度上,治疗 A 对结局 Y 的效应在所有个体中都一样,那我们会说加成保序性成 立。比如,如果戒烟让每个人都增重3kg,那根据Ya=0的排序也就等于根据Ya=1的排序。
在结构嵌入模型中,我们只会在意 L 分 层中的加成保序性。如果在 L 的所有分层中, A 对 Y 的效应都相等,那么我们会说条件加成保序 性成立。
一个(条件加成)保序的结构模型如下所示:

其中ψ1+ψ2l是L=l的时候因果效应的大小。也就是说,对于个体i,Ya1等于Ya0+ψ1+ψ2l。个体在没有治疗的时候,反事实结局Ya0移动ψ1+ψ2l个单位就能够得到有治疗时候的反事实结局。
对大多数治疗和结局而言,个体的因果效应并不是一个常数,甚至不可能接近常数。因而 (条件加成)保序性几乎不可能成立。
因为保序性不可能成立,所以我们的因果推断方法不能依赖于保序性。实际上,本书所讨论 的方法都不需要保序性。
基于保序性的模型需要一个非常强的假设:在 L 的同一分层中,个体的治疗因果效应是一个 常数。而我们在现实中并不希望使用这么一个不切实际的模型。不过在下一小节我们将用保序性 介绍 G-估算,这只是因为用保序性能更好地解释 G-估算,并且不管保序性是否成立,G-估算的具 体步骤都是一样的。也就是说 G-估算不一定需要保序性假设。同时要注意到,保序性结构模型也 是一个结构均值模型——个体从Y a=0 移动到Y a=1 距离的均值等于个体移动的距离。
14.4 G-估算
假设我们的目标是估计结构嵌入模型

为了简便,这里只考虑又一个参数的模型。
我们同时也假设加成保序性成立,也即对所有个体i有。 
因而个体的因果效 应ψ 1=β1,这也是我们想估计的值。我们进一步将保序性模型中的下角标 i 去 掉, 
这是因为这一模型对所有个体都适用。接下来我们通过移项得到

G-估算的第一步是将模型和观测数据联系起来。因而,保序结构模型就意味着每个个体的反事实结局 Y a=0 可以表示为观测数据和未知参数ψ1的一个函数

如果这个模型是正确的,并且我们知道ψ 1 的值,那我们就能计算每个个体没有治疗时的反事 实结局Y a=0 。现在我们还不知道ψ1 的值,我们的目标就是估计ψ1 。
# 计算权重
################################################################
# PROGRAM 14.1
# Preprocessing, ranks of extreme observations, IP weights for censoring
# Data from NHEFS
################################################################
#install.packages("readxl") # install package if required
library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
# some processing of the data
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0) # 标记缺失值
# ranking of extreme observations
#install.packages("Hmisc")
library(Hmisc)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
##
## aml
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
describe(nhefs$wt82_71) # 对数据集进行描述性分析
## nhefs$wt82_71
## n missing distinct Info Mean Gmd .05 .10
## 1566 63 1510 1 2.638 8.337 -9.752 -6.292
## .25 .50 .75 .90 .95
## -1.478 2.604 6.690 11.117 14.739
##
## lowest : -41.3 -30.5 -30.1 -29.0 -26.0, highest: 34.0 37.0 37.7 47.5 48.5
# estimation of denominator of ip weights for C
# C的分母权重
cw.denom <- glm(cens==0 ~ qsmk + sex + race + age + I(age^2)
+ as.factor(education) + smokeintensity + I(smokeintensity^2)
+ smokeyrs + I(smokeyrs^2) + as.factor(exercise)
+ as.factor(active) + wt71 + I(wt71^2),
data = nhefs, family = binomial("logit"))
summary(cw.denom)
##
## Call:
## glm(formula = cens == 0 ~ qsmk + sex + race + age + I(age^2) +
## as.factor(education) + smokeintensity + I(smokeintensity^2) +
## smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
## wt71 + I(wt71^2), family = binomial("logit"), data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.996 0.157 0.207 0.287 1.097
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.014466 2.576106 -1.56 0.1192
## qsmk -0.516867 0.287716 -1.80 0.0724 .
## sex -0.057313 0.330278 -0.17 0.8622
## race 0.012271 0.452489 0.03 0.9784
## age 0.269729 0.117465 2.30 0.0217 *
## I(age^2) -0.002884 0.001114 -2.59 0.0096 **
## as.factor(education)2 0.440788 0.419399 1.05 0.2933
## as.factor(education)3 0.164688 0.370547 0.44 0.6567
## as.factor(education)4 -0.138447 0.569797 -0.24 0.8080
## as.factor(education)5 0.382382 0.560181 0.68 0.4949
## smokeintensity -0.015712 0.034732 -0.45 0.6510
## I(smokeintensity^2) 0.000113 0.000606 0.19 0.8517
## smokeyrs -0.078597 0.074958 -1.05 0.2944
## I(smokeyrs^2) 0.000557 0.001032 0.54 0.5894
## as.factor(exercise)1 0.971471 0.387810 2.51 0.0122 *
## as.factor(exercise)2 0.583989 0.372313 1.57 0.1168
## as.factor(active)1 0.247479 0.325455 0.76 0.4470
## as.factor(active)2 -0.706583 0.396458 -1.78 0.0747 .
## wt71 0.087887 0.040012 2.20 0.0281 *
## I(wt71^2) -0.000635 0.000226 -2.81 0.0049 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 533.36 on 1628 degrees of freedom
## Residual deviance: 465.36 on 1609 degrees of freedom
## AIC: 505.4
##
## Number of Fisher Scoring iterations: 7
nhefs$pd.c <- predict(cw.denom, nhefs, type="response") # 进行预测
nhefs$wc <- ifelse(nhefs$cens==0, 1/nhefs$pd.c, NA) # observations with cens=1 only contribute to censoring models
##################################################################
# PROGRAM 14.2
# G-estimation of a 1-parameter structural nested mean model
# Brute force search
# Data from NHEFS
##################################################################
####################################################
# G-estimation: Checking one possible value of psi #
####################################################
#install.packages("geepack")
library("geepack")
nhefs$psi <- 3.446
nhefs$Hpsi <- nhefs$wt82_71 - nhefs$psi*nhefs$qsmk
fit <- geeglm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71) + Hpsi, family=binomial, data=nhefs,
weights=wc, id=seqn, corstr="independence")
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit)
##
## Call:
## geeglm(formula = qsmk ~ sex + race + age + I(age * age) + as.factor(education) +
## smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs +
## I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) +
## wt71 + I(wt71 * wt71) + Hpsi, family = binomial, data = nhefs,
## weights = wc, id = seqn, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.40e+00 1.33e+00 3.27 0.07060 .
## sex -5.14e-01 1.54e-01 11.19 0.00082 ***
## race -8.61e-01 2.10e-01 16.83 4.1e-05 ***
## age 1.15e-01 5.02e-02 5.26 0.02178 *
## I(age * age) -7.59e-04 5.30e-04 2.06 0.15162
## as.factor(education)2 -2.89e-02 1.96e-01 0.02 0.88286
## as.factor(education)3 8.77e-02 1.73e-01 0.26 0.61133
## as.factor(education)4 6.64e-02 2.70e-01 0.06 0.80565
## as.factor(education)5 4.71e-01 2.25e-01 4.40 0.03604 *
## smokeintensity -7.83e-02 1.46e-02 28.63 8.7e-08 ***
## I(smokeintensity * smokeintensity) 1.07e-03 2.65e-04 16.37 5.2e-05 ***
## smokeyrs -7.11e-02 2.64e-02 7.26 0.00705 **
## I(smokeyrs * smokeyrs) 8.15e-04 4.49e-04 3.30 0.06938 .
## as.factor(exercise)1 3.36e-01 1.83e-01 3.38 0.06584 .
## as.factor(exercise)2 3.80e-01 1.89e-01 4.05 0.04419 *
## as.factor(active)1 3.41e-02 1.34e-01 0.06 0.79878
## as.factor(active)2 2.13e-01 2.12e-01 1.01 0.31431
## wt71 -7.66e-03 2.56e-02 0.09 0.76496
## I(wt71 * wt71) 8.66e-05 1.58e-04 0.30 0.58423
## Hpsi -1.90e-06 8.84e-03 0.00 0.99983
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.997 0.0672
## Number of clusters: 1566 Maximum cluster size: 1
##########################################################
# G-estimation: Checking multiple possible values of psi #
##########################################################
#install.packages("geepack")
grid <- seq(from = 2,to = 5, by = 0.1)
j = 0
Hpsi.coefs <- cbind(rep(NA,length(grid)), rep(NA, length(grid)))
colnames(Hpsi.coefs) <- c("Estimate", "p-value")
for (i in grid){
psi = i
j = j+1
nhefs$Hpsi <- nhefs$wt82_71 - psi * nhefs$qsmk
gest.fit <- geeglm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71) + Hpsi, family=binomial, data=nhefs,
weights=wc, id=seqn, corstr="independence")
Hpsi.coefs[j,1] <- summary(gest.fit)$coefficients["Hpsi", "Estimate"]
Hpsi.coefs[j,2] <- summary(gest.fit)$coefficients["Hpsi", "Pr(>|W|)"]
}
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
Hpsi.coefs
## Estimate p-value
## [1,] 0.026722 0.00177
## [2,] 0.024895 0.00358
## [3,] 0.023066 0.00696
## [4,] 0.021234 0.01303
## [5,] 0.019401 0.02342
## [6,] 0.017565 0.04043
## [7,] 0.015725 0.06701
## [8,] 0.013883 0.10663
## [9,] 0.012036 0.16288
## [10,] 0.010186 0.23898
## [11,] 0.008331 0.33705
## [12,] 0.006471 0.45743
## [13,] 0.004607 0.59823
## [14,] 0.002737 0.75520
## [15,] 0.000862 0.92210
## [16,] -0.001018 0.90854
## [17,] -0.002904 0.74436
## [18,] -0.004797 0.59219
## [19,] -0.006695 0.45717
## [20,] -0.008600 0.34236
## [21,] -0.010511 0.24868
## [22,] -0.012428 0.17524
## [23,] -0.014352 0.11984
## [24,] -0.016283 0.07958
## [25,] -0.018221 0.05135
## [26,] -0.020165 0.03222
## [27,] -0.022116 0.01968
## [28,] -0.024074 0.01171
## [29,] -0.026039 0.00679
## [30,] -0.028011 0.00385
## [31,] -0.029989 0.00213
##################################################################
14.5 两个或多个参数的结构嵌入模型
例如,我们认为吸烟频率V能够修饰戒烟的因果效应,此时结构嵌入模型是: 
其对应的保序模型则是: 
因此我们可以拟合如下logistic 模型:

在这个模型中,我们需要找到可能的ψ 1† 和ψ 2† 组合,使得参数 α1 和 α 2 都等于 0。
# PROGRAM 14.3
# G-estimation for 2-parameter structural nested mean model
# Closed form estimator
# Data from NHEFS
##################################################################
##########################################################
# G-estimation: Closed form estimator linear mean models #
##########################################################
logit.est <- glm(qsmk ~ sex + race + age + I(age^2) + as.factor(education)
+ smokeintensity + I(smokeintensity^2) + smokeyrs
+ I(smokeyrs^2) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71^2), data = nhefs, weight = wc,
family = binomial())
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(logit.est)
##
## Call:
## glm(formula = qsmk ~ sex + race + age + I(age^2) + as.factor(education) +
## smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) +
## as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2),
## family = binomial(), data = nhefs, weights = wc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.529 -0.808 -0.650 1.029 2.417
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.40e+00 1.31e+00 -1.83 0.06743 .
## sex -5.14e-01 1.50e-01 -3.42 0.00062 ***
## race -8.61e-01 2.06e-01 -4.18 2.9e-05 ***
## age 1.15e-01 4.95e-02 2.33 0.01992 *
## I(age^2) -7.59e-04 5.14e-04 -1.48 0.13953
## as.factor(education)2 -2.89e-02 1.93e-01 -0.15 0.88079
## as.factor(education)3 8.77e-02 1.73e-01 0.51 0.61244
## as.factor(education)4 6.64e-02 2.66e-01 0.25 0.80301
## as.factor(education)5 4.71e-01 2.21e-01 2.13 0.03314 *
## smokeintensity -7.83e-02 1.49e-02 -5.27 1.4e-07 ***
## I(smokeintensity^2) 1.07e-03 2.78e-04 3.85 0.00012 ***
## smokeyrs -7.11e-02 2.71e-02 -2.63 0.00862 **
## I(smokeyrs^2) 8.15e-04 4.45e-04 1.83 0.06722 .
## as.factor(exercise)1 3.36e-01 1.75e-01 1.92 0.05467 .
## as.factor(exercise)2 3.80e-01 1.82e-01 2.09 0.03637 *
## as.factor(active)1 3.41e-02 1.30e-01 0.26 0.79337
## as.factor(active)2 2.13e-01 2.06e-01 1.04 0.30033
## wt71 -7.66e-03 2.46e-02 -0.31 0.75530
## I(wt71^2) 8.66e-05 1.51e-04 0.57 0.56586
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1872.2 on 1565 degrees of freedom
## Residual deviance: 1755.6 on 1547 degrees of freedom
## (63 observations deleted due to missingness)
## AIC: 1719
##
## Number of Fisher Scoring iterations: 4
nhefs$pqsmk <- predict(logit.est, nhefs, type = "response")
describe(nhefs$pqsmk)
## nhefs$pqsmk
## n missing distinct Info Mean Gmd .05 .10
## 1629 0 1629 1 0.2622 0.1302 0.1015 0.1261
## .25 .50 .75 .90 .95
## 0.1780 0.2426 0.3251 0.4221 0.4965
##
## lowest : 0.0514 0.0516 0.0544 0.0558 0.0593, highest: 0.6721 0.6864 0.7139 0.7333 0.7891
summary(nhefs$pqsmk)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.051 0.178 0.243 0.262 0.325 0.789
# solve sum(w_c * H(psi) * (qsmk - E[qsmk | L])) = 0
# for a single psi and H(psi) = wt82_71 - psi * qsmk
# this can be solved as psi = sum( w_c * wt82_71 * (qsmk - pqsmk)) / sum(w_c * qsmk * (qsmk - pqsmk))
nhefs.c <- nhefs[which(!is.na(nhefs$wt82)),]
with(nhefs.c, sum(wc*wt82_71*(qsmk-pqsmk)) / sum(wc*qsmk*(qsmk - pqsmk)))
## [1] 3.45
#############################################################
# G-estimation: Closed form estimator for 2-parameter model #
#############################################################
diff = with(nhefs.c, qsmk - pqsmk)
diff2 = with(nhefs.c, wc * diff)
lhs = matrix(0,2,2)
lhs[1,1] = with(nhefs.c, sum(qsmk * diff2))
lhs[1,2] = with(nhefs.c, sum(qsmk * smokeintensity * diff2))
lhs[2,1] = with(nhefs.c, sum(qsmk * smokeintensity * diff2))
lhs[2,2] = with(nhefs.c, sum(qsmk * smokeintensity * smokeintensity * diff2))
rhs = matrix(0,2,1)
rhs[1] = with(nhefs.c, sum(wt82_71 * diff2))
rhs[2] = with(nhefs.c, sum(wt82_71 * smokeintensity * diff2))
psi = t(solve(lhs,rhs))
psi
## [,1] [,2]
## [1,] 2.86 0.03
15 结局回归和倾向性评分
已经有无数著作详细介绍并运用了这两种方法。然而,这两种方法只在较简单的情形中适 用。当情形更复杂一些,比如涉及时异性治疗的时候,这两种方法就不再适用。在本书第三部分 我们将主要讨论 G-方法。本章将介绍结局回归和倾向性评分这两种方法。不过谨记,这两种方法 不适用于复杂的纵向数据。
15.1 结局回归
在前三章,我们讨论了逆概率加权,标准化,以及 G-估算三种方法,并用它们估计了戒烟 A (治疗)对增重Y (结局)的因果效应。
rm(list = ls())
#################################################################################
# Program 15.1
# Estimating the average causal effect within levels of confounders
# under the assumption of effect-measure modification by smoking intensity ONLY
# Data from NHEFS
#################################################################################
#install.packages("readxl") # install package if required
library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)
# regression on covariates, allowing for some effect modification
fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71) + I(qsmk*smokeintensity), data=nhefs)
summary(fit)
##
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) +
## as.factor(education) + smokeintensity + I(smokeintensity *
## smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity),
## data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -42.06 -4.17 -0.34 3.89 44.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.588166 4.313036 -0.37 0.71276
## qsmk 2.559594 0.809149 3.16 0.00159 **
## sex -1.430272 0.468958 -3.05 0.00233 **
## race 0.560110 0.581889 0.96 0.33591
## age 0.359635 0.163319 2.20 0.02781 *
## I(age * age) -0.006101 0.001726 -3.53 0.00042 ***
## as.factor(education)2 0.790444 0.607000 1.30 0.19304
## as.factor(education)3 0.556312 0.556102 1.00 0.31728
## as.factor(education)4 1.491570 0.832270 1.79 0.07330 .
## as.factor(education)5 -0.194977 0.741369 -0.26 0.79259
## smokeintensity 0.049136 0.051725 0.95 0.34229
## I(smokeintensity * smokeintensity) -0.000991 0.000938 -1.06 0.29110
## smokeyrs 0.134369 0.091712 1.47 0.14309
## I(smokeyrs * smokeyrs) -0.001866 0.001544 -1.21 0.22683
## as.factor(exercise)1 0.295975 0.535153 0.55 0.58030
## as.factor(exercise)2 0.353913 0.558859 0.63 0.52665
## as.factor(active)1 -0.947569 0.409934 -2.31 0.02094 *
## as.factor(active)2 -0.261378 0.684558 -0.38 0.70265
## wt71 0.045502 0.083371 0.55 0.58530
## I(wt71 * wt71) -0.000965 0.000525 -1.84 0.06600 .
## I(qsmk * smokeintensity) 0.046663 0.035145 1.33 0.18446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 53.6)
##
## Null deviance: 97176 on 1565 degrees of freedom
## Residual deviance: 82763 on 1545 degrees of freedom
## (63 observations deleted due to missingness)
## AIC: 10701
##
## Number of Fisher Scoring iterations: 2
# (step 1) build the contrast matrix with all zeros
# this function builds the blank matrix
# install.packages("multcomp") # install packages if necessary
library("multcomp")
## Warning: package 'multcomp' was built under R version 4.1.2
## Loading required package: mvtnorm
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
makeContrastMatrix <- function(model, nrow, names) {
m <- matrix(0, nrow = nrow, ncol = length(coef(model)))
colnames(m) <- names(coef(model))
rownames(m) <- names
return(m)
}
K1 <- makeContrastMatrix(fit, 2, c('Effect of Quitting Smoking at Smokeintensity of 5',
'Effect of Quitting Smoking at Smokeintensity of 40'))
# (step 2) fill in the relevant non-zero elements
K1[1:2, 'qsmk'] <- 1
K1[1:2, 'I(qsmk * smokeintensity)'] <- c(5, 40)
# (step 3) check the contrast matrix
K1
## (Intercept) qsmk sex race
## Effect of Quitting Smoking at Smokeintensity of 5 0 1 0 0
## Effect of Quitting Smoking at Smokeintensity of 40 0 1 0 0
## age I(age * age)
## Effect of Quitting Smoking at Smokeintensity of 5 0 0
## Effect of Quitting Smoking at Smokeintensity of 40 0 0
## as.factor(education)2
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## as.factor(education)3
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## as.factor(education)4
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## as.factor(education)5
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## smokeintensity
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## I(smokeintensity * smokeintensity)
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## smokeyrs
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## I(smokeyrs * smokeyrs)
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## as.factor(exercise)1
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## as.factor(exercise)2
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## as.factor(active)1
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## as.factor(active)2 wt71
## Effect of Quitting Smoking at Smokeintensity of 5 0 0
## Effect of Quitting Smoking at Smokeintensity of 40 0 0
## I(wt71 * wt71)
## Effect of Quitting Smoking at Smokeintensity of 5 0
## Effect of Quitting Smoking at Smokeintensity of 40 0
## I(qsmk * smokeintensity)
## Effect of Quitting Smoking at Smokeintensity of 5 5
## Effect of Quitting Smoking at Smokeintensity of 40 40
# (step 4) estimate the contrasts, get tests and confidence intervals for them
estimates1 <- glht(fit, K1)
summary(estimates1)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) +
## as.factor(education) + smokeintensity + I(smokeintensity *
## smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity),
## data = nhefs)
##
## Linear Hypotheses:
## Estimate Std. Error
## Effect of Quitting Smoking at Smokeintensity of 5 == 0 2.793 0.668
## Effect of Quitting Smoking at Smokeintensity of 40 == 0 4.426 0.848
## z value Pr(>|z|)
## Effect of Quitting Smoking at Smokeintensity of 5 == 0 4.18 5.8e-05 ***
## Effect of Quitting Smoking at Smokeintensity of 40 == 0 5.22 3.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(estimates1)
##
## Simultaneous Confidence Intervals
##
## Fit: glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) +
## as.factor(education) + smokeintensity + I(smokeintensity *
## smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity),
## data = nhefs)
##
## Quantile = 2.23
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## Effect of Quitting Smoking at Smokeintensity of 5 == 0 2.793 1.304 4.282
## Effect of Quitting Smoking at Smokeintensity of 40 == 0 4.426 2.537 6.315
# regression on covariates, not allowing for effect modification
fit2 <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71), data=nhefs)
summary(fit2)
##
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) +
## as.factor(education) + smokeintensity + I(smokeintensity *
## smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71), data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -42.33 -4.22 -0.32 3.81 44.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.658618 4.313773 -0.38 0.70067
## qsmk 3.462622 0.438454 7.90 5.4e-15 ***
## sex -1.465050 0.468341 -3.13 0.00179 **
## race 0.586412 0.581695 1.01 0.31356
## age 0.362662 0.163343 2.22 0.02655 *
## I(age * age) -0.006138 0.001726 -3.56 0.00039 ***
## as.factor(education)2 0.818526 0.606782 1.35 0.17755
## as.factor(education)3 0.571500 0.556121 1.03 0.30427
## as.factor(education)4 1.508517 0.832378 1.81 0.07013 .
## as.factor(education)5 -0.170826 0.741329 -0.23 0.81779
## smokeintensity 0.065153 0.050311 1.29 0.19551
## I(smokeintensity * smokeintensity) -0.001047 0.000937 -1.12 0.26426
## smokeyrs 0.133393 0.091732 1.45 0.14610
## I(smokeyrs * smokeyrs) -0.001827 0.001544 -1.18 0.23682
## as.factor(exercise)1 0.320682 0.534962 0.60 0.54896
## as.factor(exercise)2 0.362879 0.558956 0.65 0.51630
## as.factor(active)1 -0.942957 0.410021 -2.30 0.02159 *
## as.factor(active)2 -0.258037 0.684722 -0.38 0.70634
## wt71 0.037364 0.083166 0.45 0.65330
## I(wt71 * wt71) -0.000916 0.000524 -1.75 0.08043 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 53.6)
##
## Null deviance: 97176 on 1565 degrees of freedom
## Residual deviance: 82857 on 1546 degrees of freedom
## (63 observations deleted due to missingness)
## AIC: 10701
##
## Number of Fisher Scoring iterations: 2
##################################################
15.2 倾向性评分
# Program 15.2
# Estimating and plotting the propensity score
# Data from NHEFS
##################################################
fit3 <- glm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71), data=nhefs, family=binomial())
summary(fit3)
##
## Call:
## glm(formula = qsmk ~ sex + race + age + I(age * age) + as.factor(education) +
## smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs +
## I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) +
## wt71 + I(wt71 * wt71), family = binomial(), data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.465 -0.804 -0.646 1.058 2.355
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.988902 1.241279 -1.60 0.10909
## sex -0.507522 0.148232 -3.42 0.00062 ***
## race -0.850231 0.205872 -4.13 3.6e-05 ***
## age 0.103013 0.048900 2.11 0.03515 *
## I(age * age) -0.000605 0.000507 -1.19 0.23297
## as.factor(education)2 -0.098320 0.190655 -0.52 0.60607
## as.factor(education)3 0.015699 0.170714 0.09 0.92673
## as.factor(education)4 -0.042526 0.264276 -0.16 0.87216
## as.factor(education)5 0.379663 0.220395 1.72 0.08495 .
## smokeintensity -0.065156 0.014759 -4.41 1.0e-05 ***
## I(smokeintensity * smokeintensity) 0.000846 0.000276 3.07 0.00216 **
## smokeyrs -0.073371 0.026996 -2.72 0.00657 **
## I(smokeyrs * smokeyrs) 0.000838 0.000443 1.89 0.05867 .
## as.factor(exercise)1 0.291412 0.173554 1.68 0.09314 .
## as.factor(exercise)2 0.355052 0.179929 1.97 0.04846 *
## as.factor(active)1 0.010875 0.129832 0.08 0.93324
## as.factor(active)2 0.068312 0.208727 0.33 0.74346
## wt71 -0.012848 0.022283 -0.58 0.56423
## I(wt71 * wt71) 0.000121 0.000135 0.89 0.37096
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1876.3 on 1628 degrees of freedom
## Residual deviance: 1766.7 on 1610 degrees of freedom
## AIC: 1805
##
## Number of Fisher Scoring iterations: 4
nhefs$ps <- predict(fit3, nhefs, type="response")
summary(nhefs$ps[nhefs$qsmk==0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.053 0.169 0.227 0.245 0.304 0.658
summary(nhefs$ps[nhefs$qsmk==1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.062 0.220 0.289 0.312 0.381 0.793
# # plotting the estimated propensity score
# install.packages("ggplot2") # install packages if necessary
# install.packages("dplyr")
library("ggplot2")
library("dplyr")
library(tidyverse)
ggplot(nhefs, aes(x = ps, fill = qsmk)) + geom_density(alpha = 0.2) +
xlab('Probability of Quitting Smoking During Follow-up') +
ggtitle('Propensity Score Distribution by Treatment Group') +
scale_fill_discrete('') +
theme(legend.position = 'bottom', legend.direction = 'vertical')

# alternative plot with histograms
nhefs <- nhefs %>% mutate(qsmklabel = ifelse(qsmk == 1,
yes = 'Quit Smoking 1971-1982',
no = 'Did Not Quit Smoking 1971-1982'))
ggplot(nhefs, aes(x = ps, fill = as.factor(qsmk), color = as.factor(qsmk))) +
geom_histogram(alpha = 0.3, position = 'identity', bins=15) +
facet_grid(as.factor(qsmk) ~ .) +
xlab('Probability of Quitting Smoking During Follow-up') +
ggtitle('Propensity Score Distribution by Treatment Group') +
scale_fill_discrete('') +
scale_color_discrete('') +
theme(legend.position = 'bottom', legend.direction = 'vertical')

# attempt to reproduce plot from the book
nhefs %>%
mutate(ps.grp = round(ps/0.05) * 0.05) %>%
group_by(qsmk, ps.grp) %>%
dplyr::summarize(n = n()) %>%
ungroup() %>%
mutate(n2 = ifelse(qsmk == 0, yes = n, no = -1*n)) %>%
ggplot(aes(x = ps.grp, y = n2, fill = as.factor(qsmk))) +
geom_bar(stat = 'identity', position = 'identity') +
geom_text(aes(label = n, x = ps.grp, y = n2 + ifelse(qsmk == 0, 8, -8))) +
xlab('Probability of Quitting Smoking During Follow-up') +
ylab('N') +
ggtitle('Propensity Score Distribution by Treatment Group') +
scale_fill_discrete('') +
scale_x_continuous(breaks = seq(0, 1, 0.05)) +
theme(legend.position = 'bottom', legend.direction = 'vertical',
axis.ticks.y = element_blank(),
axis.text.y = element_blank())
## `summarise()` has grouped output by 'qsmk'. You can override using the `.groups` argument.

############################################
15.3 倾向性分层和标准化
# Program 15.3
# Stratification on the propensity score
# Data from NHEFS
############################################
# calculation of deciles
nhefs$ps.dec <- cut(nhefs$ps,
breaks=c(quantile(nhefs$ps, probs=seq(0,1,0.1))),
labels=seq(1:10),
include.lowest=TRUE)
#install.packages("psych") # install package if required
library("psych")
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following object is masked from 'package:boot':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describeBy(nhefs$ps, list(nhefs$ps.dec, nhefs$qsmk))
##
## Descriptive statistics by group
## : 1
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 151 0.1 0.02 0.11 0.1 0.02 0.05 0.13 0.08 -0.55 -0.53 0
## ------------------------------------------------------------
## : 2
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 136 0.15 0.01 0.15 0.15 0.01 0.13 0.17 0.04 -0.04 -1.23 0
## ------------------------------------------------------------
## : 3
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 134 0.18 0.01 0.18 0.18 0.01 0.17 0.19 0.03 -0.08 -1.34 0
## ------------------------------------------------------------
## : 4
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 129 0.21 0.01 0.21 0.21 0.01 0.19 0.22 0.02 -0.04 -1.13 0
## ------------------------------------------------------------
## : 5
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 120 0.23 0.01 0.23 0.23 0.01 0.22 0.25 0.03 0.24 -1.22 0
## ------------------------------------------------------------
## : 6
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 117 0.26 0.01 0.26 0.26 0.01 0.25 0.27 0.03 -0.11 -1.29 0
## ------------------------------------------------------------
## : 7
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 120 0.29 0.01 0.29 0.29 0.01 0.27 0.31 0.03 -0.23 -1.19 0
## ------------------------------------------------------------
## : 8
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 112 0.33 0.01 0.33 0.33 0.02 0.31 0.35 0.04 0.15 -1.1 0
## ------------------------------------------------------------
## : 9
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 96 0.38 0.02 0.38 0.38 0.02 0.35 0.42 0.06 0.13 -1.15 0
## ------------------------------------------------------------
## : 10
## : 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 86 0.49 0.06 0.47 0.48 0.05 0.42 0.66 0.24 1.1 0.47 0.01
## ------------------------------------------------------------
## : 1
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 0.1 0.02 0.11 0.1 0.03 0.06 0.13 0.07 -0.5 -1.36 0.01
## ------------------------------------------------------------
## : 2
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 27 0.15 0.01 0.15 0.15 0.01 0.13 0.17 0.03 -0.03 -1.34 0
## ------------------------------------------------------------
## : 3
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 29 0.18 0.01 0.18 0.18 0.01 0.17 0.19 0.03 0.01 -1.34 0
## ------------------------------------------------------------
## : 4
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 34 0.21 0.01 0.21 0.21 0.01 0.19 0.22 0.02 -0.31 -1.23 0
## ------------------------------------------------------------
## : 5
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 43 0.23 0.01 0.23 0.23 0.01 0.22 0.25 0.03 0.11 -1.23 0
## ------------------------------------------------------------
## : 6
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 45 0.26 0.01 0.26 0.26 0.01 0.25 0.27 0.03 0.2 -1.12 0
## ------------------------------------------------------------
## : 7
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 43 0.29 0.01 0.29 0.29 0.01 0.27 0.31 0.03 0.16 -1.25 0
## ------------------------------------------------------------
## : 8
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 51 0.33 0.01 0.33 0.33 0.02 0.31 0.35 0.04 0.11 -1.19 0
## ------------------------------------------------------------
## : 9
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 67 0.38 0.02 0.38 0.38 0.03 0.35 0.42 0.06 0.19 -1.27 0
## ------------------------------------------------------------
## : 10
## : 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 77 0.52 0.08 0.51 0.51 0.08 0.42 0.79 0.38 0.88 0.81 0.01
# function to create deciles easily
decile <- function(x) {
return(factor(quantcut(x, seq(0, 1, 0.1), labels = FALSE)))
}
# regression on PS deciles, allowing for effect modification
for (deciles in c(1:10)) {
print(t.test(wt82_71~qsmk, data=nhefs[which(nhefs$ps.dec==deciles),]))
}
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = 0.006, df = 12, p-value = 1
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5.28 5.31
## sample estimates:
## mean in group 0 mean in group 1
## 4.00 3.98
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -3, df = 37, p-value = 0.004
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -6.85 -1.45
## sample estimates:
## mean in group 0 mean in group 1
## 2.90 7.05
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -5, df = 36, p-value = 6e-05
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -9.47 -3.61
## sample estimates:
## mean in group 0 mean in group 1
## 2.61 9.16
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -1, df = 45, p-value = 0.2
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5.683 0.999
## sample estimates:
## mean in group 0 mean in group 1
## 3.47 5.82
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -3, df = 74, p-value = 0.002
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -6.75 -1.51
## sample estimates:
## mean in group 0 mean in group 1
## 2.10 6.23
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -2, df = 51, p-value = 0.03
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -8.752 -0.335
## sample estimates:
## mean in group 0 mean in group 1
## 1.85 6.39
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -3, df = 85, p-value = 0.001
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -6.90 -1.73
## sample estimates:
## mean in group 0 mean in group 1
## 1.56 5.88
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -3, df = 75, p-value = 0.009
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -6.240 -0.901
## sample estimates:
## mean in group 0 mean in group 1
## 0.285 3.855
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -2, df = 129, p-value = 0.06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -4.6814 0.0797
## sample estimates:
## mean in group 0 mean in group 1
## -0.895 1.405
##
##
## Welch Two Sample t-test
##
## data: wt82_71 by qsmk
## t = -2, df = 143, p-value = 0.1
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5.02 0.54
## sample estimates:
## mean in group 0 mean in group 1
## -0.504 1.736
# regression on PS deciles, not allowing for effect modification
fit.psdec <- glm(wt82_71 ~ qsmk + as.factor(ps.dec), data = nhefs)
summary(fit.psdec)
##
## Call:
## glm(formula = wt82_71 ~ qsmk + as.factor(ps.dec), data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -43.54 -3.93 -0.08 4.23 46.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.750 0.609 6.16 9.3e-10 ***
## qsmk 3.500 0.457 7.66 3.3e-14 ***
## as.factor(ps.dec)2 -0.739 0.861 -0.86 0.391
## as.factor(ps.dec)3 -0.618 0.861 -0.72 0.473
## as.factor(ps.dec)4 -0.520 0.858 -0.61 0.544
## as.factor(ps.dec)5 -1.488 0.859 -1.73 0.083 .
## as.factor(ps.dec)6 -1.623 0.868 -1.87 0.062 .
## as.factor(ps.dec)7 -1.985 0.868 -2.29 0.022 *
## as.factor(ps.dec)8 -3.445 0.875 -3.94 8.6e-05 ***
## as.factor(ps.dec)9 -5.154 0.885 -5.83 6.9e-09 ***
## as.factor(ps.dec)10 -4.840 0.883 -5.48 4.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 58.4)
##
## Null deviance: 97176 on 1565 degrees of freedom
## Residual deviance: 90848 on 1555 degrees of freedom
## (63 observations deleted due to missingness)
## AIC: 10827
##
## Number of Fisher Scoring iterations: 2
confint.lm(fit.psdec)
## 2.5 % 97.5 %
## (Intercept) 2.56 4.9449
## qsmk 2.60 4.3970
## as.factor(ps.dec)2 -2.43 0.9498
## as.factor(ps.dec)3 -2.31 1.0710
## as.factor(ps.dec)4 -2.20 1.1633
## as.factor(ps.dec)5 -3.17 0.1966
## as.factor(ps.dec)6 -3.32 0.0789
## as.factor(ps.dec)7 -3.69 -0.2825
## as.factor(ps.dec)8 -5.16 -1.7286
## as.factor(ps.dec)9 -6.89 -3.4188
## as.factor(ps.dec)10 -6.57 -3.1087
#######################################################################################
15.4 倾向性评分匹配
# Program 15.4 ####
# Standardization using the propensity score ####
# Data from NHEFS ####
#######################################################################################
#install.packages("boot") # install package if required
library("boot")
# standardization by propensity score, agnostic regarding effect modification
std.ps <- function(data, indices) {
d <- data[indices,] # 1st copy: equal to original one`
# calculating propensity scores
ps.fit <- glm(qsmk ~ sex + race + age + I(age*age)
+ as.factor(education) + smokeintensity
+ I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise)
+ as.factor(active) + wt71 + I(wt71*wt71),
data=d, family=binomial())
d$pscore <- predict(ps.fit, d, type="response")
# create a dataset with 3 copies of each subject
d$interv <- -1 # 1st copy: equal to original one`
d0 <- d # 2nd copy: treatment set to 0, outcome to missing
d0$interv <- 0
d0$qsmk <- 0
d0$wt82_71 <- NA
d1 <- d # 3rd copy: treatment set to 1, outcome to missing
d1$interv <- 1
d1$qsmk <- 1
d1$wt82_71 <- NA
d.onesample <- rbind(d, d0, d1) # combining datasets
std.fit <- glm(wt82_71 ~ qsmk + pscore + I(qsmk*pscore), data=d.onesample)
d.onesample$predicted_meanY <- predict(std.fit, d.onesample)
# estimate mean outcome in each of the groups interv=-1, interv=0, and interv=1
return(c(mean(d.onesample$predicted_meanY[d.onesample$interv==-1]),
mean(d.onesample$predicted_meanY[d.onesample$interv==0]),
mean(d.onesample$predicted_meanY[d.onesample$interv==1]),
mean(d.onesample$predicted_meanY[d.onesample$interv==1])-
mean(d.onesample$predicted_meanY[d.onesample$interv==0])))
}
# bootstrap
results <- boot(data=nhefs, statistic=std.ps, R=5)
# generating confidence intervals
se <- c(sd(results$t[,1]), sd(results$t[,2]),
sd(results$t[,3]), sd(results$t[,4]))
mean <- results$t0
ll <- mean - qnorm(0.975)*se
ul <- mean + qnorm(0.975)*se
bootstrap <- data.frame(cbind(c("Observed", "No Treatment", "Treatment",
"Treatment - No Treatment"), mean, se, ll, ul))
bootstrap
## V1 mean se ll
## 1 Observed 2.63384609228479 0.215654312114019 2.21117140743055
## 2 No Treatment 1.71983636149842 0.212527259152412 1.30329058782669
## 3 Treatment 5.35072300362993 0.32326309173114 4.71713898630582
## 4 Treatment - No Treatment 3.6308866421315 0.311509003584114 3.02034021424668
## ul
## 1 3.05652077713903
## 2 2.13638213517016
## 3 5.98430702095403
## 4 4.24143307001633
#####################
# regression on the propensity score (linear term) p.qsmk 改为 ps
model6 <- glm(wt82_71 ~ qsmk + ps, data = nhefs)
summary(model6)
##
## Call:
## glm(formula = wt82_71 ~ qsmk + ps, data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -43.31 -4.01 -0.07 4.24 47.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.594 0.483 11.58 < 2e-16 ***
## qsmk 3.551 0.457 7.76 1.5e-14 ***
## ps -14.822 1.758 -8.43 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 58.3)
##
## Null deviance: 97176 on 1565 degrees of freedom
## Residual deviance: 91099 on 1563 degrees of freedom
## (63 observations deleted due to missingness)
## AIC: 10815
##
## Number of Fisher Scoring iterations: 2
# standarization on the propensity score
# (step 1) create two new datasets, one with all treated and one with all untreated
treated <- nhefs
treated$qsmk <- 1
untreated <- nhefs
untreated$qsmk <- 0
# (step 2) predict values for everyone in each new dataset based on above model
treated$pred.y <- predict(model6, treated)
untreated$pred.y <- predict(model6, untreated)
# (step 3) compare mean weight loss had all been treated vs. that had all been untreated
mean1 <- mean(treated$pred.y, na.rm = TRUE)
mean0 <- mean(untreated$pred.y, na.rm = TRUE)
mean1
## [1] 5.25
mean0
## [1] 1.7
mean1 - mean0
## [1] 3.55
# (step 4) bootstrap a confidence interval
# number of bootstraps
nboot <- 100
# set up a matrix to store results
boots <- data.frame(i = 1:nboot,
mean1 = NA,
mean0 = NA,
difference = NA)
# loop to perform the bootstrapping p.qsmk 改为ps
nhefs <- subset(nhefs, !is.na(ps) & !is.na(wt82_71))
for(i in 1:nboot) {
# sample with replacement
sampl <- nhefs[sample(1:nrow(nhefs), nrow(nhefs), replace = TRUE), ]
# fit the model in the bootstrap sample
bootmod <- glm(wt82_71 ~ qsmk + ps, data = sampl)
# create new datasets
sampl.treated <- sampl %>%
mutate(qsmk = 1)
sampl.untreated <- sampl %>%
mutate(qsmk = 0)
# predict values
sampl.treated$pred.y <- predict(bootmod, sampl.treated)
sampl.untreated$pred.y <- predict(bootmod, sampl.untreated)
# output results
boots[i, 'mean1'] <- mean(sampl.treated$pred.y, na.rm = TRUE)
boots[i, 'mean0'] <- mean(sampl.untreated$pred.y, na.rm = TRUE)
boots[i, 'difference'] <- boots[i, 'mean1'] - boots[i, 'mean0']
# once loop is done, print the results
if(i == nboot) {
cat('95% CI for the causal mean difference\n')
cat(mean(boots$difference) - 1.96*sd(boots$difference),
',',
mean(boots$difference) + 1.96*sd(boots$difference))
}
}
## 95% CI for the causal mean difference
## 2.54 , 4.57
# a more flexible and elegant way to do this is to write a function
# to perform the model fitting, prediction, bootstrapping, and reporting all at once
# view the code contained in the file mstandardize.R to learn more
# load the code for the mstandardize() function
# (you may need to change the filepath)
# source('/Users/milin/Downloads/causalInferenceWhatIfRcode_CIpart2/chapter15.R')
# # performt the standardization
# mstandardize(formula = wt82_71 ~ qsmk + decile(p.qsmk),
# family = 'gaussian',
# trt = 'qsmk',
# nboot = 100,
# data = nhefs)
16 工具变量
16.1 工具变量的三个条件
##################################################################
# Program 16.1
# Estimating the average causal using the standard IV estimator
# via the calculation of sample averages
# Data from NHEFS
##################################################################
#install.packages("readxl") # install package if required
library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
# some preprocessing of the data
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)
summary(nhefs$price82)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.5 1.7 1.8 1.8 1.9 2.1 92
# for simplicity, ignore subjects with missing outcome or missing instrument
nhefs.iv <- nhefs[which(!is.na(nhefs$wt82) & !is.na(nhefs$price82)),]
nhefs.iv$highprice <- ifelse(nhefs.iv$price82>=1.5, 1, 0)
table(nhefs.iv$highprice, nhefs.iv$qsmk)
##
## 0 1
## 0 33 8
## 1 1065 370
t.test(wt82_71 ~ highprice, data=nhefs.iv)
##
## Welch Two Sample t-test
##
## data: wt82_71 by highprice
## t = -0.1, df = 42, p-value = 0.9
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -3.13 2.83
## sample estimates:
## mean in group 0 mean in group 1
## 2.54 2.69
######################################################################
16.2 工具变量的效应估计
# PROGRAM 16.2
# Estimating the average causal effect using the standard IV estimator
# via two-stage-least-squares regression
# Data from NHEFS
######################################################################
#install.packages ("sem") # install package if required
library(sem)
model1 <- tsls(wt82_71 ~ qsmk, ~ highprice, data = nhefs.iv)
summary(model1)
##
## 2SLS Estimates
##
## Model Formula: wt82_71 ~ qsmk
##
## Instruments: ~highprice
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -43.3 -4.0 0.0 0.0 4.2 46.5
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.07 5.09 0.41 0.68
## qsmk 2.40 19.84 0.12 0.90
##
## Residual standard error: 7.856 on 1474 degrees of freedom
confint(model1) # note the wide confidence intervals
## 2.5 % 97.5 %
## (Intercept) -7.9 12.0
## qsmk -36.5 41.3
######################################################################
16.3 工具变量的第四个条件:同质性
# Program 16.3
# Estimating the average causal using the standard IV estimator
# via additive marginal structural models
# Data from NHEFS
######################################################################
########################################################
# G-estimation: Checking one possible value of psi
# See Chapter 14 for program that checks several values
# and computes 95% confidence intervals
########################################################
nhefs.iv$psi <- 2.396
nhefs.iv$Hpsi <- nhefs.iv$wt82_71-nhefs.iv$psi*nhefs.iv$qsmk
#install.packages("geepack") # install package if required
library("geepack")
g.est <- geeglm(highprice ~ Hpsi, data=nhefs.iv, id=seqn, family=binomial(),
corstr="independence")
summary(g.est)
##
## Call:
## geeglm(formula = highprice ~ Hpsi, family = binomial(), data = nhefs.iv,
## id = seqn, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 3.56e+00 1.65e-01 463 <2e-16 ***
## Hpsi 2.75e-07 2.27e-02 0 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1 0.761
## Number of clusters: 1476 Maximum cluster size: 1
beta <- coef(g.est)
SE <- coef(summary(g.est))[,2]
lcl <- beta-qnorm(0.975)*SE
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)
## beta lcl ucl
## (Intercept) 3.56e+00 3.2315 3.8792
## Hpsi 2.75e-07 -0.0446 0.0446
######################################################################
16.4 另一种第四个条件:单调性
# Program 16.4
# Estimating the average causal using the standard IV estimator
# with altnerative proposed instruments
# Data from NHEFS
######################################################################
summary(tsls(wt82_71 ~ qsmk, ~ ifelse(price82 >= 1.6, 1, 0), data = nhefs.iv))
##
## 2SLS Estimates
##
## Model Formula: wt82_71 ~ qsmk
##
## Instruments: ~ifelse(price82 >= 1.6, 1, 0)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -55.6 -13.5 7.6 0.0 12.5 56.4
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.89 42.25 -0.19 0.85
## qsmk 41.28 164.95 0.25 0.80
##
## Residual standard error: 18.605 on 1474 degrees of freedom
summary(tsls(wt82_71 ~ qsmk, ~ ifelse(price82 >= 1.7, 1, 0), data = nhefs.iv))
##
## 2SLS Estimates
##
## Model Formula: wt82_71 ~ qsmk
##
## Instruments: ~ifelse(price82 >= 1.7, 1, 0)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -54.4 -13.4 -8.4 0.0 18.1 75.3
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.2 48.1 0.27 0.78
## qsmk -40.9 187.7 -0.22 0.83
##
## Residual standard error: 20.591 on 1474 degrees of freedom
summary(tsls(wt82_71 ~ qsmk, ~ ifelse(price82 >= 1.8, 1, 0), data = nhefs.iv))
##
## 2SLS Estimates
##
## Model Formula: wt82_71 ~ qsmk
##
## Instruments: ~ifelse(price82 >= 1.8, 1, 0)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -49.4 -8.3 -3.4 0.0 7.3 60.5
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.09 7.29 1.11 0.27
## qsmk -21.10 28.43 -0.74 0.46
##
## Residual standard error: 13.019 on 1474 degrees of freedom
summary(tsls(wt82_71 ~ qsmk, ~ ifelse(price82 >= 1.9, 1, 0), data = nhefs.iv))
##
## 2SLS Estimates
##
## Model Formula: wt82_71 ~ qsmk
##
## Instruments: ~ifelse(price82 >= 1.9, 1, 0)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -47.2 -6.3 -1.4 0.0 5.5 54.4
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.96 6.07 0.98 0.33
## qsmk -12.81 23.67 -0.54 0.59
##
## Residual standard error: 10.364 on 1474 degrees of freedom
######################################################################
16.5 再谈工具变量的三个条件
# Program 16.5
# Estimating the average causal using the standard IV estimator
# Conditional on baseline covariates
# Data from NHEFS
######################################################################
model2 <- tsls(wt82_71 ~ qsmk + sex + race + age + smokeintensity + smokeyrs +
as.factor(exercise) + as.factor(active) + wt71,
~ highprice + sex + race + age + smokeintensity + smokeyrs + as.factor(exercise) +
as.factor(active) + wt71, data = nhefs.iv)
summary(model2)
##
## 2SLS Estimates
##
## Model Formula: wt82_71 ~ qsmk + sex + race + age + smokeintensity + smokeyrs +
## as.factor(exercise) + as.factor(active) + wt71
##
## Instruments: ~highprice + sex + race + age + smokeintensity + smokeyrs + as.factor(exercise) +
## as.factor(active) + wt71
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -42.2 -4.3 -0.6 0.0 3.9 46.7
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.28033 2.33540 7.40 2.3e-13 ***
## qsmk -1.04229 29.98737 -0.03 0.972
## sex -1.64439 2.63083 -0.63 0.532
## race -0.18325 4.65039 -0.04 0.969
## age -0.16364 0.24055 -0.68 0.496
## smokeintensity 0.00577 0.14550 0.04 0.968
## smokeyrs 0.02584 0.16142 0.16 0.873
## as.factor(exercise)1 0.49875 2.17124 0.23 0.818
## as.factor(exercise)2 0.58183 2.18315 0.27 0.790
## as.factor(active)1 -1.17015 0.60747 -1.93 0.054 .
## as.factor(active)2 -0.51228 1.30845 -0.39 0.695
## wt71 -0.09795 0.03627 -2.70 0.007 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.716 on 1464 degrees of freedom
17 因果推断中的生存分析
17.1 危害和风险
##################################################################
# PROGRAM 17.1
# Nonparametric estimation of survival curves
# Data from NHEFS
##################################################################
library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
# some preprocessing of the data
nhefs$survtime <- ifelse(nhefs$death==0, 120,
(nhefs$yrdth-83)*12+nhefs$modth) # yrdth ranges from 83 to 92
table(nhefs$death, nhefs$qsmk)
##
## 0 1
## 0 985 326
## 1 216 102
summary(nhefs[which(nhefs$death==1),]$survtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 35.0 61.0 61.1 86.8 120.0
#install.packages("survival")
#install.packages("ggplot2") # for plots
#install.packages("survminer") # for plots
library("survival")
library("ggplot2")
library("survminer")
## Loading required package: ggpubr
survdiff(Surv(survtime, death) ~ qsmk, data=nhefs)
## Call:
## survdiff(formula = Surv(survtime, death) ~ qsmk, data = nhefs)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## qsmk=0 1201 216 237.5 1.95 7.73
## qsmk=1 428 102 80.5 5.76 7.73
##
## Chisq= 7.7 on 1 degrees of freedom, p= 0.005
fit <- survfit(Surv(survtime, death) ~ qsmk, data=nhefs)
ggsurvplot(fit, data = nhefs, xlab="Months of follow-up",
ylab="Survival probability",
main="Product-Limit Survival Estimates", risk.table = TRUE)

##################################################################
17.2 从危害和风险
# PROGRAM 17.2
# Parametric estimation of survival curves via hazards model
# Data from NHEFS
##################################################################
# creation of person-month data
#install.packages("splitstackshape")
library("splitstackshape")
nhefs.surv <- expandRows(nhefs, "survtime", drop=F)
nhefs.surv$time <- sequence(rle(nhefs.surv$seqn)$lengths)-1
nhefs.surv$event <- ifelse(nhefs.surv$time==nhefs.surv$survtime-1 &
nhefs.surv$death==1, 1, 0)
nhefs.surv$timesq <- nhefs.surv$time^2
# fit of parametric hazards model
hazards.model <- glm(event==0 ~ qsmk + I(qsmk*time) + I(qsmk*timesq) +
time + timesq, family=binomial(), data=nhefs.surv)
summary(hazards.model)
##
## Call:
## glm(formula = event == 0 ~ qsmk + I(qsmk * time) + I(qsmk * timesq) +
## time + timesq, family = binomial(), data = nhefs.surv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.725 0.055 0.060 0.063 0.078
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.00e+00 2.31e-01 30.29 <2e-16 ***
## qsmk -3.35e-01 3.97e-01 -0.85 0.40
## I(qsmk * time) -1.21e-02 1.50e-02 -0.80 0.42
## I(qsmk * timesq) 1.61e-04 1.25e-04 1.29 0.20
## time -1.96e-02 8.41e-03 -2.33 0.02 *
## timesq 1.26e-04 6.69e-05 1.88 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4655.3 on 176763 degrees of freedom
## Residual deviance: 4631.3 on 176758 degrees of freedom
## AIC: 4643
##
## Number of Fisher Scoring iterations: 9
# creation of dataset with all time points under each treatment level
qsmk0 <- data.frame(cbind(seq(0, 119),0,(seq(0, 119))^2))
qsmk1 <- data.frame(cbind(seq(0, 119),1,(seq(0, 119))^2))
colnames(qsmk0) <- c("time", "qsmk", "timesq")
colnames(qsmk1) <- c("time", "qsmk", "timesq")
# assignment of estimated (1-hazard) to each person-month */
qsmk0$p.noevent0 <- predict(hazards.model, qsmk0, type="response")
qsmk1$p.noevent1 <- predict(hazards.model, qsmk1, type="response")
# computation of survival for each person-month
qsmk0$surv0 <- cumprod(qsmk0$p.noevent0)
qsmk1$surv1 <- cumprod(qsmk1$p.noevent1)
# some data management to plot estimated survival curves
hazards.graph <- merge(qsmk0, qsmk1, by=c("time", "timesq"))
hazards.graph$survdiff <- hazards.graph$surv1-hazards.graph$surv0
# plot
ggplot(hazards.graph, aes(x=time, y=surv)) +
geom_line(aes(y = surv0, colour = "0")) +
geom_line(aes(y = surv1, colour = "1")) +
xlab("Months") +
scale_x_continuous(limits = c(0, 120), breaks=seq(0,120,12)) +
scale_y_continuous(limits=c(0.6, 1), breaks=seq(0.6, 1, 0.2)) +
ylab("Survival") +
ggtitle("Survival from hazards model") +
labs(colour="A:") +
theme_bw() +
theme(legend.position="bottom")

##################################################################
17.3 删失数据
# PROGRAM 17.3
# Estimation of survival curves via IP weighted hazards model
# Data from NHEFS
##################################################################
# estimation of denominator of ip weights
p.denom <- glm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity)
+ smokeyrs + I(smokeyrs*smokeyrs) + as.factor(exercise)
+ as.factor(active) + wt71 + I(wt71*wt71),
data=nhefs, family=binomial())
nhefs$pd.qsmk <- predict(p.denom, nhefs, type="response")
# estimation of numerator of ip weights
p.num <- glm(qsmk ~ 1, data=nhefs, family=binomial())
nhefs$pn.qsmk <- predict(p.num, nhefs, type="response")
# computation of estimated weights
nhefs$sw.a <- ifelse(nhefs$qsmk==1, nhefs$pn.qsmk/nhefs$pd.qsmk,
(1-nhefs$pn.qsmk)/(1-nhefs$pd.qsmk))
summary(nhefs$sw.a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.33 0.86 0.95 1.00 1.08 4.21
# creation of person-month data
nhefs.ipw <- expandRows(nhefs, "survtime", drop=F)
nhefs.ipw$time <- sequence(rle(nhefs.ipw$seqn)$lengths)-1
nhefs.ipw$event <- ifelse(nhefs.ipw$time==nhefs.ipw$survtime-1 &
nhefs.ipw$death==1, 1, 0)
nhefs.ipw$timesq <- nhefs.ipw$time^2
# fit of weighted hazards model
ipw.model <- glm(event==0 ~ qsmk + I(qsmk*time) + I(qsmk*timesq) +
time + timesq, family=binomial(), weight=sw.a,
data=nhefs.ipw)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(ipw.model)
##
## Call:
## glm(formula = event == 0 ~ qsmk + I(qsmk * time) + I(qsmk * timesq) +
## time + timesq, family = binomial(), data = nhefs.ipw, weights = sw.a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.186 0.053 0.059 0.064 0.145
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.896996 0.220760 31.24 <2e-16 ***
## qsmk 0.179384 0.439873 0.41 0.683
## I(qsmk * time) -0.018946 0.016403 -1.16 0.248
## I(qsmk * timesq) 0.000210 0.000135 1.56 0.120
## time -0.018888 0.008053 -2.35 0.019 *
## timesq 0.000118 0.000064 1.85 0.065 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4643.9 on 176763 degrees of freedom
## Residual deviance: 4626.2 on 176758 degrees of freedom
## AIC: 4634
##
## Number of Fisher Scoring iterations: 9
# creation of survival curves
ipw.qsmk0 <- data.frame(cbind(seq(0, 119),0,(seq(0, 119))^2))
ipw.qsmk1 <- data.frame(cbind(seq(0, 119),1,(seq(0, 119))^2))
colnames(ipw.qsmk0) <- c("time", "qsmk", "timesq")
colnames(ipw.qsmk1) <- c("time", "qsmk", "timesq")
# assignment of estimated (1-hazard) to each person-month */
ipw.qsmk0$p.noevent0 <- predict(ipw.model, ipw.qsmk0, type="response")
ipw.qsmk1$p.noevent1 <- predict(ipw.model, ipw.qsmk1, type="response")
# computation of survival for each person-month
ipw.qsmk0$surv0 <- cumprod(ipw.qsmk0$p.noevent0)
ipw.qsmk1$surv1 <- cumprod(ipw.qsmk1$p.noevent1)
# some data management to plot estimated survival curves
ipw.graph <- merge(ipw.qsmk0, ipw.qsmk1, by=c("time", "timesq"))
ipw.graph$survdiff <- ipw.graph$surv1-ipw.graph$surv0
# plot
ggplot(ipw.graph, aes(x=time, y=surv)) +
geom_line(aes(y = surv0, colour = "0")) +
geom_line(aes(y = surv1, colour = "1")) +
xlab("Months") +
scale_x_continuous(limits = c(0, 120), breaks=seq(0,120,12)) +
scale_y_continuous(limits=c(0.6, 1), breaks=seq(0.6, 1, 0.2)) +
ylab("Survival") +
ggtitle("Survival from IP weighted hazards model") +
labs(colour="A:") +
theme_bw() +
theme(legend.position="bottom")

##################################################################
17.4 生存分析中的逆概率加权
# PROGRAM 17.4
# Estimating of survival curves via g-formula
# Data from NHEFS
##################################################################
# fit of hazards model with covariates
gf.model <- glm(event==0 ~ qsmk + I(qsmk*time) + I(qsmk*timesq)
+ time + timesq + sex + race + age + I(age*age)
+ as.factor(education) + smokeintensity
+ I(smokeintensity*smokeintensity) + smkintensity82_71
+ smokeyrs + I(smokeyrs*smokeyrs) + as.factor(exercise)
+ as.factor(active) + wt71 + I(wt71*wt71),
data=nhefs.surv, family=binomial())
summary(gf.model)
##
## Call:
## glm(formula = event == 0 ~ qsmk + I(qsmk * time) + I(qsmk * timesq) +
## time + timesq + sex + race + age + I(age * age) + as.factor(education) +
## smokeintensity + I(smokeintensity * smokeintensity) + smkintensity82_71 +
## smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71), family = binomial(),
## data = nhefs.surv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.316 0.024 0.039 0.064 0.330
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.27e+00 1.38e+00 6.72 1.8e-11 ***
## qsmk 5.96e-02 4.15e-01 0.14 0.88592
## I(qsmk * time) -1.49e-02 1.51e-02 -0.99 0.32382
## I(qsmk * timesq) 1.70e-04 1.24e-04 1.37 0.17164
## time -2.27e-02 8.44e-03 -2.69 0.00714 **
## timesq 1.17e-04 6.71e-05 1.75 0.08002 .
## sex 4.37e-01 1.41e-01 3.10 0.00193 **
## race -5.24e-02 1.73e-01 -0.30 0.76257
## age -8.75e-02 5.91e-02 -1.48 0.13854
## I(age * age) 8.13e-05 5.47e-04 0.15 0.88186
## as.factor(education)2 1.40e-01 1.57e-01 0.89 0.37098
## as.factor(education)3 4.33e-01 1.53e-01 2.84 0.00450 **
## as.factor(education)4 2.35e-01 2.79e-01 0.84 0.39975
## as.factor(education)5 3.75e-01 2.39e-01 1.57 0.11612
## smokeintensity -1.63e-03 1.43e-02 -0.11 0.90943
## I(smokeintensity * smokeintensity) -7.18e-05 2.39e-04 -0.30 0.76374
## smkintensity82_71 -1.69e-03 6.50e-03 -0.26 0.79540
## smokeyrs -1.68e-02 3.06e-02 -0.55 0.58415
## I(smokeyrs * smokeyrs) -5.28e-05 4.24e-04 -0.12 0.90100
## as.factor(exercise)1 1.47e-01 1.79e-01 0.82 0.41230
## as.factor(exercise)2 -1.50e-01 1.76e-01 -0.85 0.39318
## as.factor(active)1 -1.60e-01 1.30e-01 -1.23 0.21805
## as.factor(active)2 -2.29e-01 1.88e-01 -1.22 0.22177
## wt71 6.22e-02 1.90e-02 3.27 0.00107 **
## I(wt71 * wt71) -4.05e-04 1.13e-04 -3.58 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4655.3 on 176763 degrees of freedom
## Residual deviance: 4185.7 on 176739 degrees of freedom
## AIC: 4236
##
## Number of Fisher Scoring iterations: 10
# creation of dataset with all time points for
# each individual under each treatment level
gf.qsmk0 <- expandRows(nhefs, count=120, count.is.col=F)
gf.qsmk0$time <- rep(seq(0, 119), nrow(nhefs))
gf.qsmk0$timesq <- gf.qsmk0$time^2
gf.qsmk0$qsmk <- 0
gf.qsmk1 <- gf.qsmk0
gf.qsmk1$qsmk <- 1
gf.qsmk0$p.noevent0 <- predict(gf.model, gf.qsmk0, type="response")
gf.qsmk1$p.noevent1 <- predict(gf.model, gf.qsmk1, type="response")
#install.packages("dplyr")
library("dplyr")
gf.qsmk0.surv <- gf.qsmk0 %>% group_by(seqn) %>% mutate(surv0 = cumprod(p.noevent0))
gf.qsmk1.surv <- gf.qsmk1 %>% group_by(seqn) %>% mutate(surv1 = cumprod(p.noevent1))
gf.surv0 <- aggregate(gf.qsmk0.surv, by=list(gf.qsmk0.surv$time), FUN=mean)[c("qsmk", "time", "surv0")]
gf.surv1 <- aggregate(gf.qsmk1.surv, by=list(gf.qsmk1.surv$time), FUN=mean)[c("qsmk", "time", "surv1")]
gf.graph <- merge(gf.surv0, gf.surv1, by=c("time"))
gf.graph$survdiff <- gf.graph$surv1-gf.graph$surv0
# plot
ggplot(gf.graph, aes(x=time, y=surv)) +
geom_line(aes(y = surv0, colour = "0")) +
geom_line(aes(y = surv1, colour = "1")) +
xlab("Months") +
scale_x_continuous(limits = c(0, 120), breaks=seq(0,120,12)) +
scale_y_continuous(limits=c(0.6, 1), breaks=seq(0.6, 1, 0.2)) +
ylab("Survival") +
ggtitle("Survival from g-formula") +
labs(colour="A:") +
theme_bw() +
theme(legend.position="bottom")

##################################################################
17.5 生存分析中的G公式
# PROGRAM 17.5
# Estimating of median survival time ratio via a structural nested AFT model
# Data from NHEFS
##################################################################
# some preprocessing of the data
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
nhefs$survtime <- ifelse(nhefs$death==0, NA, (nhefs$yrdth-83)*12+nhefs$modth) # * yrdth ranges from 83 to 92
# model to estimate E[A|L]
modelA <- glm(qsmk ~ sex + race + age + I(age*age)
+ as.factor(education) + smokeintensity
+ I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise)
+ as.factor(active) + wt71 + I(wt71*wt71),
data=nhefs, family=binomial())
nhefs$p.qsmk <- predict(modelA, nhefs, type="response")
d <- nhefs[!is.na(nhefs$survtime),] # select only those with observed death time
n <- nrow(d)
# define the estimating function that needs to be minimized
sumeef <- function(psi){
# creation of delta indicator
if (psi>=0){
delta <- ifelse(d$qsmk==0 |
(d$qsmk==1 & psi <= log(120/d$survtime)),
1, 0)
} else if (psi < 0) {
delta <- ifelse(d$qsmk==1 |
(d$qsmk==0 & psi > log(d$survtime/120)), 1, 0)
}
smat <- delta*(d$qsmk-d$p.qsmk)
sval <- sum(smat, na.rm=T)
save <- sval/n
smat <- smat - rep(save, n)
# covariance
sigma <- t(smat) %*% smat
if (sigma == 0){
sigma <- 1e-16
}
estimeq <- sval*solve(sigma)*t(sval)
return(estimeq)
}
res <- optimize(sumeef, interval = c(-0.2,0.2))
psi1 <- res$minimum
objfunc <- as.numeric(res$objective)
# Use simple bisection method to find estimates of lower and upper 95% confidence bounds
increm <- 0.1
for_conf <- function(x){
return(sumeef(x) - 3.84)
}
if (objfunc < 3.84){
# Find estimate of where sumeef(x) > 3.84
# Lower bound of 95% CI
psilow <- psi1
testlow <- objfunc
countlow <- 0
while (testlow < 3.84 & countlow < 100){
psilow <- psilow - increm
testlow <- sumeef(psilow)
countlow <- countlow + 1
}
# Upper bound of 95% CI
psihigh <- psi1
testhigh <- objfunc
counthigh <- 0
while (testhigh < 3.84 & counthigh < 100){
psihigh <- psihigh + increm
testhigh <- sumeef(psihigh)
counthigh <- counthigh + 1
}
# Better estimate using bisection method
if ((testhigh > 3.84) & (testlow > 3.84)){
# Bisection method
left <- psi1
fleft <- objfunc - 3.84
right <- psihigh
fright <- testhigh - 3.84
middle <- (left + right) / 2
fmiddle <- for_conf(middle)
count <- 0
diff <- right - left
while (!(abs(fmiddle) < 0.0001 | diff < 0.0001 | count > 100)){
test <- fmiddle * fleft
if (test < 0){
right <- middle
fright <- fmiddle
} else {
left <- middle
fleft <- fmiddle
}
middle <- (left + right) / 2
fmiddle <- for_conf(middle)
count <- count + 1
diff <- right - left
}
psi_high <- middle
objfunc_high <- fmiddle + 3.84
# lower bound of 95% CI
left <- psilow
fleft <- testlow - 3.84
right <- psi1
fright <- objfunc - 3.84
middle <- (left + right) / 2
fmiddle <- for_conf(middle)
count <- 0
diff <- right - left
while(!(abs(fmiddle) < 0.0001 | diff < 0.0001 | count > 100)){
test <- fmiddle * fleft
if (test < 0){
right <- middle
fright <- fmiddle
} else {
left <- middle
fleft <- fmiddle
}
middle <- (left + right) / 2
fmiddle <- for_conf(middle)
diff <- right - left
count <- count + 1
}
psi_low <- middle
objfunc_low <- fmiddle + 3.84
psi <- psi1
}
}
c(psi, psi_low, psi_high)
## [1] -0.0504 -0.2231 0.3331