reference books

causal inference in statistics
fundamentals of causal inference with R
causal inference -what if

content

因果效应的定义
随机试验 Randomized experiments
观察性研究 Observational studies
效应修饰 Effect modification
交互作用 Interaction
因果效应的图像表达 Graphical representation of causal effects
混杂 Confounding
选择偏移 Selection bias
测量偏移 Measurement bias
随机性偏移 Random variability
统计模型
逆概率加权和边缘结构模型 IP weighting and marginal structural models
标准化和参数G-公式 Standardization and the parametric g-formula
G-估算和结构嵌入模型 G-estimation of structural nested models
结局回归和倾向性评分 Outcome regression and propensity scores
工具变量估计 Instrumental variable estimation
因果推断中的生存分析
因果推断中的变量选择
时异治疗 Time-varying treatments
治疗-混杂反馈 Treatment-confounder feedback
时异治疗的G-方法 G-methods for time-varying treatments
靶标实验 Target trial emulation
补充：中介效应分析 mediation analysis

代码链接：https://remlapmot.github.io/cibookex-r/outcome-regression-and-propensity-scores.html#program-15.4

1 因果效应的定义

E(Ya1=1)不等于E(Ya0=1),则称人群中A对于Y有因果效应，

1.1 个体的因果效应

我在野外，感觉非常饿，这个时候看到一个挺漂亮的蘑菇，然后我吃了，接着我中毒而亡。这个时候我们会说蘑菇导致了我的死亡，很明显这是一个因果关系，因为我吃了蘑菇，然后我死了。

这个时候我们脑海中就会想象，如果我们吃，我就不会死。

通过这个例子我们可以来说明我们是如何推理个体的因果效应的：如果一件事情A，我们做了或者没有做，会产生两个结局，如果这两个结局不一样，那么我们认为时间A对于结果有因果效应。如果结局一样，则认为时间A对于结局没有因果效应。通常而言时间A可能是某种干预，或者某种措施，亦或者某种治疗，等等。

为了将这个说法数学话，我们用一些简单的符号，A表示某种措施，他的取值是0和1，0表示不采取措施，1表示采取措施。Y表示结局，也有两个值0和1 ，例如1表示死亡，0表示存活。我们记Ya1 表示当A为1的时候Y的值，同理Ya0表示a等于0的时候Y的值。

这个时候个体的因果效应（causal effect for an individ- ual）可以定义为：如果Ya1=Ya0，那么A对于Y没有因果效应，反之则有。

通常而言Ya0和Ya1被称为值潜在结局 potential outcomes（我们一般只能观察到某一种结局）或者反事实结局counterfactual outcomes（一些结局可能从来不会出现，有反事实）。

个体的因果效应定义为不同反事实结果的对比，但是，每一个人的结局只有一个，其他反事实解决无法被观察到。所以这里的结论是，因为数据的缺失个体的因果效应是不可被识别的。

1.2 因果效应的均值

首先来定义因果效应的均值。假设我们有这样一份数据。

Ya0 <- c(0,1,0,0,0,1,0,0,1,1,0,1,1,0,0,0,1,1,1,1)
Ya1 <- c(1,0,0,0,0,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0)

这个数据集表示了20个人在采取某种措施后的不同反事实结局（我们先假设我们能搜集到这样的数据）。从结果中可以看到，如果采取措施，有10个人的Y=1，概率为0.5 。如果不采取措施，那么同样有10个人Y=1，概率为0.5 。

在这里我们定义因果效应的均值：在人群中，如果P（Ya1=1）！=P（Ya0=1），那么时间A就对结果Y有因果效应。因果效应的均值为E（Ya1-Ya0）根据这个定义，上面那个例子不具有因果效应，且因果效应的均值为0。

（下文将因果效的均值简称为因果效应，因果效应均值的0假设，简称为零因果假设）

1.3 因果效应的度量以及偏差

E(Ya1)-E(Ya0)
E(Ya1)/E(Ya0)

我们需要注意，我们通常而言是通过样本去估计总体。抽样带来的误差是随机的，在大数定律成立的前提下，抽样的统计量是关于总体统计量的一个一致估计，也就是说，样本越多，抽样的结果越接近真实的结果。

因为我们只是接近整体的结果，所以我们不能完全肯定的下结论，需要通过一些统计的方法来判断。

除了抽样带来的误差之外，还可能存在另外一个问题导致误差，个体的反事实结局可能不是固定的，例如我吃了蘑菇，我有90% 的概率会死，不吃有10%概率会死，这种情况我们不管搜集多少数据，结局都是不可预测的，这种情况称之为不确定性反事实（nondeterministic counterfactuals）。

我们先忽略抽样的随机性以及假设解决都是确定的。

1.4 其他

干扰：个体之间的结局不会相互干扰。（教育，传染病等领域常常会出现干扰）

不同形式的治疗对于结果是无影响的，那么反事实结局是良定的。（很多时候，A在事实上不是唯一的，例如动手乎，我们无法保证手术的每一个过程都是一样的，可能医生不一样，护士状态不好）

2 随机试验

我们想象一个行为艺术，你在一条商业街抬头看向天空，时候会吸引其他人和你一样看向天空。也就是说，我们想要知道，在特定的人群中，我们想要知道，一个行为是否能够影响一个结局。

假设我们想要用科学的方法研究这个问题，我们可以当每一个人靠近，我们就抛硬币，如果正面朝上，我们就看向天空，反之看前方。重复上千次。然后比较我抬头看的时候，其他人也抬头看的次数和我看前方的时候，其他人抬头看次的次数。如果二者有显著的差异，则可以说我抬头看向天空对于他人是否抬头看有显著差异。

在这里我们做了一个不严谨的随机试验。在随机试验中，我们是否采取行动是通过一个随机机制（抛硬币）决定的。随机很重要，如果行动是基于某种规则，那么最后的结果无法确定是因为规则还是因为行动。

2.1随机

在现实中，我们无法得到个体的所有反事实结局，因此无法直接计算因果效应。但是有了随机实验，我们可以得到真是因果效应的一致性估计。

在真实世界中，我们只能得到类似这样的数据：

我们假设有足够多的人，对每一个人我们抛硬币，正面表示将这个分分到A组，反之分到B组。然后我们给A组的人某种措施(A=1)，B组给安慰剂(A=0)。实验结束后，我们计算每一组的死亡率。假设A组的死亡率为p(Y=1|A=1)=0.3,B组的死亡率是p(Y=1|A=0)=0.6。相关性风险比是0.3/0.6=0.5 因为我们的实验是随机的，那意味着，如果们将B组的人采取某种措施，A采取安慰剂，最终的相关性风险比应该是不变的（注意，这里说的相关性）。我们可以称A组和B组是可互换的（exchangeable）。

因此对于A组而言,p(Y=1|A=0)就等于B组的p(Y=1|A=0),因此对于一个完美的随机试验而言，相关性就是因果性

2.2 条件随机

我们看一组数据，如图所示：

A表示治疗，Y表示结局，L表示是否重症。我们考虑两种实验，第一种就是将人随机分配到治疗组和对照组，这种实验叫做边缘随机实验（marginal randomized experiment），第二种则，则是分别随机从重症人群和轻症人数中抽取一定人数，这种实验叫做条件随机实验（conditional randomized experiment ），因为这里我们的分组多了一个因素，L。

边缘随机试验能够保证互换性，而条件随机试验则不能。

事实上，条件随机实验可以看做两个边缘随机实验的组合，我们首先考虑重症的实验，在这个实验中，治疗组和对照组都是重症病人，两个组的反事实分享应该是相等的，也就是： p(Ya=1|A=1,L=1)=p(Ya=1|A=0.L=1),也就是当L确定的时候，治疗组和对照组是可以互换的。也就是说，虽然条件随机实验无法保证互换性，但是在条件确定的情况下，可以保证互换性，这种互换性叫做条件互换性（conditional exchangeability）。

在边缘随机试验中，因果性就是香港过关行，因此： p(Ya1=1)/p(Ya0=1)=P(Y=1|A=1)/p(Y=1|A=0).

而对于条件随机试验，我们通常有两种方式计算因果效应，第一种是在子集中计算因果效应，因为每一个子集都是边缘随机实验，虽然每一个子集的因果效应可能是不相同的，这种方法叫做分层分析。

2.3 标准化和逆概率加权。

标准化的思路很简单，假设p(Y=1|l=0,A=1)=1/4,p(Y=1|l=1,A=1)=2/3,l=0的人数占比是40%，l=1的人数占比是60%，那么加权之后的结果:

p(Ya1=1)=1/4*40%+2/3*60%

同理可以计算出p(Ya0=1).

总而言之，边缘反事实风险p（Ya1）是每一层反事实风险p（Ya1｜L=0）和p（Ya=1｜L=1）的加权平均

然后我们来看逆概率加权，首先看如下的图。

因果性风险比的分母Pr[Ya0=1]是整个人群都无治疗时的反事实死亡风险。现在让我们来计算这个风险。在图中，8 个轻症( L = 0 )被试中有 4 个在非治疗组，他们中有 1 个死亡。如果这 8 个人都在非治疗组，会有多少人死亡?2 个。因为总人数翻倍，因果性风险比不会改变，所以死亡人数也会翻倍。同理，在图中，12 个重症( L = 1 )被试中有 3 个在非治疗组，他们中有 2 个死亡。如果所有重症被试都在非治疗组，将会有 8 个死亡。因而，如果整个人群中的 20 个人都在非治疗组，会有 2 + 8 = 10 个人死亡。所以因果性风险比的分母 Pr[Ya0=1]=10/20 0.5=。图2.2的第一个树状图显示了所有人没有被治疗时的情况。当然，以上计算都需要在 L = 0 和 L = 1 时互换性成立，也即在每一层中，治疗组在无治疗时的反事实死亡风险，和非治疗组观测到的风险一样。

让我们来思考一下为什么这个方法是正确的。图中的两个树状图是当整个人群中的所有人都接受治疗或未接受治疗时的假想人群。这两个假想人群在有界互换性成立的情况下会是正确的。这两个假想人群可以合在一起创造了一个更大的假想人群，其中的每个个体都同时接受和未接受治疗。这个假想人群是原人群人数的两倍，我们称之为虚拟人群(pseudo-population)

上述方法被称作逆概率加权(inverse probability weighting，逆概率也简写作 IP)

（这种方法本质就是，如果人全部在治疗/非治疗组，那会有多少人，并根据这个假设来进行计算整体的因果效应）

L=0中的4个非治疗组被用来对应虚拟人群中的8个人，也就是说，这四个人的权重是2，也就是等于1/0.5 。而0.5指的是L=0时分配到非治疗组的的概率。同理L=1的0个治疗组被用来模拟虚拟人群中的12个人，每个人的权重是1.33=1/0.75 ，1.33也等价于12/9。

权重也称为逆概率权重：WA = 1/f（A｜L）

标准化和逆概率加权在数学上是等价的，逆概率加权用的是给定写变量L后接受治疗A的概率；标准化则使用协变量L的概率和给定A和L后结局Y 的概率。

知道了逆概率权重之后，就可以直接计算结果了，例如p(Ya1=1) = (12+61.333)/(20)

3 观察性研究

让我们再思考一下上一章开篇提出的问题:“一个人抬头向天上看会影响其他行人也抬头向天上看吗?”上一章我们讨论了随机试验，但你觉得要你抬头看那么多次实在是太累了，并且对你的颈椎也不好。于是你决定采取一种新的研究方法:先找到一个站着但没有抬头向上看的行人，再找一个向第一个人走去但也没有抬头向上看的人，然后观察这两人在接下来 10 秒内的行为。如此这般重复上千次。然后你比较第一个人抬头看之后第二个人也抬头看的人数比例，与第一个人未抬头看但第二个人却也抬头看的人数比例。在这个研究中，研究者观察并记录相关的数据，这样的研究被称作观察性研究。

不过批评者们会说，这两个行人都抬头向上看并不是因为第一个人影响了第二个人，而可能是因为他俩都同时听到了天空中的惊雷声，因此这一研究不足以证明第一个人抬头看会影响第二个人也抬头看。这一批评不适用于随机试验，这也是为什么随机试验对于因果推断理论非常重要的原因。然而，在实践中，用随机试验来估计因果效应经常受到一定的限制。许多科学研究都不是试验。许多人类知识都来源于观察性的研究，比如进化论、板块构造论、全球变暖、甚至还有天文学

3.1 可识别性

一个理想的随机试验能够用来识别并量化因果效应均值，这是因为对治疗组的随机分配保证了互换性。与之相比，观察性研究就没有那么有说服力，主要原因是因为治疗组的分配不是随机的。

虽然随机试验对因果推断有根本优势，但我们大多数时候只能用观察性研究来回答因果性的问题。

观察性研究中因果推断的核心，是我们能将观察性研究视作条件随机试验。

如果以下条件成立，则我们可以将观察性研究视为条件随机试验:

我们要比较的治疗方式，是良定的，并且能与我们数据中的治疗形式相对应。
治疗取值的条件概率，虽然不由研究者决定，但是只由已有的协变量L决定。
在控制了L的基础上，任一治疗取值的概率都应该大于0。

第一个条件就是第一章所说的一致性，第二个条件就是第二章所说的互换性，点三个是指对满足Pr[L=l] 0≠的l，有f[a|l]>0。条件随机试验能够保证正数性(positivity)成立。

我们将会看到这三个条件在大多数情况下很难达成，这也解释了为什么人们总是对观测性研究中的因果推断持怀疑态度。然而，如果观察性研究能和条件随机试验能够类比，那我们就可以用上一章介绍的方法——逆概率加权和标准化——来计算观察性研究中的因果效应。在此，我们将这三个条件称作可识别性条件或可识别性假设。

更重要的是，在一个理想的随机试验中，试验设计保证了可识别性的三个条件成立。但显然，这三个条件不可能时时成立。如果可识别性条件有一条不成立，那就不能将观察性研究比作条件随机试验。在这种情况下，也还有其他可能的方法分析观察性研究的数据做出因果推断，不过这些方法可能需要其他不同的可识别性假设。

3.2 互换性

研究人员可以用他们的专业知识来提高他们研究中有界互换性假设的可信度。他们可以收集许多相关的变量 L (比如，既是结局独立预测因素又是治疗决定因素的变量)的数据，然后就能假设只要控制了这些变量，有界互换性就能近似成立。遗憾的是，不管 L 中有多少个变量，我们都不能检验这个假设是否成立，也就使得观察性研究的因果推断不是一件万无一失的事。其因果推断的有效性依赖于研究人员的专业知识。这些专业知识，表现在通过控制已知的变量从而确保互换性，将和数据一起共同确认观察性研究中的因果效应。

3.3 正数性

一些研究人员想开展一项试验来计算心脏移植 A 对 5 年死亡率 Y 的平均效应。在这里，不言自明的是，研究人员会将一些被试分配到治疗组 A = 1 ，而将另一些被试分配到非治疗组 A = 0 。没有正经的研究人员会将所有被试都被分配到同一组， A = 1 或者 A = 0 。如果所有的被试都接受同一种治疗，那我们就不可能计算因果效应。因而，基本可以肯定的是，我们每一个治疗组中，都会有一定数量的被试。换句话说，我们需要确保每一种治疗的概率大于 0。这一性质被称为正数性。

在试验中我们不需要强调太多正数性，因为试验设计肯定能保证正数性成立。在观察性研究中，正数性和互换性都不能得到保证。

3.4 一致性与反事实结局

一致性意味着我们观测到的治疗组每个被试的结局，等于这个被试如果接受了治疗时的反事实结局;我们观测到的非治疗组每个被试的结局，等于这个被试如果未接受治疗时的反事实结局。即对于A=a，有Ya =Y。这个表述似乎是显然的，一些读者可能会不禁发问:在什么情况下一致性才会不成立?毕竟，如果我服用了阿司匹林( A =1)，然后我死了(Y =1)，不是就说明我的反事实结局 Ya1 就是 1?因而许多人认为一致性非常简单。但这种表明上的“简单”具有很强的欺骗性。让我们将一致性分成两个部分:(1)通过 a 定义的反事实结局 Y a ，和(2)反事实结局与我们观测到的结局的联系。这一小节我们将讨论第一个部分。

(Robins 和 Greenland(2000)认为良定的反事实情境，或其他数学上等价的概念是因果推断的必要条件)

让我们回到心脏移植 A 对 5 年死亡率 Y 的因果效应的随机试验。在招募被试前，研究人员在研究方案中详细描述了两种治疗方式:心脏移植( A = 1 )和药物治疗( A = 0 )。比如，研究人员详细描述了心脏移植组( A = 1 )被试的术前准备、麻醉、手术方式、术后护理以及免疫抑制方法。如果这个方案没有详细描述这些过程，那么每个医生可能会采用不一样的方式进行移植手术或术后的免疫抑制。

如果同一治疗的不同形式有不一样的因果效应，那问题就会随之产生。比如，在我们心脏移植的试验中，如果手术是由传统方式进行，那其因果效应势必会和采用新式手术方法有所不同。因此，当我们说“心脏移植对死亡率的因果效应”时，我们需要指明治疗 A 的具体形式 a 。

这里很好理解，字面上的‘措施’的含义可能和实际上措施的含义不一致。

在观察性研究中，研究者只能尽可能精确地去定义研究中的 a 。虽然这对医疗干预措施，比如心脏移植而言相对直接明了，但是对于现实世界中的其他干预措施而言则不是那么直截了当。x

劣定的反事实结局会让我们的因果性问题变得空洞。如果我们想研究运动的影响，我们就需要认真定义运动时长、强度、形式等等等。

但如果有人问，“你怎么知道我们的定义是足够良定的呢?”很遗憾，答案是，我们不知道。判断一个定义是否良定，需要研究者用自己的专业知识进行判断。这一判断应在研究人员中取得共识。我们现在认为跑步的方向不会对健康造成影响，但未来的研究可能会认为我们错了，比如未来发现，人体跑步向左倾斜时对健康不利。在历史中的每一个时间点，撰写研究方案的研究者只能用他当时能有的所有知识来尽可能消除模糊保证精确。然而，在所有关乎因果的问题中，模糊总是存在的。一个因果性问题的模糊程度能够用更详尽的叙述减少，但是不能完全消除。在观察性研究中，涉及到生物(比如体重或者胆固醇)或者社会(比如社会经济地位)因素的问题时，模糊程度都会格外地高。

以上讨论阐述了因果推断的一个基本特点:因果性问题的清晰程度取决于我们的专业知识和我们所拥有的信息。我们现今觉得有意义的因果性问题，可能在未来发现了能影响结局的更具体因素之后，就变得模糊而无意义。我们再一次强调，良定的治疗方式，其定义取决于大多数研究者的共识，而这会随时间改变。

现在，读者可能已经意识到，在清晰定义一个治疗方式的过程中，我们可能会改变最开始的问题。我们最开始只是在讨论一项关于肥胖效果的研究，但最终我们居然讨论起运动的方式。我们越是想对我们的研究提供一个良定的因果性阐释，我们离最开始的问题就越远。但这不是一件坏事:不断地完善我们的问题，直到它在现有知识下不再模糊不明，是因果推断的基本要素之一。我们将治疗方式定义得越明确，我们就越能避免将来研究者互相交流时的歧义和误解，尤其是当不同的研究得到的数值估计不一样的时候。

3.5 一致性：反事实世界与观测数据

总而言之，劣定的肥胖会使我们对因果效应估计值的阐释模糊不清(上一小节)，但是足够良定的肥胖在我们的现实数据中又不存在的话，也会使我们的因果推断不清不楚。我们需要详细描述在一个人群中存在的实际治疗的各种特征，将其与我们想研究的“治疗”匹配起来。这项任务可能会在试验研究中相对简单直接，但在一些观察性研究中，尤其是涉及生物和社会因素的观察性研究中，这项任务可能会非常困难，甚至不可能。

3.6 靶标实验（The target trial）

在整本书中，因果效应一词指两种不同治疗取值下的反事实结局的比较。因此，对每一种因果效应，我们都能用一个假想的随机试验去量化它。这个假想的随机试验，我们称之为靶标试验 (target trial)。然而真正开展一项靶标试验是不现实的，我们只能用观察性研究中得到的数据进行因果推断。换句话说，我们将观察性研究中的因果推断过程，视作在模拟一项靶标试验。如果这种模拟是成功的，那我们从观察性研究中得到的数值估计，就和从靶标试验中(如果我们能实际进行的话)得到的没有差别。

靶标试验，或者其他同义词，是因果推断框架的核心之一。Dorn(1953)，Cochran (1972)，Rubin(1974)，Feinstein(1971)和 Dawid(1986)等人都使用过这个概念。 Robins(1986)将这一概念应用于时异性治疗中

因此，“你想模拟什么样的随机试验”，是观察性研究中因果推断的核心问题之一。对每一种我们希望在观察性研究中计算的因果效应，我们需要描述:(1)靶标试验是什么;(2)怎么用观察性数据模拟靶标试验。

我们可以通过详细说明靶标试验(假想的)研究计划中的重点要素，来具体化这项靶标试验。这些因素包括:入组标准、干预(或治疗)策略、结局定义、随访方式、效果比较以及数据分析。

将观察性研究中的因果推断类比成一项靶标试验这一概念并不被所有人认可。一些研究者认为“ A 对 Y 的因果效应”这个定义，不管 A 和 Y 是什么(只要 A 在 Y 之前)，就已经是良定的了。

有些人认为：我们不必去知道一项观察性研究中某个因果效应的具体估计值，我们只需要知道这个因果效应是否存在就足够了。肥胖和死亡率之间的强相关性表明某些对肥胖的干预措施能降低死亡风险。因此，我们能得到的有价值信息是，如果我们对所有肥胖者进行强制干预，把他们的体重控制在正常范围，那我们就能预防许多不必要的死亡。

这一观点非常具有吸引力，但也很危险。接受这一观点会带来一个问题:劣定的治疗形式会阻碍观察性研究中对互换性和正数性的思考。

当一项观察性研究的数据不能用来模拟靶标试验的时候，是不是也就意味着一切都白费了?不完全是这样。这种情况下，观测所得的数据，依然能用来做非因果性的预测。我们知道肥胖的人群有更高的死亡率，因此肥胖是死亡率的一个预测因素，换句话说，肥胖和死亡相关。这将有助于我们辨认什么样的人有较高的死亡率。不过就算我们知道肥胖和死亡相关，我们却并不知道肥胖对死亡有没有因果性的影响:肥胖和死亡的关系，可能就和携带打火机与肺癌的关系一样。因此，肥胖和死亡率之间的相关性会进一步催生出种种假设与研究去探寻背后的机制，但其证据并不足以推荐所有人都去减肥。

当我们从因果性问题回退到预测性问题时，我们就回避了许多随机试验不能回答的问题。但另一方面，因果推断是我们的终极目标，仅仅预测则不能满足我们的希望。

4 修饰效应

我们迄今的讨论都是围绕整个人群中的因果效应。然而，有一些问题却只关乎一部分人而非所有人。再考虑一下这个因果性的问题:“一个人抬头向天上看会影响其他行人也抬头向天上看吗?”也许你觉得更有意义的做法，是把行人分成本市居民和外来游客，然后分别在这两个群体中衡量“抬头看”这个行为的因果效应。是在整个人群中计算因果效应，还是在一个子群体中进行计算，取决于我们的研究目标。在某些情况下，你并不关心不同群体中的效果差异。比如，你是一名政策制定者，正在考虑在全国范围内推行氟化水项目。因为这一项目会影响到全国的所有家庭，所以你的主要关注点在整个人群中的效果，而不是某一特定群体中的效果。而当你的干预措施是针对某些特定群体时，你的关注点可能就是这项干预措施在整个人群不同子群体中的差异。

整个人群中的因果效应均值为零，在这个人群的子群体中，因果效应均值不一定为零。

我们可以给效应修饰下一个定义了。如果 A对Y 的因果效应均值在V 的不同取值下不尽相同，则我们称V 是 A对Y 因果效应的修饰因子。

4.1 通过分层识别修饰效应

分层分析是识别效应修饰的最直接方式。为了识别V 是否修饰 A对Y 的效应，我们可以计算 V 的不同取值(每一取值是一个分层)下 A 对 Y 的因果效应。

假设，在我们的研究人群中，国籍V 会修饰治疗 A对死亡Y 的因果效应。然而，对修饰作用背后的机理，我们却没有做任何解释。事实上，国籍可能不是修饰作用的根本原因，只是这个根本原因的一个标志物。比如，希腊的心外科手术技术要强于罗马，因而研究者就会在数据中发现国籍修饰了治疗的效应。也因此，一项提高罗马心外科手术技术的干预措施也许就能消除国籍的修饰作用。当这种情况发生的时候，为了强调两者之间的区别，我们会将国籍称作修饰因子替代物，将心外科手术技术称作因果性修饰因子。

4.2 为什么要关注效应修饰

第一点，如果V 修饰了治疗 A对结局Y 的效应，那治疗的因果效应均值就会根据人群中V 的出现频率而变化。如果在一个人群中计算得到的因果效应能被外推到另一个人群，我们会说这个因果推断在不同人群中具有可移植性。因果效应的可移植性是一个不能被验证的假设，需要具体问题具体分析。比如，一些研究者认为给尼日尔每户家庭额外 100 美元带来的健康效应收益，(无论在加法尺度上还是乘法尺度上)都不能移植到荷兰。

第二点，检验效应修饰存在与否，有助于确定能从干预措施中获益最多的群体。例如治疗A对Y的因果效应均值为零。然而治疗A在男性(V =0)中是有益的，但在女性(V =1)中是有害的。如果医生知道性别有质的效应修饰作用，在没有其他额外信息的情况下，他只会给男性进行治疗。

第三点，也是最后一点，识别效应修饰有助于我们了解背后的生物学、社会学或其他方面的具体机制。比如，如果我们知道有没有做过包皮手术对 HIV 感染有修饰作用，这将有助于我们了解这个病的传染机理。在我们描述两个变量的交互作用之前，我们要先识别效应修饰。

通过匹配调整变量：匹配的目标是在人群中构建一个子群体，变量L的分布在这个子群体的治疗组和非治疗组相同。不会计算某些分层中的因果效应成为限制。

根据因果效应的不同类型，这些方法分为两类：1标准化和逆概率加权用来计算边缘和条件效应。2分层分析，限制和匹配只是计算人群子群体中的条件效应。

4.3 总结

只有人群特征清晰的群体才能用于有意义的因果推断。这两个条件在满足一系列前提标准的试验人群中会自动满足。然而，这两个条件在观察性研究中却不会自动满足。在观察性研究中，研究者需要尽可能清晰定义我们关心的因果效应，以及用来计算这个效应的人群。否则，不同方法得到的不同效应估计会产生极大的误解。

4.4 可移植性

我们从一个人群中得到的因果效应，大多时候会被用于对其他人群(称之为目标人群)的决策证据。假设我们在互换性、正数性以及一致性的假设下正确得到了我们研究人群中的因果效应均值，那在目标人群中，这个治疗的效应会是一样的吗?换句话说，我们能把研究人群中的治疗效应移植到目标人群中吗?这个问题的回答取决于两个人群的人群特征。具体而言，一个因果效应从一个人群到另一个人群的可移植性，取决于这两个人群的人群特征是否相似。

• 效应修饰:效应修饰因子会影响治疗的因果效应。如果两个人群的效应修饰因子分布不同，那同一治疗在两个人群中的因果效应均值也会不同。 • 治疗形式:治疗的因果效应也取决于其不同形式在两个人群中的分布。如果分布不同，效应均值也会不同。 • 干扰:在本书正文中，我们都假定没有干扰(参见精讲点1.1)。然而干扰在现实中确是实际存在的，一个个体的结局可能会影响到其他个体的结局。因此，两个人群中不同的社会交往形式会使治疗的因果效应不同。

可以通过以下方式提高因果推断在不同人群间的可移植性:(1)将关注点限制在特定分层中的人群;(2)利用研究人群中每一分层的因果效应，构造符合目标人群特征的因果效应均值。比如，我们书中例子里，涉及了四个分层，即罗马女性、希腊女性、罗马男性、希腊男性。我们可以对每一分层的因果效应进行加权平均，从而构造出满足不同性别比例、国籍比例的目标人群，并得到其中的因果效应均值，而权重就是每一分层在目标人群中所占的比例。然而，我们依然不能保证在构造所得人群中的因果效应均值，等于实际目标人群的真实值，这是因为未知的修饰因子、干扰现象以及治疗形式可能在研究人群与目标人群中分布不同。

5 交互作用

再思考一下这个因果性的问题:“一个人抬头向天上看会影响其他行人也抬头向天上看吗?”我们迄今的讨论仅局限于一个单一动作(抬头看)在整个人群或子群体中的因果效应。但是，有一些问题，则是关于两个或多个同时动作的因果效应。除了你是否抬头向天上看之外，我们再增加一项:你是否赤身裸体地站在大街上。现在我们的问题变成:如果你穿着衣服站在大街上，你抬头向天上看会影响其他行人也抬头向天上看吗?如果你赤身裸体地站在大街上呢?如果这两个情境中的因果效应有差别，我们会说这两项动作(抬头看和赤身裸体)相互交互，从而对结局产生影响。当多个干预措施存在的时候，识别是否有交互作用能让我们找出最有效的干预策略。

5.1 交互作用的定义

假设在我们的心脏移植例子中，在分配心脏移植治疗之前，每个被试被分为服用维生素组 ( E = 1 )和不服用维生素组( E = 0 )。于是我们能将被试分成四个组:有维生素有移植组 (E =1, A 1)=、有维生素无移植组(E =1, A 0)=、无维生素有移植组(E = 0, A 1)=、无维生素无移植组( E = 0, A 0 )=。于是有四种治疗组合，每个被试有四个反事实结局: Ya1,e1、Ya1,e0、Ya0,e1、Ya0,e0。与之前的定义相同，个体的反事实结局Ya,e 表示这个个体在 A 和 E 的取值分别为 a 和 e 时我们观测到的结局。我们将这种涉及多种治疗(或干预)的措施称为联合干预。

我们现在可以在反事实结局的框架下对交互作用下一个定义:对于两种不同的治疗(或干预)A和E，如果A对结局Y的因果效应在E分别为0和1时而不同，则A和E之间存在交互作用。

比如，在我们的例子中，如果心脏移植对死亡率的因果效应在服用维生素组和不服用维生素组中不同，则心脏移植 A 和维生素 E 之间有交互作用。

用风险差来度量因果效应时，如果在人群中有：

让我们回顾一下效应修饰的定义:如果 A对Y 的因果效应均值在V 的不同取值下不尽相同，则V 是一个效应修饰因子。注意，当提及效应修饰时，我们说的是对 A的因果效应的修饰，而不是V 的因果效应。因而，当我们说V 修饰了 A的因果效应时，我们并不是将V 和 A放在同等的地位进行考虑，而只有 A才是我们所关注的干预(或治疗)变量。

5.2 识别交互作用

在前几章，我们讨论了在整个人群或其中一个子群体中，识别治疗(或干预) A 对结局 Y 的因果效应所需要的三个前提条件，分别是互换性、正数性和一致性。因为交互作用涉及两个(或多个)干预(或治疗) A 和 E 的联合效应，因而识别交互作用需要互换性、正数性和一致性对这两个干预(或治疗)都成立。

假设研究者是随机、无条件进行维生素 E 分组的，则正数性和一致性成立，且服用组 E = 1 和未服用组 E = 0 可互换。也就是说，如果所有被试都被分配到治疗组 A = 1 和服用组 E = 1 ，我们观测到的结局，会和如果现实中所有服用组 E = 1 被试都被分配到治疗组 A = 1 时的结局一样。

换句话说，如果治疗 E 是随机分配的，则交互作用和效应修饰两个概念重合。

就算治疗(或干预) E 不是由研究者分配的，我们依然需要计算四个不同的边缘风险 p（Ya,e=1）来检验A和E之间的交互作用。

总而言之，检验 A 和 E 之间的交互作用，需要互换性、正数性和一致性对于联合治疗(或干预) AE 成立，这个联合治疗(或干预)有四种不同取值，分别是 00,01,10,11。此时，我们就可以用标准化或逆概率加权来估计这两个治疗(或干预)的联合效应，并衡量它们间的交互作用

5.3 反事实回应类型和交互作用

每个个体可以根据他们命定的反事实结局分成不同的类型。注定型，指不管有没有接受治疗，都会发生我们关注的结局.免疫型，指不管有没有接受治疗，都不会发生我们关注的结局.受益型，指接受了治疗，就不会发生我们关注的结局，而不接受就会发生.受害型，指接受了治疗，才会发生我们关注的结局，而不接受就不会发生.

当涉及两种不同的二分治疗 A 和 E 时，联合治疗共有 4 种不同取值，因而每个被试有 4 种情 59 形的反事实结局，也因此总共将有 16 种不同的回应类型。

在表中的 16 种类型中，我们已经发现其中 6 种(第 1、4、6、11、13、16)有一个共同特点:如果一个个体属于这 6 个类型中的一种，那对他来说，治疗 A 对 Y 的因果效应不管 E 的取值如何都是一样的，同时，治疗 E 对 Y 的因果效应不管 A 的取值如何也都是一样的。如果一个人群是由这 6 种类型的个体构成的，那有 E 时 A 的因果效应，和没有 E 时 A 的因果效应就会是一样。也就是说，如果一个人群中的所有个体都是第 1、4、6、11、13 和 16 类型的，那么 A 和 E 之间在加法尺度上就不存在交互作用。

如果 A 和 E 之间存在加法交互作用，那对一个人群中的某些个体而言。我们就不能在不知道 E 的情况下识别 A = a 的反事实结局，反之亦然。也就是说，如果存在交互作用，那人群中的某些个体属于以下三类中的一种:

只会在4种不同治疗组合中的一种出现我们关注的结局(表中的第8、12、14和 15)。
会在4种不同治疗组合中的两种出现我们关注的结局，其中一个组合的两种治疗，和另一个组合中对应的治疗取值正好相反(表的第 7 和 10)。
会在4种不同治疗组合中的三种出现我们关注的结局(表的第2、3、5和9)。

5.4 充分成因

还是以我们心脏移植的研究为例。只考虑一种治疗 A 时，我们知道有些人接受了治疗会死，有些人不接受治疗会死，有些人不管接不接受治疗都会死，而还有一些人不管接不接受治疗都不会死。这林林总总的回应类型反映了治疗 A 不是唯一决定结局 Y 的变量。

只考虑在现实中实际接受治疗的人，他们中有一些人死了，但有些人没死，说明仅有治疗不足以充分说明是否会发生结局，还有一些其他的背景因素也会影响结局。在一个很简化的例子中，假设只有对麻药过敏(U1 =1)的人在接受心脏移植(A=1)后才会死亡，则我们将治疗 (A=1)和对麻药过敏(U1 =1)称作结局Y的最小充分成因。

现在让我们考虑在现实中实际未接受治疗的人，同样，他们中有一些人死了，但有些人没死，说明仅靠没有治疗不足以充分说明是否会发生结局，还有一些其他的背景因素也会影响结局。在一个很简化的例子中，假设只有射血分数小于20%(U2 =1)的人在未接受心脏移植 (A=0)的情况下才会死亡，则未接受治疗(A=0)和低射血分数(U2 =1)是结局Y的另一个最小充分成因。

当然，也有被试在没有U1 和U2 的情况下，不管接不接受心脏移植治疗，都会去世。这些 “注定型”个体也有他们自己的最小充分成因。比如，所有这些个体在研究开始前都有胰腺癌 (U0 =1)，因而不管他们接不接受治疗，他们都会去世。因而胰腺癌(U0 =1)就是结局Y的另一个最小充分成因。

至此，我们描述了结局的3个充分成因:A=1且U1 =1，A=0且U2 =1，U0 =1。每一个充分成因都有一个或多个成分，比如第一个中的A=1和U1 =1。

其中一个整圆表示一个充分成因，圆内的每一部分表示一个成分。充分成因经常用来表示一个结局的充分条件和这个条件的所有成分。

用图来表示充分成因可以形象化地展示效应修饰的一个重要推论:如同第四章所讨论的一样，治疗 A 的因果效应大小取决于人群中修饰因子的分布状态。让我们来想象两种假想情境。在第一种情境中，人群中只有1%的人有U1 =1。在第二种情境中，人群中有10%的人有U1 =1。而 U0 和U2 在两种情境中分布相同。如果我们在两种情境中都开展一次心脏移植的试验，我们会发现第一种情境里里心脏移植 A 对死亡 Y 的因果效应均值，要大于第二种情境，这是因为第二种情境里有更多的人会在接受治疗后去世。在这个例子中，3个充分成因中的一个，A=1且U1 =1，在第二种情境里的人数比例是第一种里的 10 倍，而其余 2 个在两种情境中的出现频率是一样的。充分成因的概念可以用图像的方法来表示，而这个图像也可以用来解释交互作用。我们先考虑有两种治疗 A 和 E 时的情况，于是我们有以下 9 种可能的充分成因:

有A时(不管有没有E)
无A时(不管有没有E)
有E时(不管有没有A)
无E时(不管有没有A)
同时有A和E时
有A无E时
有E无A时
同时无A和E时
通过其他途径(不管有没有A或E)

我们给每一种充分成因补充一个背景因素，记作U0 至U8。下图展示了两种治疗A和E下的这 9 种不同充分成因。在下一节我们会进一步用图解释交互作用。

不是每一个涉及一个二分结局和两种不同治疗的情境都有 9 种不同的充分成因。比如在我们的例子中，可能出现只要服用维生素E =1，那不管有没有接受心脏移植手术，都不会死的情况。此时，图5.2中E=1的3个充分成因就不再成立。这3个充分成因表示，在U3 =1时，如果被试服用维生素( E = 1 )就会死亡，也就是说，如果他们不服用维生素( E = 0 )，那他们就不会死亡。

5.5 充分成因的交互作用

“ A 和 E 的交互作用”这一说法暗示这两种不同的治疗在机理上共同协作从而造成了结局。然而，在反事实框架下，交互作用的定义不涉及任何机理相关的内容在我们心脏移植的例子中，如果心脏移植 A 的因果效应在服用维生素 E 和不服用的被试中不一样的话，那 A 和 E 之间就有交互作用。也就是说，交互作用是通过反事实结局的对比而定义的，我们也就能通过随机试验来检验交互作用，而无需知道背后物理学、化学、生物学、社会学等等各方面的机理。

这一小节我们将介绍交互作用的另一种定义。这一定义不依赖于反事实结局的对比，而是建立在充分成因之上，更能用于解释交互作用的机理。

只有当A和E同时出现在一个充分成因中，A和E之间才会有交互作用。比如，一个U5 =1的个体，只有同时服用维生素( E = 1 )和接受心脏移植( A = 1 )才死亡，但只接受两种治疗中的一种则不会。因而，只要存在一个U5 =1的个体，A和E之间就存在交互作用。

充分成因中的交互作用可以分成协同和拮抗两种。如果在一个充分成因中， A = 1 且 E = 1 ，则A和E之间的交互作用是协同的;而如果A=1且E =0(或A=0且E =1)，则是拮抗的。当然，也可以认为A和E之间的协同是A和非E(或非A和E)之间的拮抗。

与反事实框架下交互作用的定义不同，充分成因中交互作用的定义需要涉及到 A 和 E 的机理，也即我们需要相关知识去识别是否存在交互作用。但有时候也会出现这种情况:就算我们对充分成因和其各成分所知不多，我们依然发现有交互作用的存在。

5.6 反事实框架和充分成因框架

充分成因框架和反事实框架能够处理不同的问题。充分成因模型考虑了事件、动作、状态等不同因素，这些不同的因素共同作用从而造成了我们关注的结局。这一模型能够给一个特定的效应作出解释。它能够解决诸如“哪些不同的条件能导致同样的结局?”等问题。而反事实模型则主要关注一个特定的治疗(或干预)，并对这个治疗(或干预)的不同因果效应给出解释。反事实框架能够解决诸如“如果我们对一因素进行干预会导致什么样的后果?”等问题。与充分成因模型不同，反事实模型不需要任何涉及作用机理的知识。

反事实框架解决的问题是“会发生什么?”，而充分成因框架解决的问题是“怎么会发生?”。

虽然充分成因框架是有益于我们的阐释，但是尚未有方法将其整合到数据分析当中。

6 因果效应的图像表示

因果推断需要借助研究者的专业知识和不可检验的假设去勾勒治疗、结局以及其他变量之间的关系。

在因果推断中使用图像，能让我们更好地遵守这一重要准则: 在你做出结论之前，先说出你假设了什么。

6.1 因果图

图由 3 个节点和 3 条边组成。每个节点表示一个随机变量( L ， A ， Y )，每条边都有一个方向。我们按照惯例，从左至右表示时间顺序，因而， L 在时间上先于 A 和 Y ， A 在时间上先于 Y 。如同前几章， L 表示重症程度， A 表示心脏移植， Y 表示死亡。

从 A 到 Y 的箭头表示我们知道至少在一个个体中， A 对 Y 有直接因果效应(即中间不存在任何其他变量)。同理，如果 A 和 Y 之间不存在任何箭头，那对任何个体， A 对 Y 都没有直接因果效应。

在我们的图 6.1 中，有箭头从 L 指向 A ，表示重症程度会影响接受心脏移植的概率。

类似图的因果图被称为有向无环图(Directed acyclic graphs，简称为 DAG)，大多时候，会被简称为 DAG 图.

“有向”指的是每条边都有一个方向:因为箭头是从 L 指向 A ，所以 L 是 A 的一个诱因，而不是反过来。“无环”指的是没有一个封闭的环形回路:任何变量都不能是自己的诱因，也不能间接通过其他变量成为自己的诱因。

在因果性有向无环图中，如果控制了变量 A 的直接诱因，那除去 A 作为诱因的变量，A 和图中的其他变量相互独立。这是因果性有向无环图的基本定义之一。这一假设被称为因果性马尔科夫假设(causal Markov assumption)，同时暗示了在一个因果性 DAG 图中，如果任意两个变量有共同诱因，那这个共同诱因也必须出现在同一图中。

假设治疗组的分配不再取决于重症程度 L ，也即 L 不再是 A 和 Y 的共同诱因，也就没必要出现在图中，图 6.2 反映了这一情况，此时这是一项边缘随机试验。

最近发展出来的另一种有向无环图——单一世界干涉图(Single world intervention graph，简称 SWIG 图)——通过将反事实变量融入到图中，统一了反事实框架和因果推断的图像表示。会在后续进行介绍

6.2 因果图与边缘独立

考虑以下两个例子。在第一个例子中，你知道阿司匹林 A 能预防心脏病 Y ，即p（Ya1=1）！=p（Ya0=1）。

在第二个例子中，你知道携带打火机 A 其实对肺癌 Y 没有因果效应(无论是预防性的还是诱发性的)，p（Ya1=1）=p（Ya0=1）。然而吸烟L对携带打火机A和肺癌Y同时具有因果效应。

。在图 6.3 中，没有箭头从 A 指向 Y ，表明携带打火机对肺癌没有因果效应;而 L 是 A 和 Y 的一个共同诱因。

我们在画图 6.2 和图 6.3 的时候运用到了我们对这几个变量之间因果关系的理解。同时，除去因果关系，这两幅图也表示了各变量之间的相关性。在这两幅图中， A 和 Y 都是相关的，不管这两者之间有没有箭头。这一点蕴含了因果图论的重要推论之一。

我们先来看图 6.2 表示的随机试验。在直觉上，我们当然会期待由箭头连接的两个变量 A 和 Y 相关。这也是因果图论的结论之一:如果 A 对 Y 有因果效应，那么 A 和 Y 相关。

对于吸烟的例子，即使A对Y没有因果效应，但是知道A的信息能提高我们预测结局 Y 的能力。我们不能因为 A 和 Y 相关就结论说 A 对 Y 有因果效应。在这里，因果图论再一次验证了我们的直觉。用图论的术语来说， A 和 Y 相关，是因为有信息途经共同诱因 L，从 A流动到了Y (或者从Y 到 A)。

让我们再考虑第三个例子。假设我们知道某个基因型 A 对我们是否抽烟 Y 没有因果效应，即： p（Ya1=1）=p（Ya0=1）而基因型A和抽烟Y都对心脏病L有因果效应。图6.4表示了这一情况。A和Y之间没有箭头，表明A对Y没有因果效应。同时A和Y都是L的诱因。

L作为A和Y 共有的结果，被称为对撞变量(Collider)，其路径可以表述为 A → L ← Y ，在这个路径中，两个箭头在 L 这个节点处相撞，故采用对撞这一命名。

A 和 Y 是否相关?用图论的术语来说，和其他变量不同，对撞变量会阻断其所在路径中信息的流动。在路径 A → L ← Y 中，因为对撞变量 L 阻断了这条路径，所以 A 和 Y 相互独立。总而言之，如果一个变量是另一个变量的诱因，或两个变量有共同诱因，那么这两个变量相关，否则这两个变量相互(边缘)独立。

6.3 因果图与条件独立

在图 6.2 中，我们料到阿司匹林 A 和心脏病 Y 相关，因为阿司匹林对心脏病有因果效应。假设我们有了其他新知识:阿司匹林 A 能影响心脏病 Y 是因为阿司匹林 A 能降低血小板凝集 B 。我们可以将这一新知识画成因果图，如图 6.5 所示，血小板凝集 B (1 表示高，0 表示低)是 A 对 Y 效应的中介变量。

在因果图中引入第三个变量后，我们就面临一个新问题:在 B 的每一层取值中(也即控制了B 之后)， A 和 Y 还相关吗?或者，换句话说:当我们有了 B 的信息后，我们依然需要 A 的信息来预测 Y 吗?答案是不需要了。在=图中，变量= B 周围的方框阻断了路径 A → B → Y 。

让我们回到图 6.3。在上一节我们说携带打火机 A 和肺癌 Y 相关，是因为路径 A ← L → Y 是开放的，相关性的信息能从 Y 流到 A 。现在，我们想知道，在控制了 L 之后， A 和 Y 是否依然相关。

答案也是不相关。

6.4 因果图的正数性和一致性

因为因果图定性地包含了变量间因果关系的知识，所以它可以用来帮助我们分析因果推断中出现的问题。

不管我们用了什么样的表示方法(反事实或图像)，互换性、正数性和一致性都是因果推断所需的前提假设。如果这些条件中有一条不成立，那数据分析得到的估计值就不能阐释为因果效应。在下一小节(同时还有第七章和第八章)，我们将讨论互换性在图中如何表述。现在我们先来讨论正数性和互换性。

正数性在图中可以大致表述为:从节点 L 到节点 A 的箭头不是命定的。

一致性的第一个部分 ——良定的干预——意味着从干预 A 到结局 Y 的箭头对应一个可能是假想的、但是不含糊的干预措施。

6.5 偏移的结构性分类

当我们的数据无限、但依然不足以识别或计算因果效应时，我们会说存在系统性偏移。如果干预与结局之间的相关性不是来源于干预的因果效应，那么我们会将这种相关性称为系统性偏移。

偏移的一个主要来源是治疗组与非治疗组之间互换性的缺失。以下两种不同的因果结构中会出现互换性的缺失:

存在共同诱因:当干预和结局有共同诱因时，相关性量度通常会和效应量度不同。许多流行病学家用“混杂”一词表示这一情形。
控制共同后果:来源于这一结构的偏移被大多数流行病学家称作“选择偏移”。

不同类型的系统性偏移，即混杂、选择偏移和测量偏移

6.6 效应修饰的结构

识别偏移的可能来源是因果图的主要用途之一:我们可以根据已有的专业知识画出各变量间的结构，然后找出干预和结局间相关性的可能源头。。然而，因果图却难以描述我们在第四章讨论的效应修饰作用。

总的来说，因果图并不能表示两个变量之间的交互作用。然而，如果我们在因果图中加入表示充分成因的增强型节点(即有命定的箭头从干预指向充分成因)，我们就能在因果图中包含交互作用的信息。有交互作用存在的时候，控制共同后果会影响相关性的大小和方向。我们将在第八章讨论这一增强型因果图。

7. 混杂 CONFOUNDING

假设某研究者进行了一次观察性研究来探索以下问题:“一个人抬头向天上看会影响其他行人抬头向上看吗?”他发现第一个行人抬头向上看和第二个行人也抬头向上看之间存在相关性。然而，这名研究者发现当行人听到雷声时都会抬头向天上看。因此，到底是第一个行人的行为让第二个行人抬头向天上看，还是雷声让第二个行人向天上看?这名研究者只能说雷声的存在混杂了第一个人抬头向上看的因果效应。在随机试验中，治疗组的分配都是随机的。但是在观察性研究中，治疗与否却受到许多因素的影响。如果这些因素影响了结局出现的风险，那么这些因素的效应就会和治疗的效应混杂在一起。此时我们会说存在混杂，这也是治疗组和非治疗组之间互换性缺失的体现。混杂的存在经常被视为观察性研究的重要缺陷之一。在有混杂的时候，即使我们的研究样本无限大，我们常说的 “相关性不等于因果性”也依然是正确的。本章将给出混杂的正式定义，同时将讨论调整混杂的不同方法。

7.1 混杂的结构

如果图 7.1 不包含共同诱因 L ，那么治疗和结局之间的路径就只剩 A → Y ，因而 A 和 Y 之间所有的相关性都来自A对Y的因果效应。此时相关性就是因果性。而共同诱因L的存在是 A 和 Y 之间相关性的另一个来源，我们将其称为 A 对 Y 因果效应的混杂变量。因为混杂以及混杂变量的存在，相关性风险比不再等于因果性风险比，此时相关性不再是因果性。

偏移都来源于治疗(或干预) A 和结局Y的共同诱因L或U。

7.2 混杂与互换性

在实践中，如果我们认为混杂很可能存在，那我们就要面对一个严峻的问题:我们能否找到一系列已测的变量 L ，只要控制了这些变量，有界互换性就能成立?回答这一问题并不轻松。在本章，我们将论述，如果我们知道数据所对应的 DAG 图，我们就能回答上述问题。假设我们知道了数据背后真实的 DAG 图(不要在意我们是怎么知道的)，那我们应该怎么使用这幅 DAG 图来判断有界互换性取决于哪些变量呢?我们有两种可行的方法:(1)DAG 图中的后门准则; (2)将 DAG 图转化为 SWIG 图。虽然使用 SWIG 图更为直观，但它也需要更多的前期准备。

现在让我们正式介绍一下后门准则(即互换性)和混杂的关系。以下两种情境满足后门准则:

治疗和结局没有共同诱因。在图6.2中，治疗和结局没有共同诱因，因而也就没有后门路径需要被阻断。此时满足后门准则的变量集合为空集，即不存在混杂。
治疗和结局有共同诱因，但是一个不含A下游变量的变量集合L足够阻断所有的后门路径。在图 7.1 中，满足后门准则的变量集合是 L 。此时存在混杂，但是不存在需要调整未知变量(这也不可能实现)的残余混杂。我们将此称为“不存在未知混杂”。 L 被称为“混杂调整的充分集合”。

后门准则并不能告诉我们混杂的方向与强度。很可能未阻断的后门路径所引起的相关性很弱 (比如 L 对 A 和 Y 的效果都不强)，那引入的偏移就很小。在实践中，我们经常要考虑偏移的方向与大小。

7.3 混杂与后门准则

现在我们来讲述应用后门准则去判定 A 对 Y 的因果效应是否可识别，如果可以，需要控制哪些变量以保证有界互换性。谨记本章因果性 DAG 图中的节点代表的都是零误差测量的变量。在图7.1中，治疗A和结局Y有一个共同诱因L，也即A和L之间有一条通过L的开放后门路径。但我们控制住 L 的话，我们就能阻断这条后门路径。因而，如果研究者收集了每个个体的 L 信息，那在给定 L 的情况下，也就不存在未知或未测的混杂了。在图7.2中，治疗A和结局Y有一个未知的共同诱因U，也即A和L之间有一条通过U的开放后门路径。理论上，我们可以通过控制U阻断这条后门路径，从而消除混杂。但与L、A和Y不同的是，U 是一个未知或未测量的变量。不过此时我们也可以通过控制 L 阻断这条后门路径。因而，那在给定 L 的情况下，这一情境不存在未知或未测的混杂。

8 选择偏移

指的是因为对研究人群有选择的分析而引起的偏移。

9 测量偏移

如果数据的测量导致治疗和结局之间相关的增强或减弱，则称为测量偏移。

10 随机性变异

因为随机性导致的偏移。

11 统计模型

略，主要设计回归模型

12 逆概率加权和边缘结构模型

（需要注意，A和Y都可以是连续变量或者分类变量）

戒烟对体重增加的因果效应均值是多少?在本章，我们会讲述如何使用逆概率加权从观察性数据中估计这一效应。我们将用现实世界中 NHEFS 的数据来估计戒烟对体重增加的效应。NHEFS 表示“美国国家健康与营养调查数据 1 期:流行病学跟踪研究”。NHEFS 是由美国国家健康数据中心、老龄化研究所、以及其他美国公共卫生服务机构共同发起的研究。NHEFS 的详尽介绍及数据能在以下网址中找到:wwwn.cdc.gov/nchs/nhanes/nhefs/。

关于数据集的信息如链接所示： https://github.com/IBM/causallib/blob/master/causallib/datasets/data/nhefs/NHEFS_codebook.csv

12.1 因果性为题

我们的目标是估计戒烟 A (治疗变量)对体重增加 Y (结局变量)的因果效应均值，因而我们从 NHEFS 数据中选取了 1566 名 25 至 74 岁的参与人员，他们有第一次访问时的数据，并在 10 年后被再次随访。如果他们在随访之前戒烟了，那就被称为治疗组 A = 1 ，如果没有则是非治疗组 A = 0 。在初次访问和随访中都记录了每个人的体重(kg)，因而可以计算出体重增加了多少。

我们定义戒烟效果的因果效应定义为这两者的差别，也就是E（Ya1）-E（Ya0）

通常而言，相关性差值E（Y｜A=1）-E（Y｜A=0）与E（Ya1）-E（Ya0）有区别。如果戒烟者和非戒烟者的特征分布不一样，那么相关性的差值就没有因果意义。（我们可以分析治疗组和非治疗组的特征分布是不是一样。）

#####################################################
# PROGRAM 12.1
# Descriptive statistics from NHEFS data (Table 12.1)
#####################################################

#install.packages("readxl") # install package if required
library("readxl")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)  # 如果没有值就记为1

nhefs.nmv <- nhefs[which(!is.na(nhefs$wt82)),] #删除缺失值 provisionally ignore subjects with missing values for weight in 1982

lm(wt82_71 ~ qsmk, data=nhefs.nmv) # qsmk 表示是否是治疗组

## 
## Call:
## lm(formula = wt82_71 ~ qsmk, data = nhefs.nmv)
## 
## Coefficients:
## (Intercept)         qsmk  
##       1.984        2.541

predict(lm(wt82_71 ~ qsmk, data=nhefs.nmv), data.frame(qsmk=1)) # Smoking cessation

##        1 
## 4.525079

predict(lm(wt82_71 ~ qsmk, data=nhefs.nmv), data.frame(qsmk=0)) # No smoking cessation

##        1 
## 1.984498

# Table
summary(nhefs.nmv[which(nhefs.nmv$qsmk==0),]$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   33.00   42.00   42.79   51.00   72.00

summary(nhefs.nmv[which(nhefs.nmv$qsmk==0),]$wt71)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.82   59.19   68.49   70.30   79.38  151.73

summary(nhefs.nmv[which(nhefs.nmv$qsmk==0),]$smokeintensity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   20.00   21.19   30.00   60.00

summary(nhefs.nmv[which(nhefs.nmv$qsmk==0),]$smokeyrs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   23.00   24.09   32.00   64.00

summary(nhefs.nmv[which(nhefs.nmv$qsmk==1),]$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   35.00   46.00   46.17   56.00   74.00

summary(nhefs.nmv[which(nhefs.nmv$qsmk==1),]$wt71)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39.58   60.67   71.21   72.35   81.08  136.98

summary(nhefs.nmv[which(nhefs.nmv$qsmk==1),]$smokeintensity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    10.0    20.0    18.6    25.0    80.0

summary(nhefs.nmv[which(nhefs.nmv$qsmk==1),]$smokeyrs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   26.00   26.03   35.00   60.00

table(nhefs.nmv$qsmk, nhefs.nmv$sex)

##    
##       0   1
##   0 542 621
##   1 220 183

prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$sex), 1)

##    
##             0         1
##   0 0.4660361 0.5339639
##   1 0.5459057 0.4540943

table(nhefs.nmv$qsmk, nhefs.nmv$race)

##    
##       0   1
##   0 993 170
##   1 367  36

prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$race), 1)

##    
##              0          1
##   0 0.85382631 0.14617369
##   1 0.91066998 0.08933002

table(nhefs.nmv$qsmk, nhefs.nmv$education)

##    
##       1   2   3   4   5
##   0 210 266 480  92 115
##   1  81  74 157  29  62

prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$education), 1)

##    
##              1          2          3          4          5
##   0 0.18056750 0.22871883 0.41272571 0.07910576 0.09888220
##   1 0.20099256 0.18362283 0.38957816 0.07196030 0.15384615

table(nhefs.nmv$qsmk, nhefs.nmv$exercise)

##    
##       0   1   2
##   0 237 485 441
##   1  63 176 164

prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$exercise), 1)

##    
##             0         1         2
##   0 0.2037833 0.4170249 0.3791917
##   1 0.1563275 0.4367246 0.4069479

table(nhefs.nmv$qsmk, nhefs.nmv$active)

##    
##       0   1   2
##   0 532 527 104
##   1 170 188  45

prop.table(table(nhefs.nmv$qsmk, nhefs.nmv$active), 1)

##    
##             0         1         2
##   0 0.4574377 0.4531384 0.0894239
##   1 0.4218362 0.4665012 0.1116625

从结果中可以看到，戒烟者和非戒烟者在性别、种族、教育程度、初次访问时的体重以及吸烟频率等特征方面并不相同。如果这些变量是混杂变量，那我们就需要在分析中调整这些变量。在本书第十八章我们将讨论混杂变量的选取。在此处，我们假设初次访问时的 9 个变量就足以代表所有混杂，这些变量包括:性别(sex1，0:男性，1:女性)，年龄(age，单位:年)，种族(race，0:白人，1:其他)，教育程度(education，5 个分类)，吸烟频率 (smokeintensity，单位:支/天)，烟龄(smokeyrs，单位:年)，每日体力情况(active，3 个分类)，运动量(exercise，3 个分类)，以及初次访问时的体重(wt71，单位:kg)。也即，我们常用来表示混杂变量的 L 此时是一个含 9 个变量的向量。下一节我们将用逆概率加权的方法来调整这些变量。

12.2 使用模型计算逆概率权重

为了得到p（A=1｜L）的参数估计值，可以使用逻辑回归模型，用来预测控制了混杂变量之后戒烟的概率。

在虚拟人群中，我们通过加权最小二乘法拟合(饱和)线性均值模型E[Y | A]=θ θ+A，从而估计E [Y−| A=1] E=[Y | A 0].

有两步：1. 根据A与L构建逻辑回归模型，罗辑回归模型的预测值的倒数就是L的逆概率。（计算逆概率） 2.使用加权最小二乘拟合A与Y的线性回归模型。（构建回归模型）

#####################################################
# PROGRAM 12.2
# Estimating IP weights
# Data from NHEFS
#####################################################

# Estimation of ip weights via a logistic model
fit <- glm(qsmk ~ sex + race + age + I(age^2) + 
  as.factor(education) + smokeintensity + 
  I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + 
  as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), 
           family = binomial(), data = nhefs.nmv)
summary(fit)

## 
## Call:
## glm(formula = qsmk ~ sex + race + age + I(age^2) + as.factor(education) + 
##     smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + 
##     as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), 
##     family = binomial(), data = nhefs.nmv)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5127  -0.7907  -0.6387   0.9832   2.3729  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.2425191  1.3808360  -1.624 0.104369    
## sex                   -0.5274782  0.1540496  -3.424 0.000617 ***
## race                  -0.8392636  0.2100665  -3.995 6.46e-05 ***
## age                    0.1212052  0.0512663   2.364 0.018068 *  
## I(age^2)              -0.0008246  0.0005361  -1.538 0.124039    
## as.factor(education)2 -0.0287755  0.1983506  -0.145 0.884653    
## as.factor(education)3  0.0864318  0.1780850   0.485 0.627435    
## as.factor(education)4  0.0636010  0.2732108   0.233 0.815924    
## as.factor(education)5  0.4759606  0.2262237   2.104 0.035384 *  
## smokeintensity        -0.0772704  0.0152499  -5.067 4.04e-07 ***
## I(smokeintensity^2)    0.0010451  0.0002866   3.647 0.000265 ***
## smokeyrs              -0.0735966  0.0277775  -2.650 0.008061 ** 
## I(smokeyrs^2)          0.0008441  0.0004632   1.822 0.068398 .  
## as.factor(exercise)1   0.3548405  0.1801351   1.970 0.048855 *  
## as.factor(exercise)2   0.3957040  0.1872400   2.113 0.034571 *  
## as.factor(active)1     0.0319445  0.1329372   0.240 0.810100    
## as.factor(active)2     0.1767840  0.2149720   0.822 0.410873    
## wt71                  -0.0152357  0.0263161  -0.579 0.562625    
## I(wt71^2)              0.0001352  0.0001632   0.829 0.407370    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1786.1  on 1565  degrees of freedom
## Residual deviance: 1676.9  on 1547  degrees of freedom
## AIC: 1714.9
## 
## Number of Fisher Scoring iterations: 4

p.qsmk.obs <- ifelse(nhefs.nmv$qsmk == 0, 1 - predict(fit, type = "response"),
                     predict(fit, type = "response"))

nhefs.nmv$w <- 1/p.qsmk.obs
summary(nhefs.nmv$w)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.054   1.230   1.373   1.996   1.990  16.700

sd(nhefs.nmv$w)

## [1] 1.474787

#install.packages("geepack") # install package if required
library("geepack")

## Warning: package 'geepack' was built under R version 4.1.2

msm.w <- geeglm(wt82_71 ~ qsmk, data=nhefs.nmv, weights=w,id = seqn,corstr="independence")
summary(msm.w)

## 
## Call:
## geeglm(formula = wt82_71 ~ qsmk, data = nhefs.nmv, weights = w, 
##     id = seqn, corstr = "independence")
## 
##  Coefficients:
##             Estimate Std.err  Wald Pr(>|W|)    
## (Intercept)   1.7800  0.2247 62.73 2.33e-15 ***
## qsmk          3.4405  0.5255 42.87 5.86e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    65.06   4.221
## Number of clusters:   1566  Maximum cluster size: 1

beta <- coef(msm.w)
SE <- coef(summary(msm.w))[,2]
lcl <- beta-qnorm(0.975)*SE 
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)

##              beta   lcl  ucl
## (Intercept) 1.780 1.340 2.22
## qsmk        3.441 2.411 4.47

# no association between sex and qsmk in pseudo-population
xtabs(nhefs.nmv$w ~ nhefs.nmv$sex + nhefs.nmv$qsmk)

##              nhefs.nmv$qsmk
## nhefs.nmv$sex     0     1
##             0 763.6 763.6
##             1 801.7 797.2

# "check" for positivity (White women)
table(nhefs.nmv$age[nhefs.nmv$race == 0 & nhefs.nmv$sex == 1], 
      nhefs.nmv$qsmk[nhefs.nmv$race == 0 & nhefs.nmv$sex == 1])

##     
##       0  1
##   25 24  3
##   26 14  5
##   27 18  2
##   28 20  5
##   29 15  4
##   30 14  5
##   31 11  5
##   32 14  7
##   33 12  3
##   34 22  5
##   35 16  5
##   36 13  3
##   37 14  1
##   38  6  2
##   39 19  4
##   40 10  4
##   41 13  3
##   42 16  3
##   43 14  3
##   44  9  4
##   45 12  5
##   46 19  4
##   47 19  4
##   48 19  4
##   49 11  3
##   50 18  4
##   51  9  3
##   52 11  3
##   53 11  4
##   54 17  9
##   55  9  4
##   56  8  7
##   57  9  2
##   58  8  4
##   59  5  4
##   60  5  4
##   61  5  2
##   62  6  5
##   63  3  3
##   64  7  1
##   65  3  2
##   66  4  0
##   67  2  0
##   69  6  2
##   70  2  1
##   71  0  1
##   72  2  2
##   74  0  1

12.3 逆概率稳定权重

不过，除此之外，我们还有其他方法来构建虚拟人群，从而让其中的 A 和 L 相互独立。比如，我们可以构建一个无论 L 是什么， A = 1 的概率是 50%、 A = 0 的概率也是 50%的虚拟人群，这样一个虚拟人群的逆概率权重是0.5/ f (A| L)。

逆概率权重为 p / f ( A | L ) 时同理，其中0< p≤1。权重WA =1/ f (A|L)只是p=1时的一个特例。

让我们进一步往前推理。调整混杂的目的是让治疗变量 A 的概率不取决于混杂变量 L 。于是我们可以构建一个虚拟人群来达到这个目的，而其中每个人接受治疗 A 的概率是否相同则无关紧要，只要这个概率不取决于 L 就行。

比如，一个常用做法是让虚拟人群中接受治疗的概率等于原本样本中治疗的概率Pr[A =1]，对于未接受治疗的概率同理。因而，此时的逆概率权重是:治疗组是Pr[A=1]/ f (A|L)，非治疗组是Pr[A=0]/ f (A|L)。或者写作f (A)/ f (A|L)。

分子中的 f (A)被称为稳定因子，会缩小 f (A)/ f (A| L)的取值范围。

W A =1/ f (A| L)被称为非稳定权重，而SW A = f (A)/ f (A| L)被称为稳定权重。

#####################################################
# PROGRAM 12.3
# Estimating stabilized IP weights
# Data from NHEFS
#####################################################

# estimation of denominator of ip weights
denom.fit <- glm(qsmk ~ as.factor(sex) + as.factor(race) + age + I(age^2) + 
  as.factor(education) + smokeintensity + 
  I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + 
  as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), 
                 family = binomial(), data = nhefs.nmv)
summary(denom.fit)

## 
## Call:
## glm(formula = qsmk ~ as.factor(sex) + as.factor(race) + age + 
##     I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + 
##     smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + 
##     wt71 + I(wt71^2), family = binomial(), data = nhefs.nmv)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.513  -0.791  -0.639   0.983   2.373  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.242519   1.380836   -1.62  0.10437    
## as.factor(sex)1       -0.527478   0.154050   -3.42  0.00062 ***
## as.factor(race)1      -0.839264   0.210067   -4.00  6.5e-05 ***
## age                    0.121205   0.051266    2.36  0.01807 *  
## I(age^2)              -0.000825   0.000536   -1.54  0.12404    
## as.factor(education)2 -0.028776   0.198351   -0.15  0.88465    
## as.factor(education)3  0.086432   0.178085    0.49  0.62744    
## as.factor(education)4  0.063601   0.273211    0.23  0.81592    
## as.factor(education)5  0.475961   0.226224    2.10  0.03538 *  
## smokeintensity        -0.077270   0.015250   -5.07  4.0e-07 ***
## I(smokeintensity^2)    0.001045   0.000287    3.65  0.00027 ***
## smokeyrs              -0.073597   0.027777   -2.65  0.00806 ** 
## I(smokeyrs^2)          0.000844   0.000463    1.82  0.06840 .  
## as.factor(exercise)1   0.354841   0.180135    1.97  0.04885 *  
## as.factor(exercise)2   0.395704   0.187240    2.11  0.03457 *  
## as.factor(active)1     0.031944   0.132937    0.24  0.81010    
## as.factor(active)2     0.176784   0.214972    0.82  0.41087    
## wt71                  -0.015236   0.026316   -0.58  0.56262    
## I(wt71^2)              0.000135   0.000163    0.83  0.40737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1786.1  on 1565  degrees of freedom
## Residual deviance: 1676.9  on 1547  degrees of freedom
## AIC: 1715
## 
## Number of Fisher Scoring iterations: 4

pd.qsmk <- predict(denom.fit, type = "response")

# estimation of numerator of ip weights
numer.fit <- glm(qsmk~1, family = binomial(), data = nhefs.nmv)
summary(numer.fit)

## 
## Call:
## glm(formula = qsmk ~ 1, family = binomial(), data = nhefs.nmv)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.771  -0.771  -0.771   1.648   1.648  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0598     0.0578   -18.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1786.1  on 1565  degrees of freedom
## Residual deviance: 1786.1  on 1565  degrees of freedom
## AIC: 1788
## 
## Number of Fisher Scoring iterations: 4

pn.qsmk <- predict(numer.fit, type = "response")

nhefs.nmv$sw <- ifelse(nhefs.nmv$qsmk == 0, ((1-pn.qsmk)/(1-pd.qsmk)),
                    (pn.qsmk/pd.qsmk))

summary(nhefs.nmv$sw)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.331   0.867   0.950   0.999   1.079   4.298

msm.sw <- geeglm(wt82_71 ~ qsmk, data=nhefs.nmv, weights=sw, id=seqn,
                corstr="independence")
summary(msm.sw)

## 
## Call:
## geeglm(formula = wt82_71 ~ qsmk, data = nhefs.nmv, weights = sw, 
##     id = seqn, corstr = "independence")
## 
##  Coefficients:
##             Estimate Std.err Wald Pr(>|W|)    
## (Intercept)    1.780   0.225 62.7  2.3e-15 ***
## qsmk           3.441   0.525 42.9  5.9e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     60.7    3.71
## Number of clusters:   1566  Maximum cluster size: 1

beta <- coef(msm.sw)
SE <- coef(summary(msm.sw))[,2]
lcl <- beta-qnorm(0.975)*SE 
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)

##             beta  lcl  ucl
## (Intercept) 1.78 1.34 2.22
## qsmk        3.44 2.41 4.47

# no association between sex and qsmk in pseudo-population
xtabs(nhefs.nmv$sw ~ nhefs.nmv$sex + nhefs.nmv$qsmk)

##              nhefs.nmv$qsmk
## nhefs.nmv$sex   0   1
##             0 567 197
##             1 595 205

如果稳定权重和非稳定权重都能得到同样的结果，我们为什么还要使用稳定权重?这是因为稳定权重的通常会给出更窄的置信区间。然而，只有当(逆概率加权)模型不是饱和的时候，稳定权重的优越性才会显现出来。在我们的例子中，E[Y | A]=θ θ1*A是饱和的，因为治疗变量A只有两个可能取值。在大多数情形中(比如时异性或连续性治疗)，加权模型就不再是饱和的，从而我们会更常用稳定结局

如果治疗变量 A 是一个二分变量，那么这个模型就是一个饱和模型。

12.4 边缘结构模型

让我们思考下面这个在治疗取值为 a 时的线性模型

这个模型和我们之前所讲的模型都不一样，此处模型的因变量是反事实结局，因而我们不可能观测到。因此，这个模型不能用现实数据拟合。这一模型的因变量是反事实结局的边缘均值，因而这一模型被称为边缘结构均值模型。

治疗变量经常是多分类或连续的。比如，如果我们的治疗变量 A 是“吸烟频率的变化”，通过随访时的吸烟频率减去初次访问时的吸烟频率计算得到，那它就是一个连续变量，可以有许多取值，比如可以是-25，表示某人吸烟频率降低了 25 支/天，也可以是 25，表示某人吸烟频率增加了 25 支/天。在初次访问时，有 1162 人每天吸烟数少于等于 25 支。假设我们的目标是估计这 1162 人中不同吸烟频率变化所对应的体重变化，

那我们想估计的是E(ya0)-E(ya1),其中a0和a1可以是任意值。

因为治疗变量 A 可能有许多不同取值，此时饱和模型就变得不切实际。我们需要使用非饱和模型，并假设治疗 A 和结局 Y 的剂量反应曲线形式。如果我们觉得抛物线能恰当描述这一关系，那我们的边缘结构模型就可以是:

假设我们想估计吸烟频率提高 20 支/天相比于频率没有变化所对应因果效应均值，即E(Ya20)-E(ya0),根据我们的模型E(Y20)=β0+β1*20+β2*400,因此，E(Y20-E(Y0)=β1*20+β2*400,因此我们需要估计β1和β2，因此我们需要使用逆概率权重SWA去构建一个虚拟人群，其中L不再是混杂变量，然后使用这个虚拟人群中拟合相关性模型E（Y｜A）=θ0+θ1*A+θ2*A^2

为了计算权重SWA=f（A）/f（A｜L），我们需要先估计f（A｜L），对于二分类的A，可以使用逻辑回归来估计他。对于连续变量A，f（A｜L）是概率密度函数，遗憾的是，一般情况下很能正确估计一个概率密度函数，因而对连续性治疗变量使用逆概率加权方法存在风险。

，我们假设概率密度 f (A| L)是一个正态分布，u=E（A｜L），标准差是σ，接下来我们用线性回归来估计不同L取值下的均值E[A|L]和残差标准差σ，我们假设分子中的密度f (A)也是正态分布的。因为效应估计对条件密度 f (A| L)的模型选择非常敏感,所以我们对连续变量使用逆概率加权时需要非常小心。

#####################################################
# PROGRAM 12.4
# Estimating the parameters of a marginal structural mean model
# with a continuous treatment Data from NHEFS
#####################################################

# Analysis restricted to subjects reporting <=25 cig/day at baseline
nhefs.nmv.s <- subset(nhefs.nmv, smokeintensity <=25)

# estimation of denominator of ip weights
den.fit.obj <- lm(smkintensity82_71 ~ as.factor(sex) + 
  as.factor(race) + age + I(age^2) + 
  as.factor(education) + smokeintensity + I(smokeintensity^2) +
  smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71  + 
  I(wt71^2), data = nhefs.nmv.s)
p.den <- predict(den.fit.obj, type = "response")
dens.den <- dnorm(nhefs.nmv.s$smkintensity82_71, p.den, summary(den.fit.obj)$sigma) # 逆概率权重的分母

# estimation of numerator of ip weights
num.fit.obj <- lm(smkintensity82_71 ~ 1, data = nhefs.nmv.s) # 逆概率加权的分子
p.num <- predict(num.fit.obj, type = "response")
dens.num <- dnorm(nhefs.nmv.s$smkintensity82_71, p.num, summary(num.fit.obj)$sigma)

nhefs.nmv.s$sw.a = dens.num/dens.den
summary(nhefs.nmv.s$sw.a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.19    0.89    0.97    1.00    1.05    5.10

msm.sw.cont <- geeglm(wt82_71 ~ smkintensity82_71 + I(smkintensity82_71*smkintensity82_71), 
                 data=nhefs.nmv.s, weights=sw.a, id=seqn, corstr="independence")
summary(msm.sw.cont)

## 
## Call:
## geeglm(formula = wt82_71 ~ smkintensity82_71 + I(smkintensity82_71 * 
##     smkintensity82_71), data = nhefs.nmv.s, weights = sw.a, id = seqn, 
##     corstr = "independence")
## 
##  Coefficients:
##                                          Estimate  Std.err  Wald Pr(>|W|)    
## (Intercept)                               2.00452  0.29512 46.13  1.1e-11 ***
## smkintensity82_71                        -0.10899  0.03154 11.94  0.00055 ***
## I(smkintensity82_71 * smkintensity82_71)  0.00269  0.00242  1.24  0.26489    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     60.5     4.5
## Number of clusters:   1162  Maximum cluster size: 1

beta <- coef(msm.sw.cont)
SE <- coef(summary(msm.sw.cont))[,2]
lcl <- beta-qnorm(0.975)*SE 
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)

##                                              beta      lcl      ucl
## (Intercept)                               2.00452  1.42610  2.58295
## smkintensity82_71                        -0.10899 -0.17080 -0.04718
## I(smkintensity82_71 * smkintensity82_71)  0.00269 -0.00204  0.00743

边缘结构模型的治疗变量也可以是一个二分变量。比如，我们想估计戒烟 A (1:是，0: 否)对死亡(1:是，0:否)的因果效应，我们可以用以下边缘结构 logistic 模型:

其中 exp (α1 ) 是戒烟者比上非戒烟者的死亡因果性比值比。在我们之前所述假设成立的情况下，

我们可以在逆概率权重构建的虚拟人群拟合logistic模型logitPr(D=1|A)=θ0+θ1*A 估计上述边缘结构logistic模型中的参数。

#####################################################
# PROGRAM 12.5
# Estimating the parameters of a marginal structural logistic model
# Data from NHEFS
#####################################################

table(nhefs.nmv$qsmk, nhefs.nmv$death) # qsmk表示是否是是治疗组 # death 表示是是否死亡

##    
##       0   1
##   0 963 200
##   1 312  91

# First, estimation of stabilized weights sw.a (same as in PROGRAM 12.3)
# Second, fit logistic model below # nhefs.nmv.s/nhefs.nmv
msm.logistic <- geeglm(death ~ qsmk, data=nhefs.nmv.s, weights=sw.a, # sw /sw.a
                      id=seqn, family=binomial(), corstr="independence")

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(msm.logistic)

## 
## Call:
## geeglm(formula = death ~ qsmk, family = binomial(), data = nhefs.nmv.s, 
##     weights = sw.a, id = seqn, corstr = "independence")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -1.5026  0.0981 234.80   <2e-16 ***
## qsmk          0.0369  0.1736   0.05     0.83    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)        1   0.082
## Number of clusters:   1162  Maximum cluster size: 1

beta <- coef(msm.logistic)
SE <- coef(summary(msm.logistic))[,2]
lcl <- beta-qnorm(0.975)*SE 
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)

##                beta    lcl    ucl
## (Intercept) -1.5026 -1.695 -1.310
## qsmk         0.0369 -0.303  0.377

12.5 效应修饰与边缘结构模型

当我们想估计的参数是目标人群中的因果效应均值时，边缘结构模型不会包含其他协变量。然而，当效应修饰存在时，我们可以在边缘结构模型中放入其他协变量，即使这些协变量不是混杂变量。假设戒烟的效应对不同性别V (1:女性，0:男性)不一样。为了验证这一假设，我们需要在边缘结构均值模型中加入协变量V :


如果β2不等于0，那么存在加法效应修饰。严格而言，这不再是边缘模型，因为存在其他协变量V， 然而我们依然称之为边缘结构模型。

我们可以在逆概率权重构建的虚拟人群中拟合线性模型```E（Y｜A，V）=θ0+θ1*A+θ2*A*V+θ3*V```
从而估计上述边缘模型中的参数值。此时向量 L 需要包含V ——即使V 不是混杂变量——以及其 他可能影响互换性的变量。

因为我们现在要考虑的是V 的每一分层中治疗的因果效应，所以我们稳定权重的分子可以是 f (A)或f (A|V)。如果分母是f（A｜V），置信区间会更加窄一些。
我们可以这样理解:因为分子分母中都含有V ，所以分子和分母的值更加接近，从而使得
 逆概率权重更加稳定，因而 95%置信区间更窄。
 此时，在估计权重分子的时候，我们会在 logistic模型中加入协变量V 。
 
 V 是 L 的一个子集，选择什么样的变量作为V 需要反映出研究者真切关心的研究问题。比如，只有当研究者确信V 是一个效应修饰因子，并且对V 的每一分层中的因果效应非常感兴趣时，V 才能被放入边缘结构模型之中。如果研究者决定把 L 中的所有变量全部放入边 缘结构模型之中，那逆概率加权就不再是必须的，因为此时正确设定的未加权回归模型也能完全 控制 L 带来的混杂出于这一原因，我们将包含了所有变量 L 的边缘结构模型 称为仿制的边缘结构模型。
 
 需要注意的是：如果我们要探索两个治疗变量 A 和 B 之间的交互作用，我们需要把 A 和 B 都放入边缘结构 模型之中，在计算逆概率权重时，分母中会包含这两个变量的联合分布。我们会对 A 和 B 都假设 互换性、正数性和一致性成立。


```r
#####################################################
# PROGRAM 12.6
# Assessing effect modification by sex using a marginal structural mean model
# Data from NHEFS
#####################################################

table(nhefs.nmv$sex)

## 
##   0   1 
## 762 804

# estimation of denominator of ip weights
denom.fit <- glm(qsmk ~ as.factor(sex) + as.factor(race) + age + I(age^2) + 
  as.factor(education) + smokeintensity + 
  I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + 
  as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), 
                 family = binomial(), data = nhefs.nmv) # 计算权重的分母
summary(denom.fit)

## 
## Call:
## glm(formula = qsmk ~ as.factor(sex) + as.factor(race) + age + 
##     I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + 
##     smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + 
##     wt71 + I(wt71^2), family = binomial(), data = nhefs.nmv)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.513  -0.791  -0.639   0.983   2.373  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.242519   1.380836   -1.62  0.10437    
## as.factor(sex)1       -0.527478   0.154050   -3.42  0.00062 ***
## as.factor(race)1      -0.839264   0.210067   -4.00  6.5e-05 ***
## age                    0.121205   0.051266    2.36  0.01807 *  
## I(age^2)              -0.000825   0.000536   -1.54  0.12404    
## as.factor(education)2 -0.028776   0.198351   -0.15  0.88465    
## as.factor(education)3  0.086432   0.178085    0.49  0.62744    
## as.factor(education)4  0.063601   0.273211    0.23  0.81592    
## as.factor(education)5  0.475961   0.226224    2.10  0.03538 *  
## smokeintensity        -0.077270   0.015250   -5.07  4.0e-07 ***
## I(smokeintensity^2)    0.001045   0.000287    3.65  0.00027 ***
## smokeyrs              -0.073597   0.027777   -2.65  0.00806 ** 
## I(smokeyrs^2)          0.000844   0.000463    1.82  0.06840 .  
## as.factor(exercise)1   0.354841   0.180135    1.97  0.04885 *  
## as.factor(exercise)2   0.395704   0.187240    2.11  0.03457 *  
## as.factor(active)1     0.031944   0.132937    0.24  0.81010    
## as.factor(active)2     0.176784   0.214972    0.82  0.41087    
## wt71                  -0.015236   0.026316   -0.58  0.56262    
## I(wt71^2)              0.000135   0.000163    0.83  0.40737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1786.1  on 1565  degrees of freedom
## Residual deviance: 1676.9  on 1547  degrees of freedom
## AIC: 1715
## 
## Number of Fisher Scoring iterations: 4

pd.qsmk <- predict(denom.fit, type = "response")

# estimation of numerator of ip weights
numer.fit <- glm(qsmk~as.factor(sex), family = binomial(), data = nhefs.nmv) # 计算权重的分母
summary(numer.fit)

## 
## Call:
## glm(formula = qsmk ~ as.factor(sex), family = binomial(), data = nhefs.nmv)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.825  -0.825  -0.719   1.576   1.720  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.9016     0.0799  -11.28   <2e-16 ***
## as.factor(sex)1  -0.3202     0.1160   -2.76   0.0058 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1786.1  on 1565  degrees of freedom
## Residual deviance: 1778.4  on 1564  degrees of freedom
## AIC: 1782
## 
## Number of Fisher Scoring iterations: 4

pn.qsmk <- predict(numer.fit, type = "response")

nhefs.nmv$sw.a <- ifelse(nhefs.nmv$qsmk == 0, ((1-pn.qsmk)/(1-pd.qsmk)),
                     (pn.qsmk/pd.qsmk))

summary(nhefs.nmv$sw.a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.29    0.88    0.96    1.00    1.08    3.80

sd(nhefs.nmv$sw.a)

## [1] 0.271

# Estimating parameters of a marginal structural mean model
msm.emm <- geeglm(wt82_71~as.factor(qsmk) + as.factor(sex) 
                  + as.factor(qsmk):as.factor(sex), data=nhefs.nmv, 
                  weights=sw.a, id=seqn, corstr="independence")
summary(msm.emm)

## 
## Call:
## geeglm(formula = wt82_71 ~ as.factor(qsmk) + as.factor(sex) + 
##     as.factor(qsmk):as.factor(sex), data = nhefs.nmv, weights = sw.a, 
##     id = seqn, corstr = "independence")
## 
##  Coefficients:
##                                  Estimate  Std.err  Wald Pr(>|W|)    
## (Intercept)                       1.78445  0.30984 33.17  8.5e-09 ***
## as.factor(qsmk)1                  3.52198  0.65707 28.73  8.3e-08 ***
## as.factor(sex)1                  -0.00872  0.44882  0.00     0.98    
## as.factor(qsmk)1:as.factor(sex)1 -0.15948  1.04608  0.02     0.88    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     60.8    3.71
## Number of clusters:   1566  Maximum cluster size: 1

beta <- coef(msm.emm)
SE <- coef(summary(msm.emm))[,2]
lcl <- beta-qnorm(0.975)*SE 
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)

##                                      beta    lcl   ucl
## (Intercept)                       1.78445  1.177 2.392
## as.factor(qsmk)1                  3.52198  2.234 4.810
## as.factor(sex)1                  -0.00872 -0.888 0.871
## as.factor(qsmk)1:as.factor(sex)1 -0.15948 -2.210 1.891

12.6 删失和缺失值

当我们估计戒烟 A 对增重 Y 的因果效应时，我们将分析局限于 1566 名有体重变化信息的人群中。然而，还有 63(上面的代码中我们删除了这一部分数据) 名人员虽然符合入组标准，但是因为在随访中没有体重信息而被排除于分析之外。只选择没有结局缺失值的人——也即删失有缺失值的人——可能会引入选择偏移.

让我们用删失变量 C 表示在随访中的体重测量:1 表示没有测量(即被删失)，0 表示有测量(即没有被删失)。显然，我们的分析只能在没有被删失的人群中进行，即 C = 0 的人群中。

也就是说，我们在12.2小节和12.4小节中的拟合的(加权)回归模型不是


 因为失访导致的删失可能会引入选择偏移，所以我们希望不存在删失，并估计不存在删失时
的因果效应。

主要关注如何计算Wac和SWac

```r
#####################################################
# PROGRAM 12.7
# Estimating IP weights to adjust for selection bias due to censoring
# Data from NHEFS
#####################################################

table(nhefs$qsmk, nhefs$cens)

##    
##        0    1
##   0 1163   38
##   1  403   25

summary(nhefs[which(nhefs$cens==0),]$wt71)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    39.6    59.5    69.2    70.8    79.8   151.7

summary(nhefs[which(nhefs$cens==1),]$wt71)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    36.2    63.1    72.1    76.6    87.9   169.2

# estimation of denominator of ip weights for A
denom.fit <- glm(qsmk ~ as.factor(sex) + as.factor(race) + age + I(age^2) + 
  as.factor(education) + smokeintensity + 
  I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + 
  as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), 
                 family = binomial(), data = nhefs)
summary(denom.fit)

## 
## Call:
## glm(formula = qsmk ~ as.factor(sex) + as.factor(race) + age + 
##     I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + 
##     smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + 
##     wt71 + I(wt71^2), family = binomial(), data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.465  -0.804  -0.646   1.058   2.355  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.988902   1.241279   -1.60  0.10909    
## as.factor(sex)1       -0.507522   0.148232   -3.42  0.00062 ***
## as.factor(race)1      -0.850231   0.205872   -4.13  3.6e-05 ***
## age                    0.103013   0.048900    2.11  0.03515 *  
## I(age^2)              -0.000605   0.000507   -1.19  0.23297    
## as.factor(education)2 -0.098320   0.190655   -0.52  0.60607    
## as.factor(education)3  0.015699   0.170714    0.09  0.92673    
## as.factor(education)4 -0.042526   0.264276   -0.16  0.87216    
## as.factor(education)5  0.379663   0.220395    1.72  0.08495 .  
## smokeintensity        -0.065156   0.014759   -4.41  1.0e-05 ***
## I(smokeintensity^2)    0.000846   0.000276    3.07  0.00216 ** 
## smokeyrs              -0.073371   0.026996   -2.72  0.00657 ** 
## I(smokeyrs^2)          0.000838   0.000443    1.89  0.05867 .  
## as.factor(exercise)1   0.291412   0.173554    1.68  0.09314 .  
## as.factor(exercise)2   0.355052   0.179929    1.97  0.04846 *  
## as.factor(active)1     0.010875   0.129832    0.08  0.93324    
## as.factor(active)2     0.068312   0.208727    0.33  0.74346    
## wt71                  -0.012848   0.022283   -0.58  0.56423    
## I(wt71^2)              0.000121   0.000135    0.89  0.37096    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1876.3  on 1628  degrees of freedom
## Residual deviance: 1766.7  on 1610  degrees of freedom
## AIC: 1805
## 
## Number of Fisher Scoring iterations: 4

pd.qsmk <- predict(denom.fit, type = "response")

# estimation of numerator of ip weights for A
numer.fit <- glm(qsmk~1, family = binomial(), data = nhefs)
summary(numer.fit)

## 
## Call:
## glm(formula = qsmk ~ 1, family = binomial(), data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.781  -0.781  -0.781   1.635   1.635  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0318     0.0563   -18.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1876.3  on 1628  degrees of freedom
## Residual deviance: 1876.3  on 1628  degrees of freedom
## AIC: 1878
## 
## Number of Fisher Scoring iterations: 4

pn.qsmk <- predict(numer.fit, type = "response")

# estimation of denominator of ip weights for C
denom.cens <- glm(cens ~ as.factor(qsmk) + as.factor(sex) + 
  as.factor(race) + age + I(age^2) + 
  as.factor(education) + smokeintensity + 
  I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + 
  as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), 
                  family = binomial(), data = nhefs)
summary(denom.cens)

## 
## Call:
## glm(formula = cens ~ as.factor(qsmk) + as.factor(sex) + as.factor(race) + 
##     age + I(age^2) + as.factor(education) + smokeintensity + 
##     I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71^2), family = binomial(), 
##     data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.097  -0.287  -0.207  -0.157   2.996  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)            4.014466   2.576106    1.56   0.1192   
## as.factor(qsmk)1       0.516867   0.287716    1.80   0.0724 . 
## as.factor(sex)1        0.057313   0.330278    0.17   0.8622   
## as.factor(race)1      -0.012271   0.452489   -0.03   0.9784   
## age                   -0.269729   0.117465   -2.30   0.0217 * 
## I(age^2)               0.002884   0.001114    2.59   0.0096 **
## as.factor(education)2 -0.440788   0.419399   -1.05   0.2933   
## as.factor(education)3 -0.164688   0.370547   -0.44   0.6567   
## as.factor(education)4  0.138447   0.569797    0.24   0.8080   
## as.factor(education)5 -0.382382   0.560181   -0.68   0.4949   
## smokeintensity         0.015712   0.034732    0.45   0.6510   
## I(smokeintensity^2)   -0.000113   0.000606   -0.19   0.8517   
## smokeyrs               0.078597   0.074958    1.05   0.2944   
## I(smokeyrs^2)         -0.000557   0.001032   -0.54   0.5894   
## as.factor(exercise)1  -0.971471   0.387810   -2.51   0.0122 * 
## as.factor(exercise)2  -0.583989   0.372313   -1.57   0.1168   
## as.factor(active)1    -0.247479   0.325455   -0.76   0.4470   
## as.factor(active)2     0.706583   0.396458    1.78   0.0747 . 
## wt71                  -0.087887   0.040012   -2.20   0.0281 * 
## I(wt71^2)              0.000635   0.000226    2.81   0.0049 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 533.36  on 1628  degrees of freedom
## Residual deviance: 465.36  on 1609  degrees of freedom
## AIC: 505.4
## 
## Number of Fisher Scoring iterations: 7

pd.cens <- 1-predict(denom.cens, type = "response")

# estimation of numerator of ip weights for C
numer.cens <- glm(cens~as.factor(qsmk), family = binomial(), data = nhefs)
summary(numer.cens)

## 
## Call:
## glm(formula = cens ~ as.factor(qsmk), family = binomial(), data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.347  -0.254  -0.254  -0.254   2.628  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.421      0.165  -20.75   <2e-16 ***
## as.factor(qsmk)1    0.641      0.264    2.43    0.015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 533.36  on 1628  degrees of freedom
## Residual deviance: 527.76  on 1627  degrees of freedom
## AIC: 531.8
## 
## Number of Fisher Scoring iterations: 6

pn.cens <- 1-predict(numer.cens, type = "response")

nhefs$sw.a <- ifelse(nhefs$qsmk == 0, ((1-pn.qsmk)/(1-pd.qsmk)),
                     (pn.qsmk/pd.qsmk))
nhefs$sw.c <- pn.cens/pd.cens
nhefs$sw <- nhefs$sw.c*nhefs$sw.a

summary(nhefs$sw.a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.33    0.86    0.95    1.00    1.08    4.21

sd(nhefs$sw.a)

## [1] 0.284

summary(nhefs$sw.c)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.94    0.98    0.99    1.01    1.01    7.58

sd(nhefs$sw.c)

## [1] 0.178

summary(nhefs$sw)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.35    0.86    0.94    1.01    1.08   12.86

sd(nhefs$sw)

## [1] 0.411

msm.sw <- geeglm(wt82_71~qsmk, data=nhefs, 
                  weights=sw, id=seqn, corstr="independence")
summary(msm.sw)

## 
## Call:
## geeglm(formula = wt82_71 ~ qsmk, data = nhefs, weights = sw, 
##     id = seqn, corstr = "independence")
## 
##  Coefficients:
##             Estimate Std.err Wald Pr(>|W|)    
## (Intercept)    1.662   0.233 51.0  9.3e-13 ***
## qsmk           3.496   0.526 44.2  2.9e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     61.8    3.83
## Number of clusters:   1566  Maximum cluster size: 1

beta <- coef(msm.sw)
SE <- coef(summary(msm.sw))[,2]
lcl <- beta-qnorm(0.975)*SE 
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)

##             beta  lcl  ucl
## (Intercept) 1.66 1.21 2.12
## qsmk        3.50 2.47 4.53

13标准化和参数G-公式

在实践中，研究者可以在逆概率加权和标准化两种方法中选择一种用来估计因果效应。这两种方法都基于同样的可识别性条件，然而它们的模型假设却不一样。

在上一章，我们用逆概率加权估计了戒烟 A (1:是，0:否)对增重 Y 的(单位:kg)因果效应。本章我们会估计同样的因果效应，不过将用另一种方法，即标准化。我们的分析人群同样是 NHFES 中的 1629 名参与者，他们在 25-74 岁之间，参加了初访，并在 10 年之后参加了随访。在这些人当中，1566 名人员在初访和随访中都有体重测量，因而没有被删失( C = 0 )。

13.1 标准化

标准化均值是每一分层中条件均值的加权平均，而权重是每一分层的出现概率 Pr [L = l ]。也就是说，在计算标准化均值的时候，人数最多的分层占有的权重最多。

我们可以把治疗为 a 且未被删失的标准化均值表述为:

如果L是连续的，我们需要把Pr[L = l]替换为概率密度函数 fL (l)，同时把求和符合替换为积分符号。

13.2 通过模型估计结局均值

理想情况下，我们可以用非参数的方法估计条件均值E[Y | A=1,C 0,L= l].

但在高维数据中——比如我们戒烟的例子——我们不可能用非参数的方法估计 E[Y | A=1,C 0,L= l]。在我=们的例子中，L有上百万个分层，而我们只有403名戒烟者。因此我们需要借助模型。

为了得到上百万个L分层中E[Y|A=a,C 0,L= l]的参数=估计值，我们需要拟合一个线性模型，其中的因变量是增重，模型中会包含所有混杂变量。对连续性变量，诸如年龄、初访时体重、烟龄和吸烟频率，模型中会放入它们的一次项和二次项。也就是说，我们假设条件均值和这些连续性协变量的关系可以用抛物线表示。我们在模型中也包括了戒烟和吸烟频率的乘积项。也就是说，我们认为戒烟的效果会因吸烟频率的不同而不同，并且这一关系是线性的，同时戒烟的效果独立于其他变量。

################################################################
# PROGRAM 13.1
# Estimating the mean outcome within levels of treatment 
# and confounders: Data from NHEFS
################################################################

#install.packages("readxl") # install package if required
library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")

# some preprocessing of the data
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)

fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)
           + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
           + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
           + wt71 + I(wt71*wt71) + qsmk*smokeintensity, data=nhefs)
summary(fit)

## 
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) + 
##     as.factor(education) + smokeintensity + I(smokeintensity * 
##     smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity, 
##     data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -42.06   -4.17   -0.34    3.89   44.61  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -1.588166   4.313036   -0.37  0.71276    
## qsmk                                2.559594   0.809149    3.16  0.00159 ** 
## sex                                -1.430272   0.468958   -3.05  0.00233 ** 
## race                                0.560110   0.581889    0.96  0.33591    
## age                                 0.359635   0.163319    2.20  0.02781 *  
## I(age * age)                       -0.006101   0.001726   -3.53  0.00042 ***
## as.factor(education)2               0.790444   0.607000    1.30  0.19304    
## as.factor(education)3               0.556312   0.556102    1.00  0.31728    
## as.factor(education)4               1.491570   0.832270    1.79  0.07330 .  
## as.factor(education)5              -0.194977   0.741369   -0.26  0.79259    
## smokeintensity                      0.049136   0.051725    0.95  0.34229    
## I(smokeintensity * smokeintensity) -0.000991   0.000938   -1.06  0.29110    
## smokeyrs                            0.134369   0.091712    1.47  0.14309    
## I(smokeyrs * smokeyrs)             -0.001866   0.001544   -1.21  0.22683    
## as.factor(exercise)1                0.295975   0.535153    0.55  0.58030    
## as.factor(exercise)2                0.353913   0.558859    0.63  0.52665    
## as.factor(active)1                 -0.947569   0.409934   -2.31  0.02094 *  
## as.factor(active)2                 -0.261378   0.684558   -0.38  0.70265    
## wt71                                0.045502   0.083371    0.55  0.58530    
## I(wt71 * wt71)                     -0.000965   0.000525   -1.84  0.06600 .  
## qsmk:smokeintensity                 0.046663   0.035145    1.33  0.18446    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 53.6)
## 
##     Null deviance: 97176  on 1565  degrees of freedom
## Residual deviance: 82763  on 1545  degrees of freedom
##   (63 observations deleted due to missingness)
## AIC: 10701
## 
## Number of Fisher Scoring iterations: 2

nhefs$predicted.meanY <- predict(fit, nhefs) # 得到了条件均值E[Y | A=1,C 0,L= l]的结果 

nhefs[which(nhefs$seqn==24770), c("predicted.meanY", "qsmk", "sex", 
                                 "race", "age", "education", 
                                 "smokeintensity", "smokeyrs", "exercise",
                                 "active", "wt71")]

##      predicted.meanY qsmk sex race age education smokeintensity smokeyrs
## 1582           0.342    0   0    0  26         4             15       12
##      exercise active wt71
## 1582        1      0  112

summary(nhefs$predicted.meanY[nhefs$cens==0])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -10.88    1.12    3.04    2.64    4.51    9.88

summary(nhefs$wt82_71[nhefs$cens==0])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -41.3    -1.5     2.6     2.6     6.7    48.5

有了这些假设后，我们得到A和L所有可能组合之下的E[Y|A=a,C 0,L= l]，同时也能。

得到数据中 403 名戒烟者和 1163 名非戒烟者的预测值。比如，如果某参与人员的信息是:非戒烟者，男性，白人，26 岁，大学肄业，12 年烟龄，吸烟频率 15 支/天，适当运动，体力良好，初访时体重 112kg，那么我们就能得到他的增重预测值是 0.34kg(碰巧编号为 24770 的参与者满足以上条件，你可以在看看他的预测值是多少)

。总而言之，在这 1566 名参与人员中，增重预测均值是 2.6kg，这与我们观测到的增重值相同，而取值范围是-41.3 到 48.5kg。

我们的目标是估计A=a时的，∑E[Y|A=a,C=0,L =l]*Pr[L l]。里的求和符合在更 l 正式的情况下应该被写作积分符号，因为 L 中的变量基本上是连续的，不能用简单的概率表示。不过无论用什么符号表示，我们现在已经估计了 A 和 L 所有组合下的 E [Y | A = a, C =0, L= l ]。

下一步我们需要把这些均值根据 L 中的取值 l 进行标准化。

13.3 根据混杂变量的分布对结局均值进行标准化

标准化均值是各条件均值E[Y|A=a,C 0,L= l]的加权=平均。当L中的所有变量都是离散的时候，条件均值对应的权重就是 L = l 的人数比例，即 Pr [L = l ]。原则上，我们可以非参数地计算这一概率，只需用 L = l 的人数除以总人数即可，我们在 2.3 小节中就是这么做的。然而，在高维数据中，我们需要用到其他方法。

幸运的是，我们其实并不需要去估计 Pr [L = l ]。我们只需要估计每个个体 i 在 L = l 时的条件，均值E[Y|A=a,C 0,L= l]即可，然后再计算平均值：

其中n是研究的总参与人数。这是因为

也可以被写作双重期望:

接下来我们将讲述如何在治疗组( A = 1 )和非治疗组( A = 0 )中估计

且不需要计算Pr[L=l].

这个方法有四步:扩充数据，结局回归，预测，以及标准化。

################################################################
# PROGRAM 13.2
# Standardizing the mean outcome to the baseline confounders
# Data from Table 2.2
################################################################

id <- c("Rheia", "Kronos", "Demeter", "Hades", "Hestia", "Poseidon", 
  "Hera", "Zeus", "Artemis", "Apollo", "Leto", "Ares", "Athena", 
  "Hephaestus", "Aphrodite", "Cyclope", "Persephone", "Hermes", 
  "Hebe", "Dionysus")
N <- length(id)
L <- c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
A <- c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Y <- c(0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0)
interv <- rep(-1, N)
observed <- cbind(L, A, Y, interv) # di一组
untreated <- cbind(L, rep(0, N), rep(NA, N), rep(0, N)) # 第二组，非治疗组，A=0
treated <- cbind(L, rep(1, N), rep(NA, N), rep(1, N))# 第三组，治疗组 ，A=1
table22 <- as.data.frame(rbind(observed, untreated, treated)) # 按照行合并数据集
table22$id <- rep(id, 3) # 设置ID
# 这属于第一步，扩充数据

# 第二部，用这个新数据集去拟合结局均值和治疗 A 以及混杂变量 L 的回归模型。注意到，其实只有第一个板块中的数据(也即实 际数据)会被用来拟合模型，这是因为第二和第三个板块中的结局数据都是缺失值。

glm.obj <- glm(Y~ A*L, data = table22)
summary(glm.obj)

## 
## Call:
## glm(formula = Y ~ A * L, data = table22)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6667  -0.2500   0.0417   0.3333   0.7500  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.50e-01   2.55e-01    0.98     0.34
## A           -4.16e-16   3.61e-01    0.00     1.00
## L            4.17e-01   3.90e-01    1.07     0.30
## A:L          3.24e-16   4.96e-01    0.00     1.00
## 
## (Dispersion parameter for gaussian family taken to be 0.26)
## 
##     Null deviance: 5.0000  on 19  degrees of freedom
## Residual deviance: 4.1667  on 16  degrees of freedom
##   (40 observations deleted due to missingness)
## AIC: 35.39
## 
## Number of Fisher Scoring iterations: 2

# 第三部，下一步我们需要用第一个板块拟合的模型去预测第二个和第三个板块的结局。(也就是，我 们将第二和第三个板块中的 L 与 A 和回归系数结合起来，从而得到结局 Y 的预测值。)
table22$predicted.meanY <- predict(glm.obj, table22)


# 第四步 ， 我们将会计算第二个板块中的预测值均值。
# 因为 60%的人是 L = 1 ，40%的是 L = 0 ，所 以 L = 1 的均值就会有更多权重。也就是说，第二个板块中的预测值均值就是非治疗组的标准化结 局均值。同理，第三个板块中的预测值均值是治疗组的标准化结局均值。这样一来，我们的计算 就结束了。
mean(table22$predicted.meanY[table22$interv==-1])

## [1] 0.5

mean(table22$predicted.meanY[table22$interv==0])

## [1] 0.5

mean(table22$predicted.meanY[table22$interv==1])

## [1] 0.5

################################################################

在实践中，用经验分布经计算标准化均值是一个可行的办法，尤其在 L 是高维向量的情况下。我们戒烟例子中所用的方法步骤和前几段所描述的大致相同。我们需要在原数据中添加第二和第三个板块，拟合E[Y|A=a,C 0,L= l]的模型=，然后生成预测值。第二个板块中预测值的均值是1.66，第三个板块的值是5.18 ，因果效应E（Ya1，c0）-E（Ya0，c0）就是3.5。

# PROGRAM 13.3
# Standardizing the mean outcome to the baseline confounders:
# Data from NHEFS
################################################################

# create a dataset with 3 copies of each subject
nhefs$interv <- -1 # 1st copy: equal to original one
# 用这个变量表示分组

interv0 <- nhefs # 2nd copy: treatment set to 0, outcome to missing 
interv0$interv <- 0 # 表示第二组
interv0$qsmk <- 0 # 设置A为0
interv0$wt82_71 <- NA

interv1 <- nhefs # 3rd copy: treatment set to 1, outcome to missing
interv1$interv <- 1 # 表示第三组
interv1$qsmk <- 1 # 表示A为1
interv1$wt82_71 <- NA #

onesample <- rbind(nhefs, interv0, interv1) # combining datasets # 合并数据集

# linear model to estimate mean outcome conditional on treatment and confounders
# parameters are estimated using original observations only (nhefs)
# parameter estimates are used to predict mean outcome for observations with 
# treatment set to 0 (interv=0) and to 1 (interv=1)

std <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age)
           + as.factor(education) + smokeintensity 
           + I(smokeintensity*smokeintensity) + smokeyrs 
           + I(smokeyrs*smokeyrs) + as.factor(exercise) 
           + as.factor(active) + wt71 + I(wt71*wt71) + I(qsmk*smokeintensity), 
           data=onesample)# 构建模型
summary(std)

## 
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) + 
##     as.factor(education) + smokeintensity + I(smokeintensity * 
##     smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity), 
##     data = onesample)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -42.06   -4.17   -0.34    3.89   44.61  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -1.588166   4.313036   -0.37  0.71276    
## qsmk                                2.559594   0.809149    3.16  0.00159 ** 
## sex                                -1.430272   0.468958   -3.05  0.00233 ** 
## race                                0.560110   0.581889    0.96  0.33591    
## age                                 0.359635   0.163319    2.20  0.02781 *  
## I(age * age)                       -0.006101   0.001726   -3.53  0.00042 ***
## as.factor(education)2               0.790444   0.607000    1.30  0.19304    
## as.factor(education)3               0.556312   0.556102    1.00  0.31728    
## as.factor(education)4               1.491570   0.832270    1.79  0.07330 .  
## as.factor(education)5              -0.194977   0.741369   -0.26  0.79259    
## smokeintensity                      0.049136   0.051725    0.95  0.34229    
## I(smokeintensity * smokeintensity) -0.000991   0.000938   -1.06  0.29110    
## smokeyrs                            0.134369   0.091712    1.47  0.14309    
## I(smokeyrs * smokeyrs)             -0.001866   0.001544   -1.21  0.22683    
## as.factor(exercise)1                0.295975   0.535153    0.55  0.58030    
## as.factor(exercise)2                0.353913   0.558859    0.63  0.52665    
## as.factor(active)1                 -0.947569   0.409934   -2.31  0.02094 *  
## as.factor(active)2                 -0.261378   0.684558   -0.38  0.70265    
## wt71                                0.045502   0.083371    0.55  0.58530    
## I(wt71 * wt71)                     -0.000965   0.000525   -1.84  0.06600 .  
## I(qsmk * smokeintensity)            0.046663   0.035145    1.33  0.18446    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 53.6)
## 
##     Null deviance: 97176  on 1565  degrees of freedom
## Residual deviance: 82763  on 1545  degrees of freedom
##   (3321 observations deleted due to missingness)
## AIC: 10701
## 
## Number of Fisher Scoring iterations: 2

onesample$predicted_meanY <- predict(std, onesample) # 进行预测

# estimate mean outcome in each of the groups interv=0, and interv=1
# this mean outcome is a weighted average of the mean outcomes in each combination 
# of values of treatment and confounders, that is, the standardized outcome
mean(onesample[which(onesample$interv==-1),]$predicted_meanY) # 计算

## [1] 2.56

mean(onesample[which(onesample$interv==0),]$predicted_meanY)

## [1] 1.66

mean(onesample[which(onesample$interv==1),]$predicted_meanY)

## [1] 5.18

我们需要自助法计算置信区间，置信区间为（2.6，4.5）

################################################################
# PROGRAM 13.4
# Computing the 95% confidence interval of the standardized means 
# and their difference: Data from NHEFS
################################################################

#install.packages("boot") # install package if required
library(boot) # 之后好好学学这个R包

# function to calculate difference in means
# indices -i表示按行，frequencies-f表示按照频率，weights-w 表示权重 ，这个参数主要设置如何抽样
standardization <- function(data, indices) {
  # create a dataset with 3 copies of each subject
  d <- data[indices,] # 1st copy: equal to original one，
  d$interv <- -1
  d0 <- d # 2nd copy: treatment set to 0, outcome to missing
  d0$interv <- 0
  d0$qsmk <- 0
  d0$wt82_71 <- NA
  d1 <- d # 3rd copy: treatment set to 1, outcome to missing
  d1$interv <- 1
  d1$qsmk <- 1
  d1$wt82_71 <- NA
  d.onesample <- rbind(d, d0, d1) # combining datasets
  
  # linear model to estimate mean outcome conditional on treatment and confounders
  # parameters are estimated using original observations only (interv= -1)
  # parameter estimates are used to predict mean outcome for observations with set 
  # treatment (interv=0 and interv=1)
  fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + 
               as.factor(education) + smokeintensity + 
               I(smokeintensity*smokeintensity) + smokeyrs + I(smokeyrs*smokeyrs) +
               as.factor(exercise) + as.factor(active) + wt71 + I(wt71*wt71), 
             data=d.onesample)
  
  d.onesample$predicted_meanY <- predict(fit, d.onesample)
  
  # estimate mean outcome in each of the groups interv=-1, interv=0, and interv=1
  return(c(mean(d.onesample$predicted_meanY[d.onesample$interv==-1]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==0]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==1]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==1])-
             mean(d.onesample$predicted_meanY[d.onesample$interv==0])))
}

# bootstrap
results <- boot(data=nhefs, statistic=standardization, R=5)

# generating confidence intervals
se <- c(sd(results$t[,1]), sd(results$t[,2]), 
        sd(results$t[,3]), sd(results$t[,4]))
mean <- results$t0 # t0 是什么
ll <- mean - qnorm(0.975)*se
ul <- mean + qnorm(0.975)*se

bootstrap <- data.frame(cbind(c("Observed", "No Treatment", "Treatment", 
                                "Treatment - No Treatment"), mean, se, ll, ul))
bootstrap

##                         V1             mean                se               ll
## 1                 Observed 2.56188497106103 0.249348302062059 2.07317127941318
## 2             No Treatment 1.65212306626746 0.166328637079571 1.32612492799386
## 3                Treatment 5.11474489549346 0.512935487268879 4.10940981405396
## 4 Treatment - No Treatment 3.46262182922601 0.418650414071998 2.64208209553211
##                 ul
## 1 3.05059866270888
## 2 1.97812120454105
## 3 6.12007997693297
## 4  4.2831615629199

13.4 逆概率加权or 标准化

在进行因果推理的时候，可以尝试使用多种方法进行分析。逆概率加权均值和标准化均值，只有在不使用模型的时候，才是相等的，否则不一定相等。

在实践中，或多或少的模型错误设定总是不可避免，而错误设定会带来偏移。治疗模型(逆概率加权)和结局模型(标准化)的错误设定给效应估计带来的偏移影响不尽相同。因此，逆概率加权得到的效应估计会和标准化得到的效应估计有所区别。如果逆概率加权和标准化两种方法得到结果差距很大，会让我们警觉至少其中一种方法存在严重的模型错误设定。如果两种方法得到的结果差距不大，虽然不能完全排除存在模型错误设定，但可以侧面说明模型错误设定带来的影响不是很严重——这是因为两种方法的模型都错误设定且导致的偏移影响差不多，这一概率实在太低了。

。如果我们只对人群的一个子集感兴趣，那么我们应该将计算局限在这个子集当中。比如，如果我们对性别的修饰效应感兴趣，我们需要在女性和男性中分别计算标准化均值。逆概率加权和标准化两种方法都能用来计算人群子集的因果效应。

13.5 如何看待估计值

本章和上一章的分析只是我们在现实数据中的初步尝试。在逆概率加权和标准化方法中，我们得到同样的结果:如果每个参与人员都戒烟，将会平均增重 5.2kg，而如果每个参与人员都不戒烟，将会平均增重 1.7kg。两种方法得到的戒烟因果效应均值都是 3.5kg(95%置信区间:2.5 到 4.5)。

不同方法得到同样的结果，这让我们的结果更加可信，这是因为每种方法的模型假设都不一样。然而，观测性研究得到的效应估计总是有严重缺陷。我们对目标人群的效应估计依然需要其他假设条件。我们将这些条件分成三组。

1.第一，可识别性条件，即互换性、正数性和一致性 2. 第二，所有变量都应该被正确测量。 3. 第三，所有数据分析中的模型都应该是正确设定的。如果连续变量年龄的真实函数形式不是抛物线，而是另一种复杂曲线，那么即使 L 中的所有变量都是正确测定的，但因为模型设定错误，逆概率加权依然不能调整所有混杂。模型错误设定带来的效果就如混杂变量的测量误差一样。

（因果推断的有效性依赖于以下几个条件:互换性，正数性，一致性。并不涉及测量误差和模型错误设定。）

保证以上条件成立，或者近似成立，是研究者的主要任务。如果这些条件不一定成立，那数据分析就是无意义的。问题是，没有人能保证这些条件都能完美成立。未测的混杂变量、不重叠的混杂变量分布、劣定的干预措施、误测的变量、以及错误设定的模型都是我们在效应估计中常见的问题。其中某些问题可以用经验方法解决，但是其他问题则需要专业判断，因而会被其他人批评。遗憾的是，我们的数据并不能反驳这些批评。比如，我们可以尝试使用不同的模型，但是我们不可能去调整没有测量的变量。因果推断依赖于以上条件。这些条件看起来很美好，但是却不能实证地验证。因果推断的重要问题是我们不可能观察到所有的反事实数据，因此我们用上述假设去近似得到这些数据。我们越偏离这些假设，那我们的效应估计也就越偏离真实因果效应。因此，对观察性研究中得到的因果推断必须保持谨慎怀疑。实际中，我们需要依据上面提到的假设，对现实中的因果推断进行逐条讨论，从而保证因果推断的严肃性。我们需要像对待效应估计一样，严肃对待这些条件假设。

14 G-估计和结构嵌入模型

逆概率加权、标准化和 G-估算经常被统称为 G-方法。因为它们被用于广义(Generalized) 的治疗效果比较，且可以涉及不同时间下的治疗。

14.1 互换性

有界互换性意味着，在 L 取值相同的子人群中，非戒烟者 ( A = 0 )如果戒烟了，那么他们的增重均值会和戒烟者( A = 1 )一样。换句话说，有界互换性意味着现实中的戒烟者和非戒烟者如果戒烟状态一样，那么这两个组的结局分布相同。

把有界互换性表述为概率形式将有助于我们理解接下来要说的 G-估算。具体而言，假设我们要拟合下述参数模型来预测接受治疗的概率:

当然，我们在现实世界中不能拟合这个模型，因为我们不知道所有人的反事实结局Ya0.

假设我们知道了、拟合了上述模型，并且有界互换性成立、模型也设定正确，那么参数α1 会是什么样呢?正确答案是 0，这是因为ya0并不能用来预测参与人员是否接受了治疗。接下来，我们将介绍 G-估算的另外一半:结构模型。

14.2 结局均值和结构嵌入模型

我们想估计在L的每一分层中，治疗A的因果效应均值,也就是：

我们可以写作：

对因果效应构建结构模型，就有:

一般而言， L 可能存在效应修饰作用.在有界互换性下，结构模型也能被表述成:

这被称为结构嵌入均值模型。与这些参数模型相比，结构嵌入模型是半参数化的，因为这个模型中没有截距项和 L 的效应项，也即没有 β0 和 β3 L 两项。

当我们使用 G-估算的时候，我们需要先使用逆概率加权调整删失。在实践中，这意味着我们需要先使用逆概率加权去构建一个虚拟人群，其中没有人被删失，然后再在虚拟人群中使用G-估算。

我们假设控制了 L 之后，被删失的和未被删失的人群是可互换的。在这一假设之下，虚拟人群中的结构模型将是:

对于连续性变量，模型需要假设治疗 A 对结局 Y 的剂量反应函数。比如:

或者其他任意光滑函数，比如样条

14.3 保序性

假设我们知道所有人的Y a=1 和Y a=0 ，我们可以就根据Y a=1 和 Y a=0 进行排序，并得到两份排序名单。如果两份名单的顺序是一样的，那我们就把这称为“保序性”。

如果在加法尺度上，治疗 A 对结局 Y 的效应在所有个体中都一样，那我们会说加成保序性成立。比如，如果戒烟让每个人都增重3kg，那根据Ya=0的排序也就等于根据Ya=1的排序。

在结构嵌入模型中，我们只会在意 L 分层中的加成保序性。如果在 L 的所有分层中， A 对 Y 的效应都相等，那么我们会说条件加成保序性成立。

一个(条件加成)保序的结构模型如下所示:

其中ψ1+ψ2l是L=l的时候因果效应的大小。也就是说，对于个体i，Ya1等于Ya0+ψ1+ψ2l。个体在没有治疗的时候，反事实结局Ya0移动ψ1+ψ2l个单位就能够得到有治疗时候的反事实结局。

对大多数治疗和结局而言，个体的因果效应并不是一个常数，甚至不可能接近常数。因而 (条件加成)保序性几乎不可能成立。

因为保序性不可能成立，所以我们的因果推断方法不能依赖于保序性。实际上，本书所讨论的方法都不需要保序性。

基于保序性的模型需要一个非常强的假设:在 L 的同一分层中，个体的治疗因果效应是一个常数。而我们在现实中并不希望使用这么一个不切实际的模型。不过在下一小节我们将用保序性介绍 G-估算，这只是因为用保序性能更好地解释 G-估算，并且不管保序性是否成立，G-估算的具体步骤都是一样的。也就是说 G-估算不一定需要保序性假设。同时要注意到，保序性结构模型也是一个结构均值模型——个体从Y a=0 移动到Y a=1 距离的均值等于个体移动的距离。

14.4 G-估算

假设我们的目标是估计结构嵌入模型

为了简便，这里只考虑又一个参数的模型。

我们同时也假设加成保序性成立，也即对所有个体i有。

因而个体的因果效应ψ 1=β1,这也是我们想估计的值。我们进一步将保序性模型中的下角标 i 去掉，

这是因为这一模型对所有个体都适用。接下来我们通过移项得到

G-估算的第一步是将模型和观测数据联系起来。因而，保序结构模型就意味着每个个体的反事实结局 Y a=0 可以表示为观测数据和未知参数ψ1的一个函数

如果这个模型是正确的，并且我们知道ψ 1 的值，那我们就能计算每个个体没有治疗时的反事实结局Y a=0 。现在我们还不知道ψ1 的值，我们的目标就是估计ψ1 。

# 计算权重
################################################################
# PROGRAM 14.1
# Preprocessing, ranks of extreme observations, IP weights for censoring
# Data from NHEFS
################################################################

#install.packages("readxl") # install package if required
library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")

# some processing of the data
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0) # 标记缺失值

# ranking of extreme observations
#install.packages("Hmisc")
library(Hmisc)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:boot':
## 
##     aml

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

describe(nhefs$wt82_71) # 对数据集进行描述性分析

## nhefs$wt82_71 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1566       63     1510        1    2.638    8.337   -9.752   -6.292 
##      .25      .50      .75      .90      .95 
##   -1.478    2.604    6.690   11.117   14.739 
## 
## lowest : -41.3 -30.5 -30.1 -29.0 -26.0, highest:  34.0  37.0  37.7  47.5  48.5

# estimation of denominator of ip weights for C
# C的分母权重 
cw.denom <- glm(cens==0 ~ qsmk + sex + race + age + I(age^2) 
                     + as.factor(education) + smokeintensity + I(smokeintensity^2) 
                     + smokeyrs + I(smokeyrs^2) + as.factor(exercise) 
                     + as.factor(active) + wt71 + I(wt71^2), 
                     data = nhefs, family = binomial("logit"))

summary(cw.denom)

## 
## Call:
## glm(formula = cens == 0 ~ qsmk + sex + race + age + I(age^2) + 
##     as.factor(education) + smokeintensity + I(smokeintensity^2) + 
##     smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + 
##     wt71 + I(wt71^2), family = binomial("logit"), data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.996   0.157   0.207   0.287   1.097  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -4.014466   2.576106   -1.56   0.1192   
## qsmk                  -0.516867   0.287716   -1.80   0.0724 . 
## sex                   -0.057313   0.330278   -0.17   0.8622   
## race                   0.012271   0.452489    0.03   0.9784   
## age                    0.269729   0.117465    2.30   0.0217 * 
## I(age^2)              -0.002884   0.001114   -2.59   0.0096 **
## as.factor(education)2  0.440788   0.419399    1.05   0.2933   
## as.factor(education)3  0.164688   0.370547    0.44   0.6567   
## as.factor(education)4 -0.138447   0.569797   -0.24   0.8080   
## as.factor(education)5  0.382382   0.560181    0.68   0.4949   
## smokeintensity        -0.015712   0.034732   -0.45   0.6510   
## I(smokeintensity^2)    0.000113   0.000606    0.19   0.8517   
## smokeyrs              -0.078597   0.074958   -1.05   0.2944   
## I(smokeyrs^2)          0.000557   0.001032    0.54   0.5894   
## as.factor(exercise)1   0.971471   0.387810    2.51   0.0122 * 
## as.factor(exercise)2   0.583989   0.372313    1.57   0.1168   
## as.factor(active)1     0.247479   0.325455    0.76   0.4470   
## as.factor(active)2    -0.706583   0.396458   -1.78   0.0747 . 
## wt71                   0.087887   0.040012    2.20   0.0281 * 
## I(wt71^2)             -0.000635   0.000226   -2.81   0.0049 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 533.36  on 1628  degrees of freedom
## Residual deviance: 465.36  on 1609  degrees of freedom
## AIC: 505.4
## 
## Number of Fisher Scoring iterations: 7

nhefs$pd.c <- predict(cw.denom, nhefs, type="response") # 进行预测
nhefs$wc <- ifelse(nhefs$cens==0, 1/nhefs$pd.c, NA)  #  observations with cens=1 only contribute to censoring models


##################################################################

#  PROGRAM 14.2
# G-estimation of a 1-parameter structural nested mean model
# Brute force search
# Data from NHEFS
##################################################################

####################################################
# G-estimation: Checking one possible value of psi #
####################################################
#install.packages("geepack")
library("geepack")

nhefs$psi <- 3.446
nhefs$Hpsi <- nhefs$wt82_71 - nhefs$psi*nhefs$qsmk

fit <- geeglm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
           + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
           + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
           + wt71 + I(wt71*wt71) + Hpsi, family=binomial, data=nhefs,
           weights=wc, id=seqn, corstr="independence")

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fit)

## 
## Call:
## geeglm(formula = qsmk ~ sex + race + age + I(age * age) + as.factor(education) + 
##     smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs + 
##     I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) + 
##     wt71 + I(wt71 * wt71) + Hpsi, family = binomial, data = nhefs, 
##     weights = wc, id = seqn, corstr = "independence")
## 
##  Coefficients:
##                                     Estimate   Std.err  Wald Pr(>|W|)    
## (Intercept)                        -2.40e+00  1.33e+00  3.27  0.07060 .  
## sex                                -5.14e-01  1.54e-01 11.19  0.00082 ***
## race                               -8.61e-01  2.10e-01 16.83  4.1e-05 ***
## age                                 1.15e-01  5.02e-02  5.26  0.02178 *  
## I(age * age)                       -7.59e-04  5.30e-04  2.06  0.15162    
## as.factor(education)2              -2.89e-02  1.96e-01  0.02  0.88286    
## as.factor(education)3               8.77e-02  1.73e-01  0.26  0.61133    
## as.factor(education)4               6.64e-02  2.70e-01  0.06  0.80565    
## as.factor(education)5               4.71e-01  2.25e-01  4.40  0.03604 *  
## smokeintensity                     -7.83e-02  1.46e-02 28.63  8.7e-08 ***
## I(smokeintensity * smokeintensity)  1.07e-03  2.65e-04 16.37  5.2e-05 ***
## smokeyrs                           -7.11e-02  2.64e-02  7.26  0.00705 ** 
## I(smokeyrs * smokeyrs)              8.15e-04  4.49e-04  3.30  0.06938 .  
## as.factor(exercise)1                3.36e-01  1.83e-01  3.38  0.06584 .  
## as.factor(exercise)2                3.80e-01  1.89e-01  4.05  0.04419 *  
## as.factor(active)1                  3.41e-02  1.34e-01  0.06  0.79878    
## as.factor(active)2                  2.13e-01  2.12e-01  1.01  0.31431    
## wt71                               -7.66e-03  2.56e-02  0.09  0.76496    
## I(wt71 * wt71)                      8.66e-05  1.58e-04  0.30  0.58423    
## Hpsi                               -1.90e-06  8.84e-03  0.00  0.99983    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    0.997  0.0672
## Number of clusters:   1566  Maximum cluster size: 1

##########################################################
# G-estimation: Checking multiple possible values of psi #
##########################################################

#install.packages("geepack")

grid <- seq(from = 2,to = 5, by = 0.1)
j = 0
Hpsi.coefs <- cbind(rep(NA,length(grid)), rep(NA, length(grid)))
colnames(Hpsi.coefs) <- c("Estimate", "p-value")

for (i in grid){
  psi = i
  j = j+1
  nhefs$Hpsi <- nhefs$wt82_71 - psi * nhefs$qsmk 
  
  gest.fit <- geeglm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
                  + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
                  + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
                  + wt71 + I(wt71*wt71) + Hpsi, family=binomial, data=nhefs,
                  weights=wc, id=seqn, corstr="independence")
  Hpsi.coefs[j,1] <- summary(gest.fit)$coefficients["Hpsi", "Estimate"]
  Hpsi.coefs[j,2] <- summary(gest.fit)$coefficients["Hpsi", "Pr(>|W|)"]
}

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Hpsi.coefs

##        Estimate p-value
##  [1,]  0.026722 0.00177
##  [2,]  0.024895 0.00358
##  [3,]  0.023066 0.00696
##  [4,]  0.021234 0.01303
##  [5,]  0.019401 0.02342
##  [6,]  0.017565 0.04043
##  [7,]  0.015725 0.06701
##  [8,]  0.013883 0.10663
##  [9,]  0.012036 0.16288
## [10,]  0.010186 0.23898
## [11,]  0.008331 0.33705
## [12,]  0.006471 0.45743
## [13,]  0.004607 0.59823
## [14,]  0.002737 0.75520
## [15,]  0.000862 0.92210
## [16,] -0.001018 0.90854
## [17,] -0.002904 0.74436
## [18,] -0.004797 0.59219
## [19,] -0.006695 0.45717
## [20,] -0.008600 0.34236
## [21,] -0.010511 0.24868
## [22,] -0.012428 0.17524
## [23,] -0.014352 0.11984
## [24,] -0.016283 0.07958
## [25,] -0.018221 0.05135
## [26,] -0.020165 0.03222
## [27,] -0.022116 0.01968
## [28,] -0.024074 0.01171
## [29,] -0.026039 0.00679
## [30,] -0.028011 0.00385
## [31,] -0.029989 0.00213

##################################################################

14.5 两个或多个参数的结构嵌入模型

例如，我们认为吸烟频率V能够修饰戒烟的因果效应，此时结构嵌入模型是：

其对应的保序模型则是：

因此我们可以拟合如下logistic 模型：

在这个模型中，我们需要找到可能的ψ 1† 和ψ 2† 组合，使得参数 α1 和 α 2 都等于 0。

#  PROGRAM 14.3
# G-estimation for 2-parameter structural nested mean model
# Closed form estimator
# Data from NHEFS
##################################################################

##########################################################
# G-estimation: Closed form estimator linear mean models #
##########################################################
logit.est <- glm(qsmk ~ sex + race + age + I(age^2) + as.factor(education) 
                 + smokeintensity + I(smokeintensity^2) + smokeyrs 
                 + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) 
                 + wt71 + I(wt71^2), data = nhefs, weight = wc, 
                 family = binomial())

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(logit.est)

## 
## Call:
## glm(formula = qsmk ~ sex + race + age + I(age^2) + as.factor(education) + 
##     smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + 
##     as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), 
##     family = binomial(), data = nhefs, weights = wc)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.529  -0.808  -0.650   1.029   2.417  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.40e+00   1.31e+00   -1.83  0.06743 .  
## sex                   -5.14e-01   1.50e-01   -3.42  0.00062 ***
## race                  -8.61e-01   2.06e-01   -4.18  2.9e-05 ***
## age                    1.15e-01   4.95e-02    2.33  0.01992 *  
## I(age^2)              -7.59e-04   5.14e-04   -1.48  0.13953    
## as.factor(education)2 -2.89e-02   1.93e-01   -0.15  0.88079    
## as.factor(education)3  8.77e-02   1.73e-01    0.51  0.61244    
## as.factor(education)4  6.64e-02   2.66e-01    0.25  0.80301    
## as.factor(education)5  4.71e-01   2.21e-01    2.13  0.03314 *  
## smokeintensity        -7.83e-02   1.49e-02   -5.27  1.4e-07 ***
## I(smokeintensity^2)    1.07e-03   2.78e-04    3.85  0.00012 ***
## smokeyrs              -7.11e-02   2.71e-02   -2.63  0.00862 ** 
## I(smokeyrs^2)          8.15e-04   4.45e-04    1.83  0.06722 .  
## as.factor(exercise)1   3.36e-01   1.75e-01    1.92  0.05467 .  
## as.factor(exercise)2   3.80e-01   1.82e-01    2.09  0.03637 *  
## as.factor(active)1     3.41e-02   1.30e-01    0.26  0.79337    
## as.factor(active)2     2.13e-01   2.06e-01    1.04  0.30033    
## wt71                  -7.66e-03   2.46e-02   -0.31  0.75530    
## I(wt71^2)              8.66e-05   1.51e-04    0.57  0.56586    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1872.2  on 1565  degrees of freedom
## Residual deviance: 1755.6  on 1547  degrees of freedom
##   (63 observations deleted due to missingness)
## AIC: 1719
## 
## Number of Fisher Scoring iterations: 4

nhefs$pqsmk <- predict(logit.est, nhefs, type = "response")
describe(nhefs$pqsmk)

## nhefs$pqsmk 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1629        0     1629        1   0.2622   0.1302   0.1015   0.1261 
##      .25      .50      .75      .90      .95 
##   0.1780   0.2426   0.3251   0.4221   0.4965 
## 
## lowest : 0.0514 0.0516 0.0544 0.0558 0.0593, highest: 0.6721 0.6864 0.7139 0.7333 0.7891

summary(nhefs$pqsmk)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.051   0.178   0.243   0.262   0.325   0.789

# solve sum(w_c * H(psi) * (qsmk - E[qsmk | L]))  = 0
# for a single psi and H(psi) = wt82_71 - psi * qsmk
# this can be solved as psi = sum( w_c * wt82_71 * (qsmk - pqsmk)) / sum(w_c * qsmk * (qsmk - pqsmk))

nhefs.c <- nhefs[which(!is.na(nhefs$wt82)),]
with(nhefs.c, sum(wc*wt82_71*(qsmk-pqsmk)) / sum(wc*qsmk*(qsmk - pqsmk)))

## [1] 3.45

#############################################################
# G-estimation: Closed form estimator for 2-parameter model #
#############################################################

diff = with(nhefs.c, qsmk - pqsmk)
diff2 = with(nhefs.c, wc * diff)

lhs = matrix(0,2,2)
lhs[1,1] = with(nhefs.c, sum(qsmk * diff2))
lhs[1,2] = with(nhefs.c, sum(qsmk * smokeintensity  * diff2))
lhs[2,1] = with(nhefs.c, sum(qsmk * smokeintensity * diff2))
lhs[2,2] = with(nhefs.c, sum(qsmk * smokeintensity * smokeintensity * diff2))

rhs = matrix(0,2,1)
rhs[1] = with(nhefs.c, sum(wt82_71 * diff2))
rhs[2] = with(nhefs.c, sum(wt82_71 * smokeintensity * diff2))

psi = t(solve(lhs,rhs))
psi

##      [,1] [,2]
## [1,] 2.86 0.03

15 结局回归和倾向性评分

已经有无数著作详细介绍并运用了这两种方法。然而，这两种方法只在较简单的情形中适用。当情形更复杂一些，比如涉及时异性治疗的时候，这两种方法就不再适用。在本书第三部分我们将主要讨论 G-方法。本章将介绍结局回归和倾向性评分这两种方法。不过谨记，这两种方法不适用于复杂的纵向数据。

15.1 结局回归

在前三章，我们讨论了逆概率加权，标准化，以及 G-估算三种方法，并用它们估计了戒烟 A (治疗)对增重Y (结局)的因果效应。

rm(list = ls())
#################################################################################
# Program 15.1                                                                  
# Estimating the average causal effect within levels of confounders             
# under the assumption of effect-measure modification by smoking intensity ONLY 
# Data from NHEFS                                                               
#################################################################################

#install.packages("readxl") # install package if required
library("readxl")

nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)

# regression on covariates, allowing for some effect modification
fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)
           + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
           + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
           + wt71 + I(wt71*wt71) + I(qsmk*smokeintensity), data=nhefs)
summary(fit)

## 
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) + 
##     as.factor(education) + smokeintensity + I(smokeintensity * 
##     smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity), 
##     data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -42.06   -4.17   -0.34    3.89   44.61  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -1.588166   4.313036   -0.37  0.71276    
## qsmk                                2.559594   0.809149    3.16  0.00159 ** 
## sex                                -1.430272   0.468958   -3.05  0.00233 ** 
## race                                0.560110   0.581889    0.96  0.33591    
## age                                 0.359635   0.163319    2.20  0.02781 *  
## I(age * age)                       -0.006101   0.001726   -3.53  0.00042 ***
## as.factor(education)2               0.790444   0.607000    1.30  0.19304    
## as.factor(education)3               0.556312   0.556102    1.00  0.31728    
## as.factor(education)4               1.491570   0.832270    1.79  0.07330 .  
## as.factor(education)5              -0.194977   0.741369   -0.26  0.79259    
## smokeintensity                      0.049136   0.051725    0.95  0.34229    
## I(smokeintensity * smokeintensity) -0.000991   0.000938   -1.06  0.29110    
## smokeyrs                            0.134369   0.091712    1.47  0.14309    
## I(smokeyrs * smokeyrs)             -0.001866   0.001544   -1.21  0.22683    
## as.factor(exercise)1                0.295975   0.535153    0.55  0.58030    
## as.factor(exercise)2                0.353913   0.558859    0.63  0.52665    
## as.factor(active)1                 -0.947569   0.409934   -2.31  0.02094 *  
## as.factor(active)2                 -0.261378   0.684558   -0.38  0.70265    
## wt71                                0.045502   0.083371    0.55  0.58530    
## I(wt71 * wt71)                     -0.000965   0.000525   -1.84  0.06600 .  
## I(qsmk * smokeintensity)            0.046663   0.035145    1.33  0.18446    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 53.6)
## 
##     Null deviance: 97176  on 1565  degrees of freedom
## Residual deviance: 82763  on 1545  degrees of freedom
##   (63 observations deleted due to missingness)
## AIC: 10701
## 
## Number of Fisher Scoring iterations: 2

# (step 1) build the contrast matrix with all zeros
# this function builds the blank matrix 
# install.packages("multcomp") # install packages if necessary
library("multcomp")

## Warning: package 'multcomp' was built under R version 4.1.2

## Loading required package: mvtnorm

## Loading required package: TH.data

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## Attaching package: 'TH.data'

## The following object is masked from 'package:MASS':
## 
##     geyser

makeContrastMatrix <- function(model, nrow, names) {
  m <- matrix(0, nrow = nrow, ncol = length(coef(model)))
  colnames(m) <- names(coef(model))
  rownames(m) <- names
  return(m)
}
K1 <- makeContrastMatrix(fit, 2, c('Effect of Quitting Smoking at Smokeintensity of 5',
                                      'Effect of Quitting Smoking at Smokeintensity of 40'))
# (step 2) fill in the relevant non-zero elements 
K1[1:2, 'qsmk'] <- 1
K1[1:2, 'I(qsmk * smokeintensity)'] <- c(5, 40)

# (step 3) check the contrast matrix
K1

##                                                    (Intercept) qsmk sex race
## Effect of Quitting Smoking at Smokeintensity of 5            0    1   0    0
## Effect of Quitting Smoking at Smokeintensity of 40           0    1   0    0
##                                                    age I(age * age)
## Effect of Quitting Smoking at Smokeintensity of 5    0            0
## Effect of Quitting Smoking at Smokeintensity of 40   0            0
##                                                    as.factor(education)2
## Effect of Quitting Smoking at Smokeintensity of 5                      0
## Effect of Quitting Smoking at Smokeintensity of 40                     0
##                                                    as.factor(education)3
## Effect of Quitting Smoking at Smokeintensity of 5                      0
## Effect of Quitting Smoking at Smokeintensity of 40                     0
##                                                    as.factor(education)4
## Effect of Quitting Smoking at Smokeintensity of 5                      0
## Effect of Quitting Smoking at Smokeintensity of 40                     0
##                                                    as.factor(education)5
## Effect of Quitting Smoking at Smokeintensity of 5                      0
## Effect of Quitting Smoking at Smokeintensity of 40                     0
##                                                    smokeintensity
## Effect of Quitting Smoking at Smokeintensity of 5               0
## Effect of Quitting Smoking at Smokeintensity of 40              0
##                                                    I(smokeintensity * smokeintensity)
## Effect of Quitting Smoking at Smokeintensity of 5                                   0
## Effect of Quitting Smoking at Smokeintensity of 40                                  0
##                                                    smokeyrs
## Effect of Quitting Smoking at Smokeintensity of 5         0
## Effect of Quitting Smoking at Smokeintensity of 40        0
##                                                    I(smokeyrs * smokeyrs)
## Effect of Quitting Smoking at Smokeintensity of 5                       0
## Effect of Quitting Smoking at Smokeintensity of 40                      0
##                                                    as.factor(exercise)1
## Effect of Quitting Smoking at Smokeintensity of 5                     0
## Effect of Quitting Smoking at Smokeintensity of 40                    0
##                                                    as.factor(exercise)2
## Effect of Quitting Smoking at Smokeintensity of 5                     0
## Effect of Quitting Smoking at Smokeintensity of 40                    0
##                                                    as.factor(active)1
## Effect of Quitting Smoking at Smokeintensity of 5                   0
## Effect of Quitting Smoking at Smokeintensity of 40                  0
##                                                    as.factor(active)2 wt71
## Effect of Quitting Smoking at Smokeintensity of 5                   0    0
## Effect of Quitting Smoking at Smokeintensity of 40                  0    0
##                                                    I(wt71 * wt71)
## Effect of Quitting Smoking at Smokeintensity of 5               0
## Effect of Quitting Smoking at Smokeintensity of 40              0
##                                                    I(qsmk * smokeintensity)
## Effect of Quitting Smoking at Smokeintensity of 5                         5
## Effect of Quitting Smoking at Smokeintensity of 40                       40

# (step 4) estimate the contrasts, get tests and confidence intervals for them
estimates1 <- glht(fit, K1)
  summary(estimates1)

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) + 
##     as.factor(education) + smokeintensity + I(smokeintensity * 
##     smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity), 
##     data = nhefs)
## 
## Linear Hypotheses:
##                                                         Estimate Std. Error
## Effect of Quitting Smoking at Smokeintensity of 5 == 0     2.793      0.668
## Effect of Quitting Smoking at Smokeintensity of 40 == 0    4.426      0.848
##                                                         z value Pr(>|z|)    
## Effect of Quitting Smoking at Smokeintensity of 5 == 0     4.18  5.8e-05 ***
## Effect of Quitting Smoking at Smokeintensity of 40 == 0    5.22  3.6e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

  confint(estimates1)

## 
##   Simultaneous Confidence Intervals
## 
## Fit: glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) + 
##     as.factor(education) + smokeintensity + I(smokeintensity * 
##     smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity), 
##     data = nhefs)
## 
## Quantile = 2.23
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                                                         Estimate lwr   upr  
## Effect of Quitting Smoking at Smokeintensity of 5 == 0  2.793    1.304 4.282
## Effect of Quitting Smoking at Smokeintensity of 40 == 0 4.426    2.537 6.315

# regression on covariates, not allowing for effect modification
fit2 <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)
           + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
           + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
           + wt71 + I(wt71*wt71), data=nhefs)
  
summary(fit2)

## 
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) + 
##     as.factor(education) + smokeintensity + I(smokeintensity * 
##     smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71), data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -42.33   -4.22   -0.32    3.81   44.67  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -1.658618   4.313773   -0.38  0.70067    
## qsmk                                3.462622   0.438454    7.90  5.4e-15 ***
## sex                                -1.465050   0.468341   -3.13  0.00179 ** 
## race                                0.586412   0.581695    1.01  0.31356    
## age                                 0.362662   0.163343    2.22  0.02655 *  
## I(age * age)                       -0.006138   0.001726   -3.56  0.00039 ***
## as.factor(education)2               0.818526   0.606782    1.35  0.17755    
## as.factor(education)3               0.571500   0.556121    1.03  0.30427    
## as.factor(education)4               1.508517   0.832378    1.81  0.07013 .  
## as.factor(education)5              -0.170826   0.741329   -0.23  0.81779    
## smokeintensity                      0.065153   0.050311    1.29  0.19551    
## I(smokeintensity * smokeintensity) -0.001047   0.000937   -1.12  0.26426    
## smokeyrs                            0.133393   0.091732    1.45  0.14610    
## I(smokeyrs * smokeyrs)             -0.001827   0.001544   -1.18  0.23682    
## as.factor(exercise)1                0.320682   0.534962    0.60  0.54896    
## as.factor(exercise)2                0.362879   0.558956    0.65  0.51630    
## as.factor(active)1                 -0.942957   0.410021   -2.30  0.02159 *  
## as.factor(active)2                 -0.258037   0.684722   -0.38  0.70634    
## wt71                                0.037364   0.083166    0.45  0.65330    
## I(wt71 * wt71)                     -0.000916   0.000524   -1.75  0.08043 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 53.6)
## 
##     Null deviance: 97176  on 1565  degrees of freedom
## Residual deviance: 82857  on 1546  degrees of freedom
##   (63 observations deleted due to missingness)
## AIC: 10701
## 
## Number of Fisher Scoring iterations: 2

##################################################

15.2 倾向性评分

# Program 15.2 
# Estimating and plotting the propensity score
# Data from NHEFS                            
##################################################

fit3 <- glm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
            + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
            + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
            + wt71 + I(wt71*wt71), data=nhefs, family=binomial())
summary(fit3)

## 
## Call:
## glm(formula = qsmk ~ sex + race + age + I(age * age) + as.factor(education) + 
##     smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs + 
##     I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) + 
##     wt71 + I(wt71 * wt71), family = binomial(), data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.465  -0.804  -0.646   1.058   2.355  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -1.988902   1.241279   -1.60  0.10909    
## sex                                -0.507522   0.148232   -3.42  0.00062 ***
## race                               -0.850231   0.205872   -4.13  3.6e-05 ***
## age                                 0.103013   0.048900    2.11  0.03515 *  
## I(age * age)                       -0.000605   0.000507   -1.19  0.23297    
## as.factor(education)2              -0.098320   0.190655   -0.52  0.60607    
## as.factor(education)3               0.015699   0.170714    0.09  0.92673    
## as.factor(education)4              -0.042526   0.264276   -0.16  0.87216    
## as.factor(education)5               0.379663   0.220395    1.72  0.08495 .  
## smokeintensity                     -0.065156   0.014759   -4.41  1.0e-05 ***
## I(smokeintensity * smokeintensity)  0.000846   0.000276    3.07  0.00216 ** 
## smokeyrs                           -0.073371   0.026996   -2.72  0.00657 ** 
## I(smokeyrs * smokeyrs)              0.000838   0.000443    1.89  0.05867 .  
## as.factor(exercise)1                0.291412   0.173554    1.68  0.09314 .  
## as.factor(exercise)2                0.355052   0.179929    1.97  0.04846 *  
## as.factor(active)1                  0.010875   0.129832    0.08  0.93324    
## as.factor(active)2                  0.068312   0.208727    0.33  0.74346    
## wt71                               -0.012848   0.022283   -0.58  0.56423    
## I(wt71 * wt71)                      0.000121   0.000135    0.89  0.37096    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1876.3  on 1628  degrees of freedom
## Residual deviance: 1766.7  on 1610  degrees of freedom
## AIC: 1805
## 
## Number of Fisher Scoring iterations: 4

nhefs$ps <- predict(fit3, nhefs, type="response")

summary(nhefs$ps[nhefs$qsmk==0])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.053   0.169   0.227   0.245   0.304   0.658

summary(nhefs$ps[nhefs$qsmk==1])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.062   0.220   0.289   0.312   0.381   0.793

# # plotting the estimated propensity score
# install.packages("ggplot2") # install packages if necessary
# install.packages("dplyr")
library("ggplot2")
library("dplyr")
library(tidyverse)
ggplot(nhefs, aes(x = ps, fill = qsmk)) + geom_density(alpha = 0.2) +
  xlab('Probability of Quitting Smoking During Follow-up') +
  ggtitle('Propensity Score Distribution by Treatment Group') +
  scale_fill_discrete('') +
  theme(legend.position = 'bottom', legend.direction = 'vertical')

# alternative plot with histograms
nhefs <- nhefs %>% mutate(qsmklabel = ifelse(qsmk == 1,
                       yes = 'Quit Smoking 1971-1982',
                       no = 'Did Not Quit Smoking 1971-1982'))
ggplot(nhefs, aes(x = ps, fill = as.factor(qsmk), color = as.factor(qsmk))) +
  geom_histogram(alpha = 0.3, position = 'identity', bins=15) +
  facet_grid(as.factor(qsmk) ~ .) +
  xlab('Probability of Quitting Smoking During Follow-up') +
  ggtitle('Propensity Score Distribution by Treatment Group') +
  scale_fill_discrete('') +
  scale_color_discrete('') +
  theme(legend.position = 'bottom', legend.direction = 'vertical')

# attempt to reproduce plot from the book
nhefs %>%
  mutate(ps.grp = round(ps/0.05) * 0.05) %>%
  group_by(qsmk, ps.grp) %>%
  dplyr::summarize(n = n()) %>%
  ungroup() %>%
  mutate(n2 = ifelse(qsmk == 0, yes = n, no =  -1*n)) %>%
  ggplot(aes(x = ps.grp, y = n2, fill = as.factor(qsmk))) +
  geom_bar(stat = 'identity', position = 'identity') +
  geom_text(aes(label = n, x = ps.grp, y = n2 + ifelse(qsmk == 0, 8, -8))) +
  xlab('Probability of Quitting Smoking During Follow-up') +
  ylab('N') +
  ggtitle('Propensity Score Distribution by Treatment Group') +
  scale_fill_discrete('') +
  scale_x_continuous(breaks = seq(0, 1, 0.05)) +
  theme(legend.position = 'bottom', legend.direction = 'vertical',
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

## `summarise()` has grouped output by 'qsmk'. You can override using the `.groups` argument.

############################################

15.3 倾向性分层和标准化

# Program 15.3         
# Stratification on the propensity score 
# Data from NHEFS  
############################################

# calculation of deciles
nhefs$ps.dec <- cut(nhefs$ps, 
                    breaks=c(quantile(nhefs$ps, probs=seq(0,1,0.1))),
                    labels=seq(1:10),
                    include.lowest=TRUE)

#install.packages("psych") # install package if required
library("psych")

## 
## Attaching package: 'psych'

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following object is masked from 'package:boot':
## 
##     logit

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describeBy(nhefs$ps, list(nhefs$ps.dec, nhefs$qsmk))

## 
##  Descriptive statistics by group 
## : 1
## : 0
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis se
## X1    1 151  0.1 0.02   0.11     0.1 0.02 0.05 0.13  0.08 -0.55    -0.53  0
## ------------------------------------------------------------ 
## : 2
## : 0
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis se
## X1    1 136 0.15 0.01   0.15    0.15 0.01 0.13 0.17  0.04 -0.04    -1.23  0
## ------------------------------------------------------------ 
## : 3
## : 0
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis se
## X1    1 134 0.18 0.01   0.18    0.18 0.01 0.17 0.19  0.03 -0.08    -1.34  0
## ------------------------------------------------------------ 
## : 4
## : 0
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis se
## X1    1 129 0.21 0.01   0.21    0.21 0.01 0.19 0.22  0.02 -0.04    -1.13  0
## ------------------------------------------------------------ 
## : 5
## : 0
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 120 0.23 0.01   0.23    0.23 0.01 0.22 0.25  0.03 0.24    -1.22  0
## ------------------------------------------------------------ 
## : 6
## : 0
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis se
## X1    1 117 0.26 0.01   0.26    0.26 0.01 0.25 0.27  0.03 -0.11    -1.29  0
## ------------------------------------------------------------ 
## : 7
## : 0
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis se
## X1    1 120 0.29 0.01   0.29    0.29 0.01 0.27 0.31  0.03 -0.23    -1.19  0
## ------------------------------------------------------------ 
## : 8
## : 0
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 112 0.33 0.01   0.33    0.33 0.02 0.31 0.35  0.04 0.15     -1.1  0
## ------------------------------------------------------------ 
## : 9
## : 0
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 96 0.38 0.02   0.38    0.38 0.02 0.35 0.42  0.06 0.13    -1.15  0
## ------------------------------------------------------------ 
## : 10
## : 0
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 86 0.49 0.06   0.47    0.48 0.05 0.42 0.66  0.24  1.1     0.47 0.01
## ------------------------------------------------------------ 
## : 1
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 12  0.1 0.02   0.11     0.1 0.03 0.06 0.13  0.07 -0.5    -1.36 0.01
## ------------------------------------------------------------ 
## : 2
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis se
## X1    1 27 0.15 0.01   0.15    0.15 0.01 0.13 0.17  0.03 -0.03    -1.34  0
## ------------------------------------------------------------ 
## : 3
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 29 0.18 0.01   0.18    0.18 0.01 0.17 0.19  0.03 0.01    -1.34  0
## ------------------------------------------------------------ 
## : 4
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis se
## X1    1 34 0.21 0.01   0.21    0.21 0.01 0.19 0.22  0.02 -0.31    -1.23  0
## ------------------------------------------------------------ 
## : 5
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 43 0.23 0.01   0.23    0.23 0.01 0.22 0.25  0.03 0.11    -1.23  0
## ------------------------------------------------------------ 
## : 6
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 45 0.26 0.01   0.26    0.26 0.01 0.25 0.27  0.03  0.2    -1.12  0
## ------------------------------------------------------------ 
## : 7
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 43 0.29 0.01   0.29    0.29 0.01 0.27 0.31  0.03 0.16    -1.25  0
## ------------------------------------------------------------ 
## : 8
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 51 0.33 0.01   0.33    0.33 0.02 0.31 0.35  0.04 0.11    -1.19  0
## ------------------------------------------------------------ 
## : 9
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 67 0.38 0.02   0.38    0.38 0.03 0.35 0.42  0.06 0.19    -1.27  0
## ------------------------------------------------------------ 
## : 10
## : 1
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 77 0.52 0.08   0.51    0.51 0.08 0.42 0.79  0.38 0.88     0.81 0.01

# function to create deciles easily
decile <- function(x) {
  return(factor(quantcut(x, seq(0, 1, 0.1), labels = FALSE)))
}

# regression on PS deciles, allowing for effect modification
for (deciles in c(1:10)) {
  print(t.test(wt82_71~qsmk, data=nhefs[which(nhefs$ps.dec==deciles),]))
}

## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = 0.006, df = 12, p-value = 1
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.28  5.31
## sample estimates:
## mean in group 0 mean in group 1 
##            4.00            3.98 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -3, df = 37, p-value = 0.004
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -6.85 -1.45
## sample estimates:
## mean in group 0 mean in group 1 
##            2.90            7.05 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -5, df = 36, p-value = 6e-05
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -9.47 -3.61
## sample estimates:
## mean in group 0 mean in group 1 
##            2.61            9.16 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -1, df = 45, p-value = 0.2
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.683  0.999
## sample estimates:
## mean in group 0 mean in group 1 
##            3.47            5.82 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -3, df = 74, p-value = 0.002
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -6.75 -1.51
## sample estimates:
## mean in group 0 mean in group 1 
##            2.10            6.23 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -2, df = 51, p-value = 0.03
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -8.752 -0.335
## sample estimates:
## mean in group 0 mean in group 1 
##            1.85            6.39 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -3, df = 85, p-value = 0.001
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -6.90 -1.73
## sample estimates:
## mean in group 0 mean in group 1 
##            1.56            5.88 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -3, df = 75, p-value = 0.009
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -6.240 -0.901
## sample estimates:
## mean in group 0 mean in group 1 
##           0.285           3.855 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -2, df = 129, p-value = 0.06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -4.6814  0.0797
## sample estimates:
## mean in group 0 mean in group 1 
##          -0.895           1.405 
## 
## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by qsmk
## t = -2, df = 143, p-value = 0.1
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.02  0.54
## sample estimates:
## mean in group 0 mean in group 1 
##          -0.504           1.736

# regression on PS deciles, not allowing for effect modification
fit.psdec <- glm(wt82_71 ~ qsmk + as.factor(ps.dec), data = nhefs)
summary(fit.psdec)

## 
## Call:
## glm(formula = wt82_71 ~ qsmk + as.factor(ps.dec), data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -43.54   -3.93   -0.08    4.23   46.77  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.750      0.609    6.16  9.3e-10 ***
## qsmk                   3.500      0.457    7.66  3.3e-14 ***
## as.factor(ps.dec)2    -0.739      0.861   -0.86    0.391    
## as.factor(ps.dec)3    -0.618      0.861   -0.72    0.473    
## as.factor(ps.dec)4    -0.520      0.858   -0.61    0.544    
## as.factor(ps.dec)5    -1.488      0.859   -1.73    0.083 .  
## as.factor(ps.dec)6    -1.623      0.868   -1.87    0.062 .  
## as.factor(ps.dec)7    -1.985      0.868   -2.29    0.022 *  
## as.factor(ps.dec)8    -3.445      0.875   -3.94  8.6e-05 ***
## as.factor(ps.dec)9    -5.154      0.885   -5.83  6.9e-09 ***
## as.factor(ps.dec)10   -4.840      0.883   -5.48  4.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 58.4)
## 
##     Null deviance: 97176  on 1565  degrees of freedom
## Residual deviance: 90848  on 1555  degrees of freedom
##   (63 observations deleted due to missingness)
## AIC: 10827
## 
## Number of Fisher Scoring iterations: 2

confint.lm(fit.psdec)

##                     2.5 %  97.5 %
## (Intercept)          2.56  4.9449
## qsmk                 2.60  4.3970
## as.factor(ps.dec)2  -2.43  0.9498
## as.factor(ps.dec)3  -2.31  1.0710
## as.factor(ps.dec)4  -2.20  1.1633
## as.factor(ps.dec)5  -3.17  0.1966
## as.factor(ps.dec)6  -3.32  0.0789
## as.factor(ps.dec)7  -3.69 -0.2825
## as.factor(ps.dec)8  -5.16 -1.7286
## as.factor(ps.dec)9  -6.89 -3.4188
## as.factor(ps.dec)10 -6.57 -3.1087

#######################################################################################

15.4 倾向性评分匹配

# Program 15.4                                                                  ####
# Standardization using the propensity score                                    ####
# Data from NHEFS                                                               ####
#######################################################################################

#install.packages("boot") # install package if required
library("boot")

# standardization by propensity score, agnostic regarding effect modification 
std.ps <- function(data, indices) {
  d <- data[indices,] # 1st copy: equal to original one`
  # calculating propensity scores
  ps.fit <- glm(qsmk ~ sex + race + age + I(age*age) 
                + as.factor(education) + smokeintensity
                + I(smokeintensity*smokeintensity) + smokeyrs
                + I(smokeyrs*smokeyrs) + as.factor(exercise)
                + as.factor(active) + wt71 + I(wt71*wt71),
                data=d, family=binomial())
  d$pscore <- predict(ps.fit, d, type="response")
  
  # create a dataset with 3 copies of each subject
  d$interv <- -1 # 1st copy: equal to original one`
  d0 <- d # 2nd copy: treatment set to 0, outcome to missing
  d0$interv <- 0
  d0$qsmk <- 0
  d0$wt82_71 <- NA
  d1 <- d # 3rd copy: treatment set to 1, outcome to missing
  d1$interv <- 1
  d1$qsmk <- 1
  d1$wt82_71 <- NA
  d.onesample <- rbind(d, d0, d1) # combining datasets

  std.fit <- glm(wt82_71 ~ qsmk + pscore + I(qsmk*pscore), data=d.onesample)
  d.onesample$predicted_meanY <- predict(std.fit, d.onesample)

  # estimate mean outcome in each of the groups interv=-1, interv=0, and interv=1
  return(c(mean(d.onesample$predicted_meanY[d.onesample$interv==-1]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==0]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==1]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==1])-
             mean(d.onesample$predicted_meanY[d.onesample$interv==0])))
}

# bootstrap
results <- boot(data=nhefs, statistic=std.ps, R=5)

# generating confidence intervals
se <- c(sd(results$t[,1]), sd(results$t[,2]), 
        sd(results$t[,3]), sd(results$t[,4]))
mean <- results$t0
ll <- mean - qnorm(0.975)*se
ul <- mean + qnorm(0.975)*se

bootstrap <- data.frame(cbind(c("Observed", "No Treatment", "Treatment", 
                                "Treatment - No Treatment"), mean, se, ll, ul))
bootstrap

##                         V1             mean                se               ll
## 1                 Observed 2.63384609228479 0.215654312114019 2.21117140743055
## 2             No Treatment 1.71983636149842 0.212527259152412 1.30329058782669
## 3                Treatment 5.35072300362993  0.32326309173114 4.71713898630582
## 4 Treatment - No Treatment  3.6308866421315 0.311509003584114 3.02034021424668
##                 ul
## 1 3.05652077713903
## 2 2.13638213517016
## 3 5.98430702095403
## 4 4.24143307001633

#####################
# regression on the propensity score (linear term) p.qsmk 改为 ps
model6 <- glm(wt82_71 ~ qsmk + ps, data = nhefs)
summary(model6)

## 
## Call:
## glm(formula = wt82_71 ~ qsmk + ps, data = nhefs)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -43.31   -4.01   -0.07    4.24   47.16  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.594      0.483   11.58  < 2e-16 ***
## qsmk           3.551      0.457    7.76  1.5e-14 ***
## ps           -14.822      1.758   -8.43  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 58.3)
## 
##     Null deviance: 97176  on 1565  degrees of freedom
## Residual deviance: 91099  on 1563  degrees of freedom
##   (63 observations deleted due to missingness)
## AIC: 10815
## 
## Number of Fisher Scoring iterations: 2

# standarization on the propensity score
# (step 1) create two new datasets, one with all treated and one with all untreated
treated <- nhefs
  treated$qsmk <- 1

untreated <- nhefs
  untreated$qsmk <- 0

# (step 2) predict values for everyone in each new dataset based on above model
treated$pred.y <- predict(model6, treated)
untreated$pred.y <- predict(model6, untreated)

# (step 3) compare mean weight loss had all been treated vs. that had all been untreated
mean1 <- mean(treated$pred.y, na.rm = TRUE)
mean0 <- mean(untreated$pred.y, na.rm = TRUE)
mean1

## [1] 5.25

mean0

## [1] 1.7

mean1 - mean0

## [1] 3.55

# (step 4) bootstrap a confidence interval
# number of bootstraps
nboot <- 100
# set up a matrix to store results
boots <- data.frame(i = 1:nboot,
                    mean1 = NA,
                    mean0 = NA,
                    difference = NA)
# loop to perform the bootstrapping p.qsmk 改为ps
nhefs <- subset(nhefs, !is.na(ps) & !is.na(wt82_71))
for(i in 1:nboot) {
  # sample with replacement
  sampl <- nhefs[sample(1:nrow(nhefs), nrow(nhefs), replace = TRUE), ]
  
  # fit the model in the bootstrap sample
  bootmod <- glm(wt82_71 ~ qsmk + ps, data = sampl)
  
  # create new datasets
  sampl.treated <- sampl %>%
    mutate(qsmk = 1)
  
  sampl.untreated <- sampl %>%
    mutate(qsmk = 0)
  
  # predict values
  sampl.treated$pred.y <- predict(bootmod, sampl.treated)
  sampl.untreated$pred.y <- predict(bootmod, sampl.untreated)
  
  # output results 
  boots[i, 'mean1'] <- mean(sampl.treated$pred.y, na.rm = TRUE)
  boots[i, 'mean0'] <- mean(sampl.untreated$pred.y, na.rm = TRUE)
  boots[i, 'difference'] <- boots[i, 'mean1'] - boots[i, 'mean0']
  
  # once loop is done, print the results
  if(i == nboot) {
    cat('95% CI for the causal mean difference\n')
    cat(mean(boots$difference) - 1.96*sd(boots$difference), 
        ',',
        mean(boots$difference) + 1.96*sd(boots$difference))
  }
}

## 95% CI for the causal mean difference
## 2.54 , 4.57

# a more flexible and elegant way to do this is to write a function 
# to perform the model fitting, prediction, bootstrapping, and reporting all at once
# view the code contained in the file mstandardize.R to learn more

# load the code for the mstandardize() function 
# (you may need to change the filepath)
# source('/Users/milin/Downloads/causalInferenceWhatIfRcode_CIpart2/chapter15.R') 

# # performt the standardization
# mstandardize(formula = wt82_71 ~ qsmk + decile(p.qsmk), 
#              family = 'gaussian',
#              trt = 'qsmk', 
#              nboot = 100, 
#              data = nhefs)

16 工具变量

16.1 工具变量的三个条件

##################################################################
# Program 16.1
# Estimating the average causal using the standard IV estimator
# via the calculation of sample averages
# Data from NHEFS
##################################################################

#install.packages("readxl") # install package if required
library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")

# some preprocessing of the data
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)
summary(nhefs$price82)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.5     1.7     1.8     1.8     1.9     2.1      92

# for simplicity, ignore subjects with missing outcome or missing instrument
nhefs.iv <- nhefs[which(!is.na(nhefs$wt82) & !is.na(nhefs$price82)),]
nhefs.iv$highprice <- ifelse(nhefs.iv$price82>=1.5, 1, 0)

table(nhefs.iv$highprice, nhefs.iv$qsmk)

##    
##        0    1
##   0   33    8
##   1 1065  370

t.test(wt82_71 ~ highprice, data=nhefs.iv)

## 
##  Welch Two Sample t-test
## 
## data:  wt82_71 by highprice
## t = -0.1, df = 42, p-value = 0.9
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -3.13  2.83
## sample estimates:
## mean in group 0 mean in group 1 
##            2.54            2.69

######################################################################

16.2 工具变量的效应估计

# PROGRAM 16.2
# Estimating the average causal effect using the standard IV estimator 
# via two-stage-least-squares regression
# Data from NHEFS
######################################################################

#install.packages ("sem") # install package if required
library(sem) 

model1 <- tsls(wt82_71 ~ qsmk, ~ highprice, data = nhefs.iv)
summary(model1)

## 
##  2SLS Estimates
## 
## Model Formula: wt82_71 ~ qsmk
## 
## Instruments: ~highprice
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -43.3    -4.0     0.0     0.0     4.2    46.5 
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     2.07       5.09    0.41     0.68
## qsmk            2.40      19.84    0.12     0.90
## 
## Residual standard error: 7.856 on 1474 degrees of freedom

confint(model1)  # note the wide confidence intervals

##             2.5 % 97.5 %
## (Intercept)  -7.9   12.0
## qsmk        -36.5   41.3

######################################################################

16.3 工具变量的第四个条件：同质性

# Program 16.3
# Estimating the average causal using the standard IV estimator
# via additive marginal structural models
# Data from NHEFS
######################################################################

########################################################
# G-estimation: Checking one possible value of psi      
# See Chapter 14 for program that checks several values 
# and computes 95% confidence intervals                 
########################################################

nhefs.iv$psi <- 2.396
nhefs.iv$Hpsi <- nhefs.iv$wt82_71-nhefs.iv$psi*nhefs.iv$qsmk

#install.packages("geepack") # install package if required
library("geepack")
g.est <- geeglm(highprice ~ Hpsi, data=nhefs.iv, id=seqn, family=binomial(),
                corstr="independence")
summary(g.est)

## 
## Call:
## geeglm(formula = highprice ~ Hpsi, family = binomial(), data = nhefs.iv, 
##     id = seqn, corstr = "independence")
## 
##  Coefficients:
##             Estimate  Std.err Wald Pr(>|W|)    
## (Intercept) 3.56e+00 1.65e-01  463   <2e-16 ***
## Hpsi        2.75e-07 2.27e-02    0        1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)        1   0.761
## Number of clusters:   1476  Maximum cluster size: 1

beta <- coef(g.est)
SE <- coef(summary(g.est))[,2]
lcl <- beta-qnorm(0.975)*SE 
ucl <- beta+qnorm(0.975)*SE
cbind(beta, lcl, ucl)

##                 beta     lcl    ucl
## (Intercept) 3.56e+00  3.2315 3.8792
## Hpsi        2.75e-07 -0.0446 0.0446

######################################################################

16.4 另一种第四个条件：单调性

# Program 16.4
# Estimating the average causal using the standard IV estimator
# with altnerative proposed instruments
# Data from NHEFS
######################################################################

summary(tsls(wt82_71 ~ qsmk, ~ ifelse(price82 >= 1.6, 1, 0), data = nhefs.iv))

## 
##  2SLS Estimates
## 
## Model Formula: wt82_71 ~ qsmk
## 
## Instruments: ~ifelse(price82 >= 1.6, 1, 0)
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -55.6   -13.5     7.6     0.0    12.5    56.4 
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -7.89      42.25   -0.19     0.85
## qsmk           41.28     164.95    0.25     0.80
## 
## Residual standard error: 18.605 on 1474 degrees of freedom

summary(tsls(wt82_71 ~ qsmk, ~ ifelse(price82 >= 1.7, 1, 0), data = nhefs.iv))

## 
##  2SLS Estimates
## 
## Model Formula: wt82_71 ~ qsmk
## 
## Instruments: ~ifelse(price82 >= 1.7, 1, 0)
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -54.4   -13.4    -8.4     0.0    18.1    75.3 
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     13.2       48.1    0.27     0.78
## qsmk           -40.9      187.7   -0.22     0.83
## 
## Residual standard error: 20.591 on 1474 degrees of freedom

summary(tsls(wt82_71 ~ qsmk, ~ ifelse(price82 >= 1.8, 1, 0), data = nhefs.iv))

## 
##  2SLS Estimates
## 
## Model Formula: wt82_71 ~ qsmk
## 
## Instruments: ~ifelse(price82 >= 1.8, 1, 0)
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -49.4    -8.3    -3.4     0.0     7.3    60.5 
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     8.09       7.29    1.11     0.27
## qsmk          -21.10      28.43   -0.74     0.46
## 
## Residual standard error: 13.019 on 1474 degrees of freedom

summary(tsls(wt82_71 ~ qsmk, ~ ifelse(price82 >= 1.9, 1, 0), data = nhefs.iv))

## 
##  2SLS Estimates
## 
## Model Formula: wt82_71 ~ qsmk
## 
## Instruments: ~ifelse(price82 >= 1.9, 1, 0)
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -47.2    -6.3    -1.4     0.0     5.5    54.4 
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     5.96       6.07    0.98     0.33
## qsmk          -12.81      23.67   -0.54     0.59
## 
## Residual standard error: 10.364 on 1474 degrees of freedom

######################################################################

16.5 再谈工具变量的三个条件

# Program 16.5
# Estimating the average causal using the standard IV estimator
# Conditional on baseline covariates
# Data from NHEFS
######################################################################

model2 <- tsls(wt82_71 ~ qsmk + sex + race + age + smokeintensity + smokeyrs + 
                      as.factor(exercise) + as.factor(active) + wt71,
             ~ highprice + sex + race + age + smokeintensity + smokeyrs + as.factor(exercise) +
               as.factor(active) + wt71, data = nhefs.iv)
summary(model2)

## 
##  2SLS Estimates
## 
## Model Formula: wt82_71 ~ qsmk + sex + race + age + smokeintensity + smokeyrs + 
##     as.factor(exercise) + as.factor(active) + wt71
## 
## Instruments: ~highprice + sex + race + age + smokeintensity + smokeyrs + as.factor(exercise) + 
##     as.factor(active) + wt71
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -42.2    -4.3    -0.6     0.0     3.9    46.7 
## 
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          17.28033    2.33540    7.40  2.3e-13 ***
## qsmk                 -1.04229   29.98737   -0.03    0.972    
## sex                  -1.64439    2.63083   -0.63    0.532    
## race                 -0.18325    4.65039   -0.04    0.969    
## age                  -0.16364    0.24055   -0.68    0.496    
## smokeintensity        0.00577    0.14550    0.04    0.968    
## smokeyrs              0.02584    0.16142    0.16    0.873    
## as.factor(exercise)1  0.49875    2.17124    0.23    0.818    
## as.factor(exercise)2  0.58183    2.18315    0.27    0.790    
## as.factor(active)1   -1.17015    0.60747   -1.93    0.054 .  
## as.factor(active)2   -0.51228    1.30845   -0.39    0.695    
## wt71                 -0.09795    0.03627   -2.70    0.007 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.716 on 1464 degrees of freedom

17 因果推断中的生存分析

17.1 危害和风险

##################################################################
# PROGRAM 17.1
# Nonparametric estimation of survival curves
# Data from NHEFS
##################################################################

library("readxl")
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")

# some preprocessing of the data 
nhefs$survtime <- ifelse(nhefs$death==0, 120, 
                         (nhefs$yrdth-83)*12+nhefs$modth) # yrdth ranges from 83 to 92

table(nhefs$death, nhefs$qsmk)

##    
##       0   1
##   0 985 326
##   1 216 102

summary(nhefs[which(nhefs$death==1),]$survtime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    61.0    61.1    86.8   120.0

#install.packages("survival")
#install.packages("ggplot2") # for plots
#install.packages("survminer") # for plots
library("survival")
library("ggplot2")
library("survminer")

## Loading required package: ggpubr

survdiff(Surv(survtime, death) ~ qsmk, data=nhefs)

## Call:
## survdiff(formula = Surv(survtime, death) ~ qsmk, data = nhefs)
## 
##           N Observed Expected (O-E)^2/E (O-E)^2/V
## qsmk=0 1201      216    237.5      1.95      7.73
## qsmk=1  428      102     80.5      5.76      7.73
## 
##  Chisq= 7.7  on 1 degrees of freedom, p= 0.005

fit <- survfit(Surv(survtime, death) ~ qsmk, data=nhefs)
ggsurvplot(fit, data = nhefs, xlab="Months of follow-up",
           ylab="Survival probability",
           main="Product-Limit Survival Estimates", risk.table = TRUE)

##################################################################

17.2 从危害和风险

# PROGRAM 17.2
# Parametric estimation of survival curves via hazards model
# Data from NHEFS
##################################################################

# creation of person-month data
#install.packages("splitstackshape")
library("splitstackshape")
nhefs.surv <- expandRows(nhefs, "survtime", drop=F) 
nhefs.surv$time <- sequence(rle(nhefs.surv$seqn)$lengths)-1
nhefs.surv$event <- ifelse(nhefs.surv$time==nhefs.surv$survtime-1 & 
                             nhefs.surv$death==1, 1, 0)
nhefs.surv$timesq <- nhefs.surv$time^2

# fit of parametric hazards model
hazards.model <- glm(event==0 ~ qsmk + I(qsmk*time) + I(qsmk*timesq) + 
                       time + timesq, family=binomial(), data=nhefs.surv)
summary(hazards.model)

## 
## Call:
## glm(formula = event == 0 ~ qsmk + I(qsmk * time) + I(qsmk * timesq) + 
##     time + timesq, family = binomial(), data = nhefs.surv)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.725   0.055   0.060   0.063   0.078  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       7.00e+00   2.31e-01   30.29   <2e-16 ***
## qsmk             -3.35e-01   3.97e-01   -0.85     0.40    
## I(qsmk * time)   -1.21e-02   1.50e-02   -0.80     0.42    
## I(qsmk * timesq)  1.61e-04   1.25e-04    1.29     0.20    
## time             -1.96e-02   8.41e-03   -2.33     0.02 *  
## timesq            1.26e-04   6.69e-05    1.88     0.06 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4655.3  on 176763  degrees of freedom
## Residual deviance: 4631.3  on 176758  degrees of freedom
## AIC: 4643
## 
## Number of Fisher Scoring iterations: 9

# creation of dataset with all time points under each treatment level
qsmk0 <- data.frame(cbind(seq(0, 119),0,(seq(0, 119))^2))
qsmk1 <- data.frame(cbind(seq(0, 119),1,(seq(0, 119))^2))

colnames(qsmk0) <- c("time", "qsmk", "timesq")
colnames(qsmk1) <- c("time", "qsmk", "timesq")

# assignment of estimated (1-hazard) to each person-month */
qsmk0$p.noevent0 <- predict(hazards.model, qsmk0, type="response")
qsmk1$p.noevent1 <- predict(hazards.model, qsmk1, type="response")

# computation of survival for each person-month
qsmk0$surv0 <- cumprod(qsmk0$p.noevent0)
qsmk1$surv1 <- cumprod(qsmk1$p.noevent1)

# some data management to plot estimated survival curves
hazards.graph <- merge(qsmk0, qsmk1, by=c("time", "timesq"))
hazards.graph$survdiff <- hazards.graph$surv1-hazards.graph$surv0

# plot
ggplot(hazards.graph, aes(x=time, y=surv)) + 
  geom_line(aes(y = surv0, colour = "0")) + 
  geom_line(aes(y = surv1, colour = "1")) + 
  xlab("Months") + 
  scale_x_continuous(limits = c(0, 120), breaks=seq(0,120,12)) +
  scale_y_continuous(limits=c(0.6, 1), breaks=seq(0.6, 1, 0.2)) +
  ylab("Survival") + 
  ggtitle("Survival from hazards model") + 
  labs(colour="A:") +
  theme_bw() + 
  theme(legend.position="bottom")

##################################################################

17.3 删失数据

# PROGRAM 17.3
# Estimation of survival curves via IP weighted hazards model
# Data from NHEFS
##################################################################

# estimation of denominator of ip weights
p.denom <- glm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
               + smokeintensity + I(smokeintensity*smokeintensity)
               + smokeyrs + I(smokeyrs*smokeyrs) + as.factor(exercise)
               + as.factor(active) + wt71 + I(wt71*wt71), 
               data=nhefs, family=binomial())
nhefs$pd.qsmk <- predict(p.denom, nhefs, type="response")

# estimation of numerator of ip weights
p.num <- glm(qsmk ~ 1, data=nhefs, family=binomial())
nhefs$pn.qsmk <- predict(p.num, nhefs, type="response")

# computation of estimated weights
nhefs$sw.a <- ifelse(nhefs$qsmk==1, nhefs$pn.qsmk/nhefs$pd.qsmk,
                     (1-nhefs$pn.qsmk)/(1-nhefs$pd.qsmk))
summary(nhefs$sw.a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.33    0.86    0.95    1.00    1.08    4.21

# creation of person-month data
nhefs.ipw <- expandRows(nhefs, "survtime", drop=F) 
nhefs.ipw$time <- sequence(rle(nhefs.ipw$seqn)$lengths)-1
nhefs.ipw$event <- ifelse(nhefs.ipw$time==nhefs.ipw$survtime-1 & 
                            nhefs.ipw$death==1, 1, 0)
nhefs.ipw$timesq <- nhefs.ipw$time^2

# fit of weighted hazards model
ipw.model <- glm(event==0 ~ qsmk + I(qsmk*time) + I(qsmk*timesq) + 
                   time + timesq, family=binomial(), weight=sw.a,
                 data=nhefs.ipw)

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(ipw.model)

## 
## Call:
## glm(formula = event == 0 ~ qsmk + I(qsmk * time) + I(qsmk * timesq) + 
##     time + timesq, family = binomial(), data = nhefs.ipw, weights = sw.a)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -7.186   0.053   0.059   0.064   0.145  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       6.896996   0.220760   31.24   <2e-16 ***
## qsmk              0.179384   0.439873    0.41    0.683    
## I(qsmk * time)   -0.018946   0.016403   -1.16    0.248    
## I(qsmk * timesq)  0.000210   0.000135    1.56    0.120    
## time             -0.018888   0.008053   -2.35    0.019 *  
## timesq            0.000118   0.000064    1.85    0.065 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4643.9  on 176763  degrees of freedom
## Residual deviance: 4626.2  on 176758  degrees of freedom
## AIC: 4634
## 
## Number of Fisher Scoring iterations: 9

# creation of survival curves
ipw.qsmk0 <- data.frame(cbind(seq(0, 119),0,(seq(0, 119))^2))
ipw.qsmk1 <- data.frame(cbind(seq(0, 119),1,(seq(0, 119))^2))

colnames(ipw.qsmk0) <- c("time", "qsmk", "timesq")
colnames(ipw.qsmk1) <- c("time", "qsmk", "timesq")

# assignment of estimated (1-hazard) to each person-month */
ipw.qsmk0$p.noevent0 <- predict(ipw.model, ipw.qsmk0, type="response")
ipw.qsmk1$p.noevent1 <- predict(ipw.model, ipw.qsmk1, type="response")

# computation of survival for each person-month
ipw.qsmk0$surv0 <- cumprod(ipw.qsmk0$p.noevent0)
ipw.qsmk1$surv1 <- cumprod(ipw.qsmk1$p.noevent1)

# some data management to plot estimated survival curves
ipw.graph <- merge(ipw.qsmk0, ipw.qsmk1, by=c("time", "timesq"))
ipw.graph$survdiff <- ipw.graph$surv1-ipw.graph$surv0

# plot
ggplot(ipw.graph, aes(x=time, y=surv)) + 
  geom_line(aes(y = surv0, colour = "0")) + 
  geom_line(aes(y = surv1, colour = "1")) + 
  xlab("Months") + 
  scale_x_continuous(limits = c(0, 120), breaks=seq(0,120,12)) +
  scale_y_continuous(limits=c(0.6, 1), breaks=seq(0.6, 1, 0.2)) +
  ylab("Survival") + 
  ggtitle("Survival from IP weighted hazards model") + 
  labs(colour="A:") +
  theme_bw() + 
  theme(legend.position="bottom")

##################################################################

17.4 生存分析中的逆概率加权

# PROGRAM 17.4
# Estimating of survival curves via g-formula
# Data from NHEFS
##################################################################

# fit of hazards model with covariates
gf.model <- glm(event==0 ~ qsmk + I(qsmk*time) + I(qsmk*timesq)
                + time + timesq + sex + race + age + I(age*age)
                + as.factor(education) + smokeintensity 
                + I(smokeintensity*smokeintensity) + smkintensity82_71 
                + smokeyrs + I(smokeyrs*smokeyrs) + as.factor(exercise) 
                + as.factor(active) + wt71 + I(wt71*wt71), 
                data=nhefs.surv, family=binomial())
summary(gf.model)

## 
## Call:
## glm(formula = event == 0 ~ qsmk + I(qsmk * time) + I(qsmk * timesq) + 
##     time + timesq + sex + race + age + I(age * age) + as.factor(education) + 
##     smokeintensity + I(smokeintensity * smokeintensity) + smkintensity82_71 + 
##     smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71), family = binomial(), 
##     data = nhefs.surv)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.316   0.024   0.039   0.064   0.330  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.27e+00   1.38e+00    6.72  1.8e-11 ***
## qsmk                                5.96e-02   4.15e-01    0.14  0.88592    
## I(qsmk * time)                     -1.49e-02   1.51e-02   -0.99  0.32382    
## I(qsmk * timesq)                    1.70e-04   1.24e-04    1.37  0.17164    
## time                               -2.27e-02   8.44e-03   -2.69  0.00714 ** 
## timesq                              1.17e-04   6.71e-05    1.75  0.08002 .  
## sex                                 4.37e-01   1.41e-01    3.10  0.00193 ** 
## race                               -5.24e-02   1.73e-01   -0.30  0.76257    
## age                                -8.75e-02   5.91e-02   -1.48  0.13854    
## I(age * age)                        8.13e-05   5.47e-04    0.15  0.88186    
## as.factor(education)2               1.40e-01   1.57e-01    0.89  0.37098    
## as.factor(education)3               4.33e-01   1.53e-01    2.84  0.00450 ** 
## as.factor(education)4               2.35e-01   2.79e-01    0.84  0.39975    
## as.factor(education)5               3.75e-01   2.39e-01    1.57  0.11612    
## smokeintensity                     -1.63e-03   1.43e-02   -0.11  0.90943    
## I(smokeintensity * smokeintensity) -7.18e-05   2.39e-04   -0.30  0.76374    
## smkintensity82_71                  -1.69e-03   6.50e-03   -0.26  0.79540    
## smokeyrs                           -1.68e-02   3.06e-02   -0.55  0.58415    
## I(smokeyrs * smokeyrs)             -5.28e-05   4.24e-04   -0.12  0.90100    
## as.factor(exercise)1                1.47e-01   1.79e-01    0.82  0.41230    
## as.factor(exercise)2               -1.50e-01   1.76e-01   -0.85  0.39318    
## as.factor(active)1                 -1.60e-01   1.30e-01   -1.23  0.21805    
## as.factor(active)2                 -2.29e-01   1.88e-01   -1.22  0.22177    
## wt71                                6.22e-02   1.90e-02    3.27  0.00107 ** 
## I(wt71 * wt71)                     -4.05e-04   1.13e-04   -3.58  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4655.3  on 176763  degrees of freedom
## Residual deviance: 4185.7  on 176739  degrees of freedom
## AIC: 4236
## 
## Number of Fisher Scoring iterations: 10

# creation of dataset with all time points for 
# each individual under each treatment level
gf.qsmk0 <- expandRows(nhefs, count=120, count.is.col=F) 
gf.qsmk0$time <- rep(seq(0, 119), nrow(nhefs))
gf.qsmk0$timesq <- gf.qsmk0$time^2
gf.qsmk0$qsmk <- 0

gf.qsmk1 <- gf.qsmk0
gf.qsmk1$qsmk <- 1

gf.qsmk0$p.noevent0 <- predict(gf.model, gf.qsmk0, type="response")
gf.qsmk1$p.noevent1 <- predict(gf.model, gf.qsmk1, type="response")

#install.packages("dplyr")
library("dplyr")
gf.qsmk0.surv <- gf.qsmk0 %>% group_by(seqn) %>% mutate(surv0 = cumprod(p.noevent0))
gf.qsmk1.surv <- gf.qsmk1 %>% group_by(seqn) %>% mutate(surv1 = cumprod(p.noevent1))

gf.surv0 <- aggregate(gf.qsmk0.surv, by=list(gf.qsmk0.surv$time), FUN=mean)[c("qsmk", "time", "surv0")]
gf.surv1 <- aggregate(gf.qsmk1.surv, by=list(gf.qsmk1.surv$time), FUN=mean)[c("qsmk", "time", "surv1")]

gf.graph <- merge(gf.surv0, gf.surv1, by=c("time"))
gf.graph$survdiff <- gf.graph$surv1-gf.graph$surv0

# plot
ggplot(gf.graph, aes(x=time, y=surv)) + 
  geom_line(aes(y = surv0, colour = "0")) + 
  geom_line(aes(y = surv1, colour = "1")) + 
  xlab("Months") + 
  scale_x_continuous(limits = c(0, 120), breaks=seq(0,120,12)) +
  scale_y_continuous(limits=c(0.6, 1), breaks=seq(0.6, 1, 0.2)) +
  ylab("Survival") + 
  ggtitle("Survival from g-formula") + 
  labs(colour="A:") +
  theme_bw() + 
  theme(legend.position="bottom")

##################################################################

17.5 生存分析中的G公式

# PROGRAM 17.5
# Estimating of median survival time ratio via a structural nested AFT model
# Data from NHEFS
##################################################################

# some preprocessing of the data
nhefs <- read.csv("/Users/milin/Downloads/nhefs.csv")
nhefs$survtime <- ifelse(nhefs$death==0, NA, (nhefs$yrdth-83)*12+nhefs$modth) # * yrdth ranges from 83 to 92

# model to estimate E[A|L]
modelA <- glm(qsmk ~ sex + race + age + I(age*age)
              + as.factor(education) + smokeintensity
              + I(smokeintensity*smokeintensity) + smokeyrs
              + I(smokeyrs*smokeyrs) + as.factor(exercise)
              + as.factor(active) + wt71 + I(wt71*wt71),
              data=nhefs, family=binomial())

nhefs$p.qsmk <- predict(modelA, nhefs, type="response") 
d <- nhefs[!is.na(nhefs$survtime),] # select only those with observed death time
n <- nrow(d)

# define the estimating function that needs to be minimized
sumeef <- function(psi){
  
  # creation of delta indicator
  if (psi>=0){
    delta <- ifelse(d$qsmk==0 | 
                      (d$qsmk==1 & psi <= log(120/d$survtime)), 
                    1, 0)
  } else if (psi < 0) {
    delta <- ifelse(d$qsmk==1 | 
                      (d$qsmk==0 & psi > log(d$survtime/120)), 1, 0)
  }
  
  smat <- delta*(d$qsmk-d$p.qsmk)
  sval <- sum(smat, na.rm=T)
  save <- sval/n
  smat <- smat - rep(save, n)
  
  # covariance
  sigma <- t(smat) %*% smat
  if (sigma == 0){
    sigma <- 1e-16
  }
  estimeq <- sval*solve(sigma)*t(sval)
  return(estimeq)
}

res <- optimize(sumeef, interval = c(-0.2,0.2))
psi1 <- res$minimum
objfunc <- as.numeric(res$objective)


# Use simple bisection method to find estimates of lower and upper 95% confidence bounds
increm <- 0.1
for_conf <- function(x){
  return(sumeef(x) - 3.84)
}

if (objfunc < 3.84){
  # Find estimate of where sumeef(x) > 3.84
  
  # Lower bound of 95% CI
  psilow <- psi1
  testlow <- objfunc
  countlow <- 0
  while (testlow < 3.84 & countlow < 100){
    psilow <- psilow - increm
    testlow <- sumeef(psilow)
    countlow <- countlow + 1
  }
  
  # Upper bound of 95% CI
  psihigh <- psi1
  testhigh <- objfunc
  counthigh <- 0
  while (testhigh < 3.84 & counthigh < 100){
    psihigh <- psihigh + increm
    testhigh <- sumeef(psihigh)
    counthigh <- counthigh + 1
  }
  
  # Better estimate using bisection method
  if ((testhigh > 3.84) & (testlow > 3.84)){
    
    # Bisection method
    left <- psi1
    fleft <- objfunc - 3.84
    right <- psihigh
    fright <- testhigh - 3.84
    middle <- (left  + right) / 2
    fmiddle <- for_conf(middle)
    count <- 0
    diff <- right - left
    
    while (!(abs(fmiddle) < 0.0001 | diff < 0.0001 | count > 100)){
      test <- fmiddle * fleft
      if (test < 0){
        right <- middle
        fright <- fmiddle
      } else {
        left <- middle
        fleft <- fmiddle
      }
      middle <- (left + right) / 2
      fmiddle <- for_conf(middle)
      count <- count + 1
      diff <- right - left
    }
    
    psi_high <- middle
    objfunc_high <- fmiddle + 3.84
    
    # lower bound of 95% CI
    left <- psilow
    fleft <- testlow - 3.84
    right <- psi1
    fright <- objfunc - 3.84
    middle <- (left + right) / 2
    fmiddle <- for_conf(middle)
    count <- 0
    diff <- right - left
    
    while(!(abs(fmiddle) < 0.0001 | diff < 0.0001 | count > 100)){
      test <- fmiddle * fleft
      if (test < 0){
        right <- middle
        fright <- fmiddle
      } else {
        left <- middle
        fleft <- fmiddle
      }
      middle <- (left + right) / 2
      fmiddle <- for_conf(middle)
      diff <- right - left
      count <- count + 1
    }
    psi_low <- middle
    objfunc_low <- fmiddle + 3.84
    psi <- psi1
  }
}
c(psi, psi_low, psi_high)

## [1] -0.0504 -0.2231  0.3331

因果推断与R语言

MiLin

2/5/2022