knitr::opts_chunk$set(echo = TRUE)

1 简介

对高维模型的分析变得越来越重要。高维模型需要估计的参数数量相对于样本量很大。这些模型自然地出现在现代数据集中,这些数据集具有许多可用于单个观察的测量特征,例如人口普查数据、扫描仪数据和文本数据。这样的模型也会很自然地出现,即使是在只有少量测量特征的数据中,当观察到的变量进入模型时,其确切的函数形式是未知的,而我们根据原始特征创建了许多技术变量,如字典。本场景中的示例包括具有非参数干扰函数的半参数模型。更一般地说,当试图对复杂现象建模时,往往会出现与样本量相关的许多参数的模型。

随着这些数据集在经济学和其他数据科学领域的可用性的增加,分析这些数据的新方法已经被开发出来。R包hdm包含了最近开发的用于高维近似稀疏模型的方法的实现,主要依赖于lasso和post-lasso的形式以及相关的估计和推断方法。方法如下图所示计量经济学的应用,但也有用的其他学科,如医学,生物学,社会学或心理学。

这个包中实现的方法与其他包中已经可用的方法有以下四种主要方式:

  • 1)首先,我们提供了一个Lasso回归的版本,该版本明确处理并允许非高斯误差和异方差误差。

  • 2)其次,我们实现了λ的选择套索回归。为了强调这一选择,我们称这个包中的Lasso实现为“strict”Lasso(=rlasso)。函数名中的前缀r应该强调这一点。在高维设置中,交叉验证非常流行;但它缺乏一个理论理由使用在现在的环境下,一些理论建议λ的选择往往是不可行的。

  • 3)第三,对于高维近似稀疏模型中出现的各种低维因果/结构参数,我们提供了有效的估计量一致有效的置信区间。例如,在高维稀疏回归模型中,我们为目标变量(如处理变量或政策变量)的回归系数提供了有效的估计量和一致有效的置信区间。这里的目标变量指的是不相关的对象,例如预先设定的回归系数。我们还提供了平均处理效果(ATE)和平均治疗效果(ATET)的估计和置信区间,并将这些参数扩展到生性设置

  • 4)第四,基于Belloni、Chernozhukov和Kato(2014)开发的方法和理论,给出了高维近似稀疏模型中估计系数的联合/一致置信区间。他们提出了回归系数在高维稀疏z-估计问题中的一致有效置信域,其中包括中值、均值和许多其他回归问题作为特例。在这篇文章中,我们把这个方法应用到拉索回归的系数中,并通过一个实证例子来强调这个方法。

2 开始

# install.packages("hdm")
library(tidyverse)
## -- Attaching packages ------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.2.1     √ purrr   0.3.3
## √ tibble  2.1.3     √ dplyr   0.8.3
## √ tidyr   1.0.0     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.4.0
## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(DT)
library(hdm)

3 使用近似稀疏性进行预测

函数rlasso实现了Lasso和post-Lasso,其中前缀表示这些是Lasso和post-Lasso的严格版本。默认选项是post-lasso, post=TRUE。这个函数返回一个S3类rlasso对象,其中提供了predict、print、summary等方法。

3.1 数据构造

set.seed(12345)
n = 100 #sample size
p = 100 # number of variables
s = 3 # nubmer of variables with non-zero coefficients

X = matrix(rnorm(n * p), ncol = p)
beta = c(rep(5, s), rep(0, p - s))
Y = X %*% beta + rnorm(n)

3.2 Lasso

lasso.reg = rlasso(Y ~ X, post = FALSE) # use lasso, not-Post-lasso

sum.lasso <- summary(lasso.reg, all = FALSE) # can also do print(lasso.reg, all=FALSE)
## 
## Call:
## rlasso.formula(formula = Y ~ X, post = FALSE)
## 
## Post-Lasso Estimation:  FALSE 
## 
## Total number of variables: 100
## Number of selected variables: 11 
## 
## Residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.09008 -0.45801 -0.01237  0.50291  2.25098 
## 
##             Estimate
## (Intercept)    0.057
## 1              4.771
## 2              4.693
## 3              4.766
## 13            -0.045
## 15            -0.047
## 16            -0.005
## 19            -0.092
## 22            -0.027
## 40            -0.011
## 61             0.114
## 100           -0.025
## 
## Residual standard error: 0.8039
## Multiple R-squared:  0.9913
## Adjusted R-squared:  0.9902
## Joint significance test:
##  the sup score statistic for joint significance test is 64.02 with a p-value of     0
yhat.lasso = predict(lasso.reg) #in-sample prediction

Xnew = matrix(rnorm(n * p), ncol = p) # new X

Ynew = Xnew %*% beta + rnorm(n) #new Y

yhat.lasso.new = predict(lasso.reg, newdata = Xnew) #out-of-sample prediction

3.3 Post—Lasso

post.lasso.reg = rlasso(Y ~ X, post = TRUE) #now use post-lasso

print(post.lasso.reg, all = FALSE) # or use summary(post.lasso.reg, all=FALSE)
## 
## Call:
## rlasso.formula(formula = Y ~ X, post = TRUE)
## 
## (Intercept)            1            2            3  
##      0.0341       4.9241       4.8579       4.9644
summary(post.lasso.reg,all = FALSE)
## 
## Call:
## rlasso.formula(formula = Y ~ X, post = TRUE)
## 
## Post-Lasso Estimation:  TRUE 
## 
## Total number of variables: 100
## Number of selected variables: 3 
## 
## Residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.94738 -0.50541 -0.01237  0.51283  2.11575 
## 
##             Estimate
## (Intercept)    0.034
## 1              4.924
## 2              4.858
## 3              4.964
## 
## Residual standard error: 0.833
## Multiple R-squared:  0.9906
## Adjusted R-squared:  0.9903
## Joint significance test:
##  the sup score statistic for joint significance test is 64.02 with a p-value of     0

在上面的示例中,总体显著性的supscore统计值为64.02,pvalue为0。这意味着在水平α= 0:05零假设被拒绝和模型具有解释力。

yhat.postlasso = predict(post.lasso.reg) #in-sample prediction

yhat.postlasso.new = predict(post.lasso.reg, newdata = Xnew) #out-of-sample p

MAE <- apply(cbind(abs(Ynew - yhat.lasso.new), abs(Ynew - yhat.postlasso.new)), 2, mean)

names(MAE) <- c("lasso MAE", "Post-lasso MAE")

print(MAE, digits = 2) # MAE for Lasso and Post-Lasso
##      lasso MAE Post-lasso MAE 
##           0.91           0.79

4 目标回归系数的推论

4.1 Intuition for the Orthogonality Principle in Linear Models via Partialling Out

4.1.1 In low-dimensional settings

set.seed(1)
n = 5000
p = 20
X = matrix(rnorm(n * p), ncol = p)

colnames(X) <- c("d",paste("X",1:19,sep = ""))

X %>% datatable()
xnames = colnames(X)[-1]

beta = rep(1, 20)

y = X %*% beta + rnorm(n)  # 构造的y  

dat = data.frame(y = y, X)

We can estimate α0 by running full least squares:

# full fit
fmla = as.formula(paste("y ~ ", paste(colnames(X), collapse = "+")))

full.fit = lm(fmla, data = dat)

summary(full.fit)$coef["d", 1:2]
##   Estimate Std. Error 
## 0.97807455 0.01371225

Another way to estimate α0 is to first partial out the x-variables from yi and di, and run least squares on the residuals:

fmla.y = as.formula(paste("y ~ ", paste(xnames, collapse = "+")))

fmla.d = as.formula(paste("d ~ ", paste(xnames, collapse = "+")))

# partial fit via ols
rY = lm(fmla.y, data = dat)$res

rD = lm(fmla.d, data = dat)$res

partial.fit.ls = lm(rY ~ rD)

summary(partial.fit.ls)$coef["rD", 1:2]
##   Estimate Std. Error 
## 0.97807455 0.01368616

可以看到,估计值是相同的,而标准误差几乎是相同的。事实上,由于与分拆方法相关的估计方程的正交性,标准误差是渐近等价的。

4.1.2 In high-dimensional settings

# partial fit via post-lasso
rY = rlasso(fmla.y, data = dat)$res

rD = rlasso(fmla.d, data = dat)$res

partial.fit.postlasso = lm(rY ~ rD)

summary(partial.fit.postlasso)$coef["rD", 1:2]
##   Estimate Std. Error 
## 0.97273870 0.01368677

我们可以看到,这个估计值和标准误差几乎与前面给出的估计值相同。事实上,它们在低维环境中渐进地与先前的估计等价,其优点是,与先前的估计不同,它们在高维环境中仍然有效。

基于Lasso或post-Lasso的分出的正交估计方程方法由函数rlassoEffect实现,使用method= “partialling out”:

Eff = rlassoEffect(X[, -1], y, X[, 1], method = "partialling out")

summary(Eff)$coef[, 1:2]
##  Estimate. Std. Error 
## 0.97273870 0.01368677
summary(Eff)
## [1] "Estimates and significance testing of the effect of target variables"
##      Estimate. Std. Error t value Pr(>|t|)    
## [1,]   0.97274    0.01369   71.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

另一种正交估计方程方法-基于协变量的双重选择,由函数rlassoEffect实现,使用method= “double selection”。算法上的方法如下:

    1. Select controls xij’s that predict yi by Lasso.
    1. Select controls xij’s that predict di by Lasso.
    1. Run OLS of yi on di and the union of controls selected in steps 1 and 2
Eff = rlassoEffect(X[, -1], y, X[, 1], method = "double selection")

summary(Eff)$coef[, 1:2]
##  Estimate. Std. Error 
## 0.97807455 0.01415624
summary(Eff)
## [1] "Estimates and significance testing of the effect of target variables"
##    Estimate. Std. Error t value Pr(>|t|)    
## d1   0.97807    0.01416   69.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimate. Std. Error: 0.97807455 0.01415624,与method= “partialling out”不同

4.2 Inference: Confidence Intervals and Significance Testing

rlassoEffects函数rlassoEffect的一个包装函数,它对单个目标回归变量进行推断。在Belloni、Chernozhukov和Hansen(2014)中给出了理论基础,在Belloni、Chernozhukov和Kato(2014)中给出了更一般的一类模型。函数rlassoEffects可以以rlassoEffects(x, y,index)的形 式使用,其中x是一个矩阵,y表示结果变量,index指定进行推理的x的变量。这可以通过一个整数向量(变量的位置)、一个逻辑向量或变量的名称来实现。

另一种用法是rlassoEffects(formula, data,I),其中I是一个单边公式,它指定为其进行推断的变量。有关进一步的详细信息,请 参阅函数的帮助页面和下面的示例,其中显示了这两种方法的用法.

set.seed(1)
n = 100 #sample size
p = 100 # number of variables
s = 3 # nubmer of non-zero variables
X = matrix(rnorm(n * p), ncol = p)

colnames(X) <- paste("X", 1:p, sep = "")

colnames(X)
##   [1] "X1"   "X2"   "X3"   "X4"   "X5"   "X6"   "X7"   "X8"   "X9"   "X10" 
##  [11] "X11"  "X12"  "X13"  "X14"  "X15"  "X16"  "X17"  "X18"  "X19"  "X20" 
##  [21] "X21"  "X22"  "X23"  "X24"  "X25"  "X26"  "X27"  "X28"  "X29"  "X30" 
##  [31] "X31"  "X32"  "X33"  "X34"  "X35"  "X36"  "X37"  "X38"  "X39"  "X40" 
##  [41] "X41"  "X42"  "X43"  "X44"  "X45"  "X46"  "X47"  "X48"  "X49"  "X50" 
##  [51] "X51"  "X52"  "X53"  "X54"  "X55"  "X56"  "X57"  "X58"  "X59"  "X60" 
##  [61] "X61"  "X62"  "X63"  "X64"  "X65"  "X66"  "X67"  "X68"  "X69"  "X70" 
##  [71] "X71"  "X72"  "X73"  "X74"  "X75"  "X76"  "X77"  "X78"  "X79"  "X80" 
##  [81] "X81"  "X82"  "X83"  "X84"  "X85"  "X86"  "X87"  "X88"  "X89"  "X90" 
##  [91] "X91"  "X92"  "X93"  "X94"  "X95"  "X96"  "X97"  "X98"  "X99"  "X100"
beta = c(rep(3, s), rep(0, p - s))
y = 1 + X %*% beta + rnorm(n)       # 1可以看作截距项

data = data.frame(cbind(y, X))

colnames(data)[1] <- "y"
fm = paste("y ~", paste(colnames(X), collapse = "+"))
fm = as.formula(fm)
fm
## y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + 
##     X12 + X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20 + X21 + 
##     X22 + X23 + X24 + X25 + X26 + X27 + X28 + X29 + X30 + X31 + 
##     X32 + X33 + X34 + X35 + X36 + X37 + X38 + X39 + X40 + X41 + 
##     X42 + X43 + X44 + X45 + X46 + X47 + X48 + X49 + X50 + X51 + 
##     X52 + X53 + X54 + X55 + X56 + X57 + X58 + X59 + X60 + X61 + 
##     X62 + X63 + X64 + X65 + X66 + X67 + X68 + X69 + X70 + X71 + 
##     X72 + X73 + X74 + X75 + X76 + X77 + X78 + X79 + X80 + X81 + 
##     X82 + X83 + X84 + X85 + X86 + X87 + X88 + X89 + X90 + X91 + 
##     X92 + X93 + X94 + X95 + X96 + X97 + X98 + X99 + X100

我们可以对一组感兴趣的变量进行推理,例如第一、第二、第三和第五十:

lasso.effect = rlassoEffects(fm, I = ~X1 + X2 + X3 + X50, data = data)
print(lasso.effect)
## 
## Call:
## rlassoEffects.formula(formula = fm, data = data, I = ~X1 + X2 + 
##     X3 + X50)
## 
## Coefficients:
##      X1       X2       X3      X50  
## 2.94448  3.04127  2.97540  0.07196
summary(lasso.effect)
## [1] "Estimates and significance testing of the effect of target variables"
##     Estimate. Std. Error t value Pr(>|t|)    
## X1    2.94448    0.08815  33.404   <2e-16 ***
## X2    3.04127    0.08389  36.253   <2e-16 ***
## X3    2.97540    0.07804  38.127   <2e-16 ***
## X50   0.07196    0.07765   0.927    0.354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(lasso.effect)
##           2.5 %    97.5 %
## X1   2.77171308 3.1172421
## X2   2.87685121 3.2056979
## X3   2.82244962 3.1283583
## X50 -0.08022708 0.2241377

若不指示推断变量I,观察结果如何,会报错!!!

联合置信区间

confint(lasso.effect, level = 0.95, joint = TRUE)
##          2.5 %    97.5 %
## X1   2.7279477 3.1610075
## X2   2.8371214 3.2454278
## X3   2.7833176 3.1674903
## X50 -0.1154509 0.2593615
plot(lasso.effect, main = "Confidence Intervals")
## Warning: Ignoring unknown aesthetics: width, h

4.3 应用: 性别对薪资的影响

在劳动经济学中,一个重要的问题是工资如何与雇员的性别相关。我们利用2012年美国人口普查数据,联合分析了性别和其他变量与性别的交互作用对工资的影响。因变量为工资的对数,目标变量为女性(结合其他变量)。所有其他变量表示其他一些社会经济特征,如婚姻状况、教育和经历。有关变量的详细描述,请参阅帮助页面。首先,我们加载并准备数据:

library(hdm)
data(cps2012)
dim(cps2012)
## [1] 29217    23
cps2012 %>% datatable()
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
X <- model.matrix(~-1 + female + female:(widowed + divorced + separated + nevermarried +
hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) + +(widowed +
divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so +
we + exp1 + exp2 + exp3)^2, data = cps2012)

dim(X)
## [1] 29217   136
X <- X[, which(apply(X, 2, var) != 0)] # exclude all constant variables
dim(X)
## [1] 29217   116
index.gender <- grep("female", colnames(X))
y <- cps2012$lnw

目标参数,即所有与性别相关的系数(即与其他变量的交互作用)的参数估计由以下命令计算和汇总

effects.female <- rlassoEffects(x = X, y = y, index = index.gender)
summary(effects.female)
## [1] "Estimates and significance testing of the effect of target variables"
##                     Estimate. Std. Error t value Pr(>|t|)    
## female              -0.154923   0.050162  -3.088 0.002012 ** 
## female:widowed       0.136095   0.090663   1.501 0.133325    
## female:divorced      0.136939   0.022182   6.174 6.68e-10 ***
## female:separated     0.023303   0.053212   0.438 0.661441    
## female:nevermarried  0.186853   0.019942   9.370  < 2e-16 ***
## female:hsd08         0.027810   0.120914   0.230 0.818092    
## female:hsd911       -0.119335   0.051880  -2.300 0.021435 *  
## female:hsg          -0.012890   0.019223  -0.671 0.502518    
## female:cg            0.010139   0.018327   0.553 0.580114    
## female:ad           -0.030464   0.021806  -1.397 0.162405    
## female:mw           -0.001063   0.019192  -0.055 0.955811    
## female:so           -0.008183   0.019357  -0.423 0.672468    
## female:we           -0.004226   0.021168  -0.200 0.841760    
## female:exp1          0.004935   0.007804   0.632 0.527139    
## female:exp2         -0.159519   0.045300  -3.521 0.000429 ***
## female:exp3          0.038451   0.007861   4.891 1.00e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

最后,我们估计和绘制置信区间,首先“点态化”,然后联合置信区间。

joint.CI <- confint(effects.female, level = 0.95, joint = TRUE)
joint.CI
##                           2.5 %      97.5 %
## female              -0.29422452 -0.01562204
## female:widowed      -0.13367117  0.40586213
## female:divorced      0.07479695  0.19908182
## female:separated    -0.11671664  0.16332216
## female:nevermarried  0.12925782  0.24444915
## female:hsd08        -0.37450228  0.43012291
## female:hsd911       -0.26902488  0.03035480
## female:hsg          -0.06513949  0.03935993
## female:cg           -0.04168175  0.06195886
## female:ad           -0.09583693  0.03490944
## female:mw           -0.05470794  0.05258106
## female:so           -0.06272074  0.04635406
## female:we           -0.06562874  0.05717648
## female:exp1         -0.01649540  0.02636592
## female:exp2         -0.28414464 -0.03489402
## female:exp3          0.01684777  0.06005338
 plot(effects.female, joint=TRUE, level=0.95) # plot of the effects
## Warning: Ignoring unknown aesthetics: width, h

这一分析使我们更深入地了解性别歧视与其他社会经济变量之间的关系

4.4 应用:在有许多混合因素的线性模型中估计处理效应

部分实证增长文献侧重于估计初始(滞后)人均GDP(国内生产总值)水平对人均GDP增长率的影响。

特别是,经典的索洛-斯旺-拉姆齐增长模型的一个关键预测是趋同假说,该假说认为,在一系列制度和社会特征的条件下,较贫穷国家通常增长更快,因此往往会赶上较富裕国家。描述这些特征的协变量包括衡量教育和科学政策、市场机构的实力、贸易开放程度、储蓄率等的变量。

library(knitr)
knitr::include_graphics("hdm.png")

data(GrowthData)
dim(GrowthData)
## [1] 90 63
colnames(GrowthData)
##  [1] "Outcome"   "intercept" "gdpsh465"  "bmp1l"     "freeop"   
##  [6] "freetar"   "h65"       "hm65"      "hf65"      "p65"      
## [11] "pm65"      "pf65"      "s65"       "sm65"      "sf65"     
## [16] "fert65"    "mort65"    "lifee065"  "gpop1"     "fert1"    
## [21] "mort1"     "invsh41"   "geetot1"   "geerec1"   "gde1"     
## [26] "govwb1"    "govsh41"   "gvxdxe41"  "high65"    "highm65"  
## [31] "highf65"   "highc65"   "highcm65"  "highcf65"  "human65"  
## [36] "humanm65"  "humanf65"  "hyr65"     "hyrm65"    "hyrf65"   
## [41] "no65"      "nom65"     "nof65"     "pinstab1"  "pop65"    
## [46] "worker65"  "pop1565"   "pop6565"   "sec65"     "secm65"   
## [51] "secf65"    "secc65"    "seccm65"   "seccf65"   "syr65"    
## [56] "syrm65"    "syrf65"    "teapri65"  "teasec65"  "ex1"      
## [61] "im1"       "xr65"      "tot1"
y = GrowthData[, 1, drop = F]
d = GrowthData[, 3, drop = F]
X = as.matrix(GrowthData)[, -c(1, 2, 3)]
varnames = colnames(GrowthData)
varnames %>% class()
## [1] "character"

首先使用线性模型估计效应

xnames = varnames[-c(1, 2, 3)] # names of X variables

dandxnames = varnames[-c(1, 2)] # names of D and X variables

# create formulas by pasting names (this saves typing times)
fmla = as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect = lm(fmla, data = GrowthData)
summary(ls.effect)
## 
## Call:
## lm(formula = fmla, data = GrowthData)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.040338 -0.011298 -0.000863  0.011813  0.043247 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.472e-01  7.845e-01   0.315  0.75506   
## gdpsh465    -9.378e-03  2.989e-02  -0.314  0.75602   
## bmp1l       -6.886e-02  3.253e-02  -2.117  0.04329 * 
## freeop       8.007e-02  2.079e-01   0.385  0.70300   
## freetar     -4.890e-01  4.182e-01  -1.169  0.25214   
## h65         -2.362e+00  8.573e-01  -2.755  0.01019 * 
## hm65         7.071e-01  5.231e-01   1.352  0.18729   
## hf65         1.693e+00  5.032e-01   3.365  0.00223 **
## p65          2.655e-01  1.643e-01   1.616  0.11727   
## pm65         1.370e-01  1.512e-01   0.906  0.37284   
## pf65        -3.313e-01  1.651e-01  -2.006  0.05458 . 
## s65          3.908e-02  1.855e-01   0.211  0.83469   
## sm65        -3.067e-02  1.168e-01  -0.263  0.79479   
## sf65        -1.799e-01  1.181e-01  -1.523  0.13886   
## fert65       6.881e-03  2.705e-02   0.254  0.80108   
## mort65      -2.335e-01  8.174e-01  -0.286  0.77729   
## lifee065    -1.491e-02  1.933e-01  -0.077  0.93906   
## gpop1        9.702e-01  1.812e+00   0.535  0.59663   
## fert1        8.838e-03  3.504e-02   0.252  0.80271   
## mort1        6.656e-02  6.848e-01   0.097  0.92326   
## invsh41      7.446e-02  1.084e-01   0.687  0.49797   
## geetot1     -7.151e-01  1.680e+00  -0.426  0.67364   
## geerec1      6.300e-01  2.447e+00   0.257  0.79874   
## gde1        -4.436e-01  1.671e+00  -0.265  0.79263   
## govwb1       3.375e-01  4.380e-01   0.770  0.44748   
## govsh41      4.632e-01  1.925e+00   0.241  0.81165   
## gvxdxe41    -7.934e-01  2.059e+00  -0.385  0.70296   
## high65      -7.525e-01  9.057e-01  -0.831  0.41311   
## highm65     -3.903e-01  6.812e-01  -0.573  0.57131   
## highf65     -4.177e-01  5.615e-01  -0.744  0.46308   
## highc65     -2.216e+00  1.481e+00  -1.496  0.14575   
## highcm65     2.797e-01  6.582e-01   0.425  0.67412   
## highcf65     3.921e-01  7.660e-01   0.512  0.61278   
## human65      2.337e+00  3.307e+00   0.707  0.48559   
## humanm65    -1.209e+00  1.619e+00  -0.747  0.46121   
## humanf65    -1.104e+00  1.685e+00  -0.655  0.51763   
## hyr65        5.491e+01  2.389e+01   2.299  0.02918 * 
## hyrm65       1.294e+01  2.317e+01   0.558  0.58112   
## hyrf65       9.093e+00  1.767e+01   0.515  0.61088   
## no65         3.721e-02  1.320e-01   0.282  0.78006   
## nom65       -2.120e-02  6.496e-02  -0.326  0.74661   
## nof65       -1.686e-02  6.700e-02  -0.252  0.80319   
## pinstab1    -4.997e-02  3.092e-02  -1.616  0.11729   
## pop65        1.032e-07  1.318e-07   0.783  0.44027   
## worker65     3.408e-02  1.562e-01   0.218  0.82887   
## pop1565     -4.655e-01  4.713e-01  -0.988  0.33176   
## pop6565     -1.357e+00  6.349e-01  -2.138  0.04139 * 
## sec65       -1.089e-02  3.077e-01  -0.035  0.97201   
## secm65       3.344e-03  1.512e-01   0.022  0.98251   
## secf65      -2.304e-03  1.580e-01  -0.015  0.98847   
## secc65      -4.915e-01  7.290e-01  -0.674  0.50570   
## seccm65      2.596e-01  3.557e-01   0.730  0.47150   
## seccf65      2.207e-01  3.733e-01   0.591  0.55924   
## syr65       -7.556e-01  7.977e+00  -0.095  0.92521   
## syrm65       3.109e-01  3.897e+00   0.080  0.93698   
## syrf65       7.593e-01  4.111e+00   0.185  0.85479   
## teapri65     3.955e-05  7.700e-04   0.051  0.95941   
## teasec65     2.497e-04  1.171e-03   0.213  0.83274   
## ex1         -5.804e-01  2.418e-01  -2.400  0.02329 * 
## im1          5.914e-01  2.503e-01   2.363  0.02531 * 
## xr65        -1.038e-04  5.417e-05  -1.916  0.06565 . 
## tot1        -1.279e-01  1.126e-01  -1.136  0.26561   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03074 on 28 degrees of freedom
## Multiple R-squared:  0.8871, Adjusted R-squared:  0.6411 
## F-statistic: 3.607 on 61 and 28 DF,  p-value: 0.0002003

其次,我们通过post-lasso的partialling out来估计效果:

dX = as.matrix(cbind(d, X))
lasso.effect = rlassoEffect(x = X, y = y, d = d, method = "partialling out")
summary(lasso.effect)
## [1] "Estimates and significance testing of the effect of target variables"
##      Estimate. Std. Error t value Pr(>|t|)    
## [1,]  -0.04981    0.01394  -3.574 0.000351 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

第三,我们用双重选择的方法来估计效果:

dX = as.matrix(cbind(d, X))
doublesel.effect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)
## [1] "Estimates and significance testing of the effect of target variables"
##          Estimate. Std. Error t value Pr(>|t|)   
## gdpsh465  -0.05001    0.01579  -3.167  0.00154 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(xtable)
table = rbind(summary(ls.effect)$coef["gdpsh465", 1:2],
              summary(lasso.effect)$coef[,1:2], 
              summary(doublesel.effect)$coef[, 1:2])
colnames(table) = c("Estimate", "Std. Error") #names(summary(full.fit)$coef)[1:2]
rownames(table) = c("full reg via ols", "partial regvia post-lasso ", "partial reg via double selection")
tab = xtable(table, digits = c(2, 2, 5))
tab
## % latex table generated in R 3.6.1 by xtable 1.8-4 package
## % Sat Nov 02 16:06:03 2019
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrr}
##   \hline
##  & Estimate & Std. Error \\ 
##   \hline
## full reg via ols & -0.01 & 0.02989 \\ 
##   partial regvia post-lasso  & -0.05 & 0.01394 \\ 
##   partial reg via double selection & -0.05 & 0.01579 \\ 
##    \hline
## \end{tabular}
## \end{table}

我们看到OLS估计是有噪声的,这并不奇怪,因为p与n是可比较的。偏回归方法,基于两个投影方程中协变量的Lasso选择,相比之下,得到更精确的估计,这支持了条件收敛的假设。我们注意到偏回归方法是正交估计方程方法的一个特例。

5 高维环境下的工具变量估计

在许多应用环境中,研究者对估计变量的(结构)效应感兴趣(处理变量),但该变量是内生性的,即与误差项相关。在这种情况下,工具变量(IV)方法被用来识别因果参数。

5.1 估计和推断

与formula一起使用的一般模式是function(formula, data,…),其中formula由两部分组成。这些公式是模式y d + x | x + z,其中y为结果变量, x为外生变量,d为内生变量(如果允许多个变量取决于具体函数),z为工具变量。函数的一种更原始的用法是将所需的一组变量简单地转换为矩阵:function(x=x, d=d, y=y, z=z)。在下面的一些例子中,显示了这两种选择。

5.2 应用:经济发展与制度建设

由于制度与产出之间的同时性,估计制度对产出的因果效应就变得复杂了:具体来说,更好的制度可能导致更高的收入,但更高的收入也可能导致更好的制度的发展。为了帮助克服这种同时性,Acemoglu, Johnson和Robinson(2001)使用早期欧洲定居者的死亡率作为制度质量的工具。

定居者在他们更可能建立长期定居点的地方建立更好的制度,他们更可能长期定居的地方与定居者在最初殖民化时的死亡率有关。 工具变量的外生性的动机是,GDP虽然持续存在,但不太可能受到上个世纪或更早时期死亡率的强烈影响,除非通过制度。

data(AJR)
AJR %>% dim
## [1] 64 11
AJR %>% datatable()
y = AJR$GDP            # 结果变量
d = AJR$Exprop         # 内生变量
z = AJR$logMort        # 工具变量
x = model.matrix(~-1 + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2,data = AJR) # 控制变量

# AJR.Xselect = rlassoIV(x=x, d=d, y=y, z=z, select.X=TRUE, select.Z=FALSE)
AJR.Xselect = rlassoIV(GDP ~ Exprop + (Latitude + Latitude2 + Africa + Asia + Namer +Samer)^2 | logMort + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2,data = AJR, select.X = TRUE, select.Z = FALSE)

summary(AJR.Xselect)
## [1] "Estimation and significance testing of the effect of target variables in the IV regression model"
##        coeff.    se. t-value p-value   
## Exprop 0.8450 0.2699   3.131 0.00174 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(AJR.Xselect)
##            2.5 %   97.5 %
## Exprop 0.3159812 1.374072

理解上面的过程是很有趣的。本质上,它使用post-Lasso将xi从yi、di和zi中分离出来,并将2SLS应用于剩余量。

让我们在这个例子中更详细地研究一下分区。我们可以先试着用OLS进行分拆:

# parialling out by linear model
fmla.y = GDP ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2
fmla.d = Exprop ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2
fmla.z = logMort ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2

rY = lm(fmla.y, data = AJR)$res
rD = lm(fmla.d, data = AJR)$res
rZ = lm(fmla.z, data = AJR)$res

# ivfit.lm = tsls(y=rY,d=rD, x=NULL, z=rZ, intercept=FALSE)
ivfit.lm = tsls(rY ~ rD | rZ, intercept = FALSE)

print(cbind(ivfit.lm$coef, ivfit.lm$se), digits = 7)
##        [,1]     [,2]
## rD 1.267209 1.730537

看到估计值有很大的标准误差。不精确是可以预料的。接下来,我们将OLS操作符替换为post-Lasso来进行划分

# parialling out by lasso
rY = rlasso(fmla.y, data = AJR)$res
rD = rlasso(fmla.d, data = AJR)$res
rZ = rlasso(fmla.z, data = AJR)$res

# ivfit.lasso = tsls(y=rY,d=rD, x=NULL, z=rZ, intercept=FALSE)
ivfit.lasso = tsls(rY ~ rD | rZ, intercept = FALSE)
print(cbind(ivfit.lasso$coef, ivfit.lasso$se), digits = 5)
##       [,1]    [,2]
## rD 0.84503 0.26993

5.3 应用:征地决定对经济结果的影响

在这里,我们调查了在国家征用权(政府征用私人财产)案件中支持原告的决定对经济结果的影响。司法判决与潜在经济后果之间可能存在的内生性,使此类判决的效果分析变得复杂。为了解决潜在的内生性问题,我们采用了一种工具变量策略,该策略基于随机分配法官到做出决定的联邦上诉委员会。

因为法官是随机分配给三个法官小组的,所以法官的确切身份和他们的人口统计数据是随机分配的,取决于联邦巡回法院法官在一个特定的巡回年的特征分布。在这种随机分配下,联邦上诉委员会法官的特征只能通过法官的决定与房地产价格相关;因此,法官的特征很可能满足工具变量排除限制。

data(EminentDomain)
z <- as.matrix(EminentDomain$logGDP$z)
x <- as.matrix(EminentDomain$logGDP$x)
y <- EminentDomain$logGDP$y
d <- EminentDomain$logGDP$d
x <- x[, apply(x, 2, mean, na.rm = TRUE) > 0.05] 
z <- z[, apply(z, 2, mean, na.rm = TRUE) > 0.05] 

首先,我们使用简单的OLS和2SLS两种工具来估计处理变量的效果:

ED.ols = lm(y ~ cbind(d, x))
ED.2sls = tsls(y = y, d = d, x = x, z = z[, 1:2], intercept = FALSE)

接下来,we estimate the model with selection on the instruments

lasso.IV.Z = rlassoIV(x = x, d = d, y = y, z = z, select.X = FALSE, select.Z = TRUE)
# or lasso.IV.Z = rlassoIVselectZt(x=X, d=d, y=y, z=z)
summary(lasso.IV.Z)
## [1] "Estimates and significance testing of the effect of target variables in the IV regression model"
##    coeff.    se. t-value p-value
## d1 0.4146 0.2902   1.428   0.153
confint(lasso.IV.Z)
##         2.5 %    97.5 %
## d1 -0.1542764 0.9834796

最后,我们对x和z变量进行选择

lasso.IV.XZ = rlassoIV(x = x, d = d, y = y, z = z, select.X = TRUE, select.Z = TRUE)

summary(lasso.IV.XZ)
## Estimates and Significance Testing of the effect of target variables in the IV regression model 
##      coeff.      se. t-value p-value
## d1 -0.02383  0.12851  -0.185   0.853
confint(lasso.IV.XZ)
##         2.5 %    97.5 %
## d1 -0.2757029 0.2280335

通过比较我们看到的结果,OLS估计表明,联邦巡回法院支持原告上诉的判决的影响是显著正的,但是考虑潜在内生性的2SLS估计使结果不显著。对仪器进行选择也会得到类似的结果。对x和z变量进行选择会导致非常不精确的估计。最后,我们比较了所有的结果。

library(xtable)
table = matrix(0, 4, 2)
table[1, ] = summary(ED.ols)$coef[2, 1:2]
table[2, ] = cbind(ED.2sls$coef[1], ED.2sls$se[1])
table[3, ] = summary(lasso.IV.Z)[, 1:2]
## [1] "Estimates and significance testing of the effect of target variables in the IV regression model"
##    coeff.    se. t-value p-value
## d1 0.4146 0.2902   1.428   0.153
table[4, ] = summary(lasso.IV.XZ)[, 1:2]
## Estimates and Significance Testing of the effect of target variables in the IV regression model 
##      coeff.      se. t-value p-value
## d1 -0.02383  0.12851  -0.185   0.853
colnames(table) = c("Estimate", "Std. Error")
rownames(table) = c("ols regression", "IV estimation ", "selection on Z", "selection on X and Z")
tab = xtable(table, digits = c(2, 2, 7))
tab
## % latex table generated in R 3.6.1 by xtable 1.8-4 package
## % Sat Nov 02 16:06:06 2019
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrr}
##   \hline
##  & Estimate & Std. Error \\ 
##   \hline
## ols regression & 0.01 & 0.0098659 \\ 
##   IV estimation  & -0.01 & 0.0337664 \\ 
##   selection on Z & 0.41 & 0.2902492 \\ 
##   selection on X and Z & -0.02 & 0.1285065 \\ 
##    \hline
## \end{tabular}
## \end{table}

6 在高维环境下处理效果的推断

6.1. Treatment Effects Parameters

knitr::include_graphics("hdm2.png")

6.2 Estimation and Inference of Treatment effects

6.3 Application: 401(k) plan participation

虽然401(k)计划被广泛用作退休储蓄的工具是显而易见的,但其对资产的影响却不那么明显。确定参加401(k)计划对积累资产影响的关键问题是储蓄者的异质性和对参与状态的非随机选择。

data(pension)
y = pension$tw
d = pension$p401
z = pension$e401
X = pension[, c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5", "fsize",
"hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown")] # simple model
xvar = c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5", "fsize", "hs",
"smcol", "col", "marr", "twoearn", "db", "pira", "hown")

xpart = paste(xvar, collapse = "+")
pension.ate = rlassoATE(X,d,y)

summary(pension.ate)
## Estimation and significance testing of the treatment effect 
## Type: ATE 
## Bootstrap: not applicable 
##    coeff.   se. t-value  p-value    
## TE  10180  1931   5.273 1.34e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.atet = rlassoATET(X,d,y)

summary(pension.atet)
## Estimation and significance testing of the treatment effect 
## Type: ATET 
## Bootstrap: not applicable 
##    coeff.   se. t-value p-value    
## TE  12628  2944   4.289 1.8e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For LATE and LATET we estimate the effect of 401(k) participation (d) with plan eligibility (z) as instrument.

pension.late = rlassoLATE(X, d, y, z, always_takers = FALSE)

summary(pension.late)
## Estimation and significance testing of the treatment effect 
## Type: LATE 
## Bootstrap: not applicable 
##    coeff.   se. t-value p-value    
## TE  12250  2745   4.463 8.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.latet = rlassoLATET(X, d, y, z, always_takers = FALSE)

summary(pension.latet)
## Estimation and significance testing of the treatment effect 
## Type: LATET 
## Bootstrap: not applicable 
##    coeff.   se. t-value  p-value    
## TE  15323  3645   4.204 2.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

最后,我们估计一个包含所有交互效应的模型:

# generate all interactions of X's
xvar2 = paste("(", paste(xvar, collapse = "+"), ")^2", sep = "")

forminteract = formula(paste("tw ~", xvar2, " + p401", "|", xvar2, sep = ""))

forminteract
## tw ~ (i2 + i3 + i4 + i5 + i6 + i7 + a2 + a3 + a4 + a5 + fsize + 
##     hs + smcol + col + marr + twoearn + db + pira + hown)^2 + 
##     p401 | (i2 + i3 + i4 + i5 + i6 + i7 + a2 + a3 + a4 + a5 + 
##     fsize + hs + smcol + col + marr + twoearn + db + pira + hown)^2
formZinteract = formula(paste("tw ~", xvar2, " + p401", "|", xvar2, " + e401", sep = ""))
pension.ate= rlassoATE(forminteract, data = pension); summary(pension.ate)
## Estimation and significance testing of the treatment effect 
## Type: ATE 
## Bootstrap: not applicable 
##    coeff.   se. t-value p-value    
## TE  10420  1901   5.482 4.2e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.atet= rlassoATET(forminteract, data = pension); summary(pension.atet)
## Estimation and significance testing of the treatment effect 
## Type: ATET 
## Bootstrap: not applicable 
##    coeff.   se. t-value  p-value    
## TE  12515  2901   4.313 1.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.late= rlassoLATE(formZinteract, data = pension, always_takers = FALSE);summary(pension.late) 
## Estimation and significance testing of the treatment effect 
## Type: LATE 
## Bootstrap: not applicable 
##    coeff.   se. t-value p-value    
## TE  11844  2726   4.344 1.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.latet= rlassoLATET(formZinteract, data = pension,always_takers = FALSE); summary(pension.latet) 
## Estimation and significance testing of the treatment effect 
## Type: LATET 
## Bootstrap: not applicable 
##    coeff.   se. t-value  p-value    
## TE  15313  3789   4.041 5.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table= matrix(0, 4, 2) 
table[1,]=summary(pension.ate)[,1:2] 
## Estimation and significance testing of the treatment effect 
## Type: ATE 
## Bootstrap: not applicable 
##    coeff.   se. t-value p-value    
## TE  10420  1901   5.482 4.2e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table[2,]= summary(pension.atet)[,1:2] 
## Estimation and significance testing of the treatment effect 
## Type: ATET 
## Bootstrap: not applicable 
##    coeff.   se. t-value  p-value    
## TE  12515  2901   4.313 1.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table[3,]=summary(pension.late)[,1:2] 
## Estimation and significance testing of the treatment effect 
## Type: LATE 
## Bootstrap: not applicable 
##    coeff.   se. t-value p-value    
## TE  11844  2726   4.344 1.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table[4,]= summary(pension.latet)[,1:2]
## Estimation and significance testing of the treatment effect 
## Type: LATET 
## Bootstrap: not applicable 
##    coeff.   se. t-value  p-value    
## TE  15313  3789   4.041 5.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
colnames(table)= c('Estimate', 'Std. Error') 
rownames(table)= c('ATE', 'ATET ','LATE', 'LATET') 
tab= xtable(table, digits=c(2, 2,2))