knitr::opts_chunk$set(echo = TRUE)
对高维模型的分析变得越来越重要。高维模型需要估计的参数数量相对于样本量很大。这些模型自然地出现在现代数据集中,这些数据集具有许多可用于单个观察的测量特征,例如人口普查数据、扫描仪数据和文本数据。这样的模型也会很自然地出现,即使是在只有少量测量特征的数据中,当观察到的变量进入模型时,其确切的函数形式是未知的,而我们根据原始特征创建了许多技术变量,如字典。本场景中的示例包括具有非参数干扰函数的半参数模型。更一般地说,当试图对复杂现象建模时,往往会出现与样本量相关的许多参数的模型。
随着这些数据集在经济学和其他数据科学领域的可用性的增加,分析这些数据的新方法已经被开发出来。R包hdm包含了最近开发的用于高维近似稀疏模型的方法的实现,主要依赖于lasso和post-lasso的形式以及相关的估计和推断方法。方法如下图所示计量经济学的应用,但也有用的其他学科,如医学,生物学,社会学或心理学。
这个包中实现的方法与其他包中已经可用的方法有以下四种主要方式:
1)首先,我们提供了一个Lasso回归的版本,该版本明确处理并允许非高斯误差和异方差误差。
2)其次,我们实现了λ的选择套索回归。为了强调这一选择,我们称这个包中的Lasso实现为“strict”Lasso(=rlasso)。函数名中的前缀r应该强调这一点。在高维设置中,交叉验证非常流行;但它缺乏一个理论理由使用在现在的环境下,一些理论建议λ的选择往往是不可行的。
3)第三,对于高维近似稀疏模型中出现的各种低维因果/结构参数,我们提供了有效的估计量和一致有效的置信区间。例如,在高维稀疏回归模型中,我们为目标变量(如处理变量或政策变量)的回归系数提供了有效的估计量和一致有效的置信区间。这里的目标变量指的是不相关的对象,例如预先设定的回归系数。我们还提供了平均处理效果(ATE)和平均治疗效果(ATET)的估计和置信区间,并将这些参数扩展到生性设置
4)第四,基于Belloni、Chernozhukov和Kato(2014)开发的方法和理论,给出了高维近似稀疏模型中估计系数的联合/一致置信区间。他们提出了回归系数在高维稀疏z-估计问题中的一致有效置信域,其中包括中值、均值和许多其他回归问题作为特例。在这篇文章中,我们把这个方法应用到拉索回归的系数中,并通过一个实证例子来强调这个方法。
# install.packages("hdm")
library(tidyverse)
## -- Attaching packages ------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.2.1 √ purrr 0.3.3
## √ tibble 2.1.3 √ dplyr 0.8.3
## √ tidyr 1.0.0 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.4.0
## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(DT)
library(hdm)
函数rlasso实现了Lasso和post-Lasso,其中前缀表示这些是Lasso和post-Lasso的严格版本。默认选项是post-lasso, post=TRUE。这个函数返回一个S3类rlasso对象,其中提供了predict、print、summary等方法。
set.seed(12345)
n = 100 #sample size
p = 100 # number of variables
s = 3 # nubmer of variables with non-zero coefficients
X = matrix(rnorm(n * p), ncol = p)
beta = c(rep(5, s), rep(0, p - s))
Y = X %*% beta + rnorm(n)
lasso.reg = rlasso(Y ~ X, post = FALSE) # use lasso, not-Post-lasso
sum.lasso <- summary(lasso.reg, all = FALSE) # can also do print(lasso.reg, all=FALSE)
##
## Call:
## rlasso.formula(formula = Y ~ X, post = FALSE)
##
## Post-Lasso Estimation: FALSE
##
## Total number of variables: 100
## Number of selected variables: 11
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.09008 -0.45801 -0.01237 0.50291 2.25098
##
## Estimate
## (Intercept) 0.057
## 1 4.771
## 2 4.693
## 3 4.766
## 13 -0.045
## 15 -0.047
## 16 -0.005
## 19 -0.092
## 22 -0.027
## 40 -0.011
## 61 0.114
## 100 -0.025
##
## Residual standard error: 0.8039
## Multiple R-squared: 0.9913
## Adjusted R-squared: 0.9902
## Joint significance test:
## the sup score statistic for joint significance test is 64.02 with a p-value of 0
yhat.lasso = predict(lasso.reg) #in-sample prediction
Xnew = matrix(rnorm(n * p), ncol = p) # new X
Ynew = Xnew %*% beta + rnorm(n) #new Y
yhat.lasso.new = predict(lasso.reg, newdata = Xnew) #out-of-sample prediction
post.lasso.reg = rlasso(Y ~ X, post = TRUE) #now use post-lasso
print(post.lasso.reg, all = FALSE) # or use summary(post.lasso.reg, all=FALSE)
##
## Call:
## rlasso.formula(formula = Y ~ X, post = TRUE)
##
## (Intercept) 1 2 3
## 0.0341 4.9241 4.8579 4.9644
summary(post.lasso.reg,all = FALSE)
##
## Call:
## rlasso.formula(formula = Y ~ X, post = TRUE)
##
## Post-Lasso Estimation: TRUE
##
## Total number of variables: 100
## Number of selected variables: 3
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.94738 -0.50541 -0.01237 0.51283 2.11575
##
## Estimate
## (Intercept) 0.034
## 1 4.924
## 2 4.858
## 3 4.964
##
## Residual standard error: 0.833
## Multiple R-squared: 0.9906
## Adjusted R-squared: 0.9903
## Joint significance test:
## the sup score statistic for joint significance test is 64.02 with a p-value of 0
在上面的示例中,总体显著性的supscore统计值为64.02,pvalue为0。这意味着在水平α= 0:05零假设被拒绝和模型具有解释力。
yhat.postlasso = predict(post.lasso.reg) #in-sample prediction
yhat.postlasso.new = predict(post.lasso.reg, newdata = Xnew) #out-of-sample p
MAE <- apply(cbind(abs(Ynew - yhat.lasso.new), abs(Ynew - yhat.postlasso.new)), 2, mean)
names(MAE) <- c("lasso MAE", "Post-lasso MAE")
print(MAE, digits = 2) # MAE for Lasso and Post-Lasso
## lasso MAE Post-lasso MAE
## 0.91 0.79
set.seed(1)
n = 5000
p = 20
X = matrix(rnorm(n * p), ncol = p)
colnames(X) <- c("d",paste("X",1:19,sep = ""))
X %>% datatable()
xnames = colnames(X)[-1]
beta = rep(1, 20)
y = X %*% beta + rnorm(n) # 构造的y
dat = data.frame(y = y, X)
We can estimate α0 by running full least squares:
# full fit
fmla = as.formula(paste("y ~ ", paste(colnames(X), collapse = "+")))
full.fit = lm(fmla, data = dat)
summary(full.fit)$coef["d", 1:2]
## Estimate Std. Error
## 0.97807455 0.01371225
Another way to estimate α0 is to first partial out the x-variables from yi and di, and run least squares on the residuals:
fmla.y = as.formula(paste("y ~ ", paste(xnames, collapse = "+")))
fmla.d = as.formula(paste("d ~ ", paste(xnames, collapse = "+")))
# partial fit via ols
rY = lm(fmla.y, data = dat)$res
rD = lm(fmla.d, data = dat)$res
partial.fit.ls = lm(rY ~ rD)
summary(partial.fit.ls)$coef["rD", 1:2]
## Estimate Std. Error
## 0.97807455 0.01368616
可以看到,估计值是相同的,而标准误差几乎是相同的。事实上,由于与分拆方法相关的估计方程的正交性,标准误差是渐近等价的。
# partial fit via post-lasso
rY = rlasso(fmla.y, data = dat)$res
rD = rlasso(fmla.d, data = dat)$res
partial.fit.postlasso = lm(rY ~ rD)
summary(partial.fit.postlasso)$coef["rD", 1:2]
## Estimate Std. Error
## 0.97273870 0.01368677
我们可以看到,这个估计值和标准误差几乎与前面给出的估计值相同。事实上,它们在低维环境中渐进地与先前的估计等价,其优点是,与先前的估计不同,它们在高维环境中仍然有效。
基于Lasso或post-Lasso的分出的正交估计方程方法由函数rlassoEffect实现,使用method= “partialling out”:
Eff = rlassoEffect(X[, -1], y, X[, 1], method = "partialling out")
summary(Eff)$coef[, 1:2]
## Estimate. Std. Error
## 0.97273870 0.01368677
summary(Eff)
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## [1,] 0.97274 0.01369 71.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
另一种正交估计方程方法-基于协变量的双重选择,由函数rlassoEffect实现,使用method= “double selection”。算法上的方法如下:
Eff = rlassoEffect(X[, -1], y, X[, 1], method = "double selection")
summary(Eff)$coef[, 1:2]
## Estimate. Std. Error
## 0.97807455 0.01415624
summary(Eff)
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## d1 0.97807 0.01416 69.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimate. Std. Error: 0.97807455 0.01415624,与method= “partialling out”不同
rlassoEffects是函数rlassoEffect的一个包装函数,它对单个目标回归变量进行推断。在Belloni、Chernozhukov和Hansen(2014)中给出了理论基础,在Belloni、Chernozhukov和Kato(2014)中给出了更一般的一类模型。函数rlassoEffects可以以rlassoEffects(x, y,index)的形 式使用,其中x是一个矩阵,y表示结果变量,index指定进行推理的x的变量。这可以通过一个整数向量(变量的位置)、一个逻辑向量或变量的名称来实现。
另一种用法是rlassoEffects(formula, data,I),其中I是一个单边公式,它指定为其进行推断的变量。有关进一步的详细信息,请 参阅函数的帮助页面和下面的示例,其中显示了这两种方法的用法.
set.seed(1)
n = 100 #sample size
p = 100 # number of variables
s = 3 # nubmer of non-zero variables
X = matrix(rnorm(n * p), ncol = p)
colnames(X) <- paste("X", 1:p, sep = "")
colnames(X)
## [1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10"
## [11] "X11" "X12" "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20"
## [21] "X21" "X22" "X23" "X24" "X25" "X26" "X27" "X28" "X29" "X30"
## [31] "X31" "X32" "X33" "X34" "X35" "X36" "X37" "X38" "X39" "X40"
## [41] "X41" "X42" "X43" "X44" "X45" "X46" "X47" "X48" "X49" "X50"
## [51] "X51" "X52" "X53" "X54" "X55" "X56" "X57" "X58" "X59" "X60"
## [61] "X61" "X62" "X63" "X64" "X65" "X66" "X67" "X68" "X69" "X70"
## [71] "X71" "X72" "X73" "X74" "X75" "X76" "X77" "X78" "X79" "X80"
## [81] "X81" "X82" "X83" "X84" "X85" "X86" "X87" "X88" "X89" "X90"
## [91] "X91" "X92" "X93" "X94" "X95" "X96" "X97" "X98" "X99" "X100"
beta = c(rep(3, s), rep(0, p - s))
y = 1 + X %*% beta + rnorm(n) # 1可以看作截距项
data = data.frame(cbind(y, X))
colnames(data)[1] <- "y"
fm = paste("y ~", paste(colnames(X), collapse = "+"))
fm = as.formula(fm)
fm
## y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 +
## X12 + X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20 + X21 +
## X22 + X23 + X24 + X25 + X26 + X27 + X28 + X29 + X30 + X31 +
## X32 + X33 + X34 + X35 + X36 + X37 + X38 + X39 + X40 + X41 +
## X42 + X43 + X44 + X45 + X46 + X47 + X48 + X49 + X50 + X51 +
## X52 + X53 + X54 + X55 + X56 + X57 + X58 + X59 + X60 + X61 +
## X62 + X63 + X64 + X65 + X66 + X67 + X68 + X69 + X70 + X71 +
## X72 + X73 + X74 + X75 + X76 + X77 + X78 + X79 + X80 + X81 +
## X82 + X83 + X84 + X85 + X86 + X87 + X88 + X89 + X90 + X91 +
## X92 + X93 + X94 + X95 + X96 + X97 + X98 + X99 + X100
我们可以对一组感兴趣的变量进行推理,例如第一、第二、第三和第五十:
lasso.effect = rlassoEffects(fm, I = ~X1 + X2 + X3 + X50, data = data)
print(lasso.effect)
##
## Call:
## rlassoEffects.formula(formula = fm, data = data, I = ~X1 + X2 +
## X3 + X50)
##
## Coefficients:
## X1 X2 X3 X50
## 2.94448 3.04127 2.97540 0.07196
summary(lasso.effect)
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## X1 2.94448 0.08815 33.404 <2e-16 ***
## X2 3.04127 0.08389 36.253 <2e-16 ***
## X3 2.97540 0.07804 38.127 <2e-16 ***
## X50 0.07196 0.07765 0.927 0.354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(lasso.effect)
## 2.5 % 97.5 %
## X1 2.77171308 3.1172421
## X2 2.87685121 3.2056979
## X3 2.82244962 3.1283583
## X50 -0.08022708 0.2241377
若不指示推断变量I,观察结果如何,会报错!!!
联合置信区间
confint(lasso.effect, level = 0.95, joint = TRUE)
## 2.5 % 97.5 %
## X1 2.7279477 3.1610075
## X2 2.8371214 3.2454278
## X3 2.7833176 3.1674903
## X50 -0.1154509 0.2593615
plot(lasso.effect, main = "Confidence Intervals")
## Warning: Ignoring unknown aesthetics: width, h
在劳动经济学中,一个重要的问题是工资如何与雇员的性别相关。我们利用2012年美国人口普查数据,联合分析了性别和其他变量与性别的交互作用对工资的影响。因变量为工资的对数,目标变量为女性(结合其他变量)。所有其他变量表示其他一些社会经济特征,如婚姻状况、教育和经历。有关变量的详细描述,请参阅帮助页面。首先,我们加载并准备数据:
library(hdm)
data(cps2012)
dim(cps2012)
## [1] 29217 23
cps2012 %>% datatable()
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
X <- model.matrix(~-1 + female + female:(widowed + divorced + separated + nevermarried +
hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) + +(widowed +
divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so +
we + exp1 + exp2 + exp3)^2, data = cps2012)
dim(X)
## [1] 29217 136
X <- X[, which(apply(X, 2, var) != 0)] # exclude all constant variables
dim(X)
## [1] 29217 116
index.gender <- grep("female", colnames(X))
y <- cps2012$lnw
目标参数,即所有与性别相关的系数(即与其他变量的交互作用)的参数估计由以下命令计算和汇总
effects.female <- rlassoEffects(x = X, y = y, index = index.gender)
summary(effects.female)
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## female -0.154923 0.050162 -3.088 0.002012 **
## female:widowed 0.136095 0.090663 1.501 0.133325
## female:divorced 0.136939 0.022182 6.174 6.68e-10 ***
## female:separated 0.023303 0.053212 0.438 0.661441
## female:nevermarried 0.186853 0.019942 9.370 < 2e-16 ***
## female:hsd08 0.027810 0.120914 0.230 0.818092
## female:hsd911 -0.119335 0.051880 -2.300 0.021435 *
## female:hsg -0.012890 0.019223 -0.671 0.502518
## female:cg 0.010139 0.018327 0.553 0.580114
## female:ad -0.030464 0.021806 -1.397 0.162405
## female:mw -0.001063 0.019192 -0.055 0.955811
## female:so -0.008183 0.019357 -0.423 0.672468
## female:we -0.004226 0.021168 -0.200 0.841760
## female:exp1 0.004935 0.007804 0.632 0.527139
## female:exp2 -0.159519 0.045300 -3.521 0.000429 ***
## female:exp3 0.038451 0.007861 4.891 1.00e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
最后,我们估计和绘制置信区间,首先“点态化”,然后联合置信区间。
joint.CI <- confint(effects.female, level = 0.95, joint = TRUE)
joint.CI
## 2.5 % 97.5 %
## female -0.29422452 -0.01562204
## female:widowed -0.13367117 0.40586213
## female:divorced 0.07479695 0.19908182
## female:separated -0.11671664 0.16332216
## female:nevermarried 0.12925782 0.24444915
## female:hsd08 -0.37450228 0.43012291
## female:hsd911 -0.26902488 0.03035480
## female:hsg -0.06513949 0.03935993
## female:cg -0.04168175 0.06195886
## female:ad -0.09583693 0.03490944
## female:mw -0.05470794 0.05258106
## female:so -0.06272074 0.04635406
## female:we -0.06562874 0.05717648
## female:exp1 -0.01649540 0.02636592
## female:exp2 -0.28414464 -0.03489402
## female:exp3 0.01684777 0.06005338
plot(effects.female, joint=TRUE, level=0.95) # plot of the effects
## Warning: Ignoring unknown aesthetics: width, h
这一分析使我们更深入地了解性别歧视与其他社会经济变量之间的关系。
部分实证增长文献侧重于估计初始(滞后)人均GDP(国内生产总值)水平对人均GDP增长率的影响。
特别是,经典的索洛-斯旺-拉姆齐增长模型的一个关键预测是趋同假说,该假说认为,在一系列制度和社会特征的条件下,较贫穷国家通常增长更快,因此往往会赶上较富裕国家。描述这些特征的协变量包括衡量教育和科学政策、市场机构的实力、贸易开放程度、储蓄率等的变量。
library(knitr)
knitr::include_graphics("hdm.png")
data(GrowthData)
dim(GrowthData)
## [1] 90 63
colnames(GrowthData)
## [1] "Outcome" "intercept" "gdpsh465" "bmp1l" "freeop"
## [6] "freetar" "h65" "hm65" "hf65" "p65"
## [11] "pm65" "pf65" "s65" "sm65" "sf65"
## [16] "fert65" "mort65" "lifee065" "gpop1" "fert1"
## [21] "mort1" "invsh41" "geetot1" "geerec1" "gde1"
## [26] "govwb1" "govsh41" "gvxdxe41" "high65" "highm65"
## [31] "highf65" "highc65" "highcm65" "highcf65" "human65"
## [36] "humanm65" "humanf65" "hyr65" "hyrm65" "hyrf65"
## [41] "no65" "nom65" "nof65" "pinstab1" "pop65"
## [46] "worker65" "pop1565" "pop6565" "sec65" "secm65"
## [51] "secf65" "secc65" "seccm65" "seccf65" "syr65"
## [56] "syrm65" "syrf65" "teapri65" "teasec65" "ex1"
## [61] "im1" "xr65" "tot1"
y = GrowthData[, 1, drop = F]
d = GrowthData[, 3, drop = F]
X = as.matrix(GrowthData)[, -c(1, 2, 3)]
varnames = colnames(GrowthData)
varnames %>% class()
## [1] "character"
首先使用线性模型估计效应
xnames = varnames[-c(1, 2, 3)] # names of X variables
dandxnames = varnames[-c(1, 2)] # names of D and X variables
# create formulas by pasting names (this saves typing times)
fmla = as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect = lm(fmla, data = GrowthData)
summary(ls.effect)
##
## Call:
## lm(formula = fmla, data = GrowthData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.040338 -0.011298 -0.000863 0.011813 0.043247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.472e-01 7.845e-01 0.315 0.75506
## gdpsh465 -9.378e-03 2.989e-02 -0.314 0.75602
## bmp1l -6.886e-02 3.253e-02 -2.117 0.04329 *
## freeop 8.007e-02 2.079e-01 0.385 0.70300
## freetar -4.890e-01 4.182e-01 -1.169 0.25214
## h65 -2.362e+00 8.573e-01 -2.755 0.01019 *
## hm65 7.071e-01 5.231e-01 1.352 0.18729
## hf65 1.693e+00 5.032e-01 3.365 0.00223 **
## p65 2.655e-01 1.643e-01 1.616 0.11727
## pm65 1.370e-01 1.512e-01 0.906 0.37284
## pf65 -3.313e-01 1.651e-01 -2.006 0.05458 .
## s65 3.908e-02 1.855e-01 0.211 0.83469
## sm65 -3.067e-02 1.168e-01 -0.263 0.79479
## sf65 -1.799e-01 1.181e-01 -1.523 0.13886
## fert65 6.881e-03 2.705e-02 0.254 0.80108
## mort65 -2.335e-01 8.174e-01 -0.286 0.77729
## lifee065 -1.491e-02 1.933e-01 -0.077 0.93906
## gpop1 9.702e-01 1.812e+00 0.535 0.59663
## fert1 8.838e-03 3.504e-02 0.252 0.80271
## mort1 6.656e-02 6.848e-01 0.097 0.92326
## invsh41 7.446e-02 1.084e-01 0.687 0.49797
## geetot1 -7.151e-01 1.680e+00 -0.426 0.67364
## geerec1 6.300e-01 2.447e+00 0.257 0.79874
## gde1 -4.436e-01 1.671e+00 -0.265 0.79263
## govwb1 3.375e-01 4.380e-01 0.770 0.44748
## govsh41 4.632e-01 1.925e+00 0.241 0.81165
## gvxdxe41 -7.934e-01 2.059e+00 -0.385 0.70296
## high65 -7.525e-01 9.057e-01 -0.831 0.41311
## highm65 -3.903e-01 6.812e-01 -0.573 0.57131
## highf65 -4.177e-01 5.615e-01 -0.744 0.46308
## highc65 -2.216e+00 1.481e+00 -1.496 0.14575
## highcm65 2.797e-01 6.582e-01 0.425 0.67412
## highcf65 3.921e-01 7.660e-01 0.512 0.61278
## human65 2.337e+00 3.307e+00 0.707 0.48559
## humanm65 -1.209e+00 1.619e+00 -0.747 0.46121
## humanf65 -1.104e+00 1.685e+00 -0.655 0.51763
## hyr65 5.491e+01 2.389e+01 2.299 0.02918 *
## hyrm65 1.294e+01 2.317e+01 0.558 0.58112
## hyrf65 9.093e+00 1.767e+01 0.515 0.61088
## no65 3.721e-02 1.320e-01 0.282 0.78006
## nom65 -2.120e-02 6.496e-02 -0.326 0.74661
## nof65 -1.686e-02 6.700e-02 -0.252 0.80319
## pinstab1 -4.997e-02 3.092e-02 -1.616 0.11729
## pop65 1.032e-07 1.318e-07 0.783 0.44027
## worker65 3.408e-02 1.562e-01 0.218 0.82887
## pop1565 -4.655e-01 4.713e-01 -0.988 0.33176
## pop6565 -1.357e+00 6.349e-01 -2.138 0.04139 *
## sec65 -1.089e-02 3.077e-01 -0.035 0.97201
## secm65 3.344e-03 1.512e-01 0.022 0.98251
## secf65 -2.304e-03 1.580e-01 -0.015 0.98847
## secc65 -4.915e-01 7.290e-01 -0.674 0.50570
## seccm65 2.596e-01 3.557e-01 0.730 0.47150
## seccf65 2.207e-01 3.733e-01 0.591 0.55924
## syr65 -7.556e-01 7.977e+00 -0.095 0.92521
## syrm65 3.109e-01 3.897e+00 0.080 0.93698
## syrf65 7.593e-01 4.111e+00 0.185 0.85479
## teapri65 3.955e-05 7.700e-04 0.051 0.95941
## teasec65 2.497e-04 1.171e-03 0.213 0.83274
## ex1 -5.804e-01 2.418e-01 -2.400 0.02329 *
## im1 5.914e-01 2.503e-01 2.363 0.02531 *
## xr65 -1.038e-04 5.417e-05 -1.916 0.06565 .
## tot1 -1.279e-01 1.126e-01 -1.136 0.26561
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03074 on 28 degrees of freedom
## Multiple R-squared: 0.8871, Adjusted R-squared: 0.6411
## F-statistic: 3.607 on 61 and 28 DF, p-value: 0.0002003
其次,我们通过post-lasso的partialling out来估计效果:
dX = as.matrix(cbind(d, X))
lasso.effect = rlassoEffect(x = X, y = y, d = d, method = "partialling out")
summary(lasso.effect)
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## [1,] -0.04981 0.01394 -3.574 0.000351 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
第三,我们用双重选择的方法来估计效果:
dX = as.matrix(cbind(d, X))
doublesel.effect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## gdpsh465 -0.05001 0.01579 -3.167 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(xtable)
table = rbind(summary(ls.effect)$coef["gdpsh465", 1:2],
summary(lasso.effect)$coef[,1:2],
summary(doublesel.effect)$coef[, 1:2])
colnames(table) = c("Estimate", "Std. Error") #names(summary(full.fit)$coef)[1:2]
rownames(table) = c("full reg via ols", "partial regvia post-lasso ", "partial reg via double selection")
tab = xtable(table, digits = c(2, 2, 5))
tab
## % latex table generated in R 3.6.1 by xtable 1.8-4 package
## % Sat Nov 02 16:06:03 2019
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrr}
## \hline
## & Estimate & Std. Error \\
## \hline
## full reg via ols & -0.01 & 0.02989 \\
## partial regvia post-lasso & -0.05 & 0.01394 \\
## partial reg via double selection & -0.05 & 0.01579 \\
## \hline
## \end{tabular}
## \end{table}
我们看到OLS估计是有噪声的,这并不奇怪,因为p与n是可比较的。偏回归方法,基于两个投影方程中协变量的Lasso选择,相比之下,得到更精确的估计,这支持了条件收敛的假设。我们注意到偏回归方法是正交估计方程方法的一个特例。
在许多应用环境中,研究者对估计变量的(结构)效应感兴趣(处理变量),但该变量是内生性的,即与误差项相关。在这种情况下,工具变量(IV)方法被用来识别因果参数。
与formula一起使用的一般模式是function(formula, data,…),其中formula由两部分组成。这些公式是模式y d + x | x + z,其中y为结果变量, x为外生变量,d为内生变量(如果允许多个变量取决于具体函数),z为工具变量。函数的一种更原始的用法是将所需的一组变量简单地转换为矩阵:function(x=x, d=d, y=y, z=z)。在下面的一些例子中,显示了这两种选择。
由于制度与产出之间的同时性,估计制度对产出的因果效应就变得复杂了:具体来说,更好的制度可能导致更高的收入,但更高的收入也可能导致更好的制度的发展。为了帮助克服这种同时性,Acemoglu, Johnson和Robinson(2001)使用早期欧洲定居者的死亡率作为制度质量的工具。
定居者在他们更可能建立长期定居点的地方建立更好的制度,他们更可能长期定居的地方与定居者在最初殖民化时的死亡率有关。 工具变量的外生性的动机是,GDP虽然持续存在,但不太可能受到上个世纪或更早时期死亡率的强烈影响,除非通过制度。
data(AJR)
AJR %>% dim
## [1] 64 11
AJR %>% datatable()
y = AJR$GDP # 结果变量
d = AJR$Exprop # 内生变量
z = AJR$logMort # 工具变量
x = model.matrix(~-1 + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2,data = AJR) # 控制变量
# AJR.Xselect = rlassoIV(x=x, d=d, y=y, z=z, select.X=TRUE, select.Z=FALSE)
AJR.Xselect = rlassoIV(GDP ~ Exprop + (Latitude + Latitude2 + Africa + Asia + Namer +Samer)^2 | logMort + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2,data = AJR, select.X = TRUE, select.Z = FALSE)
summary(AJR.Xselect)
## [1] "Estimation and significance testing of the effect of target variables in the IV regression model"
## coeff. se. t-value p-value
## Exprop 0.8450 0.2699 3.131 0.00174 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(AJR.Xselect)
## 2.5 % 97.5 %
## Exprop 0.3159812 1.374072
理解上面的过程是很有趣的。本质上,它使用post-Lasso将xi从yi、di和zi中分离出来,并将2SLS应用于剩余量。
让我们在这个例子中更详细地研究一下分区。我们可以先试着用OLS进行分拆:
# parialling out by linear model
fmla.y = GDP ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2
fmla.d = Exprop ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2
fmla.z = logMort ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2
rY = lm(fmla.y, data = AJR)$res
rD = lm(fmla.d, data = AJR)$res
rZ = lm(fmla.z, data = AJR)$res
# ivfit.lm = tsls(y=rY,d=rD, x=NULL, z=rZ, intercept=FALSE)
ivfit.lm = tsls(rY ~ rD | rZ, intercept = FALSE)
print(cbind(ivfit.lm$coef, ivfit.lm$se), digits = 7)
## [,1] [,2]
## rD 1.267209 1.730537
看到估计值有很大的标准误差。不精确是可以预料的。接下来,我们将OLS操作符替换为post-Lasso来进行划分
# parialling out by lasso
rY = rlasso(fmla.y, data = AJR)$res
rD = rlasso(fmla.d, data = AJR)$res
rZ = rlasso(fmla.z, data = AJR)$res
# ivfit.lasso = tsls(y=rY,d=rD, x=NULL, z=rZ, intercept=FALSE)
ivfit.lasso = tsls(rY ~ rD | rZ, intercept = FALSE)
print(cbind(ivfit.lasso$coef, ivfit.lasso$se), digits = 5)
## [,1] [,2]
## rD 0.84503 0.26993
在这里,我们调查了在国家征用权(政府征用私人财产)案件中支持原告的决定对经济结果的影响。司法判决与潜在经济后果之间可能存在的内生性,使此类判决的效果分析变得复杂。为了解决潜在的内生性问题,我们采用了一种工具变量策略,该策略基于随机分配法官到做出决定的联邦上诉委员会。
因为法官是随机分配给三个法官小组的,所以法官的确切身份和他们的人口统计数据是随机分配的,取决于联邦巡回法院法官在一个特定的巡回年的特征分布。在这种随机分配下,联邦上诉委员会法官的特征只能通过法官的决定与房地产价格相关;因此,法官的特征很可能满足工具变量排除限制。
data(EminentDomain)
z <- as.matrix(EminentDomain$logGDP$z)
x <- as.matrix(EminentDomain$logGDP$x)
y <- EminentDomain$logGDP$y
d <- EminentDomain$logGDP$d
x <- x[, apply(x, 2, mean, na.rm = TRUE) > 0.05]
z <- z[, apply(z, 2, mean, na.rm = TRUE) > 0.05]
首先,我们使用简单的OLS和2SLS两种工具来估计处理变量的效果:
ED.ols = lm(y ~ cbind(d, x))
ED.2sls = tsls(y = y, d = d, x = x, z = z[, 1:2], intercept = FALSE)
接下来,we estimate the model with selection on the instruments
lasso.IV.Z = rlassoIV(x = x, d = d, y = y, z = z, select.X = FALSE, select.Z = TRUE)
# or lasso.IV.Z = rlassoIVselectZt(x=X, d=d, y=y, z=z)
summary(lasso.IV.Z)
## [1] "Estimates and significance testing of the effect of target variables in the IV regression model"
## coeff. se. t-value p-value
## d1 0.4146 0.2902 1.428 0.153
confint(lasso.IV.Z)
## 2.5 % 97.5 %
## d1 -0.1542764 0.9834796
最后,我们对x和z变量进行选择
lasso.IV.XZ = rlassoIV(x = x, d = d, y = y, z = z, select.X = TRUE, select.Z = TRUE)
summary(lasso.IV.XZ)
## Estimates and Significance Testing of the effect of target variables in the IV regression model
## coeff. se. t-value p-value
## d1 -0.02383 0.12851 -0.185 0.853
confint(lasso.IV.XZ)
## 2.5 % 97.5 %
## d1 -0.2757029 0.2280335
通过比较我们看到的结果,OLS估计表明,联邦巡回法院支持原告上诉的判决的影响是显著正的,但是考虑潜在内生性的2SLS估计使结果不显著。对仪器进行选择也会得到类似的结果。对x和z变量进行选择会导致非常不精确的估计。最后,我们比较了所有的结果。
library(xtable)
table = matrix(0, 4, 2)
table[1, ] = summary(ED.ols)$coef[2, 1:2]
table[2, ] = cbind(ED.2sls$coef[1], ED.2sls$se[1])
table[3, ] = summary(lasso.IV.Z)[, 1:2]
## [1] "Estimates and significance testing of the effect of target variables in the IV regression model"
## coeff. se. t-value p-value
## d1 0.4146 0.2902 1.428 0.153
table[4, ] = summary(lasso.IV.XZ)[, 1:2]
## Estimates and Significance Testing of the effect of target variables in the IV regression model
## coeff. se. t-value p-value
## d1 -0.02383 0.12851 -0.185 0.853
colnames(table) = c("Estimate", "Std. Error")
rownames(table) = c("ols regression", "IV estimation ", "selection on Z", "selection on X and Z")
tab = xtable(table, digits = c(2, 2, 7))
tab
## % latex table generated in R 3.6.1 by xtable 1.8-4 package
## % Sat Nov 02 16:06:06 2019
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrr}
## \hline
## & Estimate & Std. Error \\
## \hline
## ols regression & 0.01 & 0.0098659 \\
## IV estimation & -0.01 & 0.0337664 \\
## selection on Z & 0.41 & 0.2902492 \\
## selection on X and Z & -0.02 & 0.1285065 \\
## \hline
## \end{tabular}
## \end{table}
knitr::include_graphics("hdm2.png")
虽然401(k)计划被广泛用作退休储蓄的工具是显而易见的,但其对资产的影响却不那么明显。确定参加401(k)计划对积累资产影响的关键问题是储蓄者的异质性和对参与状态的非随机选择。
data(pension)
y = pension$tw
d = pension$p401
z = pension$e401
X = pension[, c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5", "fsize",
"hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown")] # simple model
xvar = c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5", "fsize", "hs",
"smcol", "col", "marr", "twoearn", "db", "pira", "hown")
xpart = paste(xvar, collapse = "+")
pension.ate = rlassoATE(X,d,y)
summary(pension.ate)
## Estimation and significance testing of the treatment effect
## Type: ATE
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 10180 1931 5.273 1.34e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.atet = rlassoATET(X,d,y)
summary(pension.atet)
## Estimation and significance testing of the treatment effect
## Type: ATET
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 12628 2944 4.289 1.8e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For LATE and LATET we estimate the effect of 401(k) participation (d) with plan eligibility (z) as instrument.
pension.late = rlassoLATE(X, d, y, z, always_takers = FALSE)
summary(pension.late)
## Estimation and significance testing of the treatment effect
## Type: LATE
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 12250 2745 4.463 8.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.latet = rlassoLATET(X, d, y, z, always_takers = FALSE)
summary(pension.latet)
## Estimation and significance testing of the treatment effect
## Type: LATET
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 15323 3645 4.204 2.63e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
最后,我们估计一个包含所有交互效应的模型:
# generate all interactions of X's
xvar2 = paste("(", paste(xvar, collapse = "+"), ")^2", sep = "")
forminteract = formula(paste("tw ~", xvar2, " + p401", "|", xvar2, sep = ""))
forminteract
## tw ~ (i2 + i3 + i4 + i5 + i6 + i7 + a2 + a3 + a4 + a5 + fsize +
## hs + smcol + col + marr + twoearn + db + pira + hown)^2 +
## p401 | (i2 + i3 + i4 + i5 + i6 + i7 + a2 + a3 + a4 + a5 +
## fsize + hs + smcol + col + marr + twoearn + db + pira + hown)^2
formZinteract = formula(paste("tw ~", xvar2, " + p401", "|", xvar2, " + e401", sep = ""))
pension.ate= rlassoATE(forminteract, data = pension); summary(pension.ate)
## Estimation and significance testing of the treatment effect
## Type: ATE
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 10420 1901 5.482 4.2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.atet= rlassoATET(forminteract, data = pension); summary(pension.atet)
## Estimation and significance testing of the treatment effect
## Type: ATET
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 12515 2901 4.313 1.61e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.late= rlassoLATE(formZinteract, data = pension, always_takers = FALSE);summary(pension.late)
## Estimation and significance testing of the treatment effect
## Type: LATE
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 11844 2726 4.344 1.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pension.latet= rlassoLATET(formZinteract, data = pension,always_takers = FALSE); summary(pension.latet)
## Estimation and significance testing of the treatment effect
## Type: LATET
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 15313 3789 4.041 5.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table= matrix(0, 4, 2)
table[1,]=summary(pension.ate)[,1:2]
## Estimation and significance testing of the treatment effect
## Type: ATE
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 10420 1901 5.482 4.2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table[2,]= summary(pension.atet)[,1:2]
## Estimation and significance testing of the treatment effect
## Type: ATET
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 12515 2901 4.313 1.61e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table[3,]=summary(pension.late)[,1:2]
## Estimation and significance testing of the treatment effect
## Type: LATE
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 11844 2726 4.344 1.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table[4,]= summary(pension.latet)[,1:2]
## Estimation and significance testing of the treatment effect
## Type: LATET
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 15313 3789 4.041 5.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
colnames(table)= c('Estimate', 'Std. Error')
rownames(table)= c('ATE', 'ATET ','LATE', 'LATET')
tab= xtable(table, digits=c(2, 2,2))