library(AER)
library(tidyverse)
library(estimatr)
data("CigarettesSW")
<- CigarettesSW |>
d as_tibble() |>
mutate(rprice = price / cpi, # 价格
salestax = (taxs - tax) / cpi, # 消费税
cigtax = tax/cpi, # 香烟税
rincome = income / population / cpi) |> # 收入
filter(year == "1995") |>
select(packs, rprice, salestax, rincome, cigtax)
工具变量
参考资料:https://www.econometrics-with-r.org/12.1-TIVEWASRAASI.html
案例背景
香烟的价格\(P\)与销量之间\(Q\)存在怎样的关系?通过执行式1的回归,我们期望得到香烟的价格弹性,但是由于供需之间的双向因果,这一目标是无法实现的
\[ \log(Q) = \beta_0 + \beta_1\log(P) + u \tag{1} \]
因此,我们使用每包烟的消费税\(SaleTax\)作为工具变量。它需要满足两个条件:
- 会影响税后每包烟的价格\(P\)。这一条显然满足。
- 是外生的,因为销售税并不直接影响销售量,而是通过价格间接与销售量相关。这一条可能不满足。例如,像州平均收入\(Income\)这样的变量,既会影响销量\(Q\),又和税费水平\(SaleTax\)相关。所以实操时,该变量需要被控制。
通过下面的代码加载数据:
Two-Stage Least Squares
两步法进行回归。
第一步
用内生解释变量对所有工具变量(这里是rprice
)及所有外生变量(控制变量,即rincome
)做回归,并提取拟合值。
<- lm(log(rprice) ~ salestax + log(rincome), data = d)
fit1 <- fit1$fitted.values lprice_pred
在做第一步时,需要关注两点,一是工具变量的系数必须显著,否则就不满足作为工具变量的条件;二是模型解释能力不能太差,否则会出现弱工具变量的问题。
summary(fit1)
Call:
lm(formula = log(rprice) ~ salestax + log(rincome), data = d)
Residuals:
Min 1Q Median 3Q Max
-0.163799 -0.033049 0.001907 0.049322 0.185542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.590811 0.225558 15.920 < 2e-16 ***
salestax 0.027395 0.004077 6.720 2.65e-08 ***
log(rincome) 0.389283 0.085104 4.574 3.74e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.07848 on 45 degrees of freedom
Multiple R-squared: 0.6389, Adjusted R-squared: 0.6228
F-statistic: 39.81 on 2 and 45 DF, p-value: 1.114e-10
第二步
用因变量对所有拟合值(这里是上一步得到的lprice_pred
)及所有外生变量(控制变量,即rincome
)做回归。
<- lm(log(packs) ~ lprice_pred + log(rincome), data = d)
fit2 summary(fit2)
Call:
lm(formula = log(packs) ~ lprice_pred + log(rincome), data = d)
Residuals:
Min 1Q Median 3Q Max
-0.67048 -0.13306 0.00598 0.13361 0.58044
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4307 1.6247 5.805 6.09e-07 ***
lprice_pred -1.1434 0.4300 -2.659 0.0108 *
log(rincome) 0.2145 0.3212 0.668 0.5077
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2267 on 45 degrees of freedom
Multiple R-squared: 0.1687, Adjusted R-squared: 0.1318
F-statistic: 4.567 on 2 and 45 DF, p-value: 0.01564
也就是说:
\[ \log(Q) = 9.43 - 1.14 \times \log(P) + 0.21 \times \log(rincome) \]
使用集成命令
在实操中,不要使用Two-Stage Least Squares (TSLS)来汇报最终结果,因为标准误的估计会有问题。这里可以使用estimatr
包中的集成命令:
<- iv_robust(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax,
fit3 data = d,
se_type = "HC1")
fit3
Estimate Std. Error t value Pr(>|t|) CI Lower
(Intercept) 9.4306583 1.2593926 7.4882595 1.934709e-09 6.8941115
log(rprice) -1.1433751 0.3723027 -3.0710902 3.611302e-03 -1.8932312
log(rincome) 0.2145153 0.3117469 0.6881071 4.949173e-01 -0.4133752
CI Upper DF
(Intercept) 11.9672051 45
log(rprice) -0.3935190 45
log(rincome) 0.8424058 45
过度识别
过度识别指工具变量的数目大于内生变量数目。例如,在此案例中,我们加入香烟税cigtax
作为第二个工具变量。
TSLS估计
用二阶段回归得到点估计
<- lm(log(rprice) ~ salestax + cigtax + log(rincome), data = d)
fit1 <- fit1$fitted.values
lprice_pred <- lm(log(packs) ~ lprice_pred + log(rincome), data = d)
fit2 summary(fit2)
Call:
lm(formula = log(packs) ~ lprice_pred + log(rincome), data = d)
Residuals:
Min 1Q Median 3Q Max
-0.63899 -0.09544 0.00126 0.11769 0.42479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.8950 1.1413 8.670 3.72e-11 ***
lprice_pred -1.2774 0.2838 -4.502 4.73e-05 ***
log(rincome) 0.2804 0.2572 1.090 0.281
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2025 on 45 degrees of freedom
Multiple R-squared: 0.3368, Adjusted R-squared: 0.3073
F-statistic: 11.43 on 2 and 45 DF, p-value: 9.709e-05
使用集成命令
<- iv_robust(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax + cigtax,
fit3 data = d,
se_type = "HC1")
fit3
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper
(Intercept) 9.8949555 0.9592169 10.315660 1.946702e-13 7.9629934 11.8269176
log(rprice) -1.2774241 0.2496100 -5.117680 6.210718e-06 -1.7801645 -0.7746838
log(rincome) 0.2804048 0.2538897 1.104436 2.752748e-01 -0.2309552 0.7917648
DF
(Intercept) 45
log(rprice) 45
log(rincome) 45
工具变量的有效性检验
弱工具变量
一个经验法则是,对TSLS的first-stage中所有工具变量打包,进行嵌套模型的F检验,如果\(F\leq 10\),那么就是一个弱工具变量。
例如,对于只有一个工具变量salestax
的情况:
<- lm(log(rprice) ~ salestax + log(rincome), data = d)
mod2 <- lm(log(rprice) ~ log(rincome), data = d)
mod1
anova(mod1, mod2)
Analysis of Variance Table
Model 1: log(rprice) ~ log(rincome)
Model 2: log(rprice) ~ salestax + log(rincome)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 0.55522
2 45 0.27713 1 0.2781 45.158 2.655e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
再例如,对于有两个工具变量salestax
和cigtax
的情况:
<- lm(log(rprice) ~ salestax + cigtax + log(rincome), data = d)
mod2 <- lm(log(rprice) ~ log(rincome), data = d)
mod1
anova(mod1, mod2)
Analysis of Variance Table
Model 1: log(rprice) ~ log(rincome)
Model 2: log(rprice) ~ salestax + cigtax + log(rincome)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 0.55522
2 44 0.04579 2 0.50943 244.73 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
可以看到,对应的F值分别为45.158和244.73,都通过了经验法则的检验,说明不是弱工具变量。此外,我们还可以通过linearHypothesis模型来简化流程,例如对于两个工具变量的情况(如果数据存在异方差性,即误差项的方差不是常数,这会影响F值的计算。在代码中,使用了vcovHC来计算稳健标准误(robust standard errors),这有助于纠正异方差性带来的影响,因此结果也与前面的不同):
linearHypothesis(mod2,
c("salestax = 0", "cigtax = 0"),
vcov = vcovHC, type = "HC1")
Linear hypothesis test
Hypothesis:
salestax = 0
cigtax = 0
Model 1: restricted model
Model 2: log(rprice) ~ salestax + cigtax + log(rincome)
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 46
2 44 2 167.99 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
过度识别检验
当工具变量数量大于内生自变量时,可以使用过度识别的J检验。
第一步,用iv回归的因变量残差对工具变量和外生变量做回归。
# compute the J-statistic
<- iv_robust(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax + cigtax,
fit3 data = d,
se_type = "HC1")
<- log(d$packs) - fit3$fitted.values # 计算残差
residuals
<- lm(residuals ~ log(rincome) + salestax + cigtax, data = d) cig_iv_OR
第二步,对两个工具变量进行联合卡方检验
<- linearHypothesis(cig_iv_OR,
cig_OR_test c("salestax = 0", "cigtax = 0"),
test = "Chisq")
cig_OR_test
Linear hypothesis test
Hypothesis:
salestax = 0
cigtax = 0
Model 1: restricted model
Model 2: residuals ~ log(rincome) + salestax + cigtax
Res.Df RSS Df Sum of Sq Chisq Pr(>Chisq)
1 46 1.588
2 44 1.577 2 0.011005 0.307 0.8577
第三步,提取卡方值,做自由度为(工具变量数目-自变量数目)的卡方检验
pchisq(cig_OR_test$Chisq[2], df = 1, lower.tail = FALSE)
[1] 0.5795077
因此,我们不能拒绝两个工具变量都是外生变量的原假设。