工具变量

作者

plumber

发布日期

2023年3月20日

参考资料：https://www.econometrics-with-r.org/12.1-TIVEWASRAASI.html

案例背景

香烟的价格\(P\)与销量之间\(Q\)存在怎样的关系？通过执行式1的回归，我们期望得到香烟的价格弹性，但是由于供需之间的双向因果，这一目标是无法实现的

\[ \log(Q) = \beta_0 + \beta_1\log(P) + u \tag{1} \]

因此，我们使用每包烟的消费税\(SaleTax\)作为工具变量。它需要满足两个条件：

会影响税后每包烟的价格\(P\)。这一条显然满足。
是外生的，因为销售税并不直接影响销售量，而是通过价格间接与销售量相关。这一条可能不满足。例如，像州平均收入\(Income\)这样的变量，既会影响销量\(Q\)，又和税费水平\(SaleTax\)相关。所以实操时，该变量需要被控制。

通过下面的代码加载数据：

library(AER)
library(tidyverse)
library(estimatr)
data("CigarettesSW")

d <- CigarettesSW |>
  as_tibble() |>
  mutate(rprice = price / cpi, # 价格
         salestax = (taxs - tax) / cpi,  # 消费税
         cigtax = tax/cpi, # 香烟税
         rincome = income / population / cpi) |> # 收入
  filter(year == "1995") |>
  select(packs, rprice, salestax, rincome, cigtax)

Two-Stage Least Squares

两步法进行回归。

第一步

用内生解释变量对所有工具变量（这里是rprice）及所有外生变量（控制变量，即rincome）做回归，并提取拟合值。

fit1 <- lm(log(rprice) ~ salestax + log(rincome), data = d)
lprice_pred <- fit1$fitted.values

在做第一步时，需要关注两点，一是工具变量的系数必须显著，否则就不满足作为工具变量的条件；二是模型解释能力不能太差，否则会出现弱工具变量的问题。

summary(fit1)


Call:
lm(formula = log(rprice) ~ salestax + log(rincome), data = d)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.163799 -0.033049  0.001907  0.049322  0.185542 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.590811   0.225558  15.920  < 2e-16 ***
salestax     0.027395   0.004077   6.720 2.65e-08 ***
log(rincome) 0.389283   0.085104   4.574 3.74e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07848 on 45 degrees of freedom
Multiple R-squared:  0.6389,    Adjusted R-squared:  0.6228 
F-statistic: 39.81 on 2 and 45 DF,  p-value: 1.114e-10

第二步

用因变量对所有拟合值（这里是上一步得到的lprice_pred）及所有外生变量（控制变量，即rincome）做回归。

fit2 <- lm(log(packs) ~ lprice_pred + log(rincome), data = d)
summary(fit2)


Call:
lm(formula = log(packs) ~ lprice_pred + log(rincome), data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.67048 -0.13306  0.00598  0.13361  0.58044 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.4307     1.6247   5.805 6.09e-07 ***
lprice_pred   -1.1434     0.4300  -2.659   0.0108 *  
log(rincome)   0.2145     0.3212   0.668   0.5077    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2267 on 45 degrees of freedom
Multiple R-squared:  0.1687,    Adjusted R-squared:  0.1318 
F-statistic: 4.567 on 2 and 45 DF,  p-value: 0.01564

也就是说：

\[ \log(Q) = 9.43 - 1.14 \times \log(P) + 0.21 \times \log(rincome) \]

使用集成命令

在实操中，不要使用Two-Stage Least Squares (TSLS)来汇报最终结果，因为标准误的估计会有问题。这里可以使用estimatr包中的集成命令：

fit3 <- iv_robust(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax, 
                  data = d,
                  se_type = "HC1")
fit3

               Estimate Std. Error    t value     Pr(>|t|)   CI Lower
(Intercept)   9.4306583  1.2593926  7.4882595 1.934709e-09  6.8941115
log(rprice)  -1.1433751  0.3723027 -3.0710902 3.611302e-03 -1.8932312
log(rincome)  0.2145153  0.3117469  0.6881071 4.949173e-01 -0.4133752
               CI Upper DF
(Intercept)  11.9672051 45
log(rprice)  -0.3935190 45
log(rincome)  0.8424058 45

过度识别

过度识别指工具变量的数目大于内生变量数目。例如，在此案例中，我们加入香烟税cigtax作为第二个工具变量。

TSLS估计

用二阶段回归得到点估计

fit1 <- lm(log(rprice) ~ salestax + cigtax + log(rincome), data = d)
lprice_pred <- fit1$fitted.values
fit2 <- lm(log(packs) ~ lprice_pred + log(rincome), data = d)
summary(fit2)


Call:
lm(formula = log(packs) ~ lprice_pred + log(rincome), data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.63899 -0.09544  0.00126  0.11769  0.42479 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.8950     1.1413   8.670 3.72e-11 ***
lprice_pred   -1.2774     0.2838  -4.502 4.73e-05 ***
log(rincome)   0.2804     0.2572   1.090    0.281    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2025 on 45 degrees of freedom
Multiple R-squared:  0.3368,    Adjusted R-squared:  0.3073 
F-statistic: 11.43 on 2 and 45 DF,  p-value: 9.709e-05

使用集成命令

fit3 <- iv_robust(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax + cigtax, 
                  data = d,
                  se_type = "HC1")
fit3

               Estimate Std. Error   t value     Pr(>|t|)   CI Lower   CI Upper
(Intercept)   9.8949555  0.9592169 10.315660 1.946702e-13  7.9629934 11.8269176
log(rprice)  -1.2774241  0.2496100 -5.117680 6.210718e-06 -1.7801645 -0.7746838
log(rincome)  0.2804048  0.2538897  1.104436 2.752748e-01 -0.2309552  0.7917648
             DF
(Intercept)  45
log(rprice)  45
log(rincome) 45

工具变量的有效性检验

弱工具变量

一个经验法则是，对TSLS的first-stage中所有工具变量打包，进行嵌套模型的F检验，如果\(F\leq 10\)，那么就是一个弱工具变量。

例如，对于只有一个工具变量salestax的情况:

mod2 <- lm(log(rprice) ~ salestax + log(rincome), data = d)
mod1 <- lm(log(rprice) ~ log(rincome), data = d)

anova(mod1, mod2)

Analysis of Variance Table

Model 1: log(rprice) ~ log(rincome)
Model 2: log(rprice) ~ salestax + log(rincome)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     46 0.55522                                  
2     45 0.27713  1    0.2781 45.158 2.655e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

再例如，对于有两个工具变量salestax和cigtax的情况:

mod2 <- lm(log(rprice) ~ salestax + cigtax + log(rincome), data = d)
mod1 <- lm(log(rprice) ~ log(rincome), data = d)

anova(mod1, mod2)

Analysis of Variance Table

Model 1: log(rprice) ~ log(rincome)
Model 2: log(rprice) ~ salestax + cigtax + log(rincome)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     46 0.55522                                  
2     44 0.04579  2   0.50943 244.73 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

可以看到，对应的F值分别为45.158和244.73，都通过了经验法则的检验，说明不是弱工具变量。此外，我们还可以通过linearHypothesis模型来简化流程，例如对于两个工具变量的情况（如果数据存在异方差性，即误差项的方差不是常数，这会影响F值的计算。在代码中，使用了vcovHC来计算稳健标准误（robust standard errors），这有助于纠正异方差性带来的影响，因此结果也与前面的不同）：

linearHypothesis(mod2, 
                 c("salestax = 0", "cigtax = 0"),
                 vcov = vcovHC, type = "HC1")

Linear hypothesis test

Hypothesis:
salestax = 0
cigtax = 0

Model 1: restricted model
Model 2: log(rprice) ~ salestax + cigtax + log(rincome)

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1     46                        
2     44  2 167.99 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

过度识别检验

当工具变量数量大于内生自变量时，可以使用过度识别的J检验。

第一步，用iv回归的因变量残差对工具变量和外生变量做回归。

# compute the J-statistic
fit3 <- iv_robust(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax + cigtax, 
                  data = d,
                  se_type = "HC1")
residuals <- log(d$packs) - fit3$fitted.values # 计算残差

cig_iv_OR <- lm(residuals ~ log(rincome) + salestax + cigtax, data = d)

第二步，对两个工具变量进行联合卡方检验

cig_OR_test <- linearHypothesis(cig_iv_OR, 
                               c("salestax = 0", "cigtax = 0"), 
                               test = "Chisq")
cig_OR_test

Linear hypothesis test

Hypothesis:
salestax = 0
cigtax = 0

Model 1: restricted model
Model 2: residuals ~ log(rincome) + salestax + cigtax

  Res.Df   RSS Df Sum of Sq Chisq Pr(>Chisq)
1     46 1.588                              
2     44 1.577  2  0.011005 0.307     0.8577

第三步，提取卡方值，做自由度为（工具变量数目-自变量数目）的卡方检验

pchisq(cig_OR_test$Chisq[2], df = 1, lower.tail = FALSE)

[1] 0.5795077

因此，我们不能拒绝两个工具变量都是外生变量的原假设。