2022-09-13

IV 2SLS Equation

\(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \varepsilon\)

where \(X_1\) is an endogenous variable.

  • First-stage: \(\hat{X_1} = \gamma_0 + \gamma_1Z_1 + \gamma_2X_2 + \gamma_3X_3 + v\)

where \(Z_1\) is the instrumental variable.

  • Second-stage: \(Y = \beta_0 + \beta_1\hat{X_1} + \beta_2X_2 + \beta_3X_3 + v\)

where \(v\) is a composite error term that is uncorrelated with \(\hat{X_1},\hspace{0.2 cm} X_2 \hspace{0.2 cm}and \hspace{0.2 cm}X_3\)

Instrumental Variables Example

  • We want to study the factors influencing medical expenses (\(Y\)) given the endogenous regressor of having health insurance (\(X_1\)) and exogenous regressors of illnesses, age, and income (\(X_2\)). Instruments are the SS (Social Security) income ratio (\(Z_1\)).

  • Data are from the Medical Expenditure Panel Survey (MEPS).

\(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \varepsilon\)

where \(X_1\) is an endogenous variable.

  • First-stage: \(\hat{X_1} = \gamma_0 + \gamma_1Z_1 + \gamma_2X_2 + v\)

where \(Z_1\) is the instrumental variable.

  • Second-stage: \(Y = \beta_0 + \beta_1\hat{X_1} + \beta_2X_2 + v\)

where \(v\) is a composite error term that is uncorrelated with \(\hat{X_1},\hspace{0.2 cm} X_2 \hspace{0.2 cm}and \hspace{0.2 cm}X_3\)

Code

Variables

\(Y\): Medical expenses.

\(X_1\): health insurance.

\(X_2\): illnesses, age, and income.

\(Z_1\): SS (Social Security) income ratio.

Y <- cbind(logmedexpense)
X_1 <- cbind(healthinsu)
X_2 <- cbind(illnesses, age, logincome)
Z_1 <- cbind(ssiratio)

Descriptive Statistics

summary(cbind(Y, X_1, X_2, Z_1))
##  logmedexpense      healthinsu       illnesses          age       
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.000   Min.   :65.00  
##  1st Qu.: 5.740   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:70.00  
##  Median : 6.678   Median :0.0000   Median :2.000   Median :74.00  
##  Mean   : 6.481   Mean   :0.3822   Mean   :1.861   Mean   :75.05  
##  3rd Qu.: 7.430   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:80.00  
##  Max.   :10.180   Max.   :1.0000   Max.   :9.000   Max.   :91.00  
##    logincome         ssiratio     
##  Min.   :-6.908   Min.   :0.0000  
##  1st Qu.: 2.233   1st Qu.:0.2381  
##  Median : 2.743   Median :0.5045  
##  Mean   : 2.743   Mean   :0.5365  
##  3rd Qu.: 3.315   3rd Qu.:0.9091  
##  Max.   : 5.744   Max.   :9.2506

OLS Regression

olsreg <- lm(Y ~ X_1 + X_2)
summary(olsreg)
## 
## Call:
## lm(formula = Y ~ X_1 + X_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2793 -0.6768  0.1472  0.8517  3.7803 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.780127   0.150891  38.307  < 2e-16 ***
## X_1           0.074960   0.026012   2.882  0.00396 ** 
## X_2illnesses  0.440653   0.009572  46.035  < 2e-16 ***
## X_2age       -0.002595   0.001879  -1.381  0.16735    
## X_2logincome  0.017236   0.013787   1.250  0.21124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 10084 degrees of freedom
## Multiple R-squared:  0.1749, Adjusted R-squared:  0.1746 
## F-statistic: 534.4 on 4 and 10084 DF,  p-value: < 2.2e-16

2SLS estimation: First-Step

olsreg1 <- lm (X_1 ~ Z_1 + X_2)
summary(olsreg1)
## 
## Call:
## lm(formula = X_1 ~ Z_1 + X_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6817 -0.3882 -0.2413  0.5167  2.5921 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9591576  0.0568776  16.864  < 2e-16 ***
## Z_1          -0.1997539  0.0141579 -14.109  < 2e-16 ***
## X_2illnesses  0.0113510  0.0036336   3.124  0.00179 ** 
## X_2age       -0.0085302  0.0007125 -11.973  < 2e-16 ***
## X_2logincome  0.0544246  0.0056429   9.645  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4691 on 10084 degrees of freedom
## Multiple R-squared:  0.06839,    Adjusted R-squared:  0.06803 
## F-statistic: 185.1 on 4 and 10084 DF,  p-value: < 2.2e-16
X_1_hat <- fitted(olsreg1)

2SLS estimation: Second-Step

olsreg2 <- lm(Y ~ X_1_hat + X_2)
summary(olsreg2)
## 
## Call:
## lm(formula = Y ~ X_1_hat + X_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2923 -0.6683  0.1525  0.8507  3.6881 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.589839   0.221021  29.815  < 2e-16 ***
## X_1_hat      -0.852201   0.186843  -4.561 5.15e-06 ***
## X_2illnesses  0.448512   0.009694  46.267  < 2e-16 ***
## X_2age       -0.011797   0.002627  -4.492 7.15e-06 ***
## X_2logincome  0.097693   0.021157   4.617 3.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 10084 degrees of freedom
## Multiple R-squared:  0.1759, Adjusted R-squared:  0.1756 
## F-statistic: 538.1 on 4 and 10084 DF,  p-value: < 2.2e-16

2SLS estimation (ivreg function)

# 2SLS estimation
ivreg <- ivreg(Y ~ X_1 + X_2 | Z_1 + X_2)
summary(ivreg)
## 
## Call:
## ivreg(formula = Y ~ X_1 + X_2 | Z_1 + X_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7141 -0.7468  0.1288  0.8907  4.0895 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.589839   0.234676  28.081  < 2e-16 ***
## X_1          -0.852201   0.198386  -4.296 1.76e-05 ***
## X_2illnesses  0.448512   0.010293  43.575  < 2e-16 ***
## X_2age       -0.011797   0.002789  -4.230 2.36e-05 ***
## X_2logincome  0.097693   0.022464   4.349 1.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.313 on 10084 degrees of freedom
## Multiple R-Squared: 0.07094, Adjusted R-squared: 0.07058 
## Wald test: 477.3 on 4 and 10084 DF,  p-value: < 2.2e-16

2SLS estimation (2 IV)

Z_1alt <- cbind(ssiratio, firmlocation) # firm multiple locations

ivreg_o <- ivreg(Y ~ X_1 + X_2 | X_2 + Z_1alt)
summary(ivreg_o)
## 
## Call:
## ivreg(formula = Y ~ X_1 + X_2 | X_2 + Z_1alt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7692 -0.7664  0.1183  0.9073  4.1775 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.692387   0.228705  29.262  < 2e-16 ***
## X_1          -0.969624   0.186385  -5.202 2.01e-07 ***
## X_2illnesses  0.449508   0.010427  43.111  < 2e-16 ***
## X_2age       -0.012963   0.002728  -4.752 2.04e-06 ***
## X_2logincome  0.107882   0.021821   4.944 7.78e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.333 on 10084 degrees of freedom
## Multiple R-Squared: 0.04295, Adjusted R-squared: 0.04257 
## Wald test: 465.7 on 4 and 10084 DF,  p-value: < 2.2e-16

Resume

OLS regression for \(Y\) (log of med expenses) 2SLS: first stage for \(X_1\) (health insurance) 2SLS: second stage for \(Y\) (log med expenses)
Have health insurance (endogenous variable \(X_1\)) 0.075* - -0.852*
Illnesses (\(X_2\)) 0.441* 0.011* 0.449*
Age (\(X_2\)) -0.003 -0.009* -0.012*
Log income (\(X_2\)) 0.017 0.054* 0.098*
SS income ratio (\(Z_1\)) - -0.200* -
Constant 5.780* 0.959* 6.590*

Resume

OLS regression for \(Y\) (log of med expenses) 2SLS: first stage for \(X_1\) (health insurance) 2SLS: second stage for \(Y\) (log med expenses)
Have health insurance (endogenous variable \(X_1\)) 0.075* - -0.970*
Illnesses (\(X_2\)) 0.441* 0.012* 0.450*
Age (\(X_2\)) -0.003 -0.009* -0.013*
Log income (\(X_2\)) 0.017 0.051* 0.108*
SS income ratio (\(Z_1\)) - -0.191* -
Firm location (\(Z_1\)) - 0.116* -
Constant 5.780* 0.912* 6.692*

With two instruments instead of one, the estimates changed only slightly from -0.852 to -0.970 for the coefficient on have health insurance.

Conclusions

  • Interpretation of coefficient on the endogenous variable in OLS model: For individuals with health insurance, the medical expenses are 7.5% higher than those for individuals without health insurance.

  • Interpretation of the coefficient of the endogenous variable in 2SLS: After instrumentation, for individuals with health insurance, their medical expenses are 85.2% lower than those for individuals without health insurance.

  • Note that the 2SLS coefficient turned out quite different from the OLS coefficient.

Details

##        Min. 1st Qu. Median   Mean 3rd Qu.   Max.
## [1,] 0.0000  0.2381 0.5045 0.5365  0.9091 9.2506

Details

##         Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
## [1,] 0.00000 0.00000 0.00000 0.06205 0.00000 1.00000