Karim Naguib (Boston University)
11/17/2013
Let our population regression model be
\[ Y_i = \beta_0 + \beta_1 X_i + u_i,~~i=1,\dots,n \]
and let the variable \( Z_i \) be an instrumental variable that isolates the part of \( X_i \) that is uncorrelated with \( u_i \).
If valid instrument, \( Z \), is available, we are able to estimate the coefficient \( \beta_1 \) using two stage least squares (TSLS).
Suppose we wanted to estimate the price elasticity in the log-log model
\[ \ln(Q_i^{butter}) = \beta_0 + \beta_1 \ln(P_i^{butter}) + u_i \]
if we had a sample of \( n \) observations of quantity demanded and the equilibrium price, we can run an OLS estimation to estimate the elasticity coefficient \( \beta_1 \).
This can be shown because
\[ \begin{align*} Cov(Z_i, Y_i) &= Cov[Z_i, (\beta_0 + \beta_1 X_i + u_i)]\\ &= \beta_1 Cov(Z_i, X_i) + Cov(Z_i, u_i) \end{align*} \]
And since \( Cov(Z_i, u_i) = 0 \) and \( Cov(Z_i, X_i) \ne 0 \)
\[ \beta_1 = \frac{Cov(Z_i, Y_i)}{Cov(Z_i, X_i)} \]
Since \[ \begin{align*} s_{ZY} &\overset{p}{\longrightarrow} Cov(Z_i, Y_i) \\ s_{ZX} &\overset{p}{\longrightarrow} Cov(Z_i, X_i) \end{align*} \]
therefore
\[ \hat{\beta}_1^{TSLS} = \frac{s_{ZY}}{s_{ZX}} \overset{p}{\longrightarrow} \frac{Cov(Z_i, Y_i)}{Cov(Z_i, X_i)} = \beta_1 \]
We have cross-sectional data from 48 US states in 1995 with the variables
Before we can carry out TSLS estimation we must investigate the relevance of our instrument.
Statistical software conceals the various steps needed for IV regression, but it can be useful to demonstrate the steps here
with \( R^2 = 0.47 \).
str(cig.data)
'data.frame': 96 obs. of 9 variables:
$ state : chr "AL" "AR" "AZ" "CA" ...
$ year : num 1985 1985 1985 1985 1985 ...
$ cpi : num 1.08 1.08 1.08 1.08 1.08 ...
$ pop : num 3973000 2327000 3184000 26444000 3209000 ...
$ packpc: num 116 129 105 100 113 ...
$ income: num 4.60e+07 2.62e+07 4.40e+07 4.47e+08 4.95e+07 ...
$ tax : num 32.5 37 31 26 31 ...
$ avgprs: num 102.2 101.5 108.6 107.8 94.3 ...
$ taxs : num 33.3 37 36.2 32.1 31 ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr "29 Dec 2010 20:03"
- attr(*, "formats")= chr "%9s" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int 9 254 254 254 254 254 254 254 254
- attr(*, "val.labels")= chr "" "" "" "" ...
- attr(*, "var.labels")= chr "" "" "" "population from RAs before 1990; from web for 1990-2000" ...
- attr(*, "expansion.fields")=List of 2
..$ : chr "pop" "note1" "data from 2000 extrapolated from 1999 assuming constant population growth rate from 98-99 and 99-2000 of .76%"
..$ : chr "pop" "note0" "1"
- attr(*, "version")= int 12
cig.data$ravgprs <- cig.data$avgprs/cig.data$cpi # real average price
cig.data$rtax <- cig.data$tax/cig.data$cpi # real average cig tax
cig.data$rtaxs <- cig.data$taxs/cig.data$cpi # real average total tax
cig.data$rtaxso <- cig.data$rtaxs - cig.data$rtax # instrument
The first stage regression:
first.stage.res <- lm(log(ravgprs) ~ rtaxso, data=cig.data, subset=(year == 1995))
coeftest(first.stage.res, vcov.=vcovHAC(first.stage.res))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.61655 0.02854 161.73 < 2e-16 ***
rtaxso 0.03073 0.00486 6.32 9.6e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(AER)
iv.res <- ivreg(log(packpc) ~ log(ravgprs) | rtaxso, data=cig.data, subset=(year == 1995))
coeftest(iv.res, vcov.=vcovHAC(iv.res))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.720 1.527 6.36 8.2e-08 ***
log(ravgprs) -1.084 0.319 -3.40 0.0014 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Y_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_k X_{ki} + \beta_{k+1} W_{1i} + \cdots + \beta_{k+r}W_{ri} + u_i \]
For \( i = 1,\dots,n \),
Let \( m \) be the number of instruments and \( k \) be the number of endogenous regressors. The coefficients of this model are said to be
For an IV regression to be possible, we must have exact identification or overidentification.
The variables in \( W \) can either be
For example, in the model
\[ \ln(Q_i^{cigarettes}) = \beta_0 + \beta_1\ln(P_i^{cigarettes}) + u_i \]
we raised the possibility of having income has an omitted factor that would be correlated with sales tax. If we added income directly or a control for it we can remove this bias.
Let our model be
The second stage would then be
As stated before, when we regressed quantity demanded on prices, using sales taxes as an instrument, we might have correlation with the error since state income might be correlated with sales taxes. So, now let's add an exogenous variable for income
\[ \begin{alignat*}{5} \widehat{\ln(Q_i^{cigarettes})} = &9.43 - &&1.14 \widehat{\ln(P_i^{cigarettes})} + &&0.21\ln(Inc_i) \\ &(1.26) &&(0.37) && (0.31) \end{alignat*} \]
cig.data$perinc <- cig.data$income/(cig.data$pop * cig.data$cpi)
iv.res.2 <- ivreg(log(packpc) ~ log(ravgprs) + log(perinc) | rtaxso + log(perinc), data=cig.data, subset=(year == 1995))
coeftest(iv.res.2, vcov.=vcovHAC(iv.res.2))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.431 1.255 7.52 1.8e-09 ***
log(ravgprs) -1.143 0.371 -3.08 0.0035 **
log(perinc) 0.215 0.311 0.69 0.4945
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now instead of using only one instrument we can use two: \( SalesTax_i \) and \( CigTax_i \), so \( m = 2 \), making this model overidentified.
\[ \begin{alignat*}{5} \widehat{\ln(Q_i^{cigarettes})} = &9.89 - &&1.28 \widehat{\ln(P_i^{cigarettes})} + &&0.28\ln(Inc_i) \\ &(0.96) &&(0.25) &&(0.25) \end{alignat*} \]
iv.res.3 <- ivreg(log(packpc) ~ log(ravgprs) + log(perinc) | rtaxso + rtax + log(perinc), data=cig.data, subset=(year == 1995))
coeftest(iv.res.3, vcov.=vcovHAC(iv.res.3))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.895 0.966 10.24 2.4e-13 ***
log(ravgprs) -1.277 0.253 -5.05 7.8e-06 ***
log(perinc) 0.280 0.255 1.10 0.28
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If the instruments are not exogenous then TSLS estimators will suffer from inconsistency.
We can test for exogeneity using the \( J \)-statistic. We do this by estimating the following regression
and using an \( F \)-test for \( \delta_1 = \cdots = \delta_m = 0 \)
We are actually going to do is run a state fixed effects regression using all the data.
panel.iv.res.1 <- plm(log(packpc) ~ log(ravgprs) + log(perinc) | rtaxso + log(perinc), data=cig.data, method="within", effect="individual", index=c("state", "year"))
coeftest(panel.iv.res.1, vcov.=vcovHC(panel.iv.res.1))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
log(ravgprs) -1.073 0.145 -7.39 2.4e-09 ***
log(perinc) -0.079 0.227 -0.35 0.73
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First stage \( F \)-test:
panel.1st.stage.res.1 <- lm(log(ravgprs) ~ rtaxso, data=cig.data)
lht(panel.1st.stage.res.1, "rtaxso = 0", vcov=vcovHAC)
Linear hypothesis test
Hypothesis:
rtaxso = 0
Model 1: restricted model
Model 2: log(ravgprs) ~ rtaxso
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 95
2 94 1 113 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
panel.iv.res.2 <- plm(log(packpc) ~ log(ravgprs) + log(perinc) | rtax + log(perinc), data=cig.data, method="within", effect="individual", index=c("state", "year"))
coeftest(panel.iv.res.2, vcov.=vcovHC(panel.iv.res.2))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
log(ravgprs) -1.363 0.143 -9.53 1.8e-12 ***
log(perinc) 0.342 0.210 1.63 0.11
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First stage \( F \)-test:
panel.1st.stage.res.2 <- lm(log(ravgprs) ~ rtax, data=cig.data)
lht(panel.1st.stage.res.2, "rtax = 0", vcov=vcovHAC)
Linear hypothesis test
Hypothesis:
rtax = 0
Model 1: restricted model
Model 2: log(ravgprs) ~ rtax
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 95
2 94 1 302 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
panel.iv.res.3 <- plm(log(packpc) ~ log(ravgprs) + log(perinc) | rtaxso + rtax + log(perinc), data=cig.data, method="within", effect="individual", index=c("state", "year"))
coeftest(panel.iv.res.3, vcov.=vcovHC(panel.iv.res.3))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
log(ravgprs) -1.268 0.140 -9.07 8.2e-12 ***
log(perinc) 0.204 0.211 0.97 0.34
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First stage \( F \)-test:
panel.1st.stage.res.3 <- lm(log(ravgprs) ~ rtaxso + rtax, data=cig.data)
lht(panel.1st.stage.res.3, c("rtaxso = 0", "rtax = 0"), vcov=vcovHAC)
Linear hypothesis test
Hypothesis:
rtaxso = 0
rtax = 0
Model 1: restricted model
Model 2: log(ravgprs) ~ rtaxso + rtax
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 95
2 93 2 165 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\( J \)-test:
j.test.reg <- lm(panel.iv.res.3$residuals ~ rtaxso + rtax + log(perinc), data=cig.data)
lht(j.test.reg, c("rtaxso = 0", "rtax = 0"), vcov.=vcovHAC)
Linear hypothesis test
Hypothesis:
rtaxso = 0
rtax = 0
Model 1: restricted model
Model 2: panel.iv.res.3$residuals ~ rtaxso + rtax + log(perinc)
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 94
2 92 2 0 1