L12 (Lab) - Instrumental Variable Regression

In this lab exercise, you will learn how to obtain the TSLS estimator for the instrumental variable model.

Exercise: Demand for Cigarettes

The relation between the demand for and the price of commodities is a simple yet widespread problem in economics. Health economics is concerned with the study of how health-affecting behavior of individuals is influenced by the health-care system and regulation policy. Probably the most prominent example in public policy debates is smoking as it is related to many illnesses and negative externalities.

It is plausible that cigarette consumption can be reduced by taxing cigarettes more heavily. The question is by how much taxes must be increased to reach a certain reduction in cigarette consumption. Economists use elasticities to answer this kind of question. Since the price elasticity for the demand of cigarettes is unknown, it must be estimated. As discussed in the box Who Invented Instrumental Variables Regression presented in Chapter 12.1 of the book, an OLS regression of log quantity on log price cannot be used to estimate the effect of interest since there is simultaneous causality between demand and supply. Instead, IV regression can be used.

We use the data set CigarettesSW which comes with the package AER. It is a panel data set that contains observations on cigarette consumption and several economic indicators for all 48 continental federal states of the U.S. from 1985 to 1995. For this exercise, let’s consider data for the cross section of states in 1995 only.

We start by loading the package, attaching the data set and getting an overview.

Clear the Workspace

rm(list=ls())

Install and Load Needed Packages

Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).

library(AER)             # load package; to run IV regression; also contains data
library(stargazer)        # load package; to put regression results into a single stargazer table

Import Data: CigarettesSW

# load the data set and get an overview
data("CigarettesSW")
summary(CigarettesSW)

##      state      year         cpi          population           packs       
##  AL     : 2   1985:48   Min.   :1.076   Min.   :  478447   Min.   : 49.27  
##  AR     : 2   1995:48   1st Qu.:1.076   1st Qu.: 1622606   1st Qu.: 92.45  
##  AZ     : 2             Median :1.300   Median : 3697472   Median :110.16  
##  CA     : 2             Mean   :1.300   Mean   : 5168866   Mean   :109.18  
##  CO     : 2             3rd Qu.:1.524   3rd Qu.: 5901500   3rd Qu.:123.52  
##  CT     : 2             Max.   :1.524   Max.   :31493524   Max.   :197.99  
##  (Other):84                                                                
##      income               tax            price             taxs       
##  Min.   :  6887097   Min.   :18.00   Min.   : 84.97   Min.   : 21.27  
##  1st Qu.: 25520384   1st Qu.:31.00   1st Qu.:102.71   1st Qu.: 34.77  
##  Median : 61661644   Median :37.00   Median :137.72   Median : 41.05  
##  Mean   : 99878736   Mean   :42.68   Mean   :143.45   Mean   : 48.33  
##  3rd Qu.:127313964   3rd Qu.:50.88   3rd Qu.:176.15   3rd Qu.: 59.48  
##  Max.   :771470144   Max.   :99.00   Max.   :240.85   Max.   :112.63  
##

Use ?CigarettesSW for a detailed description of the variables.

We are interested in estimating \(\beta_1\) in \[\log(Q_i^{cigarettes}) = \beta_0 + \beta_1 \log(P_i^{cigarettes}) + u_i,\] where \(Q_i^{cigarettes}\) is the number of cigarette packs per capita sold and \(P_i^{cigarettes}\) is the after-tax average real price per pack of cigarettes in state \(i\).

1. Instrumental Variable and TSLS

The instrumental variable we are going to use for instrumenting the endogenous regressor \(\log(P_i^{cigarettes})\) is SalesTax, the portion of taxes on cigarettes arising from the general sales tax. SalesTax is measured in dollars per pack. The idea is that SalesTax is a relevant instrument as it is included in the after-tax average price per pack. Also, it is plausible that SalesTax is exogenous since the sales tax does not influence quantity sold directly but indirectly through the price.

1.1 Data Preparation

We perform some transformations in order to obtain deflated cross section data for the year 1995.

We also compute the sample correlation between the sales tax and price per pack. The sample correlation is a consistent estimator of the population correlation. The estimate of approximately 0.614 indicates that SalesTax and \(P_i^{cigarettes}\) exhibit positive correlation which meets our expectations: higher sales taxes lead to higher prices. However, a correlation analysis like this is not sufficient for checking whether the instrument is relevant. (Why? The weak IV issue.) We will later come back to the issue of checking whether an instrument is relevant and exogenous.

# compute real per capita prices
CigarettesSW$rprice <- with(CigarettesSW, price / cpi)

#  compute the sales tax
CigarettesSW$salestax <- with(CigarettesSW, (taxs - tax) / cpi)

# check the correlation between sales tax and price
cor(CigarettesSW$salestax, CigarettesSW$price)

## [1] 0.6141228

# generate a subset for the year 1995
c1995 <- subset(CigarettesSW, year == "1995")

1.2 2SLS: The First Stage

The first stage regression is: \[\log(P_i^{cigarettes}) = \pi_0 + \pi_1 SalesTax_i + v_i.\]

We estimate this model in R using lm( ).

cig_s1 <- lm(log(rprice) ~ salestax, data = c1995)
coeftest(cig_s1, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##              Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) 4.6165463  0.0289177 159.6444 < 2.2e-16 ***
## salestax    0.0307289  0.0048354   6.3549 8.489e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first stage regression indicates that the relation between sales tax and price of cigarettes to be positive and statistically significant.

We next store \(\log(\widehat{P_i^{cigarettes}})\), the fitted values obtained by the first stage regression cig_s1, in the variable lcigp_pred.

# store the predicted values
lcigp_pred <- cig_s1$fitted.values

1.3 2SLS: The Second Stage

In the second stage, we run a regression of \(\log(Q_i^{cigarettes})\) on \(\log(\widehat{P_i^{cigarettes}})\) to obtain \(\hat\beta_0^{TSLS}\) and \(\hat\beta_1^{TSLS}\).

# run the stage 2 regression
cig_s2 <- lm(log(c1995$packs) ~ lcigp_pred)
coeftest(cig_s2, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.7199     1.5971  6.0859 2.153e-07 ***
## lcigp_pred   -1.0836     0.3337 -3.2472  0.002178 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.4 Perform TSLS using ivreg( )

In R, TSLS can be performed automatically using ivreg( ) from the package AER. It is used similarly as lm( ). Instruments can be added to the usual specification of the regression formula using a vertical bar separating the model equation from the instruments. Thus, for the regression at hand the correct formula is \[log(packs) \sim log(rprice) | salestax.\]

# perform TSLS using 'ivreg()'
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax, data = c1995)

coeftest(cig_ivreg, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  9.71988    1.52832  6.3598 8.346e-08 ***
## log(rprice) -1.08359    0.31892 -3.3977  0.001411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that the coefficient estimates coincide for both approaches.

1.5 Notes on the Computation of TSLS Standard Errors

We have demonstrated that running the individual regressions for each stage of TSLS using lm( ) leads to the same coefficient estimates as when using ivreg( ). However, the standard errors reported for the second-stage regression, e.g., by coeftest( ) or summary( ), are invalid: neither adjusts for using predictions from the first-stage regression as regressors in the second-stage regression. Fortunately, ivreg( ) performs the necessary adjustment automatically. This is another advantage over manual step-by-step estimation which we have done above for demonstrating the mechanics of the procedure.
Just like in multiple regression it is important to compute heteroskedasticity-robust standard errors as we have done above using vcovHC.

Discussion: The TSLS estimate for \(\beta_1\) suggests that an increase in cigarette prices by one percent reduces cigarette consumption by roughly percentage points, which is fairly elastic. However, we should keep in mind that this estimate might not be trustworthy even though we used IV estimation: there still might be a bias due to omitted variables. Thus a multiple IV regression approach is needed.

2. The General IV Regression Model

2.1 A Model with Multiple Variables

The estimated elasticity of the demand for cigarettes in the simple linear model is 1.08. Although the simple model was estimated using IV regression it is plausible that this IV estimate is biased: in this model, the TSLS estimator is inconsistent for the true \(\beta_1\) if the instrument (the real sales tax per pack) correlates with the error term. This is likely to be the case since there are economic factors, like state income, which impact the demand for cigarettes and correlate with the sales tax. States with high personal income tend to generate tax revenues by income taxes and less by sales taxes. Consequently, state income should be included in the regression model: \[\log(Q_i^{cigarettes}) = \beta_0 + \beta_1 \log(P_i^{cigarettes}) + \beta_2 \log(income_i) + u_i.\] Before estimating this model, we define income as real per capita income rincome and append it to the data set CigarettesSW.

# add rincome to the dataset
CigarettesSW$rincome <- with(CigarettesSW, income / population / cpi)

c1995 <- subset(CigarettesSW, year == "1995")

# estimate the model
cig_ivreg2 <- ivreg(log(packs) ~ log(rprice) + log(rincome) |  
                    salestax + log(rincome), data = c1995)

coeftest(cig_ivreg2, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.43066    1.25939  7.4883 1.935e-09 ***
## log(rprice)  -1.14338    0.37230 -3.0711  0.003611 ** 
## log(rincome)  0.21452    0.31175  0.6881  0.494917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimating regression models with TSLS using multiple instruments by means of ivreg( ) is straightforward. There are, however, some subtleties in correctly specifying the regression formula.

In ivreg( ), it is necessary to list all exogenous variables as instruments too, that is joining them by +’s on the right of the vertical bar: \[y \sim x_1 + w_1 | z_1 + w_1,\] where \(w_1\) is the exogenous variable and it is “instrumenting itself”.
If there is a large number of exogenous variables it may be convenient to provide an update formula with a . (this includes all variables except for the dependent variable) right after the \(|\) and to exclude all endogenous variables using a \(-\). For example, if there is one exogenous regressor \(w_1\) and one endogenous regressor \(x_1\) with instrument \(z_1\), the appropriate formula would be \(y \sim x_1 + w_1 | z_1 + w_1\), which is equivalent to \(y \sim x_1 + w_1 | . - x_1 + z_1\).

2.2 A Model with Multiple Instruments

Following the textbook, we add the cigarette-specific taxes (\(cigtax_i\)) as a further instrument variable and estimate again using TSLS.

# add cigtax to the data set
CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi)

c1995 <- subset(CigarettesSW, year == "1995")

# estimate the model
cig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | 
                    log(rincome) + salestax + cigtax, data = c1995)

coeftest(cig_ivreg3, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.89496    0.95922 10.3157 1.947e-13 ***
## log(rprice)  -1.27742    0.24961 -5.1177 6.211e-06 ***
## log(rincome)  0.28040    0.25389  1.1044    0.2753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using the two instruments \(salestax_i\) and \(cigtax_i\), we have \(m=2\) and \(k=1\) so the coefficient on the endogenous regressor \(\log(P_i^{cigarettes})\) is overidentified.

## Summary of the regression results ##
cig_ols <- lm(log(packs) ~ log(rprice), data=c1995)

rob_se <- list(sqrt(diag(vcovHC(cig_ols, type = "HC1"))),
               sqrt(diag(vcovHC(cig_ivreg, type = "HC1"))),
               sqrt(diag(vcovHC(cig_ivreg2, type = "HC1"))),
               sqrt(diag(vcovHC(cig_ivreg3, type = "HC1"))))

stargazer(cig_ols, cig_ivreg, cig_ivreg2, cig_ivreg3, title="Regression Results: Demand for Cigarettes", column.labels=c("ols","ivreg", "ivreg2", "ivreg3"), align=TRUE, type="text", se = rob_se, df=FALSE, digits=2)

## 
## Regression Results: Demand for Cigarettes
## =======================================================
##                             Dependent variable:        
##                     -----------------------------------
##                                 log(packs)             
##                       OLS           instrumental       
##                                       variable         
##                       ols     ivreg    ivreg2   ivreg3 
##                       (1)      (2)      (3)      (4)   
## -------------------------------------------------------
## log(rprice)         -1.21*** -1.08*** -1.14*** -1.28***
##                      (0.19)   (0.32)   (0.37)   (0.25) 
##                                                        
## log(rincome)                            0.21     0.28  
##                                        (0.31)   (0.25) 
##                                                        
## Constant            10.34*** 9.72***  9.43***  9.89*** 
##                      (0.93)   (1.53)   (1.26)   (0.96) 
##                                                        
## -------------------------------------------------------
## Observations           48       48       48       48   
## R2                    0.41     0.40     0.42     0.43  
## Adjusted R2           0.39     0.39     0.39     0.40  
## Residual Std. Error   0.19     0.19     0.19     0.19  
## F Statistic         31.41***                           
## =======================================================
## Note:                       *p<0.1; **p<0.05; ***p<0.01

Discussion: Should we trust the estimates in cig_ivreg2 or rather rely on cig_ivreg3 where two instruments are used? The estimates obtained using both instruments are more precise since in cig_ivreg3 all standard errors reported are smaller than in cig_ivreg2. In fact, the standard error for the estimate of the demand elasticity is only two thirds of the standard error when the sales tax is the only instrument used. This is due to more information being used in estimation cig_ivreg3. If the instruments are valid, cig_ivreg3 can be considered more reliable.

However, without insights regarding the validity of the instruments it is not sensible to make such a statement. This stresses why checking instrument validity is essential. Chapter 12.3 briefly discusses guidelines in checking instrument validity and presents approaches that allow to test for instrument relevance and exogeneity under certain conditions.

3. Test for Validity of The Instruments

3.1 Instrument Relevance

Rule-of-thumb: In all first-stage regressions, compute (heteroskedasticity-robust) F statistic to test hypothesis that all instruments have coefficients of 0.

For example, if there are two exogenous regressors \(w_1\), \(w_2\), and one endogenous regressor \(x_1\) with instruments \(z_1\) and \(z_2\), the appropriate F statistic would test hypothesis that instruments \(z_1\) and \(z_2\) jointly have coefficients of 0: \[ H_0: x_1 = \pi_0 + \pi_3 w_1 + \pi_4 w_2\] \[ H_a: x_1 = \pi_0 + \pi_1 z_1 + \pi_2 z_2 + \pi_3 w_1 + \pi_4 w_2\]

Compute the appropriate first-stage \(F\) statistics for cig_ivreg, cig_ivreg2, and cig_ivreg3:

First, compute for cig_ivreg.

# compute the first-stage F-statistic for cig_ivreg
cig_s1 <- lm(log(rprice) ~ salestax, data = c1995)
linearHypothesis(cig_s1, c("salestax=0"), white.adjust=c("hc1"))

## Linear hypothesis test
## 
## Hypothesis:
## salestax = 0
## 
## Model 1: restricted model
## Model 2: log(rprice) ~ salestax
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1     47                        
## 2     46  1 40.385 8.489e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When there is only one instrument in the first stage, the F-statistic is identical to the corresponding t-statistic squared. Following the rule-of-thumb, it implies that the instrument \(salestax\) is not weak since \(t=6.35>\sqrt{10}\approx 3.16\).

# equivalently, we can compute the first-stage t-statistic for cig_ivreg
coeftest(cig_s1, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##              Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) 4.6165463  0.0289177 159.6444 < 2.2e-16 ***
## salestax    0.0307289  0.0048354   6.3549 8.489e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next, compute for cig_ivreg2 and cig_ivreg3.

# compute the first-stage F-statistic for cig_ivreg2
cig_s1 <- lm(log(rprice) ~ salestax + log(rincome), data = c1995)
linearHypothesis(cig_s1, c("salestax=0"), white.adjust=c("hc1"))

## Linear hypothesis test
## 
## Hypothesis:
## salestax = 0
## 
## Model 1: restricted model
## Model 2: log(rprice) ~ salestax + log(rincome)
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df     F   Pr(>F)    
## 1     46                      
## 2     45  1 44.73 2.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# compute the first-stage F-statistic for cig_ivreg3
cig_s1 <- lm(log(rprice) ~ salestax + log(rincome) + cigtax, data = c1995)
linearHypothesis(cig_s1, c("salestax=0", "cigtax=0"), white.adjust=c("hc1"))

## Linear hypothesis test
## 
## Hypothesis:
## salestax = 0
## cigtax = 0
## 
## Model 1: restricted model
## Model 2: log(rprice) ~ salestax + log(rincome) + cigtax
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1     46                        
## 2     44  2 209.68 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary: All F-statistics suggest that instruments \(salestax\) and \(cigtax\) satisfy the relevance condition.

3.2 Instrument Exogeneity

If the IV model is exactly identified, i.e., the number of instruments equals the number of endogenous variables, there is no way to check the exogeneity condition of the instruments. In this exercise, \(cig\_ivreg\) and \(cig\_ivreg2\) are exactly identified. Thus, we much rely on good sense as to whether an instrument is exogenous.

If the IV model is overidentified, i.e., the number of instruments is greater than the number of endogenous variables, we can test the exogeneity of additional instruments using the \(J\) test. The IV regression \(cig\_ivreg3\) has one endogenous variable but two IVs, therefore it is overidentified.

# compute the J-statistic for cig_ivreg3
cig_ivreg3_J <- lm(residuals(cig_ivreg3) ~ log(rincome) + salestax + cigtax, data=c1995)

cig_J_test <- linearHypothesis(cig_ivreg3_J, c("salestax = 0", "cigtax = 0"), test = "Chisq")

cig_J_test

## Linear hypothesis test
## 
## Hypothesis:
## salestax = 0
## cigtax = 0
## 
## Model 1: restricted model
## Model 2: residuals(cig_ivreg3) ~ log(rincome) + salestax + cigtax
## 
##   Res.Df   RSS Df Sum of Sq Chisq Pr(>Chisq)
## 1     46 1.588                              
## 2     44 1.577  2  0.011005 0.307     0.8577

Caution: In this case the \(p\)-value reported by linearHypothesis( ) is wrong because the degrees of freedom are set to \(2\). This differs from the degree of overidentification (\(m-k=2-1=1\)) so under the null hypothesis the \(J\)-statistic is \(\chi^2_1\) distributed instead of following a \(\chi^2_2\) distribution as assumed defaultly linearHypothesis( ). We may compute the correct \(p\)-value using 1-pchisq( ).

# compute correct p-value for J-statistic
1-pchisq(cig_J_test[2, 5], df = 1)

## [1] 0.5795077

Since the value is greater than 0.05, we do not reject the null hypothesis that both instruments are exogenous.

References

Kleiber, Christian, and Achim Zeileis. 2020. AER: Applied Econometrics with R (version 1.2-9). https://CRAN.R-project.org/package=AER.

Hlavac, Marek. 2018. stargazer: Well-Formatted Regression and Summary Statistics Tables (version 5.2.2). https://CRAN.R-project.org/package=stargazer.

Hanck, Arnold, Gerber, and Schmelzer. 2021. Introduction to Econometrics with R. https://www.econometrics-with-r.org/index.html.