An example of two-stage least squares (2SLS) method with R

The model we are going to estimate is

\(y=a+bx+cd+e\),

where \(y\) is the explained variable, \(a\), \(b\), and \(c\) are the coefficients we want to estimate. \(x\) is the control variableand \(d\) is the treatment variable. We are perticularly interested in the impact of our treatment \(d\) on \(y\).

generate data

First, let’s generate the data.

Assume the the correlation matrix between \(x\), \(d\), \(z\) (the instrumental variable for \(d\)), and \(e\) is as follows:

R<-matrix(cbind(1,0.001,0.002,0.001,
                0.001,1,0.7,0.3,
                0.002,0.7,1,0.001,
                0.001,0.3,0.001,1),nrow=4)
rownames(R)<-colnames(R)<-c("x","d","z","e")
R

##       x     d     z     e
## x 1.000 0.001 0.002 0.001
## d 0.001 1.000 0.700 0.300
## z 0.002 0.700 1.000 0.001
## e 0.001 0.300 0.001 1.000

Specifically, the correlation shows that

cor(d,e)=0.3, this means that \(d\) is endogeneous;
cor(d,z)=0.7, this means that \(z\) is a strong instrumental varialbe for \(d\);
cor(z,e)=0.001, this means that instrumental variable \(z\) satisfy the exclusion restriction in that it only affects \(y\) through \(d\).

Now, let’s generate data for \(x\), \(d\), \(z\), and \(e\) with the specified correlation.

U = t(chol(R))
nvars = dim(U)[1]
numobs = 1000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
data = as.data.frame(newX)
attach(data)

the data looks like this:

head(data)

##             x          d          z          e
## 1 -0.62645381  0.1830168 -0.4694601  1.7474361
## 2  0.32950777 -0.8201385 -0.2255741  0.2818908
## 3  0.57578135 -0.3048125  0.8670061 -0.1795257
## 4 -0.62124058 -2.2153200 -0.7481687 -1.0350488
## 5 -0.01619026  0.9438195  1.2471197  0.5820200
## 6  0.91897737  0.7830549  0.6025820 -1.5924689

and the correlation between the data are

cor(data)

##             x          d            z           e
## x  1.00000000 0.00668391 -0.012319595 0.016239235
## d  0.00668391 1.00000000  0.680741763 0.312192680
## z -0.01231960 0.68074176  1.000000000 0.006322354
## e  0.01623923 0.31219268  0.006322354 1.000000000

just as what we previously specified.

Now let’s specify the true data generating process and generate explained variable \(y\)

y<-10+1*x+1*d+e

If we pretend we don’t know the true relationship and use \(x\) and \(d\) to explain \(y\), our correct coefficients for \(x\) and \(d\) should be close to \(1\). Otherwise, we are making mistakes.

OLS

If we just use OLS to estimate the coefficients:

ols<-lm(formula = y~x+d)
summary(ols)

## 
## Call:
## lm(formula = y ~ x + d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2395 -0.5952 -0.0308  0.6617  2.7592 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.99495    0.03105  321.89   <2e-16 ***
## x            1.01408    0.02992   33.89   <2e-16 ***
## d            1.31356    0.03023   43.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9817 on 997 degrees of freedom
## Multiple R-squared:  0.7541, Adjusted R-squared:  0.7536 
## F-statistic:  1528 on 2 and 997 DF,  p-value: < 2.2e-16

the estimated coefficient of b is 1.31 instread of 1. ## 2SLS ## Now we use 2SLS to estimate the relationship. We use z as the instrumental variable for d

stage 1: regress \(d\) on \(x\) and \(z\), and save the fitted value for d as d.hat

tsls1<-lm(d~x+z)
summary(tsls1)

## 
## Call:
## lm(formula = d ~ x + z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59344 -0.52572  0.04978  0.53115  2.01555 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01048    0.02383   -0.44    0.660    
## x            0.01492    0.02296    0.65    0.516    
## z            0.68594    0.02337   29.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7534 on 997 degrees of freedom
## Multiple R-squared:  0.4636, Adjusted R-squared:  0.4626 
## F-statistic: 430.9 on 2 and 997 DF,  p-value: < 2.2e-16

d.hat<-fitted.values(tsls1)

stage 2: regress \(y\) on \(x\) and \(d.hat\)

tsls2<-lm(y~x+d.hat)
summary(tsls2)

## 
## Call:
## lm(formula = y ~ x + d.hat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4531 -1.0333  0.0228  1.0657  4.0104 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.99507    0.04786  208.85   <2e-16 ***
## x            1.01609    0.04612   22.03   <2e-16 ***
## d.hat        1.00963    0.06842   14.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.513 on 997 degrees of freedom
## Multiple R-squared:  0.4158, Adjusted R-squared:  0.4146 
## F-statistic: 354.8 on 2 and 997 DF,  p-value: < 2.2e-16

Results

True value of b: 1 OLS estiamte of b: .00963 2SLS estiamte of b: 1.31356

If the treatment varialbe is endogenous, we should find an instrumental varialbe for the treatmet varialbe (not easy), and use 2SLS.

An example of two-stage least squares (2SLS) method with R

Max Shang

November 20, 2017

generate data

OLS

Results