The model we are going to estimate is
\(y=a+bx+cd+e\),
where \(y\) is the explained variable, \(a\), \(b\), and \(c\) are the coefficients we want to estimate. \(x\) is the control variableand \(d\) is the treatment variable. We are perticularly interested in the impact of our treatment \(d\) on \(y\).
First, let’s generate the data.
Assume the the correlation matrix between \(x\), \(d\), \(z\) (the instrumental variable for \(d\)), and \(e\) is as follows:
R<-matrix(cbind(1,0.001,0.002,0.001,
0.001,1,0.7,0.3,
0.002,0.7,1,0.001,
0.001,0.3,0.001,1),nrow=4)
rownames(R)<-colnames(R)<-c("x","d","z","e")
R
## x d z e
## x 1.000 0.001 0.002 0.001
## d 0.001 1.000 0.700 0.300
## z 0.002 0.700 1.000 0.001
## e 0.001 0.300 0.001 1.000
Specifically, the correlation shows that
Now, let’s generate data for \(x\), \(d\), \(z\), and \(e\) with the specified correlation.
U = t(chol(R))
nvars = dim(U)[1]
numobs = 1000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
data = as.data.frame(newX)
attach(data)
the data looks like this:
head(data)
## x d z e
## 1 -0.62645381 0.1830168 -0.4694601 1.7474361
## 2 0.32950777 -0.8201385 -0.2255741 0.2818908
## 3 0.57578135 -0.3048125 0.8670061 -0.1795257
## 4 -0.62124058 -2.2153200 -0.7481687 -1.0350488
## 5 -0.01619026 0.9438195 1.2471197 0.5820200
## 6 0.91897737 0.7830549 0.6025820 -1.5924689
and the correlation between the data are
cor(data)
## x d z e
## x 1.00000000 0.00668391 -0.012319595 0.016239235
## d 0.00668391 1.00000000 0.680741763 0.312192680
## z -0.01231960 0.68074176 1.000000000 0.006322354
## e 0.01623923 0.31219268 0.006322354 1.000000000
just as what we previously specified.
Now let’s specify the true data generating process and generate explained variable \(y\)
y<-10+1*x+1*d+e
If we pretend we don’t know the true relationship and use \(x\) and \(d\) to explain \(y\), our correct coefficients for \(x\) and \(d\) should be close to \(1\). Otherwise, we are making mistakes.
If we just use OLS to estimate the coefficients:
ols<-lm(formula = y~x+d)
summary(ols)
##
## Call:
## lm(formula = y ~ x + d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2395 -0.5952 -0.0308 0.6617 2.7592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.99495 0.03105 321.89 <2e-16 ***
## x 1.01408 0.02992 33.89 <2e-16 ***
## d 1.31356 0.03023 43.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9817 on 997 degrees of freedom
## Multiple R-squared: 0.7541, Adjusted R-squared: 0.7536
## F-statistic: 1528 on 2 and 997 DF, p-value: < 2.2e-16
the estimated coefficient of b is 1.31 instread of 1. ## 2SLS ## Now we use 2SLS to estimate the relationship. We use z as the instrumental variable for d
stage 1: regress \(d\) on \(x\) and \(z\), and save the fitted value for d as d.hat
tsls1<-lm(d~x+z)
summary(tsls1)
##
## Call:
## lm(formula = d ~ x + z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59344 -0.52572 0.04978 0.53115 2.01555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01048 0.02383 -0.44 0.660
## x 0.01492 0.02296 0.65 0.516
## z 0.68594 0.02337 29.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7534 on 997 degrees of freedom
## Multiple R-squared: 0.4636, Adjusted R-squared: 0.4626
## F-statistic: 430.9 on 2 and 997 DF, p-value: < 2.2e-16
d.hat<-fitted.values(tsls1)
stage 2: regress \(y\) on \(x\) and \(d.hat\)
tsls2<-lm(y~x+d.hat)
summary(tsls2)
##
## Call:
## lm(formula = y ~ x + d.hat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4531 -1.0333 0.0228 1.0657 4.0104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.99507 0.04786 208.85 <2e-16 ***
## x 1.01609 0.04612 22.03 <2e-16 ***
## d.hat 1.00963 0.06842 14.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.513 on 997 degrees of freedom
## Multiple R-squared: 0.4158, Adjusted R-squared: 0.4146
## F-statistic: 354.8 on 2 and 997 DF, p-value: < 2.2e-16
True value of b: 1 OLS estiamte of b: .00963 2SLS estiamte of b: 1.31356
If the treatment varialbe is endogenous, we should find an instrumental varialbe for the treatmet varialbe (not easy), and use 2SLS.