From: Wooldridge J. M. (2020). Introductory econometrics: A modern approach. 7th Ed. Cengage learning. Chapter 13.
1 Pooled regression
Example 13.3
Study the effect that a new garbage incinerator had on housing values in North Andover, Massachusetts. The rumor that a new incinerator would be built in North Andover began after 1978, and construction began in 1981. The incinerator was expected to be in operation soon after the start of construction; the incinerator actually began operating in 1985.
We will use data on prices of houses that sold in 1978 and another sample on those that sold in 1981. The hypothesis is that the price of houses located near the incinerator (within three miles) would fall relative to the price of more distant houses.
1.1 Post-policy
data$rprice = data$rprice/1000data1981 <- data %>%filter(year==1981)m81 <-lm(rprice ~ nearinc, data1981)summary(m81)
Call:
lm(formula = rprice ~ nearinc, data = data1981)
Residuals:
Min 1Q Median 3Q Max
-60.678 -19.832 -2.997 21.139 136.754
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101.308 3.093 32.754 < 2e-16 ***
nearinc -30.688 5.828 -5.266 5.14e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 31.24 on 140 degrees of freedom
Multiple R-squared: 0.1653, Adjusted R-squared: 0.1594
F-statistic: 27.73 on 1 and 140 DF, p-value: 5.139e-07
1.2 Pre-policy
data1978 <- data %>%filter(year==1978)m78 <-lm(rprice ~ nearinc, data=data1978)summary(m78)
Call:
lm(formula = rprice ~ nearinc, data = data1978)
Residuals:
Min 1Q Median 3Q Max
-56.517 -16.605 -3.193 8.683 236.307
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.517 2.654 31.094 < 2e-16 ***
nearinc -18.824 4.745 -3.968 0.000105 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.43 on 177 degrees of freedom
Multiple R-squared: 0.08167, Adjusted R-squared: 0.07648
F-statistic: 15.74 on 1 and 177 DF, p-value: 0.0001054
The effect can be obtained as: \[
\hat{\delta}_1 = (\bar{rprice_{81,nr}}-\bar{rprice_{81,fr}})-(\bar{rprice_{78,nr}}-\bar{rprice_{78,fr}})
\] which is:
DiD <-lm(rprice ~ y81 + nearinc + y81*nearinc, data=data)summary(DiD)
Call:
lm(formula = rprice ~ y81 + nearinc + y81 * nearinc, data = data)
Residuals:
Min 1Q Median 3Q Max
-60.678 -17.693 -3.031 12.483 236.307
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.517 2.727 30.260 < 2e-16 ***
y81 18.790 4.050 4.640 5.12e-06 ***
nearinc -18.824 4.875 -3.861 0.000137 ***
y81:nearinc -11.864 7.457 -1.591 0.112595
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30.24 on 317 degrees of freedom
Multiple R-squared: 0.1739, Adjusted R-squared: 0.1661
F-statistic: 22.25 on 3 and 317 DF, p-value: 4.224e-13
The estimator of interest is the difference over time in the average difference of housing prices in the two locations, known as the difference-in-differences estimator.
1.4 Adding controls
DiD2 <-lm(rprice ~ y81 + nearinc + y81 * nearinc + age + agesq + intst + land + area + rooms + baths, data=data)summary(DiD2)
Call:
lm(formula = rprice ~ y81 + nearinc + y81 * nearinc + age + agesq +
intst + land + area + rooms + baths, data = data)
Residuals:
Min 1Q Median 3Q Max
-76.721 -8.885 -0.252 8.433 136.649
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.381e+01 1.117e+01 1.237 0.21720
y81 1.393e+01 2.799e+00 4.977 1.07e-06 ***
nearinc 3.780e+00 4.453e+00 0.849 0.39661
age -7.395e-01 1.311e-01 -5.639 3.85e-08 ***
agesq 3.453e-03 8.128e-04 4.248 2.86e-05 ***
intst -5.386e-04 1.963e-04 -2.743 0.00643 **
land 1.414e-04 3.108e-05 4.551 7.69e-06 ***
area 1.809e-02 2.306e-03 7.843 7.16e-14 ***
rooms 3.304e+00 1.661e+00 1.989 0.04758 *
baths 6.977e+00 2.581e+00 2.703 0.00725 **
y81:nearinc -1.418e+01 4.987e+00 -2.843 0.00477 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.62 on 310 degrees of freedom
Multiple R-squared: 0.66, Adjusted R-squared: 0.6491
F-statistic: 60.19 on 10 and 310 DF, p-value: < 2.2e-16
Call:
lm(formula = crmrte ~ d87 + unem, data = data)
Residuals:
Min 1Q Median 3Q Max
-53.474 -21.794 -6.266 18.297 75.113
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.4202 12.7395 7.333 9.92e-11 ***
d87 7.9404 7.9753 0.996 0.322
unem 0.4265 1.1883 0.359 0.720
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.99 on 89 degrees of freedom
Multiple R-squared: 0.01221, Adjusted R-squared: -0.009986
F-statistic: 0.5501 on 2 and 89 DF, p-value: 0.5788
The coefficient on unem is positive and not statistically significant at standard significance levels. No link between crime and unemployment rates. Also, the p-value for the overall F test can not reject the null.
Omitted variable problems: control for more factors, such as number of police officers, income, law enforcement efforts, and so on, in a multiple regression analysis. For example,
Call:
lm(formula = ccrmrte ~ cunem, data = data)
Residuals:
Min 1Q Median 3Q Max
-36.912 -13.369 -5.507 12.446 52.915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.4022 4.7021 3.276 0.00206 **
cunem 2.2180 0.8779 2.527 0.01519 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.05 on 44 degrees of freedom
(46 observations deleted due to missingness)
Multiple R-squared: 0.1267, Adjusted R-squared: 0.1069
F-statistic: 6.384 on 1 and 44 DF, p-value: 0.01519
3 Policy analysis with two periods panel data
A random sample of observations (individuals, firms, cities, etc) is obtained in the first time period.
In the second period, some of these units, those in the treatment group, take part in a particular program, the ones that do not are the control group.
The same cross-sectional units appear in both time periods
The effect of a job training program in worker productivity of manufacturing firms in Michigan.
The outcome of interest is the percentage of items scrapped due to defects, \(scrap\) rate. The binary variable \(grant_{it}\) takes value of 1 if the firm received a job training grant in year \(t\). Use data for 1987 and 1988.
Upload the data:
jtain <-load("jtrain.RData")
An estimation with pooling data will produce biased and inconsistent estimators as there are time invariant factors that may have influenced firm’s participation in the program. For example, grants could have been given to firms with more productive workers, or t employers with better managerial skills.
data <- data %>%filter(year %in%c('1987','1988'))dm1 <-lm(clscrap ~ cgrant, data=data)summary(dm1)
Call:
lm(formula = clscrap ~ cgrant, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.1277 -0.2372 0.0237 0.2142 2.4553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.05744 0.09721 -0.591 0.5572
cgrant -0.31706 0.16388 -1.935 0.0585 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5751 on 52 degrees of freedom
(260 observations deleted due to missingness)
Multiple R-squared: 0.06715, Adjusted R-squared: 0.04921
F-statistic: 3.743 on 1 and 52 DF, p-value: 0.05847
where we used the log of the scrap rate to estimate the percentage change. The value of \(\hat{\beta}_1=\) -0.317 means that the job training grant lowered the scrap rate by -0.27% [exp(\(\beta_1\))-1].