Simple Panel Data Methods

Isai Guizar

From: Wooldridge J. M. (2020). Introductory econometrics: A modern approach. 7th Ed. Cengage learning. Chapter 13.

1 Pooled regression

Example 13.3

Study the effect that a new garbage incinerator had on housing values in North Andover, Massachusetts. The rumor that a new incinerator would be built in North Andover began after 1978, and construction began in 1981. The incinerator was expected to be in operation soon after the start of construction; the incinerator actually began operating in 1985.

We will use data on prices of houses that sold in 1978 and another sample on those that sold in 1981. The hypothesis is that the price of houses located near the incinerator (within three miles) would fall relative to the price of more distant houses.

1.1 Post-policy

data$rprice = data$rprice/1000
data1981 <- data %>% filter(year==1981)

m81 <- lm(rprice ~ nearinc, data1981)
summary(m81)


Call:
lm(formula = rprice ~ nearinc, data = data1981)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.678 -19.832  -2.997  21.139 136.754 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  101.308      3.093  32.754  < 2e-16 ***
nearinc      -30.688      5.828  -5.266 5.14e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.24 on 140 degrees of freedom
Multiple R-squared:  0.1653,    Adjusted R-squared:  0.1594 
F-statistic: 27.73 on 1 and 140 DF,  p-value: 5.139e-07

1.2 Pre-policy

data1978 <- data %>% filter(year==1978)

m78 <- lm(rprice ~ nearinc, data=data1978)
summary(m78)


Call:
lm(formula = rprice ~ nearinc, data = data1978)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.517 -16.605  -3.193   8.683 236.307 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   82.517      2.654  31.094  < 2e-16 ***
nearinc      -18.824      4.745  -3.968 0.000105 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.43 on 177 degrees of freedom
Multiple R-squared:  0.08167,   Adjusted R-squared:  0.07648 
F-statistic: 15.74 on 1 and 177 DF,  p-value: 0.0001054

The effect can be obtained as: \[ \hat{\delta}_1 = (\bar{rprice_{81,nr}}-\bar{rprice_{81,fr}})-(\bar{rprice_{78,nr}}-\bar{rprice_{78,fr}}) \] which is:

\(\hat{\delta} = \hat{\beta}_{post} - \hat{\beta}_{pre}=\) -11,864.0

1.3 Using regression

\[ rprice =\beta_0 +\delta_0after_i +\beta_1nearinc_i + \delta_1after_i \cdot nearinc_i +u_i \]

DiD <- lm(rprice ~ y81 + nearinc + y81*nearinc, data=data)
summary(DiD)


Call:
lm(formula = rprice ~ y81 + nearinc + y81 * nearinc, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.678 -17.693  -3.031  12.483 236.307 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   82.517      2.727  30.260  < 2e-16 ***
y81           18.790      4.050   4.640 5.12e-06 ***
nearinc      -18.824      4.875  -3.861 0.000137 ***
y81:nearinc  -11.864      7.457  -1.591 0.112595    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.24 on 317 degrees of freedom
Multiple R-squared:  0.1739,    Adjusted R-squared:  0.1661 
F-statistic: 22.25 on 3 and 317 DF,  p-value: 4.224e-13

The estimator of interest is the difference over time in the average difference of housing prices in the two locations, known as the difference-in-differences estimator.

1.4 Adding controls

DiD2 <- lm(rprice ~ y81 + nearinc + y81 * nearinc + age + agesq + intst + land + area + rooms + baths, data=data)

summary(DiD2)


Call:
lm(formula = rprice ~ y81 + nearinc + y81 * nearinc + age + agesq + 
    intst + land + area + rooms + baths, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-76.721  -8.885  -0.252   8.433 136.649 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.381e+01  1.117e+01   1.237  0.21720    
y81          1.393e+01  2.799e+00   4.977 1.07e-06 ***
nearinc      3.780e+00  4.453e+00   0.849  0.39661    
age         -7.395e-01  1.311e-01  -5.639 3.85e-08 ***
agesq        3.453e-03  8.128e-04   4.248 2.86e-05 ***
intst       -5.386e-04  1.963e-04  -2.743  0.00643 ** 
land         1.414e-04  3.108e-05   4.551 7.69e-06 ***
area         1.809e-02  2.306e-03   7.843 7.16e-14 ***
rooms        3.304e+00  1.661e+00   1.989  0.04758 *  
baths        6.977e+00  2.581e+00   2.703  0.00725 ** 
y81:nearinc -1.418e+01  4.987e+00  -2.843  0.00477 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.62 on 310 degrees of freedom
Multiple R-squared:   0.66, Adjusted R-squared:  0.6491 
F-statistic: 60.19 on 10 and 310 DF,  p-value: < 2.2e-16

1.5 In logs

DiD3 <- lm(lrprice ~ y81 + nearinc + y81 * nearinc + age + agesq + lintst + lland + larea + rooms + baths, data=data)

summary(DiD3)


Call:
lm(formula = lrprice ~ y81 + nearinc + y81 * nearinc + age + 
    agesq + lintst + lland + larea + rooms + baths, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.18441 -0.09947  0.01478  0.10985  0.74872 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.652e+00  4.159e-01  18.399  < 2e-16 ***
y81          1.621e-01  2.850e-02   5.687 2.99e-08 ***
nearinc      3.223e-02  4.749e-02   0.679 0.497805    
age         -8.359e-03  1.411e-03  -5.924 8.36e-09 ***
agesq        3.764e-05  8.668e-06   4.342 1.92e-05 ***
lintst      -6.145e-02  3.151e-02  -1.950 0.052045 .  
lland        9.985e-02  2.449e-02   4.077 5.81e-05 ***
larea        3.508e-01  5.149e-02   6.813 4.98e-11 ***
rooms        4.733e-02  1.733e-02   2.732 0.006661 ** 
baths        9.428e-02  2.773e-02   3.400 0.000761 ***
y81:nearinc -1.315e-01  5.197e-02  -2.531 0.011885 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2038 on 310 degrees of freedom
Multiple R-squared:  0.7326,    Adjusted R-squared:  0.7239 
F-statistic: 84.92 on 10 and 310 DF,  p-value: < 2.2e-16

Now, \(\hat{\delta} =\) -13.15% is the approximate percentage reduction in housing value due to the incinerator.

2 Panel with two periods

Example

Employ data on unemployment rate in 46 US American cities to explain crime rates. Data from two years, 1982 and 1987.

2.1 Pooled OLS

Estimate a simple pooled regression model:

\[ crmrte_{it} = \beta_0 + \delta_0 d87_i + \beta_1 unem_{it} + v_{it} \]

m1 <- lm(crmrte ~ d87 + unem, data=data)
summary(m1)


Call:
lm(formula = crmrte ~ d87 + unem, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.474 -21.794  -6.266  18.297  75.113 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  93.4202    12.7395   7.333 9.92e-11 ***
d87           7.9404     7.9753   0.996    0.322    
unem          0.4265     1.1883   0.359    0.720    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.99 on 89 degrees of freedom
Multiple R-squared:  0.01221,   Adjusted R-squared:  -0.009986 
F-statistic: 0.5501 on 2 and 89 DF,  p-value: 0.5788

The coefficient on unem is positive and not statistically significant at standard significance levels. No link between crime and unemployment rates. Also, the p-value for the overall F test can not reject the null.

Omitted variable problems: control for more factors, such as number of police officers, income, law enforcement efforts, and so on, in a multiple regression analysis. For example,

m2 <- lm(crmrte ~ d87 + unem + officers + pcinc, data=data)
summary(m2)


Call:
lm(formula = crmrte ~ d87 + unem + officers + pcinc, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.563 -21.613  -7.428  15.754  77.690 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 117.350688  25.797894   4.549 1.74e-05 ***
d87          12.492543  10.239359   1.220    0.226    
unem         -0.499700   1.363726  -0.366    0.715    
officers      0.003955   0.004363   0.906    0.367    
pcinc        -0.002520   0.002406  -1.047    0.298    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30 on 87 degrees of freedom
Multiple R-squared:  0.03356,   Adjusted R-squared:  -0.01088 
F-statistic: 0.7552 on 4 and 87 DF,  p-value: 0.5573

Not much better. Many unobserved factors such as law enforcement efforts are difficult to control for.

Solution, explicitly capture unobserved time-constant factors that may affect crime rates. That is,

\[ crmrte_{it} = \beta_0 + \delta_0d87_i + \beta_1 unem_{it} + \alpha_i + u_{it} \]

2.2 First-differenced estimator

Because \(\alpha_i\) is constant over time, we can difference the data across the two years to get an equation that can be estimated by OLS:

\[ \Delta crmrte_{i} = \delta_0 + \beta_1 \Delta unem_{i} + \Delta u_{i} \]

m3 <- lm(ccrmrte ~ cunem, data=data)
summary(m3)


Call:
lm(formula = ccrmrte ~ cunem, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.912 -13.369  -5.507  12.446  52.915 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  15.4022     4.7021   3.276  0.00206 **
cunem         2.2180     0.8779   2.527  0.01519 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.05 on 44 degrees of freedom
  (46 observations deleted due to missingness)
Multiple R-squared:  0.1267,    Adjusted R-squared:  0.1069 
F-statistic: 6.384 on 1 and 44 DF,  p-value: 0.01519

3 Policy analysis with two periods panel data

A random sample of observations (individuals, firms, cities, etc) is obtained in the first time period.
In the second period, some of these units, those in the treatment group, take part in a particular program, the ones that do not are the control group.
The same cross-sectional units appear in both time periods

The model is:

\[ y_{it} = \beta_0 +\delta_0 d2_t +\beta_1 program_{it}+ \alpha_i +u_{it} \]

If program participation only occurred in the second period, the OLS estimator of \(\hat{\beta}_1\) in the differenced equation is:

\[ \hat{\beta_1}= \bar{\Delta y_T} - \bar{\Delta y_C} \]

Example

The effect of a job training program in worker productivity of manufacturing firms in Michigan.

The outcome of interest is the percentage of items scrapped due to defects, \(scrap\) rate. The binary variable \(grant_{it}\) takes value of 1 if the firm received a job training grant in year \(t\). Use data for 1987 and 1988.

Upload the data:

jtain <- load("jtrain.RData")

An estimation with pooling data will produce biased and inconsistent estimators as there are time invariant factors that may have influenced firm’s participation in the program. For example, grants could have been given to firms with more productive workers, or t employers with better managerial skills.

The model of interest is:

\[ scrap_{it} = \beta_0 + \delta_0 y88_t + \beta_1 grant_{it} + \alpha_i + u_{it} \]

Taking difference to remove \(\alpha_i\), the model of interest becomes:

\[ \Delta scrap_{it} = \delta_0 + \beta_1 \Delta grant_{it} + \Delta u_{it} \]

data <- data %>% filter(year %in% c('1987','1988'))

dm1 <- lm(clscrap ~ cgrant, data=data)
summary(dm1)


Call:
lm(formula = clscrap ~ cgrant, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1277 -0.2372  0.0237  0.2142  2.4553 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -0.05744    0.09721  -0.591   0.5572  
cgrant      -0.31706    0.16388  -1.935   0.0585 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5751 on 52 degrees of freedom
  (260 observations deleted due to missingness)
Multiple R-squared:  0.06715,   Adjusted R-squared:  0.04921 
F-statistic: 3.743 on 1 and 52 DF,  p-value: 0.05847

where we used the log of the scrap rate to estimate the percentage change. The value of \(\hat{\beta}_1=\) -0.317 means that the job training grant lowered the scrap rate by -0.27% [exp(\(\beta_1\))-1].