Fixed Effects Panel Data

Author

Teddy Kelly

1. Load in the Panel Data

I have decided to use the LaborSupply dataset from the plm package to study how to perform fixed effects regression in R. This is a balanced panel dataset that contains wage data on 532 different individuals (entities) during a 10 year period from 1979-1988 (time component), resulting in 5,320 total observations.

rm(list=ls())
library(plm)
data(LaborSupply)

Below are the summary statistics of the data with the variable descriptions of each feature of the data.

Summary Statistics

library(stargazer)
labels = c(
  "Log of annual hours worked",
  "Log of hourly wage",
  "Number of kids",
  "Age in years",
  "Has a disability",
  "ID of individual",
  "Year"
)
stargazer(LaborSupply, type='text', omit.summary.stat = "N", covariate.labels = labels, 
          title = "Summary Statistics for Labor Wages", digits = 2)

Summary Statistics for Labor Wages
========================================================
Statistic                    Mean   St. Dev.  Min   Max 
--------------------------------------------------------
Log of annual hours worked   7.66     0.29   2.77  8.56 
Log of hourly wage           2.61     0.43   -0.26 4.69 
Number of kids               1.56     1.20     0     6  
Age in years                38.92     8.45    22    60  
Has a disability             0.06     0.24     0     1  
ID of individual            266.50   153.59    1    532 
Year                       1,983.50   2.87   1,979 1,988
--------------------------------------------------------

Variable Descriptions

  • lnhr - Numeric: Log of annual hours worked.

  • lnwg - Numeric. Log of hourly wages.

  • kids - Numeric. Number of children each observation has

  • age - Numeric. Age of individuals

  • disab - Factor. Indication whether someone has a disability or not.

  • id - Factor. The id of each individual.

  • year - Factor indicating the year from 1979 to 1988.

Not that the only variables that change over time are lnhr , lnwg, and age. All of the other variables are time-invariant which means they stay constant over time. Not controlling for these variables may result in biased coefficient estimates.

2. OLS Regression without Dummies

Estimating equation of the OLS regression without using fixed effects:

\[ lnwg_{it} = \beta_0 + \beta_1lnhr_{it} + \beta_2age_{it} + u_{it} \]

Below are the regression results using the estimating equation from above.

# Normal regression without fixed effects
reg1 <- lm(data = LaborSupply,
           formula = lnwg ~ lnhr + age + kids)
summary(reg1)

Call:
lm(formula = lnwg ~ lnhr + age + kids, data = LaborSupply)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.59426 -0.24444  0.03385  0.24760  2.22926 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.8726628  0.1571077   5.555 2.92e-08 ***
lnhr         0.1905607  0.0200695   9.495  < 2e-16 ***
age          0.0073860  0.0007198  10.262  < 2e-16 ***
kids        -0.0063526  0.0050879  -1.249    0.212    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4177 on 5316 degrees of freedom
Multiple R-squared:  0.03876,   Adjusted R-squared:  0.03822 
F-statistic: 71.45 on 3 and 5316 DF,  p-value: < 2.2e-16

2.1 Interpretations:

  • Direction (+)

    • The coefficient estimates for both lnhr and age are positive which makes sense because it’s expected that people who are older and work more hours will have higher wages.
  • Magnitude

    • lnhr - For every 1% increase in the number of hours someone works, their hourly wage is expected to increase by about 0.19%, holding all else constant.

    • age - For every year older a person gets, their hourly wage is expected to increase by about 0.0077%, holding all else constant.

    • It’s most likely that these coefficient estimates are biased because of the unobserved heterogeneity from the variables that vary from person to person but do not change over time. However, it’s difficult to determine if these coefficients are overestimates or underestimates. We will find out after performing the fixed-effects regressions.

  • Statistical Significance

    • Both estimates are statistically significant at the \(\alpha = 0.001\) level, indicating strong certainty that these estimates are not the result of randomness.

2.2 Omitted Variables

There is a possible risk of bias from the omitted variables that differ across individuals in the dataset but do not vary across time.

cor1 <- cor.test(LaborSupply$lnhr, LaborSupply$disab)
cor2 <- cor.test(LaborSupply$lnwg, LaborSupply$disab)

cor3 <- cor.test(LaborSupply$lnhr, LaborSupply$kids)
cor4 <- cor.test(LaborSupply$lnwg, LaborSupply$kids)
# Correlation between lnhr and disab
cor1$estimate
        cor 
-0.09360225 
# correlation between lnwg and disab
cor2$estimate
        cor 
-0.08268162 
# correlation between lnhr and kids
cor3$estimate
       cor 
0.03838767 
#correlation between lnwg and kids
cor4$estimate
        cor 
-0.06225285 
  • kids - The number of children that each person has is generally the same over time. Note that there are some occurrences where the observations did have children during the time frame, but this was generally not the case.

    • In theory, people tend to work fewer hours when they have children and also high fertility has been linked with lower wages which would suggest a positive bias in the coefficient estimate from removing this from the regression.

    • However, the correlation test from above shows that having kids is positively correlated with working hours which is surprising. If this is the case, then this would mean there is negative bias from removing kids from the regression.

  • disab - Whether or not a person has a disability usually stays the same over time unless someone sustains a life-changing injury during the time period. This was rarely the case in the data.

    • The correlation tests from above display that disab is negatively correlated with both working hours and wages, Therefore, we would expect that omitting this variable would result in positive bias.

    We will now use fixed-effects to control for these time-invariant variables and evaluate the effect on the coefficient estimates.

3. Regression Using Fixed-Effects

3. 1 OLS Regression with Dummy variables for ID

Estimating Equation:

\[ lnwg_{it}=\beta_0+\beta_1lnhr_{it}+\beta_2age_{it}+\delta_1id1_{i}+ ...+\delta_{n-1}id(n-1)_i+u_{it} \]

Now, I will run a regression with the dummy variables for each id to control for the time-invariant variables for each observation. These time-invariant variables include the number of children each person has, and whether or not someone has a disability.

reg2 <- lm(data = LaborSupply,
           formula = lnwg ~ lnhr + age + factor(id))

3.2 Demeaned Regression

Estimating Equation:

\[ \tilde{lnhr}_{it}=\beta_1\tilde{lnwg}_{it}+\beta_2\tilde{age}_{it}+\tilde{u}_{it} \]

Now let’s compare these results to the demeaned regression.

# Demeaned regression
demeaned_labor <- with(data = LaborSupply,
                       expr = data.frame(lnwg = lnwg - ave(x = lnwg, factor(id)),
                                         lnhr = lnhr - ave(x = lnhr, factor(id)),
                                         age = age - ave(x = age, factor(id))
                                         )
                       )

demeaned_reg <- lm(data = demeaned_labor,
                   formula = lnwg ~ lnhr + age - 1)

3. 3 Fixed-Effects Regression with the plm command

Estimating Equation:

\[ lnwg_{it}=\beta_1lnhr_{it}+\beta_2age_{it}+disab_i+kids_i+u_{it} \]

Fixed-effects regression:

library(plm)
reg_fe <- plm(data = LaborSupply,
              formula = lnwg ~ lnhr + age,
              index = c("id", "year"),
              model = "within")

Now, let’s observe the coefficient estimates of these three regressions side by side.

reg_labels <- c(
  "Log annual hours worked",
  "Age",
  "constant"
)
stargazer(reg2, demeaned_reg, reg_fe, type = "text", digits = 5, 
          title = "Comparison of the Regression", omit = "id", 
          notes = "Dropping the id dummy coefficients", 
          column.labels = c("OLS with ID dummy", "Demeand Regression", "Fixed Effects"),
          covariate.labels = reg_labels, 
          dep.var.labels = "Log of hourly wage")

Comparison of the Regression
==========================================================================================================
                                                       Dependent variable:                                
                        ----------------------------------------------------------------------------------
                                                        Log of hourly wage                                
                                                  OLS                                     panel           
                                                                                          linear          
                             OLS with ID dummy           Demeand Regression           Fixed Effects       
                                    (1)                         (2)                        (3)            
----------------------------------------------------------------------------------------------------------
Log annual hours worked          0.09645***                  0.09645***                 0.09645***        
                                 (0.01089)                   (0.01033)                  (0.01089)         
                                                                                                          
Age                              0.00164**                   0.00164**                  0.00164**         
                                 (0.00084)                   (0.00079)                  (0.00084)         
                                                                                                          
constant                         1.16085***                                                               
                                 (0.10336)                                                                
                                                                                                          
----------------------------------------------------------------------------------------------------------
Observations                       5,320                       5,320                      5,320           
R2                                0.84495                     0.01702                    0.01702          
Adjusted R2                       0.82768                     0.01665                    -0.09245         
Residual Std. Error         0.17679 (df = 4786)         0.16772 (df = 5318)                               
F Statistic             48.93285*** (df = 533; 4786) 46.03609*** (df = 2; 5318) 41.43075*** (df = 2; 4786)
==========================================================================================================
Note:                                                                          *p<0.1; **p<0.05; ***p<0.01
                                                                        Dropping the id dummy coefficients

3.4 Interpretations

  • Sign (+)

    • The signs of the coefficients are still positive which is good news since we expect that people will make more money as they get older and gain more experience and work more hours.
  • Magnitude

    • lnhr - The coefficients for all three regressions are exactly the same with a value of 0.09645. This means that for every 1% increase in the amount of hours worked, a person’s hourly wage will increase by about 0.096%, all else equal.

    • Note that this is less than the estimate for the non-fixed effects regression. Therefore, adding the fixed-effects element helps reduce the overestimates of the coefficients that were included in the original regression by not controlling for the time-invariant variables.

    • The positive bias in the original regression was likely coming from omitting disab since we determined it to have positive bias while kids appeared to have negative bias.

    • age - The coefficients for age are also exactly the same for the three fixed-effects regressions with a value of 0.00164. This means that for every 1 year older someone gets, their expected hourly wage will increase by about 0.002%, all else equal.

    • This value is also less than for the non-fixed effects regression, indicating that not controlling for the time-invariant variables resulted in overestimating the relationship between age and log hourly wage.

  • Significance

    • Both estimates are significant under the \(\alpha=0.05\) level across all three of the regressions, indicating strong certainty in the coefficient estimates.

Conclusion

  • We ran an initial linear regression without controlling for time-invariant variables like kids and disab which resulted in overestimating the regression coefficients.

  • We then controlled for these omitted variables by using three different fixed-effects regression methods:

    • OLS with dummy variables for each id

    • Demeaned Regression

    • Using the plm command to get the fixed-effects estimator directly

  • All of the coefficient estimates for the three regressions were identical, standard errors were very similar, and F-statistics are also similar. The adjusted R-squared value was the highest for the OLS regression with the id dummy variables.

  • This result of the coefficient estimates illustrates how there are multiple techniques that we can use to apply the fixed-effects methods to extract insights from panel datasets.

  • The OLS regression with the dummy variables had an adjusted R-squared value of about 0.83 which is by far the highest value of the regressions.