graphics.off()# Clear all graphs# Prepare needed librariespackages<-c("AER", # load dataset"plm", # for FE"stargazer", # summary stats for sharing"ggplot2", #for graphing,"psych","naniar", # for visualisation of missing data,"visdat", # for visualisation of missing data,"VIM", # for visualisation of missing data,"DataExplorer", # for visualisation of missing data,"dplyr","magrittr")for(iin1:length(packages)){if(!packages[i]%in%rownames(installed.packages())){install.packages(packages[i] , repos ="http://cran.rstudio.com/" , dependencies =TRUE)}library(packages[i], character.only =TRUE)}rm(packages)
Data
Create dependent variable.
Load data and generate fatality rate dependent variable.
## data from Stock and Watson (2007)data("Fatalities", package ="AER")Fatalities_original<-Fatalities## add fatality rate (number of traffic deaths per 10,000 people living in that state in that year)Fatalities$frate<-with(data =Fatalities, expr =fatal/pop*10000)
Summary Statistics
## ALL VARIABLES IN THE ORIGINAL DATAFRAMEstargazer(Fatalities, type ="text", # html, latex# out =# summary.stat = # covariate.labels = labels, digits =1)
dry - numeric. Percent residing in “dry” countries.
youngdrivers - numeric. Percent of drivers aged 15–24.
miles - numeric. Average miles per driver. Department of Transportation.
fatal - numeric. Number of vehicle fatalities. Department of Transportation Fatal Accident Reporting System.
nfatal - numeric. Number of night-time vehicle fatalities.
sfatal - numeric. Number of single vehicle fatalities.
fatal1517 - numeric. Number of vehicle fatalities, 15–17 year olds.
nfatal1517 - numeric. Number of night-time vehicle fatalities, 15–17 year olds.
fatal1820 - numeric. Number of vehicle fatalities, 18–20 year olds.
nfatal1820 - numeric. Number of night-time vehicle fatalities, 18–20 year olds.
fatal2124 - numeric. Number of vehicle fatalities, 21–24 year olds.
nfatal2124 - numeric. Number of night-time vehicle fatalities, 21–24 year olds.
afatal - numeric. Number of alcohol-involved vehicle fatalities.
pop - numeric. Population.
pop1517 - numeric. Population, 15–17 year olds.
pop1820 - numeric. Population, 18–20 year olds.
pop2124 - numeric. Population, 21–24 year olds.
milestot - numeric. Total vehicle miles (millions).
unempus - numeric. US unemployment rate. US Bureau of Labor Statistics.
emppopus - numeric. US employment/population ratio.
gsp - numeric. GSP rate of change.
OLS
We would expect more beer drinking linked with more fatal car crashes.
States mainly levy excise taxes on beer, wine, and spirits, although some states levy taxes on additional beverages (such as sparkling wine and cider).
Thus, we would expect more beer taxes (reduces beer sales) linked with less fatal car crashes.
Without FE
Unfortunately, we find that more state beer tax leads to more fatal DUI (driving under influence) related deaths, not less !!!
fatal_lm_mod<-lm(data =Fatalities, formula =frate~beertax , )summary(fatal_lm_mod)
Call:
lm(formula = frate ~ beertax, data = Fatalities)
Residuals:
Min 1Q Median 3Q Max
-1.09060 -0.37768 -0.09436 0.28548 2.27643
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.85331 0.04357 42.539 < 2e-16 ***
beertax 0.36461 0.06217 5.865 1.08e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5437 on 334 degrees of freedom
Multiple R-squared: 0.09336, Adjusted R-squared: 0.09065
F-statistic: 34.39 on 1 and 334 DF, p-value: 1.082e-08
If the tax on a case of beer increases by $1, then the number of vehicle fatalities increases 0.36 deaths per 10,000 people. This is statistically significant, seems a bit high and in the unexpected direction.
Omitted Variables
What could cause this result??? Omitted Variable Bias due to unobserved heterogeneity ???
Risk of a bias due to omitted factors that vary across states but not over time. Use State Fixed Effects. Start by thinking about factors that determine traffic fatality rate, and could be correlated with beer taxes
Density of cars on the road. Specifically, “high taxes” could reflect “high traffic density” (so the OLS coefficient would be biased positively – high taxes, more deaths). Panel data lets us eliminate omitted variable bias when the omitted variables are constant over time within a given state.
Quality (age) of automobiles / Vintage of autos on the road. Older vehicles may be less safe and more likely to cause traffic death. If the vintage of auto happens to be negatively correlated with beer taxes, then there will be a omitted variable bias.
Culture around drinking and driving. Potentially are correlated with the beer tax, so beer taxes could be picking up cultural differences (omitted variable bias). Specifically, “high taxes” could reflect “cultural attitudes towards drinking” (so the OLS coefficient would be biased). OVB table
Driving conditions in states/Quality of roads. Roads may be worse in some states than others, and thus correlated with more fatalities. If driving infrastructure is also (negatively) correlated to alcohol taxes, then OVB.
Weather conditions in states. Some states may have colder weather, causing more skids and thus deaths. If colder weather is also correlated with alcohol taxes, then OVB.
Other omitted variables that vary over time and thus cause a bias. Use Time Fixed Effects. These affect all states in a particular year !!!
Improvements in auto safety over time
Changing national attitudes towards drunk driving
Model: OLS with Dummy Variables i.e. With FE
We have 48 binary regressors here — one for each federal state.
\(Fatality \ Rate_{it} = \beta_0 + \beta_1 Beer \ Tax _{it} + \delta_2 State \ 2_i + \delta_3 State \ 3_i + ... + \delta_n State \ n_i + u_{it}\)
fatal_fe_lm_mod<-lm( data =Fatalities, formula =frate~beertax+state, )# summary(fatal_fe_lm_mod)stargazer(fatal_fe_lm_mod, type="text")
?stargazerstargazer(fatal_lm_mod, fatal_fe_lm_mod, type ='text', column.labels =c("OLS with no State Dummy","OLS with State Dummy"), omit =c("state"), notes =c("Dropping state coefficients"))
====================================================================
Dependent variable:
------------------------------------------------
frate
OLS with no State Dummy OLS with State Dummy
(1) (2)
--------------------------------------------------------------------
beertax 0.365*** -0.656***
(0.062) (0.188)
Constant 1.853*** 3.478***
(0.044) (0.313)
--------------------------------------------------------------------
Observations 336 336
R2 0.093 0.905
Adjusted R2 0.091 0.889
Residual Std. Error 0.544 (df = 334) 0.190 (df = 287)
F Statistic 34.394*** (df = 1; 334) 56.969*** (df = 48; 287)
====================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Dropping state coefficients
The function ave is convenient for computing group averages. We use it to obtain state specific averages of the fatality rate and the beer tax.
# obtain demeaned data?aveFatalities_demeaned<-with(data =Fatalities, expr =data.frame(frate =frate-ave(x =frate, state), # frate grouped by state beertax =beertax-ave(x =beertax, state)# beertax grouped by state))# estimate the regressionfatal_demeaned_lm_mod<-lm(data =Fatalities_demeaned, formula =frate~beertax-1 , # indicating that you are regressing frate on beertax without an intercept)summary(fatal_demeaned_lm_mod)
Call:
lm(formula = frate ~ beertax - 1, data = Fatalities_demeaned)
Residuals:
Min 1Q Median 3Q Max
-0.58696 -0.08284 -0.00127 0.07955 0.89780
Coefficients:
Estimate Std. Error t value Pr(>|t|)
beertax -0.6559 0.1739 -3.772 0.000191 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1757 on 335 degrees of freedom
Multiple R-squared: 0.04074, Adjusted R-squared: 0.03788
F-statistic: 14.23 on 1 and 335 DF, p-value: 0.0001913
Model: Fixed Effects
The ID variable for entities is named state and the time id variable is year.
Since the fixed effects estimator is also called the within estimator, we set model = "within". Do not set model="random" as that would run a random effects model.
# install and load the 'plm' package## install.packages("plm")library(plm)# estimate the fixed effects regression with plm()fatal_fe_mod<-plm(formula =frate~beertax, data =Fatalities, index =c("state", "year"), # declaring data to be panel model ="within"# fixed effects mdodel)summary(fatal_fe_mod)
Oneway (individual) effect Within Model
Call:
plm(formula = frate ~ beertax, data = Fatalities, model = "within",
index = c("state", "year"))
Balanced Panel: n = 48, T = 7, N = 336
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.5869619 -0.0828376 -0.0012701 0.0795454 0.8977960
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
beertax -0.65587 0.18785 -3.4915 0.000556 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 10.785
Residual Sum of Squares: 10.345
R-Squared: 0.040745
Adj. R-Squared: -0.11969
F-statistic: 12.1904 on 1 and 287 DF, p-value: 0.00055597
# print summary using robust standard errorscoeftest(fatal_fe_mod, vcov. =vcovHC, type ="HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
beertax -0.65587 0.28880 -2.271 0.02388 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note we had to declare our data to be panel in order to use plm command.
Declaring data as panel in R can have several benefits, including:
Panel data commands: When data is declared as panel, it allows the use of specialized panel data commands in R, such as the plm package, which can be used to estimate panel data models, such as fixed effects and random effects models.
Easy subsetting: Panel data can be easily subsetted by individual or time, which allows for easier analysis and visualization of data.
Easy aggregation: Panel data can also be easily aggregated by individual or time, which can be useful for summarizing data and creating summary statistics.
Panel-specific visualizations: Panel data can be used to create panel-specific visualizations, such as panel plots, which allow for easy comparison of variables across time and individuals.
Panel data modeling: Panel data models can provide more accurate estimates of coefficients, as they control for individual-level and time-specific heterogeneity that can bias estimates in traditional cross-sectional or time-series models.
stargazer(#fatal_lm_mod, # OLSfatal_fe_lm_mod, fatal_demeaned_lm_mod, fatal_fe_mod, # fixed effect type ="text", column.labels =c(#"OLS - no Dummy","OLS - State Dummy","Demeaned Reg","FE"), omit =c("state"))
You should use heteroskedasticity- and autocorrelation consistent (clustered) standard errors if you think \(u_{it}\) could be correlated over time
Two Way FE Implementation
You were doing one way FE only above (state).
Sometimes you want to do two way FE (state and time).
Here, it seems like time FE are not as important as state FE.
fatal_fe_lm_mod2<-lm(data =Fatalities, formula =frate~beertax+state+year, )# summary(fatal_fe_lm_mod2)# estimate the fixed effects regression with plm()fatal_fe_mod2<-plm(formula =frate~beertax+as.factor(year), data =Fatalities, index =c("state", "year"), # declaring data to be panel model ="within"# fixed effects mdodel)# summary(fatal_fe_mod2)stargazer(fatal_fe_lm_mod, # State - One Way OLS (FE)fatal_fe_lm_mod2, # State - Time (Two Way) OLSfatal_fe_mod2, # State - Time (Two Way) FE type ="text", column.labels =c("FE - One way (state)","FE - Two way (state-time) OLS","FE - Two way (state-time) PLM"), omit =c("state"))
Lets hide the state coefficients and the time coefficients. with keep or omit commands.
?stargazerstargazer(fatal_fe_lm_mod, # State - One Way OLS (FE)fatal_fe_lm_mod2, # State - Time (Two Way) OLSfatal_fe_mod2, # State - Time (Two Way) FE type ="text", column.labels =c("FE - One way (state)","FE - Two way (state-time) OLS","FE - Two way (state-time) PLM"), add.lines=list(c('Two Way Fixed effects', "No","Yes", "Yes")), keep =c("beertax"))
==========================================================================================================
Dependent variable:
------------------------------------------------------------------------------------
frate
OLS panel
linear
FE - One way (state) FE - Two way (state-time) OLS FE - Two way (state-time) PLM
(1) (2) (3)
----------------------------------------------------------------------------------------------------------
beertax -0.656*** -0.640*** -0.640***
(0.188) (0.197) (0.197)
----------------------------------------------------------------------------------------------------------
Two Way Fixed effects No Yes Yes
Observations 336 336 336
R2 0.905 0.909 0.080
Adjusted R2 0.889 0.891 -0.096
Residual Std. Error 0.190 (df = 287) 0.188 (df = 281)
F Statistic 56.969*** (df = 48; 287) 51.934*** (df = 54; 281) 3.503*** (df = 7; 281)
==========================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Unbalanced Panel
An unbalanced data (compared to balanced data) will have much larger confidence interval (as SE increase due to smaller n).
# original modelfatal_fe_lm_mod<-lm(data =Fatalities, formula =frate~beertax+state, )# data with NAfatal_fe_lm_mod_NA<-lm(data =Fatalities_unbalanced, formula =frate~beertax+state, )# summary(fatal_fe_mod2)stargazer(fatal_fe_lm_mod, # State - One Way OLS (FE)fatal_fe_lm_mod_NA, # State - One Way OLS (FE) with NA data type ="text", column.labels =c("State - One Way OLS (FE)","State - One Way OLS (FE) with NA data"), keep =c("beertax"))
==================================================================================
Dependent variable:
--------------------------------------------------------------
frate
State - One Way OLS (FE) State - One Way OLS (FE) with NA data
(1) (2)
----------------------------------------------------------------------------------
beertax -0.656*** -1.157***
(0.188) (0.356)
----------------------------------------------------------------------------------
Observations 336 181
R2 0.905 0.919
Adjusted R2 0.889 0.890
Residual Std. Error 0.190 (df = 287) 0.190 (df = 132)
F Statistic 56.969*** (df = 48; 287) 31.360*** (df = 48; 132)
==================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
SE is indeed larger for the unbalanced data, as expected.
Random Effect
In Economics, we mostly have fixed effects (and not random effects or mixed effects {not to be confused with mixed models} ).
The rationale behind random effects model is that, unlike the fixed effects model, the variation across entities is assumed to be random and uncorrelated with the predictor or independent variables included in the model:
“…the crucial distinction between fixed and random effects is whether the unobserved individual effect embodies elements that are correlated with the regressors in the model, not whether these effects are stochastic or not” [Green, 2008, p.183]
If you have reason to believe that differences across entities have some influence on your dependent variable but are not correlated with the predictors then you should use random effects. An advantage of random effects is that you can include time invariant variables (i.e. gender). In the fixed effects model these variables are absorbed by the intercept.
Random effects assume that the entity’s error term is not correlated with the predictors which allows for timeinvariant variables to play a role as explanatory variables.
In random-effects you need to specify those individual characteristics that may or may not influence the predictor variables. The problem with this is that some variables may not be available therefore leading to omitted variable bias in the model.
RE allows to generalize the inferences beyond the sample used in the model.