Lecture 5

Administrative Miscellanea

  • (Slides to be reformatted)
  • Exam 1 Graded
  • Quiz 3 Wednesday
  • Homework 5 Friday
  • Problem Set 3 Friday, October 20th
    • Syllabus updated

Multivariate OLS

  • Let’s return to our bivariate regression world.
  • We run \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\).
  • We obtain unbiased estimates if our error term is uncorrelated with x, i.e. if there is not some third variable that is correlated with both

Multivariate OLS

  • Consider the case of attendance on student test scores.
  • Grades are associated with higher test scores, but we also know that better students tend to show up to class more often
  • Can we correct for this?

Multivariate OLS

Multivariate OLS - formula plus motivation

  • For the previous question we can write \(grade_i = \beta_0 + \beta_1 attendance_i + \beta_2 baseline_i + \varepsilon_i\)
  • What is the interpretation of \(\beta_1\)?
    • The change in grade resulting from a 1 unit increase in attendance holding baseline ability constant
    • e.g. for two individuals with the same baseline ability, what is the effect of increasing attendance by 1
  • This now controls for the previous omitted variable

Multivariate OLS - Intuition

  • We can make a valid comparison by only comparing students with the same test scores.
    • e.g. for all students who received a score of 90%, we regress grade on attendance
  • This approach is really inefficient: instead we can just “net out” the effect of baseline ability. How to do this?

Multivariate OLS - FWL

  • Run a regression of grade on baseline performance to get a predicted grade from baseline
  • Then, regress attendance on baseline ability to get a predicted attendance from baseline
  • Finally, take the residuals from these two and run that regression.
  • This is equivalent to what is obtained for \(\beta_1\) in the multivariate regression \(y_i=\beta_0+\beta_1x_1+\beta_2x_2\)

Multivariate OLS: net effect


Call:
lm(formula = grade ~ baseline + attendance, data = dt)

Coefficients:
(Intercept)     baseline   attendance  
      41.37        15.30        37.08  

Example: simpson’s paradox


Call:
lm(formula = y ~ x, data = dt)

Coefficients:
(Intercept)            x  
    -2.0199      -0.3231  

Call:
lm(formula = y ~ x + group, data = dt)

Coefficients:
(Intercept)            x       group2       group3       group4       group5  
         -2            1           -2           -4           -6           -8  

Multivariate OLS - Graph

Multivariate OLS - Diagram

Multivariate OLS - Diagram

Multivariate OLS - Diagram

Multivariate OLS Diagram: confirmation Bivariate


Call:
lm(formula = grade ~ attendance, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.333  -4.000  -4.000   6.667  16.000 

Coefficients:
            Estimate Std. Error t value    Pr(>|t|)    
(Intercept)   64.000      4.355  14.697 0.000000135 ***
attendance    19.333      5.896   3.279     0.00955 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.737 on 9 degrees of freedom
Multiple R-squared:  0.5443,    Adjusted R-squared:  0.4937 
F-statistic: 10.75 on 1 and 9 DF,  p-value: 0.009545

Multivariate OLS Diagram: confirmation Multivariate


Call:
lm(formula = grade ~ attendance + ability, data = dt)

Residuals:
                   Min                     1Q                 Median 
-0.0000000000000260252 -0.0000000000000007185  0.0000000000000014370 
                    3Q                    Max 
 0.0000000000000023828  0.0000000000000144539 

Coefficients:
                         Estimate            Std. Error           t value
(Intercept) 59.999999999999992895  0.000000000000005098 11768699472570328
attendance  10.000000000000007105  0.000000000000007463  1339925378608921
ability     19.999999999999996447  0.000000000000007463  2679850757217842
                       Pr(>|t|)    
(Intercept) <0.0000000000000002 ***
attendance  <0.0000000000000002 ***
ability     <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0000000000000109 on 8 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 7.88e+30 on 2 and 8 DF,  p-value: < 0.00000000000000022

Multivariate OLS Diagram: confirmation FWL

   attendance ability grade residgrade residattendance
1:          0       0    60  -3.333333      -0.3333333
2:          0       1    80  -8.000000      -0.8000000
3:          1       0    70   6.666667       0.6666667
4:          1       1    90   2.000000       0.2000000

Multivariate OLS Diagram: Confirmation FWL


Call:
lm(formula = residgrade ~ residattendance, data = dt)

Residuals:
                  Min                    1Q                Median 
-0.000000000000022662 -0.000000000000001297  0.000000000000002301 
                   3Q                   Max 
 0.000000000000004978  0.000000000000006645 

Coefficients:
                              Estimate             Std. Error
(Intercept)     -0.0000000000000003013  0.0000000000000026033
residattendance 10.0000000000000035527  0.0000000000000059114
                             t value            Pr(>|t|)    
(Intercept)                   -0.116                0.91    
residattendance 1691660616295972.000 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.000000000000008634 on 9 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.862e+30 on 1 and 9 DF,  p-value: < 0.00000000000000022

What if we have multiple omitted variables

  • If we run \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\) we could still have other variables that are correlated with both \(x_1\) and y, conditional on \(x_2\)
  • We can just add every omitted variable in our regression: \(y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + ... + \beta_nx_n +\varepsilon\)
  • Are there issues with this?

What if we have multiple omitted variables

  • If we don’t observe a variable, we can’t control for it.
  • Example: ability. This is called selection on observables vs selection on unobservables
  • If we control for an irrelevant variable, does it bias \(\hat\beta_1\)? e.g. if we add control for the day of the month the student was born?
  • No, it just changes our standard errors

What if we have multiple omitted variables

  • Are there controls that can bias our estimate
  • Yes. It’s complicated.
  • The short answer is that we don’t want to control for any intermediate pathways (e.g. if x causes z, and z causes y, we don’t control for y)

Different Controls

Different Controls

Different Controls

An Example: Twitter Opinion Predicts Corporate Earnings?

Worked Example: Alcohol and Mortality

  • You are measuring the effects of alcohol consumption on life expectancy. Your naive regression is \(life_i = \beta_0 + \beta_1 drinks_i + \varepsilon_i\). When you run this regression you obtain \(\hat\beta_1=1\), but you know there are omitted variables like social networks that can bias this results. To fix this, you now use a multivariate regression controlling for social status (measured as number of close friends), marital status, age, BMI, and self reported health status
    1. Write the estimating equation for this multivariate OLS model

Worked Example: Alcohol and Mortality

    1. \(life_i=\beta_0+\beta_1 drinks_i + \beta_2 social_i + \beta_3 marital_i + \beta_4 age_i\)
  • \(+ \beta_5 BMI_i + \beta_6 health_i +\varepsilon_i\)

Worked Example: Alcohol and Mortality

  • \(life_i = \beta_0 + \beta_1 drinks_i + \varepsilon_i\), \(\hat\beta_1=1\). you now use a multivariate regression controlling for social status (measured as number of close friends), marital status, age, BMI, and self reported health status
    1. You obtain a value of \(\hat\beta_1=-0.5\) in the multivariate model. Interpret \(\hat\beta_1\) in the context of the research question

Worked Example: Alcohol and Mortality

    1. For each additional alcoholic drink consumed per day, life expectancy is expected to decrease by half a year, holding our observables constant (number of friends, marital status, age, BMI, and health status)

Worked Example: Alcohol and Mortality

  • \(life_i = \beta_0 + \beta_1 drinks_i + \varepsilon_i\), \(\hat\beta_1=1\). you now use a multivariate regression controlling for social status (measured as number of close friends), marital status, age, BMI, and self reported health status
    1. Which estimate of \(\hat\beta_1\) is a better estimate of the causal effect of alcohol consumption on life expectancy? Is \(\hat\beta_1=-0.5\) unbiased? Why or why not?

Worked Example: Alcohol and Mortality

    1. We expect our multivariate estimate of \(-0.5\) to be less biased since we control for several factors that we suspect are endogenous. Our estimate is still unlikely to be unbiased though - if there are any unobservable factors that are correlated with both age and life expectancy, conditional on our control variables, then we will still obtain a biased estimate. Income may be an example in this case.

Worked Example: Alcohol and Mortality

  • \(life_i = \beta_0 + \beta_1 drinks_i + \varepsilon_i\), \(\hat\beta_1=1\). you now use a multivariate regression controlling for social status (measured as number of close friends), marital status, age, BMI, and self reported health status
    1. Which regression has a higher value of \(R^2\)? Why?

Worked Example: Alcohol and Mortality

    1. The multivariate model will always have a higher \(R^2\). If we add an additional variable to our model it cannot make our fit any worse since in that case we can just end up with a \(\beta\) coefficient of 0 and we’re back to our bivariate estimate.

Multiple hypothesis testing

  • In a multivariate setting nothing changes when testing the hypothesis of \(\beta_1=0\)
  • We can also do a joint test to see if every variable is equal to 0: \(H_0: \beta_1=\beta_2=...=\beta_n=0\)

Multiple hypothesis testing

  • The test statistic is called an F-statistic. The actual calculations and tables are burdensome, but R will calculate this for us
  • The p-value is also obtained from R, once we have the p-value the hypothesis test is exactly identical
  • We obtain both estimates for individual \(\hat\beta\) estimates and for our full model
  • In econometrics we usually care about the former, while predicitive analytics cares about the latter

Dummy variables

  • We now have almost everything we need for before getting into experimental designs, but so far we have only dealt with numeric data
  • Recall that we can also have ordinal (categorical, non-numeric) data, e.g. gender.
  • To study these we can convert them to a numeric value using a dummy variable.
  • These are the building blocks of how we handle any ordinal data

Gender as a dummy variable

  • Suppose we want to know the effect of gender on earnings. For now set aside causality and run the bivariate OLS regression \(wage_i=\beta_0+\beta_1 gender_i + \varepsilon_i\)
  • What values should we put for gender?
  • We can encode (arbitrarily) male=0, female=1
  • What is a a “1 unit increase” in gender?

Gender as a dummy variable

  • \(wage_i=\beta_0+\beta_1 gender_i + \varepsilon_i\)
  • Suppose we obtain \(\hat\beta_0=18\), \(\hat\beta_1=-3\). Interpret \(\hat\beta_0,\hat\beta_1\)
    • \(\hat\beta_0\) is the average value of wage when \(gender=0\). But \(gender=0\) means for a male, ie the average male makes $18/hour
    • \(\hat\beta_1=-3\) means that when \(gender=1\), \(wage=18-3*1=15\), ie the average wage for a female is $15/hour. We can also directly interpret this as women earn $3/hour less, on average
    • Note that we don’t use causal terms

Gender as a dummy variable

  • What if instead of encoding male as 0, we did male=1,female=0?
    • We now obtain \(\hat\beta_1=3\): males earn $3/hour more
    • Similarly, \(\hat\beta_0=15\) now. The model is identical

Fitting Dummy Variables Graphically

Dummy variable calculation

  • \(wage_i = \beta_0 + \beta_1 gender_i + \varepsilon_i\)
  • Calculate \(\hat\beta_1\)
   gender college_degree wage   N
1:      0              0   10 100
2:      0              1   30 100
3:      1              0   10 100
4:      1              1   20 100

Dummy variable calculation

  • \(wage_i = \beta_0 + \beta_1 gender_i + \beta_2 degree_i + \varepsilon_i\)
  • Calculate \(\hat\beta_1\)
   gender college_degree wage   N
1:      0              0   10 100
2:      0              1   30 100
3:      1              0   10 100
4:      1              1   20 100

A note on dummy variables

  • Note that when we fit a dummy variable, we are just comparing the average value of y for our two groups.
    • In some sense this makes them much easier than numeric variables

Dummies with controls

  • Suppose we run the same regression of wage on gender, but now we control for age, ie \(wage_i=\beta_0+\beta_1 gender_i + \beta_2 age_i\)
  • What does \(\hat\beta_1\) represent?
    • Still the mean wage of females minus the mean wage of males, but now conditional on age
  • Under what conditions do we expect our age control to change \(\hat\beta_1\)?
    • The average age of men and women in the workforce needs to differ

Controls as dummies

  • Suppose instead we were interested in the effect of wage on age, but we’re now controlling for gender: \(wage_i=\beta_0+\beta_1 wage_i + \beta_2 gender_i\)
  • Nothing has changed from our multivariate OLS, but now we’re literally just subtracting out the average wage by gender before doing our regression
    • This is called a fixed effect: we’re removing all variation from gender. Before we were only taking out the linear component

Ordinal variables as dummies

  • Suppose instead of gender we have education: we have high school dropouts, high school graduates, and college graduates. How can we model this?
  • Use multiple dummies: 0/1 for HS dropout, 0/1 for HS graduate, and 0/1 for college graduate. Are there any issues with this?

Ordinal variables as dummies

  • Once we know the value of the first two, we know the value of the third - these are multicollinear
  • We always drop 1 dummy variable. We can do this arbitrarily
  • Note that For gender we can have a dummy for male or a dummy for female. The only difference is interpretation.
  • The level left out is the reference level

Ordinal Variable Example

  • We wish to regress wage on education
  • Education = 0 if HS dropout, 1 if HS graduate, 2 if some college, 3 if college degree
  • How do we write this regression equation

Oridinal Variable Example


Call:
lm(formula = wage ~ educ, data = dt)

Coefficients:
(Intercept)         educ  
      7.500        5.929  

Oridinal Variable Example

Oridinal Variable Example


Call:
lm(formula = wage ~ label, data = dt)

Coefficients:
      (Intercept)     labelbachelors     labeldoctorate            labelHS  
               10                 15                 25                  2  
     labelmasters  labelprofessional  labelsome college  
               20                 40                  5  

Oridinal Variable Example


Call:
lm(formula = wage ~ label, data = dt)

Coefficients:
      (Intercept)           label<HS     labelbachelors     labeldoctorate  
               30                -20                 -5                  5  
          labelHS  labelprofessional  labelsome college  
              -18                 20                -15  

Dummies as the dependent variable

  • We can have a dummy variable as our y variable instead.
  • We are interested in knowing the effect of education on employment and run \(employment_i=\beta_0+\beta_1 educ_i+\varepsilon_i\), where \(educ_i\) is the number of years of education obtained
  • We obtain \(\hat\beta_1=.05\) How do we interpret this?

Dummies as the dependent variable

  • Linear probability model: every year of education increases our probability of being employed by 5%

Interaction variables

  • Suppose we have both gender and education (for simplicity: an indicator for having a bachelor’s degree) and are interested in wages.
  • We can run \(wage_i=\beta_0 + \beta_1 gender_i + \beta_2 education_i\)
  • What if we think the effect of education is different for males and females?

Interaction variables

  • We can interact these variables by multiplying them: \(wage_i=\beta_0+\beta_1 gender_i + \beta_2 education_i + \beta_3 gender_i*education_i+\varepsilon_i\)
  • We obtain \(\hat\beta_1=1, \hat\beta_2=2,\hat\beta_3=-1\). How do we interpret these?
  • You need to think through all combinations (drop the error term for now and interpret these as averages for notation simplicity):

Interaction variables

  • \(gender=0,education=0\implies wage=\hat\beta_0\)
    • \(\hat\beta_0\) is the average wage for uneducated women
  • \(gender=1, education=0\implies wage=\hat\beta_0 + \hat\beta_1\)
    • \(\hat\beta_0+\hat\beta_1\) is the average wage for uneducated men
    • \(\hat\beta_1\) is the wage differential for uneducated men vs uneducated women

Interaction variables

  • \(gender=0,education=1\implies wage=\hat\beta_0+\hat\beta_2\)
    • \(\hat\beta_0+\hat\beta_2\) is the average wage for educated women
    • \(\hat\beta_2\) is the wage differential for educated women vs uneducated women
  • \(gender=1,education=1\implies wage=\hat\beta_0+\hat\beta_1+\hat\beta_2+\hat\beta_3\)
    • \(\hat\beta_0+\hat\beta_1+\hat\beta_2+\hat\beta_3\) is the average wage for educated men
    • \(\hat\beta_3\) is the wage differential for educated men vs uneducated men

Interaction variables

  • In other words, we obtain two different effects: the effect of education on earnings for both men and women separately

Interaction variables: Some algebra

  • Before we had \(wage_i=\beta_0+\beta_1 gender_i + \beta_2 education_i + \beta_3 gender_i education_i\)
  • If we want the overall effect of gender we can factor this out
  • \(wage_i=\beta_0 + (\beta_1 + \beta_3 education_i)gender_i + \beta_2 education_i\)
  • ie the males earn \(\beta_1 + \beta_3 education_i\) more per hour.
  • This means uneducated males earn \(\beta_1\) more, while educated males make \(\beta_1+\beta_3\) more, on average

Interaction variables: Some algebra

  • If we want the overall effect of education we can factor it differently
  • \(wage_i = \beta_0 + \beta_1 gender_i + (\beta_2 + \beta_3gender_i)education_i\)
  • ie educated individuals earn \(\beta_2 + \beta_3 gender_i\) more, on average
  • This means educated women earn \(\beta_2\) more while educated men earn \(\beta_2+\beta_3\) more

Another interaction: difference-in-differences

  • Recreational cannabis was legalized in Illinois in 2020. We can create a dummy variable for whether it was 2020 or later. This variable is often called \(post\) in difference-in-differences model
  • Indiana borders Illinois, but did not have recreational cannabis legalized. We can have another variable called \(treat\) that equals 1 if the state is Illinois, and 0 if the state is Indiana (assuming we only use these two states)

Another interaction: difference-in-differences

  • Suppose we wish to analyze the effect of cannabis legalization on emergency room visits, measured as emergency room visits per 1000 population per year. We filter our data to Illinois and use \(ER_i = \beta_0 + \beta_1 post_i + \varepsilon_i\). We obtain \(\hat\beta_1 = 1\)
  • Interpret \(\hat\beta_1\)
    • After legalization of cannabis, ER visits increased by 1 visit per 1000 population per year

Another interaction: difference-in-differences

  • Is this likely to capture the causal effect of cannabis legalization on emergency room visits?
    • No, ER visits could have been trending over time naturally for unrelated reasons

Another interaction: difference-in-differences

  • Suppose instead you use both Illinois and Indiana and run \(ER_i = \beta_0 + \beta_1 treat_i + \beta_2 post_i + \beta_3 treat_i*post_i\)
  • What does \(\hat\beta_0, \hat\beta_1, \hat\beta_2, \hat\beta_3\) represent?

Another interaction: difference-in-differences

  • \(\beta_0\) is the average for Indiana in the pre-period
  • \(\beta_1\) is the difference between Illinois and Indiana in the pre-period
  • \(\beta_2\) is the difference between Indiana in the pre-period and post-period

Another interaction: difference-in-differences

  • \(\beta_3\) is the difference between Illinois in the pre-period and post-period, minus the difference in Indiana in the pre and post period
  • Illinois in the post period is \(\beta_0+\beta_1+\beta_2+\beta_3\). Illinois pre-period is \(\beta_0+\beta_1\)
  • Indiana in the post period is \(\beta_0 + \beta_2\), and in the pre period is \(\beta_0\)
  • So (Illinois_post-Illinois_pre) - (indiana_post-indiana_pre) = \((\beta_0+\beta_1+\beta_2+\beta_3-(\beta_0+\beta_1))-(\beta_0+\beta_2-\beta_0)=\beta_3\)

Another interaction: difference-in-differences

  • Is this a reasonable causal estimate? Under what conditions?
  • Yes, so long as they’re trending at the same rate.

Example: School fixed effects

  • Suppose we are interested in the effect of teachers effectiveness on students. We use value-added as the increase in test scores of a student over their prior year in student-standard-deviation units
  • We can run \(grade_{ij} = \beta_0 + \beta_1 teacherVA_{ij} + \beta_2 baseline_{ij} + \varepsilon_{ij}\) for student i in school j
  • Concern: Students are sorted into teachers, e.g. because parents move to locations with better teachers.

Example: School fixed effects

  • Solution: School fixed effects. \(grade_{ij} = \beta_0 + \beta_1 teacherVA_{ij} + \beta_2 baseline_{ij} + \lambda_j + \varepsilon_{ij}\)
  • Now we only use variation within a school
  • As long as students aren’t sorted to teachers within a school (e.g. from honors classes or special education) then we are likely to capture a causal effect
  • This is more likely to occur in elementary school, so long as we filter out special education classes

Example: School fixed effects

  • What if you have multiple years of data and there is grade inflation?
  • School and year fixed effects
  • \(grade_{ijt} = \beta_0 + \beta_1 teacherVA_{ijt} + \beta_2 baseline_{ijt} + \lambda_j + \mu_t + \varepsilon_{ij}\)
  • What if grade inflation differs by school?

Example: School fixed effects

  • School-by-year fixed effects
  • \(grade_{ijt} = \beta_0 + \beta_1 teacherVA_{ijt} + \beta_2 baseline_{ijt} + \theta_{jt} + \varepsilon_{ij}\)
  • What variation is left?

Example: School fixed effects

  • Do we lose anything by using school fixed effects?
  • What if better schools have better teachers, but students are randomly assigned to schools?
  • Tradeoff between internal validity and external validity

Model Transformations

  • When we turned education into two levels for whether an individual had a bachelors degree we already did an example of a variable transformation
  • More generally, we can transform variables to provide a better fit to the data
  • Let’s consider how wages evolve with age as our example

Wage vs age

Wage vs age Quadratic Fit

Interpreting Quadratic


Call:
lm(formula = wage ~ age, data = dt)

Coefficients:
(Intercept)          age  
    11.0506       0.3004  

Call:
lm(formula = wage ~ age + I(age^2), data = dt)

Coefficients:
(Intercept)          age     I(age^2)  
    3.38417      0.69165     -0.00482  
  • \(wage_i = \beta_0 + \beta_1 age_i + \beta_2 age_i^2\)
  • How do we interpret \(\beta_1, \beta_2\)?

Interpreting Quadratic

  • Derivatives: slope is \(\beta_1 + 2\beta_2 age\). On average wage increases by \(\beta_1 + 2\beta_2 age\) for every additional year of age
  • Note that the slope now depends on the age (it’s not constant)

Linear-Log Transformations

  • Logs are the inverses of exponential functions: \(log(k^x)=x\). They have the property that \(log(xy)=log(x)+log(y)\)
  • Suppose we have \(y = \beta_0 + \beta_1 log(x) + \varepsilon\). How do we interpret a 1 unit change in log(x)?

Linear-Log Transformations

  • If \(log(x)\) increases by 1, then that means that \(log(k^{x+1})=log(k*k^x)\)
    • ie it means that x increases by a factor of k
  • The base of our logarithm is arbitrary - a 1 unit increase in log(x) results in a multiplicative change to x
    • In other words, a 1 unit increase in log(x) causes a \(100\beta_1\) percent change in y
    • (approximately: \(log(1+x) \approx x\))

Linear-Log Transformations

  • Example: \(y = \beta_0 + 10log(x)+\varepsilon\)
  • a 1% increase in x is associated with a 0.1 unit increase in y

Log-linear Transformations

  • We an also have our y variable as a log: \(log(y) = \beta_0+\beta_1 x + \varepsilon\)
  • This means that a 1 unit increase in 1 leads to a 1 unit increase in log(y), or a 100% increase in y

Log-Log Transformations

  • We can also have logs on both sides : \(log(y) = \beta_0 + \beta_1 log(x) + \varepsilon\). What does \(\beta_1\) represent?
  • A 1% increase in x results in a 1% increase in y
  • This is elasticity if you covered it in microeconomics

Example: Income

Example: Income


Call:
lm(formula = Income ~ Percentile, data = dt)

Coefficients:
(Intercept)   Percentile  
     -40367         2751  

Call:
lm(formula = log(Income + 1) ~ Percentile, data = dt)

Coefficients:
(Intercept)   Percentile  
     8.9293       0.0404  

Call:
lm(formula = log(Income + 1) ~ log(Percentile), data = dt)

Coefficients:
    (Intercept)  log(Percentile)  
          5.588            1.478  

Different Types of interpretations

  • Interpret \(\hat\beta_1\) in each of the following:
  • \(y_i=\beta_0 + \beta_1 x_i + \varepsilon_i\)
    • 1 unit increase in x associated with an average \(\beta_1\) unit increase in y
  • \(y_i=\beta_0 + \beta_1 x_i + \beta_2 z_i + \varepsilon_i\)
    • “” holding z constant
  • \(y_i=\beta_0 + \beta_1 d_i + \varepsilon_i\)
    • The average difference between the group characterized by \(d_i=1\) compared to \(d_i=0\)

Different Types of interpretations

  • \(y_i=\beta_0 + \beta_1 d_i + \beta_2 z_i + \varepsilon_i\)
    • “” holding z constant
  • \(d_i=\beta_0 + \beta_1 x_i + \varepsilon_i\)
    • 1 unit increase in x associated with average of \(100\beta_1\%\) increase in probability of attaining \(d_i=1\)
  • \(d_i=\beta_0 + \beta_1 d2_i + \varepsilon_i\)
    • Average difference in probability of attaining \(d_i=1\) for group \(d2=1\)

Different Types of interpretations

  • \(log(y_i)=\beta_0 + \beta_1 x_i + \varepsilon_i\)
    • 1 unit increase in x associated with \(100\beta_1\%\) increase in y
  • \(y_i=\beta_0 + \beta_1 log(x_i) + \varepsilon_i\)
    • \(100\%\) increase in x associated with \(\beta_1\) unit increase in y
  • \(log(y_i)=\beta_0 + \beta_1 log(x_i) + \varepsilon_i\)
    • \(1\%\) increase in x associated with \(\beta_1\) percent increase in y

Different Types of interpretations

  • \(y_i = \beta_0 + \beta_1 d_1 + \beta_2 d_2 + \beta_1 d_1*d_2 + \varepsilon_i\)
    • average difference for group \(d_1=1\) vs \(d_1=0\) when \(d_2=0\)
  • \(y_{it} = \beta_0 + \beta_1 x_{it} + \lambda_t + \varepsilon_{it}\)
    • … holding time fixed (“only using variation within years, not across”)

Fixed Effect models: A formalization

  • In the past we’ve run regressions where observations vary at the individual level, but sometimes it is important to think about observations varying at multiple levels.
  • For instance individual, time, and regional level variation may exist in some data but not others (e.g. a panel dataset vs a cross sectional one).
  • More formally, in a cross sectional dataset we observe \(y_i = \beta_0 + beta_1 x_i + \varepsilon_i\), while in a different dataset we may observe individuals at the regional level: \(y_{ij}=\beta_0+\beta_1 x_{ij} + \varepsilon_{ij}\)

Fixed Effect models: A formalization

  • This matters for fixed effects: We may observe the region but decide to only use variation within a region rather than across regions.
  • Notationally we do this by adding a regional fixed effect: \(y_{ij} = \beta_0 + \beta_1 x_{ij} + \lambda_j + \varepsilon_{ij}\)
  • Mechanically we fit this using dummy variables, one for each region (except 1): \(y_{ij} = \beta_0 + \beta_1x_{ij} + D_{j1} + D_{j2} + ... D_{jk} + \varepsilon_{ij}\)

Fixed Effect models: A formalization

  • This is identical to the demeaned version: \(Y_{ij} - \bar Y_j=\beta_1(X_{ij}-\bar X_j)\)
  • Identical to running a regression for each region of x on y, then averaging the results (using a weighted average)

Two Way Fixed Effects (TWFE)

  • Suppose we’re looking at the average income of individuals in different cities over time and are interested in how the crime rate affects that: \(log(income_{it}) = \beta_0 + \beta_1 crime_{it} + \varepsilon_{it}\) where i indexes the city and t the year.
  • Both crime rates and incomes vary over cities and over time, so these are obvious sources of endogeneity. We can include both fixed effects in our model:
  • \(log(income_{it}) + \beta_0 +\beta_1 crime_{it} + \lambda_t + \mu_i + \varepsilon_{j}\)

Two Way Fixed Effects (TWFE)

  • \(log(income_{it}) + \beta_0 +\beta_1 crime_{it} + \lambda_t + \mu_i + \varepsilon_{j}\)
  • This means that we have removed variation across cities and across time. What variation is left?
  • We still have the interaction between city and time, i.e. the differential trend in cities
  • This is the exact same variation we use in a difference-in-difference model: the difference in trend between our treatment and control group

Two Way Fixed Effects (TWFE)

  • \(log(income_{it}) + \beta_0 +\beta_1 crime_{it} + \lambda_t + \mu_i + \varepsilon_{j}\)
  • Suppose we run this and obtain \(\hat\beta_1=-1\). How do we interpret this (assume that crime is measured in violent crimes per 1000 people per year)
  • each additional crime per 1000 people is associated with a 1% decrease in income

Endogeneity in Two Way Fixed Effects Estimators

  • By including fixed effects for city and year we’ve removed all sources of variation associated with these
  • Are there still potential endogeneity issues?
  • Yes, if there are any endogenous factors (that impact both income and crime) that trend differentially across city
  • i.e. what we haven’t removed with our fixed effects.

Endogeneity in Two Way Fixed Effects Estimators

  • Example: suppose high income individuals move to areas of lower crime.
  • This will lead to a differential trend across cities that is not controlled for with our fixed effects
  • Question: Can we solve for this endogeneity by also controlling for the interaction between city and year?

Endogeneity in Two Way Fixed Effects Estimators

  • This is called a city-by-year fixed effect. In general we can, but once we’ve controlled for this we no longer have variation in crime rate or income if we only observe data at the city level.
  • If we observe crime and income at a more granular level we can still use these fixed effects

Endogeneity in Two Way Fixed Effects Estimators

  • In our two way fixed effects estimation, can we measure the effect of whether a city is near a lake on income levels?
  • No, because whether a city is near a lake does not vary over time within a city, and therefore has no variation

Endogeneity in Two Way Fixed Effects Estimators

  • We are concerned that city size might be an omitted variable (urban areas have higher crime rates and higher incomes). Should we control for city size?
  • No need - this is absorbed into city fixed effects. Within a city there is no variation in population
  • What about race, e.g. percent of black population?
  • This only matters if racial composition of a city is trending over time. It’s a weak control.