Take Home Final

Author

Andrew Grimoldby

Published

June 14, 2023

library(pacman)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

p_load("ggplot2")
p_load("tidyverse")
p_load('fixest')
p_load("lmtest")
p_load("stargazer")
p_load("data.table")
p_load('haven')
p_load('broom')
p_load('modelsummary')
p_load('patchwork')
p_load("AER")

Reading data

aea <- read.csv("aeapnp_bedard_lee_royer.csv")

GenderBinary <- ifelse(aea$sex != "Female", 0, 1)
aea <- cbind(aea, GenderBinary)

1.01

legendtitle1 <- "Gender"
ggplot(aea, aes(x = salary/1000, fill=factor(GenderBinary)))+
  geom_histogram(position="identity", alpha=0.5,bins=30)+
  labs(x='Salary (in Thousands)', y='Count', fill = legendtitle1)+ 
  scale_fill_manual(values=c("blue","red"),labels = c("Non-Female","Female")) +
  scale_x_continuous(labels = scales::dollar)+
  ggtitle("Histogram of Salaries by Gender")+
    facet_grid(~ factor(GenderBinary))+
  theme_bw()

Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

1.02

reg1.02.1 <- lm(salary ~ GenderBinary, data=aea)
summary(reg1.02.1)


Call:
lm(formula = salary ~ GenderBinary, data = aea)

Residuals:
    Min      1Q  Median      3Q     Max 
-155515  -51217  -18358   42485  323755 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    180515       2235  80.777  < 2e-16 ***
GenderBinary   -17781       5233  -3.398 0.000699 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 74220 on 1347 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.008498,  Adjusted R-squared:  0.007762 
F-statistic: 11.54 on 1 and 1347 DF,  p-value: 0.0006993

reg1.02.2 <- lm(log(salary) ~ GenderBinary, data=aea)
summary(reg1.02.2)


Call:
lm(formula = log(salary) ~ GenderBinary, data = aea)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89278 -0.25878 -0.02406  0.29630  1.17091 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.01941    0.01229 978.257  < 2e-16 ***
GenderBinary -0.09536    0.02877  -3.314 0.000943 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4081 on 1347 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.008089,  Adjusted R-squared:  0.007352 
F-statistic: 10.98 on 1 and 1347 DF,  p-value: 0.0009433

When talking about a wage gap, we typically like to use percentages. For that reason I’d argue that the log(salary) regression makes more sense. We have two results here, our first regression suggests that being female decreases salary by $19,346 on average, which is seemingly a large amount, but I’m not sure how much it really tells us… Alternatively, our log(salary) regression suggests a 10% decrease in salary when looking at females. If we are looking to make comparisons The reason I suggest using the log regression is also that we see our data is skewed to the right for both males and females, the log can help us linearize the relationship between salary and gender. Overall, when talking about wage gaps, the log relationship seems to be the best fit.

1.03

In this case, we are estimating the ATE, or average treatment effect, since we have no control variables. Realistically, we would like to control for other things, it is clear that gender is not the only thing that effects salary, we likely need to control for education, field, tenure, among other things in this scenario.

1.04

reg1.04.01 <- lm(log(salary) ~ year, data = aea)
summary(reg1.04.01)


Call:
lm(formula = log(salary) ~ year, data = aea)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.46452 -0.26954 -0.07358  0.25670  1.33729 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -80.617821   4.304081  -18.73   <2e-16 ***
year          0.045967   0.002136   21.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3534 on 1347 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.2558,    Adjusted R-squared:  0.2553 
F-statistic: 463.1 on 1 and 1347 DF,  p-value: < 2.2e-16

reg1.04.02 <- lm(GenderBinary ~ year, data = aea)
summary(reg1.04.02)


Call:
lm(formula = GenderBinary ~ year, data = aea)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2003 -0.1958 -0.1869 -0.1422  0.8936 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -8.830200   4.685641  -1.885   0.0597 .
year         0.004473   0.002325   1.923   0.0546 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3857 on 1349 degrees of freedom
Multiple R-squared:  0.002735,  Adjusted R-squared:  0.001996 
F-statistic: 3.699 on 1 and 1349 DF,  p-value: 0.05464

I would argue that omitting year would cause bias for a multitude of reasons (but this bias may be able to be attacked in a different way than including year). The first being that the cost of living is likely higher towards present day than, and salary has likely been higher to adjust for that. The second being that given earlier years, like 1997, the minimum year of our data set, there likely wasn’t as many women in the field of Economics, and they may not have been treated nearly as equally as today. Our results indicate that a 1 year increase in the year increased salary by about 4%, when we regressed Gender on year, we say that an 1 year increase increased our estimate of Female gender by about 0.003. Note that in the latter regression, neither the intercept nor the coefficient on year is statistically significant. Indicating that my idea of more women in the field of economics in later years may be incorrect. Overall, year is a positive effect on log salary, and a positive effect on Gender. Leading us to believe that year is a positive bias.

1.05

reg1.05.01 <- lm(log(salary) ~ GenderBinary + year, data=aea)
summary(reg1.05.01)


Call:
lm(formula = log(salary) ~ GenderBinary + year, data = aea)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.48215 -0.26257 -0.06128  0.25035  1.31751 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -81.68408    4.27196 -19.121  < 2e-16 ***
GenderBinary  -0.12313    0.02474  -4.978 7.26e-07 ***
year           0.04651    0.00212  21.935  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3504 on 1346 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.2693,    Adjusted R-squared:  0.2682 
F-statistic:   248 on 2 and 1346 DF,  p-value: < 2.2e-16

reg1.05.02 <- feols(log(salary) ~ GenderBinary | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

summary(reg1.05.02)

OLS estimation, Dep. Var.: log(salary)
Observations: 1,349 
Fixed-effects: year: 21
Standard-errors: Clustered (year) 
              Estimate Std. Error  t value   Pr(>|t|)    
GenderBinary -0.122202   0.021145 -5.77927 1.1793e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.344536     Adj. R2: 0.280601
                 Within R2: 0.018155

Controlling for year does matter, and I would argue that year is a good control. In our initial regression in problem 1.02, we saw a standard error on the GenderBinary variable of about 0.028. With year fixed effects, that decreases to about 0.023, and using year as a numeric variable, the standard error on the same variable is 0.025. Our intercept term in the initial regression had a standard error of 0.012 and in our new regression a standard error of 4.266. We also now see that being female decreases wage by 12% on average instead of 10% as seen before. While I think year is important, I believe that there may be proxies that are better suited. I believe that including it as a fixed effect may be a good idea, but I wouldn’t argue that including it as a numeric variable is a good idea. By including it as a fixed effect, we are essentially controlling for the different things that may change accross an individual over time.

1.06

#Creating a year since PhD column
yearssincephd <- aea$year - aea$yearofphd
aea <- cbind(aea, yearssincephd)

reg1.06.01 <- feols(log(salary) ~ GenderBinary + yearssincephd | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

reg1.06.02 <- feols(log(salary) ~ GenderBinary + I(yearssincephd >=0 & yearssincephd <= 9) | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

#Note years since phd>=20 left out to avoid perfect multicollinearity
reg1.06.03 <- feols(log(salary) ~  GenderBinary +I(yearssincephd >= 0 & yearssincephd <= 9) + 
                    I(yearssincephd >= 10 & yearssincephd <= 19) +
                    I(yearssincephd >= 0 & yearssincephd <= 9)*GenderBinary +
                    I(yearssincephd >= 10 & yearssincephd <= 19)*GenderBinary | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

etable(reg1.06.01, reg1.06.02, reg1.06.03)

                                                              reg1.06.01
Dependent Var.:                                              log(salary)
                                                                        
GenderBinary                                           -0.0572* (0.0203)
yearssincephd                                         0.0083*** (0.0015)
I(yearssincephd>=0&yearssincephd<=9)                                    
I(yearssincephd>=10&yearssincephd<=19)                                  
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)                     
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)                   
Fixed-Effects:                                        ------------------
year                                                                 Yes
________________________________________              __________________
S.E.: Clustered                                                 by: year
Observations                                                       1,349
R2                                                               0.36888
Within R2                                                        0.12501

                                                               reg1.06.02
Dependent Var.:                                               log(salary)
                                                                         
GenderBinary                                            -0.0417* (0.0162)
yearssincephd                                                            
I(yearssincephd>=0&yearssincephd<=9)                  -0.3596*** (0.0176)
I(yearssincephd>=10&yearssincephd<=19)                                   
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)                      
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)                    
Fixed-Effects:                                        -------------------
year                                                                  Yes
________________________________________              ___________________
S.E.: Clustered                                                  by: year
Observations                                                        1,349
R2                                                                0.45945
Within R2                                                         0.25058

                                                               reg1.06.03
Dependent Var.:                                               log(salary)
                                                                         
GenderBinary                                             -0.0248 (0.0255)
yearssincephd                                                            
I(yearssincephd>=0&yearssincephd<=9)                  -0.3706*** (0.0225)
I(yearssincephd>=10&yearssincephd<=19)                   -0.0313 (0.0333)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)      -0.0119 (0.0272)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)    -0.0326 (0.0535)
Fixed-Effects:                                        -------------------
year                                                                  Yes
________________________________________              ___________________
S.E.: Clustered                                                  by: year
Observations                                                        1,349
R2                                                                0.46062
Within R2                                                         0.25219
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.07

reg1.07.01 <- feols(log(salary) ~  GenderBinary +I(yearssincephd >= 0 & yearssincephd <= 9) + 
                    I(yearssincephd >= 10 & yearssincephd <= 19) +
                    I(yearssincephd >= 0 & yearssincephd <= 9)*GenderBinary +
                    I(yearssincephd >= 10 & yearssincephd <= 19)*GenderBinary +
                    i_tenure | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

reg1.07.02 <- feols(log(salary) ~  GenderBinary +I(yearssincephd >= 0 & yearssincephd <= 9) + 
                    I(yearssincephd >= 10 & yearssincephd <= 19) +
                    I(yearssincephd >= 0 & yearssincephd <= 9)*GenderBinary +
                    I(yearssincephd >= 10 & yearssincephd <= 19)*GenderBinary +
                    i_tenure + business | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

reg1.07.03 <- feols(log(salary) ~  GenderBinary +I(yearssincephd >= 0 & yearssincephd <= 9) + 
                    I(yearssincephd >= 10 & yearssincephd <= 19) +
                    I(yearssincephd >= 0 & yearssincephd <= 9)*GenderBinary +
                    I(yearssincephd >= 10 & yearssincephd <= 19)*GenderBinary +
                    i_tenure + business + school_id | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

reg1.07.04 <- feols(log(salary) ~  GenderBinary +I(yearssincephd >= 0 & yearssincephd <= 9) + 
                    I(yearssincephd >= 10 & yearssincephd <= 19) +
                    I(yearssincephd >= 0 & yearssincephd <= 9)*GenderBinary +
                    I(yearssincephd >= 10 & yearssincephd <= 19)*GenderBinary +
                    i_tenure + business + school_id + phd_id | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

reg1.07.05 <- feols(log(salary) ~  GenderBinary +I(yearssincephd >= 0 & yearssincephd <= 9) + 
                    I(yearssincephd >= 10 & yearssincephd <= 19) +
                    I(yearssincephd >= 0 & yearssincephd <= 9)*GenderBinary +
                    I(yearssincephd >= 10 & yearssincephd <= 19)*GenderBinary +
                    i_tenure + business + school_id + phd_id +
                    micro + macromoney + metrics + labor + io + public + enviroenergy + develop + intl + finance + health + expbehave + political + urban + econhist + educ | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

etable(reg1.07.01, reg1.07.02, reg1.07.03, reg1.07.04, reg1.07.05)

                                                               reg1.07.01
Dependent Var.:                                               log(salary)
                                                                         
GenderBinary                                             -0.0221 (0.0258)
I(yearssincephd>=0&yearssincephd<=9)                  -0.1699*** (0.0318)
I(yearssincephd>=10&yearssincephd<=19)                   -0.0176 (0.0366)
i_tenure                                               0.2430*** (0.0299)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)      -0.0165 (0.0263)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)    -0.0188 (0.0432)
business                                                                 
school_id                                                                
phd_id                                                                   
micro                                                                    
macromoney                                                               
metrics                                                                  
labor                                                                    
io                                                                       
public                                                                   
enviroenergy                                                             
develop                                                                  
intl                                                                     
finance                                                                  
health                                                                   
expbehave                                                                
political                                                                
urban                                                                    
econhist                                                                 
educ                                                                     
Fixed-Effects:                                        -------------------
year                                                                  Yes
________________________________________              ___________________
S.E.: Clustered                                                  by: year
Observations                                                        1,349
R2                                                                0.48201
Within R2                                                         0.28186

                                                               reg1.07.02
Dependent Var.:                                               log(salary)
                                                                         
GenderBinary                                             -0.0188 (0.0268)
I(yearssincephd>=0&yearssincephd<=9)                  -0.1722*** (0.0337)
I(yearssincephd>=10&yearssincephd<=19)                   -0.0160 (0.0367)
i_tenure                                               0.2386*** (0.0333)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)      -0.0213 (0.0284)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)    -0.0223 (0.0444)
business                                                 0.0512. (0.0264)
school_id                                                                
phd_id                                                                   
micro                                                                    
macromoney                                                               
metrics                                                                  
labor                                                                    
io                                                                       
public                                                                   
enviroenergy                                                             
develop                                                                  
intl                                                                     
finance                                                                  
health                                                                   
expbehave                                                                
political                                                                
urban                                                                    
econhist                                                                 
educ                                                                     
Fixed-Effects:                                        -------------------
year                                                                  Yes
________________________________________              ___________________
S.E.: Clustered                                                  by: year
Observations                                                        1,349
R2                                                                0.48385
Within R2                                                         0.28441

                                                               reg1.07.03
Dependent Var.:                                               log(salary)
                                                                         
GenderBinary                                             -0.0242 (0.0271)
I(yearssincephd>=0&yearssincephd<=9)                  -0.1773*** (0.0368)
I(yearssincephd>=10&yearssincephd<=19)                   -0.0193 (0.0383)
i_tenure                                               0.2372*** (0.0335)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)      -0.0157 (0.0278)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)    -0.0100 (0.0490)
business                                               0.1114*** (0.0266)
school_id                                                0.0006* (0.0003)
phd_id                                                                   
micro                                                                    
macromoney                                                               
metrics                                                                  
labor                                                                    
io                                                                       
public                                                                   
enviroenergy                                                             
develop                                                                  
intl                                                                     
finance                                                                  
health                                                                   
expbehave                                                                
political                                                                
urban                                                                    
econhist                                                                 
educ                                                                     
Fixed-Effects:                                        -------------------
year                                                                  Yes
________________________________________              ___________________
S.E.: Clustered                                                  by: year
Observations                                                        1,349
R2                                                                0.49258
Within R2                                                         0.29651

                                                               reg1.07.04
Dependent Var.:                                               log(salary)
                                                                         
GenderBinary                                             -0.0247 (0.0273)
I(yearssincephd>=0&yearssincephd<=9)                  -0.1783*** (0.0364)
I(yearssincephd>=10&yearssincephd<=19)                   -0.0202 (0.0387)
i_tenure                                               0.2364*** (0.0335)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)      -0.0154 (0.0281)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)    -0.0087 (0.0487)
business                                               0.1115*** (0.0270)
school_id                                                0.0007* (0.0003)
phd_id                                                 -0.0002. (8.76e-5)
micro                                                                    
macromoney                                                               
metrics                                                                  
labor                                                                    
io                                                                       
public                                                                   
enviroenergy                                                             
develop                                                                  
intl                                                                     
finance                                                                  
health                                                                   
expbehave                                                                
political                                                                
urban                                                                    
econhist                                                                 
educ                                                                     
Fixed-Effects:                                        -------------------
year                                                                  Yes
________________________________________              ___________________
S.E.: Clustered                                                  by: year
Observations                                                        1,349
R2                                                                0.49335
Within R2                                                         0.29757

                                                               reg1.07.05
Dependent Var.:                                               log(salary)
                                                                         
GenderBinary                                             -0.0245 (0.0280)
I(yearssincephd>=0&yearssincephd<=9)                  -0.1896*** (0.0356)
I(yearssincephd>=10&yearssincephd<=19)                   -0.0289 (0.0340)
i_tenure                                               0.2338*** (0.0345)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)      -0.0045 (0.0313)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)    -0.0053 (0.0495)
business                                                0.1060** (0.0301)
school_id                                                0.0006. (0.0003)
phd_id                                                  -0.0001 (9.19e-5)
micro                                                    -0.0087 (0.0130)
macromoney                                                0.0263 (0.0202)
metrics                                                   0.0420 (0.0280)
labor                                                     0.0129 (0.0189)
io                                                        0.0226 (0.0291)
public                                                    0.0132 (0.0317)
enviroenergy                                              0.0131 (0.0196)
develop                                                 -0.0678* (0.0322)
intl                                                      0.0324 (0.0215)
finance                                                   0.0171 (0.0336)
health                                                   0.0522* (0.0212)
expbehave                                               0.0746** (0.0255)
political                                                -0.0066 (0.0183)
urban                                                    0.1038. (0.0532)
econhist                                                 -0.0741 (0.0693)
educ                                                     0.0515. (0.0250)
Fixed-Effects:                                        -------------------
year                                                                  Yes
________________________________________              ___________________
S.E.: Clustered                                                  by: year
Observations                                                        1,349
R2                                                                0.50380
Within R2                                                         0.31206
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.08

In some of our regressions (especially the last) I am concerned about multicollinearity and over controlling. Both of which could introduce selection bias. Since we are looking for gender’s affect on log salary, things such as tenure, years since PhD, the institution they are at and they were at are certainly relevant. Although, I am not convinced that field and the institution they are at and were at are random. When we look at field, it isn’t random because it depends on where they went, and what they wanted to do in grad school. Furthermore, the institution they got their PhD from can’t be random because it depends on so many factors, and many choose a school for a variety of different reasons. Finally, their current institution they are at is likely a correlated result to where they attended grad school. It is unlikely that someone like myself would go on to teach at a top tier institution like Harvard, Yale, etc. after completing graduate school. Thus, we certainly can’t consider that random either. In sum, I am concerned about randomization issues within some of our controls, even if they may seem relevant, we are likely introducing selection bias.

1.09

Up to this point we have been using clustered standard errors, I would recommend cluster robust standard errors because of the nature of our testing. When we have variables like years since PhD, institution (both types), we likely have correlation within our cluster. If 2 people got their Econ PhD from Oregon in the same year, we may expect their starting salaries to be somewhat similar. We also must consider the job market for a certain year, it may be more or less competitive, potentially leading to higher or lower salaries. Clearly, cluster robust standard errors are the way to go.

1.10

reg1.10 <- feols(log(salary) ~ GenderBinary + i_tenure + school_id +
                    I(yearssincephd >= 0 & yearssincephd <= 9) + 
                    I(yearssincephd >= 10 & yearssincephd <= 19) +
                    I(yearssincephd >= 0 & yearssincephd <= 9)*GenderBinary +
                    I(yearssincephd >= 10 & yearssincephd <= 19)*GenderBinary | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

etable(reg1.10)

                                                                  reg1.10
Dependent Var.:                                               log(salary)
                                                                         
GenderBinary                                             -0.0276 (0.0266)
i_tenure                                               0.2453*** (0.0293)
school_id                                                 0.0004 (0.0003)
I(yearssincephd>=0&yearssincephd<=9)                  -0.1712*** (0.0326)
I(yearssincephd>=10&yearssincephd<=19)                   -0.0206 (0.0393)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)      -0.0101 (0.0270)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)    -0.0092 (0.0500)
Fixed-Effects:                                        -------------------
year                                                                  Yes
________________________________________              ___________________
S.E.: Clustered                                                  by: year
Observations                                                        1,349
R2                                                                0.48584
Within R2                                                         0.28716
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The specification that I chose is above. For the CIA to be valid I need gender to be as good as random conditional on my covariates. My covariates are tenure, the school they are currently at, and the three groups of years since PhD allowing for heterogenous treatment effects. I think this CIA is the best one because it gets rid of some of the extras we had before, such as all of the possible fields, as well as the school they got their PhD from. As described earlier, I think that where they got their PhD from and where they currently are may be too highly correlated, and not random at all if we include both. My inference approach is as follows: our goal is to estimate the effect of gender on salary, to do that we clearly need to control for things other than gender. If someone is tenured, it will certainly increase wages. If they are at a more prestigious school it will increase wages. We also need to allow for heterogenous treatment effects in order to account for differences among the individuals. I also feel it is necessary to justify not including the specific field. In academia, I don’t feel like the fields will be as good as random, because gender may have a influence on why you would want to choose a certain field, based on societal norms, and amongst other things. Furthermore, I would argue that field won’t cause too big of a difference in salary.

1.11

reg1.11 <- feols(salary ~ GenderBinary + i_tenure + school_id +
                    I(yearssincephd >= 0 & yearssincephd <= 9) + 
                    I(yearssincephd >= 10 & yearssincephd <= 19) +
                    I(yearssincephd >= 0 & yearssincephd <= 9)*GenderBinary +
                    I(yearssincephd >= 10 & yearssincephd <= 19)*GenderBinary | year, data = aea)

NOTE: 2 observations removed because of NA values (LHS: 2).

etable(reg1.11)

                                                                     reg1.11
Dependent Var.:                                                       salary
                                                                            
GenderBinary                                              -5,522.0 (5,212.9)
i_tenure                                               39,049.2*** (6,374.8)
school_id                                                      100.1 (65.89)
I(yearssincephd>=0&yearssincephd<=9)                  -38,343.0*** (6,812.6)
I(yearssincephd>=10&yearssincephd<=19)                    -9,583.3 (6,897.0)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9)         -208.8 (4,999.7)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19)       -46.56 (8,801.0)
Fixed-Effects:                                        ----------------------
year                                                                     Yes
________________________________________              ______________________
S.E.: Clustered                                                     by: year
Observations                                                           1,349
R2                                                                   0.42152
Within R2                                                            0.26715
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Obviously, we see differences because of the differences in nature of these tests, although all coefficients are the same sign, and thus our general conclusions can remain the same. It is also worth noting that if it was statistically significant with log(salary), it is still statistically significant without the log.

1.12

If I’m correct about my CIA, I do think a matching estimator would work, and I also think it would yield an interesting result. If we were to match treated (Female) to untreated (Non-Female) we could reduce dimensionality with the propensity scores. This would ultimately lead to a better regression. Despite all of this, one potential problem I see with this is our data set, I would like to control for other things such as cost of living, performance at the job (which could be tenure…?), and even potentially race, as well as age.

Section 2 - Simulation

#Loading necessary packages for simulation
p_load("tibble")
p_load("purrr")
p_load("furrr")

Please note: I don’t know what other way to do this besides run the simulation first to generate data in order to answer question 2.01. Further, I was doing it a different (far more stupid way) and ran two simulations, on for the variables to answer question 2.01, and then one that did the regressions for 2.03. I took an educated guess (perhaps foolishly) that that couldn’t possibly be what you’re asking us to do. Finally, I (with the help of GPT) figured out how to make two separate data frames for the variables, and the regressions.

fun_iter = function(iter, n=1000){
dgp_df = tibble(
  z_1 = runif(n, min = -5, max = 5),
  z_2 = runif(n, min = -5, max = 5),
  v = rnorm(n, sd = 1),
  w = rnorm(n, sd = 2),
  p = rbinom(n, size = 1, prob =  0.2),
  u = rnorm(n, sd = 1),
  x = 1 + z_1 + z_2 - 2*(z_2*p) + v + w,
  y = 2 + 3*x +(x*p) + 2*u + w
  )

OLS <- lm(y ~ x, data = dgp_df)
IV1 <- ivreg(y ~ x | z_1, data = dgp_df)
IV2 <- ivreg(y ~ x | z_2, data = dgp_df)
TwoSLS1 <- ivreg(y ~ x | z_1 + z_2, data=dgp_df)
TwoSLS2 <- ivreg(y ~ x | z_1 + z_2 + (z_2*p), data = dgp_df)

variables <- dgp_df

estimates <- tibble(
  Strategy = c("OLS", "IV1", "IV2", "TwoSLS1", "TwoSLS2"),
  Estimates = c(OLS$coefficients[2],IV1$coefficients[2],IV2$coefficients[2], TwoSLS1$coefficients[2], TwoSLS2$coefficients[2])
)

return(list(variables = variables, estimates = estimates))

}

set.seed(98372)
plan(multisession)
sim_list <- future_map(1:1e4, fun_iter,
                     .options = furrr_options(seed = T)
                     )

variables_df <- bind_rows(map(sim_list, "variables"))
estimates_df <- bind_rows(map(sim_list, "estimates"))

2.01

The ATE will look at the difference between the mean of our outcome variable ($y_i$) conditional on our treated group ($p=1$) and the mean of our outcome variable conditional on our control group ($p=0$). Since $p_i$ is randomly assigned (thanks to our simulation!) with probability 0.2, in mathematical terms: $E[Y_{i}|p_i=1]-E[Y_{i}|p_i=0] = E[\tau_{i}]$

ATE_DGP <- mean(variables_df$y[variables_df$p == 1]) - mean(variables_df$y[variables_df$p ==0])
ATE_DGP

[1] 1.00489

2.02

A simple regression of y on x will fail to recover the ATE because of a simultaneity bias. Since $y$ is equal to some variables as well as $x$, when we regress $y$ on $x$ we will have a biased estimator. Here, $x$ is not exogenous, and thus will lead to an inconsistent estimate on $\beta_1$.

2.03

ggplot(data = estimates_df, aes(x= Estimates, fill=Strategy))+
  geom_density(alpha=0.5)+
  theme_minimal()

2.04

cov(variables_df$z_1, variables_df$p)

[1] 0.000605615

cov(variables_df$z_2, variables_df$p)

[1] -0.0003871567

We have established just now that instrument $z_i$ and treatment $p_i$ have covariance $\neq$ 0. This is good, and one of the assumptions we needed. Although, we also need the covariance between $z_i$ and other determinants of $y_i$ to be zero.

cov(variables_df$z_1, variables_df$w)

[1] -0.001942507

cov(variables_df$z_1, variables_df$u)

[1] -0.001057883

cov(variables_df$z_2, variables_df$w)

[1] 0.0008928576

cov(variables_df$z_2, variables_df$u)

[1] 0.0008750567

As we can see covariance between $z_i$ and other determinants of $y_i$ is not equal to zero, which means we likely have a invalid instrument, and furthermore, an inconsistent estimate. The reason for some being more biased than others is likely because they are more “off-base”, for example the covariance between $z_2$ and other determinants of $y_i$ is less than that of $z_1$. So an estimate like IV2 will be more biased than IV1, and TwoSLS2 more biased than TwoSLS1. (I am guessing on this next part)… I think it may be because we are compounding the issue of the invalid instruments by using more of them?

2.05

I would think that IV2 where we regress $y$ on $x$ and use $z_2$ as an instrument. While we don’t have any perfect instruments, this is our “strongest” instrument in the sense that it likely provides the least biased estimate. Here, I am defining best as the least biased.