Call:
lm(formula = log(salary) ~ GenderBinary, data = aea)
Residuals:
Min 1Q Median 3Q Max
-1.89278 -0.25878 -0.02406 0.29630 1.17091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.01941 0.01229 978.257 < 2e-16 ***
GenderBinary -0.09536 0.02877 -3.314 0.000943 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4081 on 1347 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.008089, Adjusted R-squared: 0.007352
F-statistic: 10.98 on 1 and 1347 DF, p-value: 0.0009433
When talking about a wage gap, we typically like to use percentages. For that reason I’d argue that the log(salary) regression makes more sense. We have two results here, our first regression suggests that being female decreases salary by $19,346 on average, which is seemingly a large amount, but I’m not sure how much it really tells us… Alternatively, our log(salary) regression suggests a 10% decrease in salary when looking at females. If we are looking to make comparisons The reason I suggest using the log regression is also that we see our data is skewed to the right for both males and females, the log can help us linearize the relationship between salary and gender. Overall, when talking about wage gaps, the log relationship seems to be the best fit.
1.03
In this case, we are estimating the ATE, or average treatment effect, since we have no control variables. Realistically, we would like to control for other things, it is clear that gender is not the only thing that effects salary, we likely need to control for education, field, tenure, among other things in this scenario.
1.04
reg1.04.01<-lm(log(salary) ~ year, data = aea)summary(reg1.04.01)
Call:
lm(formula = log(salary) ~ year, data = aea)
Residuals:
Min 1Q Median 3Q Max
-1.46452 -0.26954 -0.07358 0.25670 1.33729
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.617821 4.304081 -18.73 <2e-16 ***
year 0.045967 0.002136 21.52 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3534 on 1347 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.2558, Adjusted R-squared: 0.2553
F-statistic: 463.1 on 1 and 1347 DF, p-value: < 2.2e-16
reg1.04.02<-lm(GenderBinary ~ year, data = aea)summary(reg1.04.02)
Call:
lm(formula = GenderBinary ~ year, data = aea)
Residuals:
Min 1Q Median 3Q Max
-0.2003 -0.1958 -0.1869 -0.1422 0.8936
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.830200 4.685641 -1.885 0.0597 .
year 0.004473 0.002325 1.923 0.0546 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3857 on 1349 degrees of freedom
Multiple R-squared: 0.002735, Adjusted R-squared: 0.001996
F-statistic: 3.699 on 1 and 1349 DF, p-value: 0.05464
I would argue that omitting year would cause bias for a multitude of reasons (but this bias may be able to be attacked in a different way than including year). The first being that the cost of living is likely higher towards present day than, and salary has likely been higher to adjust for that. The second being that given earlier years, like 1997, the minimum year of our data set, there likely wasn’t as many women in the field of Economics, and they may not have been treated nearly as equally as today. Our results indicate that a 1 year increase in the year increased salary by about 4%, when we regressed Gender on year, we say that an 1 year increase increased our estimate of Female gender by about 0.003. Note that in the latter regression, neither the intercept nor the coefficient on year is statistically significant. Indicating that my idea of more women in the field of economics in later years may be incorrect. Overall, year is a positive effect on log salary, and a positive effect on Gender. Leading us to believe that year is a positive bias.
Controlling for year does matter, and I would argue that year is a good control. In our initial regression in problem 1.02, we saw a standard error on the GenderBinary variable of about 0.028. With year fixed effects, that decreases to about 0.023, and using year as a numeric variable, the standard error on the same variable is 0.025. Our intercept term in the initial regression had a standard error of 0.012 and in our new regression a standard error of 4.266. We also now see that being female decreases wage by 12% on average instead of 10% as seen before. While I think year is important, I believe that there may be proxies that are better suited. I believe that including it as a fixed effect may be a good idea, but I wouldn’t argue that including it as a numeric variable is a good idea. By including it as a fixed effect, we are essentially controlling for the different things that may change accross an individual over time.
1.06
#Creating a year since PhD columnyearssincephd <- aea$year - aea$yearofphdaea <-cbind(aea, yearssincephd)reg1.06.01<-feols(log(salary) ~ GenderBinary + yearssincephd | year, data = aea)
NOTE: 2 observations removed because of NA values (LHS: 2).
reg1.07.01
Dependent Var.: log(salary)
GenderBinary -0.0221 (0.0258)
I(yearssincephd>=0&yearssincephd<=9) -0.1699*** (0.0318)
I(yearssincephd>=10&yearssincephd<=19) -0.0176 (0.0366)
i_tenure 0.2430*** (0.0299)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9) -0.0165 (0.0263)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19) -0.0188 (0.0432)
business
school_id
phd_id
micro
macromoney
metrics
labor
io
public
enviroenergy
develop
intl
finance
health
expbehave
political
urban
econhist
educ
Fixed-Effects: -------------------
year Yes
________________________________________ ___________________
S.E.: Clustered by: year
Observations 1,349
R2 0.48201
Within R2 0.28186
reg1.07.02
Dependent Var.: log(salary)
GenderBinary -0.0188 (0.0268)
I(yearssincephd>=0&yearssincephd<=9) -0.1722*** (0.0337)
I(yearssincephd>=10&yearssincephd<=19) -0.0160 (0.0367)
i_tenure 0.2386*** (0.0333)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9) -0.0213 (0.0284)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19) -0.0223 (0.0444)
business 0.0512. (0.0264)
school_id
phd_id
micro
macromoney
metrics
labor
io
public
enviroenergy
develop
intl
finance
health
expbehave
political
urban
econhist
educ
Fixed-Effects: -------------------
year Yes
________________________________________ ___________________
S.E.: Clustered by: year
Observations 1,349
R2 0.48385
Within R2 0.28441
reg1.07.03
Dependent Var.: log(salary)
GenderBinary -0.0242 (0.0271)
I(yearssincephd>=0&yearssincephd<=9) -0.1773*** (0.0368)
I(yearssincephd>=10&yearssincephd<=19) -0.0193 (0.0383)
i_tenure 0.2372*** (0.0335)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9) -0.0157 (0.0278)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19) -0.0100 (0.0490)
business 0.1114*** (0.0266)
school_id 0.0006* (0.0003)
phd_id
micro
macromoney
metrics
labor
io
public
enviroenergy
develop
intl
finance
health
expbehave
political
urban
econhist
educ
Fixed-Effects: -------------------
year Yes
________________________________________ ___________________
S.E.: Clustered by: year
Observations 1,349
R2 0.49258
Within R2 0.29651
reg1.07.04
Dependent Var.: log(salary)
GenderBinary -0.0247 (0.0273)
I(yearssincephd>=0&yearssincephd<=9) -0.1783*** (0.0364)
I(yearssincephd>=10&yearssincephd<=19) -0.0202 (0.0387)
i_tenure 0.2364*** (0.0335)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9) -0.0154 (0.0281)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19) -0.0087 (0.0487)
business 0.1115*** (0.0270)
school_id 0.0007* (0.0003)
phd_id -0.0002. (8.76e-5)
micro
macromoney
metrics
labor
io
public
enviroenergy
develop
intl
finance
health
expbehave
political
urban
econhist
educ
Fixed-Effects: -------------------
year Yes
________________________________________ ___________________
S.E.: Clustered by: year
Observations 1,349
R2 0.49335
Within R2 0.29757
reg1.07.05
Dependent Var.: log(salary)
GenderBinary -0.0245 (0.0280)
I(yearssincephd>=0&yearssincephd<=9) -0.1896*** (0.0356)
I(yearssincephd>=10&yearssincephd<=19) -0.0289 (0.0340)
i_tenure 0.2338*** (0.0345)
GenderBinary x I(yearssincephd>=0&yearssincephd<=9) -0.0045 (0.0313)
GenderBinary x I(yearssincephd>=10&yearssincephd<=19) -0.0053 (0.0495)
business 0.1060** (0.0301)
school_id 0.0006. (0.0003)
phd_id -0.0001 (9.19e-5)
micro -0.0087 (0.0130)
macromoney 0.0263 (0.0202)
metrics 0.0420 (0.0280)
labor 0.0129 (0.0189)
io 0.0226 (0.0291)
public 0.0132 (0.0317)
enviroenergy 0.0131 (0.0196)
develop -0.0678* (0.0322)
intl 0.0324 (0.0215)
finance 0.0171 (0.0336)
health 0.0522* (0.0212)
expbehave 0.0746** (0.0255)
political -0.0066 (0.0183)
urban 0.1038. (0.0532)
econhist -0.0741 (0.0693)
educ 0.0515. (0.0250)
Fixed-Effects: -------------------
year Yes
________________________________________ ___________________
S.E.: Clustered by: year
Observations 1,349
R2 0.50380
Within R2 0.31206
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1.08
In some of our regressions (especially the last) I am concerned about multicollinearity and over controlling. Both of which could introduce selection bias. Since we are looking for gender’s affect on log salary, things such as tenure, years since PhD, the institution they are at and they were at are certainly relevant. Although, I am not convinced that field and the institution they are at and were at are random. When we look at field, it isn’t random because it depends on where they went, and what they wanted to do in grad school. Furthermore, the institution they got their PhD from can’t be random because it depends on so many factors, and many choose a school for a variety of different reasons. Finally, their current institution they are at is likely a correlated result to where they attended grad school. It is unlikely that someone like myself would go on to teach at a top tier institution like Harvard, Yale, etc. after completing graduate school. Thus, we certainly can’t consider that random either. In sum, I am concerned about randomization issues within some of our controls, even if they may seem relevant, we are likely introducing selection bias.
1.09
Up to this point we have been using clustered standard errors, I would recommend cluster robust standard errors because of the nature of our testing. When we have variables like years since PhD, institution (both types), we likely have correlation within our cluster. If 2 people got their Econ PhD from Oregon in the same year, we may expect their starting salaries to be somewhat similar. We also must consider the job market for a certain year, it may be more or less competitive, potentially leading to higher or lower salaries. Clearly, cluster robust standard errors are the way to go.
The specification that I chose is above. For the CIA to be valid I need gender to be as good as random conditional on my covariates. My covariates are tenure, the school they are currently at, and the three groups of years since PhD allowing for heterogenous treatment effects. I think this CIA is the best one because it gets rid of some of the extras we had before, such as all of the possible fields, as well as the school they got their PhD from. As described earlier, I think that where they got their PhD from and where they currently are may be too highly correlated, and not random at all if we include both. My inference approach is as follows: our goal is to estimate the effect of gender on salary, to do that we clearly need to control for things other than gender. If someone is tenured, it will certainly increase wages. If they are at a more prestigious school it will increase wages. We also need to allow for heterogenous treatment effects in order to account for differences among the individuals. I also feel it is necessary to justify not including the specific field. In academia, I don’t feel like the fields will be as good as random, because gender may have a influence on why you would want to choose a certain field, based on societal norms, and amongst other things. Furthermore, I would argue that field won’t cause too big of a difference in salary.
Obviously, we see differences because of the differences in nature of these tests, although all coefficients are the same sign, and thus our general conclusions can remain the same. It is also worth noting that if it was statistically significant with log(salary), it is still statistically significant without the log.
1.12
If I’m correct about my CIA, I do think a matching estimator would work, and I also think it would yield an interesting result. If we were to match treated (Female) to untreated (Non-Female) we could reduce dimensionality with the propensity scores. This would ultimately lead to a better regression. Despite all of this, one potential problem I see with this is our data set, I would like to control for other things such as cost of living, performance at the job (which could be tenure…?), and even potentially race, as well as age.
Section 2 - Simulation
#Loading necessary packages for simulationp_load("tibble")p_load("purrr")p_load("furrr")
Please note: I don’t know what other way to do this besides run the simulation first to generate data in order to answer question 2.01. Further, I was doing it a different (far more stupid way) and ran two simulations, on for the variables to answer question 2.01, and then one that did the regressions for 2.03. I took an educated guess (perhaps foolishly) that that couldn’t possibly be what you’re asking us to do. Finally, I (with the help of GPT) figured out how to make two separate data frames for the variables, and the regressions.
fun_iter =function(iter, n=1000){dgp_df =tibble(z_1 =runif(n, min =-5, max =5),z_2 =runif(n, min =-5, max =5),v =rnorm(n, sd =1),w =rnorm(n, sd =2),p =rbinom(n, size =1, prob =0.2),u =rnorm(n, sd =1),x =1+ z_1 + z_2 -2*(z_2*p) + v + w,y =2+3*x +(x*p) +2*u + w )OLS <-lm(y ~ x, data = dgp_df)IV1 <-ivreg(y ~ x | z_1, data = dgp_df)IV2 <-ivreg(y ~ x | z_2, data = dgp_df)TwoSLS1 <-ivreg(y ~ x | z_1 + z_2, data=dgp_df)TwoSLS2 <-ivreg(y ~ x | z_1 + z_2 + (z_2*p), data = dgp_df)variables <- dgp_dfestimates <-tibble(Strategy =c("OLS", "IV1", "IV2", "TwoSLS1", "TwoSLS2"),Estimates =c(OLS$coefficients[2],IV1$coefficients[2],IV2$coefficients[2], TwoSLS1$coefficients[2], TwoSLS2$coefficients[2]))return(list(variables = variables, estimates = estimates))}set.seed(98372)plan(multisession)sim_list <-future_map(1:1e4, fun_iter,.options =furrr_options(seed = T) )variables_df <-bind_rows(map(sim_list, "variables"))estimates_df <-bind_rows(map(sim_list, "estimates"))
2.01
The ATE will look at the difference between the mean of our outcome variable (\(y_i\)) conditional on our treated group (\(p=1\)) and the mean of our outcome variable conditional on our control group (\(p=0\)). Since \(p_i\) is randomly assigned (thanks to our simulation!) with probability 0.2, in mathematical terms: \(E[Y_{i}|p_i=1]-E[Y_{i}|p_i=0] = E[\tau_{i}]\)
A simple regression of y on x will fail to recover the ATE because of a simultaneity bias. Since \(y\) is equal to some variables as well as \(x\), when we regress \(y\) on \(x\) we will have a biased estimator. Here, \(x\) is not exogenous, and thus will lead to an inconsistent estimate on \(\beta_1\).
We have established just now that instrument \(z_i\) and treatment \(p_i\) have covariance \(\neq\) 0. This is good, and one of the assumptions we needed. Although, we also need the covariance between \(z_i\) and other determinants of \(y_i\) to be zero.
cov(variables_df$z_1, variables_df$w)
[1] -0.001942507
cov(variables_df$z_1, variables_df$u)
[1] -0.001057883
cov(variables_df$z_2, variables_df$w)
[1] 0.0008928576
cov(variables_df$z_2, variables_df$u)
[1] 0.0008750567
As we can see covariance between \(z_i\) and other determinants of \(y_i\) is not equal to zero, which means we likely have a invalid instrument, and furthermore, an inconsistent estimate. The reason for some being more biased than others is likely because they are more “off-base”, for example the covariance between \(z_2\) and other determinants of \(y_i\) is less than that of \(z_1\). So an estimate like IV2 will be more biased than IV1, and TwoSLS2 more biased than TwoSLS1. (I am guessing on this next part)… I think it may be because we are compounding the issue of the invalid instruments by using more of them?
2.05
I would think that IV2 where we regress \(y\) on \(x\) and use \(z_2\) as an instrument. While we don’t have any perfect instruments, this is our “strongest” instrument in the sense that it likely provides the least biased estimate. Here, I am defining best as the least biased.