insert_data_path <- "~/Dropbox/teaching/undergrad_econometrics_spring_2023/data"
df = read.csv(glue::glue("{insert_data_path}/income_survey_2011.csv"))[-1]
age_list <- c("25-34", "35-44", "45-54")
workstat_list <- c("salaried worker", "self-employed", "self-employed and salaried worker")
df <- df %>%
filter(age %in% age_list & workstat %in% workstat_list & weekwhrs > 0) %>%
mutate(sal_hour = (incsal / mwweeks) / weekwhrs,
ln_sal_hour = log(sal_hour, base = exp(1))) %>%
mutate(schooly = ifelse(schooly > 50, NA, schooly),
diploma = ifelse(diploma > 7, NA, diploma))
df$f_diploma <- factor(df$diploma, levels = c(1,2,3,4,5,6,7))
levels(df$f_diploma) <- c('PRIMARY','SECONDARY','BAGRUT','POST','BA','MA','PHD')Undergrad Metrics - PS8
Question 1
Q1.A
Denote a drop-out from the program by \(finished_i\) which equals 1 if \(i\) dropped out and 0 otherwise. Consider model \[log(wage_i) = \alpha + \beta finished_i + u_i\]
What is \(\beta\) here? What is \(\alpha\)?
- \(\alpha = E[log(wage)|finished_i = 0]\)
- \(\beta = E[log(wage)|finished_i = 1] - E[log(wage)|finished_i = 0]\)
Q1.B
- What if we had years of education (
educ)? - Assume that drop outs have lower education, i.e. \(cor(educ, finished_i)>0\),
- Seems reasonable that \(cor(log(wage), finished)>0\), even taking
educinto account. - Can we sign the omitted variable bias in the above model?
- Yes! \(\beta\) is upward-biased.
For a more in-depth discussion on signing omitted variables bias, see previous TA sessions.
Question 2
We estimate a model \[ln(wage) = \beta_0 + \beta_1D_1 + \beta_2D_2 + \beta_3D_3 + \gamma_1exp + \gamma_2exp^2+u\]
Where
- \(D_1\) dummy variable assigned 1 if women and high education
- \(D_2\) dummy variable assigned 1 if women and low education
- \(D_3\) dummy variable assigned 1 if men and high education
Whats the base (omitted) category?
- Men with low education…
Say we want to test whether returns to higher education varies with gender.
- What is the formal statement we wish to test here?
- Lets go variable by variable, taking expectations.
- Start with \(\beta_0\) \[\beta_0 = E[ln(wage)|\text{Men with low education}, exp = 0]\]
- Continue with \(\beta_3\) \[\beta_3 = E[ln(wage)|\text{Men with high education}] - E[ln(wage)|\text{Men with low education}]\]
- Now \(\beta_2\) \[ \beta_2 = E[ln(wage)|\text{Women with low education}] - E[ln(wage)|\text{Men with low education}] \]
- Lastly, \(\beta_1\) \[ \beta_1 = E[ln(wage)|\text{Women with high education}] - E[ln(wage)|\text{Men with low education}] \]
What are the returns to higher education for men?
- Easy: \(\beta_3\).
What are the returns to higher education for women?
- Start with writing them down: \[ E[ln(wage)|\text{Women with high education}] - E[ln(wage)|\text{Women with low education}] \]
- We can see this equals \(\beta_1 - \beta_2\).
So, if we want to test that returns to higher education do not vary with gender, formally we test the hypothesis that \[H_0: \beta_1 - \beta_2 = \beta_3\]
Question 3
Upload data and clean data-frame
Fit models
mod1 <- lm(ln_sal_hour ~ schooly, data = df)
mod2 <- lm(ln_sal_hour ~ diploma, data = df)
mod3 <- lm(ln_sal_hour ~ f_diploma, data = df)
modelsummary::modelsummary(models = list(mod1, mod2, mod3), gof_map = NA)| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | 2.639 | 3.076 | 3.332 |
| (0.022) | (0.013) | (0.017) | |
| schooly | 0.077 | ||
| (0.001) | |||
| diploma | 0.178 | ||
| (0.003) | |||
| f_diplomaSECONDARY | 0.137 | ||
| (0.021) | |||
| f_diplomaBAGRUT | 0.243 | ||
| (0.020) | |||
| f_diplomaPOST | 0.341 | ||
| (0.021) | |||
| f_diplomaBA | 0.643 | ||
| (0.020) | |||
| f_diplomaMA | 0.884 | ||
| (0.022) | |||
| f_diplomaPHD | 1.141 | ||
| (0.049) |
3.A
Calculate difference between high-school (HS) and undergraduate (BA).
# Using model 1
## assume HS is 12 years, and BA is 15, then
mod1$coefficients(Intercept) schooly
2.63934657 0.07705718
return_BA_mod1 = 3 * mod1$coefficients[2]
return_BA_mod1 schooly
0.2311715
# Using model 3
## diploma = 3 is HS + bagrut, and
## diploma = 5 is BA
mod3$coefficients (Intercept) f_diplomaSECONDARY f_diplomaBAGRUT f_diplomaPOST
3.3323500 0.1370899 0.2430836 0.3410482
f_diplomaBA f_diplomaMA f_diplomaPHD
0.6434942 0.8839584 1.1407033
return_BA_mod3 = mod3$coefficients[5] - mod3$coefficients[3]
return_BA_mod3f_diplomaBA
0.4004106
3.B
Calculate difference between BA and undergraduate PHD (its worth it…).
# Using model 1
## assume BA is 15, and PHD is 22 years (2 years MA and 5 years PHD)
mod1$coefficients(Intercept) schooly
2.63934657 0.07705718
return_PHD_mod1 = 7 * mod1$coefficients[2]
return_PHD_mod1 schooly
0.5394003
# Using model 3
## diploma = 5 is BA
## diploma = 7 is PHD
mod3$coefficients (Intercept) f_diplomaSECONDARY f_diplomaBAGRUT f_diplomaPOST
3.3323500 0.1370899 0.2430836 0.3410482
f_diplomaBA f_diplomaMA f_diplomaPHD
0.6434942 0.8839584 1.1407033
return_PHD_mod3 = mod3$coefficients[7] - mod3$coefficients[5]
return_PHD_mod3f_diplomaPHD
0.4972091
3.C and 3.D
What are the different assumptions in these models?
- Model 1 assumes constant returns to years of schooling, with no difference which years
- E.g., same returns to 1 more year in high-school and 1 more year in university.
- Model 2 assumes constant returns to education diplomas, with no difference which diplomas
- E.g., moving from BA to MA is same as moving from MA to PHD.
- Model 3 does not assume any structure on diploma, but does assume that years within diplomas do not affect wages.
- E.g., HS diploma is similar to HS diploma and 2 years in university (without BA diploma).
Question 4
Data is same from Q3
4.A
How higher is mens hourly wage compared to womens hourly wage?
# create female dummy variable
df$female = 1*(df$sex == "female")
# estiamte model
mod1 <- lm(ln_sal_hour ~ female, data = df)
modelsummary::modelsummary(models = list(mod1), gof_map = NA)| (1) | |
|---|---|
| (Intercept) | 3.801 |
| (0.008) | |
| female | −0.135 |
| (0.011) |
That is, womens hourly wages are 13% lower than mens, on average.
4.B
What if we control for age?
# create female dummy variable
df$age = df$year - df$birthy
# estiamte model
mod2 <- lm(ln_sal_hour ~ female + age + age^2, data = df)
## poly (from package stats) adds a polynomial of a variable into formula
modelsummary::modelsummary(models = list(mod1, mod2), gof_map = NA)| (1) | (2) | |
|---|---|---|
| (Intercept) | 3.801 | 3.330 |
| (0.008) | (0.025) | |
| female | −0.135 | −0.140 |
| (0.011) | (0.011) | |
| age | 0.012 | |
| (0.001) |
Pretty similar.
4.C
What if we control for education?
# estiamte model
mod3 <- lm(ln_sal_hour ~ female + age + age^2 + schooly, data = df)
modelsummary::modelsummary(models = list(mod1, mod2, mod3), gof_map = NA)| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | 3.801 | 3.330 | 2.178 |
| (0.008) | (0.025) | (0.031) | |
| female | −0.135 | −0.140 | −0.165 |
| (0.011) | (0.011) | (0.010) | |
| age | 0.012 | 0.013 | |
| (0.001) | (0.001) | ||
| schooly | 0.079 | ||
| (0.001) |
4.D
What if we interact with marital status = single?
# create single dummy variable
df$single = 1*(df$marital == 1)
# estiamte model
mod4 <- lm(ln_sal_hour ~ female*single + age + age^2 + schooly, data = df)
modelsummary::modelsummary(models = list(mod1, mod2, mod3, mod4), gof_map = NA)| (1) | (2) | (3) | (4) | |
|---|---|---|---|---|
| (Intercept) | 3.801 | 3.330 | 2.178 | 2.153 |
| (0.008) | (0.025) | (0.031) | (0.031) | |
| female | −0.135 | −0.140 | −0.165 | −0.129 |
| (0.011) | (0.011) | (0.010) | (0.018) | |
| age | 0.012 | 0.013 | 0.011 | |
| (0.001) | (0.001) | (0.001) | ||
| schooly | 0.079 | 0.078 | ||
| (0.001) | (0.001) | |||
| single | 0.197 | |||
| (0.016) | ||||
| female × single | −0.036 | |||
| (0.021) |
So wage gap between single men and women is \[\beta_{female} + \beta_{female\_and\_single} = -14\%-2\%=16\%\] and between married men and women is \[\beta_{female}= -14\%\]
Question 5
Upload data
df = read.csv(glue::glue("{insert_data_path}/census_83_95.csv"))[-1] %>%
mutate(year_95 = 1*(year == 95))
# keep only males
df = df %>% filter(female == 0)Estimate model
summary(lm(ln_wage_hr ~ year_95*feram, data=df))$coefficients[,1:3] Estimate Std. Error t value
(Intercept) 3.21011785 0.005369162 597.88066
year_95 0.19205673 0.007975861 24.07975
feram 0.34219960 0.007816615 43.77849
year_95:feram -0.03071876 0.011836026 -2.59536
5.A
Write out the model \[ln(wage) = \alpha + \beta_1 year\_95 + \beta_2 feram + \beta_3 year\_95\times feram + u\] Lets go over them one by one
- \(\alpha\) - base category - miazrahi and year 83
- \(\beta_1\) - addition to wages if year 95
- \(\beta_2\) - addition to wages if ashkenazi
- \(\beta_3\) - addition to wages if both ashkenazi and year 95
5.B
How can we summarize in a table?
table_df = data.frame(ethnic = c("Mizrahi", "Ashkenazi", "Diff-Ethnic:"),
year_83 = c("$\\alpha$", "$\\alpha + \\beta_2$", "$\\beta_2$"),
year_95 = c("$\\alpha + \\beta_1$", "$\\alpha + \\beta_1 + \\beta_2 + \\beta_3$", "$\\beta_2 + \\beta_3$"),
year_diff = c("$\\beta_1$", "$\\beta_1 + \\beta_3$ ", "$\\beta_3$")
)
knitr::kable(table_df, align = c("l",rep("c", times = 3)), col.names = c("Ethnic", "Year 83", "Year 93", "Diff-Year"), escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
add_header_above(c(" " = 1, "Year" = 3))| Ethnic | Year 83 | Year 93 | Diff-Year |
|---|---|---|---|
| Mizrahi | $\alpha$ | $\alpha + \beta_1$ | $\beta_1$ |
| Ashkenazi | $\alpha + \beta_2$ | $\alpha + \beta_1 + \beta_2 + \beta_3$ | $\beta_1 + \beta_3$ |
| Diff-Ethnic: | $\beta_2$ | $\beta_2 + \beta_3$ | $\beta_3$ |
So if we want to test that the wage-gap between ethnic groups did not change across years, we need to test that \[H_0: \beta_3 = 0\]
Question 6
Three models
Long, both men and women \[\begin{align} log(wages) &= \beta_0 + \beta_1 male + \beta_2 exper + \beta_4 educ \\ & + \beta_3 male\times exper + \beta_5 male\times educ + \beta_6 exper\times educ + u \end{align}\]
Short, only women \[\begin{equation} W = \delta_0 + \delta_1 exper + \delta_2 educ + \delta_3 educ \times exper \mid sex = female \end{equation}\]
Short, only men \[\begin{equation} W = \gamma_0 + \gamma_1 exper + \gamma_2 educ + \gamma_3 educ \times exper \mid sex = male \end{equation}\]
Can we compare estimates between model (1) and models (2)-(3)?
- No.
- Why?
- Whats missing in model (1)?
Lets go over the coefficients
- Since in long model \(male\) is the dummy, \(female\) is the omitted category.
- So \(\beta_0\) similar to \(\delta_0\), and \(\beta_0 + \beta_1\) similar to \(\gamma_0\).
- \(\beta_2\) similar to \(\delta_2\), and \(\beta_0 + \beta_1 + \beta_2 + \beta_3\) similar to \(\gamma_0 + \gamma_1\)
- Finally, \(\beta_6\) similar to \(\delta_3\), but
- To what is \(\gamma_3\) similar? Also \(\beta_6\).
That is,
- Model (1) assumes the change of education by experience does not vary by gender.
- While estimating (2) and (3) separately allows for a difference between \(\gamma_3\) and \(\delta_3\), i.e. different effect of \(exper\times educ\) on \(log(wage)\) by gender.
Hence:
- Different model assumptions \(\rightarrow\) different coefficients.
Lets see this in action.
df = read.csv(glue::glue("{insert_data_path}/wage1.csv"))[-1] %>%
mutate(male = 1-female)
# begin with long and short models that are similar
## long model
summary(lm(lwage ~ male*exper*educ, data = df))$coefficients[,1] (Intercept) male exper educ male:exper
0.008177848 0.482688153 0.020875378 0.109035469 -0.016116390
male:educ exper:educ male:exper:educ
-0.026105056 -0.001426622 0.002278224
## two short models
summary(lm(lwage ~ exper*educ, data = df[df$male == 1,]))$coefficients[,1] (Intercept) exper educ exper:educ
0.4908660007 0.0047589882 0.0829304135 0.0008516023
summary(lm(lwage ~ exper*educ, data = df[df$male == 0,]))$coefficients[,1] (Intercept) exper educ exper:educ
0.008177848 0.020875378 0.109035469 -0.001426622
# lets see examples that female short model coefficients are equivalent
long <- summary(lm(lwage ~ male*exper*educ, data = df))$coefficients[,1]
short_male <- summary(lm(lwage ~ exper*educ, data = df[df$male == 1,]))$coefficients[,1]
# first example
beta_0 = long[1]
beta_1 = long[2]
gamma_0 = short_male[1]
beta_0 + beta_1(Intercept)
0.490866
gamma_0(Intercept)
0.490866
# second example
beta_2 = long[3]
beta_3 = long[5]
gamma_1 = short_male[2]
beta_0 + beta_1 + beta_2 + beta_3(Intercept)
0.495625
gamma_0 + gamma_1(Intercept)
0.495625
# continue with long and short models that are slightly different
## long model
summary(lm(lwage ~ male + exper + educ +
male:exper + male:educ + exper:educ, data = df))$coefficients[,1] (Intercept) male exper educ male:exper
0.3605931725 -0.0789554069 0.0027943050 0.0802070683 0.0107731623
male:educ exper:educ
0.0191114997 0.0001124889
## two short models
summary(lm(lwage ~ exper*educ, data = df[df$male == 1,]))$coefficients[,1] (Intercept) exper educ exper:educ
0.4908660007 0.0047589882 0.0829304135 0.0008516023
summary(lm(lwage ~ exper*educ, data = df[df$male == 0,]))$coefficients[,1] (Intercept) exper educ exper:educ
0.008177848 0.020875378 0.109035469 -0.001426622
# lets see examples that female short model coefficients are equivalent
long <- summary(lm(lwage ~ male + exper + educ +
male:exper + male:educ + exper:educ, data = df))$coefficients[,1]
short_male <- summary(lm(lwage ~ exper*educ, data = df[df$male == 1,]))$coefficients[,1]
# first example
beta_0 = long[1]
beta_1 = long[2]
gamma_0 = short_male[1]
beta_0 + beta_1(Intercept)
0.2816378
gamma_0(Intercept)
0.490866
# second example
beta_2 = long[3]
beta_3 = long[5]
gamma_1 = short_male[2]
beta_0 + beta_1 + beta_2 + beta_3(Intercept)
0.2952052
gamma_0 + gamma_1(Intercept)
0.495625
Question 7
7.A
True. Base category is local, so \(\delta_1\) is effect of capital on production for local firms.
7.B
False. ESS is sum of squared estimated errors (\(ESS:=\sum\hat{u}^2\)). This can be true, but only if the full model is fully interacted version of short models.
Whats missing in this example? That the intercept can be different between local and foreign firms (i.e., there is no D in the full model).
More precisely, when will this be true? If, and only if, the coefficients from running same model on sub-samples are equivalent to coefficients in total sample. Then the estimated errors are equivalent, and so sum of total sample is sum of sums within sub-samples.
Lets show an example in R
N = 1000
X1 = 5*runif(N)
X2 = 5*runif(N)
D = 1*(runif(N)>0.5) # indicator if foreign firm
beta1_local = 0.4
beta1_foreign = 0.45
beta2_local = 0.6
beta2_foreign = 0.55
df <- tibble(X1, X2, D) |>
mutate(
log_Y = D * log(X1) * beta1_foreign + (1-D) * log(X1) * beta1_local +
D * log(X2) * beta2_foreign + (1-D) * log(X2) * beta2_local + rnorm(N),
log_X1 = log(X1), log_X2 = log(X2)
)
mod_full <- lm(log_Y ~ log_X1 + log_X1:D + D + log_X2 + log_X2:D, data = df)
mod_full_wo_D <- lm(log_Y ~ log_X1 + log_X1:D + log_X2 + log_X2:D, data = df)
mod_short_local <- lm(log_Y ~ log_X1 + log_X2, data = df |> filter(D == 0))
mod_short_foreign <- lm(log_Y ~ log_X1 + log_X2, data = df |> filter(D == 1))
ess_long = sum(residuals(mod_full)^2)
ess_long_wo_D = sum(residuals(mod_full_wo_D)^2)
ess_short_local = sum(residuals(mod_short_local)^2)
ess_short_foreign = sum(residuals(mod_short_foreign)^2)
ess_long[1] 1038.606
ess_long_wo_D[1] 1038.931
ess_short_foreign + ess_short_local[1] 1038.606
7.C
False. When foreign firm, then capital productivity is \(\delta_1 + \delta_3\).