Question 1

Q1.A

Denote a drop-out from the program by $finished_i$ which equals 1 if $i$ dropped out and 0 otherwise. Consider model \[log(wage_i) = \alpha + \beta finished_i + u_i\]

What is $\beta$ here? What is $\alpha$?

$\alpha = E[log(wage)|finished_i = 0]$
$\beta = E[log(wage)|finished_i = 1] - E[log(wage)|finished_i = 0]$

Q1.B

What if we had years of education (educ)?
Assume that drop outs have lower education, i.e. $cor(educ, finished_i)>0$,
Seems reasonable that $cor(log(wage), finished)>0$, even taking educ into account.
Can we sign the omitted variable bias in the above model?
Yes! $\beta$ is upward-biased.

For a more in-depth discussion on signing omitted variables bias, see previous TA sessions.

Question 2

We estimate a model \[ln(wage) = \beta_0 + \beta_1D_1 + \beta_2D_2 + \beta_3D_3 + \gamma_1exp + \gamma_2exp^2+u\]

Where

$D_1$ dummy variable assigned 1 if women and high education
$D_2$ dummy variable assigned 1 if women and low education
$D_3$ dummy variable assigned 1 if men and high education

Whats the base (omitted) category?

Men with low education…

Say we want to test whether returns to higher education varies with gender.

What is the formal statement we wish to test here?
Lets go variable by variable, taking expectations.

Start with $\beta_0$ \[\beta_0 = E[ln(wage)|\text{Men with low education}, exp = 0]\]
Continue with $\beta_3$ \[\beta_3 = E[ln(wage)|\text{Men with high education}] - E[ln(wage)|\text{Men with low education}]\]
Now $\beta_2$ \[ \beta_2 = E[ln(wage)|\text{Women with low education}] - E[ln(wage)|\text{Men with low education}] \]
Lastly, $\beta_1$ \[ \beta_1 = E[ln(wage)|\text{Women with high education}] - E[ln(wage)|\text{Men with low education}] \]

What are the returns to higher education for men?

Easy: $\beta_3$.

What are the returns to higher education for women?

Start with writing them down: \[ E[ln(wage)|\text{Women with high education}] - E[ln(wage)|\text{Women with low education}] \]
We can see this equals $\beta_1 - \beta_2$.

So, if we want to test that returns to higher education do not vary with gender, formally we test the hypothesis that \[H_0: \beta_1 - \beta_2 = \beta_3\]

Question 3

Upload data and clean data-frame

insert_data_path <- "~/Dropbox/teaching/undergrad_econometrics_spring_2023/data"
df = read.csv(glue::glue("{insert_data_path}/income_survey_2011.csv"))[-1]

age_list <- c("25-34", "35-44", "45-54")
workstat_list <- c("salaried worker", "self-employed", "self-employed and salaried worker")

df <- df %>%
  filter(age %in% age_list & workstat %in% workstat_list & weekwhrs > 0) %>%
  mutate(sal_hour = (incsal / mwweeks) / weekwhrs,
         ln_sal_hour = log(sal_hour, base = exp(1))) %>%
  mutate(schooly = ifelse(schooly > 50, NA, schooly),
         diploma = ifelse(diploma > 7, NA, diploma))

df$f_diploma <- factor(df$diploma,  levels = c(1,2,3,4,5,6,7))
levels(df$f_diploma) <- c('PRIMARY','SECONDARY','BAGRUT','POST','BA','MA','PHD')

Fit models

mod1 <- lm(ln_sal_hour ~ schooly, data = df)
mod2 <- lm(ln_sal_hour ~ diploma, data = df)
mod3 <- lm(ln_sal_hour ~ f_diploma, data = df)
modelsummary::modelsummary(models = list(mod1, mod2, mod3), gof_map = NA)

	(1)	(2)	(3)
(Intercept)	2.639	3.076	3.332
	(0.022)	(0.013)	(0.017)
schooly	0.077
	(0.001)
diploma		0.178
		(0.003)
f_diplomaSECONDARY			0.137
			(0.021)
f_diplomaBAGRUT			0.243
			(0.020)
f_diplomaPOST			0.341
			(0.021)
f_diplomaBA			0.643
			(0.020)
f_diplomaMA			0.884
			(0.022)
f_diplomaPHD			1.141
			(0.049)

3.A

Calculate difference between high-school (HS) and undergraduate (BA).

# Using model 1
## assume HS is 12 years, and BA is 15, then
mod1$coefficients

(Intercept)     schooly 
 2.63934657  0.07705718

return_BA_mod1 = 3 * mod1$coefficients[2] 
return_BA_mod1

  schooly 
0.2311715

# Using model 3
## diploma = 3 is HS + bagrut, and
## diploma = 5 is BA 
mod3$coefficients

       (Intercept) f_diplomaSECONDARY    f_diplomaBAGRUT      f_diplomaPOST 
         3.3323500          0.1370899          0.2430836          0.3410482 
       f_diplomaBA        f_diplomaMA       f_diplomaPHD 
         0.6434942          0.8839584          1.1407033

return_BA_mod3 = mod3$coefficients[5] - mod3$coefficients[3]
return_BA_mod3

f_diplomaBA 
  0.4004106

3.B

Calculate difference between BA and undergraduate PHD (its worth it…).

# Using model 1
## assume BA is 15, and PHD is 22 years (2 years MA and 5 years PHD)
mod1$coefficients

(Intercept)     schooly 
 2.63934657  0.07705718

return_PHD_mod1 = 7 * mod1$coefficients[2] 
return_PHD_mod1

  schooly 
0.5394003

# Using model 3
## diploma = 5 is BA
## diploma = 7 is PHD 
mod3$coefficients

       (Intercept) f_diplomaSECONDARY    f_diplomaBAGRUT      f_diplomaPOST 
         3.3323500          0.1370899          0.2430836          0.3410482 
       f_diplomaBA        f_diplomaMA       f_diplomaPHD 
         0.6434942          0.8839584          1.1407033

return_PHD_mod3 = mod3$coefficients[7] - mod3$coefficients[5]
return_PHD_mod3

f_diplomaPHD 
   0.4972091

3.C and 3.D

What are the different assumptions in these models?

Model 1 assumes constant returns to years of schooling, with no difference which years
- E.g., same returns to 1 more year in high-school and 1 more year in university.
Model 2 assumes constant returns to education diplomas, with no difference which diplomas
- E.g., moving from BA to MA is same as moving from MA to PHD.
Model 3 does not assume any structure on diploma, but does assume that years within diplomas do not affect wages.
- E.g., HS diploma is similar to HS diploma and 2 years in university (without BA diploma).

Question 4

Data is same from Q3

4.A

How higher is mens hourly wage compared to womens hourly wage?

# create female dummy variable
df$female = 1*(df$sex == "female")
# estiamte model
mod1 <- lm(ln_sal_hour ~ female, data = df)
modelsummary::modelsummary(models = list(mod1), gof_map = NA)

	(1)
(Intercept)	3.801
	(0.008)
female	−0.135
	(0.011)

That is, womens hourly wages are 13% lower than mens, on average.

4.B

What if we control for age?

# create female dummy variable
df$age = df$year - df$birthy
# estiamte model
mod2 <- lm(ln_sal_hour ~ female + age + age^2, data = df)
## poly (from package stats) adds a polynomial of a variable into formula
modelsummary::modelsummary(models = list(mod1, mod2), gof_map = NA)

	(1)	(2)
(Intercept)	3.801	3.330
	(0.008)	(0.025)
female	−0.135	−0.140
	(0.011)	(0.011)
age		0.012
		(0.001)

Pretty similar.

4.C

What if we control for education?

# estiamte model
mod3 <- lm(ln_sal_hour ~ female + age + age^2 + schooly, data = df)
modelsummary::modelsummary(models = list(mod1, mod2, mod3), gof_map = NA)

	(1)	(2)	(3)
(Intercept)	3.801	3.330	2.178
	(0.008)	(0.025)	(0.031)
female	−0.135	−0.140	−0.165
	(0.011)	(0.011)	(0.010)
age		0.012	0.013
		(0.001)	(0.001)
schooly			0.079
			(0.001)

4.D

What if we interact with marital status = single?

# create single dummy variable
df$single = 1*(df$marital == 1)
# estiamte model
mod4 <- lm(ln_sal_hour ~ female*single + age + age^2 + schooly, data = df)
modelsummary::modelsummary(models = list(mod1, mod2, mod3, mod4), gof_map = NA)

	(1)	(2)	(3)	(4)
(Intercept)	3.801	3.330	2.178	2.153
	(0.008)	(0.025)	(0.031)	(0.031)
female	−0.135	−0.140	−0.165	−0.129
	(0.011)	(0.011)	(0.010)	(0.018)
age		0.012	0.013	0.011
		(0.001)	(0.001)	(0.001)
schooly			0.079	0.078
			(0.001)	(0.001)
single				0.197
				(0.016)
female × single				−0.036
				(0.021)

So wage gap between single men and women is \[\beta_{female} + \beta_{female\_and\_single} = -14\%-2\%=16\%\] and between married men and women is \[\beta_{female}= -14\%\]

Question 5

Upload data

df = read.csv(glue::glue("{insert_data_path}/census_83_95.csv"))[-1] %>%
  mutate(year_95 = 1*(year == 95))
# keep only males
df = df %>% filter(female == 0)

Estimate model

summary(lm(ln_wage_hr ~ year_95*feram, data=df))$coefficients[,1:3]

                 Estimate  Std. Error   t value
(Intercept)    3.21011785 0.005369162 597.88066
year_95        0.19205673 0.007975861  24.07975
feram          0.34219960 0.007816615  43.77849
year_95:feram -0.03071876 0.011836026  -2.59536

5.A

Write out the model \[ln(wage) = \alpha + \beta_1 year\_95 + \beta_2 feram + \beta_3 year\_95\times feram + u\] Lets go over them one by one

$\alpha$ - base category - miazrahi and year 83
$\beta_1$ - addition to wages if year 95
$\beta_2$ - addition to wages if ashkenazi
$\beta_3$ - addition to wages if both ashkenazi and year 95

5.B

How can we summarize in a table?

table_df = data.frame(ethnic = c("Mizrahi", "Ashkenazi", "Diff-Ethnic:"),
                      year_83 = c("$\\alpha$", "$\\alpha + \\beta_2$", "$\\beta_2$"),
                      year_95 = c("$\\alpha + \\beta_1$", "$\\alpha + \\beta_1 + \\beta_2 + \\beta_3$", "$\\beta_2 + \\beta_3$"),
                      year_diff = c("$\\beta_1$", "$\\beta_1 + \\beta_3$ ", "$\\beta_3$")
                      )
knitr::kable(table_df, align = c("l",rep("c", times = 3)), col.names = c("Ethnic", "Year 83", "Year 93", "Diff-Year"), escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  add_header_above(c(" " = 1, "Year" = 3))

	Year
Ethnic	Year 83	Year 93	Diff-Year
Mizrahi	$\alpha$	$\alpha + \beta_1$	$\beta_1$
Ashkenazi	$\alpha + \beta_2$	$\alpha + \beta_1 + \beta_2 + \beta_3$	$\beta_1 + \beta_3$
Diff-Ethnic:	$\beta_2$	$\beta_2 + \beta_3$	$\beta_3$

So if we want to test that the wage-gap between ethnic groups did not change across years, we need to test that \[H_0: \beta_3 = 0\]

Question 6

Three models

Long, both men and women \[\begin{align} log(wages) &= \beta_0 + \beta_1 male + \beta_2 exper + \beta_4 educ \\ & + \beta_3 male\times exper + \beta_5 male\times educ + \beta_6 exper\times educ + u \end{align}\]
Short, only women \[\begin{equation} W = \delta_0 + \delta_1 exper + \delta_2 educ + \delta_3 educ \times exper \mid sex = female \end{equation}\]
Short, only men \[\begin{equation} W = \gamma_0 + \gamma_1 exper + \gamma_2 educ + \gamma_3 educ \times exper \mid sex = male \end{equation}\]

Can we compare estimates between model (1) and models (2)-(3)?

No.
Why?
Whats missing in model (1)?

Lets go over the coefficients

Since in long model $male$ is the dummy, $female$ is the omitted category.
So $\beta_0$ similar to $\delta_0$, and $\beta_0 + \beta_1$ similar to $\gamma_0$.
$\beta_2$ similar to $\delta_2$, and $\beta_0 + \beta_1 + \beta_2 + \beta_3$ similar to $\gamma_0 + \gamma_1$
Finally, $\beta_6$ similar to $\delta_3$, but
To what is $\gamma_3$ similar? Also $\beta_6$.

That is,

Model (1) assumes the change of education by experience does not vary by gender.
While estimating (2) and (3) separately allows for a difference between $\gamma_3$ and $\delta_3$, i.e. different effect of $exper\times educ$ on $log(wage)$ by gender.

Hence:

Different model assumptions $\rightarrow$ different coefficients.

Lets see this in action.

df = read.csv(glue::glue("{insert_data_path}/wage1.csv"))[-1] %>%
  mutate(male = 1-female)

# begin with long and short models that are similar
## long model
summary(lm(lwage ~ male*exper*educ, data = df))$coefficients[,1]

    (Intercept)            male           exper            educ      male:exper 
    0.008177848     0.482688153     0.020875378     0.109035469    -0.016116390 
      male:educ      exper:educ male:exper:educ 
   -0.026105056    -0.001426622     0.002278224

## two short models
summary(lm(lwage ~ exper*educ, data = df[df$male == 1,]))$coefficients[,1]

 (Intercept)        exper         educ   exper:educ 
0.4908660007 0.0047589882 0.0829304135 0.0008516023

summary(lm(lwage ~ exper*educ, data = df[df$male == 0,]))$coefficients[,1]

 (Intercept)        exper         educ   exper:educ 
 0.008177848  0.020875378  0.109035469 -0.001426622

# lets see examples that female short model coefficients are equivalent
long <- summary(lm(lwage ~ male*exper*educ, data = df))$coefficients[,1]
short_male <- summary(lm(lwage ~ exper*educ, data = df[df$male == 1,]))$coefficients[,1]

# first example
beta_0 = long[1]
beta_1 = long[2]
gamma_0 = short_male[1]
beta_0 + beta_1

(Intercept) 
   0.490866

gamma_0

(Intercept) 
   0.490866

# second example
beta_2 = long[3]
beta_3 = long[5]
gamma_1 = short_male[2]
beta_0 + beta_1 + beta_2 + beta_3

(Intercept) 
   0.495625

gamma_0 + gamma_1

(Intercept) 
   0.495625

# continue with long and short models that are slightly different
## long model
summary(lm(lwage ~ male + exper + educ + 
             male:exper + male:educ + exper:educ, data = df))$coefficients[,1]

  (Intercept)          male         exper          educ    male:exper 
 0.3605931725 -0.0789554069  0.0027943050  0.0802070683  0.0107731623 
    male:educ    exper:educ 
 0.0191114997  0.0001124889

## two short models
summary(lm(lwage ~ exper*educ, data = df[df$male == 1,]))$coefficients[,1]

 (Intercept)        exper         educ   exper:educ 
0.4908660007 0.0047589882 0.0829304135 0.0008516023

summary(lm(lwage ~ exper*educ, data = df[df$male == 0,]))$coefficients[,1]

 (Intercept)        exper         educ   exper:educ 
 0.008177848  0.020875378  0.109035469 -0.001426622

# lets see examples that female short model coefficients are equivalent
long <- summary(lm(lwage ~ male + exper + educ + 
             male:exper + male:educ + exper:educ, data = df))$coefficients[,1]
short_male <- summary(lm(lwage ~ exper*educ, data = df[df$male == 1,]))$coefficients[,1]

# first example
beta_0 = long[1]
beta_1 = long[2]
gamma_0 = short_male[1]
beta_0 + beta_1

(Intercept) 
  0.2816378

gamma_0

(Intercept) 
   0.490866

# second example
beta_2 = long[3]
beta_3 = long[5]
gamma_1 = short_male[2]
beta_0 + beta_1 + beta_2 + beta_3

(Intercept) 
  0.2952052

gamma_0 + gamma_1

(Intercept) 
   0.495625

Question 7

7.A

True. Base category is local, so $\delta_1$ is effect of capital on production for local firms.

7.B

False. ESS is sum of squared estimated errors ($ESS:=\sum\hat{u}^2$). This can be true, but only if the full model is fully interacted version of short models.

Whats missing in this example? That the intercept can be different between local and foreign firms (i.e., there is no D in the full model).

More precisely, when will this be true? If, and only if, the coefficients from running same model on sub-samples are equivalent to coefficients in total sample. Then the estimated errors are equivalent, and so sum of total sample is sum of sums within sub-samples.

Lets show an example in R

N = 1000
X1 = 5*runif(N)
X2 = 5*runif(N)
D = 1*(runif(N)>0.5)  # indicator if foreign firm

beta1_local = 0.4
beta1_foreign = 0.45
beta2_local = 0.6
beta2_foreign = 0.55

df <- tibble(X1, X2, D) |> 
  mutate(
    log_Y = D * log(X1) * beta1_foreign + (1-D) * log(X1) * beta1_local +
      D * log(X2) * beta2_foreign + (1-D) * log(X2) * beta2_local + rnorm(N),
    log_X1 = log(X1), log_X2 = log(X2)
  )

mod_full <- lm(log_Y ~ log_X1 + log_X1:D + D + log_X2 + log_X2:D, data = df)
mod_full_wo_D <- lm(log_Y ~ log_X1 + log_X1:D + log_X2 + log_X2:D, data = df)
mod_short_local <- lm(log_Y ~ log_X1 + log_X2, data = df |> filter(D == 0))
mod_short_foreign <- lm(log_Y ~ log_X1 + log_X2, data = df |> filter(D == 1))

ess_long = sum(residuals(mod_full)^2)
ess_long_wo_D = sum(residuals(mod_full_wo_D)^2)
ess_short_local = sum(residuals(mod_short_local)^2)
ess_short_foreign = sum(residuals(mod_short_foreign)^2)

ess_long

[1] 1038.606

ess_long_wo_D

[1] 1038.931

ess_short_foreign + ess_short_local

[1] 1038.606

7.C

False. When foreign firm, then capital productivity is $\delta_1 + \delta_3$.