/l — title: Methods of Policy Analysis - The Future of Work subtitle: Quantitative HW Assignment - Blinder-Oaxaca author: Professor Lee Branstetter date: Spring 2026 output: pdf_document —

Using Blinder-Oaxaca to quantify the effects of discrimination

This quantitative HW assignment will require you to use statistical analysis software to estimate the Blinder-Oaxaca decomposition.

You will (once again) be using data from the Current Population Survey (CPS) “outgoing rotation files”, which are surveys undertaken in the final months in which a U.S. household is part of the CPS sample. These surveys record data on schooling, hourly wages, age, location, and respondent demographics. These data therefore tell us how wages vary according to education, experience, race, gender, and geography.

Two versions of this CPS data extract can be found on the class website. The first version is HW3_Data.dta, saved in STATA format for those who are familiar with the STATA statistical software package and wish to use it to complete the homework assignment. The second version is HW3_Data.csv, saved as a text file in CSV format. You should be able to read this version into R or any other statistical software package.

This dataset has already been “preprocessed” from the original raw data files to make this assignment more straightforward. Here are the data it contains:

VARIABLE NAME	DEFINITION
IDENTIFIER	Anonymous ID code that identifies the individual respondent
WAGE	Estimated hourly wage, based on survey responses
ED	Estimated years of education based on survey responses
EXP	Estimated years of work experience: \(Age-Schooling-6\)
EXP2	The square of EXP
FE	Respondent is female
BLACK	Respondent is Black
HISP	Respondent is Hispanic
WHITE	Respondent is White
OTHER	Resopndent is not White, Black, or Hispanic
NOHS	Respondent did not complete high school
POSTHS	Respondent received some post high school training
BACHELOR	Respondent completed a 4-year college degree
POSTGRAD	Respondent completed a degree beyond the 4-year bachelors’

Recall that one important feature of the wage data reported in the publicly disclosed CPS files is that it is “topcoded” - that is, the wages of the highest income earners are not publicly disclosed. This is done in order to protect the privacy of high income individuals whose identity might otherwise be deduced. Instead of recording the actual income of these respondents, the publicly disclosed data files record the average income of all high earning individuals.

# packages
library(tidyverse)
library(car) # for test of statistical equivalence
library(gap) # for chow test (may need to install)

## Warning: package 'gap' was built under R version 4.4.3

# suppress scientific notation
options(scipen = 4) # a handy bit of code to reduce scientific notation usage

# data
cps <- read_csv("HW3_Data.csv")

Exercise 1: [60 pts + 20 pts extra credit] Measuring the impact of racial discrimination on the earnings of Black workers

[5 pts] In the previous assignment, we used dummy variables to take a first look at the impact of racial and gender discrimination on worker earnings. Following your earlier work, (re)estimate the Mincer regression described below. Report your regression coefficients, standard errors, and t-statistics. \[LNWAGE_i = \alpha_0 + \beta_1 ED_i + \beta_2 EXP_i + \beta_3 EXP2_i + \beta_4 FE_i + \beta_5 BLACK_i + \beta_6 HISP_i + \beta_7 OTHER_i + \epsilon_i\] [5 pts extra credit] For extra credit, test and interpret the null hypothesis that the coefficients on BLACK and HISP are equal to one another. ## 1a

# 1.a) Mincer regression 
mincer <- lm(LNWAGE ~ ED + EXP + EXP2 + FE + BLACK + HISP + OTHER , data=cps)

summary(mincer)

## 
## Call:
## lm(formula = LNWAGE ~ ED + EXP + EXP2 + FE + BLACK + HISP + OTHER, 
##     data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5037 -0.3226 -0.0204  0.3139  2.9885 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)  1.43840516  0.01498885  95.965  < 2e-16 ***
## ED           0.10161808  0.00093751 108.391  < 2e-16 ***
## EXP          0.02987870  0.00055036  54.289  < 2e-16 ***
## EXP2        -0.00047332  0.00001082 -43.727  < 2e-16 ***
## FE          -0.22817271  0.00467635 -48.793  < 2e-16 ***
## BLACK       -0.13966156  0.00780491 -17.894  < 2e-16 ***
## HISP        -0.05685394  0.00704083  -8.075 6.90e-16 ***
## OTHER        0.03450051  0.00871777   3.957 7.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5201 on 49992 degrees of freedom
## Multiple R-squared:  0.2846, Adjusted R-squared:  0.2845 
## F-statistic:  2841 on 7 and 49992 DF,  p-value: < 2.2e-16

# 1.a) Extra credit

##1b b. [10 pts] Separate your data into two subsamples - one for Black workers and one for White workers. For each subsample answer the following: i. How many observations are there? ii. What are the sample means for variables \(LNWAGE, ED, EXP, EXP2\) and \(FE\). iii. What are the differences in sample means for variables \(LNWAGE, ED, EXP, EXP2\) and \(FE\). iv. What do these differences tell us about different earnings, education, and experience levels between Black and White workers? v. What is the difference in average earnings between Black males and Black females? vi. What is the difference in average earnings between White males and White females?

Black <- filter(cps, BLACK == 1)
White <- filter(cps, WHITE == 1)

[1] “Number of Black workers: 5167” [1] “Number of White workers: 33683”

# 1.b.ii) sample means

meanblack <- lapply(Black, mean)
meanwhite <- lapply(White, mean)

print("Sample means for Black observations:")

## [1] "Sample means for Black observations:"

print(meanblack[c("LNWAGE", "ED", "EXP", "EXP2", "FE")])

## $LNWAGE
## [1] 2.934341
## 
## $ED
## [1] 14.07219
## 
## $EXP
## [1] 22.03155
## 
## $EXP2
## [1] 685.6116
## 
## $FE
## [1] 0.5616412

print("Sample means for White observations")

## [1] "Sample means for White observations"

print(meanwhite[c("LNWAGE", "ED", "EXP", "EXP2", "FE")])

## $LNWAGE
## [1] 3.154221
## 
## $ED
## [1] 14.65107
## 
## $EXP
## [1] 23.03776
## 
## $EXP2
## [1] 741.682
## 
## $FE
## [1] 0.4833299

[1] “Differences are presented as White mean - Black mean” [1] “Difference in LN Wages: 0.219880036088417” [1] “Difference in Education: 0.578881381798643” [1] “Difference in Experience: 1.00621750529035” [1] “Difference in Experience ^2: 56.0704323127076” [1] “Difference in % of females: -0.0783113147725224”

# 1.b.iv) exposition (no additional code required)
print("These differences show us that on average, Black observations earn less in wages, have about about a half year less in education, about a year less in experience")

## [1] "These differences show us that on average, Black observations earn less in wages, have about about a half year less in education, about a year less in experience"

# 1.b.v) difference for Black males and females
black_fem_wage <- mean(Black$LNWAGE[Black$FE==1])
black_male_wage <- mean(Black$LNWAGE[Black$FE==0])

print(paste("Difference in average Black male and Black female ln wages:",round(black_male_wage - black_fem_wage, 3)))

## [1] "Difference in average Black male and Black female ln wages: 0.087"

# 1.b.vi) difference for White males and females
white_fem_wage <- mean(White$LNWAGE[White$FE==1])
white_male_wage <- mean(White$LNWAGE[White$FE==0])

print(paste("Difference in average white male and white female ln wages:",round(white_male_wage - white_fem_wage, 3)))

## [1] "Difference in average white male and white female ln wages: 0.211"

1c

[20 pts] Using the subsamples from 1.b, estimate the following regression separately for Black and White workers: \[LNWAGE_i = \alpha_0 + \beta_1 ED_i + \beta_2 EXP_i + \beta_3 EXP2_i + \beta_4 FE_i + \epsilon_i\] Report the regression coefficients, standard errors, and R-squared values for both regressions. Compare the magnitudes of the regression coefficients for the two groups.
1. How does the effect of schooling (\(ED\)) differ across the two groups (if at all)? How do you interpret those differences (if any exist)?
2. How does the gender-related wage differential (\(FE\)) vary by race? How do you interpret those differences?
3. How does the age-earnings profile differ by race? How do you interpret those differences?
4. [5 pts extra credit] For extra credit, briefly describe how “topcoding” of high incomes might impact the results you obtain in this part of the HW.
5. [10 pts extra credit] For additional extra credit, estimate a Chow test of the null hypothesis that the intercept and slope coefficients are equal across the two groups.

# Black subset

Black_reg <- lm(LNWAGE ~ ED + EXP + EXP2 + FE, data=Black)
summary(Black_reg)

## 
## Call:
## lm(formula = LNWAGE ~ ED + EXP + EXP2 + FE, data = Black)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.08910 -0.30166 -0.03152  0.27593  2.82619 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)  1.25363126  0.04413973  28.401   <2e-16 ***
## ED           0.10352524  0.00295448  35.040   <2e-16 ***
## EXP          0.02523410  0.00161637  15.612   <2e-16 ***
## EXP2        -0.00037566  0.00003161 -11.886   <2e-16 ***
## FE          -0.13266073  0.01370640  -9.679   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.487 on 5162 degrees of freedom
## Multiple R-squared:  0.2486, Adjusted R-squared:  0.248 
## F-statistic:   427 on 4 and 5162 DF,  p-value: < 2.2e-16

# White subset
white_reg <- lm(LNWAGE ~ ED + EXP + EXP2 + FE, data=White)
summary(white_reg)

## 
## Call:
## lm(formula = LNWAGE ~ ED + EXP + EXP2 + FE, data = White)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5418 -0.3254 -0.0153  0.3208  2.6429 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)  1.33026285  0.01901841   69.95   <2e-16 ***
## ED           0.10908132  0.00123203   88.54   <2e-16 ***
## EXP          0.03116012  0.00068103   45.75   <2e-16 ***
## EXP2        -0.00049832  0.00001328  -37.54   <2e-16 ***
## FE          -0.25338327  0.00574205  -44.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5246 on 33678 degrees of freedom
## Multiple R-squared:  0.2737, Adjusted R-squared:  0.2736 
## F-statistic:  3173 on 4 and 33678 DF,  p-value: < 2.2e-16

# 1.c.i) difference on education (no additional code required)
print("Holding all else constant, on average, a 1 year increase in education increases log wage by .103 log points for Black workers and .109 log points for white workers. This can be interpreted as when holding all else constant, the financial return of an additional year of schooling for white workers is larger than for Black workers")

## [1] "Holding all else constant, on average, a 1 year increase in education increases log wage by .103 log points for Black workers and .109 log points for white workers. This can be interpreted as when holding all else constant, the financial return of an additional year of schooling for white workers is larger than for Black workers"

# 1.c.ii) difference on gender by race (no additional code required)
print("The gender wage differential is larger for white observations than for Black observations. Therefore, holding all other factors constant, the gender wage differential is larger for white women compared to white men, than for Black women compared to Black men")

## [1] "The gender wage differential is larger for white observations than for Black observations. Therefore, holding all other factors constant, the gender wage differential is larger for white women compared to white men, than for Black women compared to Black men"

# 1.c.iii) difference on age profile by race (no additional code required)
print("The coefficient on experience is larger for white observations than for Black observations. Therefore, holding all else constant, aging an additional year has higher returns for white observations than Black observations")

## [1] "The coefficient on experience is larger for white observations than for Black observations. Therefore, holding all else constant, aging an additional year has higher returns for white observations than Black observations"

# 1.c.iv) extra credit: topcoding (no additional code required)
cat("Top coding obscures the returns for high income individuals. If the advantaged group was disproportionately within the observations that were top-coded. The coefficient estimations for the advantaged group would be an underestimate and therefore bias the differences in wages towards zero")

## Top coding obscures the returns for high income individuals. If the advantaged group was disproportionately within the observations that were top-coded. The coefficient estimations for the advantaged group would be an underestimate and therefore bias the differences in wages towards zero

# 1.c.v) chow test

[15 pts] Implement the Blinder-Oaxaca procedure to measure wage discrimination against Black workers. That is, using the sample mean data from 1.b and the regression coefficients estimated in 1.c, decompose the difference in mean \(LNWAGE\) between Black and White workers into the component that is due to differences in worker endowments/characteristics and the component that is due to discrimination. When you implement this procedure, follow equation 5.30 in the Berndt textbook chapter (excerpt on Canvas) and weight the differences in endowments by the estimated coefficients for the advantaged group. Then answer the following:
1. How much of the Black-White wage differential is due to differences in endowments?
2. How much of the Black-White wage differential is due to discrimination?
3. How do these values compare to the coefficient on the dummy variable for Black workers from 1.a? What might explain any differences?

# 1.d.i) Black-White endowment difference (weighted on White values)


white_weighted_endowment_term = diffBnWED*white_reg$coefficients["ED"] + diffBnWEXP*white_reg$coefficients["EXP"] + diffBnWEXP2*white_reg$coefficients["EXP2"]+diffBnWFE * white_reg$coefficients["FE"]
white_weighted_endowment_term

##       ED 
## 0.086401

# 1.d.ii) Discrimination component is the sum product of the diff in the coeffs and the avg for black endowments


coeff_diff_ed <- white_reg$coefficients["ED"]- Black_reg$coefficients["ED"]
coeff_diff_exp <- white_reg$coefficients["EXP"]- Black_reg$coefficients["EXP"]
coeff_diff_exp2 <- white_reg$coefficients["EXP2"]- Black_reg$coefficients["EXP2"]
coeff_diff_fe <- white_reg$coefficients["FE"] - Black_reg$coefficients["FE"]


white_weighted_discrimination_term <- coeff_diff_ed * meanblack$ED + coeff_diff_exp * meanblack$EXP + coeff_diff_exp2 * meanblack$EXP2 + coeff_diff_fe * meanblack$FE

cat("White weighted discrimination term:", round(white_weighted_discrimination_term,3), "\n")

## White weighted discrimination term: 0.057

cat("White weighted endowment term:", round(white_weighted_endowment_term,3), "\n")

## White weighted endowment term: 0.086

cat("Sum of White weighted endowment and discrimination:", round(sum(white_weighted_endowment_term,white_weighted_discrimination_term ),3), "\n")

## Sum of White weighted endowment and discrimination: 0.143

cat("Fraction of difference in average log wages due to discrimination:", round(abs(white_weighted_discrimination_term/diffBnWLNWAGE*100), 3), "%")

## Fraction of difference in average log wages due to discrimination: 25.854 %

# 1.d.iii) exposition (no additional code needed)

print("Through the implementation of the Blinder-Oaxaca test, differences in wages for Black and white workers are shown to be caused by discrimination and differences in endowments. The number from the BO-test is slightly larger than the differential shown in 1a because it accounts for differences in average endowments and discrimination whereas the coefficient found in 1a holds all other factors constant and assumes the same returns for the same endowments")

## [1] "Through the implementation of the Blinder-Oaxaca test, differences in wages for Black and white workers are shown to be caused by discrimination and differences in endowments. The number from the BO-test is slightly larger than the differential shown in 1a because it accounts for differences in average endowments and discrimination whereas the coefficient found in 1a holds all other factors constant and assumes the same returns for the same endowments"

[10 pts] Reimplement the Blinder-Oaxaca procedure using equation 5.31 from the Berndt textbook excerpt. That is, weight the differences in sample means by the coefficients estimated for the disadvantaged group.
1. What is the endowment difference and discrimination difference that you obtain from this version of the Blinder-Oaxaca decomposition?
2. How do these estimates compare to the estimates from 1.d?
3. Why are the estimates different? Does the direction of the difference make sense? Why or why not?

# 1.e.i) endowment and discrimination (Black weighting)

black_weighted_endowment_term <- Black_reg$coefficients["ED"] * diffBnWED + Black_reg$coefficients["EXP"] * diffBnWEXP + Black_reg$coefficients["EXP2"] * diffBnWEXP2 + Black_reg$coefficients["FE"] * diffBnWFE

cat(paste("Black weighted endowment term:", round(black_weighted_endowment_term,3), "\n"))

## Black weighted endowment term: 0.075

black_weighted_discrimination_term <- meanwhite$ED * coeff_diff_ed + meanwhite$EXP * coeff_diff_exp + meanwhite$EXP2 * coeff_diff_exp2 + meanwhite$FE * coeff_diff_fe

cat(paste("Black weighted discrimination term:", round(black_weighted_discrimination_term,3), "\n"))

## Black weighted discrimination term: 0.069

cat(paste("Total log wage differential:", round(sum(black_weighted_discrimination_term,black_weighted_endowment_term),3)))

## Total log wage differential: 0.143

# 1.e.ii and 1.e.iii) exposition (no additional coding required)

cat("i.e.ii) When using equation 5.31 from the Berndt textbook, the endowment and discrimination terms result in the same total wage differential. However, the wage differential attributed to earnings is smaller and discrimination is larger.\n\n")

## i.e.ii) When using equation 5.31 from the Berndt textbook, the endowment and discrimination terms result in the same total wage differential. However, the wage differential attributed to earnings is smaller and discrimination is larger.

cat ("i.e.iii) The estimates are different because the coefficients and mean values for the advantaged group are typically larger than those for the disadvantaged group. Thus, when the discrimination term is estimated by multiplyingthe average endowments of the disadvantaged group to the differences in coefficients, we would expect to see a lower estimated value")

## i.e.iii) The estimates are different because the coefficients and mean values for the advantaged group are typically larger than those for the disadvantaged group. Thus, when the discrimination term is estimated by multiplyingthe average endowments of the disadvantaged group to the differences in coefficients, we would expect to see a lower estimated value

Exercise 2: [60 pts + 10 pts extra credit] Measuring the impact of gender discrimination on the earnings of female workers

[10 pts] Separate your data into two subsamples - one for female workers and one for male workers. When doing this, you may assume that all workers for which \(FE_i = 0\) are male (for CPS data (a) gender assignment is often imputed by the interviewer and not reported by the interviewee and (b) the CPS has only recently begun testing survey/interview techniques to allow for non-binary gender categorization). For each subsample, answer the following:
1. How many observations are there?
2. What are the sample means for variable \(LNWAGE, ED, EXP, EXP2, BLACK,\) and \(HISP\).
3. What are the differences in sample means for variables \(LNWAGE, ED, EXP, EXP2, BLACK,\) and \(HISP\).
4. What do these differences tell us about different earnings, education, and experience levels between female and male workers?
5. Are there average differences in the proportions of female and male workers across Black and Hispanic workers?

# 2.a.i) separate data 


female <- filter(cps, FE == 1)
male <- filter(cps, FE == 0)

cat(paste("Number of female workers:", {nrow(female)}, "\n"))

## Number of female workers: 24453

cat(paste("Number of male workers:", {nrow(male)}))

## Number of male workers: 25547

# 2.a.ii) sample means
meanfemale <- lapply(female, mean)
meanmale <- lapply(male, mean)

print("Sample means for female observations:")

## [1] "Sample means for female observations:"

print(meanfemale[c("LNWAGE", "ED", "EXP", "EXP2", "FE")])

## $LNWAGE
## [1] 3.001255
## 
## $ED
## [1] 14.55613
## 
## $EXP
## [1] 22.35869
## 
## $EXP2
## [1] 707.3425
## 
## $FE
## [1] 1

print("Sample means for Male observations")

## [1] "Sample means for Male observations"

print(meanmale[c("LNWAGE", "ED", "EXP", "EXP2", "FE")])

## $LNWAGE
## [1] 3.189151
## 
## $ED
## [1] 14.12088
## 
## $EXP
## [1] 22.40987
## 
## $EXP2
## [1] 708.7394
## 
## $FE
## [1] 0

# 2.a.iii differences in male minus female (nonnegative values)
diffMmFLNWAGE <- mean(male$LNWAGE) - mean(female$LNWAGE)
diffMmFED <- mean(male$ED) - mean(female$ED)
diffMmFEXP <- mean(male$EXP) - mean(female$EXP)
diffMmFEXP2 <- mean(male$EXP2) - mean(female$EXP2)
diffMmFBLACK <- mean(male$BLACK) - mean(female$BLACK)
diffMmFHISP <- mean(male$HISP) - mean(female$HISP)

cat(paste("Diff in LN Wage:",diffMmFLNWAGE, "\n"))

## Diff in LN Wage: 0.187895950251396

cat(paste("Diff in ED:", diffMmFED, "\n"))

## Diff in ED: -0.435252832903808

cat(paste("Diff in EXP:", diffMmFEXP, "\n"))

## Diff in EXP: 0.0511839050960958

cat(paste("Diff in EXP2:", diffMmFEXP2, "\n"))

## Diff in EXP2: 1.39692944618025

cat(paste("Diff in BLACK:", diffMmFBLACK, "\n"))

## Diff in BLACK: -0.0300165283446872

cat(paste ("Diff in HISP:", diffMmFHISP))

## Diff in HISP: 0.0176246311172031

# 2.a.iv) interpretation (no additional code required)
cat("2.a.iv) On average, men in this sample are compensated more and have more experience, but they tend to be less educated.")

## 2.a.iv) On average, men in this sample are compensated more and have more experience, but they tend to be less educated.

# 2.a.v) in numbers of workers by race
cat("2.a.v) On average, compared to the females in the sample, the men in the sample are more likely to be Hispanic and less likely to be Black")

## 2.a.v) On average, compared to the females in the sample, the men in the sample are more likely to be Hispanic and less likely to be Black

[20 pts] Using the subsamples from 2.a, estimate the following regression separately for female and male workers: \[ LNWAGE_i = \alpha_0 + \beta_1 ED_i + \beta_2 EXP_i + \beta_3 EXP2_i + \beta_4 BLACK_i + \beta_5 HISP_i + \epsilon_i\] Report the regression coefficients, standard errors, and R-squared values for both regressions. Compare the magnitudes of the regession coefficients for the two groups.
1. How does the effect of schooling (\(ED\)) differ across the two groups (if at all)? How do you interpret those differences (if any exist)?
2. How does the race-related wage differential (\(BLACK\) and \(HISP\)) vary by gender? How do you interpret those differences?
3. How does the age-earnings profile differ by gender? How do you interpret those differences?
4. [10 pts extra credit] For additional extra credit, estimate a Chow test of the null hypothesis that the intercept and slope coefficients are equal across the two groups.

# female subset
lmFemale <- lm(LNWAGE ~ ED + EXP + EXP2 + FE + BLACK + HISP, data=female)
summary(lmFemale)

## 
## Call:
## lm(formula = LNWAGE ~ ED + EXP + EXP2 + FE + BLACK + HISP, data = female)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2894 -0.3074 -0.0239  0.3002  3.0199 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)  1.19625708  0.02155332  55.502   < 2e-16 ***
## ED           0.10765793  0.00133934  80.382   < 2e-16 ***
## EXP          0.02310092  0.00076097  30.357   < 2e-16 ***
## EXP2        -0.00037147  0.00001507 -24.643   < 2e-16 ***
## FE                   NA          NA      NA        NA    
## BLACK       -0.08978093  0.01011788  -8.873   < 2e-16 ***
## HISP        -0.03860858  0.00990139  -3.899 0.0000967 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5043 on 24447 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.2638 
## F-statistic:  1753 on 5 and 24447 DF,  p-value: < 2.2e-16

# male subset
lmMale <- lm(LNWAGE ~ ED + EXP + EXP2 + FE + BLACK + HISP, data=male)
summary(lmMale)

## 
## Call:
## lm(formula = LNWAGE ~ ED + EXP + EXP2 + FE + BLACK + HISP, data = male)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5244 -0.3334 -0.0180  0.3232  2.7701 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)  1.44925686  0.02072869  69.915   <2e-16 ***
## ED           0.09620419  0.00130669  73.624   <2e-16 ***
## EXP          0.03649746  0.00079002  46.198   <2e-16 ***
## EXP2        -0.00057247  0.00001544 -37.082   <2e-16 ***
## FE                   NA          NA      NA       NA    
## BLACK       -0.20660708  0.01184781 -17.438   <2e-16 ***
## HISP        -0.08201122  0.00979587  -8.372   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.532 on 25541 degrees of freedom
## Multiple R-squared:  0.2782, Adjusted R-squared:  0.278 
## F-statistic:  1968 on 5 and 25541 DF,  p-value: < 2.2e-16

# 2.b.i) difference on education (no additional code required)
cat("Female ed coefficient:", lmFemale$coefficients["ED"], "\n")

## Female ed coefficient: 0.1076579

cat("Male ed coefficient:", lmMale$coefficients["ED"], "\n")

## Male ed coefficient: 0.09620419

cat("On average, holding all else constant, an additional year of schooling is associated with higher returns for women than for men")

## On average, holding all else constant, an additional year of schooling is associated with higher returns for women than for men

# 2.b.ii) difference on race by gender (no additional code required)

cat("Female BLACK coefficient:", lmFemale$coefficients["BLACK"], "\n")

## Female BLACK coefficient: -0.08978093

cat("Male BLACK coefficient:", lmMale$coefficients["BLACK"], "\n")

## Male BLACK coefficient: -0.2066071

cat("On average, holding all else constant, Black men have a greater wage differential compared to their white counterparts, than Black women do to their white counterparts\n\n")

## On average, holding all else constant, Black men have a greater wage differential compared to their white counterparts, than Black women do to their white counterparts

cat("Female HISP coefficient:", lmFemale$coefficients["HISP"], "\n")

## Female HISP coefficient: -0.03860858

cat("Male HISP coefficient:", lmMale$coefficients["HISP"], "\n")

## Male HISP coefficient: -0.08201122

cat("On average, holding all else constant, Hispanic men have a greater wage differential compared to their white counterparts, than Hispanic women do to their white counterparts")

## On average, holding all else constant, Hispanic men have a greater wage differential compared to their white counterparts, than Hispanic women do to their white counterparts

# 2.b.iii) difference on age profile by race (no additional code required)

cat("Female EXP coefficient:", lmFemale$coefficients["EXP"], "\n")

## Female EXP coefficient: 0.02310092

cat("Male EXP coefficient:", lmMale$coefficients["EXP"], "\n")

## Male EXP coefficient: 0.03649746

cat("On average, holding all else constant, Hispanic men have a greater wage differential compared to their white counterparts, than Hispanic women do to their white counterparts")

## On average, holding all else constant, Hispanic men have a greater wage differential compared to their white counterparts, than Hispanic women do to their white counterparts

#options(download.file.method = "wininet")
#tinytex::install_tinytex()

# 2.b.iv) chow test

y_male <- male$LNWAGE
y_female <- female$LNWAGE

x_male <- cbind(male$ED, male$EXP, male$EXP2, male$BLACK, male$HISP)
x_female <- cbind(female$ED, female$EXP, female$EXP2, female$BLACK, female$HISP)

chow.test(y_male, x_male, y_female, x_female)

##    F value      d.f.1      d.f.2    P value 
##   449.3554     6.0000 49988.0000     0.0000

cat("\nTherefore we can reject the null hypothesis")

## 
## Therefore we can reject the null hypothesis

[15 pts] Implement the Blinder-Oaxaca procedure to measure wage discrimination against female workers. That is, using the sample mean data from part 2.a and the regresion coefficients estimated in 2.b, decompose the difference in mean \(LNWAGE\) between female and male workers into the component that is due to differences in worker endowments/characteristics and the component that is due to discrimination. When you implement this procedure, follow equation 5.30 in the Berndt textbook excerpt and weight the differences in endowments by the estimated coefficients for the advantaged group. Then answer the following;
1. How much of the female-male wage differential is due to differences in endowments?
2. How much of the female-male wage differential is due to discrimination?
3. How do these compare to the coefficient on the dummy variable for female workers from 1.a? What might explain any differences?

# 2.c.i) female-male endowment difference (weighted on male)


# vector needed: coefficients from regression on male subset (already created)
male_coefficients = lmMale$coefficients[-1] #remove the intercept
male_coefficients = male_coefficients[!is.na(male_coefficients)] #remove the NA value for female


# vector needed: differences of averages (male minus female)
mean_differences <- c(diffMmFED, diffMmFEXP, diffMmFEXP2, diffMmFBLACK, diffMmFHISP)

endowment_term_male_weighted <- round(sum(mean_differences * male_coefficients),3)


# calculating endowment amount
cat("Endowment term (male weighted):", endowment_term_male_weighted)

## Endowment term (male weighted): -0.036

# 2.c.ii) female-male discrimination difference (weighted on male)

#vector of female average values
meanfemale_clean <- as.numeric(meanfemale[c("ED", "EXP", "EXP2", "BLACK", "HISP")])

#vector of diff in coefficient values (M-F)

coeff_diff_MmF <- lmMale$coefficients - lmFemale$coefficients
coeff_diff_MmF <- coeff_diff_MmF[-1] 
coeff_diff_MmF = as.numeric(coeff_diff_MmF[!is.na(coeff_diff_MmF)])


intercept_diff_MmF <- lmMale$coefficients[1]- lmFemale$coefficients[1]
discrimination_term_male_weighted <- sum(round(sum(meanfemale_clean * coeff_diff_MmF), 3), intercept_diff_MmF)


#calculating discrimination amount
cat("Discrimination term (male weighted):", discrimination_term_male_weighted, "\n")

## Discrimination term (male weighted): 0.2239998

#Calculating fraction of differential due to discrimination
cat("Fraction of wage differential due to discrimination:", round(discrimination_term_male_weighted / diffMmFLNWAGE, 3)*100, "%\n")

## Fraction of wage differential due to discrimination: 119.2 %

#Exposition
cat("This number is smaller than the coefficient for the dummy variable female")

## This number is smaller than the coefficient for the dummy variable female

# 2.c.iii) exposition (no additional code needed)
cat("This slightly smaller estimate is likely due to the Mincer regression controlling for differences in endowments.")

## This slightly smaller estimate is likely due to the Mincer regression controlling for differences in endowments.

[15 pts] Reimplement the Blinder-Oaxaca procedure using equation 5.31 from the Berndt textbook excerpt. That is, weight the differences in sample means by the coefficients estimated for the disadvantaged group.
1. What is the endowment difference and discrimination difference that you obtain from this version of the Blinder-Oaxaca decomposition?
2. How do these estimates compare to the estimates from 2.c?
3. Why are the estimates different? Does the direction of the difference make sense? Why or why not?

# 2.d.i) female-male endowment and discrimination components (weighted on female)

cat("Female weighted estimations\n-------------------------------\n")

## Female weighted estimations
## -------------------------------

female_coefficients = lmFemale$coefficients[-1] #remove the intercept
female_coefficients = female_coefficients[!is.na(female_coefficients)] #remove the NA value for female

# calculating endowment amount


MF_endowment_female_weight <- sum(mean_differences * female_coefficients)
cat("Endowment term:", MF_endowment_female_weight, "\n")

## Endowment term: -0.04418049

#vector of male average values
meanmale_clean <- as.numeric(meanmale[c("ED", "EXP", "EXP2", "BLACK", "HISP")])



#calculating discrimination amount
discrimination_term_female_weighted = round(sum(meanmale_clean * coeff_diff_MmF)+ intercept_diff_MmF, 3)
cat("Discrimination term:", discrimination_term_female_weighted, "\n")

## Discrimination term: 0.232

cat("Total difference:", sum(discrimination_term_female_weighted, MF_endowment_female_weight))

## Total difference: 0.1878195

# 2.d.ii and 2.d.iii) exposition (no additional code needed)

cat("2.d.ii) This estimate for the differences in ln wages to women due to discrimination is slightly larger than the previous estimation. The estimate for differences due to endowments has a greater negative effect than the previous estimation.\n\n")

## 2.d.ii) This estimate for the differences in ln wages to women due to discrimination is slightly larger than the previous estimation. The estimate for differences due to endowments has a greater negative effect than the previous estimation.

cat("2.d.iii) These findings are different. Specifically in terms of discrimination, I was surprised the female weighted discrimination term was larger, because females are slightly more endowed than males in certain categories such as education, so I would expect their average values multiplied by the difference in coefficients to produce a higher level of discrimination")

## 2.d.iii) These findings are different. Specifically in terms of discrimination, I was surprised the female weighted discrimination term was larger, because females are slightly more endowed than males in certain categories such as education, so I would expect their average values multiplied by the difference in coefficients to produce a higher level of discrimination