Abstract

This project reproduces and extends Bertrand and Mullainathan’s resume field experiment on labor market discrimination. The original study sent fictitious resumes to real job advertisements and randomly assigned each resume either a White-sounding or African-American-sounding name. Because the racialized name signal was randomly assigned, differences in callback rates can be interpreted as evidence that employers responded differently to otherwise comparable resumes at the initial interview-callback stage.

Our analysis has two parts. First, we reproduce the core result using a 2x2 contingency table, a difference in callback proportions, a chi-square/proportion test, and a binary probit regression. Second, we extend the analysis using methods from STAT 424: Generalized Linear Models, including additional two-way contingency tables, three-way contingency tables, logistic regression for binary outcomes, and an interaction logistic regression model. Across these analyses, resumes with White-sounding names receive substantially higher callback rates than resumes with African-American-sounding names. We also find that observed callback gains from high-quality resumes appear larger for White-sounding names, although the adjusted race-by-quality interaction is not statistically significant in our logistic model.

1. Introduction and Research Question

Bertrand and Mullainathan’s study asks whether racial signals affect employer behavior at the resume-screening stage. The key experimental feature is that resumes are fictitious and racialized names are randomly assigned. This allows the study to separate the effect of perceived race from differences in actual applicant qualifications.

Our research question is:

Do resumes with White-sounding names receive higher callback rates than otherwise comparable resumes with African-American-sounding names, and does this callback gap vary by resume quality, city, gender, or occupation?

This question preserves the original paper’s main causal comparison while adding categorical-data and generalized-linear-model methods from STAT 424. The project focuses on the first stage of hiring: whether an applicant receives an interview callback. It does not measure final job offers, wages, or long-term employment outcomes.

2. Data and Variables

The dataset contains one row per submitted resume. The main outcome is whether that resume received a callback. The most important explanatory variable is whether the assigned name sounded White or African-American. The key variables used in our analysis are listed below.

Key variables used in the analysis
Variable Meaning Role
call Whether the resume received an interview callback: 1 = callback, 0 = no callback. Main binary outcome
race / black Racialized name condition: black = 1 for African-American-sounding names, 0 for White-sounding names. Main treatment/explanatory variable
h / high_qual Resume quality condition: high_qual = 1 for high-quality resumes, 0 for low-quality resumes. Key moderator and control variable
city / chicago City of the job advertisement: chicago = 1 for Chicago, 0 for Boston. Control and subgroup variable
sex / female Gender associated with the applicant name: female = 1 for female names, 0 for male names. Control and subgroup variable
occ Occupation category of the job advertisement. Control and subgroup variable
callback_status Readable factor version of call: Callback vs. No callback. Used for contingency tables and figures
race_group Readable factor version of the racialized name condition. Used for contingency tables and figures
resume_quality Readable factor version of resume quality. Used for contingency tables and interaction analysis

3. Statistical Methods

3.1 Reproduction methods

The reproduction analysis follows the logic of the original paper. Since the outcome is binary, the simplest evidence comes from comparing callback rates across the two randomized name groups.

First, we use a 2x2 contingency table. This table crosses racialized name group with callback status:

\[ \text{Racialized name group} \times \text{Callback status}. \]

Because both variables have two categories, this is both a two-way contingency table and a 2x2 table. It directly shows how many White-name and African-American-name resumes did or did not receive callbacks.

Second, we compute the difference in callback proportions:

\[ \widehat{p}_{White} - \widehat{p}_{African\text{-}American}, \]

where \(\widehat{p}_{White}\) is the sample callback rate for White-sounding names and \(\widehat{p}_{African\text{-}American}\) is the sample callback rate for African-American-sounding names. Because the name condition was randomly assigned, this raw difference is the central experimental estimate of the callback gap.

Third, we use a chi-square test and a two-sample test of proportions. These tests evaluate whether callback status is independent of racialized name group. In plain language, they test whether the observed callback-rate difference is too large to be explained by random sampling variation alone.

Fourth, we estimate a binary probit regression, matching the style of the original paper. Let \(Y_i = 1\) if resume \(i\) received a callback and \(Y_i = 0\) otherwise. The basic probit model is:

\[ \Phi^{-1}\left(P(Y_i = 1)\right) = \beta_0 + \beta_1 Black_i, \]

where \(\Phi^{-1}\) is the inverse standard normal cumulative distribution function. The coefficient \(\beta_1\) captures how the African-American-sounding name condition is associated with callback probability on the probit scale. A negative estimate means African-American-sounding names are associated with a lower probability of callback.

3.2 Course-based extension methods

Our extension applies additional methods from the course to the same dataset.

First, we use additional two-way contingency tables to examine callback status across city, gender, resume quality, and occupation. These tables help describe whether callback rates vary across applicant or job characteristics beyond race.

Second, we use three-way contingency tables to examine whether the relationship between racialized name group and callback status changes across a third variable. For example, the table

\[ \text{Racialized name group} \times \text{Callback status} \times \text{Resume quality} \]

allows us to compare the racial callback gap separately for low-quality and high-quality resumes.

Third, we use logistic regression for binary outcomes. The logistic model is appropriate because the outcome is binary. The model is:

\[ \log\left(\frac{P(Y_i = 1)}{1 - P(Y_i = 1)}\right) = \beta_0 + \beta_1 Black_i + \beta_2 HighQuality_i + \beta_3 Female_i + \beta_4 Chicago_i + \gamma_{occ}. \]

This model estimates the log-odds of receiving a callback. We report odds ratios, where an odds ratio below 1 indicates lower odds of receiving a callback relative to the reference group.

Fourth, we estimate an interaction logistic regression to test whether resume quality changes the relationship between racialized name group and callback probability:

\[ \log\left(\frac{P(Y_i = 1)}{1 - P(Y_i = 1)}\right) = \beta_0 + \beta_1 Black_i + \beta_2 HighQuality_i + \beta_3(Black_i \times HighQuality_i) + \mathbf{X}_i'\gamma. \]

Here \(\mathbf{X}_i\) includes gender, city, and occupation controls. The interaction term \(\beta_3\) tests whether the association between resume quality and callbacks differs between African-American-sounding and White-sounding names.

4. Reproduction Results

4.1 2x2 contingency table

2x2 contingency table: racialized name group by callback status
race_group No callback Callback
White-sounding name 90.3% 9.7%
African-American-sounding name 93.6% 6.4%
Total 92.0% 8.0%

The 2x2 table shows a clear descriptive difference: resumes with White-sounding names receive callbacks more often than resumes with African-American-sounding names.

4.2 Difference in callback proportions

Main reproduction results
Result Value
White-name callback rate 9.65%
African-American-name callback rate 6.45%
Difference: White - African-American 3.20%
Ratio: White / African-American 1.497
Chi-square p-value 5e-05
Two-sample proportion test p-value 5e-05

The callback rate for White-sounding names is approximately 9.65%, while the callback rate for African-American-sounding names is approximately 6.45%. The difference is about 3.20 percentage points. In relative terms, White-sounding names receive about 1.50 times as many callbacks as African-American-sounding names.

Because the assigned name condition was randomized, this difference provides direct evidence that racialized name signals affected interview-callback decisions in this experimental setting.

Observed callback rates by racialized name group. Error bars show approximate 95% confidence intervals for the callback proportions.

Observed callback rates by racialized name group. Error bars show approximate 95% confidence intervals for the callback proportions.

4.3 Binary probit regression

Basic probit regression predicting callback status
term estimate std.error statistic p.value
Intercept -1.3017 0.0350 -37.195 0e+00
African-American-sounding name -0.2165 0.0528 -4.103 4e-05

The coefficient on the African-American-sounding name indicator is negative, meaning that resumes with African-American-sounding names have lower estimated callback probability than otherwise comparable resumes with White-sounding names. This is consistent with the contingency-table result.

5. Course-Based Extension Results

5.1 Additional two-way contingency tables

Additional two-way contingency-table tests
Table Chi-square p-value
Callback × City 0.000195
Callback × Gender 0.383
Callback × Resume quality 0.0801
Callback × Occupation NA

These two-way tables are descriptive extensions. They do not replace the randomized race comparison, but they help show whether callback status is also associated with other observed characteristics such as city, gender, resume quality, and occupation.

5.2 Three-way contingency tables

The three-way tables ask whether the race-callback relationship changes across a third variable. The most substantively important third variable is resume quality. This allows us to ask whether stronger resumes reduce the racial callback gap.

Observed callback rates by racialized name group and resume quality. Error bars show approximate 95% confidence intervals.

Observed callback rates by racialized name group and resume quality. Error bars show approximate 95% confidence intervals.

The observed callback rate increases more from low-quality to high-quality resumes for White-sounding names than for African-American-sounding names. This pattern mirrors one of the original paper’s main substantive findings. However, observed subgroup differences should be interpreted carefully, so we also evaluate this pattern using an interaction regression model below.

Callback rates by racialized name group across city and gender. Error bars show approximate 95% confidence intervals.

Callback rates by racialized name group across city and gender. Error bars show approximate 95% confidence intervals.

These descriptive subgroup results show that the overall callback gap is not driven by only one city or one gender category. In both Boston and Chicago, and for both male- and female-associated names, White-sounding names continue to receive higher callback rates. Because these are descriptive subgroup comparisons rather than direct tests of a specific interaction effect, they should be interpreted as supportive context rather than stand-alone causal claims.

5.3 Logistic regression for binary outcomes

Logistic regression gives a course-based binary outcome model. Unlike the probit model, logistic regression is often interpreted using odds ratios. An odds ratio below 1 means lower odds of receiving a callback relative to the reference category.

Key odds ratios from logistic interaction model
term Odds ratio Lower 95% CI Upper 95% CI p-value
African-American-sounding name 0.708 0.519 0.965 0.0288
High-quality resume 1.308 0.997 1.717 0.0530
African-American name × high-quality resume 0.835 0.547 1.276 0.4050

The odds ratio for the African-American-sounding name indicator is below 1, meaning that these resumes have lower odds of callback than White-sounding-name resumes, holding resume quality, gender, city, and occupation fixed. The interaction term tests whether the effect of resume quality differs by racialized name group.

Model comparison for logistic regression models
Model AIC Likelihood-ratio test p-value
Logistic model with controls 2691.495
Logistic model with race × quality interaction 2692.802 0.405

The interaction model is useful for interpretation, but the race-by-resume-quality interaction is not statistically significant in the adjusted logistic model. Therefore, the observed pattern in the three-way table should be described as suggestive rather than definitive regression evidence of different returns to resume quality.

5.4 Model-based predicted probabilities

Predicted probabilities translate the logistic model back into the probability scale. This makes the model easier to interpret than log-odds. The predictions below hold city fixed at Boston, gender fixed at female, and occupation fixed at secretary, while varying racialized name group and resume quality.

Predicted callback probabilities from logistic interaction model
Racialized name group Resume quality Predicted probability Lower 95% CI Upper 95% CI
White-sounding name Low quality 12.4% 9.5% 16.0%
African-American-sounding name Low quality 9.1% 6.8% 12.1%
White-sounding name High quality 15.6% 12.3% 19.7%
African-American-sounding name High quality 9.9% 7.4% 13.0%
Model-based predicted callback probabilities from the logistic interaction model. Error bars show approximate 95% confidence intervals on the probability scale.

Model-based predicted callback probabilities from the logistic interaction model. Error bars show approximate 95% confidence intervals on the probability scale.

The model-based predicted probabilities tell the same substantive story as the descriptive plots: holding the other covariates fixed, White-sounding names have higher predicted callback probabilities at both quality levels. At the same time, the gap between the two lines should not be interpreted as evidence of a statistically significant differential return to resume quality, because the race-by-quality interaction term is not statistically significant in the fitted logistic model.

6. Robustness and Limitations

The results should be interpreted in relation to the original experimental design. The strongest causal interpretation applies to the randomized racialized-name manipulation and the callback outcome. Since names were randomly assigned, the main callback gap cannot be explained by differences in the underlying resumes.

However, there are also important limitations. First, the outcome is an interview callback, not a final job offer or wage. Second, the experiment manipulates racialized name signals rather than directly observing race. Third, our subgroup and regression-extension analyses are useful for describing patterns, but they should not be overinterpreted as separate causal mechanisms unless the relevant subgroup comparisons are also supported by the randomized design and sufficient statistical evidence.

As a robustness check, the regression code also computes cluster-robust standard errors by job advertisement when a job-ad identifier is available. This follows the spirit of the original paper’s regression practice, since multiple resumes can be sent to the same job advertisement.

7. Appendix Figure: Occupation Gaps

The following figure is exploratory and should be treated as an appendix result rather than the main finding.

Exploratory callback-rate gaps by occupation.

Exploratory callback-rate gaps by occupation.

8. Conclusion

The reproduction analysis confirms the original study’s central result: resumes with White-sounding names receive more interview callbacks than resumes with African-American-sounding names. In the replication data, the callback rate is approximately 9.65% for White-sounding names and 6.45% for African-American-sounding names, a gap of about 3.20 percentage points.

The course-based extension adds contingency-table and regression analyses. The three-way contingency tables show that the racial callback gap appears across resume quality, city, and gender. Observed callback gains from high-quality resumes appear larger for White-sounding names than for African-American-sounding names. However, the adjusted logistic interaction model does not find a statistically significant race-by-quality interaction, so this pattern should be interpreted cautiously.

Overall, the evidence supports the conclusion that racialized name signals affected employer callback decisions at the resume-screening stage in this experiment. The analysis also shows how categorical-data methods and binary outcome regression can reproduce and extend a classic field experiment in labor economics.

References

Bertrand, Marianne, and Sendhil Mullainathan. 2004. “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.” American Economic Review 94(4): 991–1013.