In a 2007 study, Correll et al. investigate the effect of parenthood on job applicant success through a classic resume study. The original study states that according to prior literature, “mothers experience disadvantages in the workplace in addition to those commonly associated with gender.” It seeks to more concretely quantify the effects of motherhood on the desirability of a candidate by performing a laboratory experiement in order to evaluate status-based discrimination faced by mothers. The original study also subsequently conducts an audit study by sending resumes in response to actual hiring opportunities.
This replication focuses on the survey study portion of Correll et al 2007, and in particular on the effects on women, since the original study finds a small benefit to fatherhood in the survey study, and no effect of fatherhood in the audit study. I used Correll et al’s original female resumes, presented in an online Qualtrics survey. The study design is within-subjects for parental status, with each participant seeing two resumes, one mother and one nonmother. The replication survey produced in Qualtrics is available here.
Correll et al. report effect sizes for several measures, including: * Perceived competence and committment; * Ability standard measures (such as score needed on a hypothetical management profile exam in order to consider the candidate for employment and number of days the applicant could be late to work or leave early before the participant would no longer recommend her for hire); and * Evaluation metrics (including proposed salary, estimated likelihood the applicant would be subsequently promoted i selected for the job, and recommendation for hire).
Efect sizes ranged, with the largest effect sizes being reported for likelihood of promotion (Cohen’s d=1.13), competence (d=0.849), and committment (d=0.707).
Using G*Power, the sample sizes needed to achieve these strongest effects are recorded in the table below:
| Effect | Cohen’s d | N (power=0.8, two-tailed) | N (power=0.9, two-tailed) | N (power=0.95, two-tailed) |
|---|---|---|---|---|
| Competence | 0.849 | 46 | 62 | 76 |
| Committment | 0.707 | 66 | 88 | 108 |
| Promotion | 1.138 | 28 | 36 | 44 |
I documented a more detailed table with estimates for power available here.
I planned for one wave of 100 participants in order to have high power for observing the strongest effects, with another wave of 100 participants if no effects were observed to compensate for the publication bias that suggests the first published study may find a stronger effect than is generally the case.
Participants were recruited through Amazon Mechanical Turk with three criteria: * Only participants in the United States selected; * HIT approval for each participant was over 85% in order to filter spammers; * Number of HITs approved was over 100 in order to filter brand new Turkers.
The resumes used in this replication came directly from the Correll 2007 study authors:
The resumes listed the applicant’s career goals, educational history, past work experience, and other relevant activities. The resumes indicated that the applicants had bachelor’s degrees from one of two large midwestern universities and had approximately seven years of work experience. Both applicants were presented as highly pro- ductive by including ‘results’ on the resumes, such as ‘increased division sales by 10% between 2000 and 2002’.
All dates on both resumes were adjusted by 10 years so as to appear recent, but otherwise all other content remained the same. The authors developed these resumes, which had ot be different enough not to appear suspiciously similar, by pre-testing (with the manipulated content removed):
Prior to the actual experiment, we pretested the two versions of the materials to assess whether they were of equivalent quality… A different sample (N=60) drawn from the same population as in the actual experiment rated these two ‘template’ resumes, one at a time, using seven-point scale ranging from ‘not a all’ to ‘extremely’ capable, efficient, skilled, intelligent, independent, self-confident, warm, and sincere. No significant differences were found between participants’ ratings of the two resumes on any of these eight traits.
Additionally, because the two resumes were different in content and format, “parental status was counterbalanced in the actual experiement across the two versions of the resumes for each condition”.
The same procedure used in the original article was used in this replication, with the exception that they were recruited to an online survey via a Mechanical Turk HIT, rather than an in-person lab survey:
Participants came to the lab individually, read a description of a company that was purportedly hiring for a midlevel marketing position, and examined application materials for two applicants for the position who differed on parental status but were otherwise similar. They examined the applicant files one at a time, and we counterbalanced which file, the parent or nonparent, they viewed first. After reviewing an applicant’s file, participants immediately completed an ‘initial impressions’ survey for that applicant… On the same instrument, participants were asked to provide a list of pros and cons for each applicant, a task intended to entice them to look more closely at the applicants’ materials before proceeding to the next stage of evaluation. Participants were next instructed to look at the application materials more closely and complete an ‘applicant evaluation sheet’ for each candidate. This instrument contained our ability standard and evaluation measures.
The planned analysis is to investigate the same measures reported in the original study (competence, commitment, days allowed late, score required on exam, salary recommended, proportion recommended for management, likelihood of promotion, and proportion recommended for hire) between mothers and nonmothers using t-tests and proportion tests (whichever applicable) as recorded in the results table of the paper. Additionally, analysis will include linear and logistic regression models (whichever applicable—linear for continuous and logistic for binary variables) to estimate the effects of applicant parental status on each of the dependent variables, as reported in the paper.
Data is cleaned by removing any participants who refused to make an effort at brainstorming pros and cons of each applicant (such as by writing gibberish) or those who reported noticing the manipulation.
The main differences between the original study and this replication are the the online format and, as a result, the participant population. Instead of bringing undergraduate students into a lab and handing them physical copies of the materials, online crowdworkers are recruited from Mechanical Turk and follow the experimental procedure through a Qualtrics survey. This could potentially have a serious effect on the results; undergraduates are incentivized by course credit, and have the goal of simply finishing the study to receive that credit. Online crowdworkers, in contrast, know that their pay as well as their future access to work is impacted by the quality of the work they submit, and so are incentivized to complete the work carefully and thoroughly. Since bias is unconscious, a more conscientious and diligent focus on applicant qualifications may minimize the unconcious bias a participant displays.
Additionally, the original study also studied fathers compared to nonfathers using the same procedure; this replication only focuses on the stronger results which occurred between mothers and nonmothers, so no male resumes are used. Finally, the original study included stereotypically white-sounding and black-sounding names, but in the analysis races were pooled together since this race manipulation did not turn up significant results; this study only uses the white-sounding female names. These differences are not anticipated to make a difference in obtaining the original results.
Explicitly describe known differences in sample, setting, procedure, and analysis plan from original study. The goal, of course, is to minimize those differences, but differences will inevitably occur. Also, note whether such differences are anticipated to make a difference based on claims in the original article or subsequent published research on the conditions for obtaining the effect.
Data was collected from 200 participants, using the above exclusion criteria. Due to an oversight on my part, demographics were only collected from the latter half of the participants. According to those demographics, 54% of participants were male and 45% were female. Additionally, 37% reported being parents themselves, and 63% reported not having any children.
After data was collected, one additional step was taken in data cleaning: any salaries entered in the wrong format (i.e. as 150 rather than 150000) were corrected. These exclusion criteria were established before any data was collected.
Data preparation followed the analysis plan (this non-crucial code is hidden).
The analyses as specified in the analysis plan:
# The code below is a fancy way of doing all the t-tests at once, inspired by this: https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/
d %>%
select(competence_composite, capable, efficient, skilled, intelligent, independent, confident, aggressive, organized, motivated, committed_relative_others, exam_percentile, late_days, salary, likelihood_promoted, manipulation) %>%
gather(key = variable, value = value, -manipulation) %>%
group_by(manipulation, variable) %>%
summarise(value = list(value)) %>%
spread(manipulation, value) %>%
group_by(variable) %>%
mutate(p_value = t.test(unlist(parent), unlist(nonparent), paired=TRUE)$p.value,
t_value = t.test(unlist(parent), unlist(nonparent), paired=TRUE)$statistic,
avg_parent = mean(unlist(parent)),
avg_nonparent = mean(unlist(nonparent)))
## Source: local data frame [15 x 7]
## Groups: variable [15]
##
## variable nonparent parent p_value t_value
## <chr> <list> <list> <dbl> <dbl>
## 1 aggressive <dbl [198]> <dbl [198]> 0.4938425 -0.68548700
## 2 capable <dbl [198]> <dbl [198]> 0.2849017 -1.07228982
## 3 committed_relative_others <dbl [198]> <dbl [198]> 0.1174358 -1.57251498
## 4 competence_composite <dbl [198]> <dbl [198]> 0.3194920 -0.99802398
## 5 confident <dbl [198]> <dbl [198]> 0.1382060 -1.48854612
## 6 efficient <dbl [198]> <dbl [198]> 0.4868877 -0.69657855
## 7 exam_percentile <dbl [198]> <dbl [198]> 0.8020106 0.25108437
## 8 independent <dbl [198]> <dbl [198]> 0.6450356 -0.46137912
## 9 intelligent <dbl [198]> <dbl [198]> 0.9527780 0.05929406
## 10 late_days <dbl [198]> <dbl [198]> 0.1847503 -1.33093026
## 11 likelihood_promoted <dbl [198]> <dbl [198]> 0.2137207 -1.24743022
## 12 motivated <dbl [198]> <dbl [198]> 0.4091376 -0.82717696
## 13 organized <dbl [198]> <dbl [198]> 0.5260433 -0.63518741
## 14 salary <dbl [198]> <dbl [198]> 0.7204414 -0.35838180
## 15 skilled <dbl [198]> <dbl [198]> 0.3251602 -0.98637262
## # ... with 2 more variables: avg_parent <dbl>, avg_nonparent <dbl>
# Proportion tests
d %>%
group_by(manipulation) %>%
summarize(hire_true = sum(hire_rec), total=n())
## # A tibble: 2 × 3
## manipulation hire_true total
## <chr> <int> <int>
## 1 nonparent 172 198
## 2 parent 158 198
prop.test(c(172, 158), c(198, 198), correct=FALSE, alternative = "greater")
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(172, 158) out of c(198, 198)
## X-squared = 3.5636, df = 1, p-value = 0.02953
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.009376015 1.000000000
## sample estimates:
## prop 1 prop 2
## 0.8686869 0.7979798
d %>%
group_by(manipulation) %>%
summarize(train_true = sum(training_rec), total=n())
## # A tibble: 2 × 3
## manipulation train_true total
## <chr> <int> <int>
## 1 nonparent 164 198
## 2 parent 152 198
prop.test(c(164, 152), c(198, 198), correct=FALSE, alternative = "greater")
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(164, 152) out of c(198, 198)
## X-squared = 2.2557, df = 1, p-value = 0.06656
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.005579394 1.000000000
## sample estimates:
## prop 1 prop 2
## 0.8282828 0.7676768
# Linear regressions
fit <- lm(competence_composite ~ manipulation, data=d)
fit <- lm(committed_relative_others ~ manipulation, data=d)
fit <- lm(late_days ~ manipulation, data=d)
fit <- lm(exam_percentile ~ manipulation, data=d)
fit <- lm(salary ~ manipulation, data=d)
fit <- glm(training_rec ~ manipulation, data=d)
fit <- lm(likelihood_promoted ~ manipulation, data=d)
fit <- glm(hire_rec ~ manipulation, data=d)
summary(fit)
##
## Call:
## glm(formula = hire_rec ~ manipulation, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8687 0.1313 0.1313 0.2020 0.2020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.86869 0.02643 32.864 <2e-16 ***
## manipulationparent -0.07071 0.03738 -1.892 0.0593 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1383377)
##
## Null deviance: 55.000 on 395 degrees of freedom
## Residual deviance: 54.505 on 394 degrees of freedom
## AIC: 344.48
##
## Number of Fisher Scoring iterations: 2
The table below exactly reproduces that found on page 1316 Correll et al. 2007, with two rightmost added columns calculated from the replication.
| Measure | Mothers (Original) | Nonmothers (Original) | Mothers (Replication) | Nonmothers (Replication) |
|---|---|---|---|---|
| Competence | 5.19** | 5.75 | 5.47 | 5.54 |
| Commitment | 67.0** | 79.2 | 5.44 | 5.58 |
| Days allowed late | 3.16** | 3.73 | 3.19 | 3.98 |
| % score required on exam | 72.4** | 67.9 | 79.25 | 79.07 |
| Salary recommended ($) | 137,000** | 148,0000 | 144,313 | 144,651 |
| Proportion recommend for management | 0.691^^ | 0.862 | 0.768^ | 0.828 |
| Likelihood of promotion | 2.74** | 3.42 | 3.02 | 3.10 |
| Proportion recommend for hire | 0.468^^ | 0.840 | 0.798^^ | 0.869 |
^ Z < 0.10, test for difference in proportions between mothers and nonmothers. ^^ Z < 0.05. * P < 0.10 test for difference in means between mothers and nonmothers ** P < 0.05.
# Linear regressions with more factors added
fit <- lm(competence_composite ~ manipulation + candidate_name + participant_sex + participant_parent, data=d)
fit <- lm(committed_relative_others ~ manipulation + candidate_name + participant_sex + participant_parent, data=d)
fit <- lm(late_days ~ manipulation + candidate_name + participant_sex + participant_parent, data=d)
fit <- lm(exam_percentile ~ manipulation + candidate_name + participant_sex + participant_parent, data=d)
fit <- lm(salary ~ manipulation + candidate_name + participant_sex + participant_parent, data=d)
fit <- glm(training_rec ~ manipulation + candidate_name + participant_sex + participant_parent, data=d)
fit <- lm(likelihood_promoted ~ manipulation + candidate_name + participant_sex + participant_parent, data=d)
fit <- glm(hire_rec ~ manipulation + candidate_name + participant_sex + participant_parent, data=d)
summary(fit)
##
## Call:
## glm(formula = hire_rec ~ manipulation + candidate_name + participant_sex +
## participant_parent, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9532 0.0468 0.1384 0.2037 0.2894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.96866 0.04851 19.968 < 2e-16 ***
## manipulationparent -0.07025 0.04289 -1.638 0.10256
## candidate_namesarah -0.01546 0.04289 -0.360 0.71880
## participant_sexMale -0.13560 0.04358 -3.112 0.00205 **
## participant_parent -0.03677 0.04498 -0.817 0.41435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1314875)
##
## Null deviance: 38.601 on 285 degrees of freedom
## Residual deviance: 36.948 on 281 degrees of freedom
## (110 observations deleted due to missingness)
## AIC: 238.34
##
## Number of Fisher Scoring iterations: 2
As the regressions above indicate, although all measures are in the right direction (with only the “Recommendation for hire” binary variable being significantly predicted by the manipulation), there seems to be two more salient effect than parenthood—the resumes themselves, and the sex of the participant. I discuss these in greater detail below.
As described in the Methods, two different resumes were used, after pre-testing as being equivalent. In order to identify whether this effect was due to the difference in presentation of the two resumes (e.g. line spacing on the resume judged as better was a bit larger, making it appear longer) I presented both resumes in a common format and ran another 50 participants, which found the same result. The data from this exploration is available as anonymized-followup.csv, and the analyses above can be run on that data.
Additionally, several of the models above suggest that the sex of the participant had a significant effect on the rating; men were harsher judges in general, and being rated by a man made a candidate significantly less likely to be recommended for hire, need a higher exam score, and be rated as less competent.
This replication attempt was partially successful. While the proportions recommended for management and hire do appear to be significantly different between mothers and nonmothers (with mothers being less likely to be recommended for management positions and for hire), the rest of the original study’s findings do not replicate. It is notable, however, that all measures are in the right direction (that is, mothers are rated more harshly as concluded by the original study).
The stronger effect seems to be that of the different resumes. While the resumes pre-tested in the original study as being equivalent on the study’s measures, this does not appear to be the case in my replication. Rather, one of the two resumes appears to be stronger overall. In a followup exploratory analysis, this is confirmed by putting both resumes in a uniform format, to control for the effect of the different interface.
Another effect is the sex of the participant reviewing the resume. While not a key finding, the paper does report that its regression models find a significant effect that female participants were more likely to recommend an an applicant for hire or promotion, regardless of parental status.
A primary discrepancy between this replication and the original study is the relatively strong effect of the resume used on likelihood of promotion and other outcomes. This may suggest that online crowdworkers evaluate resumes differently than either undergraduates in a lab or actual hiring managers. It is possible that online workers, who know that their pay and future work opportunities are dependent on their performance, are more attentive to the content of the resume, mitigating their unconscious bias.
Additionally, futher analyzing the data by sex of participant indicates that men are harsher in their judgements of the female applicants’ competence (a composite variable made of attributes such as “skilled,” “aggressive,” “motivated,” “capable,” “efficient,” etc.). The original study included 84 male and 108 female participants, where the gender ratio in this replication was majority males, which perhaps accounts for much of the lack of replication. This has important consequences for real-world hiring situations, in which some companies (for instance, Silicon Valley tech companies, which was the scenario described to participants in this study) may have an overwhelming majority of men, who are likely to judge female (and perhaps also male) resumes more harshly.