Introduction

   I am interested in studying the career processes of entrepreneurs, especially those who move in and out of traditional, hierarchical firms. Kacperczyk and Younkin (2021) found that having an entrepreneurial experience negatively influences workers’ prospects at the hiring stage and shows how this founding penalty varies by gender. Study 1 is a resume-based audit study that tested the effect of entrepreneurial experience on the probability of interview callbacks, and Study 2 is an experimental survey that attempted to evaluate the explanations about why there exists a penalty for ex-founders returning to wage employment. I replicate the findings from Study 2, because it is not only appropriate in scope and time but also engages with the literature on understanding why founding penalties exist.
   Participants are given one of four resumes differing in founding experience and gender and asked to evaluate the given job candidate’s qualities, especially their fit and commitment. In other words, this survey experiment is a two-by-two between-subject design with participants randomly assigned to read one type of resume out of four conditions: female founder, male founder, female non-founder/employee, and male employee. The dependent variable is the respondents’ willingness to recommend the given job candidate to a future employer, and mediating variables are the extent to which the given job candidate is a good fit to a hierarchical organization and is likely to quit next job. At the end of the experiment, to control for the respondents’ characteristics, their information including age, gender, and years of work experience will be collected.
    The primary challenge will be to find a sample as similar as possible to the original study: marketing managers with more than five years of work experience and at least a bachelor’s degree who come from a diverse range of industries, including manufacturing, advertising, healthcare, software development, and consulting. Because the paper does not mention the industry composition of the sample and only writes that the sample is from “across a range of industries, such as manufacturing, advertising, healthcare, software development, and consulting”, I am worried about getting a sample whose industry composition vastly differs from that of the original study sample. Another challenge will be difficulties associated with retaining attention, which would especially increase from my additional questions. The more questions and tasks respondents are asked to do, the more likely that the responses we get are answered less carefully.
    The link to the repository is as follows: https://github.com/psych251/Kacperczyk2021.git

Methods

Power Analysis

    The original effect size is around 0.286, which is the cohen’s d and which was computed using the t statistic of 2.05 and p = 0.04 found in the original study. It is also important to note that we estimated the degrees of freedom to be 205, given that the total sample size was 413 and thus divided 413 by 2 and then subtracted by 1. The estimated sample sizes at 80%, 90%, and 95% power are 386, 516, and 636 individuals respectively.

Planned Sample

    The planned sample of the present replication study is 386 individuals who have hiring experience. If the budget allows for further selection of the sample, it would be helpful to further restrict the sample to those who have at least 2 years of work experience.

Materials and Procedure

    Here is the link for the materials: https://stanforduniversity.qualtrics.com/jfe/form/SV_6sRpi4Fvh05zJPg. And here is the link to the preregistration: http://osf.io/dh5kg.
    The resumes of job candidates differing in founding experience and gender used in the present study are downloaded from the original study. The procedure is also exactly the same except one additional questionnaire I ask in the replication study about the respondents’ current job position and industry, because the original study sample consists entirely of hiring managers who have extensive experience hiring for marketing positions.

Analysis Plan

    “We begin by analyzing differences in means between our treatment and control conditions. Consistent with the audit results, participants gave a significantly stronger (t = 2.05, p = 0.04) interview recommendation to nonfounders (mean = 5.73) than to ex-founders (mean = 5.40), and female ex-founders received a significantly higher recommendation (t = 4.33, p = 0.01, mean = 5.89) than male ex-founders (mean = 4.86).”
    The key analysis of interest is the unpaired, two-tailed t-test comparing the experimental and control groups. Moreover it is worth noting that because of the differences from the original study in sample, we aim to ask additional questions about their current employment status, for whom they work for, which allows us to learn about what sector/industry they work in, and what industry their ventures are in if they are self-employed.

Differences from Original Study

    I expect there to be some differences in sample. First, the present replication study does not recruit hiring managers but rather those who have any experience in making hiring decisions. Moreover, the original study recruits those who have experience hiring for marketing positions, but the sample collected in the previous study is expected differ in their job positions. Because perceptions of ex-entrepreneurs, which is the focus of Study 2, might differ across job positions and industries, I do anticipate that these differences in sample might create a difference based on claims in the original article. Though this might be seen as ‘failing’ to replicate, new insights can be yielded from this exercise such that job positions and industries are significant drivers of the results. Otherwise, the setting, procedure, and analysis plan are strictly followed from the original study.

Actual Sample

    A total of 239 individuals successfully responded to the survey. These individuals satisfied two selection criteria. The first criterion relates to hiring experience; all study participants answered “Yes” to the question of “Do you have any experience in making hiring decisions (i.e., have you been responsible for hiring job candidates)? The second criterion relates to the country of residence; all participants answered”United States” to the question of “In what country do you currently reside?”
    In addition to only including participants that currently live in the US and have hiring experience,the present study excludes samples who do not pass the attention check questionnaire. The attention check questionnaire makes sure that the participants properly read the resumes by asking them whether the given job candidate was an ex-founder. There were 51 individuals who failed to correctly answer this attention check questionnaire and therefore were excluded from the sample (and analyses).

Differences from pre-data collection methods plan

    For the actual implementation, there was an additional screening questionnaire that excluded participants who currently do not reside in the United States. This was not considered in the pre-data collection methods plan, but after piloting the survey on several Prolific individuals, I realized that the participants’ unfamiliarity with the US context would seriously deter their ability to correctly understand and evaluate the resumes that described educational and work experiences U.S. Furthermore, the original study also recruited individuals (hiring managers) in the U.S., and so this decision to add another prescreening questionnaire is also in line with the original study.

Results

Data preparation

First, I clean the data – excluding cases that did not have necessary responses on key variable.

I then recode the recommendation scale to a numerical scale.

d$'recommend.scale' <- recode(d$`recommend-scale`, 'Definitely\nwill not recommend'= 1,'Not very probably recommend' = 2, 'Probably not recommend' = 3, 'Might or\nmight not recommend' = 4, 'Probably recommend' = 5, 'Very probably recommend' =6, 'Definitely will recommend' = 7)

I then create a variable for the ex-employee resume condition and the ex-founder resume condition, because the survey is a 2 x 2 study.

d = d %>%
  select('recommend.scale', starts_with('resumes_randomization'), 'attention-binary-qs') %>%
  mutate(employee_condition = ifelse(is.na(d$'resumes_randomization_DO_male-employee')== FALSE | 
                                       is.na(d$'resumes_randomization_DO_female-employee')==FALSE, 
                                     'employee', 'founder')) %>% 
  mutate(gender = ifelse((d$'resumes_randomization_DO_female-employee'==1) | (d$'resumes_randomization_DO_female-founder'==1), 'female', 'male')) %>%
  mutate(gender = ifelse(is.na(gender), 'male', 'female'))

I exclude 51 cases that do not pass the attention check.

d$attention.check <- recode(d$`attention-binary-qs`, 'Yes, an ex-founder'= "founder",'No, not an ex-founder' = 'employee')

d = d %>% 
  filter(attention.check == employee_condition)

Let’s plot this out! The histogram suggests that ex-employees on average seem to receive a higher number of positive recommendations compared to ex-founders. For example, there are greater number of individuals who responded that they would very probably recommend (“6” on the recommend scale) or definitely recommend (“7” on the recommend scale) when given resumes about ex-employees than about ex-founders.

ggplot(d, aes(x = recommend.scale, fill=gender)) +
  geom_bar(position = position_dodge(width = 1)) +
  facet_grid(~employee_condition)

Confirmatory analysis

    When we run a simple unpaired two sample t-test, we obtain a t-statistic of -0.80 with a very high p-value of 0.42.
t.test(d$recommend.scale[d$employee_condition=='founder'], 
       d$recommend.scale[d$employee_condition=='employee'])
## 
##  Welch Two Sample t-test
## 
## data:  d$recommend.scale[d$employee_condition == "founder"] and d$recommend.scale[d$employee_condition == "employee"]
## t = -0.80127, df = 155.59, p-value = 0.4242
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4466889  0.1888792
## sample estimates:
## mean of x mean of y 
##  5.531646  5.660550
    However, when we consider gender and the interaction between founding experience and gender, the results really do seem to speak to and replicate the results in the original study. In the “Replication of Kacperczyk & Younkin (2021)” table below, when we control for gender, we see a stronger negative effect for having founded company. This is in line with the original study, as seen in the table titled “Table 5”, positioned below the “Replication of Kacperczyk & Younkin (2021)” table.
d$female <- ifelse(d$gender=='female', 1,0)
d$founder <- ifelse(d$employee_condition=='founder', 1,0)
model1 <- lm(recommend.scale ~ founder, data = d)
model2 <- lm(recommend.scale ~ founder + female + founder*female, data = d)

stargazer(model1,model2, type = "html", title = "Replication of Kacperczyk & Younkin (2021)")
Replication of Kacperczyk & Younkin (2021)
Dependent variable:
recommend.scale
(1) (2)
founder -0.129 -0.425**
(0.158) (0.215)
female -0.147
(0.203)
founder:female 0.659**
(0.315)
Constant 5.661*** 5.736***
(0.102) (0.145)
Observations 188 188
R2 0.004 0.030
Adjusted R2 -0.002 0.014
Residual Std. Error 1.068 (df = 186) 1.059 (df = 184)
F Statistic 0.667 (df = 1; 186) 1.911 (df = 3; 184)
Note: p<0.1; p<0.05; p<0.01


Discussion

Summary of Replication Attempt

    At first glance, the t-test result seems to provide evidence that I have failed to replicate the original result. However, the OLS model that includes gender and interaction between gender and founding status shows that the founding penalty is especially recovered once we control for gender and its interaction with founding status. Not only is the negative effect of founding experience large and significant in the OLS regression, but also the non-significant p-value in the t-test makes sense given that ex-founder effects for men and women might have cancelled each other out (i.e., an ex-founder effect is only present for men but not for women). However, I do wonder why we observe a very high p-value when running an unpaired two-sample t-test, while we observe a lower p-value when running the regression predicting the recommend scale by founding status. All in all, I am more inclined to say that the primary result has replicated the original result than say it has not replicated.

Commentary

    The results from the replication study seem to emphasize and provide evidence that gender is a critical component of the story about ex-founders returning to paid employment. Though the present replication study cannot identify what exactly drives this difference in penalty, the result from the replication exercise does provide support to the view that there does exist a difference in founding penalty, possibly driven by different perceptions toward women versus men entrepreneurs returning to paid employment, by gender.
    Looking back, I realize that the replication effort has been a laborious but a very meaningful process. It was laborious in that replication effort requires attention to detail to try to make the study as similar as possible as the original study. But the process of replicating Kacperczyk & Younkin (20210 has been extremely meaningful, as it allowed me to look closely into the experimental decisions carefully made by the original authors and gave me an opportunity to ruminate on how such decisions could create responses in different directions. In a similar line, I also realized the importance of running pilot studies, which reveal a lot about what the design is doing well and poorly and hence help us make necessary changes (i.e., adding a prescreening questionnaire to limit the population to those living in the US) before running the study on an entire sample.