Replication of Study 1 by Krauss & Wang (2003, Journal of Experimental Psychology)

Author

Bendix Kemmann (kemmann@stanford.edu)

Published

March 25, 2024

Introduction

This is an attempt to replicate Experiment 1 of Krauss and Wang (2003). In this experiment, Krauss and Wang (2003) investigated how different representations of the famous Monty Hall problem affect the rate of correct answers as well as the rate of correct justifications given for those answers. Krauss and Wang (2003) predicted that a presentation of the problem that involves four features intended to elicit a correct intuitive solution to the problem would be more likely to lead to correct responses than a control version that lacked these features. The four features, which Krauss and Wang (2003) call ‘psychological elements’, were: (1) prompting the participants to adopt the fully informed perspective of Monty Hall rather than the incomplete perspective of the contestant, (2) prompting the participants to ignore the specifics of which particular door is opened by specifying merely that some door was opened and revealed a goat, (3) prompting participants to construct ‘mental models’ of the scenario and thereby to consider all possible arrangements of where the prize might be, and (4) prompting participants to construe the problem in terms of natural frequencies rather than probabilities. In the original study, there were 67 participants in the control group and 34 participants in the experimental condition. Krauss and Wang (2003) reported that 38% of the experimental group provided correct justifications for their solutions to the Monty Hall problem. In contrast, only 3% of participants in the control group provided correct justifications. Krauss and Wang (2003) conclude that

[t]he rate of correct justifications in the guided intuition group was reliably greater than that in the control condition (\(p=.0001\), one-tailed; \(h=.98\)) […] Thus, the guided intuition manipulation significantly improved understanding of the rationale for switching yielding large effect sizes. (Krauss and Wang 2003, 14)

In this replication attempt, each participant will be presented with a questionnaire that contains one of two versions of the Monty Hall problem: a) a control version (Figure 3 in the original paper), and b) a ‘Guided intuition’ version in which all four features are added to the control version (Figure 4 in the original paper).¹ That is, participants in the control condition complete questions in response to a standard version of the Monty Hall problem. By contrast, participants in the experimental condition complete questions about the ‘Guided intuition version’ of the Monty Hall problem. As in the original study, after excluding participants that have encountered the Monty Hall problem previously, participants’ responses will be classified into ‘stayers’ (incorrect answer) and ‘switchers’ (correct answer), and the switchers’ justifications for their answer will be classified into one of three groups, depending on whether the experimenter believes the participant exhibited full insight into the mathematical structure of the problem, had the right intuition but couldn’t provide a correct proof, or switched randomly.

Summary of prior replication attempt by Wilcox (2018)

Wilcox (2018)‘s replication attempt was conducted via an online Qualtrics questionnaire, rather than in person as the original study had been. Participants were recruited via Amazon Mechanical Turk. While the original study featured participants from various German universities, Wilcox (2018) included participants who had at least graduated from high school. Wilcox (2018) also left out the ’One-Door’ experimental condition in which only one feature, the ‘less-is-more’ manipulation, was added (Appendix C in the original paper). As some participants may have prior familiarity with the Monty Hall problem, Wilcox (2018) omitted such participants from the study using pre-test and post-test approaches. In the pre-test, participants were asked to indicate whether they were familiar with the Monty Hall problem. The original study didn’t include such a pre-test, but it included the same post-test. In the post-test, participants were asked about their prior familiarity with the Monty Hall problem and whether they already knew the answer to the problem (see the appendix for the materials). Wilcox (2018) also modified the instructions in two minor ways. First, he omitted the statement ‘You may use sketches, etc., to explain your answer’ as the online testing environment would not easily provide for that capacity. Secondly, he included a simple attention check in which participants were asked how many doors were mentioned on a previous page in the questionnaire. His study involved 42 participants, 19 of which both passed the attention check and indicated that they were not already familiar with the problem. The resulting data were then unevenly distributed between the control and experimental conditions: the control condition had 8 responses and the experimental condition had 11. Consequently, the actual data collection featured a small sample size and an uneven distribution of responses between conditions. In Wilcox (2018)‘s words, ’The replication sample was small and underpowered.’ Furthermore, in Wilcox (2018)’s study, none of the respondents in the experimental condition nor the control condition provided correct justifications.

Methods

A priori power analysis

for original effect size

Using Fisher’s exact test, for two groups with proportions of 3% and 38%…

…38 respondents are needed to achieve 80% power, with respondents evenly allocated between groups
…46 respondents are needed to achieve 90% power, with respondents evenly allocated between groups
…58 respondents are needed to achieve 95% power, with respondents evenly allocated between groups

for attenuated effect size

Using Fisher’s exact test, for two groups with proportions of 3% and 19%…

…106 respondents are needed to achieve 80% power, with respondents evenly allocated between groups
…134 respondents are needed to achieve 90% power, with respondents evenly allocated between groups
…168 respondents are needed to achieve 95% power, with respondents evenly allocated between groups

Planned Sample

Participants will be recruited via Prolific.

Original sample

The original study featured three groups of participants recruited from various German universities. There were 67 participants in the control group and 34 participants in the experimental condition.

Eligibility criteria

Educational background. In this replication study, for simplicity, participants will not be pre-selected based on their educational credentials. (While Wilcox (2018) initially selected only participants who had at least graduated from high school, he later abandoned his pre-screening survey due to the low number of participants who had completed it.) Participants’ education level does not appear to be an important characteristic of the sample, and Krauss and Wang (2003) appear to have selected university students simply for convenience. We’re not trying to establish correlations between participants’ answers and their education levels.

Pre-screening. Because the hypothesis depends solely on the switch choices and justifications given by participants who aren’t already familiar with the Monty Hall problem, and to minimize costs, we will evaluate participants’ prior familiarity with the Monty Hall problem by asking them to complete a short pre-screening survey (https://stanforduniversity.qualtrics.com/jfe/form/SV_8B15MeMeEMk1zHE) which asks them about their prior familiarity with five games and reasoning problems, one of which is the Monty Hall problem. Every participant who answers the ‘Are you familiar with the Monty Hall problem?’ question with ‘no’ will be invited to the experiment. How the remaining questions are answered does not affect eligibility for participating in the experiment.

Planned sample size

Proposed sample size: 100 participants for the main experiment, which coincides approximately with the original sample size (101), as well as with 80% power for attenuated effect size (106). Because some potential participants will already be familiar with the Monty Hall problem, more participants will be invited to the pre-screening. Given that the pre-screening takes less than a minute and is inexpensive compared to the main experiment, we propose that the study be capped at 150 participants overall.

Stopping criterion: 100 participants have completed the main experiment. To that end, we will initially recruit 150 participants for the pre-screen, and if fewer than 100 pass, we will pre-screen more until 100 participants have completed the main experiments.

Materials

The original and modified instructions for the control group are reproduced in Appendix A. The original and modified instructions for the experimental condition are reproduced in Appendix B.

The following, minor modifications were made to the materials:

The name ‘Monty Hall’ was changed to a made-up name, ‘Maeve’; the reward was changed from a car to a suitcase of money; and the doors with goats behind them were changed to doors with nothing behind them. This doesn’t change the structure of the problem, but might make it slightly harder for participants to find other presentations of the problem online.
The illustration used by the original study was modified stylistically in minor ways for clarity and to improve the participants’ user experience. However, the structural aspects of the illustration (namely, the presence of 3 doors, the 1st door being highlighted, and the relative positions of the contestant and the host indicated) have all remained unchanged. We hope that the revised illustration will be slightly easier to interpret, but given that it doesn’t alter the structure of the reasoning problem, we don’t expect it to have a significant effect on task performance.
The (translation of the) original instructions, as well as those used by Wilcox (2018), included the statement ‘Please also tell us in writing what went on in your head when you made your decision.’ This wording is imprecise: Are participants meant to justify their decision or rather to introspectively report on ‘what went on in their heads’ as they thought through the problem? As a result, the wording was changed to ‘Please also tell us in writing the reasoning you used to arrive at your decision. Please be as detailed as possible.’ We don’t know the precise wording used in the original, German study, as this was not reported. However, given that the experimenters expected participants to explain their reasoning, we believe the revised wording is more in line with their intentions.
Wilcox (2018) omitted the statement that ‘You may use sketches, etc., to explain your answer’ as the online testing environment does not easily provide for that capacity. However, because the ability to make sketches as well as to write formulas may be important for some participants to arrive at, and justify, their answers, an instruction will be added that draws participants’ attention to the fact that they may create notes, sketches, and formulas on paper. Participants may then upload a photograph of their handwritten work if they choose to. This ensures that the conditions of the experiment are closer to the one used in the original study where participants produced handwritten answers. These instructions will also be included in advance in the description of the study on Prolific. Requiring the uploading of handwritten justifications may deter some potential participants from participating in the study due to the extra steps and materials involved, and may also inadvertently increase the representation of particularly conscientious participants among the sample. However, many participants are likely to have access to paper and pencil, as well as a smartphone or laptop camera. Moreover, completing the task successfully does require some conscientiousness on the part of the participants and, considering the low quality of responses obtained in the previous replication study, we argue that it is worth trying a setup that more closely resembles the original experiment. Apart from that, given that we are not comparing the proportion of participants who successfully complete the task versus those that don’t, the self-selection of more conscientious participants should not be a major distorting effect as long as the conscientiousness of participants is comparable across the experimental and control conditions.

Procedure

Participants will be randomly assigned to the control and to the experimental condition. They will be asked to follow the instructions as outlined in Appendices A and B.

Controls

While Wilcox (2018) used a very simple attention check, this study does not use any attention checks. Instead, a passive measure of compliance will be used. As a very conservative measure, we expect that participants will require at least one minute to read the instructions and respond to them. Consequently, we exclude participants who complete the questionnaire in less than that time.

Analysis Plan

Respondents who indicate prior familiarity or who have completed the questionnaire in a minute or less will be excluded from the analysis.

Then, we will proceed as in the original study:

To classify our participants’ justifications, we first divided participants into “stayers” and “switchers.” We then further classified the switchers into the following three groups according to their justifications for switching: (a) participants who gave correct justifications and exhibited full insight into the mathematical structure; (b) participants who had the right intuition but could not provide a mathematically correct proof; and (c) participants who switched randomly, meaning that they regarded switching and staying as equally good alternatives. (Krauss and Wang 2003, 11)

The coding criteria for determining whether to assign participants to group (a) or group (b) are outlined below.

The original study reports various statistical measures and doesn’t indicate which of these is intended to be the one key test. The main statistical test used by both Wilcox (2018) and this rescue study is the pairwise Fisher exact probability test, highlighted in bold:

A Pearson’s chi-square test was used to analyze the rate of switch choices (i.e., correct answers).
A Pairwise Fisher exact probability test for analyzing data on the rate of correct justifications for switch choices (i.e., correct answers and correct justifications). This was done due to the small number (3) of correct responses in the control condition.
Cohen’s effect size h was used to analyze the difference of proportions.

For Krauss and Wang (2003), more important than simply getting participants to switch rather than stay is getting them to provide correct justifications for their decision to switch:

Data on the rate of correct justifications for switch choices were analyzed with pairwise Fisher exact probability tests (due to the small number of correct responses in the control and one-door conditions), accompanied by Cohen’s effect size h for the difference of proportions […]. (Krauss and Wang 2003, 12)

Coding criteria

A participant gives a correct justification for their answer to the Monty Hall problem when they both provide the correct probability of switching doors (2/3) and this probability assignment was, in the words of the original authors, ‘comprehensibly derived’ (Krauss and Wang 2003, 11):

The criteria for correct justification were strict: We counted a response as a correct justification only if the 2/3 probability of winning was both reported and comprehensibly derived. This could, for instance, be fulfilled by applying one of the algorithms reported above (Equation 1, Table 1, or Figure 1) [reproduced here below], or by any other procedure indicating that the participant had fully understood the underlying mathematical structure of the problem. The right intuition category contained all the switchers who believed that switching was superior to staying but failed to provide proof for this. Finally, a participant was assigned to the “random switch” category if she thought it made no difference if she switched or stayed. (Krauss and Wang 2003, 11)

Equation 1 in the original study Table 1 in the original study

As in Wilcox (2018)’s replication study, the justifications that participants provide will be extracted to a separate CSV file (with links to photographs of the handwritten responses, if applicable) where justifications will be coded as correct or incorrect, while being blind to which justifications are from the experimental or control conditions. The coding will then be transferred back to the main analysis dataframe.

Following the original study’s protocol, the difference between such proportions will be measured using Cohen’s h and statistical significance will be calculated using Fisher’s exact test. Cohen’s h is then the key statistic of interest for this replication.

Differences from Original Study and 1st replication

Score of replication closeness: Very close. As outlined below, some procedural details (experimental setting, task instructions, presentation) were different in the rescue study compared to the original study and the first replication. All other design facets remained the same. Consequently, this rescue study should be characterized as a ‘very close replication’ using the framework in LeBel et al. (2018).

Unlike the original study, and like Wilcox (2018)’s replication attempt, the present study will present the instructions in the form of an online questionnaire rather than in person. The change in experimental setting may have the result that respondents are less likely to pay as much attention to the reasoning task as they would have in an in-person environment in which an experimenter is present. Ultimately, we don’t know how much of a difference this makes. However, by keeping track of response times and evaluating their written justifications, we might at least be able to estimate whether a response was thorough or rushed. Another possible benefit of administering the survey online is that participants may take it in a less stress-inducing environment, which may improve their responses.
While Wilcox (2018) included an explicit attention check in terms of a simple question (‘In the earlier scenario, how many doors were there in total (including either opened or unopened doors)?’), this study will use a passive compliance check instead.
Unlike the original study, the present study will use a pre-test to screen out participants who are already familiar with the Monty Hall problem. This is done to stay within the budget. One downside is that the pre-test may dissuade participants from working on the task (as Wilcox (2018) himself acknowledged in the Methods Addendum in his report).
An extra question (‘Did you arrive at the answer on your own?’) was added to the post-test in an attempt to screen out participants who used AI or some other resource to arrive at the answer.
The questionnaire that Wilcox (2018) used divided the task over two separate pages. In particular, the text box recorded participant’s justification for their answer was presented on a separate page from the scenario. By contrast, the present study will present all questions on one page (except for the post-test questions that inquire into familiarity with the Monty Hall problem), so that participants don’t have to keep the scenario/illustration in memory or switch back and forth. This experience is closer to the original study, where participants were given a one-page worksheet.
In the original study, participants were allowed to visually represent their justifications for their responses in various formats, such as drawings of doors. For technical reasons, Wilcox (2018)‘s study didn’t enable participants to draw such pictures. However, the ability to make sketches may be an important part of participants’ reasoning. For example, considering the various possibilities (as in Table 1 or Figure 1 above) may be an important and fruitful step in solving the problem (and, moreover, the authors of the original study intended the experimental condition to encourage such solution attempts). However, it is very demanding if not impossible to keep all possibilities in short-term memory. Similarly, some participants may choose to write formulas in solving the problem. This is difficult to do in a text box. In other words, it is possible that the Monty Hall problem can’t easily be solved without the use of external tools such as pen and paper. By limiting the ability for participants to take notes, Wilcox (2018)‘s replication study may inadvertently have interfered with participants’ reasoning in a non-trivial way that affects the results. Therefore, participants will be allowed to create drawings on paper and to upload a photograph of their work.
A different coder (myself) will code the justifications as correct or incorrect according to the criteria specified in the original paper. The results could be influenced by imperfect inter-coder reliability, i.e. the original coders and Wilcox (2018) may have applied slightly different coding criteria than I did.

Methods Addendum (Post Data Collection)

Actual Sample

300 potential participants were pre-screened on Prolific, 100 of which participated in the experiment. The average age of those 100 participants was 30.4 years. 45 participants identified as female, 54 as male, and 1 preferred not to disclose their gender identity. The participants lived in 78 different nations. Out of those 100 participants, 75 passed the exclusion criteria spelled out above.

Differences from pre-data collection methods plan

None.

Results

Data preparation

### load packages
library("tidyverse")
library("qualtRics")
library("plyr")
library("dplyr")

# NOTE: As answers have to be manually coded, the analysis pipeline is split
# into two parts. First, the anonymized data are loaded, tidied, and the
# exclusion criteria are applied. Then, the data frame is exported as a CSV file
# for manual coding. Once the coding is complete, the analysis continues below.

### load anonymized data
data <- read.csv("../data/experiment.csv")

### tidy data
tidy_data = data |>
  unite(switch, c(Q10, Q20), na.rm = T) |>
  mutate(switch = case_when(switch == 'switch to door 2' |
                            switch == 'switch to the last remaining door' ~ TRUE,
                         TRUE ~ FALSE)) |> 
  dplyr::rename(wins_cases_if_stays = Q7,
                wins_cases_if_switches = Q9,
                already_familiar = Q14,
                already_knew_answer = Q15,
                own_answer = Q43) |> 
  mutate(already_familiar = 
           case_when(already_familiar ==
                       'I was familiar with this game' ~ TRUE,
                                      TRUE ~ FALSE)) |> 
  mutate(already_knew_answer = 
           case_when(already_knew_answer ==
                       'I already knew what the correct answer should be' ~ TRUE,
                                         TRUE ~ FALSE)) |> 
  mutate(own_answer = 
           case_when(own_answer ==
                       'I arrived at the answer on my own' ~ TRUE,
                                TRUE ~ FALSE)) |> 
  mutate(control = 
           case_when(is.na(wins_cases_if_stays) ~ TRUE,
                             TRUE ~ FALSE))

# keep only those rows corresponding to participants who finished the survey,
# took more than 60 seconds, weren't familiar with the scenario, and completed
# the questionnaire on their own
relevant_tidy_data = tidy_data |>
  filter("Duration (in seconds)" >= 60) |> 
  filter(Finished == T) |> 
  filter(already_familiar == F) |> 
  filter(already_knew_answer == F) |> 
  filter(own_answer == T) |> 
  select(ResponseId, control, switch)

# create a new column to record which participants provided correct justifications
relevant_tidy_data = relevant_tidy_data |> 
  mutate(correct_justification = "")

# sort switch column to simplify coding
relevant_tidy_data = relevant_tidy_data[order(relevant_tidy_data$switch),]

# export relevant_tidy_data to csv for manual coding
# write.csv(relevant_tidy_data, "../data/experiment_to_be_coded.csv")

Results of control measures

75 out of 100 participants were retained for further analysis. The remaining participants were excluded according to the exclusion criteria spelled out above.

Confirmatory analysis

# after coding answers, load file

relevant_tidy_data <- read.csv("../data/experiment_coded.csv")

# generate contingency table that displays how many participants in the control
# and experimental condition provided a correct justification

table <- with(
  relevant_tidy_data,
  table(correct_justification, control))

# The first proportion is that of correct justifications given in
# the control condition. The second proportion is that of correct justifications
# given in experimental condition.

proportions = data.frame(Condition=c("Control","Experimental"),
                                 Proportion=c((table[2, 2]/
                                                 (table[2, 2] + table[1, 2])),
                                              (table[2,1]/
                                                 (table[2, 1] + table[1, 1]))))

# perform Cohen's h test

library("pwr")

# The first proportion is that of correct justifications given in
# the experimental condition. The second proportion is that of correct
# justifications given in control condition.

h <- ES.h(
  proportions$Proportion[2],
  proportions$Proportion[1]
)

# perform Fisher's exact test

fisher_exact_test <- fisher.test(as.matrix(table))

# generate contingency table that displays how many participants in the control
# and experimental condition indicated that the contestant should switch
# (i.e., gave a correct answer)

table_switch <- with(
  relevant_tidy_data,
  table(switch, control))

# Chi squared test for rate of switch choices

chisq_test <- chisq.test(table_switch)

# Plots of results

library(ggplot2)

# Plot from rescue replication study

plot <- ggplot(data=proportions, 
               aes(x=Condition, y=(Proportion)*100)) +
  geom_bar(stat="identity") +
  scale_x_discrete(breaks=c("Control", "Experimental"),
                   labels=c("Control", "Experimental")) + ylim(0, 100) +
  labs(title = "Second replication",
       x = "Condition",
       y = "Percentage correct (%)")

# Plot from Wilcox study

proportions_Wilcox_study = data.frame(Condition=c("Control","Experimental"),
                       Proportion=c(0, 0))

plot_Wilcox_study <- ggplot(data=proportions_Wilcox_study,
                              aes(x=Condition, y=(Proportion)*100)) +
  geom_bar(stat="identity") +
  scale_x_discrete(breaks=c("Control","Experimental"),
                   labels=c("Control", "Experimental")) + ylim(0, 100) +
  labs(title = "First replication",
       x = "Condition",
       y = "Percentage correct (%)")

# Plot from the original study

proportions_original_study = data.frame(Condition=c("Control","Experimental"),
                       Proportion=c(.03, .38))

plot_original_study <- ggplot(data=proportions_original_study,
                              aes(x=Condition, y=(Proportion)*100)) +
  geom_bar(stat="identity") +
  scale_x_discrete(breaks=c("Control","Experimental"),
                   labels=c("Control", "Experimental")) + ylim(0, 100) +
  labs(title = "Original study",
       x = "Condition",
       y = "Percentage correct (%)")

library(gridExtra)

grid.arrange(plot_original_study, plot_Wilcox_study, plot, ncol = 3)

Discussion

For the rate of correct answers (i.e., the participant indicated to switch), Krauss and Wang (2003) reported a Chi-Squared value of 14.529, p < .01. For the rate of correct justifications (i.e., the participant indicated to switch and provided a correct justification for this decision based on the criteria above), Krauss and Wang (2003) reported a p-value of .0001 (Fisher’s exact test, one-tailed), and a Cohen’s h value of .98 for effect size, indicating a large difference. In his replication study, Wilcox (2018) failed to replicate these findings as none of the participants in that study provided correct justifications. In the present replication study, for the rate of correct answers, we report a Chi-Squared value of 4.756, p = .0292. For the rate of correct justifications, we report a p-value of .04549 (Fisher’s exact test, one-tailed), and a Cohen’s h value of .536, indicating a moderate difference. Thus, while we detected a smaller effect than Krauss and Wang (2003), we conclude that the finding in Study 1 replicated.

Numeric score of replication closeness = 1. On a simple scale [0, .25, .5, .75, 1] with 0 = didn’t replicate, 1 = replicated, and in between if there were mixed results, we can select 1 as the key finding replicated.

Commentary

One of the main challenges faced by Wilcox (2018)’s initial replication study was the low number of participants as well as the poor quality of responses. In the present replication study, we recruited more participants and the quality of responses was higher. This was possibly the result of recruiting Prolific workers as opposed to Mechanical Turk workers, as well as of clarifying the instructions and requiring participants to upload handwritten answers.

Other replication attempts

I am not aware of any other replication attempts of Study 1 in Krauss and Wang (2003) apart from Wilcox (2018) and the present rescue study.

Appendix A: Control Group Instructions

Original Instructions

Modified Instructions

The following instructions were used in the replication attempt:

Appendix B: Experimental Group Instructions

Original Instructions

Modified Instructions

The following instructions were used in the replication attempt:

References

Krauss, Stefan, and X. T. Wang. 2003. “The Psychology of the Monty Hall Problem: Discovering Psychological Mechanisms for Solving a Tenacious Brain Teaser.” Journal of Experimental Psychology: General 132 (1): 3–22. https://doi.org/10.1037/0096-3445.132.1.3.

LeBel, Etienne P., Randy J. McCarthy, Brian D. Earp, Malte Elson, and Wolf Vanpaemel. 2018. “A Unified Framework to Quantify the Credibility of Scientific Findings.” Advances in Methods and Practices in Psychological Science 1 (3): 389–402. https://doi.org/10.1177/2515245918787489.

Wilcox, John. 2018. “Replication of ‘The Psychology of the Monty Hall Problem’ by Krauss and Wang (2003, Journal of Experimental Psychology).” https://rpubs.com/JohnEpisteme/433652.

Footnotes

For simplicity, a second experimental condition involving the ‘One Door’ version of the problem will be left out in this replication study. See Appendix C of Krauss and Wang (2003) for details.↩︎

Links

Introduction

Summary of prior replication attempt by Wilcox (2018)

Methods

A priori power analysis

for original effect size

for attenuated effect size

Planned Sample

Original sample

Eligibility criteria

Planned sample size

Materials

Procedure

Controls

Analysis Plan

Coding criteria

Differences from Original Study and 1st replication

Methods Addendum (Post Data Collection)

Actual Sample

Differences from pre-data collection methods plan

Results

Data preparation

Results of control measures

Confirmatory analysis

Discussion

Commentary

Other replication attempts

Appendix A: Control Group Instructions

Original Instructions

Modified Instructions

Appendix B: Experimental Group Instructions

Original Instructions

Modified Instructions

References

Footnotes