Link to the GitHub repository: https://github.com/psych251/krauss2003
Link to the original paper: https://github.com/psych251/krauss2003/blob/master/original_paper/krauss2003.pdf
Link to the Qualtrics survey: https://stanforduniversity.qualtrics.com/jfe/form/SV_beA2tfPyXz83Otv
Link to preregistration: https://osf.io/n3abr/
My PhD research focuses on the question of how humans should assign probabilities in various contexts, all with the hope to inform practical interventions to improve human reasoning. Studies show that the so-called “Monty Hall problem” is one context where humans frequently assign incorrect probabilities. This is an important context since it tests one’s capacity to reason properly about likelihoods, a capacity which is central to reasoning about many topics in scientific, medical and everyday settings. Consequently, this paper hopes to partially replicate a promising procedure (experiment 1) for inculcating correct reasoning about the Monty Hall problem.
The procedure administers questionnaires to participants in an experimental and a control condition. Participants in the control condition complete a questionnaire which asks for participants’ responses to a standard version of the Monty Hall problem. Participants in the experimental condition complete a questionnaire about the “Guided intuition version” of the Monty Hall problem. Here, participants are guided to the correct solution to the Monty Hall problem through questions which encourage them: 1) to consider the problem from the fully informed perspective of Monty Hall, 2) to consider all possible arrangements of where the prize might be, 3) to count the frequencies with which a response strategy yields the optimal outcome and 4) to ignore the specifics of which particular door is opened by specifying merely that some door was opened and revealed a goat.
The original study reported that 38% of the experimental group provided correct justifications for their solutions to the Monty Hall problem. In contrast, only 3% of participants in the control group provided correct justifications.
The aim of this study is to replicate their procedure to see whether the control and experimental groups show a similar difference.
Participants will be recruited via Amazon Mechanical Turk (MTurk).
For two groups with proportions of 3% and 38%:
38 respondents are needed to achieve 80% power (with respondents evenly allocated between groups)
46 respondents are needed to achieve 90% power (with respondents evenly allocated between groups)
58 respondents are needed to achieve 95% power (with respondents evenly allocated between groups)
The original study featured three groups of participants recruited from various German universities. There were 67 participants in the control group and 34 participants in the experimental condition (bearing in mind that another 34 participants were also included in an alternative experimental group which is not the focus of replication for this study).
To ensure that the replication study is close to the original study, participants will be selected so as to inlude only people who have graduated from high school.
Some participants may have prior familiarity with the Monty Hall problem. Such participants will be omitted from the study with pre-test and post-test approaches.
The pre-test approach will ask participants to indicate whether they are familiar with the Monty Hall problem prior to receiving the (modified) instructions in appendix A or B. The limitation with this approach is that it introduces a potential source of error into the study: participants may indicate that they are not familiar with the problem merely so that they can receive the monetary reward of completing the task on MTurk, and their prior familiarity may bias the results. In an attempt to control for this, participants will be asked a list of questions so that they are less likely to tell which question could prevent their participation on the task.
The post-test approach asks participants to indicate whether they are familiar with the Monty Hall through the instructions in appendix A or B. In particular, the last item on the instructions asks participants to “Please also tell us if you were already familiar with this game… and knew what the correct answer should be”
The post-test approach can avoid the earlier source of bias.
Materials used for this study will include two versions of instructions: the control group instructions and the so-called “Guided Intuition”" instructions. These are reproduced in full in appendices A and B respectively. The guided intuition version incorporates the four features mentioned in the introduction, features which Krauss and Wang believe facilitate good reasoning.
Two minor modifications will be made to the materials:
First, the request for justification will omit the statement that “You may use sketches, etc., to explain your answer” as the online testing environment would not provide for that capacity.
Second, an attention check will be included, asking participants how many doors are in the problem.
Participants will be randomly assigned to the control and the experimental condition where they will be instructed to complete the tasks outlined in the respective appendices via MTurk.
Participants will be paid $0.97 for completion of the task, and they will have up to 3 hours to complete the task (although I expect responses to take approximately 8 minutes on average).
Respondents who indicated prior familiarity and who failed the attention check will be excluded from the analysis.
The original study reports various statistical measures, but the one of primary importance concerns the difference in correct justifications between the groups. To clarify, a participant would give a correct justification for their answer to the Monty Hall problem when they both provided the correct probability of switching doors (2/3) and this probability assignment was, in the words of the original authors, “comprehensibly derived” (11).
Such assignments could be comprehensibly derived in two ways:
1.Specifying the probabilities via use of Bayes’s theorem (7)
2.Counting the frequency with which the prize is one among various possible outcomes of the Monty Hall problem (two examples of which are provided on page 5)
The justifications will be extracted to a separate Excel file where justifications will be coded as correct or incorrect, all while being blind to which justifications are from the experimental or control conditions. The coding will then be transferred back to the main analysis dataframe. This may yield distinct proportions of correct justifications for the experimental and control conditions.
Following the original study’s protocol, the difference between such proportions will be measured using Cohen’s h and statistical significance will be calculated using Fisher’s exact test. Cohen’s h is then the key statistic of interest for this replication.
There are several salient differences between this replication study and the original experiment:
Pictorial justifications: In the original study, participants were allowed to visually represent their justifications for their responses in various formats, such as drawings of doors. For technical reasons, this study will not enable participants to draw such pictures, but it is expected that this should have a negligible impact on the proportion of correct justifications given or how they are coded.
Experimental setting: The original study brought students into the laboratory to complete their tasks, unlike the current study in which participants may respond through the MTurk worker website. Consequently, the original respondents may have been more likely to give greater attention to the reasoning task since they are in an environment created for that sole purpose and the environment also includes the physical presence of an experimenter. In contrast, since MTurk workers are left to their own devices, it is possible that the may rush through the task, thereby reducing the proportion of correct justifications or the clarity with which justifications are articulated. There is no definitive evidence about how great a risk this poses, but I suspect it would be negligible, at least for the purposes of testing a substantial effect of some size.
Coding of correct justifications: Unlike the previous study, a different coder (the present author) will code the justifications as correct or incorrect according to the criteria specified in the original paper. The results could be influenced by imperfect inter-rater reliability. However, it seems to me that the criteria for what qualifies as a correct justification are fairly straight-forward, comprehensible and uncontroversial, and so I suspect that the coders in the original and replication studies would likely have high or perfect agreement.
The actual methodology of this study departed from the initial design plan in several ways concerning data collection.
Only three inidividuals responded to the online survey during the first three days when the initial plan was executed. This was too slow, and it was likely because of a custom qualification test that ruled out participants with prior familiarity with the problem (as per the pre-screening approach above). This is unlikely to be an error in the creation or display of the qualification test, since both I and another person (Erin) ran the test without issues and it worked for other workers too. Instead, it may be that the qualification test dissuaded participants from working on the task, perhaps because participants saw that: i) they needed to put the effort in to complete three qualifying questions first, ii) it is not certain that they will qualify for the task after making that effort, and iii) they would only gain $0.97 even if they did qualify and complete the task.
For these reasons, participants may have thought that the task and the qualifying test were not worth their time, and that they should focus elsewhere. So that is one possible explanation for why the qualification did not work well, but there may be others.
Regardless, in an attempt to increase the number of respondents, the task qualifications were removed so that the task was visible to people aside from just US high school graduates or those who indicated no prior familiarity with the problem.
Instead, participants merely indicated their familiarity with the Monty Hall problem on the survey form and were consequently excluded from analysis.
Due to resource limitations, the study then sampled only 42 participants, but merely 19 of these both passed the attention check and indicated that they were not already familiar with the problem.
The resulting data was then unevenly distributed between the control and experimental conditions: the control condition had 8 valid responses and the experimental condition had 11.
Consequently, the actual data collection featured a small sample size and an uneven distribution of responses between conditions.
###Data Preparation
# Loading the anonymized data (with coding about correct justifications added in the original datafile)
dataedit <- read.csv("C:/Users/john-/krauss2003/data/anonymizedfinaldata.csv", comment.char="#")
#Installing and loading relevant packages
# install.packages("tidyverse")
library("tidyverse")
## -- Attaching packages -----------------
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.7
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Renaming columns in datafile to shorter variable titles
data_renamed = dataedit %>% dplyr::rename(
experimental_switch = What.should.the.contestant.therefore.do.,
control_switch = After.Monty.Hall.has.opened.a..goat.door...what.should.you.do.,
familiarity = Please.also.tell.us.if.you.were.already.familiar.with.this.game.,
attentioncheck = In.the.earlier.scenario..how.many.doors.were.there.in.total..including.either.opened.or.unopened.doors..)
#Removing missing values and excluding people who failed the attention check or were familiar with the game
data_gathered = data_renamed %>% gather(condition, switch, c(experimental_switch, control_switch)) %>%
filter(switch!="") %>%
filter(!is.na(switch)) %>%
filter(attentioncheck =="3 doors in total") %>%
filter(familiarity == "I was not familiar with this game")
#Recoding variables so the meaning of the values is clearer
data_recoded = data_gathered %>%
mutate(condition = recode(condition, "experimental_switch"="experimental", "control_switch"="control"))
###Analysis
#Calculating the proportions of correct justifications given in the control and experimental conditions
table <- with(
data_recoded,
table(correctjustification, condition))
#Calculating Cohen's h test
# install.packages("pwr")
library(pwr)
# h <- ES.h(
# (table["correct", "control"]/
# (table["correct", "control"] + table["incorrect", "control"])),
# (table["correct", "experimental"]/
# (table["correct", "experimental"] + table["incorrect", "experimental"]))
# )
## Note: The function "ES.h"" cannot calculate the h value when two proportions of 0% are given (as is the case in this study).Consequently, the h value is 0 given that there is no difference, but this value was not computed via the above code.
#Calculating the p value with Fisher's exact test
test <- fisher.test(as.matrix(table))
# Displaying results from the above calculations
table
## condition
## correctjustification control experimental
## 0 0
## incorrect 8 11
# h -- commented out since h cannot be calculated in this case
test
##
## Fisher's Exact Test for Count Data
##
## data: as.matrix(table)
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0 Inf
## sample estimates:
## odds ratio
## 0
# Plots of results
library(ggplot2)
#Plot of the percentage of correct justifications between the conditions -- from the replication study
ggplot(data = data_recoded, aes(x = condition, y=100*(correctjustification == "correct"))) + geom_bar(stat="summary", fun.y=mean) + labs(title = "Replication study results", x = "Condition", y = "Percentage correct (%)") + ylim(0, 100) +
scale_x_discrete(breaks=c("control", "experimental"),
labels=c("Control", "Experimental"))
#The corresponding plot from the original study
original = data.frame(Condition=c("A","B"),
Percentage=c(3, 38))
ggplot(data=original, aes(x=Condition, y=Percentage)) +
geom_bar(stat="identity") +
scale_x_discrete(breaks=c("A", "B"),
labels=c("Control", "Experimental")) + ylim(0, 100) +
labs(title = "Original study results", x = "Condition", y = "Percentage correct (%)")
Exploratory analyses were conducted to see whether the experimental intervention had a significant effect on the number of correct decisions to switch.
###Analysis
#Plot of correct decisions to switch from the original study
original = data.frame(Condition=c("A","B"),
Percentage=c(21, 59))
ggplot(data=original, aes(x=Condition, y=Percentage)) +
geom_bar(stat="identity") +
scale_x_discrete(breaks=c("A", "B"),
labels=c("Control", "Experimental")) + ylim(0, 100) +
labs(title = "Original study results", x = "Condition", y = "Percentage switch (%)")
#Data analysis for switching in the replication study
table2 <- with(
data_recoded,
table(switch, condition))
h <- ES.h(
(table2["switch", "control"]/
(table2["switch", "control"] + table2["stay", "control"])),
(table2["switch", "experimental"]/
(table2["switch", "experimental"] + table2["stay", "experimental"]))
)
test2 <- fisher.test(as.matrix(table2))
# Table displaying counts of switching vs. staying among the conditions in the replication
table2
## condition
## switch control experimental
## stay 5 4
## switch 3 7
# Proportions displaying differences in switching between the groups
(table2["switch", "control"]/
(table2["switch", "control"] + table2["stay", "control"]))
## [1] 0.375
(table2["switch", "experimental"]/
(table2["switch", "experimental"] + table2["stay", "experimental"]))
## [1] 0.6363636
ggplot(data = data_recoded, aes(x = condition, y=100*(switch == "switch"))) + geom_bar(stat="summary", fun.y=mean) + labs(title = "Replication study results", x = "Condition", y = "Percentage switch (%)") + ylim(0, 100) +
scale_x_discrete(breaks=c("control", "experimental"),
labels=c("Control", "Experimental"))
# Test of statistical significance between the groups
test2
##
## Fisher's Exact Test for Count Data
##
## data: as.matrix(table2)
## p-value = 0.3698
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.3196802 28.8861681
## sample estimates:
## odds ratio
## 2.74899
# Data on whether participants who correctly counted the outcomes in the intervention were more likely to switch:
table4 <- with(
data_recoded,
table(data_recoded$....if.she.he.switches.to.the.last.remaining.door...In.____.out.of.3.cases..Fill.in.the.blank.below., data_recoded$switch))
table4
##
## stay switch
## 5 3
## 1 3 3
## 1.5 0 0
## 2 1 4
## GOAT DOOR 0 0
This study did not replicate the key result of experiment 1 reported by Krauss and Wang (2003). In the original study, 38% of the participants in the experimental condition reportedly provided the correct justification for why one should switch: that is, they both sepecified the probability that switching would win the prize and they “comprensibly derived” this justification from counting outcomes or applying Bayes’s theorem (Krauss & Wang, 2003, 11). In constrast, only 3% of the control condition provided the correct justifications as such. The results were statistically significant (p <0.001) and Cohen’s h was .98.
However, in this study, none of the respondents in the experimental condition nor the control condition provided the correct justifications.
In this section, I will reflect on the results of the exploratory analysis, as well as what may be the implications of the failure to replicate the previous results. (Note that no objections or challenges were raised from the author of the original study, so none are discussed here.)
The exploratory analysis showed that more participants in the experimental condition indicated that they would make the right decision and switch doors, especially if those participants correctly specified the number of outcomes where they would win if they switched. This might appear somewhat promising for the experimental intervention. But despite appearances, none of the participants correctly specified nor derived the probabilities that switching would win the prize. It is then not clear that the intervention beneficially affected the decision about whether to switch. Consequently, this study failed to furnish evidence for the efficacy of the intervention with respect to switching behavior.
However, the question remains as to what the failure to replicate previous results means. There are two sets of possible implications.
One set concerns limitations of the replication study. Perhaps, for example, the study failed to replicate the original result because of these aspects:
The replication sample was small and underpowered
The original instructions were in a different language (German) which could have guided participants in a way that differs from the English translation
Authors of the original and replication studies could have coded justifications as correct or incorrect in different ways, with some ways being more or less lenient
MTurk workers may have been less attentive or competent than participants in the original study, especially since workers may be aiming to quickly complete the task to receive a small pay-off, unlike students in a laboratory with an experimenter
Ideally, an examination of the content of the justifications given in the data would shed light on the role of this last aspect. At first, it may seem that the MTurk workers could be less competent or attentive since their “justifications” are often terse and fail to provide support for their decision to switch. For example, one person’s “justification” for switching was merely: “I was thinking of all of the different aspects of the game”. This is a far cry from specifying what the probability of switching is or why that probability is what it is. However, it may not have been clear to the participants that they needed to fully explain and justify their reasoning, especially since the task merely asks them to explain in no particular level of detail “what went on” in their heed when making the decision. Consequently, it is unclear that the MTurk workers were actually aiming to articulate justifications or that they were less competent or attentive than the original study’s participants.
The other set of possible implications concerns the original study. It is possible, albeit not necessarily probable, that the study reported an effect which does not exist in the population of interest. This could happen for various reasons. One of these is publication bias and the file-drawer problem which inflate the rate of false positives in the published literature, and it might be that this is one such false positive. Additionally, the original study may have had various methodological features which hinder it from replicating (see examples discussed in a casual blog here). For example, the original study utilized a between-subjects design. As a result, the outcomes of interest may to some extent be attributable to irrelevant variance or noise among participants in the conditions. In this sense, between-subjets designs are less likely to replicate compared to within-subjects designs.
Ultimately, however, we cannot confidently conclude whether this failure to replicate suggests more about the limitations of this replication study than it does about the veracity or generalizability of the originally reported results.
Control group instructions
Experimental group instructions