Replication of Experiment 1 in Stable Causal Relationships Are Better Causal Relationships by Vasilyeva, Blanchard, & Lombrozo (2018, Cognitive Science)

Introduction

Vasilyeva, Blanchard, & Lombrozo (2018, Cognitive Science) investigated people’s evaluations of causal relationships based on stability - that is, whether a causal relationship holds across possible moderating variables. Across three experiments, Vasilyeva et al. demonstrated that adults were more likely to endorse stable causal claims over unstable causal claims (i.e. that don’t hold across a potential moderator). Particularly in Experiment 1 (the target of this replication), adults not endorsed stable causal claims over unstable causal claims when presented with covariation data from a hypothetical science study.

One key interest of mine is childrens’ developing causal reasoning - how do children reason about causal systems and explanations? A key part of this research program will also be characterizing the adult state and asking how adults reason about the same phenomena and questions. Replicating an experiment from this paper would help me better understand the design and analyses associated with asking adults about their judgments centered around important aspects of a causal relationship. This work could extend into further replications (i.e. of the other experiments in this paper) and/or further questions (developmental or otherwise) that would be very relevant to my research interests.

I plan on replicating the result from Experiment 1 - that adults were more likely to endorse a causal relationship when the relationship was non-moderated vs. moderated. I chose this result because they manipulated many possible factors that might play into adults intuitions about causal relationships (i.e. varying the DV questions), and it was the first result that laid out the phenomenon that later experiments could build on top of. Its analysis was a 2 x 2 x 2 mixed ANOVA, and a separate 2 x 2 mixed ANOVA for a separate DV (counterfactual ratings).

Procedure and Anticipated Challenges

The study will be run on Prolific with adults in the United States. The task is simple - participants will read blocks of text describing a scientists’ investigation on a novel alien population, then be given covariation data from the scientists’ study. There are no other materials other than the text from the questions and scenario. I will obtain the exact stimuli (scenario and question wording) from the researchers, build a Qualtrics survey for the task, and analyze the data using R. For this, I will need funds to run participants on Prolific. One key concern about this experiment in particular is the exclusion rate. I will keep an eye on this exclusion rate and discuss with the TAs and Mike to see if it constitutes a bigger issue for carrying out the project, but also conduct exploratory analyses with additional check questions to see if different exclusion criteria changes the results in a meaningful way.

There appears to an OSF link on the paper, but it is broken. The data file for this experiment on OSF appears to be empty - there are no open materials from this experiment online. I will have to contact the researchers to request more information about administration of the task, clarification on the exact study design, the stimuli and question wording and coding (beyond what is written in the paper, which spells out most of the wording already), and clarification on their analyses.

Links

Project repository (on Github): https://github.com/psych251/vasilyeva2018.git. Data file from replication also present in the repo - while the primary code here is using the qualtRics package to import the data directly from Qualtrics, the data in the repo is also useable.

Original paper (as hosted in your repo): https://github.com/psych251/vasilyeva2018/blob/main/original_paper/Vasilyeva_Blanchard_Lombrozo_2018.pdf or https://onlinelibrary.wiley.com/doi/10.1111/cogs.12605

Link to the Experiment: https://stanforduniversity.qualtrics.com/jfe/form/SV_bQ6I1kuS75Z5Jvo

Link to the Preregistration: https://osf.io/efdz9

Methods

Power Analysis

With a power of .80 and replicating the original effect size of a partial eta squared of .478, we require 12 participants total across conditions, and will run 24 to account for exclusions. The power analysis was conducted in G*Power with the information given in the original paper.

Planned Sample

Participants will be limited to Prolific users in the United States, with a 98 approval rate or higher, and those who have not participated in any piloting of this experiment. Failure on any of the original comprehension questions in the survey will result in exclusion from analysis (i.e. not the two comprehension questions added at the end for exploratory purposes). Otherwise, there are no other preselection rules or other restrictions on the sample. We will proceed with 24 participants as per the power analysis.

Materials, Design, and Procedure

From the article:

"Participants first completed a short training to ensure that they could interpret covariation tables and were then placed in the role of a scientist (zoologist, botanist, geologist, or ornithologist) studying several natural kinds on a fictional planet. Table 1 shows the four kinds — zelmos, drols, grimonds, and yuyus — each associated with a triad of variables (putative cause, effect, and moderator). We illustrate the procedure with zelmos, but the structure was matched across cases.

The scientist was described as investigating the hypothesis that eating yona plants is causally related to developing sore antennas. Participants were told that to test the hypothesis, the scientist performed an experiment, selecting a random sample of 200 zelmos and randomly assigning them to two equal groups that ate a diet either containing or not containing yonas. Participants saw the results of the experiment in the form of a 2 x 2 covariation table cross-classifying zelmos based on whether they ate yonas or not, and whether they developed sore antennas or not (see Fig. 1a). The numbers in the table were selected to provide support for a relationship with causal strength equal to a \(\Delta\)P of about .4 (range 0.39–0.42).

The scientist then decided to conduct a second experiment with a new, larger sample of 400 zelmos, again randomly assigning zelmos to one of the two diets. But this time the scientist discovered after the experiment that due to a miscommunication between research assistants, half of the zelmos were given salty water, and the other half were given fresh water. The two values of this potentially moderating variable were always said to occur normally on the planet; for example, in the wild, zelmos drink either fresh or salty water, depending on what’s available. (This moderating variable played the role of a “background circumstance” relative to which the cause-effect relationship (e.g., eating yonas –> sore antennas was stable or unstable.) Luckily for the scientist, the moderator and cause variables varied orthogonally. Participants were told that “to see whether drinking salty water made a difference to the effects of yonas on sore antennas, you decide to look at the results of the experiment within each of these two groups.” This time participants were presented with the data split into two tables, one for the salty water subgroup, and one for the fresh water subgroup, each table cross-classifying zelmos in terms of diet and antenna soreness (see Fig. 1).

Depending on condition, the split tables indicated a relationship that was either moderated or not moderated. In the moderated cases (illustrated in Fig. 1c), in one subgroup (salty water) the relationship between eating yonas and sore antennas was very strong (\(\Delta\)P = .81–0.86), while in the other subgroup (fresh water), the relationship disappeared (\(\Delta\)P = .00–0.01). In the non-moderated cases (Fig. 1b), each of the split tables corresponded to relationships with a \(\Delta\)P comparable to the ~0.40 from the original, unsplit table. Importantly, the average strength of the relationship across the two split tables was the same in the moderated and non-moderated conditions (or differed by no more than 0.05 \(\Delta\)P units, always in the direction working against our hypothesis3), and equaled the strength of relationship in the first table that participants saw for each item (within .03 \(\Delta\)P units). The split tables were accompanied by a note for moderated [non-moderated] conditions: “The tables reveal that the data pattern looks very different [similar] for zelmos who drank salty water during the experiment and for zelmos who drank fresh water during the experiment. Please compare the two tables to see how different [similar] the patterns are.”

Once all three covariation tables had been presented, participants evaluated either claims about causal relationships or explanations (Table 2). Each claim was presented either at the type or token level. All claims were unqualified; that is, they stated a relationship between eating yonas and sore antennas without mentioning the kind of water the zelmo(s) in question drank. In addition, participants evaluated one counterfactual statement for each scenario; for example, after learning about a group of zelmos who were fed yonas, drank salty water, and developed sore antennas, participants rated their agreement with the statement that “had these zelmos eaten yonas but not drunk salty water, their antennas would still have become sore.” This statement was included to verify that participants differed across the moderated and non-moderated conditions in the role they attributed to the moderator.

At the end of the experiment, participants answered two multiple-choice comprehension check questions about each scenario they had read (e.g., “According to what you read, as a scientist on planet Zorg you were interested in evaluating the following hypothesis about zelmos: a. eating yona plants produces antenna soreness; b. eating drol mushrooms produces antenna soreness; c. eating mushrooms with stem bumps produces spotted antennas; d. antenna soreness makes zelmos eat yonas”). Participants who answered either question incorrectly were excluded from further analyses.

Across items, each participant saw two moderated cases and two non-moderated cases, presented in random order. Thus, Experiment 1 had a 2 moderator (moderated vs. nonmoderated relationship) x 2 judgment (causal vs. explanatory) x 2 target (type vs. token) mixed design, with moderator manipulated within-subjects. The dependent variables were agreement with causal or explanatory claims, and agreement with counterfactual claims, measured on a 1 (strongly disagree) to 7 (strongly agree) scale."

Table 1 (above): Materials used in Experiments 1 (all four items) and 2 (zelmo and drol items only)

Fig. 1(above): Sample covariation matrices from Experiment 1: (a) original unsplit table, common across the moderated and non-moderated conditions; (b) split tables in the non-moderated condition, \(\Delta\)P’s = .36 and .38 (M = .37); (c) split tables in the moderated condition, \(\Delta\)P’s = .83 and 0.01 (M = 0.42)

Table 2(above): Sample causal and explanation judgments in Experiment 1, as a function of judgment type (causal vs. explanatory) and target (token vs. type)

Analysis Plan

The key analysis of interest is the 2 (moderator vs. non-moderator) x 2 (causal vs. explanatory) x 2 (type vs. token) mixed ANOVA reported in section 2.2.1 of the paper, Experiment 1. It is a mixed ANOVA on the causal and explanatory ratings, where they found a main effect of moderator on participant’s judgments, with no other main effects of interactions. The original reported statistics (and one I will be replicating) was F(1, 178) = 163.22, p < .001, \(\eta^{2}_{p}\) = .478 for this original 2x2x2 ANOVA. A successful replication will replicate the main effect of moderator (ideally with an \(\eta^{2}_{p}\) of at least .4, similar to their effect size of .478, but this will be examined upon analysis of the results), with no other main effects of interactions. This is the key analysis of interest as it the very first piece evidence of the phenomenon that adults consider causal stability in the strength of the causal claims.

I will first import the data and relevant libraries. Then I will clean the data - omitting irrelevant columns to the analysis and excluding participants based on the exclusion criteria (in this case, participant’s who answered at least one check question incorrectly). Then I will ready the data for analysis (e.g. changing from wide to long format, and other necessary steps that will be taken upon retrieval of the data), perform the ANOVA using the “aov” function in R, calculate effect sizes, and plot the data. I will then determine whether the result constitutes a successful replication.

Differences from Original Study

I hope to perform as faithful of a replication as possible - which includes utilizing the same design, stimuli, wording, and analysis pipeline. But one critical change to the experiment is that I will be adding two check questions at the very end of the experiment (after all of the content of the original experiment), for exploratory purposes to test participants’ understanding of the covariation tables. I will analyze this in my exploratory analyses. Additionally, following best survey practices, I have added labels to all scales. Finally, I have changed the instructions of the experiment, the ending, method of feedback, and gender question, which constitute very minor changes from the original experiment.

Text added after pre-registration: Below are images of the covariation table comprehension questions. In hindsight, I would have changed these questions to be made up causal variables (similar to the ones the original authors used), so that participants do not enter the question with priors about the nature of the causal relationship.

Fig. 2 (above): Covariation Check Question 1

Fig. 3 (above): Covariation Check Question 2

If the researchers are unable to provide any of the materials, I may have to deviate from the exact stimuli/wording used in the original experiment. I may also have to compute the statistical tests in a different computing environment to what the researchers used in their study. Additionally, the population tested in the original experiment were participants on Amazon Mechanical Turk, whereas I will be testing participants on Prolific with high approval rates. I do not expect these differences to influence my ability to replicate the original results.

I also plan on producing additional plots (e.g. pirate plots) to investigate the full distribution of the data across the various independent variables (i.e. not only the presence of moderator, but also the cause/explanatory variable, and type/token).

Methods Addendum (Post Data Collection)

The final sample was n = 24, the same as the planned sample, but 11 participants were excluded from analysis for failing at least one check question. This leaves 13 for the confirmatory analysis. This was according to the pre-specified exclusion criteria (missing any of the 8 original comprehension questions), and this high exclusion rate is further explored in the confirmatory analysis.

Next, let’s look at demographics (for gender I used free response, which was not ideal for this kind of thing, so if you re-ran this experiment and ran this code I don’t know that it would give you all of the demographic info cleanly):

library(qualtRics) #to load in the survey

data <- fetch_survey(surveyID = "SV_bQ6I1kuS75Z5Jvo", force_request = T, include_display_order = FALSE, label = FALSE, convert = FALSE)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   StartDate = col_datetime(format = ""),
##   EndDate = col_datetime(format = ""),
##   IPAddress = col_logical(),
##   RecordedDate = col_datetime(format = ""),
##   ResponseId = col_character(),
##   RecipientLastName = col_logical(),
##   RecipientFirstName = col_logical(),
##   RecipientEmail = col_logical(),
##   ExternalReference = col_logical(),
##   LocationLatitude = col_logical(),
##   LocationLongitude = col_logical(),
##   DistributionChannel = col_character(),
##   UserLanguage = col_character(),
##   CommOwn = col_character(),
##   Comments = col_character(),
##   Gender = col_character(),
##   `prolific-id` = col_character(),
##   PROLIFIC_PID = col_character(),
##   task = col_character(),
##   moderator = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

mage <- mean(as.numeric(data$Age), na.rm = TRUE) #calculate mean age

mage

## [1] 32.375

sdage <- sd(as.numeric(data$Age), na.rm = TRUE) #calculate sd age

sdage

## [1] 12.28931

#count gender from free reponse

nrow(subset(data, Gender == "male")) + nrow(subset(data, Gender == "Male")) + nrow(subset(data, Gender == "man")) + nrow(subset(data, Gender == "Man")) + nrow(subset(data, Gender == "M"))

## [1] 9

nrow(subset(data, Gender == "female")) + nrow(subset(data, Gender == "Female")) + nrow(subset(data, Gender == "Woman")) + nrow(subset(data, Gender == "woman")) + nrow(subset(data, Gender == "F")) + nrow(subset(data, Gender == "cis woman"))

## [1] 12

nrow(subset(data, Gender == "Nonbinary")) + nrow(subset(data, Gender == "gender noncomforming")) + nrow(subset(data, Gender == "Gender Queer"))

## [1] 3

The mean age for the sample was 32.375, standard deviation of age was 12.28, with 9 males, 12 females, 3 gender nonconforming/nonbinary/queer.

Results

Data preparation

Data preparation following the analysis plan:

This data preparation process follows the analysis plan. We will first load in the data from qualtrics using the package “qualtRics”, then clean the data by removing irrelevant rows/columns and excluding participants who incorrectly answered at least one check question. Then will will transform the dataframe in a format suitable for statistical tests (e.g. changing from wide to long format):

###Data Preparation

####Load Relevant Libraries and Functions

library(tidyverse) #for data cleaning and visualization

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plotrix) #for statistical tests
library(yarrr) #for pirate plots

## Loading required package: jpeg

## Loading required package: BayesFactor

## Loading required package: coda

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************

## Loading required package: circlize

## ========================================
## circlize version 0.4.13
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
##   in R. Bioinformatics 2014.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(circlize))
## ========================================

## yarrr v0.1.6. Citation info at citation('yarrr'). Package guide at yarrr.guide()

## Email me at Nathaniel.D.Phillips.is@gmail.com

## 
## Attaching package: 'yarrr'

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

library(DescTools) #for effect size calculation

####Import data

data <- fetch_survey(surveyID = "SV_bQ6I1kuS75Z5Jvo", force_request = T, include_display_order = FALSE, label = FALSE, convert = FALSE) #import survey

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   StartDate = col_datetime(format = ""),
##   EndDate = col_datetime(format = ""),
##   IPAddress = col_logical(),
##   RecordedDate = col_datetime(format = ""),
##   ResponseId = col_character(),
##   RecipientLastName = col_logical(),
##   RecipientFirstName = col_logical(),
##   RecipientEmail = col_logical(),
##   ExternalReference = col_logical(),
##   LocationLatitude = col_logical(),
##   LocationLongitude = col_logical(),
##   DistributionChannel = col_character(),
##   UserLanguage = col_character(),
##   CommOwn = col_character(),
##   Comments = col_character(),
##   Gender = col_character(),
##   `prolific-id` = col_character(),
##   PROLIFIC_PID = col_character(),
##   task = col_character(),
##   moderator = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

#### Data exclusion / filtering

data <- data %>% 
  mutate(id = row_number()) %>% #create id row and bring it to the front
  select(id, everything()) %>% 
  filter(Consent == 2) %>% #consenting
  filter(Mch.z1 == 1,
         Mch.z2 == 1,
         Mch.d1 == 1,
         Mch.d2 == 1,
         Mch.g1 == 1,
         Mch.g2 == 1,
         Mch.y1 == 1,
         Mch.y2 == 1) %>% #check questions
  subset(select = -c(2:19)) %>% 
  select(-starts_with(c("Mch", #getting rid of irrelevant columns, including check questions
                        "Comm",
                        "Age", 
                        "Gender",
                        "prolific-id",
                        "PROLIFIC_PID",
                        "tableorder",
                        "cov"))) %>% 
  pivot_longer(!c(id, task, moderator), names_to = "question_type", values_to = "rating") %>% #turn table into longer
  mutate(presence_of_moderator = case_when(moderator == "drolyuyu" & grepl("d3", question_type) ~ "present", #create column indicating whether it was moderated or not
         moderator == "drolyuyu" & grepl("d4", question_type) ~ "present",
         moderator == "drolyuyu" & grepl("y3", question_type) ~ "present",
         moderator == "drolyuyu" & grepl("y4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("z3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("z4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("Z3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("Z4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("g3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("g4", question_type) ~ "present",
         TRUE ~ "absent")) %>% 
  mutate("causalexplanatoryCF" = case_when(grepl("CaTy", question_type) ~ "causal", #change column names
                                     grepl("ExTy", question_type) ~ "explanatory",
                                     grepl("CaTo", question_type) ~ "causal",
                                     grepl("ExTo", question_type) ~ "explanatory",
                                     grepl("Ty.CF", question_type) ~ "counterfactual",
                                     grepl("To.CF", question_type) ~ "counterfactual")) %>% 
  mutate("typetoken" = case_when(grepl("CaTy", question_type) ~ "type", #change column names
                                     grepl("ExTy", question_type) ~ "type",
                                     grepl("CaTo", question_type) ~ "token",
                                     grepl("ExTo", question_type) ~ "token",
                                     grepl("Ty.CF", question_type) ~ "type",
                                     grepl("To.CF", question_type) ~ "token")) %>% 
  filter(!is.na(rating)) %>% 
  select(-c(moderator, task, question_type)) #get rid of last remaining unnecessary columns

#turn them all into factors - forgot to add this in the prereg
data$id <- factor(data$id)
data$presence_of_moderator <- factor(data$presence_of_moderator)
data$causalexplanatoryCF <- factor(data$causalexplanatoryCF)
data$typetoken <- factor(data$typetoken)

Confirmatory analysis

The first goal is to perform the main 2x2x2 mixed ANOVA, which is laid out in the analysis plan and outlined as the analysis of interest. We will simply perform the ANOVA, and plot it using ggplot (using the same method as the paper just for the sake of comparison, but further plots are shown below):

First, we will get rid of the counterfactual data:

dataconfirmatory <- data %>%
  filter(grepl("counterfactual", causalexplanatoryCF) == FALSE) #get rid of counterfactual data

Now we perform the ANOVA:

#perform ANOVA on causal/explanatory ratings

aov1 <- aov(rating ~ (presence_of_moderator*causalexplanatoryCF*typetoken) + Error(id/(presence_of_moderator)) + (causalexplanatoryCF*typetoken), data = dataconfirmatory)

summary(aov1)

## 
## Error: id
##                               Df Sum Sq Mean Sq F value Pr(>F)
## causalexplanatoryCF            1   0.15   0.154   0.042  0.842
## typetoken                      1   5.94   5.940   1.612  0.236
## causalexplanatoryCF:typetoken  1   5.47   5.470   1.484  0.254
## Residuals                      9  33.17   3.685               
## 
## Error: id:presence_of_moderator
##                                                     Df Sum Sq Mean Sq F value
## presence_of_moderator                                1  44.31   44.31  24.168
## presence_of_moderator:causalexplanatoryCF            1   0.45    0.45   0.245
## presence_of_moderator:typetoken                      1   0.86    0.86   0.471
## presence_of_moderator:causalexplanatoryCF:typetoken  1   0.38    0.38   0.207
## Residuals                                            9  16.50    1.83        
##                                                       Pr(>F)    
## presence_of_moderator                               0.000829 ***
## presence_of_moderator:causalexplanatoryCF           0.632445    
## presence_of_moderator:typetoken                     0.509637    
## presence_of_moderator:causalexplanatoryCF:typetoken 0.660201    
## Residuals                                                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 26     26       1

and the effect size:

etasq1 <- EtaSq(aov1, type = 1, anova = TRUE) #eta sqquared

etasq1 #full table

##                                                          eta.sq eta.sq.part
## causalexplanatoryCF                                 0.001158744 0.004633118
## typetoken                                           0.044584485 0.151892808
## causalexplanatoryCF:typetoken                       0.041054308 0.141568627
## presence_of_moderator                               0.332563510 0.728652751
## presence_of_moderator:causalexplanatoryCF           0.003371985 0.026505720
## presence_of_moderator:typetoken                     0.006486998 0.049772769
## presence_of_moderator:causalexplanatoryCF:typetoken 0.002843096 0.022441652
##                                                      eta.sq.gen         SS df
## causalexplanatoryCF                                 0.002036115  0.1543803  1
## typetoken                                           0.072788458  5.9400253  1
## causalexplanatoryCF:typetoken                       0.067413632  5.4696970  1
## presence_of_moderator                               0.369309682 44.3076923  1
## presence_of_moderator:causalexplanatoryCF           0.005902210  0.4492521  1
## presence_of_moderator:typetoken                     0.011293050  0.8642677  1
## presence_of_moderator:causalexplanatoryCF:typetoken 0.004981072  0.3787879  1
##                                                             MS      SSE dfE
## causalexplanatoryCF                                  0.1543803 33.16667   9
## typetoken                                            5.9400253 33.16667   9
## causalexplanatoryCF:typetoken                        5.4696970 33.16667   9
## presence_of_moderator                               44.3076923 16.50000   9
## presence_of_moderator:causalexplanatoryCF            0.4492521 16.50000   9
## presence_of_moderator:typetoken                      0.8642677 16.50000   9
## presence_of_moderator:causalexplanatoryCF:typetoken  0.3787879 16.50000   9
##                                                               F            p
## causalexplanatoryCF                                  0.04189215 0.8423793717
## typetoken                                            1.61186615 0.2360737888
## causalexplanatoryCF:typetoken                        1.48423938 0.2540845825
## presence_of_moderator                               24.16783217 0.0008290704
## presence_of_moderator:causalexplanatoryCF            0.24504662 0.6324446104
## presence_of_moderator:typetoken                      0.47141873 0.5096366032
## presence_of_moderator:causalexplanatoryCF:typetoken  0.20661157 0.6602009190

etasq1[4,2] #effect size for the presence of moderator #beacuse of errors with the previous ANOVA, had to change numbers here

## [1] 0.7286528

Now we plot it similarly to the paper, using a bar plot and only showing the effect for the presence of the moderator:

#plot using ggplot, using same method as paper just for comparison

confirmatoryplot <- dataconfirmatory %>% #summarize data and take means
  group_by(presence_of_moderator) %>% 
  summarize(Mean = mean(as.numeric(rating), na.rm = TRUE), SEM = std.error(rating, na.rm = TRUE))

ggplot(confirmatoryplot, aes(x = presence_of_moderator, y = Mean, fill = presence_of_moderator)) +
  geom_bar(position = "dodge", stat = "identity", width = .6) +
  geom_errorbar(aes(ymin = Mean - SEM, ymax = Mean + SEM), position = position_dodge(.9), width = .25) +
  theme_classic() +
  ylab("Cause/Explanation Ratings") +
  xlab("Presence of Moderator")+
  scale_x_discrete(labels = c("Non-Moderated", "Moderated")) +
  scale_fill_manual(guide = NULL, name = "Presence of Moderator", labels = c("Non-moderated", "Moderated"), values = c("gray", "grey20")) +
  theme(legend.position = "top") #create plot with specified colors and axes labels

We can now visually compare the plot generated from the replication to the plot in the original paper:

Fig 2: (left) From Vasilyeva et al.: The effect of moderator on ratings of causal and explanatory relationships in Experiment 1. (right) Results of the replication (a)

For a more comprehensive look at the data, we can create pirate plots for each independent variable, and view its distribution. Consistent with the original paper, we should see differences in the presence of moderator, but not in the causal/explanatory judgments, or the type/token judgments:

#as per suggestion: pirate plot of all condition differences

pirateplot(formula = rating ~ presence_of_moderator,
           data = dataconfirmatory,
           theme = 3,
           xlab = "Presence of Moderator", #below parameters all aesthetic
           ylab = "Cause/Explanation Ratings",
           point.o = 1,
           gl.lty = 1,
           width.min = 1,
           point.pch = 16,
           inf.f.o = .4,
           yaxt = "n",
           xaxt = "n")
axis(2, at = seq(from = 0, to = 7, by = 1), las = 2)
axis(1, at = c(1,2), labels = c("Non-Moderated", "Moderated"), lwd = 0)

pirateplot(formula = rating ~ causalexplanatoryCF,
           data = dataconfirmatory,
           theme = 3,
           xlab = "Causal vs. Explanatory Questions",
           ylab = "Endorsement Ratings",
           point.o = 1,
           gl.lty = 1,
           width.min = 1,
           point.pch = 16,
           inf.f.o = .4,
           yaxt = "n",
           xaxt = "n")
axis(2, at = seq(from = 0, to = 7, by = 1), las = 2)
axis(1, at = c(1,2), labels = c("Causal", "Explanatory"), lwd = 0)

pirateplot(formula = rating ~ typetoken,
           data = dataconfirmatory,
           theme = 3,
           xlab = "Type vs. Token Questions",
           ylab = "Cause/Explanation Ratings",
           point.o = 1,
           gl.lty = 1,
           width.min = 1,
           point.pch = 16,
           inf.f.o = .4,
           yaxt = "n",
           xaxt = "n")
axis(2, at = seq(from = 0, to = 7, by = 1), las = 2)
axis(1, at = c(1,2), labels = c("Token", "Type"), lwd = 0)

Now we will perform the analysis performed in section 2.2.2 of the paper - not the key analysis of interest but relevant to the experiment. This is a 2 (moderator vs. non-moderator) x 2 (type, token) mixed ANOVA on participants’ counterfactual ratings. They found a main effect of moderator but no effect of target and no interaction. We will now perform that analysis and plot it using ggplot:

First, get rid of the non-counterfactual data:

datacounterfactual <- data %>%
  filter(grepl("counterfactual", causalexplanatoryCF))

Now we perform the ANOVA:

#perform ANOVA on counterfactual ratings

aov2 <- aov(rating ~ (presence_of_moderator*typetoken) + Error(id/(presence_of_moderator)) + (typetoken), data = datacounterfactual)

summary(aov2)

## 
## Error: id
##           Df Sum Sq Mean Sq F value Pr(>F)
## typetoken  1   0.09  0.0897   0.029  0.868
## Residuals 11  34.33  3.1212               
## 
## Error: id:presence_of_moderator
##                                 Df Sum Sq Mean Sq F value Pr(>F)    
## presence_of_moderator            1  83.77   83.77  58.461  1e-05 ***
## presence_of_moderator:typetoken  1   0.97    0.97   0.676  0.428    
## Residuals                       11  15.76    1.43                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 26     16  0.6154

and compute the effect size:

etasq2 <- EtaSq(aov2, type = 1, anova = TRUE)

etasq2

##                                       eta.sq eta.sq.part  eta.sq.gen
## typetoken                       0.0005946313 0.002607076 0.001355951
## presence_of_moderator           0.5550458716 0.841638451 0.558966588
## presence_of_moderator:typetoken 0.0064195913 0.057909141 0.014446842
##                                          SS df          MS      SSE dfE
## typetoken                        0.08974359  1  0.08974359 34.33333  11
## presence_of_moderator           83.76923077  1 83.76923077 15.76190  11
## presence_of_moderator:typetoken  0.96886447  1  0.96886447 15.76190  11
##                                          F            p
## typetoken                        0.0287528 8.684292e-01
## presence_of_moderator           58.4613061 1.001828e-05
## presence_of_moderator:typetoken  0.6761562 4.283781e-01

etasq2[2,2] #had to change this due to error in previous ANOVA

## [1] 0.8416385

Now we can plot it using the aesthetics from the paper (though they did not actually report this plot, putting it here for the sake of symmetry with the previous section):

counterfactualplot <- datacounterfactual %>% 
  group_by(presence_of_moderator) %>% 
  summarize(Mean = mean(as.numeric(rating), na.rm = TRUE), SEM = std.error(rating, na.rm = TRUE))

ggplot(counterfactualplot, aes(x = presence_of_moderator, y = Mean, fill = presence_of_moderator)) +
  geom_bar(position = "dodge", stat = "identity", width = .6) +
  geom_errorbar(aes(ymin = Mean - SEM, ymax = Mean + SEM), position = position_dodge(.9), width = .25) +
  theme_classic() + 
  ylab("Counterfactual Ratings") +
  xlab("Presence of Moderator")+
  scale_x_discrete(labels = c("Non-Moderated", "Moderated")) +
  scale_fill_manual(guide = NULL, name = "Presence of Moderator", labels = c("Non-moderated", "Moderated"), values = c("gray", "grey20")) +
  theme(legend.position = "top")

Likewise with the previous section, we can generate pirate plots showing the presence of moderator, and type/token. Consistent with the paper, a replication would find a difference in the presence of moderator, but not in the type/token judgments.

pirateplot(formula = rating ~ presence_of_moderator,
           data = datacounterfactual,
           theme = 3,
           xlab = "Presence of Moderator",
           ylab = "Counterfactual Ratings",
           point.o = 1,
           gl.lty = 1,
           width.min = 1,
           point.pch = 16,
           inf.f.o = .4,
           yaxt = "n",
           xaxt = "n")
axis(2, at = seq(from = 0, to = 7, by = 1), las = 2)
axis(1, at = c(1,2), labels = c("Non-Moderated", "Moderated"), lwd = 0)

pirateplot(formula = rating ~ typetoken,
           data = datacounterfactual,
           theme = 3,
           xlab = "Presence of Moderator",
           ylab = "Counterfactual Ratings",
           point.o = 1,
           gl.lty = 1,
           width.min = 1,
           point.pch = 16,
           inf.f.o = .4,
           yaxt = "n",
           xaxt = "n")
axis(2, at = seq(from = 0, to = 7, by = 1), las = 2)
axis(1, at = c(1,2), labels = c("Token", "Type"), lwd = 0)

Exploratory Analyses

Given the stringent exclusion rate used by the authors in the original study, and the fact that the effect is strong even with 13 participants, we can look to see if including all participants makes any difference to the effect. We’ll start with the data filtering code from above, but not excluding any participants on the basis of the check questions:

dataexp <- fetch_survey(surveyID = "SV_bQ6I1kuS75Z5Jvo", force_request = T, include_display_order = FALSE, label = FALSE, convert = FALSE)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   StartDate = col_datetime(format = ""),
##   EndDate = col_datetime(format = ""),
##   IPAddress = col_logical(),
##   RecordedDate = col_datetime(format = ""),
##   ResponseId = col_character(),
##   RecipientLastName = col_logical(),
##   RecipientFirstName = col_logical(),
##   RecipientEmail = col_logical(),
##   ExternalReference = col_logical(),
##   LocationLatitude = col_logical(),
##   LocationLongitude = col_logical(),
##   DistributionChannel = col_character(),
##   UserLanguage = col_character(),
##   CommOwn = col_character(),
##   Comments = col_character(),
##   Gender = col_character(),
##   `prolific-id` = col_character(),
##   PROLIFIC_PID = col_character(),
##   task = col_character(),
##   moderator = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

#### Data exclusion / filtering

dataexp1 <- dataexp %>% 
  mutate(id = row_number()) %>% 
  select(id, everything()) %>% 
  filter(Consent == 2) %>% #consenting
  subset(select = -c(2:19)) %>% 
  select(-starts_with(c("Mch", #getting rid of irrelevant columns, including check questions
                        "Comm",
                        "Age", 
                        "Gender",
                        "prolific-id",
                        "PROLIFIC_PID",
                        "tableorder",
                        "cov"))) %>% 
  pivot_longer(!c(id, task, moderator), names_to = "question_type", values_to = "rating") %>% #turn table into longer
  mutate(presence_of_moderator = case_when(moderator == "drolyuyu" & grepl("d3", question_type) ~ "present", #create column indicating whether it was moderated or not
         moderator == "drolyuyu" & grepl("d4", question_type) ~ "present",
         moderator == "drolyuyu" & grepl("y3", question_type) ~ "present",
         moderator == "drolyuyu" & grepl("y4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("z3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("z4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("Z3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("Z4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("g3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("g4", question_type) ~ "present",
         TRUE ~ "absent")) %>% 
  mutate("causalexplanatoryCF" = case_when(grepl("CaTy", question_type) ~ "causal", #change column names
                                     grepl("ExTy", question_type) ~ "explanatory",
                                     grepl("CaTo", question_type) ~ "causal",
                                     grepl("ExTo", question_type) ~ "explanatory",
                                     grepl("Ty.CF", question_type) ~ "counterfactual",
                                     grepl("To.CF", question_type) ~ "counterfactual")) %>% 
  mutate("typetoken" = case_when(grepl("CaTy", question_type) ~ "type", #change column names
                                     grepl("ExTy", question_type) ~ "type",
                                     grepl("CaTo", question_type) ~ "token",
                                     grepl("ExTo", question_type) ~ "token",
                                     grepl("Ty.CF", question_type) ~ "type",
                                     grepl("To.CF", question_type) ~ "token")) %>% 
  filter(!is.na(rating)) %>% 
  select(-c(moderator, task, question_type)) #get rid of last remaining unnecessary columns

#turn them all into factors 
dataexp1$id <- factor(dataexp1$id)
dataexp1$presence_of_moderator <- factor(dataexp1$presence_of_moderator)
dataexp1$causalexplanatoryCF <- factor(dataexp1$causalexplanatoryCF)
dataexp1$typetoken <- factor(dataexp1$typetoken)

#perform ANOVA and calculate partial eta squared, like the confirmatory analysis:

dataexp1test <- dataexp1 %>%
  filter(causalexplanatoryCF != "counterfactual")

aov3 <- aov(rating ~ (presence_of_moderator*causalexplanatoryCF*typetoken) + Error(id/(presence_of_moderator)) + (causalexplanatoryCF*typetoken), data = dataexp1)

summary(aov3)

## 
## Error: id
##                               Df Sum Sq Mean Sq F value Pr(>F)
## causalexplanatoryCF            1   1.17   1.167   0.216  0.647
## typetoken                      1  12.21  12.212   2.258  0.149
## causalexplanatoryCF:typetoken  1   2.82   2.819   0.521  0.479
## Residuals                     20 108.17   5.409               
## 
## Error: id:presence_of_moderator
##                                                     Df Sum Sq Mean Sq F value
## presence_of_moderator                                1 135.01  135.01  40.128
## presence_of_moderator:causalexplanatoryCF            1  13.96   13.96   4.148
## presence_of_moderator:typetoken                      1   1.23    1.23   0.366
## presence_of_moderator:causalexplanatoryCF:typetoken  1   0.15    0.15   0.043
## Residuals                                           20  67.29    3.36        
##                                                      Pr(>F)    
## presence_of_moderator                               3.5e-06 ***
## presence_of_moderator:causalexplanatoryCF            0.0551 .  
## presence_of_moderator:typetoken                      0.5518    
## presence_of_moderator:causalexplanatoryCF:typetoken  0.8376    
## Residuals                                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##                                                      Df Sum Sq Mean Sq F value
## causalexplanatoryCF                                   2   1.67  0.8352   0.850
## presence_of_moderator:causalexplanatoryCF             2   2.96  1.4803   1.506
## causalexplanatoryCF:typetoken                         2   3.60  1.7981   1.830
## presence_of_moderator:causalexplanatoryCF:typetoken   2   1.38  0.6887   0.701
## Residuals                                           136 133.65  0.9827        
##                                                     Pr(>F)
## causalexplanatoryCF                                  0.430
## presence_of_moderator:causalexplanatoryCF            0.225
## causalexplanatoryCF:typetoken                        0.164
## presence_of_moderator:causalexplanatoryCF:typetoken  0.498
## Residuals

etasq3 <- EtaSq(aov3, type = 1)

etasq3 #full table

##                                                           eta.sq eta.sq.part
## causalexplanatoryCF                                 0.0024049603 0.010673117
## typetoken                                           0.0251658549 0.101438545
## causalexplanatoryCF:typetoken                       0.0058085696 0.025394610
## presence_of_moderator                               0.2782208293 0.667377518
## presence_of_moderator:causalexplanatoryCF           0.0287596204 0.171775609
## presence_of_moderator:typetoken                     0.0025397713 0.017986327
## presence_of_moderator:causalexplanatoryCF:typetoken 0.0002988597 0.002150615
## causalexplanatoryCF                                 0.0034424987 0.012344869
## presence_of_moderator:causalexplanatoryCF           0.0061013830 0.021673034
## causalexplanatoryCF:typetoken                       0.0074111298 0.026203542
## presence_of_moderator:causalexplanatoryCF:typetoken 0.0028386310 0.010201479
##                                                       eta.sq.gen
## causalexplanatoryCF                                 0.0037612011
## typetoken                                           0.0380049119
## causalexplanatoryCF:typetoken                       0.0090361245
## presence_of_moderator                               0.3039904920
## presence_of_moderator:causalexplanatoryCF           0.0431976927
## presence_of_moderator:typetoken                     0.0039711994
## presence_of_moderator:causalexplanatoryCF:typetoken 0.0004689417
## causalexplanatoryCF                                 0.0053751215
## presence_of_moderator:causalexplanatoryCF           0.0094873193
## causalexplanatoryCF:typetoken                       0.0115004826
## presence_of_moderator:causalexplanatoryCF:typetoken 0.0044364248

Now let’s plot the data using pirate plots. This time I will use participant means (rather than trial level data), as per suggestion after my final presentation:

#we can plot the data and visually observe

#first calculate participant means and summarise data

dataexp1plot <- dataexp1test %>% 
  group_by(id, presence_of_moderator) %>% 
  summarize(Mean = mean(as.numeric(rating), na.rm = TRUE), SEM = std.error(rating, na.rm = TRUE))

## `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.

pirateplot(formula = Mean ~ presence_of_moderator,
           data = dataexp1plot,
           theme = 3,
           xlab = "Presence of Moderator",
           ylab = "Cause/Explanation Ratings",
           point.o = 1,
           gl.lty = 1,
           width.min = 1,
           point.pch = 16,
           inf.f.o = .4,
           yaxt = "n",
           xaxt = "n")
axis(2, at = seq(from = 0, to = 7, by = 1), las = 2)
axis(1, at = c(1,2), labels = c("Non-Moderated", "Moderated"), lwd = 0)

Because I plotted individual trials in the original confirmatory analysis and not participant means, we can also fix up the graph from earlier (doing only the one that shows the distribution between moderator conditions, as that one was the only one had that significant differences). This plot is in exploratory analysis because it was not part of the original confirmatory preregistration:

dataconfirmatoryplotv2 <- dataconfirmatory %>% 
  group_by(id, presence_of_moderator) %>% 
  summarize(Mean = mean(as.numeric(rating), na.rm = TRUE), SEM = std.error(rating, na.rm = TRUE))

## `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.

pirateplot(formula = Mean ~ presence_of_moderator,
           data = dataconfirmatoryplotv2,
           theme = 3,
           xlab = "Presence of Moderator",
           ylab = "Cause/Explanation Ratings",
           point.o = 1,
           gl.lty = 1,
           width.min = 1,
           point.pch = 16,
           inf.f.o = .4,
           yaxt = "n",
           xaxt = "n")
axis(2, at = seq(from = 0, to = 7, by = 1), las = 2)
axis(1, at = c(1,2), labels = c("Non-Moderated", "Moderated"), lwd = 0)

Now we can investigate the effect of the covariation check questions. What happens to the effect if you exclude participants who failed the covariation questions (in tandem with the original exclusion criteria)?

#### Data exclusion / filtering

dataexp2 <- dataexp %>% 
  mutate(id = row_number()) %>% 
  select(id, everything()) %>% 
  filter(Consent == 2) %>% #consenting
  filter(Mch.z1 == 1,
         Mch.z2 == 1,
         Mch.d1 == 1,
         Mch.d2 == 1,
         Mch.g1 == 1,
         Mch.g2 == 1,
         Mch.y1 == 1,
         Mch.y2 == 1) %>% #check questions
  filter(cov.1 == 1,
         cov.2 == 1) %>% #covariation comprehension questions
  subset(select = -c(2:19)) %>% 
  select(-starts_with(c("Mch", #getting rid of irrelevant columns, including check questions
                        "Comm",
                        "Age", 
                        "Gender",
                        "prolific-id",
                        "PROLIFIC_PID",
                        "tableorder",
                        "cov"))) %>% 
  pivot_longer(!c(id, task, moderator), names_to = "question_type", values_to = "rating") %>% #turn table into longer
  mutate(presence_of_moderator = case_when(moderator == "drolyuyu" & grepl("d3", question_type) ~ "present", #create column indicating whether it was moderated or not
         moderator == "drolyuyu" & grepl("d4", question_type) ~ "present",
         moderator == "drolyuyu" & grepl("y3", question_type) ~ "present",
         moderator == "drolyuyu" & grepl("y4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("z3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("z4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("Z3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("Z4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("g3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("g4", question_type) ~ "present",
         TRUE ~ "absent")) %>% 
  mutate("causalexplanatoryCF" = case_when(grepl("CaTy", question_type) ~ "causal", #change column names
                                     grepl("ExTy", question_type) ~ "explanatory",
                                     grepl("CaTo", question_type) ~ "causal",
                                     grepl("ExTo", question_type) ~ "explanatory",
                                     grepl("Ty.CF", question_type) ~ "counterfactual",
                                     grepl("To.CF", question_type) ~ "counterfactual")) %>% 
  mutate("typetoken" = case_when(grepl("CaTy", question_type) ~ "type", #change column names
                                     grepl("ExTy", question_type) ~ "type",
                                     grepl("CaTo", question_type) ~ "token",
                                     grepl("ExTo", question_type) ~ "token",
                                     grepl("Ty.CF", question_type) ~ "type",
                                     grepl("To.CF", question_type) ~ "token")) %>% 
  filter(!is.na(rating)) %>% 
  select(-c(moderator, task, question_type)) #get rid of last remaining unnecessary columns

#turn them all into factors - forgot to add this in the prereg
dataexp2$id <- factor(dataexp2$id)
dataexp2$presence_of_moderator <- factor(dataexp2$presence_of_moderator)
dataexp2$causalexplanatoryCF <- factor(dataexp2$causalexplanatoryCF)
dataexp2$typetoken <- factor(dataexp2$typetoken)

#now conduct statistical tests:

dataexp2test <- dataexp2 %>%
  filter(causalexplanatoryCF != "counterfactual")

aov4 <- aov(rating ~ (presence_of_moderator*causalexplanatoryCF*typetoken) + Error(id/(presence_of_moderator)) + (causalexplanatoryCF*typetoken), data = dataexp2test)

summary(aov4)

## 
## Error: id
##                               Df Sum Sq Mean Sq F value Pr(>F)
## causalexplanatoryCF            1  0.430   0.430   0.097  0.766
## typetoken                      1  1.482   1.482   0.335  0.584
## causalexplanatoryCF:typetoken  1  6.564   6.564   1.483  0.269
## Residuals                      6 26.550   4.425               
## 
## Error: id:presence_of_moderator
##                                                     Df Sum Sq Mean Sq F value
## presence_of_moderator                                1  55.23   55.23  55.689
## presence_of_moderator:causalexplanatoryCF            1   1.00    1.00   1.010
## presence_of_moderator:typetoken                      1   0.00    0.00   0.001
## presence_of_moderator:causalexplanatoryCF:typetoken  1   0.07    0.07   0.073
## Residuals                                            6   5.95    0.99        
##                                                       Pr(>F)    
## presence_of_moderator                               0.000299 ***
## presence_of_moderator:causalexplanatoryCF           0.353785    
## presence_of_moderator:typetoken                     0.974718    
## presence_of_moderator:causalexplanatoryCF:typetoken 0.795612    
## Residuals                                                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 20   24.5   1.225

etasq4 <- EtaSq(aov4, type = 1)

etasq4 #full table

##                                                           eta.sq  eta.sq.part
## causalexplanatoryCF                                 3.529147e-03 0.0159290473
## typetoken                                           1.216672e-02 0.0528546940
## causalexplanatoryCF:typetoken                       5.389970e-02 0.1982155113
## presence_of_moderator                               4.535003e-01 0.9027380466
## presence_of_moderator:causalexplanatoryCF           8.221642e-03 0.1440315122
## presence_of_moderator:typetoken                     8.887301e-06 0.0001818579
## presence_of_moderator:causalexplanatoryCF:typetoken 5.972266e-04 0.0120754717
##                                                       eta.sq.gen
## causalexplanatoryCF                                 0.0074832611
## typetoken                                           0.0253344930
## causalexplanatoryCF:typetoken                       0.1032608696
## presence_of_moderator                               0.4920917799
## presence_of_moderator:causalexplanatoryCF           0.0172615505
## presence_of_moderator:typetoken                     0.0000189865
## presence_of_moderator:causalexplanatoryCF:typetoken 0.0012742912

And now we can plot it:

dataexp2plot <- dataexp2test %>% 
  group_by(id, presence_of_moderator) %>% 
  summarize(Mean = mean(as.numeric(rating), na.rm = TRUE), SEM = std.error(rating, na.rm = TRUE))

## `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.

pirateplot(formula = Mean ~ presence_of_moderator,
           data = dataexp2plot,
           theme = 3,
           xlab = "Presence of Moderator",
           ylab = "Cause/Explanation Ratings",
           point.o = 1,
           gl.lty = 1,
           width.min = 1,
           point.pch = 16,
           inf.f.o = .4,
           yaxt = "n",
           xaxt = "n")
axis(2, at = seq(from = 0, to = 7, by = 1), las = 2)
axis(1, at = c(1,2), labels = c("Non-Moderated", "Moderated"), lwd = 0)

What about if we used only the covariation table as an exclusion criteria?

#### Data exclusion / filtering

dataexp3 <- dataexp %>% 
  mutate(id = row_number()) %>% 
  select(id, everything()) %>% 
  filter(Consent == 2) %>% #consenting
  filter(cov.1 == 1,
         cov.2 == 1) %>% #covariation comprehension questions
  subset(select = -c(2:19)) %>% 
  select(-starts_with(c("Mch", #getting rid of irrelevant columns, including check questions
                        "Comm",
                        "Age", 
                        "Gender",
                        "prolific-id",
                        "PROLIFIC_PID",
                        "tableorder",
                        "cov"))) %>% 
  pivot_longer(!c(id, task, moderator), names_to = "question_type", values_to = "rating") %>% #turn table into longer
  mutate(presence_of_moderator = case_when(moderator == "drolyuyu" & grepl("d3", question_type) ~ "present", #create column indicating whether it was moderated or not
         moderator == "drolyuyu" & grepl("d4", question_type) ~ "present",
         moderator == "drolyuyu" & grepl("y3", question_type) ~ "present",
         moderator == "drolyuyu" & grepl("y4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("z3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("z4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("Z3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("Z4", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("g3", question_type) ~ "present",
         moderator == "zelmogrimond" & grepl("g4", question_type) ~ "present",
         TRUE ~ "absent")) %>% 
  mutate("causalexplanatoryCF" = case_when(grepl("CaTy", question_type) ~ "causal", #change column names
                                     grepl("ExTy", question_type) ~ "explanatory",
                                     grepl("CaTo", question_type) ~ "causal",
                                     grepl("ExTo", question_type) ~ "explanatory",
                                     grepl("Ty.CF", question_type) ~ "counterfactual",
                                     grepl("To.CF", question_type) ~ "counterfactual")) %>% 
  mutate("typetoken" = case_when(grepl("CaTy", question_type) ~ "type", #change column names
                                     grepl("ExTy", question_type) ~ "type",
                                     grepl("CaTo", question_type) ~ "token",
                                     grepl("ExTo", question_type) ~ "token",
                                     grepl("Ty.CF", question_type) ~ "type",
                                     grepl("To.CF", question_type) ~ "token")) %>% 
  filter(!is.na(rating)) %>% 
  select(-c(moderator, task, question_type)) #get rid of last remaining unnecessary columns

#turn them all into factors - forgot to add this in the prereg
dataexp3$id <- factor(dataexp3$id)
dataexp3$presence_of_moderator <- factor(dataexp3$presence_of_moderator)
dataexp3$causalexplanatoryCF <- factor(dataexp3$causalexplanatoryCF)
dataexp3$typetoken <- factor(dataexp3$typetoken)

#now conduct statistical tests:

dataexp3test <- dataexp3 %>%
  filter(causalexplanatoryCF != "counterfactual")

aov5 <- aov(rating ~ (presence_of_moderator*causalexplanatoryCF*typetoken) + Error(id/(presence_of_moderator)) + (causalexplanatoryCF*typetoken), data = dataexp3test)

summary(aov5)

## 
## Error: id
##                               Df Sum Sq Mean Sq F value Pr(>F)
## causalexplanatoryCF            1   6.02   6.017   1.567  0.234
## typetoken                      1   3.16   3.157   0.823  0.382
## causalexplanatoryCF:typetoken  1   0.01   0.013   0.003  0.954
## Residuals                     12  46.06   3.839               
## 
## Error: id:presence_of_moderator
##                                                     Df Sum Sq Mean Sq F value
## presence_of_moderator                                1  68.06   68.06  37.732
## presence_of_moderator:causalexplanatoryCF            1   0.20    0.20   0.113
## presence_of_moderator:typetoken                      1   3.41    3.41   1.889
## presence_of_moderator:causalexplanatoryCF:typetoken  1   0.18    0.18   0.100
## Residuals                                           12  21.65    1.80        
##                                                     Pr(>F)    
## presence_of_moderator                                5e-05 ***
## presence_of_moderator:causalexplanatoryCF            0.742    
## presence_of_moderator:typetoken                      0.194    
## presence_of_moderator:causalexplanatoryCF:typetoken  0.757    
## Residuals                                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 32     27  0.8437

etasq5 <- EtaSq(aov5, type = 1)

etasq5 #full table

##                                                           eta.sq  eta.sq.part
## causalexplanatoryCF                                 3.423423e-02 0.1155292423
## typetoken                                           1.796552e-02 0.0641496200
## causalexplanatoryCF:typetoken                       7.620402e-05 0.0002906695
## presence_of_moderator                               3.872688e-01 0.7587087784
## presence_of_moderator:causalexplanatoryCF           1.161688e-03 0.0093440122
## presence_of_moderator:typetoken                     1.938800e-02 0.1360077929
## presence_of_moderator:causalexplanatoryCF:typetoken 1.024521e-03 0.0082498125
##                                                       eta.sq.gen
## causalexplanatoryCF                                 0.0597335981
## typetoken                                           0.0322629695
## causalexplanatoryCF:typetoken                       0.0001413916
## presence_of_moderator                               0.4181492384
## presence_of_moderator:causalexplanatoryCF           0.0021511041
## presence_of_moderator:typetoken                     0.0347287734
## presence_of_moderator:causalexplanatoryCF:typetoken 0.0018975927

Discussion

Summary of Replication Attempt

I successfully replicated the effect from the original paper (F(1,9) = 24.167, p = .0008, \(\eta^{2}_{p}\) = .728). There were no other main effects or interactions. The effect size in the original paper was .478, so the effect size that I found was significantly higher than that (this is further discussed in the commentary below).

I also replicated the effect (of the presence of moderator) found with the counterfactual DV as well (F(1, 11) = 58.461, p = .00001, \(\eta^{2}_{p}\) = .841).

Overall, the replication was successful. Further commentary on the exploratory analyses down below.

Commentary

Exploratory Analyses and Concluding Thoughts

Given the high exclusion rate of participants in my study, I decided to see if including participants who missed the comprehension questions would have any effect on the results. In my confirmatory analyses, I had to exclude 11 of my 24 total participants ran, which is an exclusion rate of roughly 45%. In the original paper, they excluded 49 out of 231, which is roughly 21% exclusion rate. Granted, the criteria for exclusion in the paper is slightly ambiguous - see below:

At the end of the experiment, participants answered two multiple-choice comprehension check questions about each scenario they had read (e.g., “According to what you read, as a scientist on planet Zorg you were interested in evaluating the following hypothesis about zelmos: a. eating yona plants produces antenna soreness; b. eating drol mushrooms produces antenna soreness; c. eating mushrooms with stem bumps produces spotted antennas; d. antenna soreness makes zelmos eat yonas”). Participants who answered either question incorrectly were excluded from further analyses.

Ultimately, the ambiguity lies in whether or not the original authors tossed whole participant data, or only that of a particular scenario/trial. In either case, the exploratory analyses I ran found that the effect does not go away even if you include every single participant (F(1, 20 ) = 40.12, p = .0000035, \(\eta^{2}_{p}\) = .667). My sense was that the interpretation of the exclusion criteria that I had (excluding whole participant data based on one incorrect comprehension question out of eight) was really stringent. It is possible that the exclusion criteria is actually less stringent (which would align with the the mismatch in our exclusion rates).

I also checked whether the covariation comprehension questions I would change the effect. In fact, 3 of the participants who were included in the analysis missed at least one of these covariation questions, and 6 of the participants who were excluded answered them both correctly. In any case, I would not be changing whether or not the effect was significant, but looking at the size of the effect. I found the effect when the covariation tables were the only exclusion criteria (F(1, 9) = 37.73, p = .00004, \(\eta^{2}_{p}\) = .758) and when it was used in tandem with the original exclusion criteria (F(1, 9) = 55.68, p = .000299, \(\eta^{2}_{p}\) = .902). I find that the effect is highest in the case where both exclusion criteria were used. While I would not draw any strong conclusions from this at all - it raises the question of: What makes a good comprehension question?

From the perspective of the researcher, building adequate and succinct comprehension questions are good for two reasons: to reduce cost, and to avoid adversarial relationships with their workers. If researchers can find the line between comprehension questions that accurately exclude participants who were not paying attention AND are not too stringent in doing so, then they can reduce cost in data collection (by not having to exclude as many participants). In the case of this experiment, missing any of the eight comprehension questions might be a strict exclusion criteria. And while this question differs by domain and the effect you are trying to find, it is an important question for researchers on crowd-sourcing sites to think about.

Ultimately, I replicated the targeted effect of my replication even with 13 participants, even including participants who missed various comprehension questions. This is an example of a strong effect - a case where if you don’t replicate, there might be an issue with your implementation rather than the effect itself. This report does not even look at the free response data that we collected nor as well as the effect of table order, given that these were not mentioned in the paper and not the key analysis of interest. This data may be insightful. Future work may build upon this and further investigate the theoretical question of what aspects influence people’s endorsements of causal relationships, as well as the methodological question of what make good comprehension questions in crowd-sourced experiments.