Replication of Study 1 by Franke & Degen (2016, PLoS One)

Introduction

Why replicate this Franke & Degen (2016, henceforth FD) is a paper in the Rational Speech Acts space which tests one of the underlying assumptions of much RSA work – namely, it addresses whether speakers and listeners are close to uniform in their level of pragmatic reasoning, or whether the population is diverse, with some people employing higher levels of reasoning than others. I’m interested in starting to work in the RSA pragmatic language models space; replicating this would provide a good introduction to the required analysis. Study 1 and Study 2 in the paper are complementary; study 1 is of comprehension and 2 of production, with parallel materials and analysis. The original paper found stronger evidence of heterogeneity in comprehension, so I’ll replicate Study 1. (Although I’d also strongly consider replicating study 2 and the supplemental norming study beyond the confines of the class.)

Stimuli & procedures In the study universe, there are 3 types of animals (purple monster, green monster, robot) and 3 accessories (red hat, blue hat, scarf). There are “words” for purple monster, green monster, red hat, blue hat (represented as images). The study (original available) consists of participants seeing one of the 4 words, an array of three images (of creatures with accessories) and choosing which image was being referred to. The experiment is written in javascript and all the source code is available. After checking with the original authors, I plan to use their source code and rehost the experiment using the original images, text, and experimental items.

repo and original paper

Methods

Power Analysis

FD recruited 60 participants, and analysed data from 51 post-exclusions. I’ll use the same sample size, depending on how power analysis goes.

Planned Sample

(Sample size subject to change pending power analysis results.)

Following FD, we recruited 60 participants. We excluded the 9 participants (15%) with the highest error rates (i.e. selection of distractors) on non-ambiguous trials. We also excluded XX participants (including N who had high error rates as well) who did not report English as their native language. Participants were recruited on DATE.

Materials

Materials were exactly the same as in the original experiment. The materials description from FD is quoted below.

" Materials. Participants saw 66 experimental trials, which were composed of 24 critical and 42 filler trials. Of the 24 critical trials, 12 constituted a simple implicature situation and 12 a complex one (as shown in Section 1). Stimuli were created by randomly sampling a message and then generating a grid of three objects—a target, a competitor, and a distractor— following different constraints in different conditions.

On simple implicature trials, the target was generated by combining the feature denoted by the sampled message with the inexpressible feature along the other feature dimension. For example, if the sampled message was one of “red hat” or “blue hat”, the target was a robot with the respective hat. If instead the sampled message was “purple monster” or “green monster”, the target was the respective monster with a scarf. The competitor was generated by combining the feature denoted by the sampled message with a randomly sampled expressible feature along the other feature dimension. For example, if the sampled message was “red hat”, the competitor could be a green monster with a red hat or a purple monster with a red hat. The distractor was generated by combining two features that were randomly sampled from the set of features that did not contain those features already present in target and competitor. For example, if the target was a robot with a red hat and the competitor was a green monster with a red hat, the distractor could be a purple monster with either a scarf or a blue hat.

On complex implicature trials, the target was generated by combining the feature denoted by the sampled message with an expressible feature along the other feature dimension. For example, if the sampled message was “green monster” the target could be a green monster with a red hat. The competitor was generated by combining the feature denoted by the sampled message with the remaining expressible feature along the other feature dimension. Continuing our example, the competitor would then be a green monster with a blue hat. The distractor was generated by combining the target feature that was not denoted by the sampled message (red hat) with the inexpressible feature along the other feature dimension (robot).

Of the 42 filler trials, 24 used the displays from the implicature conditions but the target was a) the competitor from the simple condition (six trials), b) the distractor from the simple condition (six trials), or c) the competitor from the complex condition (12 trials), as identified unambiguously by the trigger message. This was also intended to prevent learning associations of display type with the target. On the other 18 filler trials, the target was either entirely unambiguous or entirely ambiguous given the message. That is, there was either only one object with the feature denoted by the trigger message, or there were two identical objects that were equally viable target candidates. Unambiguous and ambiguous fillers were included as baselines to compare behavior on implicature trials to. Ambiguous fillers establish how often the target could be chosen by chance, while unambiguous fillers establish the upper bound on target choices. We did not include filler items where the target was the distractor from the complex condition, because this would have required participants to draw a one-step inference to identify the target. Trial order as well as target, competitor, and distractor order were randomized. "

Procedure

The procedure was exactly the same as in FD. To quote their procedure,

" Procedure. Participants engaged in a referential comprehension task. On each trial they saw three objects on a display. Each object differed systematically along two dimensions: its ontological kind (robot, green monster, purple monster) and accessory (scarf, blue hat, red hat). In addition to these three objects, participants saw a pictorial message that they were told was sent to them by a previous participant whose goal was to get them to pick out one of these three objects. They were told that the previous participant was allowed to send a message expressing only one feature of a given object, and that the messages the participant could send were furthermore restricted to monsters and hats (i.e., there were no messages for referring to the robot or scarf feature; we refer to these features as inexpressible features). The four expressible features were visible to participants at the bottom of the display on every trial[…]

Participants initially completed four speaker trials. They saw three objects, one of which was highlighted with a yellow rectangle. Participants were asked to click on one of four pictorial messages to send to another Mechanical Turk worker to get them to pick out the highlighted object. They were told that the other worker did not know which object was highlighted but knew which messages could be sent. The four speaker trials contained three unambiguous and one ambiguous trial which could function as fillers in the main experiment."

Analysis Plan

Data exclusions: The only data exclusions performed were at the participant level, as described above.

Confirmatory analyses: At a group level, the question of interest is how rates of picking the target, competitor, and distractor vary between the trial types (ambiguous filler, unambiguous filler, simple implicature, complex implicature). We visualize this difference and fit a regression. FD used some mixed-effects, but only what would converge. We use the same model structure they settled on, fit in lmer. As a secondary check, we fit the full mixed-effect structure in brms.

Logistic regression (lmer): selected.target ~ condition_h1 + condition_h2 + trial_num + condition_h1trial_num + condition_h2trial_num+ message_type+target_position+(trial_num+message_type|subject)

Logistic regression (brms): selected_target ~ condition_h1 + condition_h2 + trial_num + condition_h1trial_num + condition_h2trial_num+ message_type+target_position+(condition_h1 + condition_h2 + trial_num + condition_h1trial_num + condition_h2trial_num+ message_type+target_position|subject)

Selected_target is 1 if the target was selected, 0 for competitor (distractor selections excluded)
The conditions: ambiguous, complex, and simple are Helmert coded into conditions h1 & h2 (h1 is not-simple v simple, h2 is ambiguous v complex)
trial_num is a centered predictor of trial display order
message type is a contrast coded -.5, +.5 for accessory or species in the message
target position is dummy coded left-position as 0.

Clarify key analysis of interest here The primary measures of interest are the coefficients of the two helmert contrasts. We are interested in if simple, complex, and ambiguous cases show noticeably different levels of target selection.

Differences from Original Study

While the subject pool will also be Mturk, the population has changed over the past 5 years. This could effect the ratios of who uses what types of reasoning.

The main analysis is the same, although we additionally test the same model in brms. We switch what we label as the primary measure of interest to reduce the scope of the confirmatory analysis.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Following FD, we recruited 60 participants. We excluded the 9 participants (15%) with the highest error rates (i.e. selection of distractors) on non-ambiguous trials. We also excluded 4 participants (including 1 who had high error rates as well) who did not report English as their native language. Participants were recruited on 4 December 2020.

Differences from pre-data collection methods plan

None.

Results

Data preparation

Data preparation following the analysis plan.

library(tidyverse)

## ── Attaching packages ────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(here)

## here() starts at /home/vboyce/Research/franke2016

library(lme4)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

library(brms)

## Loading required package: Rcpp

## Loading 'brms' package (version 2.13.5). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').

## 
## Attaching package: 'brms'

## The following object is masked from 'package:lme4':
## 
##     ngrps

## The following object is masked from 'package:stats':
## 
##     ar

library(viridis)

## Loading required package: viridisLite

root <- here()
data_location <- paste0(root,"/data")
data_file <- "final.rds"
df <- read_rds(paste0(data_location,"/",data_file))

errorful <- df %>% filter(imptype=="unambig") %>% 
  mutate(is.error=ifelse(response=="distractor",1,0)) %>% 
    group_by(participant) %>% 
  summarize(errors=mean(is.error)) %>% 
  arrange(desc(errors)) %>% 
  ungroup() %>% 
  slice_head(prop=.15)

## `summarise()` ungrouping output (override with `.groups` argument)

not_eng <- df %>% select(participant, lang) %>% 
  unique() %>% 
  filter(!lang %in% c("english", "English", "eng", "ENG", "Eng", "ENGLISH"))

both <- errorful %>% inner_join(not_eng, by=c("participant"))

good_df <- df %>% anti_join(errorful, by=c("participant")) %>% 
  anti_join(not_eng, by=c("participant"))

We excluded the 9 participants (15%) with the highest error rates (i.e. selection of distractors) on non-ambiguous trials. We also excluded 4 participants (including 1 who had high error rates as well) who did not report English as their native language.

prep_model <- good_df %>% filter(response %in% c("target", "competitor")) %>% 
  filter(imptype %in% c("simple", "complex", "ambig")) %>% 
  mutate(selected.target=ifelse(response=="target", 1,0),
         h1.issimple=case_when(
           imptype=="simple"~2/3,
           T~-1/3),
         h2.isambig=case_when(
           imptype=="simple"~0,
           imptype=="ambig"~.5,
           imptype=="complex"~-.5
         ),
         trial.num=scale(trial),
         message.type=case_when(
           (str_sub(message,1,1)=="v")~.5,
           T~-.5), #monsters are v1,v2, hats are s1,s2
          target.position=as.factor(pos_target),
         subject=as.factor(participant)) %>% 
  select(selected.target, h1.issimple,h2.isambig,
         trial.num,message.type,target.position,subject)

Confirmatory analysis

Original graph from FD

for_plot <- good_df %>%   filter(imptype %in% c("simple", "complex", "ambig","unambig")) %>% 
group_by(imptype, response) %>% 
  tally() %>% 
  pivot_wider(names_from=response, values_from=n) %>% 
  mutate(total=competitor+distractor+target,
         competitor=competitor/total,
         distractor=distractor/total,
         target=target/total) %>% 
  select(-total) %>% 
  pivot_longer(cols=c("target","competitor","distractor"),names_to="Choice",values_to="count")
  
  
ggplot(for_plot, aes(x=imptype,y=count,fill=factor(Choice, levels=c("target", "competitor", "distractor"))))+
  geom_col(position="dodge")+
  scale_fill_viridis(discrete=T)+
  labs(fill="Choice")+
  theme_bw()

ggsave("results.png")

## Saving 7 x 5 in image

The graphs look qualitatively similar – the biggest difference is that the replication data is noisier with more clicks on distractors.

The analyses as specified in the analysis plan.

freq_model <- glmer(selected.target~h1.issimple+h2.isambig+
                     trial.num+
                     h1.issimple*trial.num+
                     h2.isambig*trial.num+
                     message.type+target.position+
                     (trial.num+message.type|subject),
                   family=binomial,
                   data=prep_model)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0263642 (tol = 0.002, component 1)

summary(freq_model)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## selected.target ~ h1.issimple + h2.isambig + trial.num + h1.issimple *  
##     trial.num + h2.isambig * trial.num + message.type + target.position +  
##     (trial.num + message.type | subject)
##    Data: prep_model
## 
##      AIC      BIC   logLik deviance df.resid 
##   1948.2   2027.7   -959.1   1918.2     1461 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1695 -0.9535  0.5220  0.8158  1.8431 
## 
## Random effects:
##  Groups  Name         Variance Std.Dev. Corr       
##  subject (Intercept)  0.23147  0.4811              
##          trial.num    0.02921  0.1709    0.43      
##          message.type 0.26633  0.5161   -0.23 -0.91
## Number of obs: 1476, groups:  subject, 48
## 
## Fixed effects:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -0.16610    0.11836  -1.403   0.1605    
## h1.issimple            0.53003    0.11921   4.446 8.75e-06 ***
## h2.isambig            -0.29585    0.14047  -2.106   0.0352 *  
## trial.num             -0.02327    0.06218  -0.374   0.7082    
## message.type           0.23765    0.13600   1.747   0.0806 .  
## target.position2       0.93275    0.14129   6.602 4.06e-11 ***
## target.position3       0.33286    0.13708   2.428   0.0152 *  
## h1.issimple:trial.num -0.01932    0.11979  -0.161   0.8719    
## h2.isambig:trial.num   0.11663    0.14270   0.817   0.4138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) h1.ssm h2.smb trl.nm mssg.t trgt.2 trgt.3 h1.s:.
## h1.issimple -0.022                                                 
## h2.isambig   0.095 -0.090                                          
## trial.num    0.138  0.025  0.011                                   
## message.typ -0.082  0.012 -0.022 -0.194                            
## targt.pstn2 -0.555  0.034 -0.063 -0.060  0.017                     
## targt.pstn3 -0.564 -0.008 -0.039 -0.036 -0.017  0.474              
## h1.ssmpl:t.  0.003  0.027 -0.006 -0.046  0.059  0.014  0.015       
## h2.smbg:tr. -0.011 -0.001 -0.015  0.082  0.001  0.046 -0.002 -0.074
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0263642 (tol = 0.002, component 1)

priors <- c(prior(normal(0,5), class=b),
            prior(normal(0,5), class=sd),
            prior(lkj(1), class=cor))

br_model <- brm(selected.target~h1.issimple+h2.isambig+
                     trial.num+
                     h1.issimple*trial.num+
                     h2.isambig*trial.num+
                     message.type+target.position+
                     (h1.issimple+h2.isambig+
                     trial.num+
                     h1.issimple*trial.num+
                     h2.isambig*trial.num+
                     message.type+target.position|subject),
                   family=bernoulli,
                   data=prep_model,
                  prior=priors,
                  file="finalmod.Rds")

summary(br_model)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: selected.target ~ h1.issimple + h2.isambig + trial.num + h1.issimple * trial.num + h2.isambig * trial.num + message.type + target.position + (h1.issimple + h2.isambig + trial.num + h1.issimple * trial.num + h2.isambig * trial.num + message.type + target.position | subject) 
##    Data: prep_model (Number of observations: 1476) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~subject (Number of levels: 48) 
##                                                 Estimate Est.Error l-95% CI
## sd(Intercept)                                       0.96      0.17     0.64
## sd(h1.issimple)                                     1.23      0.25     0.79
## sd(h2.isambig)                                      0.88      0.27     0.32
## sd(trial.num)                                       0.17      0.11     0.01
## sd(message.type)                                    0.56      0.24     0.08
## sd(target.position2)                                1.19      0.27     0.70
## sd(target.position3)                                1.04      0.25     0.55
## sd(h1.issimple:trial.num)                           0.31      0.20     0.02
## sd(h2.isambig:trial.num)                            0.22      0.17     0.01
## cor(Intercept,h1.issimple)                          0.52      0.16     0.17
## cor(Intercept,h2.isambig)                          -0.24      0.21    -0.63
## cor(h1.issimple,h2.isambig)                         0.03      0.23    -0.41
## cor(Intercept,trial.num)                            0.06      0.28    -0.51
## cor(h1.issimple,trial.num)                         -0.08      0.28    -0.60
## cor(h2.isambig,trial.num)                          -0.11      0.30    -0.66
## cor(Intercept,message.type)                         0.00      0.25    -0.46
## cor(h1.issimple,message.type)                       0.10      0.25    -0.38
## cor(h2.isambig,message.type)                        0.07      0.27    -0.46
## cor(trial.num,message.type)                        -0.11      0.30    -0.66
## cor(Intercept,target.position2)                    -0.48      0.18    -0.76
## cor(h1.issimple,target.position2)                   0.03      0.21    -0.38
## cor(h2.isambig,target.position2)                   -0.10      0.23    -0.53
## cor(trial.num,target.position2)                     0.03      0.29    -0.54
## cor(message.type,target.position2)                 -0.11      0.25    -0.59
## cor(Intercept,target.position3)                    -0.63      0.15    -0.85
## cor(h1.issimple,target.position3)                  -0.14      0.21    -0.56
## cor(h2.isambig,target.position3)                    0.17      0.23    -0.30
## cor(trial.num,target.position3)                    -0.03      0.29    -0.58
## cor(message.type,target.position3)                 -0.13      0.26    -0.62
## cor(target.position2,target.position3)              0.48      0.19     0.04
## cor(Intercept,h1.issimple:trial.num)                0.12      0.29    -0.48
## cor(h1.issimple,h1.issimple:trial.num)              0.04      0.30    -0.54
## cor(h2.isambig,h1.issimple:trial.num)              -0.11      0.30    -0.66
## cor(trial.num,h1.issimple:trial.num)                0.08      0.31    -0.54
## cor(message.type,h1.issimple:trial.num)            -0.12      0.30    -0.67
## cor(target.position2,h1.issimple:trial.num)        -0.02      0.29    -0.58
## cor(target.position3,h1.issimple:trial.num)        -0.10      0.30    -0.64
## cor(Intercept,h2.isambig:trial.num)                 0.03      0.31    -0.54
## cor(h1.issimple,h2.isambig:trial.num)               0.01      0.31    -0.56
## cor(h2.isambig,h2.isambig:trial.num)               -0.12      0.32    -0.68
## cor(trial.num,h2.isambig:trial.num)                 0.01      0.31    -0.57
## cor(message.type,h2.isambig:trial.num)              0.06      0.32    -0.57
## cor(target.position2,h2.isambig:trial.num)          0.04      0.31    -0.55
## cor(target.position3,h2.isambig:trial.num)         -0.01      0.31    -0.59
## cor(h1.issimple:trial.num,h2.isambig:trial.num)     0.00      0.31    -0.59
##                                                 u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)                                       1.30 1.00     1418     1940
## sd(h1.issimple)                                     1.75 1.00     2220     3134
## sd(h2.isambig)                                      1.41 1.00     1333     1315
## sd(trial.num)                                       0.40 1.01     1159     2109
## sd(message.type)                                    1.01 1.00     1090     1192
## sd(target.position2)                                1.74 1.00     1063     1119
## sd(target.position3)                                1.54 1.00     1233     1277
## sd(h1.issimple:trial.num)                           0.75 1.00     1858     2291
## sd(h2.isambig:trial.num)                            0.62 1.00     2224     1824
## cor(Intercept,h1.issimple)                          0.80 1.00     1849     2788
## cor(Intercept,h2.isambig)                           0.17 1.00     2676     2810
## cor(h1.issimple,h2.isambig)                         0.48 1.00     2889     3128
## cor(Intercept,trial.num)                            0.59 1.00     6245     2849
## cor(h1.issimple,trial.num)                          0.48 1.00     5359     3431
## cor(h2.isambig,trial.num)                           0.49 1.00     4388     2904
## cor(Intercept,message.type)                         0.48 1.00     4237     2644
## cor(h1.issimple,message.type)                       0.56 1.00     3775     3249
## cor(h2.isambig,message.type)                        0.60 1.00     2896     2858
## cor(trial.num,message.type)                         0.50 1.00     1866     2861
## cor(Intercept,target.position2)                    -0.08 1.00     1558     1687
## cor(h1.issimple,target.position2)                   0.45 1.00     2324     2775
## cor(h2.isambig,target.position2)                    0.37 1.00     1716     2387
## cor(trial.num,target.position2)                     0.57 1.00     2074     2436
## cor(message.type,target.position2)                  0.39 1.00     2289     3045
## cor(Intercept,target.position3)                    -0.27 1.00     2079     2787
## cor(h1.issimple,target.position3)                   0.28 1.00     2784     3140
## cor(h2.isambig,target.position3)                    0.59 1.00     2158     2527
## cor(trial.num,target.position3)                     0.54 1.00     2011     2813
## cor(message.type,target.position3)                  0.40 1.00     2377     2588
## cor(target.position2,target.position3)              0.79 1.00     1890     2138
## cor(Intercept,h1.issimple:trial.num)                0.65 1.00     6178     3253
## cor(h1.issimple,h1.issimple:trial.num)              0.60 1.00     6720     3235
## cor(h2.isambig,h1.issimple:trial.num)               0.49 1.00     5065     3432
## cor(trial.num,h1.issimple:trial.num)                0.65 1.00     3748     3305
## cor(message.type,h1.issimple:trial.num)             0.48 1.00     3752     3337
## cor(target.position2,h1.issimple:trial.num)         0.55 1.00     4554     3595
## cor(target.position3,h1.issimple:trial.num)         0.51 1.00     4905     3176
## cor(Intercept,h2.isambig:trial.num)                 0.61 1.00     6816     2919
## cor(h1.issimple,h2.isambig:trial.num)               0.59 1.00     7481     3101
## cor(h2.isambig,h2.isambig:trial.num)                0.52 1.00     5187     3068
## cor(trial.num,h2.isambig:trial.num)                 0.59 1.00     4014     3246
## cor(message.type,h2.isambig:trial.num)              0.63 1.00     5091     3012
## cor(target.position2,h2.isambig:trial.num)          0.62 1.00     6013     3438
## cor(target.position3,h2.isambig:trial.num)          0.59 1.00     5543     3276
## cor(h1.issimple:trial.num,h2.isambig:trial.num)     0.61 1.00     2813     3309
## 
## Population-Level Effects: 
##                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept                -0.16      0.17    -0.51     0.18 1.00     2374
## h1.issimple               0.70      0.23     0.26     1.16 1.00     2688
## h2.isambig               -0.34      0.21    -0.75     0.06 1.00     4095
## trial.num                -0.03      0.07    -0.17     0.11 1.00     5553
## message.type              0.24      0.16    -0.07     0.55 1.00     4767
## target.position2          1.10      0.24     0.63     1.59 1.00     3319
## target.position3          0.42      0.22    -0.01     0.84 1.00     2946
## h1.issimple:trial.num    -0.01      0.15    -0.31     0.29 1.00     5428
## h2.isambig:trial.num      0.11      0.16    -0.20     0.42 1.00     7329
##                       Tail_ESS
## Intercept                 2739
## h1.issimple               3193
## h2.isambig                2979
## trial.num                 2820
## message.type              3198
## target.position2          2839
## target.position3          2706
## h1.issimple:trial.num     2941
## h2.isambig:trial.num      3003
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The two critical analyses are the difference between simple implicature and either complex or ambiguous, and the difference between complex and ambiguous.

I tested two models, one following their formulation uses LMER, which failed to converge, and one with maximal effect structure in BRMS (which converged). I report results from each.

For harder versus simple, FD found \(\beta=1.28\) (SE=.12, p <.0001). In the replication, I found \(\beta=.53\) (SE=.12, p<.00001) with the exact model, and with maximal effects, \(\beta=.70\) (95% CI=[.26,1.16]).

For the ambiguous vs complex, they found \(\beta=.44\) (SE=.13, p < .001). In the replication, I had this variable coded backwards (positive as ambiguous larger) and found \(\beta=-.29\) (SE=14, p=.035), and with maximal effects, \(\beta=-.34\) (95% CI = [ -.75, .06]). Taking into account that I reverse coded this variable, the results are in the same direction.

Discussion

Summary of Replication Attempt

The primary result of FD was that participants are more likely to select the target in simple implicature trials compared with harder (complex or ambiguous) trials and that participants were more likely to select the target in complex (vs ambiguous) trials. This continuum of simple > complex > ambiguous is modelled as two helmert coded variables. They found significant effects of each difference, with a stronger difference of simple v harder.

I replicate both results, although the effect sizes are smaller. For simple versus harder, the difference is clearly greater than 0 in the replication (CI does not overlap 0, p < .01) . For ambiguous v complex, the evidence is weaker: the CI overlaps 0, but not by much, and .01 <p < .05.

Commentary

Overall, I think the conclusion is that the results are true, but the participants in the replication were worse. The design is gameable in that selecting referents more sloppily (and this faster) is adventageous. The original study excluded 15% of participants with high error rate (>5%). I also excluded the 15% with highest error rate, but theses all had >20% errors. Between this and the difference in distractor selections (see graphs), it seems like the replication participants were less diligent and thus gave lower quality data. This noise could be the reason for small effect sizes. Despite this, the overall pattern of results is still clear.