Why replicate this Franke & Degen (2016, henceforth FD) is a paper in the Rational Speech Acts space which tests one of the underlying assumptions of much RSA work – namely, it addresses whether speakers and listeners are close to uniform in their level of pragmatic reasoning, or whether the population is diverse, with some people employing higher levels of reasoning than others. I’m interested in starting to work in the RSA pragmatic language models space; replicating this would provide a good introduction to the required analysis. Study 1 and Study 2 in the paper are complementary; study 1 is of comprehension and 2 of production, with parallel materials and analysis. The original paper found stronger evidence of heterogeneity in comprehension, so I’ll replicate Study 1. (Although I’d also strongly consider replicating study 2 and the supplemental norming study beyond the confines of the class.)
Stimuli & procedures In the study universe, there are 3 types of animals (purple monster, green monster, robot) and 3 accessories (red hat, blue hat, scarf). There are “words” for purple monster, green monster, red hat, blue hat (represented as images). The study (original available) consists of participants seeing one of the 4 words, an array of three images (of creatures with accessories) and choosing which image was being referred to. The experiment is written in javascript and all the source code is available. After checking with the original authors, I plan to use their source code and rehost the experiment using the original images, text, and experimental items.
repo and original paper
FD recruited 60 participants, and analysed data from 51 post-exclusions. I’ll use the same sample size, depending on how power analysis goes.
(Sample size subject to change pending power analysis results.)
Following FD, we recruited 60 participants. We excluded the 9 participants (15%) with the highest error rates (i.e. selection of distractors) on non-ambiguous trials. We also excluded XX participants (including N who had high error rates as well) who did not report English as their native language. Participants were recruited on DATE.
Materials were exactly the same as in the original experiment. The materials description from FD is quoted below.
" Materials. Participants saw 66 experimental trials, which were composed of 24 critical and 42 filler trials. Of the 24 critical trials, 12 constituted a simple implicature situation and 12 a complex one (as shown in Section 1). Stimuli were created by randomly sampling a message and then generating a grid of three objects—a target, a competitor, and a distractor— following different constraints in different conditions.
On simple implicature trials, the target was generated by combining the feature denoted by the sampled message with the inexpressible feature along the other feature dimension. For example, if the sampled message was one of “red hat” or “blue hat”, the target was a robot with the respective hat. If instead the sampled message was “purple monster” or “green monster”, the target was the respective monster with a scarf. The competitor was generated by combining the feature denoted by the sampled message with a randomly sampled expressible feature along the other feature dimension. For example, if the sampled message was “red hat”, the competitor could be a green monster with a red hat or a purple monster with a red hat. The distractor was generated by combining two features that were randomly sampled from the set of features that did not contain those features already present in target and competitor. For example, if the target was a robot with a red hat and the competitor was a green monster with a red hat, the distractor could be a purple monster with either a scarf or a blue hat.
On complex implicature trials, the target was generated by combining the feature denoted by the sampled message with an expressible feature along the other feature dimension. For example, if the sampled message was “green monster” the target could be a green monster with a red hat. The competitor was generated by combining the feature denoted by the sampled message with the remaining expressible feature along the other feature dimension. Continuing our example, the competitor would then be a green monster with a blue hat. The distractor was generated by combining the target feature that was not denoted by the sampled message (red hat) with the inexpressible feature along the other feature dimension (robot).
Of the 42 filler trials, 24 used the displays from the implicature conditions but the target was a) the competitor from the simple condition (six trials), b) the distractor from the simple condition (six trials), or c) the competitor from the complex condition (12 trials), as identified unambiguously by the trigger message. This was also intended to prevent learning associations of display type with the target. On the other 18 filler trials, the target was either entirely unambiguous or entirely ambiguous given the message. That is, there was either only one object with the feature denoted by the trigger message, or there were two identical objects that were equally viable target candidates. Unambiguous and ambiguous fillers were included as baselines to compare behavior on implicature trials to. Ambiguous fillers establish how often the target could be chosen by chance, while unambiguous fillers establish the upper bound on target choices. We did not include filler items where the target was the distractor from the complex condition, because this would have required participants to draw a one-step inference to identify the target. Trial order as well as target, competitor, and distractor order were randomized. "
The procedure was exactly the same as in FD. To quote their procedure,
" Procedure. Participants engaged in a referential comprehension task. On each trial they saw three objects on a display. Each object differed systematically along two dimensions: its ontological kind (robot, green monster, purple monster) and accessory (scarf, blue hat, red hat). In addition to these three objects, participants saw a pictorial message that they were told was sent to them by a previous participant whose goal was to get them to pick out one of these three objects. They were told that the previous participant was allowed to send a message expressing only one feature of a given object, and that the messages the participant could send were furthermore restricted to monsters and hats (i.e., there were no messages for referring to the robot or scarf feature; we refer to these features as inexpressible features). The four expressible features were visible to participants at the bottom of the display on every trial[…]
Participants initially completed four speaker trials. They saw three objects, one of which was highlighted with a yellow rectangle. Participants were asked to click on one of four pictorial messages to send to another Mechanical Turk worker to get them to pick out the highlighted object. They were told that the other worker did not know which object was highlighted but knew which messages could be sent. The four speaker trials contained three unambiguous and one ambiguous trial which could function as fillers in the main experiment."
Data exclusions: The only data exclusions performed were at the participant level, as described above.
Confirmatory analyses: At a group level, the question of interest is how rates of picking the target, competitor, and distractor vary between the trial types (ambiguous filler, unambiguous filler, simple implicature, complex implicature). We visualize this difference and fit a regression. FD used some mixed-effects, but only what would converge. We use the same model structure they settled on, fit in lmer. As a secondary check, we fit the full mixed-effect structure in brms.
Logistic regression (lmer): selected.target ~ condition_h1 + condition_h2 + trial_num + condition_h1trial_num + condition_h2trial_num+ message_type+target_position+(trial_num+message_type|subject)
Logistic regression (brms): selected_target ~ condition_h1 + condition_h2 + trial_num + condition_h1trial_num + condition_h2trial_num+ message_type+target_position+(condition_h1 + condition_h2 + trial_num + condition_h1trial_num + condition_h2trial_num+ message_type+target_position|subject)
Clarify key analysis of interest here The primary measures of interest are the coefficients of the two helmert contrasts. We are interested in if simple, complex, and ambiguous cases show noticeably different levels of target selection.
While the subject pool will also be Mturk, the population has changed over the past 5 years. This could effect the ratios of who uses what types of reasoning.
The main analysis is the same, although we additionally test the same model in brms. We switch what we label as the primary measure of interest to reduce the scope of the confirmatory analysis.
You can comment this section out prior to final report with data collection.
Following FD, we recruited 60 participants. We excluded the 9 participants (15%) with the highest error rates (i.e. selection of distractors) on non-ambiguous trials. We also excluded 4 participants (including 1 who had high error rates as well) who did not report English as their native language. Participants were recruited on 4 December 2020.
None.
Data preparation following the analysis plan.
library(tidyverse)
## ── Attaching packages ────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(here)
## here() starts at /home/vboyce/Research/franke2016
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(brms)
## Loading required package: Rcpp
## Loading 'brms' package (version 2.13.5). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
##
## Attaching package: 'brms'
## The following object is masked from 'package:lme4':
##
## ngrps
## The following object is masked from 'package:stats':
##
## ar
library(viridis)
## Loading required package: viridisLite
root <- here()
data_location <- paste0(root,"/data")
data_file <- "final.rds"
df <- read_rds(paste0(data_location,"/",data_file))
errorful <- df %>% filter(imptype=="unambig") %>%
mutate(is.error=ifelse(response=="distractor",1,0)) %>%
group_by(participant) %>%
summarize(errors=mean(is.error)) %>%
arrange(desc(errors)) %>%
ungroup() %>%
slice_head(prop=.15)
## `summarise()` ungrouping output (override with `.groups` argument)
not_eng <- df %>% select(participant, lang) %>%
unique() %>%
filter(!lang %in% c("english", "English", "eng", "ENG", "Eng", "ENGLISH"))
both <- errorful %>% inner_join(not_eng, by=c("participant"))
good_df <- df %>% anti_join(errorful, by=c("participant")) %>%
anti_join(not_eng, by=c("participant"))
We excluded the 9 participants (15%) with the highest error rates (i.e. selection of distractors) on non-ambiguous trials. We also excluded 4 participants (including 1 who had high error rates as well) who did not report English as their native language.
prep_model <- good_df %>% filter(response %in% c("target", "competitor")) %>%
filter(imptype %in% c("simple", "complex", "ambig")) %>%
mutate(selected.target=ifelse(response=="target", 1,0),
h1.issimple=case_when(
imptype=="simple"~2/3,
T~-1/3),
h2.isambig=case_when(
imptype=="simple"~0,
imptype=="ambig"~.5,
imptype=="complex"~-.5
),
trial.num=scale(trial),
message.type=case_when(
(str_sub(message,1,1)=="v")~.5,
T~-.5), #monsters are v1,v2, hats are s1,s2
target.position=as.factor(pos_target),
subject=as.factor(participant)) %>%
select(selected.target, h1.issimple,h2.isambig,
trial.num,message.type,target.position,subject)
Original graph from FD
for_plot <- good_df %>% filter(imptype %in% c("simple", "complex", "ambig","unambig")) %>%
group_by(imptype, response) %>%
tally() %>%
pivot_wider(names_from=response, values_from=n) %>%
mutate(total=competitor+distractor+target,
competitor=competitor/total,
distractor=distractor/total,
target=target/total) %>%
select(-total) %>%
pivot_longer(cols=c("target","competitor","distractor"),names_to="Choice",values_to="count")
ggplot(for_plot, aes(x=imptype,y=count,fill=factor(Choice, levels=c("target", "competitor", "distractor"))))+
geom_col(position="dodge")+
scale_fill_viridis(discrete=T)+
labs(fill="Choice")+
theme_bw()
ggsave("results.png")
## Saving 7 x 5 in image
The graphs look qualitatively similar – the biggest difference is that the replication data is noisier with more clicks on distractors.
The analyses as specified in the analysis plan.
freq_model <- glmer(selected.target~h1.issimple+h2.isambig+
trial.num+
h1.issimple*trial.num+
h2.isambig*trial.num+
message.type+target.position+
(trial.num+message.type|subject),
family=binomial,
data=prep_model)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0263642 (tol = 0.002, component 1)
summary(freq_model)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## selected.target ~ h1.issimple + h2.isambig + trial.num + h1.issimple *
## trial.num + h2.isambig * trial.num + message.type + target.position +
## (trial.num + message.type | subject)
## Data: prep_model
##
## AIC BIC logLik deviance df.resid
## 1948.2 2027.7 -959.1 1918.2 1461
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1695 -0.9535 0.5220 0.8158 1.8431
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 0.23147 0.4811
## trial.num 0.02921 0.1709 0.43
## message.type 0.26633 0.5161 -0.23 -0.91
## Number of obs: 1476, groups: subject, 48
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16610 0.11836 -1.403 0.1605
## h1.issimple 0.53003 0.11921 4.446 8.75e-06 ***
## h2.isambig -0.29585 0.14047 -2.106 0.0352 *
## trial.num -0.02327 0.06218 -0.374 0.7082
## message.type 0.23765 0.13600 1.747 0.0806 .
## target.position2 0.93275 0.14129 6.602 4.06e-11 ***
## target.position3 0.33286 0.13708 2.428 0.0152 *
## h1.issimple:trial.num -0.01932 0.11979 -0.161 0.8719
## h2.isambig:trial.num 0.11663 0.14270 0.817 0.4138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) h1.ssm h2.smb trl.nm mssg.t trgt.2 trgt.3 h1.s:.
## h1.issimple -0.022
## h2.isambig 0.095 -0.090
## trial.num 0.138 0.025 0.011
## message.typ -0.082 0.012 -0.022 -0.194
## targt.pstn2 -0.555 0.034 -0.063 -0.060 0.017
## targt.pstn3 -0.564 -0.008 -0.039 -0.036 -0.017 0.474
## h1.ssmpl:t. 0.003 0.027 -0.006 -0.046 0.059 0.014 0.015
## h2.smbg:tr. -0.011 -0.001 -0.015 0.082 0.001 0.046 -0.002 -0.074
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0263642 (tol = 0.002, component 1)
priors <- c(prior(normal(0,5), class=b),
prior(normal(0,5), class=sd),
prior(lkj(1), class=cor))
br_model <- brm(selected.target~h1.issimple+h2.isambig+
trial.num+
h1.issimple*trial.num+
h2.isambig*trial.num+
message.type+target.position+
(h1.issimple+h2.isambig+
trial.num+
h1.issimple*trial.num+
h2.isambig*trial.num+
message.type+target.position|subject),
family=bernoulli,
data=prep_model,
prior=priors,
file="finalmod.Rds")
summary(br_model)
## Family: bernoulli
## Links: mu = logit
## Formula: selected.target ~ h1.issimple + h2.isambig + trial.num + h1.issimple * trial.num + h2.isambig * trial.num + message.type + target.position + (h1.issimple + h2.isambig + trial.num + h1.issimple * trial.num + h2.isambig * trial.num + message.type + target.position | subject)
## Data: prep_model (Number of observations: 1476)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~subject (Number of levels: 48)
## Estimate Est.Error l-95% CI
## sd(Intercept) 0.96 0.17 0.64
## sd(h1.issimple) 1.23 0.25 0.79
## sd(h2.isambig) 0.88 0.27 0.32
## sd(trial.num) 0.17 0.11 0.01
## sd(message.type) 0.56 0.24 0.08
## sd(target.position2) 1.19 0.27 0.70
## sd(target.position3) 1.04 0.25 0.55
## sd(h1.issimple:trial.num) 0.31 0.20 0.02
## sd(h2.isambig:trial.num) 0.22 0.17 0.01
## cor(Intercept,h1.issimple) 0.52 0.16 0.17
## cor(Intercept,h2.isambig) -0.24 0.21 -0.63
## cor(h1.issimple,h2.isambig) 0.03 0.23 -0.41
## cor(Intercept,trial.num) 0.06 0.28 -0.51
## cor(h1.issimple,trial.num) -0.08 0.28 -0.60
## cor(h2.isambig,trial.num) -0.11 0.30 -0.66
## cor(Intercept,message.type) 0.00 0.25 -0.46
## cor(h1.issimple,message.type) 0.10 0.25 -0.38
## cor(h2.isambig,message.type) 0.07 0.27 -0.46
## cor(trial.num,message.type) -0.11 0.30 -0.66
## cor(Intercept,target.position2) -0.48 0.18 -0.76
## cor(h1.issimple,target.position2) 0.03 0.21 -0.38
## cor(h2.isambig,target.position2) -0.10 0.23 -0.53
## cor(trial.num,target.position2) 0.03 0.29 -0.54
## cor(message.type,target.position2) -0.11 0.25 -0.59
## cor(Intercept,target.position3) -0.63 0.15 -0.85
## cor(h1.issimple,target.position3) -0.14 0.21 -0.56
## cor(h2.isambig,target.position3) 0.17 0.23 -0.30
## cor(trial.num,target.position3) -0.03 0.29 -0.58
## cor(message.type,target.position3) -0.13 0.26 -0.62
## cor(target.position2,target.position3) 0.48 0.19 0.04
## cor(Intercept,h1.issimple:trial.num) 0.12 0.29 -0.48
## cor(h1.issimple,h1.issimple:trial.num) 0.04 0.30 -0.54
## cor(h2.isambig,h1.issimple:trial.num) -0.11 0.30 -0.66
## cor(trial.num,h1.issimple:trial.num) 0.08 0.31 -0.54
## cor(message.type,h1.issimple:trial.num) -0.12 0.30 -0.67
## cor(target.position2,h1.issimple:trial.num) -0.02 0.29 -0.58
## cor(target.position3,h1.issimple:trial.num) -0.10 0.30 -0.64
## cor(Intercept,h2.isambig:trial.num) 0.03 0.31 -0.54
## cor(h1.issimple,h2.isambig:trial.num) 0.01 0.31 -0.56
## cor(h2.isambig,h2.isambig:trial.num) -0.12 0.32 -0.68
## cor(trial.num,h2.isambig:trial.num) 0.01 0.31 -0.57
## cor(message.type,h2.isambig:trial.num) 0.06 0.32 -0.57
## cor(target.position2,h2.isambig:trial.num) 0.04 0.31 -0.55
## cor(target.position3,h2.isambig:trial.num) -0.01 0.31 -0.59
## cor(h1.issimple:trial.num,h2.isambig:trial.num) 0.00 0.31 -0.59
## u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.30 1.00 1418 1940
## sd(h1.issimple) 1.75 1.00 2220 3134
## sd(h2.isambig) 1.41 1.00 1333 1315
## sd(trial.num) 0.40 1.01 1159 2109
## sd(message.type) 1.01 1.00 1090 1192
## sd(target.position2) 1.74 1.00 1063 1119
## sd(target.position3) 1.54 1.00 1233 1277
## sd(h1.issimple:trial.num) 0.75 1.00 1858 2291
## sd(h2.isambig:trial.num) 0.62 1.00 2224 1824
## cor(Intercept,h1.issimple) 0.80 1.00 1849 2788
## cor(Intercept,h2.isambig) 0.17 1.00 2676 2810
## cor(h1.issimple,h2.isambig) 0.48 1.00 2889 3128
## cor(Intercept,trial.num) 0.59 1.00 6245 2849
## cor(h1.issimple,trial.num) 0.48 1.00 5359 3431
## cor(h2.isambig,trial.num) 0.49 1.00 4388 2904
## cor(Intercept,message.type) 0.48 1.00 4237 2644
## cor(h1.issimple,message.type) 0.56 1.00 3775 3249
## cor(h2.isambig,message.type) 0.60 1.00 2896 2858
## cor(trial.num,message.type) 0.50 1.00 1866 2861
## cor(Intercept,target.position2) -0.08 1.00 1558 1687
## cor(h1.issimple,target.position2) 0.45 1.00 2324 2775
## cor(h2.isambig,target.position2) 0.37 1.00 1716 2387
## cor(trial.num,target.position2) 0.57 1.00 2074 2436
## cor(message.type,target.position2) 0.39 1.00 2289 3045
## cor(Intercept,target.position3) -0.27 1.00 2079 2787
## cor(h1.issimple,target.position3) 0.28 1.00 2784 3140
## cor(h2.isambig,target.position3) 0.59 1.00 2158 2527
## cor(trial.num,target.position3) 0.54 1.00 2011 2813
## cor(message.type,target.position3) 0.40 1.00 2377 2588
## cor(target.position2,target.position3) 0.79 1.00 1890 2138
## cor(Intercept,h1.issimple:trial.num) 0.65 1.00 6178 3253
## cor(h1.issimple,h1.issimple:trial.num) 0.60 1.00 6720 3235
## cor(h2.isambig,h1.issimple:trial.num) 0.49 1.00 5065 3432
## cor(trial.num,h1.issimple:trial.num) 0.65 1.00 3748 3305
## cor(message.type,h1.issimple:trial.num) 0.48 1.00 3752 3337
## cor(target.position2,h1.issimple:trial.num) 0.55 1.00 4554 3595
## cor(target.position3,h1.issimple:trial.num) 0.51 1.00 4905 3176
## cor(Intercept,h2.isambig:trial.num) 0.61 1.00 6816 2919
## cor(h1.issimple,h2.isambig:trial.num) 0.59 1.00 7481 3101
## cor(h2.isambig,h2.isambig:trial.num) 0.52 1.00 5187 3068
## cor(trial.num,h2.isambig:trial.num) 0.59 1.00 4014 3246
## cor(message.type,h2.isambig:trial.num) 0.63 1.00 5091 3012
## cor(target.position2,h2.isambig:trial.num) 0.62 1.00 6013 3438
## cor(target.position3,h2.isambig:trial.num) 0.59 1.00 5543 3276
## cor(h1.issimple:trial.num,h2.isambig:trial.num) 0.61 1.00 2813 3309
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept -0.16 0.17 -0.51 0.18 1.00 2374
## h1.issimple 0.70 0.23 0.26 1.16 1.00 2688
## h2.isambig -0.34 0.21 -0.75 0.06 1.00 4095
## trial.num -0.03 0.07 -0.17 0.11 1.00 5553
## message.type 0.24 0.16 -0.07 0.55 1.00 4767
## target.position2 1.10 0.24 0.63 1.59 1.00 3319
## target.position3 0.42 0.22 -0.01 0.84 1.00 2946
## h1.issimple:trial.num -0.01 0.15 -0.31 0.29 1.00 5428
## h2.isambig:trial.num 0.11 0.16 -0.20 0.42 1.00 7329
## Tail_ESS
## Intercept 2739
## h1.issimple 3193
## h2.isambig 2979
## trial.num 2820
## message.type 3198
## target.position2 2839
## target.position3 2706
## h1.issimple:trial.num 2941
## h2.isambig:trial.num 3003
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The two critical analyses are the difference between simple implicature and either complex or ambiguous, and the difference between complex and ambiguous.
I tested two models, one following their formulation uses LMER, which failed to converge, and one with maximal effect structure in BRMS (which converged). I report results from each.
For harder versus simple, FD found \(\beta=1.28\) (SE=.12, p <.0001). In the replication, I found \(\beta=.53\) (SE=.12, p<.00001) with the exact model, and with maximal effects, \(\beta=.70\) (95% CI=[.26,1.16]).
For the ambiguous vs complex, they found \(\beta=.44\) (SE=.13, p < .001). In the replication, I had this variable coded backwards (positive as ambiguous larger) and found \(\beta=-.29\) (SE=14, p=.035), and with maximal effects, \(\beta=-.34\) (95% CI = [ -.75, .06]). Taking into account that I reverse coded this variable, the results are in the same direction.
The primary result of FD was that participants are more likely to select the target in simple implicature trials compared with harder (complex or ambiguous) trials and that participants were more likely to select the target in complex (vs ambiguous) trials. This continuum of simple > complex > ambiguous is modelled as two helmert coded variables. They found significant effects of each difference, with a stronger difference of simple v harder.
I replicate both results, although the effect sizes are smaller. For simple versus harder, the difference is clearly greater than 0 in the replication (CI does not overlap 0, p < .01) . For ambiguous v complex, the evidence is weaker: the CI overlaps 0, but not by much, and .01 <p < .05.
Overall, I think the conclusion is that the results are true, but the participants in the replication were worse. The design is gameable in that selecting referents more sloppily (and this faster) is adventageous. The original study excluded 15% of participants with high error rate (>5%). I also excluded the 15% with highest error rate, but theses all had >20% errors. Between this and the difference in distractor selections (see graphs), it seems like the replication participants were less diligent and thus gave lower quality data. This noise could be the reason for small effect sizes. Despite this, the overall pattern of results is still clear.