Replication of Online Experiment by Mirault, Snell, and Grainger (2018, Psychological Science)

Introduction

In their paper, Mirault et al. (2018) demonstrate a novel transposed word effect in speeded sentence grammaticality judgments. They constructed base sentences that were either grammatical or ungrammatical, and then derived test sentences from them by transposing two words, finding that ungrammaticality judgments for transposed word sentences required longer reaction times and had higher error rates when the base sentences were grammatical. The authors suggested that these results demonstrate some uncertainty in word order encoding as well as parallel processing of words.

This transposed word effect has important implications in psycholinguistics, particularly regarding models of sentence processing and grammar. Furthermore, since the effect is a relatively novel phenomenon, it should be verified through replications and reproductions in order to demonstrate its robustness. In particular, the original study was conducted in French, which has agreement (in person, number, and gender) among verbs, articles, adjectives, and nouns. This contrasts with English, which has much more limited agreement (only between subjects and verbs). Additionally, French has certain word orders which are much less common or not possible in English (e.g. adjectives after nouns, direct object before verb). Replicating the effect in English would thus demonstrate that the effect is not solely due to the particular agreement or word order characteristics of French, but can generalise across languages.

The materials needed to replicate this study include the test sentences (which serve as stimuli for the speeded grammaticality judgment task), as well as the specific instructions given to participants. The former are available on the paper’s OSF site; however, the latter will have to be obtained by communicating with the authors. The original experiment was conducted online using Java, and a similar implementation (using JavaScript) on a crowdsourcing platform will be adopted for the replication study.

Since the materials, procedures, and analyses have been described in Mirault et al.’s paper to quite a good degree of detail, a reimplementation of the experimental and analytical designs is likely to be relatively straightforward as long as the specific instructions can be obtained from the authors. The most important challenge is the development of English test stimuli that are sufficiently close in construction to the French test stimuli in order to avoid the effects of improper stimuli.

The repository for this project is hosted on GitHub.

Methods

Power Analysis

I reproduced the data processing and analysis methods using the original data so that the mixed-effects model could be directly used to estimate power by simulation.

### Power analysis

#### Import data
orig_dir <- "../original_paper/original_data/"
orig_files <- grep(".txt$", list.files(path = orig_dir), value = TRUE)

orig_data_df <- data.frame()
orig_data_list <- list()

for (file_no in seq_along(orig_files)) {
  new_orig_data <- scan(file.path(orig_dir, orig_files[file_no]), sep = "@", what = "character", quiet = TRUE) 
  new_orig_data_df <- as.data.frame(do.call(rbind, strsplit(new_orig_data, split = ";")))
  names(new_orig_data_df) <- c("trial", "item_no", "start_time", "end_time", "response", "condition", "group")
  new_orig_data_df$participant <- file_no
  orig_data_list[[file_no]] <- new_orig_data_df[-161,]
}
orig_data_df <- bind_rows(orig_data_list) 

#### Pre-process data
substring(orig_data_df$start_time, 9, 9) <- "." # handle time format
substring(orig_data_df$end_time, 9, 9) <- "."

orig_data_df %<>% mutate(rt = 1000 * difftime(paste("2026-01-01", .$end_time), 
                                              paste("2026-01-01", .$start_time)))
attributes(orig_data_df$rt)$units = "milliseconds"
orig_data_df$rt %<>% round(., 0)

orig_data_df %<>% mutate(correct = ((.$condition > 0 & .$response == 0) | 
                                      (.$condition == 0 & .$response == 1)))

#### Convert data into correct format
orig_data_df$rt %<>% as.numeric()
orig_data_df$correct %<>% as.logical()
orig_data_df$condition <- as_factor(recode(orig_data_df$condition, 
                                           "0" = "gram", "1" = "transword", "2" = "control"))
orig_data_df$item_no %<>% as_factor()
orig_data_df$participant %<>% as_factor()

#### Data exclusion
orig_summary_tbl <- orig_data_df %>% 
  group_by(participant) %>%
  summarise(avg_rt = mean(rt, na.rm = TRUE), accuracy = mean(correct, na.rm = TRUE), 
            .groups = 'drop')

orig_summary_tbl %<>%
  mutate(acc_exclude = (accuracy < 0.5)) %>%
  mutate(rt_z = scale(avg_rt)) %>%
  mutate(rt_exclude = (abs(rt_z) > 2.5))

orig_summary_tbl_exclude <- orig_summary_tbl %>%
  filter(acc_exclude == TRUE | rt_exclude == TRUE)

orig_data_df %<>% filter(rt != 0)

orig_data_df_filtered <- orig_data_df %>%
  ##  note: original authors did not exclude any ppts, but according to their exclusion criteria they should 
  ##  have excluded one ppt (mean RT > 2.5 SDs from mean)
  filter(!(participant %in% orig_summary_tbl_exclude$participant))

After pre-processing the data, I noticed that the authors should have excluded one participant based on their exclusion criterion (mean RT > 2.5 standard deviations from mean); however, it was unclear whether this participant was or was not excluded. I thus attempted to reproduce the original findings and run a power analysis on both the excluded and unexcluded data.

keyEffects <- function(data_df, caption_text) {
  ## note: the implementation of this function was modified slightly from the pre-registered version to
  ##       rectify an error in the code
  
  #### LME model preparation
  data_df$condition %<>% relevel(ref = "control")
  
  data_df_ungram <- data_df %>%
    filter(condition != "gram")
  
  data_df_correct <- data_df_ungram %>%
    filter(correct) 
  
  rt_model <- lmer((-1000 / rt) ~ condition + (1|item_no) + (1|participant),
                   data = data_df_correct, na.action = na.exclude,
                   control = lmerControl(check.conv.singular = .makeCC(action = "ignore", tol = 1e-4)))
  
  print(kable(summary(rt_model)[["coefficients"]], 
              caption = paste("Reaction times model, ", caption_text, sep = "")) %>% 
          kable_styling())
  cat("\n")
  
  #### Accuracy GLME
  accuracy_model <- glmer((1 - correct) ~ condition + (1|item_no) + (1|participant),
                          data = data_df_ungram, family = 'binomial',
                          control = glmerControl(check.conv.singular = 
                                                   .makeCC(action = "ignore", tol = 1e-4)))
  
  print(kable(summary(accuracy_model)[["coefficients"]], 
              caption = paste("Error rates model, ", caption_text, sep = "")) %>% 
          kable_styling())
  
  return(list(rt_model, accuracy_model))
}

models_included <- keyEffects(orig_data_df, "original data (no exclusion)")

Reaction times model, original data (no exclusion)
	Estimate	Std. Error	t value
(Intercept)	-0.6577697	0.0168549	-39.025438
conditiontransword	0.0354053	0.0038261	9.253683

Error rates model, original data (no exclusion)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-4.381941	0.1899201	-23.07254	0
conditiontransword	1.773705	0.1159017	15.30353	0

models_excluded <- keyEffects(orig_data_df_filtered, "original data (with exclusion)")

Reaction times model, original data (with exclusion)
	Estimate	Std. Error	t value
(Intercept)	-0.6615511	0.0165610	-39.946245
conditiontransword	0.0351070	0.0038522	9.113568

Error rates model, original data (with exclusion)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-4.374100	0.1903021	-22.98503	0
conditiontransword	1.765465	0.1161164	15.20426	0

powerAnalysis <- function(rt_model, caption_text) {
  #### Post-hoc power analysis
  power <- powerSim(rt_model, nsim = 200, progress = FALSE)
  print(power)
  
  #### Power curve plotting
  pwr_crv <- powerCurve(rt_model, along = "participant", nsim = 50, progress = FALSE)
  pwr_crv_summary <- summary(pwr_crv)
  pwr_levels = c(0.8, 0.9, 0.95) * summary(power)$mean
  
  pwr_crv_plot <- ggplot(pwr_crv_summary, aes(x = nlevels, y = mean * 100)) +
    geom_errorbar(aes(ymin = lower * 100, ymax = upper * 100), width = 1.2, colour = "#0097a7") +
    geom_line(colour = "#0097a7") +
    coord_cartesian(xlim = c(0, 100), ylim = c(0, 100)) +
    geom_hline(yintercept = pwr_levels * 100, linetype = "dashed", colour = "#555555") +
    annotate("text", x = 5, y = pwr_levels[1:2] * 100 - 2.5, label = c("80% original", "90% original"), 
             colour = "#555555") +
    annotate("text", x = 5, y = pwr_levels[3] * 100 + 2.7, label = "95% original", colour = "#555555") + 
    labs(x = "Number of participants", y = "Estimated power by simulation (%)") + 
    ggtitle(paste("Power curve of ", caption_text, sep = ""))
  
  plot(pwr_crv_plot)
  
  return(list(power, pwr_crv, pwr_crv_plot))
}

power_analysis_included <- powerAnalysis(models_included[[1]], "original data (no exclusion)")

## Power for predictor 'condition', (95% confidence interval):
##       100.0% (98.17, 100.0)
## 
## Test: Likelihood ratio
## 
## Based on 200 simulations, (0 warnings, 0 errors)
## alpha = 0.05, nrow = 6970
## 
## Time elapsed: 0 h 2 m 16 s
## 
## nb: result might be an observed power calculation

power_analysis_excluded <- powerAnalysis(models_excluded[[1]], "original data (with exclusion)")

## Power for predictor 'condition', (95% confidence interval):
##       100.0% (98.17, 100.0)
## 
## Test: Likelihood ratio
## 
## Based on 200 simulations, (0 warnings, 0 errors)
## alpha = 0.05, nrow = 6896
## 
## Time elapsed: 0 h 2 m 16 s
## 
## nb: result might be an observed power calculation

# use these to display relevant info if you do not wish to rerun the simulations
# print(power_analysis_included[[1]])
# print(power_analysis_excluded[[1]])
# plot(power_analysis_included[[3]])
# plot(power_analysis_excluded[[3]])

Based on the power analysis, the power of the original study is 100%. As such, 23 participants are needed to achieve the same power as the original study (assuming exclusion).

Planned Sample

The planned sample size is 25, which can achieve 100% of the power of the original study while allowing for some buffer (for exclusions). The participants are US residents recruited on Amazon MTurk.

Materials

As the present replication involves a different language from the original study, new stimuli were created in English according to the specifications of the original study. Pairs of five-word grammatical base sentences were constructed, largely using translated lexical items from the original French, modifying where necessary to avoid semantic oddity, to minimize repetitions of lexical items, and to produce sentences that were amenable to the subsequent manipulations. The last words of each pair were then swapped to produce corresponding pairs of ungrammatical base sentences. Finally, the third and fourth words of each base sentence were transposed to form the test sequences, such that the transposed words were of different grammatical categories. Following the original authors, I will refer to the two conditions as the transposed-word condition (derived from a grammatical base sentence) and the control condition (derived from an ungrammatical base sentence). These are illustrated as follows:

Examples of base sentences and test sequences
Sequence	Example
Base
Grammatical	You should really leave now.
	Her handsome neighbor is moving.
Ungrammatical	You should really leave moving.
	Her handsome neighbor is now.
Test
Transposed-word	You should leave really now.
	Her handsome is neighbor moving.
Control	You should leave really moving.
	Her handsome is neighbor now.

The remainder of the stimulus construction followed the original study: 160 ungrammatical test sequences were constructed from 80 grammatical and 80 ungrammatical base sentences. These were distributed into two lists, such that participants only saw half of each type of test sequence, and did not experience “repetition of sequences containing the same words”. Additionally, “each participant also saw 80 grammatically correct sentences that were not related in any way to the base sequences used to generate the ungrammatical test sequences.”

The full set of English stimuli can be found here. In this file, items 1–40 have transposed-word sequences in List 1 and control sequences in List 2, items 41–80 have control sequences in List 1 and transposed-word sequences in List 2, and items 81–160 are grammatically correct sentences identical in both lists.

Procedure

The procedure from the original paper was followed closely: “Participants were instructed to decide as rapidly and as accurately as possible whether the sequence of words was grammatically correct. On each trial, a fixation cross was displayed on the center of the screen during a random time ranging between 500 and 700 ms, followed by the stimulus (a five-word sequence) centered on the screen. The distance between the central fixation cross and the first letter of the sequence varied between 8° and 18° of visual angle as a function of the length of the five-word sequence. The word sequence remained on screen until response. After this, a feedback dot was presented for 700 ms, in green if the response was correct or in red if the response was incorrect.”

Additional detail regarding online presentation was as follows: “Stimuli were presented online using Java protocol on the personal computer of the participant. Sentences were presented in 30-point mono-spaced font (Droid Sans Mono) in black on a white background. Participants were asked to sit about 60 cm from the monitor, such that 1 cm equaled approximately 1° of visual angle. Participants responded using their index fingers with two arrows on the computer keyboard: right for grammatical decisions and left for ungrammatical decisions.”

A few minor deviations from the abovementioned procedure were adopted. Firstly, the experiment was coded using JavaScript (relying on the jsPsych library) instead of Java. Secondly, two practice questions were given to ensure that participants were familiar with the paradigm before commencing. Thirdly, the experiment was divided into four blocks comprising 40 pseudorandomised questions each to alleviate the load on sustained attention. Finally, the keypresses were changed to ‘F’ and ‘J’ instead of the arrow keys to ensure that participants were using both index fingers, rather than two fingers on their right hand. These are unlikely to result in a substantial difference in results as they do not affect the main experimental manipulations.

The experiment can be found here, and the code for the experiment can be found here. The methods and analysis plan were preregisted on OSF.

Analysis Plan

As with the original paper, “response times (RT; the time between onset of stimulus presentation and participant’s response) for correct responses and response accuracy” were analyzed. Similarly, prior to analysis, RTs were also inverse-transformed (-1000/RT) to normalize the distribution.

Exclusion criteria

The exclusion criteria used in the original paper were:

Low overall accuracy (no specific cutoff), and
RTs beyond 2.5 standard deviations from the grand mean.

These were adopted for the present replication, with an accuracy cutoff specified at 50% (i.e. at chance), such that participants who performed worse than chance were excluded. A third exclusion criterion of incomplete responses was also added to account for the potential technical difficulties of working on a crowdsourcing platform.

Key analyses of interest

The key analyses of the original paper were:

Linear mixed-effects (LME) model to analyze RTs, and
Generalized (logistic) linear mixed-effects (GLME) model to analyze accuracy.

These effects were considered to be reliable if |t| (for LME) or |z| (for GLME) were greater than 1.96. Furthermore, the original authors “used the maximal random structure model that converged …, and this included by-participant and by-item random intercepts in all analyses”. These analyses were also adopted for the present replication.

Differences from Original Study

In summary, two key methodological differences exist between the original study and the present replication. These are the recruitment of participants on MTurk (instead of volunteers), and the use of novel English stimuli (instead of the original French). The former may influence the results to a small extent due to motivation differences, but it is unlikely to result in a significant change in the result, considering that the study involves a response time task. The latter, however, is much more likely to affect the result. Whether the present replication succeeds or fails would thus depend strongly on whether the transposed-word effect is generalizable to English.

Actual Sample

A total of 25 US residents were recruited from Amazon MTurk. One participant was excluded due to unusually high RTs (> 3.22 standard deviations from the grand mean). The remaining 24 participants were retained for subsequent analyses. The age and gender distributions are as follows:

#### Import data
data_dir <- "../data"
files <- grep(".csv$", list.files(data_dir), value = TRUE)

repl_data_df <- data.frame()
repl_data_list <- list()

for (file_no in seq_along(files)) {
  new_data_df <- read.csv(file.path(data_dir, files[file_no]))
  repl_data_list[[file_no]] <- new_data_df
}
repl_data_df <- bind_rows(repl_data_list) 

#### Get demographic data
repl_data_dem <- repl_data_df %>%
  filter(trial_type == "survey-html-form") %>%
  select(c(uid, list, responses)) %>%
  mutate(age = as.numeric(sub("({\"age\":\")(\\d\\d)(\",.*)", "\\2", responses, perl = TRUE)),
         gender = sub("(.*\"gender\":\")(.*)(\"})", "\\2", responses, perl = TRUE))

age_tbl <- unclass(summary(repl_data_dem$age)) %>% 
  kable(col.names = c("Age")) %>% 
  kable_styling()
gen_tbl <- table(repl_data_dem$gender) %>% 
  kable(col.names = c("Gender", "Count")) %>% 
  kable_styling()
kables(list(age_tbl, gen_tbl), caption = "Demographic data of replication experiment") %>% 
  kable_styling()

Demographic data of replication experiment

	Age
Min.	25.00
1st Qu.	31.00
Median	36.00
Mean	39.72
3rd Qu.	43.00
Max.	69.00

Gender	Count
female	12
male	11
other	1
pnts	1

Differences from pre-data collection methods plan

None.

Results

Data preparation

The data were first prepared by extracting the relevant trials and implementing the exclusion criteria.

### Data Preparation

#### Filter trial data
repl_data_df %<>% 
  filter(grepl("<p class='stimulus'>", .$stimulus, perl = TRUE)) %>%
  select(-participant) %>%
  rename(participant = uid)

repl_data_df$stimulus %<>%
  sub("<p class='stimulus'>", "", ., perl = TRUE) %>%
  sub("</p>", "", ., perl = TRUE)

repl_data_df %<>% 
  filter(trial_index > 6) # remove practice trials

#### Convert data into correct format
repl_data_df$participant %<>% as_factor()
repl_data_df$rt %<>% as.numeric()
repl_data_df$correct %<>% as.logical()
repl_data_df$condition %<>% as_factor()

#### Data exclusion
repl_summary_tbl <- repl_data_df %>% 
  group_by(participant) %>%
  summarise(avg_rt = mean(rt, na.rm = TRUE), accuracy = mean(correct, na.rm = TRUE), 
            .groups = 'drop')

repl_summary_tbl %<>%
  mutate(acc_exclude = (accuracy < 0.5)) %>%
  mutate(rt_z = scale(avg_rt)) %>%
  mutate(rt_exclude = (abs(rt_z) > 2.5))

repl_summary_tbl_exclude <- repl_summary_tbl %>%
  filter(acc_exclude == TRUE | rt_exclude == TRUE)

repl_data_df_filtered <- repl_data_df %>%
  filter(!(participant %in% repl_summary_tbl_exclude$participant))

Confirmatory analysis

An LME was then used to analyse the RT data, and a GLME was used to analyse the accuracy data.

### Confirmatory analysis
models <- keyEffects(repl_data_df_filtered, "replication data")

Reaction times model, replication data
	Estimate	Std. Error	t value
(Intercept)	-0.7737624	0.0328020	-23.588910
conditiontransword	0.0548801	0.0136369	4.024379

Error rates model, replication data
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-4.568485	0.4185565	-10.914858	0
conditiontransword	1.981450	0.3070603	6.452967	0

The original paper found that “participants were significantly slower at classifying transposed-word sequences as being ungrammatical compared with the control sequences, b = 0.03, SE = 0.00, t = 8.57, and they made significantly more errors in the transposed-word condition than in the control condition, b = 1.77, SE = 0.11, z = 15.30.” These results demonstrate a similar effect, albeit with slightly smaller t and z values, which is to be expected due to the smaller number of participants.

#### Generate summary tables
se <- function(x) sd(x) / sqrt(length(x))

color_scale <- c("#0097a7", "#e06666", "#93c47d")

make_summary <- function(data_df, is_acc) {
  if (is_acc) {
    summary_df <- data_df %>%
      group_by(condition, .add = TRUE) %>%
      summarise(mean_err = 100 * (1 - mean(correct, na.rm = TRUE)),
                err_sem = se(100 * (1 - na.omit(correct))), .groups = "drop")
  } else {
    summary_df = data_df %>% 
      group_by(condition, .add = TRUE) %>%
      summarise(mean_rt = mean(rt, na.rm = TRUE), rt_sem = se(na.omit(rt)), .groups = "drop")
  }
  return(summary_df)
}

orig_rt_summary <- make_summary(orig_data_df, 0)
orig_acc_summary <- make_summary(orig_data_df, 1)

repl_rt_summary <- make_summary(repl_data_df_filtered, 0)
repl_acc_summary <- make_summary(repl_data_df_filtered, 1)

#### Generate summary plots
plot_summary <- function(summary_df, is_acc) {
  summary_df %<>%
    mutate(condition = fct_relevel(condition, "transword", "control", "gram"))
  
  mean_var <- ifelse(is_acc, as.name("mean_err"), as.name("mean_rt"))
  sem_var <- ifelse(is_acc, as.name("err_sem"), as.name("rt_sem"))
  y_lab <- ifelse(is_acc, "Error rate (%)", "Response time (ms)")
  
  summary_plot <- ggplot(summary_df, aes(x = condition, y = !!mean_var, fill = condition)) + 
    geom_bar(position = "dodge", stat = "identity", width = .6) +
    labs(x = "Condition", y = y_lab) +
    scale_fill_manual(values = color_scale) + 
    theme(legend.position = "none") +
    scale_x_discrete(labels = c("TW", "Control", "Grammatical")) +
    geom_errorbar(aes(ymin = !!mean_var - 2 * !!sem_var, ymax = !!mean_var + 2 * !!sem_var), 
                  width = .2, position = position_dodge(.9))
}

orig_rt_plot <- plot_summary(orig_rt_summary, 0) + 
  ggtitle("Original results") +
  ylim(c(0, 2100))
orig_acc_plot <- plot_summary(orig_acc_summary, 1) + 
  ggtitle("Original results") +
  ylim(c(0, 17.5))

repl_rt_plot <- plot_summary(repl_rt_summary, 0) + 
  ggtitle("Replication results") +
  ylim(c(0, 2100))
repl_acc_plot <- plot_summary(repl_acc_summary, 1) + 
  ggtitle("Replication results") +
  ylim(c(0, 17.5))

#### RT
plot_grid(orig_rt_plot, repl_rt_plot)

#### Accuracy
plot_grid(orig_acc_plot, repl_acc_plot)

Exploratory analyses

The authors also conducted post-hoc analyses on the effects of word length and word class (closed vs open), finding that there were significant interaction effects between condition and each of those variables, such that the transposed-word effect was larger for shorter and closed-class words. They also noted that the transposed-word effect remained significant for longer and open-class words. The authors’ material contained insufficient information regarding their exact analytic workflow for these analyses; however, their R script suggested that they may have used a median split for word length, and they also noted that they coded word class as “closed” as long as either of the transposed words were closed-class. Their results are as follows:

orig_class <- data.frame("Estimate" = c(0.72, 0.08, 0.07, 0.06),
                         "Std. Error" = c(0.01, 0.00, 0.01, 0.00),
                         "t value" = c(38.06, 9.23, 6.23, 6.19),
                         "Estimate" = c(4.09, 2.76, 0.18, 1.50),
                         "Std. Error" = c(0.32, 0.22, 0.35, 0.26),
                         "z value" = c(12.44, 12.23, 0.51, 5.74))
rownames(orig_class) <- c("(Intercept)", "Condition", "Closed class", "Interaction")
colnames(orig_class) <- c("Estimate", "Std. Error", "t value", "Estimate", "Std. Error", "z value")
kable(orig_class, caption = "Mixed-effects model on original data with effect of word class") %>%
  kable_styling() %>%
  add_header_above(c(" " = 1, "RT" = 3, "Error rate" = 3))

Mixed-effects model on original data with effect of word class
	RT			Error rate
	Estimate	Std. Error	t value	Estimate	Std. Error	z value
(Intercept)	0.72	0.01	38.06	4.09	0.32	12.44
Condition	0.08	0.00	9.23	2.76	0.22	12.23
Closed class	0.07	0.01	6.23	0.18	0.35	0.51
Interaction	0.06	0.00	6.19	1.50	0.26	5.74

orig_len <- data.frame("Estimate" = c(0.76, 0.11, 0.02, 0.01),
                       "Std. Error" = c(0.02, 0.01, 0.00, 0.00),
                       "t value" = c(30.87, 7.80, 5.70, 5.82),
                       "Estimate" = c(3.72, 3.36, 0.00, 0.38),
                       "Std. Error" = c(0.38, 0.41, 0.07, 0.08),
                       "z value" = c(9.78, 8.12, 0.01, 4.59))
rownames(orig_len) <- c("(Intercept)", "Condition", "Avg length", "Interaction")
colnames(orig_len) <- c("Estimate", "Std. Error", "t value", "Estimate", "Std. Error", "z value")
kable(orig_len, caption = "Mixed-effects model on original data with effect of word length") %>%
  kable_styling() %>%
  add_header_above(c(" " = 1, "RT" = 3, "Error rate" = 3))

Mixed-effects model on original data with effect of word length
	RT			Error rate
	Estimate	Std. Error	t value	Estimate	Std. Error	z value
(Intercept)	0.76	0.02	30.87	3.72	0.38	9.78
Condition	0.11	0.01	7.80	3.36	0.41	8.12
Avg length	0.02	0.00	5.70	0.00	0.07	0.01
Interaction	0.01	0.00	5.82	0.38	0.08	4.59

To improve precision, I implemented these variables using a continuous length variable (mean length of transposed words), and coded the number of closed-class words (out of the two transposed words; defined as determiners, prepositions, pronouns, modals, auxiliaries, and conjunctions) as opposed to whether there was at least one closed-class word or not. Furthermore, both of these variables are correlated with word frequency, and I thus included an additional model using the log mean frequency of transposed words as a predictor. In all of these cases, the word-related variables were centered to avoid effects of multicollinearity among predictors. The covariate data can be found here.

stimuli_data <- read.csv("stimuli_data.csv") %>%
  select(c(item_no, sequence, condition, list, group, 
           freq_avg, closed_sum, len_avg)) %>%
  mutate(log_freq = log(freq_avg)) %>%
  mutate(log_freq_ctr = log_freq - mean(log_freq),
         closed_sum_ctr = closed_sum - mean(closed_sum),
         len_avg_ctr = len_avg - mean(len_avg))

stimuli_filtered <- stimuli_data %>%
  select(c(log_freq_ctr, closed_sum_ctr, len_avg_ctr))

ea_names <- c("Log freq (ctr)", "Closed class (ctr)", "Avg length (ctr)")
ea_correl <- as.data.frame(cor(stimuli_filtered))
rownames(ea_correl) <- ea_names
colnames(ea_correl) <- ea_names
kable(ea_correl, caption = "Covariate correlation matrix") %>% kable_styling()

Covariate correlation matrix
	Log freq (ctr)	Closed class (ctr)	Avg length (ctr)
Log freq (ctr)	1.0000000	0.7577199	-0.5895974
Closed class (ctr)	0.7577199	1.0000000	-0.5680547
Avg length (ctr)	-0.5895974	-0.5680547	1.0000000

ea_data_df <- left_join(repl_data_df_filtered, stimuli_data, by = c("item_no", "condition")) %>%
  filter(group != 0)

ea_data_df_ungram <- ea_data_df %>%
  filter(condition != "gram")

ea_data_df_correct <- ea_data_df_ungram %>%
  filter(correct)

ea_vars <- c(as.name("log_freq_ctr"), as.name("closed_sum_ctr"), as.name("len_avg_ctr"))
ea_rt_model <- list()
ea_acc_model <- list()

for (i in 1:3) {
  this_var <- ea_vars[[i]]
  ea_rt_model[[i]] <- lmer((-1000 / rt) ~ (condition * eval(this_var)) + (1|item_no) + (1|participant),
                           data = ea_data_df_correct, na.action = na.exclude,
                           control = lmerControl(check.conv.singular = .makeCC(action = "ignore", tol = 1e-4)))
  ea_acc_model[[i]] <- glmer((1 - correct) ~ (condition * eval(this_var)) + (1|item_no) + (1|participant),
                             data = ea_data_df_ungram, family = 'binomial',
                             control = glmerControl(check.conv.singular = 
                                                      .makeCC(action = "ignore", tol = 1e-4)))
  ea_rt_summary_tbl <- as.data.frame(summary(ea_rt_model[[i]])[["coefficients"]]) %>% select(c(1:3))
  ea_acc_summary_tbl <- as.data.frame(summary(ea_acc_model[[i]])[["coefficients"]]) %>% select(c(1:3))
  ea_this_summary_tbl <- cbind(ea_rt_summary_tbl, ea_acc_summary_tbl)
  row.names(ea_this_summary_tbl) <- c("(Intercept)", "Condition", ea_names[i], "Interaction")
  print(kable(ea_this_summary_tbl, caption = 
                paste("Mixed-effects model on replication data with effect of ", 
                      tolower(ea_names[i]), sep = "")) %>%
          kable_styling() %>%
          add_header_above(c(" " = 1, "RT" = 3, "Error rate" = 3)))
  cat("\n")
}

Mixed-effects model on replication data with effect of log freq (ctr)
	RT			Error rate
	Estimate	Std. Error	t value	Estimate	Std. Error	z value
(Intercept)	-0.7836729	0.0330785	-23.691302	-4.4210723	0.4130928	-10.7023711
Condition	0.0566637	0.0147306	3.846657	2.0580809	0.3125533	6.5847357
Log freq (ctr)	-0.0115479	0.0043103	-2.679161	0.0631848	0.1014094	0.6230671
Interaction	0.0040583	0.0065844	0.616355	0.1366729	0.1338243	1.0212861

Mixed-effects model on replication data with effect of closed class (ctr)
	RT			Error rate
	Estimate	Std. Error	t value	Estimate	Std. Error	z value
(Intercept)	-0.7863604	0.0331863	-23.695341	-4.3750687	0.4140346	-10.5669159
Condition	0.0579149	0.0153672	3.768727	2.0768457	0.3169703	6.5521776
Closed class (ctr)	-0.0477551	0.0170104	-2.807406	0.3532343	0.4103104	0.8608954
Interaction	0.0179503	0.0264678	0.678195	0.5112644	0.5284995	0.9673886

Mixed-effects model on replication data with effect of avg length (ctr)
	RT			Error rate
	Estimate	Std. Error	t value	Estimate	Std. Error	z value
(Intercept)	-0.7778076	0.0327606	-23.7421319	-4.3667610	0.4026928	-10.843903
Condition	0.0506628	0.0138384	3.6610318	1.9784485	0.2851311	6.938733
Avg length (ctr)	0.0140793	0.0081879	1.7195344	-0.2680015	0.1991952	-1.345422
Interaction	0.0085404	0.0120802	0.7069751	-0.2871821	0.2466773	-1.164201

There are three key observations to note. Firstly, none of the interaction effects were significant, unlike the original study. Secondly, the main effect of word length did not reach significance for reaction times. Finally, the main effect of word class was in the opposite direction in comparison with the original paper. There does not seem to be a straightforward explanation for why this is the case, although perhaps language plays some role; nonetheless, one key replicated result is the fact that condition still has a significant main effect even when the effects of these other covariates are partialled out.

Discussion

Summary of Replication Attempt

As found in the original paper, participants were slower at deciding that transposed-word sequences are ungrammatical in comparison with control sequences, and also made more errors for transposed-word sequences in comparison with control sequences. While the t and z values are not as large in magnitude in comparison with the original, this is somewhat expected due to the reduced sample size. Given that the effects are in the same direction and of a similar magnitude with the original study, the results did replicate successfully.

Summary of Exploratory Analyses

Similarly, condition was found to be a significant predictor even when word length and word class were accounted for; this was also the case for word frequency. Additionally, for RTs, word frequency and word class contributed a significant main effect, while word length did not. In contrast, none of these variables were significant predictors for accuracy. Interestingly, unlike the original study, none of these variables were involved in a significant interaction effect either. More research that systematically investigates the role of these other variables across various languages may be insightful in determining the relationship among these variables and the transposed-word effect.

Commentary

Overall, the replication of this study using English materials demonstrates that the transposed-word effect is robust and applicable cross-linguistically, as it does not depend solely upon the agreement and word order characteristics of French. To my knowledge, this is the first demonstration of the transposed-word effect in English, and the data from this replication can contribute to the discussion around word and sentence processing. The senior author’s comments were also helpful in directing the implementation of the replication, although as he noted, the details were unlikely to matter as the effect seems to be quite large; this was borne out by the successful replication despite the change in language and the smaller sample size. This replication also demonstrated the value of good open science practices, and the original authors’ conscientiousness were a major contributor to the success of the present replication.