Replication of - Why do children make mirror errors in reading? Neural correlates of mirror invariance in the visual word form area- by Dehaene et al. (2009, NeuroImage)

Author

Neha Rajagopalan (Email ID: neharaj@stanford.edu)

Published

June 21, 2024

Introduction

Justification

As a PhD student in Developmental and Psychological Sciences at the Graduate School of Education, I am constantly thinking about methods to study underlying neural and behavioral changes that occur during learning in children and adults. Given my interests, I believe that the current chosen topic aligns perfectly with my goal to expand knowledge in education psychology and learn appropriate scientific reporting of results. The paper by Dehaene et al. discusses the disappearance of the “mirror effect” (i.e. writing words from right to left while children learn to read and write) in adults and further investigates a higher presence of mirror generalization for pictures than words through a behavioral task conducted in an fMRI scanner. In the past, I have explored literature in auditory processing and narratives for learning. Through this rescue project, however, I intend to be introduced to experiment design, priming effects and analysis metrics used in visual mechanism studies.

Stimuli and Procedure

The behavioral task (called the “same-different” task) involves presenting 14 french words, 14 Japanese characters, 14 pictures of tools, 14 pictures of faces, and 14 unknown scripts (all black and white) as visual stimuli along with their left-right reversed mirror images. Each trial displays two images from the same category at a fixation position with 200ms of presentation and 300ms of inter-stimulus interval. The participant responds with “same” to the pair of images if they are physically identical (eg. same word both in normal orientation or mirrored) and “different” if they are either unrelated (eg. two different words from the same category) or differently oriented (eg. word followed by mirrored orientation).

Challenges

The target population is adults with a mean age of 23 years. Originally, the study is restricted to 26 participants. I believe that the analysis may require much more data to make generalizable conclusions on mirror invariance in adults and this might be difficult given the time constraint for data collection. Some other challenges include deployment of the experiment paradigm and adding breaks between trials (block structure) to reduce cognitive load for participants.

Links to repository and original paper

Link to github repository
Link to original paper in repository

Links to replication project repository and experiment paradigm:

Link to replication project repository
Link to experiment paradigm

Link to current replication study experiment paradigm and Pilot A data analysis:

Link to experiment paradigm
Link to pilot data analysis script

Comments and feedback on pilot A data and analysis:

Pilot A data was collected on 2 non-naive participants.
The data analysis pipeline coded below works on the collected pilot data.
Feedback for pilot A:
1. Request to increase the time of presentation of stimuli to be able to process the image, especially for words.
2. The target buttons ‘s’ and ‘d’ are next to each other. This was helpful to quickly respond but also was the reason for error. If one button for ‘same’ target was to the right of the keyboard and the other target button for ‘different’ was on the left side of the keyboard, the usage of right and left hand to touch the target buttons would be helpful to differentiate between the two targets and reduce error in response.
I was unable to implement the 10 block structure mentioned in the Control section of the document. This was an addition to the original replication and may or may not be helpful for the current replication. It will be implemented in further iterations.

Summary of prior replication attempt

Based on a preliminary comparison between the original report and the 1st replication, there are no differences in sample size, method and analysis. Since the original study included an fMRI task, the study was conducted in-person as opposed to the replication study that collected behavioral data online. Due to this difference, participants did not have any exposure to same-different task in the replication study while they did during the original study. The original study had a final mean age of 23 years across participants but the replication study had a mean age of 36 years. The replication attempt was only partially successful since the author obtained results that were both consistent and inconsistent with the original report for different analysis tests.

Methods

Power Analysis

Original effect size, power analysis for samples to achieve 80%, 90%, 95% power to detect that effect size. Considerations of feasibility for selecting planned sample size.

How much power does your planned sample have for original effect? For an attenuated effect that is half the size of the original?

(If power analysis is not possible or precise, discuss more fully how you determined a sample size that would be sufficient for rescue.)

Planned Sample

The rescue project will increase the sample size to 40 participants. The mean age of participants will also be modified to 23 years similar to the original study.

Materials

“All stimuli were presented in black-and-white, and occupied similar locations on screen (approximate width and height : 2° × 2° for Japanese characters and faces; 1.5–4° × 1.5–4° for tools, depending on their compactness and vertical or horizontal main axis; 0.8° × 2.3° for French words). Several precautions were taken to ensure that the task required view-point invariant recognition and could not be performed using simple short-cuts. All stimuli were selected so that they were clearly asymmetrical and maximally distinct from their mirror images. In particular, the faces were not front views, but were viewed and lit from an angle intermediate between profile and front view. Likewise, the Japanese characters were presented in a curvy font (“HG Sei-Kaisho-Tai”) so that they did not contain any vertical or horizontal bars that would be identical after left–right inversion. The French words had an even number of non-repeated letters, so that no letter was repeated at the same location in a word and its mirror-image. Finally, the French words were made of lower-case letters b, d, i, l, m, n, o, p, q, u, v, x, and were presented in an 20-point Arial font, slightly modified so that the above letters were exactly symmetrical on screen. As a result, even in mirror-image the words appeared as alphabetical strings made of essentially normal letters (non-French readers could not easily tell that they were not French words). A similar manipulation was not possible with Japanese characters, but we selected characters made of strokes that did not seem artificial once reversed (non-Japanese readers could not easily tell that these were not Japanese characters). French and Japanese words were matched on frequency (mean Log10 frequency = 1.14 versus 0.90, n.s.).” - Pg 1846

Procedure

“The stimuli for the behavioral same-different task, performed after fMRI, were 14 French words, 14 Japanese characters, 14 pictures of tools, 14 pictures of faces, 14 unknown script stimuli, and their corresponding left–right reversed mirror images. On each trial, two stimuli from the same category were successively presented at the fixation (200 ms presentation of each image, 300 ms inter-stimulus interval with fixation cross). The participant’s task was to decide whether the two stimuli depicted the same object, possibly in mirror- form. Thus, the participants had to respond “same” both to physically identical stimuli (1/4 of trials) and to mirror images (1/4 of trials). They had to respond “different” whenever the stimuli were unrelated, whether they were in the same orientation (e.g. two normally oriented words; two faces in the same orientation; 1/4 of trials), or whether they were in different orientations (e.g. one word followed by a mirror image of a word). The first stimulus, drawn from one of the five categories, was always in standard orientation, and the second stimulus was defined by a 2 × 2 factorial design with factors of identity (same or different object) and orientation (same or different left– right orientation). This design defined a list of 14 × 5 × 2 × 2 = 280 trials, which were run once in random order.” - Pg 1846

Controls

The trials will be split into 10 blocks with a break between each block to avoid fatigue and loss of attention. If the participant takes more than 2 seconds to respond there could either be an exclusion criteria or a message requesting participants to pay attention and displaying a substitute trial.

Analysis Plan

Similar to the original study and the replication report, the following analysis plan will be conducted:

“The following analysis plan will be followed from the study: Median correct response times were analyzed using an ANOVA with group as a between-subjects factor and stimulus category, repetition (same or different image) and orientation of the first image (normal or mirror; the second image was always in normal orientation) as within-subject factors. I will perform an ANOVA test to examine means of correct median response times between the five categories (French and Japanese words, tools, faces, and false fonts).”

Differences from Original Study and 1st replication

The original study does not mention any methods of exclusion criteria or insertion of attention checks, whereas the replication study added the exclusion criteria since it was an online study (as opposed to the original study that was conducted in-person). Participants in the replication study did not have any exposure to same-different task while they did during the original study. The original study had a final mean age of 23 years across participants but the replication study had a mean age of 36 years. Apart from these, there were no significant differences in the original study and the replication study. The results were however not completely consistent. One ANOVA analysis was replicable whereas an additional second analysis was not.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

The replication report cleaned the raw data (added a column) by clearly indicating whether the stimuli pairs were mirror imaged or normally oriented.Preprocessing of data was completed prior to the anlaysis to obtain required rows and columns for the mirror invariance task.

### Data Preparation


#### Load Relevant Libraries and Functions
library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("dplyr")

#### Import data
df.participant1 = read_csv("PilotA_task/PilotA_Data/pilotA_data.csv") %>%
  slice(18:n())

Rows: 1140 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): failed_images, failed_audio, failed_video, trial_type, internal_no...
dbl  (2): trial_index, time_elapsed
lgl  (2): success, timeout

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df.participant2 = read_csv("PilotA_task/PilotA_Data/pilotA_data1.csv") %>%
  slice(18:n())

Rows: 1140 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): failed_images, failed_audio, failed_video, trial_type, internal_no...
dbl  (2): trial_index, time_elapsed
lgl  (2): success, timeout

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df.participant = rbind(df.participant1, df.participant2)
df.participant

# A tibble: 2,246 × 16
   success timeout failed_images failed_audio failed_video trial_type           
   <lgl>   <lgl>   <chr>         <chr>        <chr>        <chr>                
 1 NA      NA      <NA>          <NA>         <NA>         image-keyboard-respo…
 2 NA      NA      <NA>          <NA>         <NA>         html-keyboard-respon…
 3 NA      NA      <NA>          <NA>         <NA>         image-keyboard-respo…
 4 NA      NA      <NA>          <NA>         <NA>         html-keyboard-respon…
 5 NA      NA      <NA>          <NA>         <NA>         image-keyboard-respo…
 6 NA      NA      <NA>          <NA>         <NA>         html-keyboard-respon…
 7 NA      NA      <NA>          <NA>         <NA>         image-keyboard-respo…
 8 NA      NA      <NA>          <NA>         <NA>         html-keyboard-respon…
 9 NA      NA      <NA>          <NA>         <NA>         image-keyboard-respo…
10 NA      NA      <NA>          <NA>         <NA>         html-keyboard-respon…
# ℹ 2,236 more rows
# ℹ 10 more variables: trial_index <dbl>, time_elapsed <dbl>,
#   internal_node_id <chr>, view_history <chr>, rt <chr>, stimulus <chr>,
#   response <chr>, task <chr>, correct_response <chr>, category <chr>

#### Data exclusion / filtering

df.response = df.participant %>%
  select(rt, stimulus, response, task, correct_response, category) %>%
  filter(!is.na(correct_response)) %>%
  select(-c(stimulus))

df.response

# A tibble: 560 × 5
   rt    response task     correct_response category
   <chr> <chr>    <chr>    <chr>            <chr>   
 1 570   d        response d                DNF     
 2 722   s        response d                DMT     
 3 686   d        response d                DMT     
 4 575   s        response s                SMT     
 5 363   d        response d                DMM     
 6 600   d        response d                DMJ     
 7 425   d        response d                DNT     
 8 406   d        response s                SMP     
 9 1097  d        response d                DNM     
10 530   d        response s                SMP     
# ℹ 550 more rows

# Here we filter for responses that have a response time between 100ms and 2000ms
df.correct = df.response %>%
  filter(as.numeric(rt) >= 100) %>%
  filter(as.numeric(rt) <= 2000) %>%
# Here we filter correct responses
  filter(response == correct_response) 

df.correct

# A tibble: 485 × 5
   rt    response task     correct_response category
   <chr> <chr>    <chr>    <chr>            <chr>   
 1 570   d        response d                DNF     
 2 686   d        response d                DMT     
 3 575   s        response s                SMT     
 4 363   d        response d                DMM     
 5 600   d        response d                DMJ     
 6 425   d        response d                DNT     
 7 1097  d        response d                DNM     
 8 410   s        response s                SMM     
 9 373   s        response s                SNF     
10 1033  s        response s                SMM     
# ℹ 475 more rows

#### Prepare data for analysis - create columns etc.
df.correct$SDNM = substr(df.correct$category, 1, 2)
df.correct$group = substr(df.correct$category, 3, 3)

df.correct = df.correct %>%
  select(-c(category))

df.correct

# A tibble: 485 × 6
   rt    response task     correct_response SDNM  group
   <chr> <chr>    <chr>    <chr>            <chr> <chr>
 1 570   d        response d                DN    F    
 2 686   d        response d                DM    T    
 3 575   s        response s                SM    T    
 4 363   d        response d                DM    M    
 5 600   d        response d                DM    J    
 6 425   d        response d                DN    T    
 7 1097  d        response d                DN    M    
 8 410   s        response s                SM    M    
 9 373   s        response s                SN    F    
10 1033  s        response s                SM    M    
# ℹ 475 more rows

#### Data analysis
##Median RT##
median_rt = df.correct %>%
  group_by(group, SDNM)%>%
  summarise(median_rt = median(as.numeric(rt), na.rm =FALSE))

`summarise()` has grouped output by 'group'. You can override using the
`.groups` argument.

median_rt

# A tibble: 20 × 3
# Groups:   group [5]
   group SDNM  median_rt
   <chr> <chr>     <dbl>
 1 F     DM         799 
 2 F     DN         607 
 3 F     SM         585 
 4 F     SN         474 
 5 J     DM         598 
 6 J     DN         692 
 7 J     SM         656.
 8 J     SN         538 
 9 M     DM         652.
10 M     DN         652.
11 M     SM         418.
12 M     SN         488.
13 P     DM         662.
14 P     DN         785 
15 P     SM         730.
16 P     SN         503 
17 T     DM         701 
18 T     DN         534 
19 T     SM         494.
20 T     SN         404.

#### Anova analysis
anova_results <- aov(median_rt$median_rt ~ median_rt$group)
summary(anova_results)

                Df Sum Sq Mean Sq F value Pr(>F)
median_rt$group  4  49251   12313   0.916   0.48
Residuals       15 201720   13448

Results of control measures

As mentioned above, there were no quality control checks or positive and negative controls in the replication study. The study excluded trials that had response times lesser than 100ms and greater than 2000ms to remove attention deficit trials, and trials with accidental button presses. Additionally subjects with less than 80 percent accuracy were removed from the analysis to ensure attention of participants.

Confirmatory analysis

The replication study conducted ANOVA and found that category the effect on mean correct response times was significant (p=0.04708) and consistent with the original study. The author also ran an additional ANOVA to compare means of categories between each other but did not observe significant results (inconsistent with original study).

Results from Replication study

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Mini meta analysis

Combining across the original paper, 1st replication, and 2nd replication, what is the aggregate effect size?

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.