Replication of Exploring second language learners’ grammaticality judgment performance in relation to task design features by Shiu, Yalçın, & Spada (2018, System)
This study replicated “Exploring second language learners’ grammaticality judgment performance in relation to task design features” (Shiu, Yalçın, and Spada 2018a, 2018b). The original study was an investigation into whether two dimensions of modality (timed/untimed and aural/written) of a grammaticality judgement task (GJT) affected the performance of adult English language learners on two grammatical features of English (passive voice and past progressive tense). The study recruited 120 adult English-as-a-foreign-language (EFL) learners from one university in Taiwan. Participants were asked to judge items as grammatical or ungrammatical on four computer-based GJTs (two differed on the timed/untimed dimension and two differed on the aural/written dimension). Each GJT consisted of 60 items (30 grammatical and 30 ungrammatical). The study was conducted in two sessions one week apart. At each session, participants took two GJT with a 30 minute break between them. The items were written in either the passive voice or used the past progressive tense, features which were hypothesized to differ in terms of their learning difficulty. The results showed significant differences in performance with respect to all three variables: time constraint, modality, and grammatical feature. Although learners performed better on past progressive items, the GJT performance across both grammatical features showed similar patterns in relation to task design features.
I chose this study because it relates to my research that uses a digital adaptation of the Test for Reception of Grammar (Bishop 1992), an assessment of implicit English syntax knowledge. Preliminary results suggest that performance on this measure improves as L2 students in grades 2-5 gain proficiency in English. While the current task uses aural prompts with picture answers, I was interested in comparing aural and written modalities on the task. I was also interested in investigating additional English features (such as tense) that are better suited to written stimuli instead of pictorial stimuli, and in extending the measure to adolescent and adult learners.
The key features needed to implement a GJT were available on the Rapid Online Assessment of Reading (ROAR) platform (Yeatman et al. 2021): jsPsych infrastructure for playing audio clips, displaying written stimuli, recording keyboard responses, and storing the responses to a database. Because the author did not respond to a request for the original stimuli, the biggest challenges were creating the item stimuli and recruiting participants who are L2 English speakers. A large language model was used to assist with item creation. In order to reduce the time required for the replication, only two GJT (comparing the aural/written condition) were included in the replication, with a minimal break between them.
G*Power was used to perform a power analysis. The correlation reported in the original study was between untimed auditory (AGJT) and untimed written (WGJT) conditions was (r=0.86). Table 1 of the original paper gave the mean and standard deviation for the score in the untimed auditory condition (untimed AGJT: mean = 32.27, sd = 5.79) and untimed written condition (untimed WGJT: mean = 39.02, sd = 6.09). From these numbers, the effect size was computed by G*Power to be 2.13.
Using a two-tailed T-test, the sample size required for 95% power with alpha = 0.01 was computed to be 8 (actual power = 0.973). A planned sample of 20 participants was deemed to be conservative.
Planned Sample
Twenty participants were recruited from Prolific. The inclusion criteria specified that participants should speak Mandarin as a first language and English as a second language.
Materials
The original study specified the following stimuli design, “The timed aural GJT (AGJT) consists of 60 items, with 24 targeting the passive construction, 24 targeting the past progressive, and 12 distractors targeting other grammatical features. The passive items vary in terms of length (10-14 syllables, with an average of 11.96 syllables), accuracy (12 grammatical and 12 ungrammatical), and tense (8 present, 8 past, 8 present perfect). The passive items are all simple sentences. The passive items include 12 regular verbs and 12 irregular verbs. The ungrammatical items focus on two types of errors: omitting auxiliary verb be (e.g., Every year, many children reported missing.), and using the bare form of the verb instead of past participle (e.g., The taxi has been park at the airport for three months.). With reference to verb types (regular vs. irregular), the error types of the passive items can be divided into four categories abbreviated as: (a) regular be, (b) regular participle, (c) irregular be, and (d) irregular participle. The 24 past progressive items are also evenly divided between grammatical and ungrammatical sentences. In order to address differences in lexical aspect (Vendler, 1967), 12 items included verbs of accomplishment and 12 included verbs of activity. The length of the past progressive items ranged between 12 and 16 syllables, with an average length of 13.46 syllables. Twelve items are grammatical, while the other 12 are ungrammatical items, targeting two error types: (1) missing auxiliary (e.g. While the girl sitting outside, it started raining), and (2) present auxiliary (e.g., She is reading a book at 4 yesterday afternoon). Sixteen of the past progressive items consist of subordinate clauses that indicate the action taking place at a certain time in the past (e.g., When I met my husband, I was traveling in France.), whereas the rest 8 sentences are simple sentences. The differences between the two target features are taken into consideration in the analysis of the data discussed below.” For the written portion of the test the authors note, “The timed written GJT (WGJT) is virtually identical to the timed aural GJT except it was delivered in the written mode.”
Item Creation
For the replication study, a chatGPT website (GPT-5.1; OpenAI, 2025) was used to create sentences similar to those described in the original paper. Example prompts are shown below. A separate prompt was used for each group of sentences. From the 20 sentences generated for each group, the replication author selected 8, ensuring that both regular and irregular verbs were represented. The author made 4 of these sentences ungrammatical by applying the error categories described in the original paper. This procedure was repeated for each tense in the passive voice (past, present, and future) and then for simple and complex sentences in the past progressive tense. After generating, selecting, and editing the target sentences, the author reviewed the stimuli list as a whole to ensure that verbs and subjects were not repeated.
Sample chatGPT prompts:
> Write 20 simple sentences in passive voice. Use present tense and constrain the length to 10-14 syllables. Use a mix of regular and irregular verbs.
> Make a past progressive version. Half of items should be verbs of accomplishment, and half verbs of activity.
> Make 20 new sentences in past progressive tense. Include a subordinate clause that indicates action in the past in the middle of the sentence. The main clause should be active. Syllable count of 12-16, split between verbs of accomplishment and activity, split between regular and irregular verbs. No introductory subordinate clauses. Example: Tom heard about the crash while he was listening to the news last night. Do not use any form of the following verbs: [list from previously chosen items].
> Make 20 sentences with an introductory subordinate clause. The main clause should be active voice and in the past, present or future tense (no progressive tense). Constrain the sentences to 12-16 syllables. Use common words. Do not use any nouns or verbs from the sentences in the forbidden list.
> Create a list of 20 sentences in active voice, using the present progressive tense, that have one main clause and no subordinate clauses. Constrain the sentences to 12-16 words. Do not use any topics or main verbs from the forbidden list.
The original paper did not describe the 12 distractor sentences, so the replication author used chatGPT prompts to make 6 active simple sentences (2 each for past, present, and future tense), 2 complex sentences with an introductory subordinate clause and an active past tense main clause, 2 complex sentences with an active past tense main clause and an embedded relative clause, and 2 simple sentences in the progressive present tense. The original paper did not say whether or not the distractors included ungrammatical sentences. The replication author chose to make one sentence in each pair of distractors ungrammatical.
Item Design
Group
Voice
Tense of main clause
Type of subordinate clause
Number
passive-past
passive
past
n/a
8
passive-present
passive
present
n/a
8
passive-future
passive
future
n/a
8
progressive-simple
active
past progressive
n/a
8
progressive-complex-intro
active
past progressive
introductory
8
progressive-complex-middle
active
past progressive
embedded relative
8
distractor
active
past
n/a
2
distractor
active
present
n/a
2
distractor
active
future
n/a
2
distractor
active
present progressive
n/a
2
distractor
active
past
introductory
2
distractor
active
past
embedded relative
2
Procedure
Original Procedure
This is the procedure from the original paper:
“The timed AGJT was administered first followed by the timed WGJT. There was a 30-min interval between the administrations of the two tests. One week after the participants completed the timed GJTs, they completed the untimed AGJT followed by the untimed WGJT. There was also a 30-min interval between the administrations of the two tests. The AGJT was administered before the WGJT because it was assumed that the aural stimuli were more transitory than the written stimuli. Therefore, administering the AGJT before the WGJT would decrease the possibility of memory effect. All tests were administered during regular class hours.
“The untimed aural GJT was the same as the timed aural GJT except that there were no time constraints for learners’ responses. The participants could take their time to respond and to listen to the item repeatedly if they felt necessary before responding. Because in the untimed written GJT, the participants were able to read a sentence more than once, to make the task demands of both untimed GJTs more parallel, repetitive listening was also allowed in the untimed aural GJT. The frequency of repeatedly listening to the sentence was recorded. The directions for the untimed AGJT were “After you hear the sentence, please choose ‘Correct,’ ‘Incorrect, or ‘Not Sure.’ If you would like to hear the sentence again, press ‘Listen Again.’ You can take as much time as you need to make your decision.” After the learner responded, the next question automatically appeared.
“The untimed written GJT is the same as the timed WGJT except that there are no time constraints for learners’ responses. The directions for the untimed WGJT were “You can take as much time as you need to make your choice.”
Replication Procedure
An app from an online assessment platform (ROAR) (Yeatman et al. 2021) was modified to present auditory and written prompts for the replication study. The Prolific survey included a link to the app.
The first screen displayed instructions unique to the replication study, “This is a test of grammar knowledge. Use the arrow keys to enter your answers. Please answer using your own knowledge, do not consult the web or any references.”
The next screen contained written instructions that were modeled on language contained in original paper, “Listen to each sentence. Your task is to decide whether the grammar of the sentence is correct or incorrect. After you hear the sentence, please choose Correct, Incorrect, or Not Sure. If you would like to hear the sentence again, click on the Listen Again button. You can take as much time as you need to make your decision.”
The auditory task began with two practice sentences intended to familiarize the participant with the response choices. An audio clip played “The grammar of this sentence is good,” in the first practice trial and “The grammar of this sentence are bad.” in the second practice trial. A button labeled “Listen Again” was at the center of the screen, with buttons labelled “Not sure”, “Incorrect”, or “Correct” below it. Each button was labelled with an arrow (pointing up, left, and right, respectively, Figure 1). While participants were instructed to use the arrow keys, due to limitations of the implementation it was also possible to use a mouse to select answers.
In the practice trials, if the correct answer was chosen it was highlighted in green and then the next trial appeared. If “Not Sure” or the incorrect answer was chosen, it was highlighted in red, and the trial remained on the screen until the correct answer was chosen.
After the practice sentences, the participant was presented with 60 auditory items in a fixed order. Next the instructions for the written task (modeled on the original instructions) were displayed, “Read each sentence. Your task is to decide whether the grammar of the sentence is correct or incorrect. After you read the sentence, please choose Correct, Incorrect, or Not Sure. You can take as much time as you need to make your choice.”
The written task began with the same practice sentences. These were displayed on the screen just above the choices (Figure 2). The main part of the task presented the same 60 sentences, in written format, in a different fixed order. While the Listen Again button was visible, pressing it did not play any audio.
Figure 1: Screenshot of the auditory grammatical judgement task.
Figure 2: Screenshot of the written grammatical judgement task.
Analysis Plan
Original Analysis
The original paper conducted the following analysis: “The four GJTs were scored in terms of accuracy, with 1 point for a correct response and 0 point for incorrect and no response. The maximum score for each GJT was 48. The option “Not sure” was considered to be incorrect. “No response” items accounted for 13% and 18% of all the responses to the timed AGJT and timed WGJT respectively. The reliability of the four GJTs was calculated based on the 120 EFL students’ data, using Cronbach’s alpha. The reliability coefficients of the timed AGJT, timed WGJT, untimed AGJT, and untimed WGJT were 0.80, 0.87, 0.81, and 0.86, respectively. Descriptive statistics of the EFL participants were calculated for the four GJTs. […] Bivariate correlations were also computed to examine the relationships among the grammatical and ungrammatical items of the four GJTs. Repeated-measures ANOVA tests were performed on the 120 EFL learner data. Given that the items of the two target features are not identical in terms of their length, error types and sentence pattern (i.e., simple versus complex), the bivariate correlations and the repeated-measures ANOVA tests were conducted separately for the passive structure and the past progressive structure. The participants’ GJT performance was also examined in relation to the different error types included in the ungrammatical items of the two target features.”
Replication Analysis
The auditory and written GJT were scored for accuracy, with 1 point for a correct response and 0 point for incorrect and “Not Sure” response. The maximum score for each GJT is 48. The mean and standard deviation for the participants were computed for each combination of modality (auditory/written), grammaticality (grammatical/ungrammatical), and feature (passive/past progressive). Pearson correlations were computed between modalities for all items, for passive items only, and for past progressive items only.
ANOVA tests examined modality, grammaticality, and modality*grammaticality interaction on all items, on passive items only, and on past progressive items only.
(Note: In the event of a significant discrepancy in findings, the replication author will drop the future tense passive items, compute a scaled adjusted score using the just the past and present tense passive items, and repeat the analyses.)
Differences from Original Study
Where the original study included 4 tasks for a 2x2 contrast of timed/untimed and auditory/written conditions, the replication only included the 2 untimed tasks for a auditory/written contrast.
The original study was conducted in university classrooms and included a 30 minute break between the auditory and written conditions. The replication study was conducted on Prolific and did not have a break between conditions. Two beta testers in the replication study reported that they noticed sentences being repeated between the conditions, which may have made their responses more similar than they would be with a longer break.
The app used in the original study only accepted keyboard responses, while the replication app allowed both keyboard and mouse responses. Because the scoring is only computed on accuracy, not on response time, the additional response method is expected to have little effect on the results.
The stimuli for the replication study were created by the replication author based on descriptions in the original paper. The original study included 8 passive sentences in the present perfect tense, while the replication study instead included 8 passive sentences in the future tense. This difference in stimuli was unintentional and was discovered while analyzing pilot test results. Due to time constraints, the author chose not to revise these sentences. The difference in difficulty between these tenses is not known and may affect the score on the passive sentences.
Methods Addendum (Post Data Collection)
The Prolific survey was configured to drop participants if they did not complete the activity or if their overall completion time was much less than the estimated time provided by the researcher.
After data collection, two participants were noted to have total scores near chance, raising the possibility that they were not engaged in the task and instead were answering randomly. The following analyses were added to the plan to detect disengaged participants: checking whether the number correct for each type of sentence (distractor, passive, past progressive) was statistically better than chance, checking whether the proportion of a particular response (Not Sure, Correct, and Incorrect) was greater than 75%, and checking whether the median response time for each modality was greater than 1 second.
Actual Sample
Three participants who began the study did not complete the test. Two did not attempt any items and one attempted 51 items. They were dropped from the study and replaced with newly recruited participants. Two of the 20 participants that completed all 120 items were excluded based on the disengagement analysis.
The final sample consisted of 18 Mandarin-speaking individuals (M age = 29.5 years, SD = 8.9). Table 1 shows demographics for the final sample. The majority of participants identified as female (72.2%), with the remaining participants identifying as male (27.8%). Most participants were born in China (72.2%), followed by Taiwan (16.7%), Malaysia (5.6%), and the United States (5.6%). At the time of participation, participants most commonly resided in Canada (38.9%) or the United Kingdom (22.2%), with others residing in Australia (16.7%), New Zealand (11.1%), and the United States (11.1%).More than half of the sample reported not being students (55.6%), 27.8% reported current student status, and student status was unknown for 16.7% of participants.
Differences from pre-data collection methods plan
A disengagement analysis was added to the plan after data collection.
Results
Data preparation
Trial level data was downloaded from the ROAR database. The number of trials attempted was checked and participants who completed all 120 trials were approved for payment in Prolific. Participants whose number of correct answers were near chance were flagged for further analysis.
Practice trials were removed, then a disengagement analysis was conducted. For two participants, a binomial test indicated that the participant’s number of correct responses was not significantly greater than chance (p > .05, range = [.33, .67]) on all three sentence types (distractor, passive, and past progressive). The same participants were also flagged for responding Correct on greater than 75% of responses (82.5% and 100%). One of these participants also had a median response time of less than one second on the written GJT. Both participants were excluded from the study.
Two participants were flagged by the binomial test only on the distractor sentences (p=0.076), but had acceptable response proportions and median response time. These participants were included in the study.
The format of the item_id field encodes the characteristics of each item (modality, grammaticality, and feature). These codes were parsed and distractor sentences were filtered out. Scores for modality, grammaticality, and feature for each participant were computed, with 1 point for correct answers and 0 points for incorrect answers. “Not sure” responses were counted as incorrect. Subtotals were computed for each combination of modality and grammaticality. The maximum score for each tuple is 12, the maximum subtotal score is 24, and the maximum total score for each modality is 48. Mean and standard deviation was computed for each subscore and subtotal.
Bivariate correlations were computed to examine inter-correlations between the auditory and written tests and the relationship between grammatical and ungrammatical items. A repeated-measures ANOVA was performed to examine modality vs grammaticality and their interaction.
#### Data exclusion / filtering # Disengagement Analysisdf_trials_all <- df_trials_raw %>%filter(assessment_stage !="practice_response") %>%mutate(modality =ifelse(grepl("^l-",item_id), "auditory", "written"),grammar =ifelse(grepl("u$",item_id), "grammatical", "ungrammatical"),feature =case_when(grepl("dist", item_id) ~"distractor",grepl("pass", item_id) ~"passive",grepl("prog", item_id) ~"progressive", TRUE~NA_character_# fallback if none match ),tense =case_when(grepl("practice", item_id) ~"practice",grepl("pass-pres", item_id) ~"present",grepl("pass-past", item_id) ~"past",grepl("pass-fut", item_id) ~"future", grepl("prog-intro", item_id) ~"past-progressive-intro",grepl("prog-mid", item_id) ~"past-progressive-middle",grepl("prog-simp", item_id) ~"past-progressive-simple", grepl("distprog", item_id) ~"distractor-progressive",grepl("distsimple", item_id) ~"distractor-simple", grepl("distsub", item_id) ~"distractor-complex", TRUE~NA_character_# fallback if none match ), ) chance_analysis <- df_trials_all %>%group_by(assessment_pid, feature) %>%summarize(n =n(),num_correct =sum(correct),.groups ="drop" ) %>%rowwise() %>%mutate(p_value =round(binom.test(x = num_correct,n = n,p =0.5,alternative ="greater")$p.value,3) ) %>%ungroup()# flag if not statistically better than chancechance_flag <- chance_analysis %>%filter(p_value >0.05)response_count <- df_trials_all %>%group_by(assessment_pid, response) %>%summarize(n =n(), .groups ="drop") %>%group_by(assessment_pid) %>%mutate(percent =round(100* n /sum(n), 1)) %>%# percent of responses per participantungroup() # flag if one type of response is > 75%response_flag <- response_count %>%filter(percent >75) rt_analysis <- df_trials_all %>%group_by(assessment_pid, modality) %>%summarise(median_rt =median(rt)) %>%ungroup()
`summarise()` has grouped output by 'assessment_pid'. You can override using
the `.groups` argument.
# flag if median is less than 1 secondrt_flag <- rt_analysis %>%filter(median_rt <1000)exclude_pid <- df_trials_all %>%select(assessment_pid) %>%unique() %>%filter(assessment_pid %in% chance_flag$assessment_pid) %>%filter(assessment_pid %in% response_flag$assessment_pid)
#### Age# exclude_id is from "trog_gjt pid tracker.xlsx" based on values in exclude_pidexclude_id <-c("69338558d54158666e7749ac","6420df282f852290686ab8d7")# Filter accepted participantsdf_accepted <- df_demo_raw %>%filter(Status =="APPROVED") %>%filter(!(Participant.id %in% exclude_id)) %>%mutate(Age =as.integer(Age)) %>%mutate(Student.status =ifelse(Student.status =="DATA_EXPIRED", "Unknown", Student.status))# Calculate mean and SD of Ageage_mean <-mean(df_accepted$Age, na.rm =TRUE)age_sd <-sd(df_accepted$Age, na.rm =TRUE)# Create inline stringage_inline <-sprintf("%.1f (%.1f)", age_mean, age_sd)# # Display# age_inline
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
Warning in alpha(df_reliability_auditory): Some items were negatively correlated with the first principal component and probably
should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
Some items ( l-pass-past-7 l-prog-simp-8-u l-pass-past-4-u l-prog-intro-1 l-pass-pres-2-u l-prog-intro-4 l-prog-simp-4 l-prog-simp-1 l-prog-mid-2 l-prog-simp-3 ) were negatively correlated with the first principal component and
probably should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
Warning in cor.smooth(r): Some items were negatively correlated with the first principal component and probably
should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
Some items ( l-pass-past-7 l-prog-simp-8-u l-pass-past-4-u l-prog-intro-1 l-pass-pres-2-u l-prog-intro-4 l-prog-simp-4 l-prog-simp-1 l-prog-mid-2 l-prog-simp-3 ) were negatively correlated with the first principal component and
probably should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
Warning in alpha(df_reliability_written): Some items were negatively correlated with the first principal component and probably
should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
Some items ( prog-simp-6-u prog-intro-4 pass-past-5 pass-pres-8 prog-mid-4 pass-fut-7 prog-intro-1 pass-pres-6 prog-mid-8-u pass-fut-3-u prog-simp-7-u pass-fut-4-u pass-past-7 prog-mid-2 ) were negatively correlated with the first principal component and
probably should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
Warning in cor.smooth(r): Some items were negatively correlated with the first principal component and probably
should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
Some items ( prog-simp-6-u prog-intro-4 pass-past-5 pass-pres-8 prog-mid-4 pass-fut-7 prog-intro-1 pass-pres-6 prog-mid-8-u pass-fut-3-u prog-simp-7-u pass-fut-4-u pass-past-7 prog-mid-2 ) were negatively correlated with the first principal component and
probably should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
In smc, smcs < 0 were set to .0
A graph of participant scores and means for each category is shown in Figure 3.
p.scores_means
Figure 3: Participant scores and mean of each group.
The reliability of the auditory and written GJTs (n=18) was calculated using Cronbach’s alpha. Internal consistency was acceptable for auditory items (α = 0.81) and written items (α = 0.79). Reliability for auditory items was comparable to the original results (n=120) and slightly lower for written items d (α = 0.81 and 0.86, respectively).
Table 2 presents mean scores by modality, grammar status, and feature (verb form), and subtotals. Overall, participants performed better on the untimed WGJT (M = 41.33, SD = 4.83) than on the untimed AGJT (M = 34.61, SD = 5.95), which was consistent with the orignal paper (untimed WGJT M=39.02 SD=6.09; untimed AGJT M=32.27 SD=5.79). Mean modality scores on the replication were within the standard deviation of modality scores in the original paper. For replication participants, mean scores were lowest for auditory grammatical while mean scores for auditory ungrammatical, written grammatical, and written ungrammatical were very similar. Replication participants generally performed better on ungrammatical than grammatical items. This is the opposite result from the original study, where participants generally performed better on grammatical items than ungrammatical items.
For passive items, replication participants had the lowest performance on grammatical passive auditory items. The best performance was on ungrammatical passive items, where auditory and written scores were similar to each other. This is the opposite result from the original study, where participants performed better on grammatical passive items than ungrammatical passive items in both modalities.
For past progressive items, replication participants had the lowest performance on auditory grammatical items and the best performance on written grammatical items. Ungrammatical past progressive items were equal across modalities. The original participants had the lowest performance on the auditory ungrammatical items, and highter and equal performance across auditory ungrammatical items and grammatical items of both modalities.
A graph of the comparison of mean scores between the original paper and the replication in shown in Figure 4.
Mean (SD) scores by modality, grammar status, and feature
Row
Total
Passive
Past Progressive
Auditory Total
34.61 (5.95)
17.56 (2.79)
17.06 (3.59)
Auditory Grammatical
14 (5.27)
6.33 (2.22)
7.67 (3.46)
Auditory Ungrammatical
20.61 (2.45)
11.22 (1.22)
9.39 (1.65)
Written Total
41.33 (4.83)
21.56 (2.5)
19.78 (3.17)
Written Grammatical
20.22 (3.57)
9.94 (2.01)
10.28 (2.08)
Written Ungrammatical
21.11 (2.54)
11.61 (0.98)
9.5 (2.09)
Table 2: Descriptive statistics. The maximum total score for each modality was 48. The maximum subtotal score by modality or grammaticality is 24, and the maximum score for features (passive/past progressive) was 12.
Figure 4: Comparison of mean scores between original study and replication.
Correlations between modalities is shown in Table 3). The correlation between the untimed AGJT and WGJT was moderate and significant (0.49, p < 0.05). This is lower than the strong significant correlation in the original paper (.86, p < 0.01)
Item Correlation by Modality
Auditory
Written
Auditory
Written
0.490*
Table 3: Correlation of all items between auditory and written modality. *p < .05
Correlations for passive items are shown in Table 4. In the replication, moderate significant correlations were only for auditory ungrammatical with written grammatical items (0.58 , p < 0..5). In the original paper, moderate significant correlations were found between the modalities for grammatical (.58, p < 0.01) and ungrammatical (.52, p < 0.01) passive items.
Passive Item Correlations
Auditory Grammatical
Auditory Ungrammatical
Written Grammatical
Written Ungrammatical
Auditory Grammatical
Auditory Ungrammatical
0.254
Written Grammatical
0.438
0.582*
Written Ungrammatical
0.225
0.275
0.317
Table 4: Correlation of modality and grammaticality for passive items. *p < .05
Correlations for past progressive items are shown in Table 5. In the replication, a moderate significant correlations was found only for auditory ungrammatical with written grammatical items (0.50 , p < 0.5). In the original paper, moderate significant correlations were found between the modalities for grammatical (.65, p < 0.01) and ungrammatical (.57, p < 0.01) past-progressive items. There was also a low significant correlation between the ungrammatical auditory and grammatical written (.21, p < 0.01) past-progressive items.
Past Progressive Item Correlations
Auditory Grammatical
Auditory Ungrammatical
Written Grammatical
Written Ungrammatical
Auditory Grammatical
Auditory Ungrammatical
-0.161
Written Grammatical
0.209
0.207
Written Ungrammatical
-0.243
0.502*
0.155
Table 5: Correlation of modality and grammaticality for past progressive items.*p < .05
Across all items, results of the repeated-measures ANOVA of modality by grammaticality (and their interaction) is shown in Table 6. In the replication, large significant effects were found for passive items for modality (F (1, 17) = 26.61, p < .001, η²p = 0.18) , grammaticality (F (1, 17) = 15.47, p < .01, η²p = 0.22) , and the interaction between them (F (1, 17) = 20.22, p < .001, η²p = 0.14). The original study found very large significant main effect of modality (F (1, 119) = 156.64, p < .05, η²p = 0.79), grammaticality (F (1, 119) = 575.04, p < .001, η²p = 0.83), and a large significant effect for the interaction between them (F (1, 119) = 18.90, p < 0.001, η²p = 0.14)
Repeated Measures ANOVA for All Items
F
η²p
p
Significance
modality
26.61
0.18
0
***
grammaticality
15.47
0.22
0
**
modality:grammaticality
20.22
0.14
0
***
Table 6: ANOVA between modality and grammaticality for all items. **p < .01, ***p < .001
For passive items, results of the repeated-measures ANOVA of modality by grammaticality (and their interaction) is shown in Table 7. In the replication, large significant effects were found for passive items for modality (F (1, 17) = 51, p < .001, η²p = 0.27) , grammaticality (F (1, 17) = 80.95, p < .001, η²p = 0.5) , and the interaction between them (F (1, 17) = 23.17, p < .001, η²p = 0.19). The original study found very large significant main effect of modality (F (1, 119) = 458.13, p < .05, η²p = 0.79), grammaticality (F (1, 119) = 569.85, p < .05, η²p = 0.83), and a medium significant effect for the interaction between them (F (1, 119) = 16.52, p < 0.001, η²p = 0.12)
Repeated Measures ANOVA for Passive Items
F
η²p
p
Significance
modality
51.00
0.27
0
***
grammaticality
80.95
0.50
0
***
modality:grammaticality
23.17
0.19
0
***
Table 7: ANOVA between modality and grammaticality for passive items. ***p < .001
For past progressive items, results of the repeated-measures ANOVA of modality by grammaticality (and their interaction) is shown in Table 8. In the replication, medium significant effects were found for passive items for modality (F (1, 17) = 7.18, p = 0.02, η²p = 0.08), and the interaction between modality and grammaticality (F (1, 17) = 7.34, p = 0.01, η²p = 0.07). The original study found very large significant main effects of modality (F (1, 119) = 112.12, p < .05, η²p = 0.49), grammaticality (F (1, 119) = 244.18, p < .05, η²p = 0.67), and a medium significant effect for the interaction between them (F (1, 119) = 6.15, p = 0.015, η²p = 0.05)
Repeated Measures ANOVA for Past Progressive Items
F
η²p
p
Significance
modality
7.18
0.08
0.02
*
grammaticality
0.49
0.01
0.49
modality:grammaticality
7.34
0.07
0.01
*
Table 8: ANOVA between modality and grammaticality for past progressive items. *p < .05
Exploratory analyses
No exploratory analyses were performed.
Discussion
Summary of Replication Attempt
The current study was able to partially replicate “Exploring second language learners’ grammaticality judgment performance in relation to task design features” (Shiu, Yalçın, and Spada 2018a, 2018b) for untimed grammaticality judgement tasks. Both studies found that task modality and item grammaticality played a significant role in GJT performance, although the effect size was much larger in the original study. Both studies also found that participants performed significantly better on the written than the auditory tasks.
There was a discrepancy between the studies: participants in the original study performed better on grammatical items than ungrammatical items, while participants in the replication study performed better on the ungrammatical items than the grammatical items in the auditory modality, and had equal scores for the written modality.
Commentary
In the discussion section of the (Shiu, Yalçın, and Spada 2018a) study, the authors noted that (Bley-Vroman, Felix, and loup 1988) found that participants judged ungrammatical items more accurately than grammatical items, which partially agrees with the replication findings. As noted by Shiu et al, it is plausible that a difference in English proficiency in the participants may account for these findings. Participants of the replication study were similar to the participants of Bley-Vroman, Felix, and loup (1988), in that they were advanced English learners living abroad, while participants in (Shiu, Yalçın, and Spada 2018a) were intermediate English learners living in Taiwan.
Alternately, the difference in difficulty could be due to differences between the stimuli used for the original and replication studies. If it is easier to judge the grammaticality of present perfect sentences than future tense sentences the difference might explain the higher scores for grammatical sentences in the original study.
References
Bishop, Dorothy. 1992. T.r.o.g. Test for Reception of Grammar. Chapel Press.
Bley-Vroman, Robert W., Sascha W. Felix, and Georgette L. loup. 1988. “The Accessibility of Universal Grammar in Adult Language Learning.”Interlanguage Studies Bulletin (Utrecht) 4 (1): 1–32. https://doi.org/10.1177/026765838800400101.
Shiu, Li-Ju, Şebnem Yalçın, and Nina Spada. 2018a. “Exploring Second Language Learners’ Grammaticality Judgment Performance in Relation to Task Design Features.”System 72 (February): 215–25. https://doi.org/10.1016/j.system.2017.12.004.
———. 2018b. “Exploring Second Language Learners’ Grammaticality Judgment Performance in Relation to Task Design Features.”System 72 (February): 215–25. https://doi.org/10.1016/j.system.2017.12.004.
Yeatman, Jason D., Kenny An Tang, Patrick M. Donnelly, Maya Yablonski, Mahalakshmi Ramamurthy, Iliana I. Karipidis, Sendy Caffarra, et al. 2021. “Rapid Online Assessment of Reading Ability.”Scientific Reports 11 (1): 6396. https://doi.org/10.1038/s41598-021-85907-x.