1 Introduction

This paper outlines the design of an acceptability judgment experiment with a focus on power calculation. The experiment investigates whether there is a significant difference in the acceptability of a specific type of Japanese sentence when spoken in two types of different contexts. Informal data collection by the author suggested that context influences sentence acceptability, but individual variability was observed (Harada 2018). Therefore, a formal experiment with a larger sample is necessary to draw reliable conclusions.

A key question in designing the experiment is determining the required sample size to detect any differences with sufficient power, assumed to be 80% according to statistical convention. Due to limited literature on the specific Japanese phenomenon or relevant statistics on acceptability judgments, accurately estimating the sample size is challenging. This paper addresses this challenge by conducting a simulation using a created dataframe and a linear mixed-effects model. The simulation will be based on acceptability judgment data from Linzen and Oseki’s (2018) experiment, which is the most relevant available study. This approach helps avoid underestimating the sample size and reduces the risk of Type II errors as well as Type M and Type S errors.

The paper is organized as follows: Section 2 briefly explains the phenomenon of interest. Section 3 overviews the experimental design. Section 4 presents the power analysis, which informs the participant number estimated in Section 3. Finally, Section 5 concludes the paper.

2 Phenomenon to be examined

The experiment investigates a phenomenon in Japanese copular constructions where the predicate can sometimes take an accusative case and sometimes cannot. Sentence (1) exemplifies the phenomenon.

kyoo-wa onigiri-o mit-tu dayo
- kyoo-wa: today-TOP
- onigiri-o: rice.ball-Acc
- mit-tu: three-Classifier
- dayo: copula
- Meaching: `Today is three rice balls.’

In (1), the predicate onigiri-o mit-tu ‘three rice balls’ can optionally take the accusative case -o. The availability of the accusative case depends on the utterance context where the sentence occurs. For example, sentence (1) is natural in the context given in (2a) but not in the context provided in (2b).

1. Ken is Ai’s father and always cooks lunch for her. It’s now 6am, and Ai has just come into the kitchen. Ken says (1) to Ai.
2. Ken and Ai have long been monitoring when different kinds of food in their showcase go bad. Ken always checks at 4pm which food items and how many of them have spoiled. It’s now 4pm, and Ai has just come to the showcase. Looking at the food, Ken says (1) to Ai.

The contextual variability in the acceptability of the accusative case raises various descriptive and theoretical questions. One pertinent question is: what kinds of contexts allow the accusative case? In response to this, Harada (2018) offers a descriptive generalization similar to (3).

The predicate accusative case in Japanese copular sentences is available only when the context supports the accommodation of a question which:
1. if expressed linguistically, contains an accusative case-marked wh-item, and
2. is answered by the copular sentence.

I call wh-questions satisfying the conditions in (3) wh_Acc-question.

I demonstrate that whereas the context in (2a) accommodates a wh_Acc-question, the one in (2b) does not, and thus only the context in (2a) allows the predicate accusative case. First, consider a wh_Acc-question accommodated in (2a) below.

Ken-wa nani-o tukutta-no
- Ken-wa: Ken-TOP
- nani-o: what-Acc
- tukutta-no: made-Question.marker
- Meaching: `What did Ken make?’

The question with an accusative case-marked wh-item in (4) is contextually salient because Ken cooks lunch for Ai every morning and (1) is uttered in the morning. Also, the copular sentence in (1) answers the question in (4). Thus, the question in (4) is an wh_Acc-question accommodated in (2a).

In contrast to (2a), it is difficult to envision a wh_Acc-question in context (2b); the most natural wh-question to accommodate in (2b) that is answered by (1) would be (5). But the question does not contain an accusative case-marked wh-item. Thus, (5) is not a wh_Acc-question.

kyoo-wa nani-ga/*o kusatta?
- kyoo-wa: today-TOP
- nani-ga/o: what-Nom/Acc
- kusatta: went.bad
- Meaching: `What has gone bad today?’

To sum up, the availability of the predicate accusative case depends on the utterance context, and seems to be governed by the conditions in (3). However, as mentioned in Section 1, there are some differences in acceptability among individuals. Therefore, the proposed experiment examines whether (3) is a correct generalization of when the predicate accusative case is available.

3 Overview of the experiment

3.1 Participants

I will recruit 88 native Japanese speakers for the experiment through a platform called Crowdworks.¹ Following Linzen and Oseki (2018), I will only select participants who meet two conditions: (i) they have lived in Japan from birth until at least age 13, and (ii) their parents spoke Japanese to them at home.

Although speakers of different Japanese dialects might show varying patterns in acceptability judgments, it is not clear if this is indeed the case. Therefore, I will not exclude speakers of any dialects from participating. However, I will ask participants to indicate their dialects in the questionnaire so I can analyze potential differences in judgment among dialect groups.

Similarly, I will include both linguists and non-linguists in the experiment. While there is some evidence suggesting that these two groups may provide different judgments (e.g., Spencer 1973; Gordon and Hendrick 1997; Dabrowska 2010), it remains unclear whether linguists should avoid collecting data from fellow linguists (e.g., Schütze and Sprouse 2014). On one hand, linguists’ judgments may be influenced by their theoretical perspectives (e.g., Edelman and Christiansen 2003; Wasow and Arnold 2005; Gibson and Fedorenko 2013), but on the other hand, they might also be more attuned to subtle differences in acceptability that non-linguists might miss (e.g., Newmeyer 1983; Grewendorf 2007). I will ask participants in the questionnaire whether they have studied linguistics.

3.2 Materials

The experiment will involve 10 main items, each consisting of the same sequence of Japanese copular sentences uttered in two different contexts, resulting in 20 main sentences. For example, one item includes sentence (1) in context (2a) and sentence (1) in context (2b).

The number of main items was determined considering the need to minimize the effect of particular lexical items. Schütze and Sprouse (2014) suggest that ideally, an experiment should include 8 or more items to reduce this effect. While more items increase the experiment’s power, we will use only 10 items to prevent participant fatigue or boredom during a longer experiment. Using a Latin square design, each participant will see only one condition per item, resulting in 10 main sentences per participant.

The experiment will also include twice as many filler sentences as main sentences, following Cowart (1997). Thus, each participant will see 30 sentences in total. The filler items are included for three reasons: (i) trying to let every response options of the 7-point Likert scale (1 being least natural and 7 being most natural) be used in the same frequency, (ii) including a variety of constructions to prevent participants from being affected by any salient feature(s) of the main item sentences, and (iii) preventing participants from discerning the experiment’s intent.

In addition to main items and fillers, the experiment will include 3 sentences that will serve as anchor sentences for the Likert scale. These anchor sentences will also encourage participants to use the full range of the scale, preventing bias such as skew and compression.

The table below outlines the dataset structure for this experiment. The items acc.ok.01 and acc.bad.01 correspond to sentences like (1), presented in contexts such as (2a) and (2b).

anchor items	main items	filler items
A1	acc.ok.01	F01
A2	acc.bad.01	F02
A3	acc.ok.02	F03
	acc.bad.02	F04
	acc.ok.03	F05
	acc.bad.03	F06
	acc.ok.04	F07
	acc.bad.04	F08
	acc.ok.05	F09
	acc.bad.05	F10
	acc.ok.06	F11
	acc.bad.06	F12
	acc.ok.07	F13
	acc.bad.07	F14
	acc.ok.08	F15
	acc.bad.08	F16
	acc.ok.09	F17
	acc.bad.09	F18
	acc.ok.10	F19
	acc.bad.10	F20

As mentioned earlier, each participant is exposed to only one condition per item. For instance, if participant 1 sees acc.ok.01, they will not see acc.bad.01, and vice versa. Therefore, the experiment requires two types of datasets—referred to as lists—as follows:

anchor items	main items	filler items
A1	acc.ok.01	F01
A2	acc.bad.02	F02
A3	acc.ok.03	F03
	acc.bad.04	F04
	acc.ok.05	F05
	acc.bad.06	F06
	acc.ok.07	F07
	acc.bad.08	F08
	acc.ok.09	F09
	acc.bad.10	F10
		F11
		F12
		F13
		F14
		F15
		F16
		F17
		F18
		F19
		F20

anchor items	main items	filler items
A1	acc.bad.01	F01
A2	acc.ok.02	F02
A3	acc.bad.03	F03
	acc.ok.04	F04
	acc.bad.05	F05
	acc.ok.06	F06
	acc.bad.07	F07
	acc.ok.08	F08
	acc.bad.09	F09
	acc.ok.10	F10
		F11
		F12
		F13
		F14
		F15
		F16
		F17
		F18
		F19
		F20

The three anchor sentences will be presented in the same order at the beginning of the experiment. However, the main and filler items will be presented in a pseudorandomized order. Following Sprouse (2018), the items are pseudorandomized such that (i) there is at least one filler item between any two main items, and (ii) two acc.ok items do not appear consecutively (even with filler items in between), nor do two acc.bad items. In other words, acc.ok and acc.bad items alternate, with filler items occurring between them.

First, for each list, we separately randomize the order of the acc.ok, acc.bad, and filler items. This can be done, for instance, by using the =rand() function in Excel. Second, we combine the acc.ok and acc.bad items in an alternating pattern. As a result, the two lists would look something like the following.

anchor items	main items	filler items
A1	acc.ok.03	F03
A2	acc.bad.04	F16
A3	acc.ok.07	F09
	acc.bad.08	F01
	acc.ok.09	F17
	acc.bad.06	F15
	acc.ok.05	F19
	acc.bad.02	F08
	acc.ok.01	F14
	acc.bad.10	F18
		F12
		F11
		F05
		F20
		F04
		F07
		F02
		F06
		F10
		F13

anchor items	main items	filler items
A1	acc.bad.03	F07
A2	acc.ok.04	F08
A3	acc.bad.07	F17
	acc.ok.10	F19
	acc.bad.05	F15
	acc.ok.06	F11
	acc.bad.01	F20
	acc.ok.08	F06
	acc.bad.09	F13
	acc.ok.02	F01
		F10
		F16
		F05
		F04
		F18
		F02
		F09
		F03
		F12
		F14

Next, we combine the main and filler items pseudorandomly, ensuring that for any two main items, x and y, there is at least one filler item between them. Additionally, the pseudorandomization involves a constraint that the sequence ‘main.item-filler-main.item’ does not occur more than five times. Since there are 10 main items and nine gaps between them (main1-gap1-main2-gap2-main3-gap3-main4-gap4-main5-gap5-main6-gap6-main7-gap7-main8-gap8-main9-gap9-main10), this constraint ensures that such sequences occur in less than half of these gaps. This condition is crucial for two reasons. First, if this pattern happens too frequently, participants might easily deduce the experiment’s intent. Second, without this condition, the pseudorandomization calculations in R, as demonstrated below, would require numerous trials.²

# Load necessary libraries.
install.packages("dplyr")
install.packages("openxlsx")
library(dplyr)
library(openxlsx)

# Definine acc.item and filler as input data.
acc_item <- c("acc.ok.03", "acc.bad.04", "acc.ok.07", "acc.bad.08", "acc.ok.09", 
              "acc.bad.06", "acc.ok.05", "acc.bad.02", "acc.ok.01", "acc.bad.10")
fillers <- c("F03", "F16", "F09", "F01", "F17.high", "F15", "F19.high", "F08", "F14", 
             "F18.low", "F12", "F11", "F05", "F20.low", "F04", "F07", "F02", "F06", "F10", "F13")

# Create a list to store the output.
final_sequence <- list()

# Manage the index for fillers.
filler_index <- 1

# Loop through each item in acc.item.
for (i in 1:length(acc_item)) {
  # Add the current acc.item.
  final_sequence <- append(final_sequence, acc_item[i])
  
  # Only add fillers if there are more acc.items left.
  if (i < length(acc_item)) {
  # Pseudorandomly add 1 to 8 fillers. 
  # Reason 1 for setting the upper limit to 8: If 9 fillers are added consecutively even once, it would create at least 5 acc.item-filler-acc.item sequences, making it easier for participants to guess the experiment's intent.
  # Reason 2 for setting the upper limit to 8: If the upper limit is too large, errors are more likely to occur as a result of the following process: Fillers are inserted between acc.items starting from the beginning, but if all 20 fillers are used up too early, NA values will occur. A higher upper limit increases the likelihood of running out of fillers, making it difficult to avoid errors.
    num_fillers <- sample(1:8, 1)
    fillers_to_add <- fillers[filler_index:min(filler_index + num_fillers - 1, length(fillers))]
    final_sequence <- append(final_sequence, fillers_to_add)
    
  # Update the index.
    filler_index <- filler_index + length(fillers_to_add)
  }
}

# Add remaining fillers at the end if any.
if (filler_index <= length(fillers)) {
  final_sequence <- append(final_sequence, fillers[filler_index:length(fillers)])
}

# Convert the final sequence to a data frame.
final_df <- data.frame(Item = final_sequence)

# Display the result.
print(final_df)

After pseudorandomization, the lists will look something like this:

anchor items	main + filler items
A1	acc.ok.03
A2	F03
A3	F16
	F09
	acc.bad.04
	F01
	F17
	acc.ok.07
	F15
	acc.bad.08
	F19
	acc.ok.09
	F08
	F14
	acc.bad.06
	F18
	F12
	F11
	F05
	F20
	F04
	acc.ok.05
	F07
	F02
	F06
	acc.bad.02
	F10
	acc.ok.01
	F13
	acc.bad.10

anchor items	main + filler items
A1	acc.bad.03
A2	F07
A3	acc.ok.04
	F08
	F17
	F19
	F15
	F11
	F20
	acc.bad.07
	F06
	acc.ok.10
	F13
	acc.bad.05
	F01
	F10
	F16
	F05
	acc.ok.06
	F04
	F18
	acc.bad.01
	F02
	acc.ok.08
	F09
	F03
	acc.bad.09
	F12
	F14
	acc.ok.02

Next, we counterbalance the main + filler items in each list by creating four different orders: original, reversed, split, and split-reversed, as outlined below (e.g., Sprouse and Almeida 2012; Sprouse 2018). The reversed order is simply the reverse of the original order. In the split order, the first and second halves of the items from the original order are swapped. The split-reversed order is the reverse of the split order. Counterbalancing helps neutralize the influence of item order (e.g., tiredness, boredom, and loss of intuitions) on participants’ acceptability judgments.

original
1
2
3
4

reversed
4
3
2
1

split
3
4
1
2

split-reversed
2
1
4
3

After counterbalancing the two lists as described above, we add the three anchor sentences at the beginning in the same order for all versions, resulting in eight different sets of items. Note that when creating a split order by swapping the first and second halves of the original sequence, you might end up with two consecutive acc.items without any filler items in between. This issue occurred in our split data, and I manually adjusted the placement of the affected acc.items, as highlighted in bold below; the sequence acc.ok.03-F03-F16 and F14-acc.ok.02 were changed to F03-F16-acc.ok.03 and acc.ok.02-F14, respectively.

List1.original
A1
A2
A3
acc.ok.03
F03
F16
F09
acc.bad.04
F01
F17
acc.ok.07
F15
acc.bad.08
F19
acc.ok.09
F08
F14
acc.bad.06
F18
F12
F11
F05
F20
F04
acc.ok.05
F07
F02
F06
acc.bad.02
F10
acc.ok.01
F13
acc.bad.10

List1.reversed
A1
A2
A3
acc.bad.10
F13
acc.ok.01
F10
acc.bad.02
F06
F02
F07
acc.ok.05
F04
F20
F05
F11
F12
F18
acc.bad.06
F14
F08
acc.ok.09
F19
acc.bad.08
F15
acc.ok.07
F17
F01
acc.bad.04
F09
F16
F03
acc.ok.03

List1.split
A1
A2
A3
F18
F12
F11
F05
F20
F04
acc.ok.05
F07
F02
F06
acc.bad.02
F10
acc.ok.01
F13
acc.bad.10
F03
F16
acc.ok.03
F09
acc.bad.04
F01
F17
acc.ok.07
F15
acc.bad.08
F19
acc.ok.09
F08
F14
acc.bad.06

List1.split-reversed
A1
A2
A3
acc.bad.06
F14
F08
acc.ok.09
F19
acc.bad.08
F15
acc.ok.07
F17
F01
acc.bad.04
F09
acc.ok.03
F16
F03
acc.bad.10
F13
acc.ok.01
F10
acc.bad.02
F06
F02
F07
acc.ok.05
F04
F20
F05
F11
F12
F18

List2.original
A1
A2
A3
acc.bad.03
F07
acc.ok.04
F08
F17
F19
F15
F11
F20
acc.bad.07
F06
acc.ok.10
F13
acc.bad.05
F01
F10
F16
F05
acc.ok.06
F04
F18
acc.bad.01
F02
acc.ok.08
F09
F03
acc.bad.09
F12
F14
acc.ok.02

List2.reversed
A1
A2
A3
acc.ok.02
F14
F12
acc.bad.09
F03
F09
acc.ok.08
F02
acc.bad.01
F18
F04
acc.ok.06
F05
F16
F10
F01
acc.bad.05
F13
acc.ok.10
F06
acc.bad.07
F20
F11
F15
F19
F17
F08
acc.ok.04
F07
acc.bad.03

List2.split
A1
A2
A3
F10
F16
F05
acc.ok.06
F04
F18
acc.bad.01
F02
acc.ok.08
F09
F03
acc.bad.09
F12
F14
acc.ok.02
F14
acc.bad.03
F07
acc.ok.04
F08
F17
F19
F15
F11
F20
acc.bad.07
F06
acc.ok.10
F13
acc.bad.05
F01

List2.split-reversed
A1
A2
A3
F01
acc.bad.05
F13
acc.ok.10
F06
acc.bad.07
F20
F11
F15
F19
F17
F08
acc.ok.04
F07
acc.bad.03
F14
acc.ok.02
F12
acc.bad.09
F03
F09
acc.ok.08
F02
acc.bad.01
F18
F04
acc.ok.06
F05
F16
F10

Since there are 8 versions of item sets, the required number of participants in the power calculations below will be addressed in multiples of 8.

3.3 Procedures

The experiment will take place online. Participants will first answer a questionnaire about their language background and familiarity with linguistics. Then, they will be introduced to the concept of sentence naturalness as used in this experiment. The instructions will emphasize that participants should disregard prescriptive grammar rules, the likelihood of the sentence being spoken in real life, the plausibility of the sentence content. Following Schütze and Sprouse (2014), the experiment will describe these points by providing the Japanese counterparts of the following sentences.

In determining if a sentence sounds natural or unnatural, you can imagine a conversation with a friend and consider whether the sentence would make their friend sound like a native Japanese speaker.The experiment is not concerned with whether the sentence is “good Japanese” for writing, the best way to convey the speaker’s idea, how often it is used in daily speech, or the likelihood of the provided context occurring in real life. You should ignore the Japanese grammar you learned in school or any prescriptive rules they have heard of (e.g., ra-nuki speech). The focus is on whether the sentence could be naturally spoken by native speakers, assuming no production errors.

After the explanation, participants will be presented with three anchor sentences: one that is clearly unnatural (acceptability = 1 or 2), one with controversial acceptability (acceptability = 4), and one that is clearly natural (acceptability = 6 or 7). Following this, participants will proceed to rate the main items.

The experiment is expected to take around 30 minutes, including instructions and brief questionnaires.

4 Power analysis

This section details how I estimated the number of participants required for this experiment. Since no existing dataset seems available for conducting simulations, Section 4.1 begins by creating a data frame. This data frame will be used to fit a model for simulations. To determine the values of variables in the model, I will reference the acceptability judgment data from Linzen and Oseki’s (2018) experiment on a Japanese linguistic phenomenon, which is closely related to the one being examined in this study. Therefore, Section 4.2 analyzes Linzen and Oseki’s data and fits a linear mixed-effects model. Based on this analysis, Section 4.3 fits a model to be used in the simulations. Finally, Section 4.4 conducts simulations using the simr package in R and discusses the simulation results.

4.1 Design of a dataframe

To fit a mixed-effects model for simulations, when data related to the experiment is not available, the first step is to determine the necessary covariates and create a data frame on which the model will be fitted later. In this study, I decided to include four covariates: item, subject, condition, and order (which refers to the sequence in which sentences are presented to participants).

First, since the experiment examines the difference in acceptability between sentences presented in two different contexts, the condition covariate has two levels. Following treatment coding, I assign the values 0 and 1 to these levels. These correspond to copular sentences that can and cannot involve the predicate accusative case, respectively. For example, sentence (1) in context (2b) is assigned 0, while in context (2a) it is assigned 1.

As for the other covariates, I set them to contain eight levels. This decision is based on the general guideline that acceptability judgment experiments should involve at least eight items (see Section 3.2). The number of levels for each covariate can be increased after the initial “basic” data frame is created. In other words, I initially set the number of levels for item to match the minimum required number of items and then set subject and order to also have eight levels. This allows for a gradual increase in the number of levels for each covariate based on the results of power analysis simulations.

For the item and subject covariates, each level is assigned a number from 1 to 8. For order, each level is assigned a value n such that n ∈ {0, …, 7}, where n represents the order n + 1. For instance, when order is set to 0, it means that a sentence was presented to the subject as the first sentence. By setting the minimum value of order to 0, it simplifies the interpretation of the effect size of condition in the context where order equals 0.

After determining the number of levels for each covariate, we use the data.frame function to organize the covariates so that participants see only one condition per item (i.e., a Latin square design) and the sentences are presented in different orders. ³ Figure 1 shows the resulting data frame.

# Setting up minimal 8 items.
item <- factor(1:8)
# Setting up condition. I use the treatment coding, so I need to use c() instead of factor() to make a numeric vector even if condition is a categorical predictor; the use of factor causes an issue when fitting a model with makeLmer() below. 
condition <-c(0,1)
# Setting up minimal 8 items. The following (as opposed to factor(1:8)) enables each participant to see both condition 0 and 1 sentences.
subj <- c(factor(1:8),factor(8:1))
# Repeating each item 8 times so that each subject sees each item.
item_full <- rep(item, each=8)
# Repeating the condition in the way that for each item, half of the subjects see condition 0 and the other half see condition 1. 
condition_full <- rep(rep(condition, each=4), 8)
# Repeating subj so that each sentence is seen by a subject.
subj_full <- rep(subj, 4)
# n ∈ {0, ..., 7}. This way, the interpretation of the intercept will make more sense; "the value in the case of order 0" can essentially mean "the value in the case of order 1".
# The order of these numbers is such that "subject # = order # +1" for item1, "subject # = order #" for item2, "subject # = order # -1" for item3, etc.
order_full <- c(0,1,2,3,4,5,6,7,0,7,6,5,4,3,2,1,2,3,4,5,6,7,0,1,2,0,1,7,6,5,4,3,4,5,6,7,0,1,2,3,4,3,2,1,0,7,6,5,6,7,0,1,2,3,4,5,6,5,4,3,2,1,0,7)
# Building the data frame based on the above.
df <- data.frame(item=item_full, condition=condition_full, subject=subj_full, order = order_full)

# Printing the data frame.
print(df)

##    item condition subject order
## 1     1         0       1     0
## 2     1         0       2     1
## 3     1         0       3     2
## 4     1         0       4     3
## 5     1         1       5     4
## 6     1         1       6     5
## 7     1         1       7     6
## 8     1         1       8     7
## 9     2         0       8     0
## 10    2         0       7     7
## 11    2         0       6     6
## 12    2         0       5     5
## 13    2         1       4     4
## 14    2         1       3     3
## 15    2         1       2     2
## 16    2         1       1     1
## 17    3         0       1     2
## 18    3         0       2     3
## 19    3         0       3     4
## 20    3         0       4     5
## 21    3         1       5     6
## 22    3         1       6     7
## 23    3         1       7     0
## 24    3         1       8     1
## 25    4         0       8     2
## 26    4         0       7     0
## 27    4         0       6     1
## 28    4         0       5     7
## 29    4         1       4     6
## 30    4         1       3     5
## 31    4         1       2     4
## 32    4         1       1     3
## 33    5         0       1     4
## 34    5         0       2     5
## 35    5         0       3     6
## 36    5         0       4     7
## 37    5         1       5     0
## 38    5         1       6     1
## 39    5         1       7     2
## 40    5         1       8     3
## 41    6         0       8     4
## 42    6         0       7     3
## 43    6         0       6     2
## 44    6         0       5     1
## 45    6         1       4     0
## 46    6         1       3     7
## 47    6         1       2     6
## 48    6         1       1     5
## 49    7         0       1     6
## 50    7         0       2     7
## 51    7         0       3     0
## 52    7         0       4     1
## 53    7         1       5     2
## 54    7         1       6     3
## 55    7         1       7     4
## 56    7         1       8     5
## 57    8         0       8     6
## 58    8         0       7     5
## 59    8         0       6     4
## 60    8         0       5     3
## 61    8         1       4     2
## 62    8         1       3     1
## 63    8         1       2     0
## 64    8         1       1     7

4.2 Linzen and Oseki (2018)

Linzen and Oseki (2018) examine the acceptability judgments of several Japanese linguistic phenomena discussed in the literature, which the authors find questionable. One such phenomenon involves the relationship between morphological cases and sentence meanings in Japanese, as exemplified in (7).

1. Taro-wa migime-dake-o tumur-e-ru.
  - Taro-wa: Taro-Top
  - migime-dake-o: right.eye-only-Acc
  - tumur-e-ru: close-can-Pre
  - Meaning: `Taro can wink his right eye.’
2. Taro-wa migime-dake-ga tumur-e-ru.
  - Taro-wa: Taro-Top
  - migime-dake-ga: right.eye-only-Nom
  - tumur-e-ru: close-can-Pre
  - Meaning (not available):`Taro can wink his right eye.’
  - Meaning: `It is only the right eye that Taro can close.’

(Linzen and Oseki 2018, 8)

Tada (1992) observes that only the sentence with accusative case in (7a) is grammatical with the intended meaning. Linzen and Oseki (2018) question the existence of the contrast between (7a) and (7b), but the results of their experiment show a significant difference in acceptability between the two sentences. They report a difference in mean acceptability ratings of 1.19 on a 7-point scale.

It’s worth noting that the sequence of words in (7b) itself is acceptable; the sentence is not acceptable with the “wink reading,” but it is acceptable when interpreted as meaning that it is only the right eye that Taro can close. The phenomenon exemplified in (7) is similar to the one being examined in the current experiment in this respect, as well as in the fact that both phenomena involve the relationship between case and semantics.

In the remainder of this section, I will analyze the acceptability judgments in more detail. To do this, I first transform the data collected by Linzen and Oseki (2018) into z-scores. This helps minimize potential scale biases, such as scale compression and skew (e.g., Schütze and Sprouse 2014), and allows for the assessment of the relative impacts of multiple predictors.

# Loading the dataset kindly provided by Linzen and Oseki (2018) after setting a working directory.
library(tidyverse)
LandO <- read_csv('japanese_paired.csv')

# The LandO$subject variable is added to group the data by subject before calculating the z-scores. This means that the z-scores for responses are calculated separately within each subject group.
LandO$response_z <- ave(LandO$response, LandO$subject, FUN=scale)

Since the original dataframe includes data on various types of sentences in addition to (7a) and (7b), I then filter the dataframe to include only the data related to example (7).

# Filtering the dataframe to include only the data related to example (7).  
LandO.case <-LandO %>% filter(group == '5')

We are now ready to examine whether there is a significant difference between sentences (7a) and (7b). First, consider the scatterplot with a trend line shown in Figure 4.1.

# Building a scatterplot with a trend line to see if there a significant difference between (7a) and (7b).
ggplot(data = LandO.case, mapping = aes(x= member, y = response_z)) + 
  geom_point() + 
  geom_smooth(method = 'lm') +
  theme_minimal()

Scatterplot with a trend line showing the relationship between the z-transformed acceptability scores and the sentence members. The y-axis represents the z-scores for sentences (7a) and (7b), while the x-axis indicates the sentence member, with 0 corresponding to (7a) and 1 to (7b).

Figure 4.1: Scatterplot with a trend line showing the relationship between the z-transformed acceptability scores and the sentence members. The y-axis represents the z-scores for sentences (7a) and (7b), while the x-axis indicates the sentence member, with 0 corresponding to (7a) and 1 to (7b).

The scatterplot suggests that sentence (7a) is generally more acceptable than sentence (7b), with a correlation coefficient of approximately -0.36 (see below).

# Calculating the correlation coefficient between "member" and "response_z".
# "Member" is a binary covariate where "1" refers to sentence (7a) that is predicted to sound natural with the intended 'wink' reading, and "0" refers to sentence (7b) that is predicted to sound unnatural with the intended reading.
with(LandO.case, cor(member, response_z))

## [1] -0.3574485

However, there is some individual variability, as shown in the data.

# Building a scatterplot with a trend line for each subject. 
ggplot(data = LandO.case, mapping = aes(x= member, y = response_z)) + 
  facet_wrap(~ subject) +
  geom_point() + 
  geom_smooth(method = 'lm') +
  theme_minimal()

Scatterplot with a trend line for each subject. The y and x axes indicate the z-transformed acceptability score of (7a)/(7b) and member (0 and 1 refer to (7a) and (7b) respectively), respectively.

Figure 4.2: Scatterplot with a trend line for each subject. The y and x axes indicate the z-transformed acceptability score of (7a)/(7b) and member (0 and 1 refer to (7a) and (7b) respectively), respectively.

To examine whether the observed difference in correlation coefficient indicates a significant difference between sentences (7a) and (7b), we will fit a linear mixed-effects model using the z-transformed acceptability score (response_z) as as a function of member.

Given that the acceptability score for (7a) and the difference in acceptability between (7a) and (7b) may vary across participants, the model will include both a participant-by varying intercept and a participant-by varying slope. ⁴ ⁵

# Installing packages to fit models. 'lmerTest' is used to calcurate p-values.
install.packages('Matrix')
install.packages('lme4')
install.packages('lmerTest')
library(Matrix)
library(lme4)
library(lmerTest)

# Modeling response_z as a function of member with a participant-by varying intercept and a participant-by varying slope. 
mod_LandO <- lmer(response_z ~ member + (1 + member || subject), data = LandO.case)

# Printing a summary of mod_LandO
summary(mod_LandO)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: response_z ~ member + (1 + member || subject)
##    Data: LandO.case
## 
## REML criterion at convergence: 403.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6393 -0.6079  0.1017  0.7118  1.7452 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  subject   (Intercept) 0.11594  0.3405  
##  subject.1 member      0.07209  0.2685  
##  Residual              0.41204  0.6419  
## Number of obs: 178, groups:  subject, 89
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  0.71821    0.07702 87.99971   9.325 8.80e-15 ***
## member      -0.57164    0.10035 88.00005  -5.697 1.59e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##        (Intr)
## member -0.599

Assuming an α level of 0.05, the table indicates that case has a significant effect on the z-score transformed acceptability judgments (p < 0.001). However, any violations of the assumptions of homoscedasticity and normality regarding the residuals could impact the accuracy and precision of the estimated coefficients, confidence intervals, and hypothesis tests. Therefore, I will examine whether the model mod_LandO meets these assumptions.

res <- residuals(mod_LandO)

# Setting graphical parameters to generate plot matrix:
par(mfrow = c(1, 3))

# Plot 1, histogram:
hist(res)

# Plot 2, Q-Q plot:
qqnorm(res)
qqline(res)

# Plot 3, residual plot: 
plot(fitted(mod_LandO), res)

Based on the plots, it seems that mod_LandO is not grossly wrong, which is the main thing we wanted to confirm with these diagnostics (e.g., Faraway 2016).

Using the values from the table above, the next section will fit a model for use in simulations

4.3 Fitting a model

This section will fit the linear mixed-effects model described in (8), incorporating both by-subject and by-item random effects to address potential violations of the independence assumption, which can significantly inflate Type I error rates. Including these random effects also allows for the investigation of variations in acceptability judgments across different participants and items.⁶

y ~ condition + order + (1 + condition + order || subject) + (1 + condition + order || item) + \(\epsilon\)

In what follows, I will estimate the fixed effects, random effects, and residuals in the model in (8) in tun.

4.3.1 Fixed effects

This section discusses the three fixed effects: intercept, condition, and order. I explain why the model in (8) should include these fixed effects but exclude the interaction between condition and order, as well as how I determined the effect size of each variable. First, I set the fixed intercept at 0.6, which is slightly smaller than the fixed intercept in the model for Linzen and Oseki’s data. This adjustment means that when order = 0, Acc.yes sentences (e.g., sentence (1) uttered in context (2a)) are predicted to be slightly less grammatical than sentence (7a) on average. However, it is important to note that the value of the fixed intercept does not significantly impact the statistical power that this paper is concerned with, as illustrated in Section 4.4.

Next, I turn to the condition variable. This variable is concerned with the contextual effect on the availability of the accusative case, which is the primary focus of the current experiment. Therefore, it is essential to include this variable in the model to allow for the calculation of the power to detect its effect. Regarding its fixed effect size, I set the value at -0.5, in contrast to the case variable in mod_LandO, which had a value of -0.57. The slightly smaller magnitude of this effect reflects the prediction that the difference in grammaticality between Acc.yes and Acc.no sentences may be less pronounced than the difference between (7a) and (7b). This effect size will serve as a baseline in simulations, where I will explore the power to detect a range of different effect sizes.

Next, I turn to the variable order, which pertains to the order in which sentences are presented to participants. Based on my consultation experience, the more examples people consider, the higher ratings they tend to give to sentences uttered in both conditions. Two factors that may contribute to this tendency are:

1. Japanese copular sentences with a predicate case are not used frequently in writing.
2. While people do use these sentences in spoken language, their counterparts without a predicate accusative case are much more common.

These factors suggest that as participants encounter more examples, they may become more lenient in their judgments, potentially leading to higher ratings across the board. This is why I assume a positive slope for order, with an effect size of 0.03⁷.

The effect of order on sentence acceptability can vary in size (and even direction) across different experiments⁸, making it challenging to estimate its precise effect size. However, the rationale behind the assumed effect size of order is my hypothesis that, on average, acceptability would increase by 0.2 points as the order changes from 0 to 7. Given that the effect size of order reflects the predicted change in acceptability when the order increases by one level, the ratio of 0.2 to 8 (the number of levels of order) is approximately equivalent to the ratio of 0.03 to 1. Therefore, I assume the effect size for order to be 0.03. It is important to note that the fixed effect of order is not expected to significantly impact the power to detect the effect of condition.

Lastly, I explain why the model in (8) does not include the interaction between condition and order. An interaction should be included if the influence of one predictor on the response depends on another predictor. In this context, the question is whether the influence of the order effect —or more specifically, the two factors outlined in (9)— depends on the two levels of condition. After considering this question, I believe that while an interaction could potentially exist, there is no solid reason to assume it. For example, if many participants assign a score of 7 to Acc.yes sentences early in the experiment, the factors in (9) could only contribute to an increase in the acceptability of Acc.no sentences. However, it is unlikely that even Acc.yes sentences will frequently receive a score of 7 due to the presence of anchor and filler sentences, some of which will be uncontroversially natural. Therefore, the model in (8) does not include the interaction between condition and order.

The discussion of the fixed effects will be summarized in a table at the end of the next subsection⁹.

4.3.2 Random effects

This section covers the six random effects included in the model (8): subject-by and item-by random intercepts, random slopes for condition, and random slopes for order. I explain why incorporating these random effects makes sense and how I determined their values. Additionally, this section addresses the residual standard deviation of the model in (8).

First, including random intercepts for both subjects and items is logical because the acceptability of Acc.yes sentences presented in order = 0 is likely to vary among subjects and items. Although the presence of filler and anchor items might reduce such variability, I expect that the acceptability of Acc.yes sentences will vary more than that of sentence (7a). Given that the subject-by intercept in mod_LandO has a standard deviation of 0.34, I set the standard deviation of the subject-by intercept in the model in (8) to 0.45. This means that 95% of the subject-by intercepts are expected to fall within the interval from -0.3 (0.6 - 0.45*2) to 1.5 (0.6 + 0.45*2).

I also set the item-by intercept to 0.45 for similar reasons. Different Acc.yes sentences likely have varying frequencies of use in everyday language, and more frequently used sentences are predicted to receive higher ratings. For example, consider sentence (1) again, which is repeated below as (10).

kyoo-wa onigiri-o mit-tu dayo
- kyoo-wa: today-TOP
- onigiri-o: rice.ball-Acc
- mit-tu: three-Classifier
- dayo: copula
- Meaching: `Today is three rice balls.’

It is assumed that the Acc.yes sentence in (10) (or, more generally, the structure Today is <food name>) is more common than other Acc.yes sentences, such as the one in (11).

kongetu-wa A-gumi-no seeto-o san-nin desu
- kongetu-wa: this.month-TOP
- A-gumi-no: A-class-Gen
- seeto-o: student-Acc
- san-nin: three-Classifier
- desu: copula
- Meaching: `This month is three students in class A.’

Sentence (11) is predicted to receive a lower rating than sentence (10) due to its lower frequency of use in daily life. Therefore, including the item-by random intercept is reasonable, and I assume its standard deviation to be identical to that of the subject-by random intercept.

Next, I address the random slopes for each variable. Both condition and order are observation-level variables with respect to subject and item since they vary within subjects and items. Therefore, it is important to consider whether subject-by and item-by random slopes should be included in the model.

First, the subject-by random slope for condition should be included. The current experiment is motivated by the potential for individual variability in the acceptability of the main items, making the subject-by random slope for condition crucial to examine. I will set its standard deviation to 0.3, based on the standard deviation of the random slope for case in mod_LandO, which is 0.27.

The item-by random slope for condition should also be included. It is reasonable to assume that different items will show varying differences in acceptability between Acc.yes and Acc.no sentences. For example, items with a low frequency of word sequences may exhibit a larger difference in acceptability between Acc.yes and Acc.no sentences than items with more common word sequences. Estimating the standard deviation for this random slope is challenging, so I will assume it to be the same as the subject-by random slope for condition, i.e., 0.3.

It makes sense to include both subject-by and item-by random slopes for order, as the effect of order on the acceptability of Acc.yes sentences is likely to vary among subjects and items. Different individuals may be influenced by the factors in (9) to varying degrees, with some assigning higher ratings to Acc.yes sentences presented later. Similarly, certain sentences may be more affected by order than others. For example, sentences with more common word sequences may be less impacted by order since they are less influenced by the factors in (9) to begin with.

Regarding the standard deviations for the subject-by and item-by random slopes for order, I assume a value of 0.03. This indicates that when comparing the acceptability of an Acc.yes sentence with order = n to one with order = n+1, 68% of cases will have the latter acceptability fall within the interval between x (x + 0.03 (fixed effect of order) - 0.03 (1 SD of the random slope for order)) and x + 0.06 (x + 0.03 (fixed effect of order) + 0.03 (1 SD of the random slope for order)). For instance, consider comparing the acceptability of an Acc.yes sentence with order = 0 to one with order = 7. If the acceptability for order = 0 is the fixed intercept of 0.6, 68% of cases would have the acceptability for order = 7 fall between 0.6 (0.6 (fixed intercept) + 0.03 (fixed order) * 7 - 0.03 (1 SD of random order) * 7) and 1.02 (0.6 + 0.03 * 7 + 0.03 * 7). The difference between 0.6 and 1.02 is 0.42, which suggests a substantial order effect given that the fixed effect of condition is -0.5. Thus, a random slope value of 0.03 for order is conservative enough to avoid a Type II error. Additionally, the random slopes for order do not significantly impact the power of the analysis this paper focuses on (see Section 4.4), making 0.03 a reasonable estimate.

Finally, I will assume the residual standard deviation to be 0.7, which is slightly larger than the residual standard deviation of 0.64 in mod_LandO. I’ll use 0.7 as a baseline and conduct simulations with a range of residual standard deviations in the next section.

With all the values for the fixed and random effects, as well as the residuals, now determined, the lmer model in (8) can be fitted using the makeLmer function from the simr package.

install.packages('simr')
library(simr)

# Making a vector of fixed effect values.
fixed <- c(0.6,-0.57,0.03)

# Assigning numbers to variance values of subject-by and item-by random effects based on the standard deviations of random effects estimated above (sd squared = variance). 
SubVCa <- .2025
SubVCb <- .09
SubVCc<- 9e-04

ItemVCa<- .2025
ItemVCb<-.09
ItemVCc<-9e-04

# Assigning a number to residual standard deviation. 
res<- 0.7

# Fitting a model based on the above values. 
model <- makeLmer(acceptability_z ~ condition + order + (1 + condition + order ||subject) + (1 + condition + order ||item), fixef = fixed, VarCorr = list(SubVCa,SubVCb,SubVCc,ItemVCa,ItemVCb,ItemVCc), sigma = res, data = df)

# Printing a summary of model
summary(model)

## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## acceptability_z ~ condition + order + ((1 | subject) + (0 + condition |  
##     subject) + (0 + order | subject)) + ((1 | item) + (0 + condition |  
##     item) + (0 + order | item))
##    Data: df
## 
## REML criterion at convergence: 151
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.98726 -0.46553  0.09258  0.43269  1.38831 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  subject   (Intercept) 0.2025   0.45    
##  subject.1 condition   0.0900   0.30    
##  subject.2 order       0.0009   0.03    
##  item      (Intercept) 0.2025   0.45    
##  item.1    condition   0.0900   0.30    
##  item.2    order       0.0009   0.03    
##  Residual              0.4900   0.70    
## Number of obs: 64, groups:  subject, 8; item, 8
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  0.60000    0.29253   2.051
## condition   -0.57000    0.23203  -2.457
## order        0.03000    0.04264   0.704
## 
## Correlation of Fixed Effects:
##           (Intr) condtn
## condition -0.230       
## order     -0.446 -0.002

The values I have set for the mixed-effects model are summarized below, with the numbers for the random effects presented as variances rather than standard deviations.

# Installing a package for tab_model().
install.packages('sjPlot')
library(sjPlot)

tab_model(model, 
          show.p = FALSE, 
          show.ci = FALSE,  
          show.icc = FALSE,
          show.r2 = FALSE,
          digits.re = 4)

	acceptability_z
Predictors	Estimates
(Intercept)	0.60
condition	-0.57
order	0.03
Random Effects
σ²	0.4900
τ₀₀ _subject	0.2025
τ₀₀ _item	0.2025
τ₁₁ _{subject.condition}	0.0900
τ₁₁ _{subject.order}	0.0009
τ₁₁ _{item.condition}	0.0900
τ₁₁ _item.order	0.0009
ρ₀₁
ρ₀₁
N _subject	8
N _item	8
Observations	64

4.4 Simulation

Based on the model created in the previous section, this section calculates the power to detect the effect of condition using Monte Carlo simulation methods. To do this, I apply the powerSim function from the simr package to the model in (8), using the variable values discussed earlier. Running the simulation 3,000 times indicates that the power to detect the effect with α = 0.05 is 0.62, which is below the conventional target power of 0.8.

pow.model <- powerSim(model, nsim = 3000, test = fixed("condition", "lr"))

print(pow.model)

## Power for predictor 'condition', (95% confidence interval):
##       62.27% (60.50, 64.01)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.57
## 
## Based on 3000 simulations, (83 warnings, 0 errors)
## alpha = 0.05, nrow = 64
## 
## Time elapsed: 0 h 8 m 43 s

To increase the power, we can increase the sample size by adding more participants or items. First, I increase the number of items from 8 to 10 using the extend function.

# Extending the number of items to 10. 
model_item10 <- extend(model, along = "item", n = 10)

# Creating a dataframe based on model_item10.
df.model_item10 <- getData(model_item10)

#Printing df.model_item10
print(df.model_item10)

##      item condition subject order acceptability_z
## a.1     a         0       1     0    0.8411696686
## a.2     a         0       2     1    2.1945772739
## a.3     a         0       3     2    0.1521702973
## a.4     a         0       4     3    0.6036524535
## a.5     a         1       5     4    0.2379118519
## a.6     a         1       6     5    0.0004401482
## a.7     a         1       7     6    1.6152831852
## a.8     a         1       8     7    0.7102391715
## b.9     b         0       8     0    1.4294646518
## b.10    b         0       7     7    1.1458549076
## b.11    b         0       6     6    2.6509027468
## b.12    b         0       5     5    1.3586626534
## b.13    b         1       4     4    1.2837535940
## b.14    b         1       3     3    0.8427479904
## b.15    b         1       2     2    2.5428461171
## b.16    b         1       1     1    0.2692393700
## c.17    c         0       1     2   -0.4599190375
## c.18    c         0       2     3    1.4181797256
## c.19    c         0       3     4    0.0155597706
## c.20    c         0       4     5   -0.0175332375
## c.21    c         1       5     6   -0.0781214843
## c.22    c         1       6     7    1.2101113415
## c.23    c         1       7     0   -0.0982914100
## c.24    c         1       8     1    0.6683055972
## d.25    d         0       8     2    0.4128037990
## d.26    d         0       7     0   -0.1749995397
## d.27    d         0       6     1    0.1933522798
## d.28    d         0       5     7    0.8831991108
## d.29    d         1       4     6    0.4561931748
## d.30    d         1       3     5   -2.1754123582
## d.31    d         1       2     4    0.1660533624
## d.32    d         1       1     3   -0.9657410779
## e.33    e         0       1     4    1.9908139676
## e.34    e         0       2     5    1.5863014873
## e.35    e         0       3     6    1.4496929156
## e.36    e         0       4     7    1.8051324329
## e.37    e         1       5     0    1.0490757029
## e.38    e         1       6     1    1.1506056145
## e.39    e         1       7     2    1.1986182908
## e.40    e         1       8     3    1.5123596929
## f.41    f         0       8     4   -0.3539050029
## f.42    f         0       7     3    0.3946702490
## f.43    f         0       6     2    0.1799934773
## f.44    f         0       5     1    0.4307214508
## f.45    f         1       4     0   -0.6832497207
## f.46    f         1       3     7   -0.5006510057
## f.47    f         1       2     6    0.9405462334
## f.48    f         1       1     5    0.0216274193
## g.49    g         0       1     6    1.3495880650
## g.50    g         0       2     7    1.8886739375
## g.51    g         0       3     0    0.5241236116
## g.52    g         0       4     1    0.2508943762
## g.53    g         1       5     2    0.0874173499
## g.54    g         1       6     3   -0.9584359302
## g.55    g         1       7     4    0.6616782175
## g.56    g         1       8     5    0.4076005059
## h.57    h         0       8     6    1.0145768009
## h.58    h         0       7     5    0.0454057683
## h.59    h         0       6     4    1.9871391639
## h.60    h         0       5     3    1.5245846698
## h.61    h         1       4     2    1.0496255563
## h.62    h         1       3     1    0.2877175623
## h.63    h         1       2     0   -0.1182632921
## h.64    h         1       1     7   -0.3370471260
## i.1     i         0       1     0    0.8411696686
## i.2     i         0       2     1    2.1945772739
## i.3     i         0       3     2    0.1521702973
## i.4     i         0       4     3    0.6036524535
## i.5     i         1       5     4    0.2379118519
## i.6     i         1       6     5    0.0004401482
## i.7     i         1       7     6    1.6152831852
## i.8     i         1       8     7    0.7102391715
## j.9     j         0       8     0    1.4294646518
## j.10    j         0       7     7    1.1458549076
## j.11    j         0       6     6    2.6509027468
## j.12    j         0       5     5    1.3586626534
## j.13    j         1       4     4    1.2837535940
## j.14    j         1       3     3    0.8427479904
## j.15    j         1       2     2    2.5428461171
## j.16    j         1       1     1    0.2692393700

Running the simulation 3,000 times indicates that the power to detect the effect with α = 0.05 is 0.70, which is still slightly below the target of 0.8.

pow.model_item10 <- powerSim(model_item10, nsim = 3000, test = fixed("condition", "lr") )

pow.model_item10

## Power for predictor 'condition', (95% confidence interval):
##       70.07% (68.39, 71.70)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.57
## 
## Based on 3000 simulations, (103 warnings, 0 errors)
## alpha = 0.05, nrow = 80
## 
## Time elapsed: 0 h 9 m 6 s

To further increase the power, we could add more items. However, as noted in Section 3.2, this is not ideal because we want to keep the experiment short. Instead, I will increase the number of participants. After increasing the participant count from 8 to 24, I run simulations with a range of different participant numbers using the powerCurve function.

# Extending the number of participants to 24. 
model_item10_subject24 <- extend(model_item10, along = "subject", n = 24)

pc_model_item10_subject24_subj8.16.24 <- powerCurve(model_item10_subject24, fixed("condition", "lr"),
                                                along = "subject",
                                                breaks = c(8,16,24),
                                                nsim=3000)

Figure 4.3 shows the results of calculating the power of the model with 8,16, and 24 participants.

plot(pc_model_item10_subject24_subj8.16.24)

Power curve based on 3,000 simulations of **model_item10_subject24** for a range of participant numbers: 8, 16, and 24. The y-axis represents power, and the x-axis represents the number of participants. The dashed line indicates a power of 0.8.

Figure 4.3: Power curve based on 3,000 simulations of model_item10_subject24 for a range of participant numbers: 8, 16, and 24. The y-axis represents power, and the x-axis represents the number of participants. The dashed line indicates a power of 0.8.

print(pc_model_item10_subject24_subj8.16.24)

## Power for predictor 'condition', (95% confidence interval),
## by number of levels in subject:
##       8: 69.07% (67.38, 70.72) - 80 rows
##      16: 89.40% (88.24, 90.48) - 160 rows
##      24: 95.67% (94.88, 96.37) - 240 rows
## 
## Time elapsed: 0 h 32 m 41 s

The figure shows that when the experiment recruits 16 participants, the original model has a power of 0.89 (95% confidence interval (CI): 0.88, 0.9) for detecting the effect of condition. Therefore, the experiment should recruit at least 16 participants.

The power illustrated in Figure 4.3 is an estimate that is accurate only if all the parameters in model_item10_subject20 (i.e., fixed effects, random effects, and residuals) are accurately or conservatively estimated. However, it is possible that the simulation based on the model overestimates the power. Therefore, the remainder of this section explores how many participants would be required if some variables were assigned less conservative values.

As mentioned earlier, some variables have a significant impact on power, while others do not. For example, changing the fixed intercept value from 0.6 to 0.3 has little effect on the power. Recall that the power of model_item10 is 0.7 (95% CI: 0.68, 0.72).

# Creating model_fe.intercept.0.3_item10 based on model_item10. 
model_fe.intercept.0.3_item10 <- model_item10

# Assigning 0.3 to Intercept of the model created above. 
fixef(model_fe.intercept.0.3_item10)['(Intercept)'] <- 0.3

pow.model_fe.intercept.0.3_item10 <- powerSim(model_fe.intercept.0.3_item10, nsim = 3000, test = fixed("condition", "lr") )

pow.model_fe.intercept.0.3_item10

## Power for predictor 'condition', (95% confidence interval):
##       68.80% (67.11, 70.46)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.57
## 
## Based on 3000 simulations, (94 warnings, 0 errors)
## alpha = 0.05, nrow = 80
## 
## Time elapsed: 0 h 9 m 2 s

Changing the fixed order value from 0.03 to 0.06 has little effect on the power.

# Creating model_fe.order.0.06_item10 based on model_item10.
model_fe.order.0.06_item10 <- model_item10

# Assigning 0.06 to order of the model created above. 
fixef(model_fe.order.0.06_item10)['order'] <- 0.06

pow.model_fe.order.0.06_item10 <- powerSim(model_fe.order.0.06_item10, nsim = 3000, fixed("condition", "lr") )

pow.model_fe.order.0.06_item10

## Power for predictor 'condition', (95% confidence interval):
##       69.87% (68.19, 71.51)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.57
## 
## Based on 3000 simulations, (96 warnings, 0 errors)
## alpha = 0.05, nrow = 80
## 
## Time elapsed: 0 h 9 m 3 s

Changing the random intercept and order also does not affect the power very much. In what follows, I examine the power of a model where both subject-by and item-by random intercepts are adjusted from 0.45 to 0.9, as well as a model where both subject-by and item-by random orders are modified from 0.03 to 0.06.

# Creating model_srintercept0.9_irintercept0.9_item10 based on model_item10.
model_srintercept0.9_irintercept0.9_item10 <- model_item10

# Assigning 0.81 (i.e., 0.9 standard deviation) to subject-by and item-by random intercept of the model created above. 
VarCorr(model_srintercept0.9_irintercept0.9_item10) <- list(0.81,SubVCb,SubVCc,0.81,ItemVCb,ItemVCc)

pow.model_srintercept0.9_irintercept0.9_item10 <- powerSim(model_srintercept0.9_irintercept0.9_item10, nsim = 3000, fixed("condition", "lr") )

pow.model_srintercept0.9_irintercept0.9_item10

## Power for predictor 'condition', (95% confidence interval):
##       69.53% (67.85, 71.18)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.57
## 
## Based on 3000 simulations, (74 warnings, 0 errors)
## alpha = 0.05, nrow = 80
## 
## Time elapsed: 0 h 8 m 57 s

# Creatig model_srorder0.06_irorder0.06_item10 based on model_item10. 
model_srorder0.06_irorder0.06_item10 <- model_item10

# Assigning 0.0036 (i.e., 0.06 standard deviation) to subject-by and item-by random order of the model created above. 
VarCorr(model_srorder0.06_irorder0.06_item10) <- list(SubVCa,SubVCb,0.0036,ItemVCa,ItemVCb,0.0036)

pow.model_srorder0.06_irorder0.06_item10 <- powerSim(model_srorder0.06_irorder0.06_item10, nsim = 3000, fixed("condition", "lr") )

pow.model_srorder0.06_irorder0.06_item10

## Power for predictor 'condition', (95% confidence interval):
##       69.00% (67.31, 70.65)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.57
## 
## Based on 3000 simulations, (110 warnings, 0 errors)
## alpha = 0.05, nrow = 80
## 
## Time elapsed: 0 h 9 m 15 s

Unlike the fixed/random intercept and order, other variables have a significant impact on power. For example, if the random condition’s standard deviation is increased from 0.3 to 0.4, the power of the original model drops from 0.7 (95% CI: 0.68, 0.72) to 0.6 (95% CI: 0.58, 0.62).

# Creating model_ircondition0.4_item10 based on model_item10. 
model_srcondition0.4_ircondition0.4_item10 <- model_item10

# Assigning 0.16 (i.e., 0.4 standard deviation) to subject-by and item-by random condition of the model created above. 
VarCorr(model_srcondition0.4_ircondition0.4_item10) <- list(SubVCa,0.16,SubVCc,ItemVCa,0.16,ItemVCc)

pow.model_srcondition0.4_ircondition0.4_item10 <- powerSim(model_srcondition0.4_ircondition0.4_item10, nsim = 3000, test = fixed("condition", "lr") )

pow.model_srcondition0.4_ircondition0.4_item10

## Power for predictor 'condition', (95% confidence interval):
##       60.17% (58.39, 61.92)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.57
## 
## Based on 3000 simulations, (104 warnings, 0 errors)
## alpha = 0.05, nrow = 80
## 
## Time elapsed: 0 h 9 m 21 s

Changing the value of the fixed condition from -0.57 to -0.47 also decreases the power to 0.55 (95% CI: 0.52, 0.56). Since only one value is changed here, compared to the previous model where both the subject-by and item-by random slopes were adjusted, this simulation suggests that the fixed condition has a much larger effect on power than the random condition.

# Creating model_fe.cond.n0.47_item10 based on model_item10. 
model_fe.cond.n0.47_item10 <- model_item10

# Assigning -0.47 to the fixed condition of the model created above. 
fixef(model_fe.cond.n0.47_item10)['condition'] <- -0.47

pow.model_fe.cond.n0.47_item10 <- powerSim(model_fe.cond.n0.47_item10, nsim = 3000, test = fixed("condition", "lr") )

pow.model_fe.cond.n0.47_item10

## Power for predictor 'condition', (95% confidence interval):
##       54.67% (52.86, 56.46)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.47
## 
## Based on 3000 simulations, (90 warnings, 0 errors)
## alpha = 0.05, nrow = 80
## 
## Time elapsed: 0 h 9 m 3 s

Next, we consider the effect of residuals on power. In Section 4.3.2, the residual value was set at 0.7. Here, I examine the power of the model when the residual is increased to 1.13 for the following reason: Lane, Hennes, and West (2016) claim that the residual standard deviation is usually larger than the variances of random effects. In the original model, the random variable with the largest value is the subject-by/item-by random intercept (0.45). While the residual in the original model (0.7) is already larger than the intercept (0.45), the question arises as to how much larger the residual can be relative to the largest random effect in general. This is important to consider because larger residual values lead to lower power, and thus it is necessary to examine power with a higher residual value.

To address this, I referred to Sonderegger, Wagner, and Torreira (2018) and Winter (2019). I reviewed the linear mixed models in these studies and found that, at most, the residual was 2.5 times larger than the largest random effect within the same model. Although this observation is based on a small sample size and it is possible that the difference could be larger, I decided to set the residual value at 1.13 (≈ 0.45 × 2.5). As a result, the power decreases to 0.44 (95% CI: 0.42, 0.46).

# Creating model_res1.13_item10 based on model_item10. 
model_res1.13_item10 <- model_item10

# Assigning 1.13 to the residual of the model created above. 
sigma(model_res1.13_item10) <- 1.13

pow.model_res1.13_item10 <- powerSim(model_res1.13_item10, nsim = 3000, test = fixed("condition", "lr") )

pow.model_res1.13_item10

## Power for predictor 'condition', (95% confidence interval):
##       44.27% (42.48, 46.07)
## 
## Test: Likelihood ratio
##       Effect size for condition is -0.57
## 
## Based on 3000 simulations, (62 warnings, 0 errors)
## alpha = 0.05, nrow = 80
## 
## Time elapsed: 0 h 9 m 5 s

So far, I have revised the value of one variable in model_item10 and examined its impact on power. However, it is possible that the model contains multiple variables with nonconservative values. Therefore, I will now examine the power of the model where the fixed condition is set to -0.47, the subject-by and item-by random slopes for condition are set to 0.4, and the residual is set to 1.13. Refer to Figure 4.4.

# Creating model_fe.cond.n0.47_item10 based on model_item10. 
model_fe.cond.n0.47_item10 <- model_item10

# Assigning -0.47 to the fixed condition of the model created above. 
fixef(model_fe.cond.n0.47_item10)['condition'] <- -0.47

# Assigning 1.13 to the residual of the model created above. 
model_fe.cond.n0.47_res1.13_item10 <- model_fe.cond.n0.47_item10
sigma(model_fe.cond.n0.47_res1.13_item10) <- 1.13

# Assigning 0.16 (i.e., 0.4 standard deviation) to subject-by and item-by condition of the model created above. 
model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10<- model_fe.cond.n0.47_res1.13_item10
VarCorr(model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10) <- list(SubVCa,0.16,SubVCc,ItemVCa,0.16,ItemVCc)

# Extending the number of subjects to 88. 
model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10_subject88 <- extend(model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10, along = "subject", n = 88)

pc.model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10_subject88_subj48.56.64.72.80.88 <- powerCurve(model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10_subject88, 
                                                                    fixed("condition", "lr"),
                                                                    along = "subject",
                                                                    breaks = c(48,56,64,72,80,88),
                                                                    nsim=3000)

plot(pc.model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10_subject88_subj48.56.64.72.80.88)

Power curve based on 3000 times simulation of the model where the fixed condition = -0.47, subject-by and item-by random slopes for condition = 0.4, and residual = 1.13

Figure 4.4: Power curve based on 3000 times simulation of the model where the fixed condition = -0.47, subject-by and item-by random slopes for condition = 0.4, and residual = 1.13

print(pc.model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10_subject88_subj48.56.64.72.80.88)

## Power for predictor 'condition', (95% confidence interval),
## by number of levels in subject:
##      48: 74.53% (72.93, 76.08) - 480 rows
##      56: 76.83% (75.28, 78.33) - 560 rows
##      64: 78.73% (77.22, 80.19) - 640 rows
##      72: 80.73% (79.28, 82.13) - 720 rows
##      80: 81.40% (79.96, 82.78) - 800 rows
##      88: 82.97% (81.57, 84.30) - 880 rows
## 
## Time elapsed: 3 h 2 m 49 s

Figure 4.4 shows that with 88 participants, the experiment can detect the effect with a power of 0.83 (95% CI: 0.82, 0.84).

These power calculations illustrate that the number of participants needed to detect the effect of condition with 80% power varies significantly depending on the values of certain variables, particularly the fixed condition and residuals. While the model model_item10 requires only 16 participants, the model model_fe.cond.n0.47_res1.13_srcondition0.4_ircondition0.4_item10 requires 84 participants. It is possible that the experiment might even need more than 88 participants. However, given that the values in the original model were estimated based on the analysis of model_LandO, it is reasonable to respect the accuracy of that model to some extent. Therefore, assuming 88 participants as the required number to detect the effect of the condition seems reasonable.

5 Conclusion

This paper discussed the design of an acceptability judgment experiment. Section 2 provided a brief overview of the phenomenon under investigation: whether the acceptability of the predicate accusative case in Japanese copular constructions depends on the context in which the sentence occurs. Section 3 then outlined the experiment’s participants, materials, and procedures. A key consideration in designing the experiment was determining the appropriate number of participants to recruit. To address this, Section 4.4 analyzed data from Linzen and Oseki’s (2018) experiment on a similar Japanese phenomenon, conducted simulations, and concluded that the experiment should aim to recruit around 88 participants. Although this estimation involves several “informed guesses,” the simulation is still valuable in avoiding an underestimation of the required sample size.

Cowart, Wayne. 1997. Experimental Syntax. Sage.

Dabrowska, Ewa. 2010. “Naive v. Expert Intuitions: An Empirical Study of Acceptability Judgments.” The Linguistic Review 27 (1): 1–23.

Edelman, Shimon, and Morten H Christiansen. 2003. “How Seriously Should We Take Minimalist Syntax? A Comment on Lasnik.” Trends in Cognitive Science 7 (2): 60–61.

Faraway, Julian J. 2016. Extending the Linear Model with r: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Chapman; Hall/CRC.

Gibson, Edward, and Evelina Fedorenko. 2013. “The Need for Quantitative Methods in Syntax and Semantics Research.” Language and Cognitive Processes 28 (1-2): 88–124.

Gordon, Peter C, and Randall Hendrick. 1997. “Intuitive Knowledge of Linguistic Co-Reference.” Cognition 62 (3): 325–70.

Grewendorf, Günther. 2007. “Empirical Evidence and Theoretical Reasoning in Generative Grammar.” Theoretical Linguistics 33 (3): 369–80.

Harada, Masashi. 2018. “Contextual Effects on Case in Japanese Copular Constructions.” In Proceedings of the 12th Generative Linguistics in the Old World and the 21st Seoul International Conference on Generative Grammar, 447–56.

———. 2020. “Experimental Design with a Focus on Power Calculation.” https://masashiharada.netlify.app/resources/pdf/powerAnalysis.pdf.

Lane, Sean, Erin Hennes, and Tessa West. 2016. I’ve Got the Power: How Anyone Can Do a Power Analysis of Any Type of Study Using Simulation.

Linzen, Tal, and Yohei Oseki. 2018. “The Reliability of Acceptability Judgments Across Languages.” Glossa: A Journal of General Linguistics 3 (1).

Newmeyer, Frederick J. 1983. Grammatical Theory: Its Limits and Its Possibilities. University of Chicago Press.

Schütze, Carson T, and Jon Sprouse. 2014. “Judgment Data.” Research Methods in Linguistics 27.

Sonderegger, Morgan, Michael Wagner, and Francisco Torreira. 2018. “Quantitative Methods for Linguistic Data. Version 1.0.”

Spencer, Nancy Jane. 1973. “Differences Between Linguists and Nonlinguists in Intuitions of Grammaticality-Acceptability.” Journal of Psycholinguistic Research 2 (2): 83–98.

Sprouse, Jon. 2018. “Experimental Syntax: Design, Analysis, and Application.” Lecture Slides. University of Connecticut. https://www.jonsprouse.com/courses/experimental-syntax/slides/full.slides.pdf.

Sprouse, Jon, and Diogo Almeida. 2012. “Assessing the Reliability of Textbook Data in Syntax: Adger’s Core Syntax.” Journal of Linguistics 48 (3): 609–52.

———. 2017. “Design Sensitivity and Statistical Power in Acceptability Judgment Experiments.” Glossa 2 (1): 1.

Tada, Hiroaki. 1992. “Nominative Objects in Japanese.” Journal of Japanese Linguistics 14 (1): 91–108.

Wasow, Thomas, and Jennifer Arnold. 2005. “Intuitions in Linguistic Argumentation.” Lingua 115 (11): 1481–96.

Winter, Bodo. 2019. Statistics for Linguists: An Introduction Using r. Routledge.

The number of participants will be discussed in more detail in Section 4.4 ↩︎
In fact, pseudorandomization with a maximum of 8 fillers between two main items already requires many calculation trials.↩︎
The data frame does not account for the filler and anchor items, and the order of the data differs slightly from what was mentioned in Section 1. However, the power analysis for this experiment requires estimating various values, as discussed below, so these differences in design should not significantly impact the power estimation.↩︎
The model does not include any item-by-random effects because Linzen and Oseki (2018) only used the set of sentences in (7) to examine the effect of morphological case on the meaning in question.↩︎
Additionally, it is reasonable to assume there might be a correlation between the two random effects. For example, if a participant initially finds sentence (7a) less natural (resulting in a lower intercept), the effect of the case manipulation might be smaller. In other words, only participants who find (7a) very natural to start with might show a strong case effect. While the scatterplot on individual variability does not clearly indicate whether such a ceiling effect exists, exploring the correlation could be useful. However, including this correlation in the mixed-effects model leads to over-parameterization, where the number of random effects equals the number of observations, causing issues with the R code. Therefore, the model does not include this random effect correlation.↩︎
The model does not include random effect correlations, as they could not be incorporated in the mod_LandO in the previous section, and thus we cannot estimate the values of these correlations. However, it is reasonable to include them —specifically, the correlations between (i) by-subject intercept and condition, (ii) by-subject intercept and order, (iii) by-subject condition and order, (iv) by-item intercept and condition, (v) by-item intercept and order, and (vi) by-item condition and order. These correlations will be added when analyzing the data after the experiment.↩︎
Fatigue is closely related to the order effect and could potentially influence the slope, possibly making it more positive. However, it is not clear whether fatigue would actually result in a more positive or more negative slope, and the direction of this effect may vary among participants. For this reason, I do not factor fatigue into the current analysis.↩︎
This observation is based on fitting some models for the English acceptability judgment data provided by Sprouse and Almeida (2017).↩︎
The table was created after setting up the values for the random effects and residuals, which will be discussed in the following subsection.↩︎

Addressing Sample Size in a Linguistic Acceptability Judgment Experiment: Simulation and Power Analysis for Japanese Sentences

Masashi Harada

2024-08 (original draft: 2024-06; previous manuscript: Harada (2020))