Our goal is to look at IRT parameters across a diverse set of languages and find a subset of uni-lemmas that are somewhat similar in their difficulty. We’ll start with 2PL fits to WG data (comprehension and production separately) for 22 languages: Kigiriama, Kiswahili, British Sign Language, Croatian, Danish, English (American), Italian, Mandarin (Taiwanese), French (French), Korean, Latvian, Hebrew, Norwegian, French (Quebecois), Slovak, Spanish (European), Spanish (Mexican), Russian, Turkish, Portuguese (European), Dutch, Persian.

WG Comprehension

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 91 rows containing non-finite values (stat_bin).

Mostly overlapping difficulty distributions (this is using a N(0,3) prior on difficulty).

Cross-linguistic similarities

We look at the Spearman correlation between the item difficulty of each language compared to each other language. We might expect this to recapitulate the historical relationship between languages, with more similar languages having more similar item difficulties (e.g., Quebecois and European French).

Difficulty by CDI Category

## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`

Candidate Items

Shown below are the total number of items per language that also have uni-lemmas.

Language Items
British Sign Language 545
Croatian 396
Danish 410
Dutch 439
English (American) 396
French (French) 619
French (Quebecois) 408
Hebrew 442
Italian 408
Kigiriama 292
Kiswahili 287
Korean 284
Latvian 402
Mandarin (Taiwanese) 354
Norwegian 395
Persian 400
Portuguese (European) 317
Russian 427
Slovak 308
Spanish (European) 303
Spanish (Mexican) 428
Turkish 418

Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. How should we choose thresholds for inclusion – both on the variability of an item’s difficulty, and on the number of languages in which it is included? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).

## Warning: Removed 159 rows containing missing values (geom_point).

Of these, a total of 819 uni-lemmas are included in more than one language, and only 484 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 21 uni-lemmas used in all 22 languages.)

To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 1.29 (SD=0.5), so we consider the 31 items with SD < 1.04. These items are shown below, sorted by number of languages from which the unilemma is missing.

WG Production

Now we’ll turn to production data.

## `summarise()` has grouped output by 'language'. You can override using the
## `.groups` argument.
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1087 rows containing non-finite values (stat_bin).

Cross-linguistic similarities

Mostly overlapping difficulty distributions (with a difficulty prior ~ N(0,3)).

Difficulty by CDI Category

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`

Candidate Items

Shown below are the total number of items per language that also have uni-lemmas.

Language Items
British Sign Language 532
Croatian 380
Danish 410
Dutch 160
English (American) 396
French (French) 212
French (Quebecois) 366
Hebrew 439
Italian 408
Kigiriama 260
Kiswahili 216
Korean 284
Latvian 402
Mandarin (Taiwanese) 354
Norwegian 395
Persian 367
Portuguese (European) 293
Russian 424
Slovak 307
Spanish (Chilean) 455
Spanish (European) 277
Spanish (Mexican) 428
Swedish 341
Turkish 418

Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. What thresholds to use for inclusion? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).

## Warning: Removed 181 rows containing missing values (geom_point).

Of these, a total of 770 uni-lemmas are included in more than one language, and only 468 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 11 uni-lemmas used in all 20 languages.)

To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 2.4 (SD=0.88), so we consider the 48 items with SD less than 1.96. These items are shown below, sorted by number of languages from which the unilemma is missing.

Overview of Cross-linguistic Comprehension and Production Difficulties by Category

Next steps

15 of the uni-lemmas are on both the good cross-linguistic comprehension and production lists. Is this enough items, or do we want to relax the criteria for inclusion? How does the difficulty of these “good” items compare to the overall item difficulties? (Are they systematically easier? If so, the list may not work well for a short form CDI since it will overestimate vocabulary size. Should plot sd(difficulty) vs. mean(difficulty).) The next step is to do real-data simulations for each language using these items, and see how well we recover full CDI scores / ability. If this doesn’t work well, we may consider constructing pairwise language lists.