Our goal is to look at IRT parameters across a diverse set of languages and find a subset of uni-lemmas that are somewhat similar in their difficulty. We’ll start with 2PL fits to WG data (comprehension and production separately) for 18 languages: British Sign Language, Croatian, Danish, English (American), Korean, Spanish (Mexican), Italian, Mandarin (Taiwanese), French (French), Latvian, Hebrew, Norwegian, French (Quebecois), Slovak, Spanish (European), Russian, Turkish, Portuguese (European).

WG Comprehension

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 85 rows containing non-finite values (stat_bin).

Mostly overlapping difficulty distributions (this is using a N(0,3) prior on difficulty).

Cross-linguistic similarities

We look at the Spearman correlation between the item difficulty of each language compared to each other language. We might expect this to recapitulate the historical relationship between languages, with more similar languages having more similar item difficulties (e.g., Quebecois and European French).

Difficulty by CDI Category

Candidate Items

Shown below are the total number of items per language that also have uni-lemmas.

Language Items
British Sign Language 418
Croatian 396
Danish 410
English (American) 396
French (French) 488
French (Quebecois) 405
Hebrew 442
Italian 408
Korean 254
Latvian 363
Mandarin (Taiwanese) 350
Norwegian 395
Portuguese (European) 292
Russian 427
Slovak 221
Spanish (European) 292
Spanish (Mexican) 428
Turkish 418

Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. How should we choose thresholds for inclusion – both on the variability of an item’s difficulty, and on the number of languages in which it is included? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).

## Warning: Removed 139 rows containing missing values (geom_point).

Of these, a total of 723 uni-lemmas are included in more than one language, and only 444 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 26 uni-lemmas used in all 18 languages.)

To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 1.26 (SD=0.46), so we consider the 50 items with SD < 1.03. These items are shown below, sorted by number of languages from which the unilemma is missing.

WG Production

Now we’ll turn to production data.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 940 rows containing non-finite values (stat_bin).

Cross-linguistic similarities

Mostly overlapping difficulty distributions (with a difficulty prior ~ N(0,3)).

Difficulty by CDI Category

Candidate Items

Shown below are the total number of items per language that also have uni-lemmas.

Language Items
British Sign Language 414
Croatian 380
Danish 410
English (American) 396
English (British) 405
French (French) 174
French (Quebecois) 322
Hebrew 439
Italian 354
Kiswahili 165
Korean 254
Latvian 363
Mandarin (Beijing) 220
Mandarin (Taiwanese) 350
Norwegian 395
Portuguese (European) 268
Russian 350
Slovak 220
Spanish (European) 267
Spanish (Mexican) 428
Turkish 380

Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. What thresholds to use for inclusion? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).

## Warning: Removed 164 rows containing missing values (geom_point).

Of these, a total of 701 uni-lemmas are included in more than one language, and only 428 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 14 uni-lemmas used in all 17 languages.)

To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 2.25 (SD=0.81), so we consider the 43 items with SD less than 1.85. These items are shown below, sorted by number of languages from which the unilemma is missing.

Overview of Cross-linguistic Comprehension and Production Difficulties by Category

Next steps

15 of the uni-lemmas are on both the good cross-linguistic comprehension and production lists. Is this enough items, or do we want to relax the criteria for inclusion? How does the difficulty of these “good” items compare to the overall item difficulties? (Are they systematically easier? If so, the list may not work well for a short form CDI since it will overestimate vocabulary size. Should plot sd(difficulty) vs. mean(difficulty).) The next step is to do real-data simulations for each language using these items, and see how well we recover full CDI scores / ability. If this doesn’t work well, we may consider constructing pairwise language lists.