Our goal is to look at IRT parameters across a diverse set of languages and find a subset of uni-lemmas that are somewhat similar in their difficulty. We’ll start with 2PL fits to WG data (comprehension and production separately) for 18 languages: British Sign Language, Croatian, Danish, English (American), Korean, Spanish (Mexican), Italian, Mandarin (Taiwanese), French (French), Latvian, Hebrew, Norwegian, French (Quebecois), Slovak, Spanish (European), Russian, Turkish, Portuguese (European).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 85 rows containing non-finite values (stat_bin).
Mostly overlapping difficulty distributions (this is using a N(0,3) prior on difficulty).
We look at the Spearman correlation between the item difficulty of each language compared to each other language. We might expect this to recapitulate the historical relationship between languages, with more similar languages having more similar item difficulties (e.g., Quebecois and European French).
Shown below are the total number of items per language that also have uni-lemmas.
Language | Items |
---|---|
British Sign Language | 418 |
Croatian | 396 |
Danish | 410 |
English (American) | 396 |
French (French) | 488 |
French (Quebecois) | 405 |
Hebrew | 442 |
Italian | 408 |
Korean | 254 |
Latvian | 363 |
Mandarin (Taiwanese) | 350 |
Norwegian | 395 |
Portuguese (European) | 292 |
Russian | 427 |
Slovak | 221 |
Spanish (European) | 292 |
Spanish (Mexican) | 428 |
Turkish | 418 |
Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. How should we choose thresholds for inclusion – both on the variability of an item’s difficulty, and on the number of languages in which it is included? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).
## Warning: Removed 139 rows containing missing values (geom_point).
Of these, a total of 723 uni-lemmas are included in more than one language, and only 444 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 26 uni-lemmas used in all 18 languages.)
To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 1.26 (SD=0.46), so we consider the 50 items with SD < 1.03. These items are shown below, sorted by number of languages from which the unilemma is missing.
Now we’ll turn to production data.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 940 rows containing non-finite values (stat_bin).
Mostly overlapping difficulty distributions (with a difficulty prior ~ N(0,3)).
Shown below are the total number of items per language that also have uni-lemmas.
Language | Items |
---|---|
British Sign Language | 414 |
Croatian | 380 |
Danish | 410 |
English (American) | 396 |
English (British) | 405 |
French (French) | 174 |
French (Quebecois) | 322 |
Hebrew | 439 |
Italian | 354 |
Kiswahili | 165 |
Korean | 254 |
Latvian | 363 |
Mandarin (Beijing) | 220 |
Mandarin (Taiwanese) | 350 |
Norwegian | 395 |
Portuguese (European) | 268 |
Russian | 350 |
Slovak | 220 |
Spanish (European) | 267 |
Spanish (Mexican) | 428 |
Turkish | 380 |
Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. What thresholds to use for inclusion? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).
## Warning: Removed 164 rows containing missing values (geom_point).
Of these, a total of 701 uni-lemmas are included in more than one language, and only 428 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 14 uni-lemmas used in all 17 languages.)
To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 2.25 (SD=0.81), so we consider the 43 items with SD less than 1.85. These items are shown below, sorted by number of languages from which the unilemma is missing.
15 of the uni-lemmas are on both the good cross-linguistic comprehension and production lists. Is this enough items, or do we want to relax the criteria for inclusion? How does the difficulty of these “good” items compare to the overall item difficulties? (Are they systematically easier? If so, the list may not work well for a short form CDI since it will overestimate vocabulary size. Should plot sd(difficulty) vs. mean(difficulty).) The next step is to do real-data simulations for each language using these items, and see how well we recover full CDI scores / ability. If this doesn’t work well, we may consider constructing pairwise language lists.