Our goal is to look at IRT parameters across a diverse set of languages and find a subset of uni-lemmas that are somewhat similar in their difficulty. We’ll start with 2PL fits to WG data (comprehension and production separately) for 22 languages: Kigiriama, Kiswahili, British Sign Language, Croatian, Danish, English (American), Italian, Mandarin (Taiwanese), French (French), Korean, Latvian, Hebrew, Norwegian, French (Quebecois), Slovak, Spanish (European), Spanish (Mexican), Russian, Turkish, Portuguese (European), Dutch, Persian.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 91 rows containing non-finite values (stat_bin).
Mostly overlapping difficulty distributions (this is using a N(0,3) prior on difficulty).
We look at the Spearman correlation between the item difficulty of each language compared to each other language. We might expect this to recapitulate the historical relationship between languages, with more similar languages having more similar item difficulties (e.g., Quebecois and European French).
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
Shown below are the total number of items per language that also have uni-lemmas.
Language | Items |
---|---|
British Sign Language | 545 |
Croatian | 396 |
Danish | 410 |
Dutch | 439 |
English (American) | 396 |
French (French) | 619 |
French (Quebecois) | 408 |
Hebrew | 442 |
Italian | 408 |
Kigiriama | 292 |
Kiswahili | 287 |
Korean | 284 |
Latvian | 402 |
Mandarin (Taiwanese) | 354 |
Norwegian | 395 |
Persian | 400 |
Portuguese (European) | 317 |
Russian | 427 |
Slovak | 308 |
Spanish (European) | 303 |
Spanish (Mexican) | 428 |
Turkish | 418 |
Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. How should we choose thresholds for inclusion – both on the variability of an item’s difficulty, and on the number of languages in which it is included? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).
## Warning: Removed 159 rows containing missing values (geom_point).
Of these, a total of 819 uni-lemmas are included in more than one language, and only 484 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 21 uni-lemmas used in all 22 languages.)
To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 1.29 (SD=0.5), so we consider the 31 items with SD < 1.04. These items are shown below, sorted by number of languages from which the unilemma is missing.
Now we’ll turn to production data.
## `summarise()` has grouped output by 'language'. You can override using the
## `.groups` argument.
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1087 rows containing non-finite values (stat_bin).
Mostly overlapping difficulty distributions (with a difficulty prior ~ N(0,3)).
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
Shown below are the total number of items per language that also have uni-lemmas.
Language | Items |
---|---|
British Sign Language | 532 |
Croatian | 380 |
Danish | 410 |
Dutch | 160 |
English (American) | 396 |
French (French) | 212 |
French (Quebecois) | 366 |
Hebrew | 439 |
Italian | 408 |
Kigiriama | 260 |
Kiswahili | 216 |
Korean | 284 |
Latvian | 402 |
Mandarin (Taiwanese) | 354 |
Norwegian | 395 |
Persian | 367 |
Portuguese (European) | 293 |
Russian | 424 |
Slovak | 307 |
Spanish (Chilean) | 455 |
Spanish (European) | 277 |
Spanish (Mexican) | 428 |
Swedish | 341 |
Turkish | 418 |
Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. What thresholds to use for inclusion? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).
## Warning: Removed 181 rows containing missing values (geom_point).
Of these, a total of 770 uni-lemmas are included in more than one language, and only 468 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 11 uni-lemmas used in all 20 languages.)
To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 2.4 (SD=0.88), so we consider the 48 items with SD less than 1.96. These items are shown below, sorted by number of languages from which the unilemma is missing.
15 of the uni-lemmas are on both the good cross-linguistic comprehension and production lists. Is this enough items, or do we want to relax the criteria for inclusion? How does the difficulty of these “good” items compare to the overall item difficulties? (Are they systematically easier? If so, the list may not work well for a short form CDI since it will overestimate vocabulary size. Should plot sd(difficulty) vs. mean(difficulty).) The next step is to do real-data simulations for each language using these items, and see how well we recover full CDI scores / ability. If this doesn’t work well, we may consider constructing pairwise language lists.