Finding Equal Difficulty Cross-linguistic Items

Our goal is to look at IRT parameters across a diverse set of languages and find a subset of uni-lemmas that are somewhat similar in their difficulty. We’ll start with 2PL fits to WG data (comprehension and production separately) for 22 languages: Kigiriama, Kiswahili, British Sign Language, Croatian, Danish, English (American), Italian, Mandarin (Taiwanese), French (French), Korean, Latvian, Hebrew, Norwegian, French (Quebecois), Slovak, Spanish (European), Spanish (Mexican), Russian, Turkish, Portuguese (European), Dutch, Persian.

WG Comprehension

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 91 rows containing non-finite values (stat_bin).

Mostly overlapping difficulty distributions (this is using a N(0,3) prior on difficulty).

Cross-linguistic similarities

We look at the Spearman correlation between the item difficulty of each language compared to each other language. We might expect this to recapitulate the historical relationship between languages, with more similar languages having more similar item difficulties (e.g., Quebecois and European French).

Difficulty by CDI Category

## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`

Candidate Items

Shown below are the total number of items per language that also have uni-lemmas.

Language	Items
British Sign Language	545
Croatian	396
Danish	410
Dutch	439
English (American)	396
French (French)	619
French (Quebecois)	408
Hebrew	442
Italian	408
Kigiriama	292
Kiswahili	287
Korean	284
Latvian	402
Mandarin (Taiwanese)	354
Norwegian	395
Persian	400
Portuguese (European)	317
Russian	427
Slovak	308
Spanish (European)	303
Spanish (Mexican)	428
Turkish	418

Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. How should we choose thresholds for inclusion – both on the variability of an item’s difficulty, and on the number of languages in which it is included? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).

## Warning: Removed 159 rows containing missing values (geom_point).

Of these, a total of 819 uni-lemmas are included in more than one language, and only 484 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 21 uni-lemmas used in all 22 languages.)

To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 1.29 (SD=0.5), so we consider the 31 items with SD < 1.04. These items are shown below, sorted by number of languages from which the unilemma is missing.

WG Production

Now we’ll turn to production data.

## `summarise()` has grouped output by 'language'. You can override using the
## `.groups` argument.

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1087 rows containing non-finite values (stat_bin).

Cross-linguistic similarities

Mostly overlapping difficulty distributions (with a difficulty prior ~ N(0,3)).

Difficulty by CDI Category

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`

Candidate Items

Shown below are the total number of items per language that also have uni-lemmas.

Language	Items
British Sign Language	532
Croatian	380
Danish	410
Dutch	160
English (American)	396
French (French)	212
French (Quebecois)	366
Hebrew	439
Italian	408
Kigiriama	260
Kiswahili	216
Korean	284
Latvian	402
Mandarin (Taiwanese)	354
Norwegian	395
Persian	367
Portuguese (European)	293
Russian	424
Slovak	307
Spanish (Chilean)	455
Spanish (European)	277
Spanish (Mexican)	428
Swedish	341
Turkish	418

Below we examine the standard deviation of unilemmas’ cross-linguistic difficulty as a function of how many languages that unilemma is missing from. What thresholds to use for inclusion? For now we consider items with less than median SD that are missing from no more than 6 languages (lower left region of plot).

## Warning: Removed 181 rows containing missing values (geom_point).

Of these, a total of 770 uni-lemmas are included in more than one language, and only 468 uni-lemmas are included in 6 or more of the languages. We will start by considering this more restricted list, but if there are not enough good candidates then we may consider making pairwise comparisons between each possible language pair (more flexible, but more complicated). (There are only 11 uni-lemmas used in all 20 languages.)

To evaluate how variable items are in their cross-linguistic difficulty, we calculate the standard deviation (SD) of each uni-lemma’s difficulty. The median SD is 2.4 (SD=0.88), so we consider the 48 items with SD less than 1.96. These items are shown below, sorted by number of languages from which the unilemma is missing.

Overview of Cross-linguistic Comprehension and Production Difficulties by Category

Next steps

15 of the uni-lemmas are on both the good cross-linguistic comprehension and production lists. Is this enough items, or do we want to relax the criteria for inclusion? How does the difficulty of these “good” items compare to the overall item difficulties? (Are they systematically easier? If so, the list may not work well for a short form CDI since it will overestimate vocabulary size. Should plot sd(difficulty) vs. mean(difficulty).) The next step is to do real-data simulations for each language using these items, and see how well we recover full CDI scores / ability. If this doesn’t work well, we may consider constructing pairwise language lists.

Finding Equal Difficulty Cross-linguistic Items

George

2022-10-10

WG Comprehension

Cross-linguistic similarities

Difficulty by CDI Category

Candidate Items

WG Production

Cross-linguistic similarities

Difficulty by CDI Category

Candidate Items

Overview of Cross-linguistic Comprehension and Production Difficulties by Category

Next steps