Uni-lemma Overlap on CDI Forms

Number of overlapping concepts on each pair of CDI forms. American English has the highest average degree of overlap (458 uni-lemmas), which may be driven in large part by it being the oldest form, from which many others were adapted. The form with the lowest average degree of overlap is Chilean Spanish, with only 314 uni-lemmas shared with other forms, on average. The median of the average overlap per language is 393 uni-lemmas.

Word Difficulty by Semantic Category and Language

Mean difficulty of CDI words by semantic category and language. Bars represent bootstrapped 95\% confidence intervals.

Mean difficulty of CDI words by semantic category and language. Bars represent bootstrapped 95% confidence intervals.

Cross-linguistic similarities

We look at the Spearman correlation between the item difficulty of each language compared to each other language. We might expect this to recapitulate the historical relationship between languages, with more similar languages having more similar item difficulties (e.g., Quebecois and European French).

Cross-linguistic similarity (Spearman correlation) of IRT item difficulty from the CDI.

Cross-linguistic similarity (Spearman correlation) of IRT item difficulty from the CDI.

Cross-validation Results

Difficulty correlation and overlap by k for the different sublist selection methods.

Difficulty correlation and overlap by k for the different sublist selection methods.

Swadesh CDI vs. Full CDI:WS

The table below shows the correlation between full CDI and Swadesh-CDI for syntactic category stratification method, for items on at least \(k=27\) CDIs, for each of the 32 languages that the IRT models were trained on.

Swadesh CDI vs. Full CDI scores for 32 training languages.
k Language Overlap Full vs. S-CDI r
27 Arabic (Saudi) 95 0.966
27 Cantonese 90 0.986
27 Catalan 93 0.989
27 Croatian 97 0.992
27 Czech 94 0.988
27 Danish 99 0.992
27 Dutch 98 0.965
27 English (American) 98 0.989
27 English (Australian) 76 0.990
27 English (British) 90 0.929
27 Estonian 93 0.990
27 Finnish 95 0.944
27 French (French) 97 0.973
27 French (Quebecois) 98 0.985
27 German 90 0.991
27 Hebrew 87 0.980
27 Hungarian 96 0.989
27 Italian 97 0.991
27 Japanese 86 0.986
27 Korean 87 0.986
27 Latvian 96 0.989
27 Mandarin (Beijing) 92 0.988
27 Mandarin (Taiwanese) 88 0.991
27 Norwegian 97 0.992
27 Portuguese (European) 91 0.988
27 Russian 89 0.987
27 Slovak 84 0.988
27 Spanish (Argentinian) 96 0.984
27 Spanish (European) 84 0.986
27 Spanish (Mexican) 94 0.988
27 Swedish 96 0.985
27 Turkish 87 0.990

Generalization Test

Comparison to the 10 low-data languages [ToDo: with the 10 additional difficult words added]. Note that many proposed Swadesh items are not actually on the CDI:WS forms available to test generalization.

Swadesh CDI vs. Full CDI:WS scores for 10 generalization languages.
language 2 strata Random category syntactic unstratified
American Sign Language 0.975 0.983 0.982 0.976 0.974
British Sign Language 0.989 0.991 0.989 0.990 0.989
English (Irish) 0.974 0.975 0.973 0.964 0.967
Greek (Cypriot) 0.979 0.988 0.991 0.985 0.979
Irish 0.977 0.984 0.988 0.980 0.974
Kigiriama 0.965 0.939 0.934 0.964 0.962
Kiswahili 0.979 0.980 0.988 0.985 0.976
Persian 0.954 0.946 0.955 0.957 0.958
Spanish (Chilean) 0.966 0.903 0.900 0.968 0.969
Spanish (Peruvian) 0.979 0.967 0.956 0.980 0.957

Swadesh-CDI Items

Below we show the full list of 100 Swadesh CDI uni-lemmas, along with their average cross-linguistic difficulty (d_m), variability in difficulty (d_sd), number of CDI:WS forms they appear on (n, out of 32), semantic category, and lexical category. The semantic and lexical categories are based on American English, as uni-lemmas sometimes appear in different categories on different forms, as appropriate.

Guide to Developing a Swadesh-based CDI

Developing a CDI for a new language is a substantial task requiring specific linguistic and cultural knowledge – even with the new Swadesh-CDI recommendations in hand. Many recommendations and caveats have been enumerated by the CDI Advisory Board, here: https://mb-cdi.stanford.edu/adaptations.html Rather than recapitulate the process and caveats already itemized there, we suggest how the Swadesh CDI can be used to jump-start the process.

  1. As a starting point, take the 100 Swadesh-CDI uni-lemmas, and consider their linguistic and cultural appropriateness for the target language. Include as many as are relevant.
  2. Consider adding uni-lemmas that are frequently included on other CDIs, and especially focus on semantic categories that are less well-represented on the S-CDI, including helping verbs, quantifiers, locatives, and toys. We recommend adding 10-20 items across these categories, including the 10-item extension we tested, which are present on most existing CDI forms (“try”, “under”, “in”, “out”, “a lot”, “all”, “many”, “not”, “chalk”, “toy”).
  3. Do a pilot study, including several children at the upper and lower intended ages, and revise if there are many children at floor or ceiling.

Comparison of Swadesh CDI Concepts to Other Lists

28 of the Swadesh CDI candidates are on the 100-item CDI:WS short (form A), and 26 of the Swadesh candidates are on the WS short (form B). 22 Swadesh CDI concepts are also on the original Swadesh-100 list—and 7 are overlapping with the ASJP (a subset of the Swadesh-100 that performs as well as the Swadesh-100 for purposes of glottochronology).