Uni-lemma Overlap on CDI Forms

Number of overlapping concepts on each pair of CDI forms. American English has the highest average degree of overlap (458 uni-lemmas), which may be driven in large part by it being the oldest form, from which many others were adapted. The form with the lowest average degree of overlap is Chilean Spanish, with only 314 uni-lemmas shared with other forms, on average. The median of the average overlap per language is 393 uni-lemmas.

Word Difficulty by Semantic Category and Language

$Mean difficulty of CDI words by semantic category and language. Bars represent bootstrapped 95\% confidence intervals.$

Mean difficulty of CDI words by semantic category and language. Bars represent bootstrapped 95% confidence intervals.

Cross-linguistic similarities

We look at the Spearman correlation between the item difficulty of each language compared to each other language. We might expect this to recapitulate the historical relationship between languages, with more similar languages having more similar item difficulties (e.g., Quebecois and European French).

Cross-linguistic similarity (Spearman correlation) of IRT item difficulty from the CDI.

Cross-validation Results

Difficulty correlation and overlap by k for the different sublist selection methods.

Swadesh CDI vs. Full CDI:WS

The table below shows the correlation between full CDI and Swadesh-CDI for syntactic category stratification method, for items on at least $k=27$ CDIs, for each of the 32 languages that the IRT models were trained on.

Swadesh CDI vs. Full CDI scores for 32 training languages.
k	Language	Overlap	Full vs. S-CDI r
27	Arabic (Saudi)	95	0.966
27	Cantonese	90	0.986
27	Catalan	93	0.989
27	Croatian	97	0.992
27	Czech	94	0.988
27	Danish	99	0.992
27	Dutch	98	0.965
27	English (American)	98	0.989
27	English (Australian)	76	0.990
27	English (British)	90	0.929
27	Estonian	93	0.990
27	Finnish	95	0.944
27	French (French)	97	0.973
27	French (Quebecois)	98	0.985
27	German	90	0.991
27	Hebrew	87	0.980
27	Hungarian	96	0.989
27	Italian	97	0.991
27	Japanese	86	0.986
27	Korean	87	0.986
27	Latvian	96	0.989
27	Mandarin (Beijing)	92	0.988
27	Mandarin (Taiwanese)	88	0.991
27	Norwegian	97	0.992
27	Portuguese (European)	91	0.988
27	Russian	89	0.987
27	Slovak	84	0.988
27	Spanish (Argentinian)	96	0.984
27	Spanish (European)	84	0.986
27	Spanish (Mexican)	94	0.988
27	Swedish	96	0.985
27	Turkish	87	0.990

Generalization Test

Comparison to the 10 low-data languages [ToDo: with the 10 additional difficult words added]. Note that many proposed Swadesh items are not actually on the CDI:WS forms available to test generalization.

Swadesh CDI vs. Full CDI:WS scores for 10 generalization languages.
language	2 strata	Random	category	syntactic	unstratified
American Sign Language	0.975	0.983	0.982	0.976	0.974
British Sign Language	0.989	0.991	0.989	0.990	0.989
English (Irish)	0.974	0.975	0.973	0.964	0.967
Greek (Cypriot)	0.979	0.988	0.991	0.985	0.979
Irish	0.977	0.984	0.988	0.980	0.974
Kigiriama	0.965	0.939	0.934	0.964	0.962
Kiswahili	0.979	0.980	0.988	0.985	0.976
Persian	0.954	0.946	0.955	0.957	0.958
Spanish (Chilean)	0.966	0.903	0.900	0.968	0.969
Spanish (Peruvian)	0.979	0.967	0.956	0.980	0.957

Swadesh-CDI Items

Below we show the full list of 100 Swadesh CDI uni-lemmas, along with their average cross-linguistic difficulty (d_m), variability in difficulty (d_sd), number of CDI:WS forms they appear on (n, out of 32), semantic category, and lexical category. The semantic and lexical categories are based on American English, as uni-lemmas sometimes appear in different categories on different forms, as appropriate.

Guide to Developing a Swadesh-based CDI

Developing a CDI for a new language is a substantial task requiring specific linguistic and cultural knowledge – even with the new Swadesh-CDI recommendations in hand. Many recommendations and caveats have been enumerated by the CDI Advisory Board, here: https://mb-cdi.stanford.edu/adaptations.html Rather than recapitulate the process and caveats already itemized there, we suggest how the Swadesh CDI can be used to jump-start the process.

As a starting point, take the 100 Swadesh-CDI uni-lemmas, and consider their linguistic and cultural appropriateness for the target language. Include as many as are relevant.
Consider adding uni-lemmas that are frequently included on other CDIs, and especially focus on semantic categories that are less well-represented on the S-CDI, including helping verbs, quantifiers, locatives, and toys. We recommend adding 10-20 items across these categories, including the 10-item extension we tested, which are present on most existing CDI forms (“try”, “under”, “in”, “out”, “a lot”, “all”, “many”, “not”, “chalk”, “toy”).
Do a pilot study, including several children at the upper and lower intended ages, and revise if there are many children at floor or ceiling.

Comparison of Swadesh CDI Concepts to Other Lists

28 of the Swadesh CDI candidates are on the 100-item CDI:WS short (form A), and 26 of the Swadesh candidates are on the WS short (form B). 22 Swadesh CDI concepts are also on the original Swadesh-100 list—and 7 are overlapping with the ASJP (a subset of the Swadesh-100 that performs as well as the Swadesh-100 for purposes of glottochronology).

Appendix to Measuring Children’s Early Vocabulary in Low-Resource Languages Using a Swadesh-style Word List

[redacted for anonymous review]

May 7, 2025