Number of overlapping concepts on each pair of CDI forms. American English has the highest average degree of overlap (458 uni-lemmas), which may be driven in large part by it being the oldest form, from which many others were adapted. The form with the lowest average degree of overlap is Chilean Spanish, with only 314 uni-lemmas shared with other forms, on average. The median of the average overlap per language is 393 uni-lemmas.
Mean difficulty of CDI words by semantic category and language. Bars represent bootstrapped 95% confidence intervals.
We look at the Spearman correlation between the item difficulty of each language compared to each other language. We might expect this to recapitulate the historical relationship between languages, with more similar languages having more similar item difficulties (e.g., Quebecois and European French).
Cross-linguistic similarity (Spearman correlation) of IRT item difficulty from the CDI.
Difficulty correlation and overlap by k for the different sublist selection methods.
The table below shows the correlation between full CDI and Swadesh-CDI for syntactic category stratification method, for items on at least \(k=27\) CDIs, for each of the 32 languages that the IRT models were trained on.
k | Language | Overlap | Full vs. S-CDI r |
---|---|---|---|
27 | Arabic (Saudi) | 95 | 0.966 |
27 | Cantonese | 90 | 0.986 |
27 | Catalan | 93 | 0.989 |
27 | Croatian | 97 | 0.992 |
27 | Czech | 94 | 0.988 |
27 | Danish | 99 | 0.992 |
27 | Dutch | 98 | 0.965 |
27 | English (American) | 98 | 0.989 |
27 | English (Australian) | 76 | 0.990 |
27 | English (British) | 90 | 0.929 |
27 | Estonian | 93 | 0.990 |
27 | Finnish | 95 | 0.944 |
27 | French (French) | 97 | 0.973 |
27 | French (Quebecois) | 98 | 0.985 |
27 | German | 90 | 0.991 |
27 | Hebrew | 87 | 0.980 |
27 | Hungarian | 96 | 0.989 |
27 | Italian | 97 | 0.991 |
27 | Japanese | 86 | 0.986 |
27 | Korean | 87 | 0.986 |
27 | Latvian | 96 | 0.989 |
27 | Mandarin (Beijing) | 92 | 0.988 |
27 | Mandarin (Taiwanese) | 88 | 0.991 |
27 | Norwegian | 97 | 0.992 |
27 | Portuguese (European) | 91 | 0.988 |
27 | Russian | 89 | 0.987 |
27 | Slovak | 84 | 0.988 |
27 | Spanish (Argentinian) | 96 | 0.984 |
27 | Spanish (European) | 84 | 0.986 |
27 | Spanish (Mexican) | 94 | 0.988 |
27 | Swedish | 96 | 0.985 |
27 | Turkish | 87 | 0.990 |
Comparison to the 10 low-data languages [ToDo: with the 10 additional difficult words added]. Note that many proposed Swadesh items are not actually on the CDI:WS forms available to test generalization.
language | 2 strata | Random | category | syntactic | unstratified |
---|---|---|---|---|---|
American Sign Language | 0.975 | 0.983 | 0.982 | 0.976 | 0.974 |
British Sign Language | 0.989 | 0.991 | 0.989 | 0.990 | 0.989 |
English (Irish) | 0.974 | 0.975 | 0.973 | 0.964 | 0.967 |
Greek (Cypriot) | 0.979 | 0.988 | 0.991 | 0.985 | 0.979 |
Irish | 0.977 | 0.984 | 0.988 | 0.980 | 0.974 |
Kigiriama | 0.965 | 0.939 | 0.934 | 0.964 | 0.962 |
Kiswahili | 0.979 | 0.980 | 0.988 | 0.985 | 0.976 |
Persian | 0.954 | 0.946 | 0.955 | 0.957 | 0.958 |
Spanish (Chilean) | 0.966 | 0.903 | 0.900 | 0.968 | 0.969 |
Spanish (Peruvian) | 0.979 | 0.967 | 0.956 | 0.980 | 0.957 |
Below we show the full list of 100 Swadesh CDI uni-lemmas, along with their average cross-linguistic difficulty (d_m), variability in difficulty (d_sd), number of CDI:WS forms they appear on (n, out of 32), semantic category, and lexical category. The semantic and lexical categories are based on American English, as uni-lemmas sometimes appear in different categories on different forms, as appropriate.
Developing a CDI for a new language is a substantial task requiring specific linguistic and cultural knowledge – even with the new Swadesh-CDI recommendations in hand. Many recommendations and caveats have been enumerated by the CDI Advisory Board, here: https://mb-cdi.stanford.edu/adaptations.html Rather than recapitulate the process and caveats already itemized there, we suggest how the Swadesh CDI can be used to jump-start the process.
28 of the Swadesh CDI candidates are on the 100-item CDI:WS short (form A), and 26 of the Swadesh candidates are on the WS short (form B). 22 Swadesh CDI concepts are also on the original Swadesh-100 list—and 7 are overlapping with the ASJP (a subset of the Swadesh-100 that performs as well as the Swadesh-100 for purposes of glottochronology).