Swadesh CDI Comparisons

Variability of Difficulty by CDI Category

Below we show the standard deviation of cross-linguistic item difficulties by CDI category (for 437 uni-lemmas that are defined in at least 5 languages).

## `summarise()` has grouped output by 'uni_lemma', 'category'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'uni_lemma', 'category'. You can override using the `.groups` argument.

The CDI categories with the least variability in difficulty differ somewhat between comprehension (left) and production (right), but include sounds, vehicles, games/routings, animals, and toys. Among the categories with the greatest variation in item difficulty are time words and question words (with great variability in that category).

Below we show the standard deviation of cross-linguistic item difficulties by lexical class.

p1 <- comp_agg %>% group_by(lexical_class) %>%
  tidyboot_mean(sd_d) %>%
  ggplot(aes(x=reorder(lexical_class, mean), y=mean)) + 
  geom_point(alpha=.7, position=posish) +
  geom_linerange(aes(ymin=ci_lower, ymax=ci_upper), position=posish, alpha=.7) +
  coord_flip() + 
  theme_classic() + ylab("Item difficulty SD") + xlab("Lexical Class") +
  ggtitle("Comprehension")

p2 <- prod_agg %>% group_by(lexical_class) %>%
  tidyboot_mean(sd_d) %>%
  ggplot(aes(x=reorder(lexical_class, mean), y=mean)) + 
  geom_point(alpha=.7, position=posish) +
  geom_linerange(aes(ymin=ci_lower, ymax=ci_upper), position=posish, alpha=.7) +
  coord_flip() + 
  theme_classic() + ylab("Item difficulty SD") + xlab("Lexical Class") +
  ggtitle("Production")

ggarrange(p1, p2)

Adjectives have the most variable cross-linguistic difficulties, on average, and “other” words are the least variable in their difficulty.

Swadesh / ASJP comparisons

The ASJP list is a subset of 40 Swadesh items that perform as well as the full list in glottochronology applications. 31 of the ASJP words are on the CDI:WG:

##  [1] "drink (action)"       "see"                  "dog"                 
##  [4] "fish (animal)"        "ear"                  "eye"                 
##  [7] "knee"                 "nose"                 "tongue"              
## [10] "tooth"                "star"                 "sun"                 
## [13] "tree"                 "water (not beverage)" "you"                 
## [16] "night"                "hand"                 "person"              
## [19] "breast"               "blood"                "stone"               
## [22] "fire"                 "I"                    "come"                
## [25] "full"                 "new"                  "mountain"            
## [28] "hear"                 "we"                   "die"                 
## [31] "leaf"

Are ASJP words less variable in difficulty?

comp_agg %>% 
  mutate(ASJP = ifelse(is.element(uni_lemma, asjp), 1, 0)) %>%
  group_by(ASJP) %>%
  summarise(d=mean(d), sd_d=mean(sd_d))

## # A tibble: 2 x 3
##    ASJP     d  sd_d
## * <dbl> <dbl> <dbl>
## 1     0 -1.38 1.14 
## 2     1 -1.09 0.976

prod_agg %>% 
  mutate(ASJP = ifelse(is.element(uni_lemma, asjp), 1, 0)) %>%
  group_by(ASJP) %>%
  summarise(d=mean(d), sd_d=mean(sd_d))

## # A tibble: 2 x 3
##    ASJP     d  sd_d
## * <dbl> <dbl> <dbl>
## 1     0 -1.99  2.27
## 2     1 -1.50  2.15

  #ggplot(aes(x=d, y=sd_d, color=ASJP)) + 
  #geom_point() + theme_bw()

For both comprehension and production, words on the ASJP are on average both easier and less variable in their cross-linguistic easiness than the items that are not on the ASJP.

Generalized Partial Credit Model (GPCM) Fits

We’ve now fitted the combined comprehension and production WG data with GPCM models (17 languages, so far), which have two difficulty parameters (and still a single discrimination parameter). Below we compare these parameters to the separate 2PL fits for comprehension and production.

## Joining, by = c("item_id", "definition", "language", "uni_lemma")
## Joining, by = c("item_id", "definition", "language", "uni_lemma")

## # A tibble: 1 x 6
##   a1_vs_comp a1_vs_prod d1_vs_comp d2_vs_comp d1_vs_prod d2_vs_prod
##        <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
## 1      0.657      0.250      0.904      0.748      0.563      0.633

Strong correlations between GPCM discrimination (a1) and comprehension, as well as GPCM’s d1 and comprehension difficulty. Moderate correlations

Difficulties

## Loading required package: GGally

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Swadesh CDI Comparisons

George

2021-04-07

Variability of Difficulty by CDI Category

Swadesh / ASJP comparisons

Generalized Partial Credit Model (GPCM) Fits

Difficulties

Discrimination