Just pick N items, either from the target language’s CDI:WS, or from English.

load(here("data/apophenia.Rdata"))

dat %>% ggplot(aes(x=N, y=mean_cor, color=language)) + 
  facet_wrap(. ~ Random) + 
  geom_line() + theme_bw() 

dat %>% group_by(N, Random) %>%
  summarise(mean_cor = mean(mean_cor)) %>%
  ggplot(aes(x=N, y=mean_cor, color=Random)) + 
  geom_line() + theme_bw() + ylim(.97, 1)
## `summarise()` has grouped output by 'N'. You can override using the `.groups`
## argument.

Monte Carlo Construction of Swadesh Lists

Want to build a Swadesh list (SL) of a given length (N)?

  1. construct SL from N randomly-selected uni-lemmas (ToDo: try biasing this selection: weight by uni-lemma frequency, 1/variance of difficulty, discrimination…)
  2. evaluate correlation between SL and full CDI (per language, then average those correlations–shouldn’t bias based on CDI admins)
  3. select a random item i to consider removing from SL, and a replacement uni-lemma
  4. evaluate correlation of SL vs. SL with swapped item: if swap is better, accept with 95% probability.

Fun fact: there are \(\sim10^{100}\) potential lists of length 50 (2000 choose 50).

make_swadesh_list <- function(unis, n_items=100, iters=1000, 
                              save_every = 50, prev_run=NULL) {
  cors <- rep(NA, iters/save_every)
  accepted_rate <- rep(NA, iters/save_every)
  swad_inds <- matrix(0, nrow=iters/save_every, ncol=n_items)
  if(is.null(prev_run)) {
    cur_swad_inds = sample(1:nrow(unis), n_items, replace = F) # could weight by n, 1/d_sd...
  } else {
    cur_swad_inds = prev_run$swad_inds[nrow(prev_run$swad_inds),] # start with last saved inds
    if(ncol(prev_run$swad_inds)!=n_items) return("Error: mismatch in n_items of passed previous result")
  }
  xx_g <- run_swadesh_comparisons(xldf, languages, unis[cur_swad_inds,]$uni_lemma, form='WS')
  accepted_count = 0 # track number of accepted swaps
  for (i in 1:iters) {
    # swap in 1 new item
    drop_i = sample(1:n_items, 1)
    new_i = sample(setdiff(1:nrow(unis), cur_swad_inds), 1) 
    new_swad_inds = c(cur_swad_inds[-drop_i], new_i)
    xx_g2 <- run_swadesh_comparisons(xldf, languages, unis[new_swad_inds,]$uni_lemma, form='WS') 
    l1r = mean(xx_g$`Swadesh r`) 
    l2r = mean(xx_g2$`Swadesh r`)
    # if l2 has better r than l1, use it (5% chance of keeping l1 even if worse)
    if(l1r < l2r && runif(1)<=.95) {
      accepted_count = accepted_count + 1
      cur_swad_inds = new_swad_inds
      xx_g = xx_g2
    }
    if(i%%save_every==0) {
      cors[i/save_every] = max(l1r, l2r)
      swad_inds[(i/save_every),] = cur_swad_inds
      accepted_rate[i/save_every] = accepted_count / save_every
      accepted_count = 0
    }
  }
  
  if(is.null(prev_run)) {
    return(list(swad_inds=swad_inds, 
              cors=cors,
              acceptance_rate = accepted_rate))
  } else {
    return(list(swad_inds = rbind(prev_run$swad_inds, swad_inds),
                cors = c(prev_run$cors, cors),
                acceptance_rate = c(prev_run$acceptance_rate, accepted_rate)))
  }
}

Correlations on the four chains went from an initial (random) .986, .985, .984, and .981 to .996 for all four chains after 10k iterations (and still .9962 after 30k swaps). On average, only 8 of the initial 100 (random) items still remain after 10k swaps (7.25 after 30k). Surprisingly, the chains started to converge: after 10k swaps, 11 uni-lemmas were on the final list of all 4 chains; 49 uni-lemmas were on the final lists of 3 or 4 chains, and 107 uni-lemmas were on the list of 2+ chains. After 30k iterations, 21 items were on all four chains, 52 were on 3+ chains, and 110 were on 2+ chains.

Now we have 8 chains with 50k iterations.

Acceptance rate

The swap acceptance rate start ~0.4, but drops quickly and remains low. This means we likely are not exploring enough of the space, and should make it stay higher.

Correlation of S-CDI with Full CDI scores

(for the 26 training languages)

Initial items still remaining

On average, 17 of the 150 original items remain after 50,000 iterations.

Comparison of final uni-lemmas across chains

Are the chains converging? Here are the number of items appearing at each frequency. That is, 19 items appear on all 8 chains; 20 appear on 7 chains, …, 99 items appear on only one chain.

## xc_freq
##  1  2  3  4  5  6  7  8 
## 99 75 52 47 27 30 20 19

Items on all 8 chains

##  [1] "1PL"            "all"            "brush (object)" "clown"         
##  [5] "cup"            "cute"           "garage"         "hammer"        
##  [9] "jam"            "nut"            "outside"        "put"           
## [13] "radio"          "sink"           "soft"           "teacher"       
## [17] "today"          "trash"          "zoo"

Items on 7 chains

##  [1] "bad"             "beans"           "break"           "but"            
##  [5] "catch"           "chocolate"       "cover (action)"  "cut"            
##  [9] "fall"            "friend"          "garden"          "grass"          
## [13] "how"             "like"            "melon"           "shoe"           
## [17] "touch"           "tricycle"        "wait"            "washing machine"

Items on 6 chains

##  [1] "1SG"            "before"         "blocks"         "boy"           
##  [5] "bread"          "cockadoodledoo" "country"        "deer"          
##  [9] "door"           "first"          "good night"     "goose"         
## [13] "helicopter"     "knee"           "later"          "living room"   
## [17] "monkey"         "pasta"          "potty"          "pull"          
## [21] "snow"           "sweater"        "swim"           "there"         
## [25] "think"          "tickle"         "want"           "wash"          
## [29] "wolf"           "yucky"

Properties of items on 7+ chains

## # A tibble: 2 × 5
##   Swadesh   d_m  d_sd     n  a1_m
##   <lgl>   <dbl> <dbl> <dbl> <dbl>
## 1 FALSE   0.785  1.23  14.0  3.78
## 2 TRUE    0.767  1.34  32.7  3.94