Just pick N items, either from the target language’s CDI:WS, or from English.
load(here("data/apophenia.Rdata"))
dat %>% ggplot(aes(x=N, y=mean_cor, color=language)) +
facet_wrap(. ~ Random) +
geom_line() + theme_bw()
dat %>% group_by(N, Random) %>%
summarise(mean_cor = mean(mean_cor)) %>%
ggplot(aes(x=N, y=mean_cor, color=Random)) +
geom_line() + theme_bw() + ylim(.97, 1)
## `summarise()` has grouped output by 'N'. You can override using the `.groups`
## argument.
Want to build a Swadesh list (SL) of a given length (N)?
Fun fact: there are \(\sim10^{100}\) potential lists of length 50 (2000 choose 50).
make_swadesh_list <- function(unis, n_items=100, iters=1000,
save_every = 50, prev_run=NULL) {
cors <- rep(NA, iters/save_every)
accepted_rate <- rep(NA, iters/save_every)
swad_inds <- matrix(0, nrow=iters/save_every, ncol=n_items)
if(is.null(prev_run)) {
cur_swad_inds = sample(1:nrow(unis), n_items, replace = F) # could weight by n, 1/d_sd...
} else {
cur_swad_inds = prev_run$swad_inds[nrow(prev_run$swad_inds),] # start with last saved inds
if(ncol(prev_run$swad_inds)!=n_items) return("Error: mismatch in n_items of passed previous result")
}
xx_g <- run_swadesh_comparisons(xldf, languages, unis[cur_swad_inds,]$uni_lemma, form='WS')
accepted_count = 0 # track number of accepted swaps
for (i in 1:iters) {
# swap in 1 new item
drop_i = sample(1:n_items, 1)
new_i = sample(setdiff(1:nrow(unis), cur_swad_inds), 1)
new_swad_inds = c(cur_swad_inds[-drop_i], new_i)
xx_g2 <- run_swadesh_comparisons(xldf, languages, unis[new_swad_inds,]$uni_lemma, form='WS')
l1r = mean(xx_g$`Swadesh r`)
l2r = mean(xx_g2$`Swadesh r`)
# if l2 has better r than l1, use it (5% chance of keeping l1 even if worse)
if(l1r < l2r && runif(1)<=.95) {
accepted_count = accepted_count + 1
cur_swad_inds = new_swad_inds
xx_g = xx_g2
}
if(i%%save_every==0) {
cors[i/save_every] = max(l1r, l2r)
swad_inds[(i/save_every),] = cur_swad_inds
accepted_rate[i/save_every] = accepted_count / save_every
accepted_count = 0
}
}
if(is.null(prev_run)) {
return(list(swad_inds=swad_inds,
cors=cors,
acceptance_rate = accepted_rate))
} else {
return(list(swad_inds = rbind(prev_run$swad_inds, swad_inds),
cors = c(prev_run$cors, cors),
acceptance_rate = c(prev_run$acceptance_rate, accepted_rate)))
}
}
Correlations on the four chains went from an initial (random) .986, .985, .984, and .981 to .996 for all four chains after 10k iterations (and still .9962 after 30k swaps). On average, only 8 of the initial 100 (random) items still remain after 10k swaps (7.25 after 30k). Surprisingly, the chains started to converge: after 10k swaps, 11 uni-lemmas were on the final list of all 4 chains; 49 uni-lemmas were on the final lists of 3 or 4 chains, and 107 uni-lemmas were on the list of 2+ chains. After 30k iterations, 21 items were on all four chains, 52 were on 3+ chains, and 110 were on 2+ chains.
Now we have 8 chains with 50k iterations.
The swap acceptance rate start ~0.4, but drops quickly and remains low. This means we likely are not exploring enough of the space, and should make it stay higher.
(for the 26 training languages)
On average, 17 of the 150 original items remain after 50,000 iterations.
Are the chains converging? Here are the number of items appearing at each frequency. That is, 19 items appear on all 8 chains; 20 appear on 7 chains, …, 99 items appear on only one chain.
## xc_freq
## 1 2 3 4 5 6 7 8
## 99 75 52 47 27 30 20 19
## [1] "1PL" "all" "brush (object)" "clown"
## [5] "cup" "cute" "garage" "hammer"
## [9] "jam" "nut" "outside" "put"
## [13] "radio" "sink" "soft" "teacher"
## [17] "today" "trash" "zoo"
## [1] "bad" "beans" "break" "but"
## [5] "catch" "chocolate" "cover (action)" "cut"
## [9] "fall" "friend" "garden" "grass"
## [13] "how" "like" "melon" "shoe"
## [17] "touch" "tricycle" "wait" "washing machine"
## [1] "1SG" "before" "blocks" "boy"
## [5] "bread" "cockadoodledoo" "country" "deer"
## [9] "door" "first" "good night" "goose"
## [13] "helicopter" "knee" "later" "living room"
## [17] "monkey" "pasta" "potty" "pull"
## [21] "snow" "sweater" "swim" "there"
## [25] "think" "tickle" "want" "wash"
## [29] "wolf" "yucky"
## # A tibble: 2 × 5
## Swadesh d_m d_sd n a1_m
## <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 0.785 1.23 14.0 3.78
## 2 TRUE 0.767 1.34 32.7 3.94