Given set of words common across all models in ETS data, cluster them for each model. Then compare clustering solution clustring solution of the translation of those words from the wikipedia models in all other languages. V-measure quantifies the alignment of these two clusstering solutions. Higher V-measure = more alignment. The effects also seem to depend on the size of the clustering (e.g. effects when nclust = 20, but not 100).

vs <- read_csv("data/v_measures_20.csv") %>%
  mutate(group = ifelse(lang1 == lang2, "within", "across"))

There’s a lot of intercept variability.

vs %>%
  ggplot(aes(x = lang1, y = lang2, 
             fill = v_measure)) +
  scale_fill_continuous(low = "white", high = "red") +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

vs %>%
  group_by(lang2) %>%
  mutate(x = mean(v_measure)) %>%
  mutate(norm_vmeasure = v_measure/x) %>%

  ggplot(aes(x = lang1, y = lang2, 
             fill = norm_vmeasure)) +
  scale_fill_continuous(low = "white", high = "red") +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Mixed effect models

But if you control for that, there’s the predicted effect, for all three measures.

V-measure
lmer(log(v_measure) ~ group + (1|lang1) + (1|lang2), vs) %>%
  summary()
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(v_measure) ~ group + (1 | lang1) + (1 | lang2)
##    Data: vs
## 
## REML criterion at convergence: -3905.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.0940 -0.6486  0.0187  0.6620  3.1768 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  lang1    (Intercept) 0.002026 0.04501 
##  lang2    (Intercept) 0.020117 0.14183 
##  Residual             0.001822 0.04268 
## Number of obs: 1225, groups:  lang1, 35; lang2, 35
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) -1.80055    0.02518 -71.500
## groupwithin  0.01854    0.00732   2.533
## 
## Correlation of Fixed Effects:
##             (Intr)
## groupwithin -0.008
Homogenity
lmer(log(homogenity) ~ group + (1|lang1) + (1|lang2), vs) %>%
  summary()
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(homogenity) ~ group + (1 | lang1) + (1 | lang2)
##    Data: vs
## 
## REML criterion at convergence: -3925.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.0102 -0.6255  0.0139  0.6596  3.1980 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  lang1    (Intercept) 0.001547 0.03933 
##  lang2    (Intercept) 0.021934 0.14810 
##  Residual             0.001799 0.04242 
## Number of obs: 1225, groups:  lang1, 35; lang2, 35
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) -1.752592   0.025931 -67.588
## groupwithin  0.018238   0.007274   2.507
## 
## Correlation of Fixed Effects:
##             (Intr)
## groupwithin -0.008
Completeness
lmer(log(completeness) ~ group + (1|lang1) + (1|lang2), vs) %>%
  summary()
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(completeness) ~ group + (1 | lang1) + (1 | lang2)
##    Data: vs
## 
## REML criterion at convergence: -3871.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.1257 -0.6554  0.0297  0.6541  3.1761 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  lang1    (Intercept) 0.002894 0.05380 
##  lang2    (Intercept) 0.019506 0.13966 
##  Residual             0.001858 0.04311 
## Number of obs: 1225, groups:  lang1, 35; lang2, 35
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) -1.845580   0.025329 -72.864
## groupwithin  0.018773   0.007393   2.539
## 
## Correlation of Fixed Effects:
##             (Intr)
## groupwithin -0.008

Dendograms

They produece dendograms that make moderate sense.

Vmeasure
vs_wide <- vs %>%
  select(lang1, lang2, v_measure) %>%
  spread(lang1, v_measure) 

vs_mat <- as.matrix(vs_wide[,-1])
rownames(vs_mat) = unlist(vs_wide[,1])

dist_matrix <- dist(vs_mat)

ggdendro::ggdendrogram(hclust(dist_matrix)) +
  ggtitle("v_measure")

Homogenity
vs_wide <- vs %>%
  select(lang1, lang2, homogenity) %>%
  spread(lang1, homogenity) 

vs_mat <- as.matrix(vs_wide[,-1])
rownames(vs_mat) = unlist(vs_wide[,1])

dist_matrix <- dist(vs_mat)

ggdendro::ggdendrogram(hclust(dist_matrix)) +
  ggtitle("homogenity")

Completeness
vs_wide <- vs %>%
  select(lang1, lang2, completeness) %>%
  spread(lang1, completeness) 

vs_mat <- as.matrix(vs_wide[,-1])
rownames(vs_mat) = unlist(vs_wide[,1])

dist_matrix <- dist(vs_mat)

ggdendro::ggdendrogram(hclust(dist_matrix)) +
  ggtitle("completeness")