Given set of words common across all models in ETS data, cluster them for each model. Then compare clustering solution clustring solution of the translation of those words from the wikipedia models in all other languages. V-measure quantifies the alignment of these two clusstering solutions. Higher V-measure = more alignment. The effects also seem to depend on the size of the clustering (e.g. effects when nclust = 20, but not 100).
vs <- read_csv("data/v_measures_20.csv") %>%
mutate(group = ifelse(lang1 == lang2, "within", "across"))
There’s a lot of intercept variability.
vs %>%
ggplot(aes(x = lang1, y = lang2,
fill = v_measure)) +
scale_fill_continuous(low = "white", high = "red") +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
vs %>%
group_by(lang2) %>%
mutate(x = mean(v_measure)) %>%
mutate(norm_vmeasure = v_measure/x) %>%
ggplot(aes(x = lang1, y = lang2,
fill = norm_vmeasure)) +
scale_fill_continuous(low = "white", high = "red") +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
But if you control for that, there’s the predicted effect, for all three measures.
lmer(log(v_measure) ~ group + (1|lang1) + (1|lang2), vs) %>%
summary()
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(v_measure) ~ group + (1 | lang1) + (1 | lang2)
## Data: vs
##
## REML criterion at convergence: -3905.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.0940 -0.6486 0.0187 0.6620 3.1768
##
## Random effects:
## Groups Name Variance Std.Dev.
## lang1 (Intercept) 0.002026 0.04501
## lang2 (Intercept) 0.020117 0.14183
## Residual 0.001822 0.04268
## Number of obs: 1225, groups: lang1, 35; lang2, 35
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -1.80055 0.02518 -71.500
## groupwithin 0.01854 0.00732 2.533
##
## Correlation of Fixed Effects:
## (Intr)
## groupwithin -0.008
lmer(log(homogenity) ~ group + (1|lang1) + (1|lang2), vs) %>%
summary()
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(homogenity) ~ group + (1 | lang1) + (1 | lang2)
## Data: vs
##
## REML criterion at convergence: -3925.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.0102 -0.6255 0.0139 0.6596 3.1980
##
## Random effects:
## Groups Name Variance Std.Dev.
## lang1 (Intercept) 0.001547 0.03933
## lang2 (Intercept) 0.021934 0.14810
## Residual 0.001799 0.04242
## Number of obs: 1225, groups: lang1, 35; lang2, 35
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -1.752592 0.025931 -67.588
## groupwithin 0.018238 0.007274 2.507
##
## Correlation of Fixed Effects:
## (Intr)
## groupwithin -0.008
lmer(log(completeness) ~ group + (1|lang1) + (1|lang2), vs) %>%
summary()
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(completeness) ~ group + (1 | lang1) + (1 | lang2)
## Data: vs
##
## REML criterion at convergence: -3871.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.1257 -0.6554 0.0297 0.6541 3.1761
##
## Random effects:
## Groups Name Variance Std.Dev.
## lang1 (Intercept) 0.002894 0.05380
## lang2 (Intercept) 0.019506 0.13966
## Residual 0.001858 0.04311
## Number of obs: 1225, groups: lang1, 35; lang2, 35
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -1.845580 0.025329 -72.864
## groupwithin 0.018773 0.007393 2.539
##
## Correlation of Fixed Effects:
## (Intr)
## groupwithin -0.008
They produece dendograms that make moderate sense.
vs_wide <- vs %>%
select(lang1, lang2, v_measure) %>%
spread(lang1, v_measure)
vs_mat <- as.matrix(vs_wide[,-1])
rownames(vs_mat) = unlist(vs_wide[,1])
dist_matrix <- dist(vs_mat)
ggdendro::ggdendrogram(hclust(dist_matrix)) +
ggtitle("v_measure")
vs_wide <- vs %>%
select(lang1, lang2, homogenity) %>%
spread(lang1, homogenity)
vs_mat <- as.matrix(vs_wide[,-1])
rownames(vs_mat) = unlist(vs_wide[,1])
dist_matrix <- dist(vs_mat)
ggdendro::ggdendrogram(hclust(dist_matrix)) +
ggtitle("homogenity")
vs_wide <- vs %>%
select(lang1, lang2, completeness) %>%
spread(lang1, completeness)
vs_mat <- as.matrix(vs_wide[,-1])
rownames(vs_mat) = unlist(vs_wide[,1])
dist_matrix <- dist(vs_mat)
ggdendro::ggdendrogram(hclust(dist_matrix)) +
ggtitle("completeness")