This document was created from an R markdown file. The repository for the project can be found here: https://github.com/mllewis/keb_2019_reanalysis.
We estimated the similarity between animals based on distributional statistics using a word embedding model trained on English Wikipedia (Bojanowski, et al. 2016). As in the human task from Kim et al. (2019), we estimated the similarity between animals based on different conceptual dimensions (color, shape, and texture). For each dimension, we identified all unique words participants generated as labels for their piles during the card sorting task (we removed a small set of items that were dimension-irrelevant, e.g. “farm”). We then calculated the pairwise cosine distance between each animal and dimension label. For example, for the color dimension, participants produced “brown”, “black”, and “pink” as descriptors for their piles (among many other labels). So, for each animal, we calculated the distance between brown, black and pink using the word embedding vectors. These distances created a “color” vector for each animal (in this case of length 3). We then used these vectors to calculate the Euclidean distance between each animal based on its color vector. These distances provide an estimate of the overall similarity between two animals in terms of color based on language statistics alone. We repeated this same procedure for the shape and texture dimensions.
The table below presents the description labels used for each of the three dimensions (color, shape, and texture).
In the Main Text, we report correlations between the pairwise animals distances for each of the three dimensions (language vs. human/gorund truth). These are presented more fully below.
For consistency, taxonomic distances are reported here in terms of similarity (1 - evolutionary distance).
participant_type | similarity_type | n | estimate | p.value |
---|---|---|---|---|
Ground Truth | taxonomy | 435 | 0.27041 | 0.00000 |
Blind | shape | 435 | 0.31586 | 0.00000 |
Sighted | shape | 435 | 0.32658 | 0.00000 |
Blind | texture | 435 | 0.11988 | 0.01235 |
Sighted | texture | 435 | 0.09546 | 0.04661 |
Blind | color | 435 | 0.14880 | 0.00186 |
Sighted | color | 435 | 0.08555 | 0.07468 |
Notably, in a regression predicting human similarity with both taxonomic and linguistic similarity, both measures predict independent variance in human judgements for all three dimensions.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.18756 | 0.04659 | 4.02550 | 0.00006 |
language_similarity_simple_dist_color | 0.10725 | 0.03297 | 3.25332 | 0.00118 |
taxonomic_sim | 0.11095 | 0.03297 | 3.36553 | 0.00080 |
participant_typesighted | -0.37512 | 0.06589 | -5.69292 | 0.00000 |
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.18756 | 0.04669 | 4.01708 | 0.00006 |
language_similarity_simple_dist_texture | 0.08692 | 0.03304 | 2.63049 | 0.00868 |
taxonomic_sim | 0.10859 | 0.03304 | 3.28649 | 0.00106 |
participant_typesighted | -0.37512 | 0.06603 | -5.68101 | 0.00000 |
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.18756 | 0.04641 | 4.04108 | 0.00006 |
language_similarity_simple_dist_shape | 0.13896 | 0.03333 | 4.16911 | 0.00003 |
taxonomic_sim | 0.08660 | 0.03333 | 2.59832 | 0.00953 |
participant_typesighted | -0.37512 | 0.06564 | -5.71495 | 0.00000 |
We clustered the resulting pairwise animal similarities for each of the three dimensions. The dendrograms below present hierarchical cluster analysis of human judgements of similarity and language-based estimates of similarity for each dimension. Dendrograms were produced using the ggdendro (de Vries & Ripley, 2016) and dendextend (Galili, 2015) packages in R.
Entanglement is a measure of how well the labels of two dendrograms are aligned. Entanglement values range from 0 (fully aligned labels) to 1 (fully mismatched labels). Entanglement is computed by numbering the labels (1 to the total number of labels) of each tree, and then computing the L-norm distance between these two vectors.
Below, we show pairwise comparisons of human judgement-based (blind and sighted participants) and language-based dendrograms in so-called tanglegrams, after using the untangle() method from the R package dendextend to minimize the amount of entanglement, i.e. to optimize the alignment of the labels from the two dendrograms without altering the underlying cluster structure. We also plot the minimum entanglement values found for each pairwise comparison (lower equals better alignment of the labels).
We also computed two indices of the similarity between the clusterings derived from human (blind and sighted participant data) and language-based similarity ratings for color, shape and texture: the Fowlkes-Mallows Index and the adjusted Rand index.
The Fowlkes-Mallows Index (FM-Index) is computed by comparing the two hierarchical clustering trees cut at a specific level k (i.e. split into k different clusters based on the hierarchical cluster). It varies from 0 to 1, with higher values indicating greater similarity. Intuitively, the FM-Index captures the degree to which two labels tend to fall in the same cluster in both tree 1 and tree 2. The Fowlkes-Mallows index is computed as the geometric mean of the ratio of the total number of labels sharing the same cluster in both trees to the number of labels sharing the same cluster in tree 1 and the ratio of the total number of labels sharing the same cluster in both trees to the number of labels sharing the same cluster in tree 2. The FM-Index of two given hierarchical clusterings is then compared to the expected value of the FM-Index under the hypothesis of no relation between the two clusterings.
The plots depict the z-scored FM-Index for each pairwise comparison of the hierarchical cluster trees derived from the human judgement and language similarity data, after cutting the trees into k = 5, 10, 15, and 20 clusters. The dashed line shows the critical value at \(\alpha = .05\) assuming a one-sided hypothesis test (\(z = 1.645\), i.e. the z-score with a tail area of .05).
The Rand index is the ratio of the number of pairs of labels on which two clusterings agree (i.e. the number of pairs of labels in the same cluster in both trees and the number of pairs of labels in different clusters in both trees) to the total number of label pairs. The adjusted Rand index (here using Hubert and Arabie’s method) corrects the Rand index for the number of groupings one might expect by chance alone. An adjusted Rand index of 0 indicates two clusterings have a Rand index that matches the expected value for random groupings, with higher and lower values indicating higher- or lower-than-chance level similarity between the two clusterings.
The plots depict the adjusted Rand index for each pairwise comparison of the hierarchical cluster trees derived from the human judgement and language similarity data, after cutting the trees into k = 5, 10, 15, and 20 clusters.
In this analyis (corresponding to Fig. 1B in the Main Text), we used language data to predict human responses from a task where participants had to indicate whether each of the 30 animals had a texture of either fur, feathers, scales, or skin.
Figure 6b from Kim, Elli and Bendny (2019; KEB) is reproduced on the left below. To predict these data from distributional language statistics, we estimated the cosine distance between each animal and each texture using a model trained on English Wikipedia (Bojanowski, et al., 2016). On the right below, the cosine distance between each animal and each texture is shown. The most frequently selected/similiar texture is indicated in red.
We calculated an accuracy score by identifying the most similar texture for each animal for the language and human data and then calculating the proportion of animals for which the language estimates correctly predict the human responses as well as the objectively “correct” responses. These accuracy scores are presented below.
group | estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative | se |
---|---|---|---|---|---|---|---|---|---|
Blind | 0.43333 | 13 | 0.03219 | 30 | 0.25461 | 0.62573 | Exact binomial test | two.sided | 0.09047 |
Sighted | 0.56667 | 17 | 0.00022 | 30 | 0.37427 | 0.74539 | Exact binomial test | two.sided | 0.09047 |
Ground Truth | 0.53333 | 16 | 0.00100 | 30 | 0.34326 | 0.71658 | Exact binomial test | two.sided | 0.09108 |
We replicated the pattern of results reported in the Main Text on a second corpus: A model trained on Google News (Mikolov, et al. 2013). Presented below are each of the key analyses using these embeddings.
Below is the analog to Fig. 1A in the Main Text shown for a model trained on Google News (rather than the Wikipedia Corpus).
Below is the analog to Fig. 1B in the Main Text shown for a model trained on Google News (rather than the Wikipedia Corpus).
We analyzed the 15 “light emission” verbs used by Bedny et al. (2019). We selected this set of items because they refer to a domain which blind participants have most limited direct perceptual access to. The 15 items are: blaze, blink, flare, flash, flicker, gleam, glimmer, glint, glisten, glitter, glow, shimmer, shine, sparkle, twinkle. We estimated the pairwise similarity of these words from distributional semantics (cosine distance) using a model trained on English Wikipedia (Bojanowski, et al. 2016), and then compared these distances to human similarity judgement.
Presented below are the correlations between human judgement (blind, sighted, and a sample from Mechanical Turk; from Bedny et al., 2019) and estimates from the language model. The red points indicate pairs that include the item “blink,” which appears to be an outlier - presumably because blink also has a frequent sense that doesn’t refer to light emission (i.e., a blinking eye). In the Main Text, we report correlation values with this item excluded; below are all correlation values with and without this item.
Correlations with all items:
participant_group | estimate | statistic | p.value | method | alternative |
---|---|---|---|---|---|
Blind | 0.53 | 90588.2 | 1.00e-08 | Spearman’s rank correlation rho | two.sided |
Sighted | 0.47 | 102339.7 | 4.40e-07 | Spearman’s rank correlation rho | two.sided |
Turk | 0.45 | 105997.2 | 1.41e-06 | Spearman’s rank correlation rho | two.sided |
Correlations with “blink” excluded:
participant_group | estimate | statistic | p.value | method | alternative |
---|---|---|---|---|---|
Blind | 0.63 | 46758.1 | 0.0e+00 | Spearman’s rank correlation rho | two.sided |
Sighted | 0.59 | 51935.5 | 1.0e-09 | Spearman’s rank correlation rho | two.sided |
Turk | 0.53 | 59120.5 | 6.9e-08 | Spearman’s rank correlation rho | two.sided |
Excluding “blink”, the correlation between blind and sighted participants is \(\rho\) = 0.9 (p < .01).
References
Bedny M., Koster-Hale J., Elli G., Yazzolino L., Saxe R. (2019) There’s more to “sparkle” than meets the eye: Knowledge of vision and light verbs among congenitally blind and sighted individuals. Cognition 189:105–115.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. https://arxiv.org/abs/1607.04606
de Vries, A. & Ripley, B. D. (2016). ggdendro: Create Dendrograms and Tree Diagrams Using ‘ggplot2’. R package version 0.1-20. https://CRAN.R-project.org/package=ggdendro
Galili, T., (2015). dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. DOI: 10.1093/bioinformatics/btv428
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.