This document was created from an R markdown file. The repository for the project can be found here: https://github.com/mllewis/keb_2019_reanalysis.

1 Card Sorting Task

We estimated the similarity between animals based on distributional statistics using a word embedding model trained on English Wikipedia (Bojanowski, et al. 2016). As in the human task from Kim et al. (2019), we estimated the similarity between animals based on different conceptual dimensions (color, shape, and texture). For each dimension, we identified all unique words participants generated as labels for their piles during the card sorting task (we removed a small set of items that were dimension-irrelevant, e.g. “farm”). We then calculated the pairwise cosine distance between each animal and dimension label. For example, for the color dimension, participants produced “brown”, “black”, and “pink” as descriptors for their piles (among many other labels). So, for each animal, we calculated the distance between brown, black and pink using the word embedding vectors. These distances created a “color” vector for each animal (in this case of length 3). We then used these vectors to calculate the Euclidean distance between each animal based on its color vector. These distances provide an estimate of the overall similarity between two animals in terms of color based on language statistics alone. We repeated this same procedure for the shape and texture dimensions.

1.1 Dimension labels

The table below presents the description labels used for each of the three dimensions (color, shape, and texture).

1.2 Pairwise Correlations

In the Main Text, we report correlations between the pairwise animals distances for each of the three dimensions (language vs. human/gorund truth). These are presented more fully below.

For consistency, taxonomic distances are reported here in terms of similarity (1 - evolutionary distance).

participant_type	similarity_type	n	estimate	p.value
Ground Truth	taxonomy	435	0.27041	0.00000
Blind	shape	435	0.31586	0.00000
Sighted	shape	435	0.32658	0.00000
Blind	texture	435	0.11988	0.01235
Sighted	texture	435	0.09546	0.04661
Blind	color	435	0.14880	0.00186
Sighted	color	435	0.08555	0.07468

Notably, in a regression predicting human similarity with both taxonomic and linguistic similarity, both measures predict independent variance in human judgements for all three dimensions.

1.2.1 Color

term	estimate	std.error	statistic	p.value
(Intercept)	0.18756	0.04659	4.02550	0.00006
language_similarity_simple_dist_color	0.10725	0.03297	3.25332	0.00118
taxonomic_sim	0.11095	0.03297	3.36553	0.00080
participant_typesighted	-0.37512	0.06589	-5.69292	0.00000

1.2.2 Texture

term	estimate	std.error	statistic	p.value
(Intercept)	0.18756	0.04669	4.01708	0.00006
language_similarity_simple_dist_texture	0.08692	0.03304	2.63049	0.00868
taxonomic_sim	0.10859	0.03304	3.28649	0.00106
participant_typesighted	-0.37512	0.06603	-5.68101	0.00000

1.2.3 Shape

term	estimate	std.error	statistic	p.value
(Intercept)	0.18756	0.04641	4.04108	0.00006
language_similarity_simple_dist_shape	0.13896	0.03333	4.16911	0.00003
taxonomic_sim	0.08660	0.03333	2.59832	0.00953
participant_typesighted	-0.37512	0.06564	-5.71495	0.00000

1.3 Basic Clusterings

We clustered the resulting pairwise animal similarities for each of the three dimensions. The dendrograms below present hierarchical cluster analysis of human judgements of similarity and language-based estimates of similarity for each dimension. Dendrograms were produced using the ggdendro (de Vries & Ripley, 2016) and dendextend (Galili, 2015) packages in R.

1.3.1 Shape

1.3.2 Texture

1.3.3 Color

1.4 Entanglement Comparisons

Entanglement is a measure of how well the labels of two dendrograms are aligned. Entanglement values range from 0 (fully aligned labels) to 1 (fully mismatched labels). Entanglement is computed by numbering the labels (1 to the total number of labels) of each tree, and then computing the L-norm distance between these two vectors.

Below, we show pairwise comparisons of human judgement-based (blind and sighted participants) and language-based dendrograms in so-called tanglegrams, after using the untangle() method from the R package dendextend to minimize the amount of entanglement, i.e. to optimize the alignment of the labels from the two dendrograms without altering the underlying cluster structure. We also plot the minimum entanglement values found for each pairwise comparison (lower equals better alignment of the labels).

1.4.1 Shape

1.4.2 Texture

1.4.3 Color

1.4.4 Entanglement Values

1.5 Indices of Similarity between Clusterings

We also computed two indices of the similarity between the clusterings derived from human (blind and sighted participant data) and language-based similarity ratings for color, shape and texture: the Fowlkes-Mallows Index and the adjusted Rand index.

1.5.1 FM-Index (z-scored)

The Fowlkes-Mallows Index (FM-Index) is computed by comparing the two hierarchical clustering trees cut at a specific level k (i.e. split into k different clusters based on the hierarchical cluster). It varies from 0 to 1, with higher values indicating greater similarity. Intuitively, the FM-Index captures the degree to which two labels tend to fall in the same cluster in both tree 1 and tree 2. The Fowlkes-Mallows index is computed as the geometric mean of the ratio of the total number of labels sharing the same cluster in both trees to the number of labels sharing the same cluster in tree 1 and the ratio of the total number of labels sharing the same cluster in both trees to the number of labels sharing the same cluster in tree 2. The FM-Index of two given hierarchical clusterings is then compared to the expected value of the FM-Index under the hypothesis of no relation between the two clusterings.

The plots depict the z-scored FM-Index for each pairwise comparison of the hierarchical cluster trees derived from the human judgement and language similarity data, after cutting the trees into k = 5, 10, 15, and 20 clusters. The dashed line shows the critical value at \(\alpha = .05\) assuming a one-sided hypothesis test (\(z = 1.645\), i.e. the z-score with a tail area of .05).

1.5.2 Adjusted Rand Index

The Rand index is the ratio of the number of pairs of labels on which two clusterings agree (i.e. the number of pairs of labels in the same cluster in both trees and the number of pairs of labels in different clusters in both trees) to the total number of label pairs. The adjusted Rand index (here using Hubert and Arabie’s method) corrects the Rand index for the number of groupings one might expect by chance alone. An adjusted Rand index of 0 indicates two clusterings have a Rand index that matches the expected value for random groupings, with higher and lower values indicating higher- or lower-than-chance level similarity between the two clusterings.

The plots depict the adjusted Rand index for each pairwise comparison of the hierarchical cluster trees derived from the human judgement and language similarity data, after cutting the trees into k = 5, 10, 15, and 20 clusters.

2 Feature Choice Task (Texture)

In this analyis (corresponding to Fig. 1B in the Main Text), we used language data to predict human responses from a task where participants had to indicate whether each of the 30 animals had a texture of either fur, feathers, scales, or skin.

Figure 6b from Kim, Elli and Bendny (2019; KEB) is reproduced on the left below. To predict these data from distributional language statistics, we estimated the cosine distance between each animal and each texture using a model trained on English Wikipedia (Bojanowski, et al., 2016). On the right below, the cosine distance between each animal and each texture is shown. The most frequently selected/similiar texture is indicated in red.

We calculated an accuracy score by identifying the most similar texture for each animal for the language and human data and then calculating the proportion of animals for which the language estimates correctly predict the human responses as well as the objectively “correct” responses. These accuracy scores are presented below.

group	estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative	se
Blind	0.43333	13	0.03219	30	0.25461	0.62573	Exact binomial test	two.sided	0.09047
Sighted	0.56667	17	0.00022	30	0.37427	0.74539	Exact binomial test	two.sided	0.09047
Ground Truth	0.53333	16	0.00100	30	0.34326	0.71658	Exact binomial test	two.sided	0.09108

3 Replication on Second Corpus

We replicated the pattern of results reported in the Main Text on a second corpus: A model trained on Google News (Mikolov, et al. 2013). Presented below are each of the key analyses using these embeddings.

3.1 Card Sorting Task

Below is the analog to Fig. 1A in the Main Text shown for a model trained on Google News (rather than the Wikipedia Corpus).

3.2 Feature Choice Task (Texture)

Below is the analog to Fig. 1B in the Main Text shown for a model trained on Google News (rather than the Wikipedia Corpus).

4 Bedny et al. (2019) Data

We analyzed the 15 “light emission” verbs used by Bedny et al. (2019). We selected this set of items because they refer to a domain which blind participants have most limited direct perceptual access to. The 15 items are: blaze, blink, flare, flash, flicker, gleam, glimmer, glint, glisten, glitter, glow, shimmer, shine, sparkle, twinkle. We estimated the pairwise similarity of these words from distributional semantics (cosine distance) using a model trained on English Wikipedia (Bojanowski, et al. 2016), and then compared these distances to human similarity judgement.

Presented below are the correlations between human judgement (blind, sighted, and a sample from Mechanical Turk; from Bedny et al., 2019) and estimates from the language model. The red points indicate pairs that include the item “blink,” which appears to be an outlier - presumably because blink also has a frequent sense that doesn’t refer to light emission (i.e., a blinking eye). In the Main Text, we report correlation values with this item excluded; below are all correlation values with and without this item.

Correlations with all items:

participant_group	estimate	statistic	p.value	method	alternative
Blind	0.53	90588.2	1.00e-08	Spearman’s rank correlation rho	two.sided
Sighted	0.47	102339.7	4.40e-07	Spearman’s rank correlation rho	two.sided
Turk	0.45	105997.2	1.41e-06	Spearman’s rank correlation rho	two.sided

Correlations with “blink” excluded:

participant_group	estimate	statistic	p.value	method	alternative
Blind	0.63	46758.1	0.0e+00	Spearman’s rank correlation rho	two.sided
Sighted	0.59	51935.5	1.0e-09	Spearman’s rank correlation rho	two.sided
Turk	0.53	59120.5	6.9e-08	Spearman’s rank correlation rho	two.sided

Excluding “blink”, the correlation between blind and sighted participants is \(\rho\) = 0.9 (p < .01).

References

Bedny M., Koster-Hale J., Elli G., Yazzolino L., Saxe R. (2019) There’s more to “sparkle” than meets the eye: Knowledge of vision and light verbs among congenitally blind and sighted individuals. Cognition 189:105–115.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. https://arxiv.org/abs/1607.04606

de Vries, A. & Ripley, B. D. (2016). ggdendro: Create Dendrograms and Tree Diagrams Using ‘ggplot2’. R package version 0.1-20. https://CRAN.R-project.org/package=ggdendro

Galili, T., (2015). dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. DOI: 10.1093/bioinformatics/btv428

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Distributional semantics as a source of visual knowledge: Commentary on Kim et al. (2019)

Supplementary Information

Molly Lewis, Martin Zettersten, and Gary Lupyan

2019-06-12