OVERVIEW

The 3k corpus was put together by Chris Cox and Chen Lang around 2015. It is a set of monosyllabic words. This repository allows you to access those words, their characteristics, and their binary feature representations for orthography and phonology. See README in the corresponding github directory and in ~/data/ for more information.

This notebook describes various item-level attributed that are included in the database and thier distributions for reference purposes. Each section below is headed by the name (or appoximate name) of the variable contained here and a description of what this variable represents.

NOTE: the word “corpus” here is used only to denote that this resource is a table of words that can be used for various NLP purposes. The texts from which these words were selected and their frequencies derived are not contained in this repository.

load(file = "./data/data_clean/by_item_data.rda")

The words in the corpus (word)

This is a character vector of the words that comprise the corpus. These words range in frequency, where most are common monosyllabic words in English, with many low frequency, exceptional words included in order to keep the list ecological. Below a few characteristics are shown.

ggplot(by_item_data, aes(str_length(word))) +
  geom_histogram() +
  labs(x = "orthographic length", title = "Distribution of words by their length in 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Here is a random sample of 200 words from the corpus. This visual will render a new random sample every time the notebook is refreshed.

sample(by_item_data$word, 200)

##   [1] "urge"    "yield"   "tee"     "half"    "hoop"    "stove"   "shard"  
##   [8] "pee"     "curd"    "die"     "frail"   "lurch"   "bus"     "spoke"  
##  [15] "proud"   "fund"    "souse"   "long"    "brig"    "shot"    "breadth"
##  [22] "brawn"   "snow"    "prick"   "nice"    "plan"    "bead"    "thin"   
##  [29] "though"  "hash"    "flail"   "dry"     "crab"    "chain"   "stamp"  
##  [36] "leaf"    "czar"    "lithe"   "brass"   "hike"    "sop"     "sub"    
##  [43] "gem"     "wield"   "stench"  "weak"    "put"     "wrote"   "bell"   
##  [50] "been"    "sly"     "gleam"   "shift"   "glove"   "grow"    "pack"   
##  [57] "hood"    "stair"   "isle"    "glaze"   "rut"     "swore"   "whence" 
##  [64] "prowl"   "crux"    "dung"    "brook"   "palm"    "budge"   "zeal"   
##  [71] "snake"   "gene"    "heed"    "saint"   "gyp"     "peace"   "terms"  
##  [78] "helm"    "foil"    "throat"  "rice"    "dice"    "stir"    "thaw"   
##  [85] "last"    "lilt"    "down"    "suck"    "boost"   "aft"     "jig"    
##  [92] "dream"   "clown"   "stance"  "drop"    "net"     "broad"   "dead"   
##  [99] "from"    "elm"     "bleep"   "lass"    "lear"    "catch"   "chunk"  
## [106] "script"  "suave"   "broach"  "one"     "ape"     "jape"    "move"   
## [113] "brain"   "null"    "bound"   "chaise"  "ire"     "lag"     "ease"   
## [120] "switch"  "hide"    "wrack"   "ounce"   "purse"   "clutch"  "false"  
## [127] "kid"     "pie"     "waist"   "this"    "deck"    "rime"    "hers"   
## [134] "creche"  "toy"     "pit"     "sunk"    "blob"    "bright"  "brute"  
## [141] "whoosh"  "dance"   "tripe"   "haunt"   "size"    "jet"     "thump"  
## [148] "huck"    "pain"    "flu"     "guile"   "whose"   "read"    "midst"  
## [155] "cove"    "fresh"   "hath"    "rum"     "heap"    "course"  "stow"   
## [162] "whack"   "slot"    "jell"    "pore"    "say"     "drew"    "build"  
## [169] "fawn"    "din"     "ship"    "guess"   "free"    "soap"    "cleft"  
## [176] "hut"     "verb"    "weld"    "more"    "stool"   "call"    "buck"   
## [183] "slang"   "wand"    "flew"    "grant"   "eight"   "sob"     "tongue" 
## [190] "smooth"  "spoof"   "leave"   "pin"     "ton"     "good"    "stunk"  
## [197] "serf"    "horn"    "show"    "broom"

Age of acquisition (aoa_mean)

This is the mean age of acquisition rating for each word in the corpus. These ratings were obtained from the Kuperman et al. (2012) publication on age of acquisition of 30,000 English words:

Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978-990.

*NOTE: there are missing values in this variable. The Kuperman et al. (2012) resource did not have values for all words in the 3k corpus.

ggplot(by_item_data, aes(aoa_mean)) +
  geom_histogram() +
  labs(x = "age of acquisition ratings", title = "Distribution of words by their aoa in 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Word frequency (freq)

This is a word frequency measure from the SUBTLEX-US corpus (Brysbaert & New, 2009). The value used is log of raw frequency from the (variable named FREQcount in file from Ghent download source) out of 51 million words in the full corpus. See this page for more information and access to the data from the original source. The citation of the source is:

Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior research methods, 41(4), 977-990.

Words that had no frequency rating in the SUBTLEX data were recoded as having a frequency of 1 (272 observations, 9% of the data). The visual below filters out these values. We should update these values with better frequency estimates in the near future.

Log of the value is displayed.

by_item_data %>% filter(freq > 1) %>%
  ggplot(aes(log(freq))) +
    geom_histogram() +
    labs(x = "log of frequency", title = "Distribution of words by their log(frequency) in 3k corpus") +
    theme(plot.title = element_text(hjust = .5, size = 15))

Orthographic metrics

The following set of metrics pertain to the orthography of the words in the 3k corpus. These are computed from the orthographic featural representations for each word in the corpus.

Orthopgraphic length (orth_length)

As shown at the top of this notebook, the orthographic length of the items in the corpus are approximately normally distributed. Here is that visual again.

ggplot(by_item_data, aes(orth_length)) +
  geom_histogram() +
  labs(x = "orthographic length", title = "Distribution of words by their length in 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Itemwise average orthographic distance from every other word (orth_dist)

Orth dist is computed by a generalized method for computing pairwise distance in multidimensional space (see .data/helper_functions.R for code). This metric is derived by the following steps:

Compute a pairwise distance matrix for the entire set of words based on their binary orthographic features (manhattan distance)
For each word, generate an average for the distance between every other word and that word in the corpus
This mean value is the orth_dist metric

So, you can think of this variable as a measure of on average, how far is a word from all other words in the set.

Note that the feature representations for the orthography treat each letter as a one hot encoded vector with that letter’s unit as the hot node. Then the vectors are concatenated together to make the longer vector that comprises the orthographic wordform. The orthographic patterns are centered on the first orthographic vowel, which means that many orthographic patterns will be padded with empty (zero) space on the right and left extremes of the featural representation.

ggplot(by_item_data, aes(orth_dist)) +
  geom_histogram() +
  labs(x = "itemwise average orthographic distance from every other word", title = "Distribution of words by orth_dist in 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Itemwise variation in average orthographic distance from every other word (orth_spread)

The orth_spread variable is calculated in a similar way to orth_dist. Where orth_dist is the average distance between a given word and all other words in the set, orth_spread is the standard deviation of those distances. The calculation is made of the following:

Compute a pairwise distance matrix for the entire set of words based on their binary orthographic features (manhattan distance)
For each word, generate a standard deviation for the distance between every other word and that word in the corpus
This standard deviation value is the orth_spread metric

You can think of this as a measure of the distribution of pairwise distances between a word and every other word in the set, rather than the mean distance. This value will be greatest for central words (easy) and lowest for outlier words (hard ones).

ggplot(by_item_data, aes(orth_spread)) +
  geom_histogram() +
  labs(x = "itemwise variation in average orthographic distance", title = "Distribution of words by orth_spread in 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Orthographic clustering (orth_cluster)

You can think of each word belonging to an orthographic cluster such that words in that cluster demonstrate a high degree of orthographic similarity to that word. How the clusters are defined is not trivial given the quasiregular nature of the orthotactics in the English spelling system. In fact, the clusters themselves could be rigorously evaluated empirically, and this is not an attempt at such evaluation (but more attention should be paid to this at some point). For our purposes we find a good number of clusters that fit the orthographic space by calculating the ratio of the sum of squares between and the sum of squares total for clusters over a range of possible cluster assignments for the set. This yields a distribution, and we plot it against the number of clusters. We then select a value for number of clusters around the point at which the rate of change starts to slow (ie, in the elbow of the curve). See the graph below for the elbow plot for orthography. See some discussion here on Wikipedia about the method.

Here is a look at the clusters based on their size.

ggplot(by_item_data, aes(orth_cluster)) +
  geom_bar() +
  labs(x = "orthographic cluster index (just the number assigned to each cluster)", title = "Size of cluster by cluster index") +
  theme(plot.title = element_text(hjust = .5, size = 15))

And a look at the cluster membership of the first few clusters.

by_item_data %>% filter(orth_cluster < 3) %>%
  select(word, orth_cluster) %>%
  arrange(desc(orth_cluster)) %>%
  top_n(25)

## # A tibble: 33 x 2
##    word   orth_cluster
##    <chr>         <int>
##  1 shout             2
##  2 should            2
##  3 shown             2
##  4 shop              2
##  5 shot              2
##  6 show              2
##  7 shod              2
##  8 shoe              2
##  9 short             2
## 10 shore             2
## # ... with 23 more rows

Orthographic neighborhood size (orth_neighb_size)

The orthographic neighborhood size is calculated as the number of words that are selected for a given cluster, where cluster assignment is based on the process outlined in orth_cluster above. As with the cluster assignments themselves, this neighborhood size is of empirical interest given that it concerns what constitutes a “cluster”.

Phonological metrics

The following word-level metrics describe the phonological properties of the words in the 3k corpus. They are computed directly from the featural representations for phonology, where each orthographic wordform has a corresponding phonological wordform.

ggplot(by_item_data, aes(orth_neighb_size)) +
  geom_histogram() +
  labs(x = "size of orthographic neighborhood", title = "Distribution of the size of the orthographic clusters in the 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Itemwise average phonological distance from every other word (phon_dist)

This metric is calculated the same way as orth_dist (see above) but in phonological space.

ggplot(by_item_data, aes(phon_dist)) +
  geom_histogram() +
  labs(x = "Itemwise average phonological distance from every other word", title = "Distribution of words by phon_dist in 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Itemwise variation in average phonological distance from every other word (phon_spread)

This metric is calculated the same way as was orth_spread (see above).

ggplot(by_item_data, aes(phon_spread)) +
  geom_histogram() +
  labs(x = "Itemwise variation in average phonological distance", title = "Distribution of words by their phon_spread in 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Phonological clustering (phon_cluster)

Generated with the same kmeans method as the other cluster metrics. See the corresponding plot for the elbow plot for phonology below.

ggplot(by_item_data, aes(phon_cluster)) +
  geom_bar() +
  labs(x = "phonological cluster index (just the number assigned to each cluster)", title = "Size of phonological cluster by cluster index") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Phonological neighborhood size (phon_neighb_size)

Calculated in the same fashion as the other measures of neighborhood size, but for phonological clusters.

ggplot(by_item_data, aes(phon_neighb_size)) +
  geom_histogram() +
  labs(x = "size of phonological neighborhood", title = "Distribution of the size of the phonological clusters in the 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Hidden layer metrics

This set of word-level metrics are derived from the hidden layer activations in a model that learns all 2881 words in the 3k corpus. See documentation for this model on github. These metrics are derived directly from the featural representations in latent space in a model that learns to map orthography to phonology, and thus represent the mappings for each word from orthography to phonology.

Itemwise average hidden distance from every other word (hidden_dist)

This metric is calculated the same way as orth_dist (see above) but in hidden space.

ggplot(by_item_data, aes(hidden_dist)) +
  geom_histogram() +
  labs(x = "Itemwise average hidden distance from every other word", title = "Distribution of words by hidden_dist in 3k corpus") +
  theme(plot.title = element_text(hjust = .5, size = 15))

Itemwise variation in average hidden distance from every other word (hidden_spread)