Potential research questions:
1. “What makes a word recognizable?”
2. “What are the features that make a word recognizable?”
3. “What are the features that make a word hard to recognize?”
The dataset we are using is from the British Lexicon Project. The data gathered here was part of an experiment to determine whether or not humans can identify real or fake words. In the experiment, British subjects were presented with a word and tasked with identifying whether or not it is a real or fake English word, as quickly and as accurately as possible. The dataset used in our project contains the aggregate results of all words presented to the subjects, including response time and accuracy for each individual word. In addition, we have a corresponding dataset that contains linguistic characteristics for each word, such as the number of syllables, the word length, morphological breakdowns, etc.
In our project, we will focus specifically on the relationship between response time and response accuracy as well as their relationship to the following linguistic features: nletters (word length), coltheart.N (number of words with a minimum edit distance of 1), parts of speech, and morphological features.
We define response time as “confidence” in one’s answer and accuracy as a measure of the sense of “familiarity” a word inspires. We hypothesize that words that prompt low response times and high accuracy scores are those which are most “word-like”, in that the word is familiar enough to the speaker that they answer quickly and confidently.
## High Accuracy Low Accuracy
## Low RT Confident & Familiar Confident & Unfamiliar
## High RT Unconfident & Familiar Unconfident & Unfamiliar
Preliminary analysis reveals clear distributional differences in the reaction times and accuracy scores of participants. As seen below, participants more accurately identified non-words as being nonwords, than they did real words as being real words.
Participants also exhibited shorter reaction times in response to shorter words than longer words, possibly owing to the fact that longer words require more mental pre-processing.
We can look at the distributions of the response time and the accuracy for both real words and non-real words. What we want to see here is where the quartiles fall and whether the distribution is normal, left-skewed, or right-skewed.
Evidently, the distribution for accuracy in both real words and non-real words is heavily left-skewed or “top heavy.” That being said, the distribution for non-real words is far more concentrated than that of real words. Furthermore, the response time for both datasets is right-skewed.
Given this information, we can infer that there are a high amount of high accuracy words and words with low response times across both datasets; participants performed quickly and accurately on most words, both real and non-real.
Further examination into the data reveals that participants were able to identify non-words with an average accuracy of 94%. By contrast, they were able to achieve only a 76% accuracy rate in identifying real words.
When we apply a word attribute, for example “nletters”, which is a count of the number of letters in the word, there arises a clear relationship between attribute, accuracy, and response time. In this case of nletters, we can see a pattern that a smaller letter count trends to lower response times. This indicates that words with smaller letters counts prompt higher confidence in decision making. Therefore, we conclude that letter count is a feature that determines whether or not a word is “word-like.”
Let’s continue by quantifying these quantiles and filtering out the middle 50% of datapoints. This allows us to look at the extremes on both the high and low ends of accuracy and response time. We can then plot these quantiles according to Table 1 and apply more features to see the trends of the extremes. In this way, we will take a closer look at the relationships between features, accuracy, and response time.
This chart is the same as that from above; however, in these graphs, the middle 50% of data has been removed, leaving only the top 25% and bottom 25%. We are left with four quadrants giving the extremes of the data in Table 1. We can examine these quadrants individually for patterns and insights.
For nletters in the each quantile, faceted for lexicality (real vs. non-real words), we can see a clear inverse relationship between the number of letters in a word and the reaction time needed to classifying the word.
Within the real words dataset, there is additional metadata that we can examine to try to find patterns of confidence and familiarity. The first thing we can examine is parts of speech.
In making a layered density distribution, we can examine the “rankings” of the parts of speech that are most commonly found in low response times and high accuracy rates, thus making the words more “word-like.”
Immediately, the rankings seem to be correlated. There is a bit of interchangability among them, but generally the rankings for low response time and high accuracy are similar. They go multiple/pronoun –> conjunction/preposition/verb –> Adverb –> Adjective –> Noun –> Interjection –> Undefined. Here, the category of “multiple” refers to words which can function in multiple categories (e.g. “about” can be an adjective, adverb, or preposition). From this we can see a general pattern among the parts of speech of words that have high accuracy and low response times, and such words are those that can be identified as word-like.
Coltheart’s N functions as a measure of orthographic similarity of words, and describes the number of words that have a minimum edit distance of 1 from a given word. The distribution below allows us to examine words which are similar to other words (different by only one letter). This measure applies to both real and non-real words.
The mean Colheart’s measure is highest for words with lowest accuracy and response time (4.761), and the lowest for words with the highest accuracy and response time (0.885). Additionally, Coltheart’s N is about equal for words with high accuracy/low RT and low accuracy/high RT. This indicates that Colheart’s measure is liked to both accuracy and reaction time. The more neighbors that a word has (higher Coltheart’s measure), the more likely an error is going to be made (lower accuracy) with a quicker response time (greater confidence in choice). The contrapositive also appears true based on this distribution.
Finally, we want to see if the morphology of a word has any impact on its perceived word-likeness. To do so, we filter for real words (this information is not available for non-real words) and look at the morphological metadata attributed to each word.
We can now break down the quadrants by morphology and look for patterns.
From this chart, we can see that there is a variation of morphological frequency in all quadrants, particularly in the two opposites “Hi ACC, Lo RT” and “Lo ACC, Hi RT.” The morphological data in these two quadrants are mirrors of each other and thus contradicting. Therefore, we cannot determine if morphology has an impact on either response time or accuracy.
Based on the analyses conducted above, we’ve concluded that a few features we examined impact native English-speakers’ perceptions of a word as being word-like. The number of letters is related to the subject’s response time and thus their confidence in their answer. Shorter words elicit a lower response time.
Additionally, the more orthographical neighbors a word has, or the more common-looking a word is, the faster participants are likely to identify it as real–but inaccurately so. Conversely, the fewer orthographical neighbors a word has, or the more unique-looking a word is, the longer participants will likely spend classifying the word–and usually with high accuracy.
Finally, a word’s part of speech appears to play a role in how accurately and quickly participants were able to identify real and fake words. Pronouns, conjunctions, pure verbs, and words with multiple parts of speech inspired the fastest and most accurate reactions, while pure nouns, interjections, and adjectives elicited the slowest and most inaccurate reactions. There appears to be no relationship between morphological breakdowns and response time or accuracy.