Zipf Results

Column

Zipf Plot

Zipf Law Test

Zipf’s law is an empirical generalization based on studies of natural language text corpora. It has repeatedly been shown to have merit and good predictive power even when applied to very different collections of documents (text corpora) in all studied natural languages, including extinct languages. It examines the relationship between term frequency (how often a term is used the text corpus) and term rank.

By definition, the most frequently used term has rank 1 (the lowest rank); the second most frequent term, rank 2 (the second lowest rank); and the least frequent term, the highest rank. Beyond this simple ordering pattern, Zipf’s law states that the “frequency of any word is inversely proportional to its rank in the frequency table” {1}. In other words, as the Wikipedia entry {1} explains, for a given collection of documents (a text corpus), Zipf’s law predicts that “the most frequent word [rank 1] will occur approximately twice as often as the second most frequent word [rank 2], three times as often as the third most frequent word [rank 3], etc.”

Since the relationship is inversely proportional, and hence a power law distribution, we can test for Zipf law conformity by plotting on the logarithmic scale (log-log) the term frequency (Y axis) as a response to term rank (X axis).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Column

Researcher & Project Info

Researcher:

This project, Analysis on Edgar Allen Poe’ s Works , was submitted on 5 January 2021 by Chen Xiaojuan, ID: 004, in partial fulfillment of the requirements for CLA 3206A: Text Mining for Liberal Arts Majors, Shantou University, Fall Semester 2020.

Project Info:

This projects considers four works by Edgar Allen Poe.They are

  • The Fall of the House of Usher (1839), by Edgar Allen Poe
  • The Cask of Amontillado (1846), by Edgar Allen Poe
  • The Masque of the Red Death (1842), by Edgar Allen Poe
  • The Bells, and Other Poems by Edgar Allan Poe (2016), by Edgar Allen Poe

Some questions this study will explore – directly or indirectly:

  • Key commonalities between the novels.
  • Key differences.
  • Can any key differences be explained by gender or nationality?
  • Gender roles as portrayed in the novels.
  • What are the major word associations? Word pairs?

Bigram Count

Interpretation

Zipf law results: The four works show strong general conformance with Zipf’s law. We do not expect exact conformance. We do see a deviation from the predicted high ranking (most commonly used) words. So we need to explore the author’s writing style by doing more analysis.

The Zipf Test results overall are good news. It means our basic assumptions hold, and we can apply the usual text mining tools and concepts from corpus linguistics.

Bigrams count: Bigrams count of Poe’ s fiction and poetry shows much similarity. The graph shows that both in Poe’ s fiction and poetry, more than half of bigrams occur only once. Also, both in Poe’ s fiction and poetry, there is a bigram appearing for more than 100 times.

TF-IDF

Column

Cask of Amontillado

Fall of the House of Usher

Column

Masque of the Red Death

Interpretation

cask_tfidf: From the graph, we can see that the word “amontillado” “ugh” “fortunato” appear for the most times, which tells us its content is about “amontillado” and “fortunato”, and “ugh” may show the pervasive negative emotion in the story.

fall_tfidf: “Usher” ranks highest, suggesting the setting of the story is in Usher’ s mansion, and it is about Usher’ s family. Also, the word “house”, “door” and “dragon” mean that there are many descriptions about the house in the novel.

masq_tfidf: “clock” appears most frequently, which may suggest that the story has to do with “clock”. While “prince” “prospero” may refer to the main character in the story.

About TF-IDF

To cite the main points from Wikipedia, term frequency–inverse document frequency: * a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus * used as a weighting factor in information retrieval, text mining, and user modeling * tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus

We use it as a standard measure to find the information value of a term in a text corpus.

(See also Silge & Robinson, “Analyzing word and document frequency: tf-idf”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Sentiment

Column

Poe’ s Fiction

Poe’ s Poetry

Interpretation

Poe’ s fiction: In Poe’ s fiction, anger, disgust, fear, and sadness, which are negative sentiments, are more than trust, surprise, anticipation, and joy.

Poe’ s poetry: Compared to Poe’ s fiction, Poe’ s poetry presents more positive sentiments and the sentiments change greatly throughout the poems.

About Sentiment Analysis

To cite a standard definition:

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. {3}

In simpler terms, we are mapping the emotional vocabulary of a document or corpus. To identify that emotional vocabulary, we use SA lexicons. For CLA 3206A, the lexicons nrc and bing. The SA analysis is relative to the lexicon, and some of the sentiment identifications for any given lexicon might be questionable. But so long as that lexicon is consistent and applied so, the results are comparable and the vocabulary mappings identify evidence-based patterns of language usage.

We use SA in CLA 3206A less to precisely identify affective states and more to map out lexicon term usage. These mappings unquestionably provide insight into document style and content.

(See also Silge & Robinson, “Sentiment analysis with tidy data”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Column

Plutchik SA

The Sentiment Analysis lexicon nrc categorizes terms – the identified vocabulary – according to eight primary emotions as defined by the psychologist Robert Plutchik {7}. The eight basic emotions according to Plutchik are: Anger, Anticipation, Joy, Trust, Fear, Surprise, Sadness, and Disgust.

Radar Chart
The Plutchik radar chart matches the frequency of text terms to the lexicon, and maps out the text’s emotive vocabulary (as identified by nrc) along the eight Plutchik emotive axes. It does not indicate where the vocabulary appears in the document, only the overall concentration or “shape”.

SA Lines Plot
The Plutchik SA lines graph maps a document section by section in a linear fashion: similar to how one typically reads a text. Like the radar chart, it also matches the frequency of text terms to the lexicon. But it provides a strong indication of where the emotive vocabulary appears in the document and shows the pattern of usage.

In the case of a literary work such as novel, the SA Lines graph provides valuable insight into the document’s literary style. Likewise, the emotive vocabulary patterns revealed also have some relationship to the story arc (plot) and character development.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Plutchik Analysis: SA lines

Plutchik Analysis: CA_SA

Interpretation

Poe_all_lines: From these two graphs, we can see that “joy” ranks higher in poetry than fiction; while in both genres, “sadness” ranks high.

CA_SA_lines: Throughout the story, the sentiment “trust” changes sharply. In the first two sections and the fifth section, “trust” ranks highest, while it drops to the lowest at “x = 3”. So, the author may want to express that people long for “trust”, but it is hard to get and easy to lose.

Bigrams

Column

Top 40 Bigrams by Count

By TF-IDF Score

fiction pairs

Poetry

Interpretation

Top_40: Pairs “of the” and “to the” rank very high in these four works. These pairs consist of very common words, which echos the previous result that zipf law results show strong conformance. However, it may show the continuity in the author’s writing style, for these pairs appear more than one text.

TF_IDF score: We can see that “amontillado”, “red death” and “bell” appear many times. These words are included in titles of the works, so from here we can learn about the topic.

fiction pairs: Word pairs above include characters in the novels, such as “Usher Roderick”, “Madeline Lady” and “Launcelot Sir”. So we can know some of the main characters through these word pairs. Besides, we can see the author makes use of colors, for there are many words related to colors, such as “blue”, “red” and “ebony”.

poetry pairs: Looking at words “knells” “tells” “rolls”; “tolling” “lolling” “flowing” “robbing” “living;”sonnet“, we may predict that the document is poetry, for these words rhyme.”thine eyes" “thou art” and “ye feel” tell us that there are old-fashioned English words, showing the diction of the author.

About Bigram Analysis

To cite a standard definition:

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. {4}

The CLA 3206A bigram analysis methods focused on adjacent words, also known as word pairs (when separated). In some cases, as is standard Tidytext TM practice, stop-words were used to produce cleaned data sets of bigrams and word pairs {5}.

Bigrams and word pairs typically provide valuable insight into the major topics of a document or corpus. Word pairs also typically provide insight about the document style, and about word-associations. These in turn may reveal important linkages between content and ideas.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Word Correlations

Column

Fiction Cors

Poetry Cors

Interpretation

fiction correlations: In Poe’ s short stories, he gives elaborate description of the environment, especially the gothic building and decorations inside. Also, he makes a good use of colors when describing.

poetry correlations: From the words “eidolon” “plutonian” “prophet” “god’ s” “devil”, we can see the poems have much to do with religion and create a mysterious atmosphere. So maybe the author uses lots of allusions related to religions. Besides, there occurs many names of men, especially women, such as “Louise” and “Frances”, so the author may like conveying his ideas from a third person perspective, especially a woman’ s. Hence, we may conclude that in terms of linguistic style, Poe uses allusions and creates a third person.

About Word Correlations

Word Pairs vs. Word Corrs:
Word Pairs consider adjacent words: term1, term2. This has a linear order: the way we naturally read a text. For some examples from Nathaniel Hawthorne’s The Scarlet Letter, designating term2 as “child”: “elf child”, “strange child”, “naughty child”, and “poor child”. In contrast, Word Correlations consider document sections {6}. If termX appears in a given section, what other terms are likely to appear in that same section? This can be anywhere in the section: any place before or after termX, not just adjacent.

Word Corr Findings :
So a Word Correlation analysis tells us what words are associated with termX, and the strength of those associations {6}. These word clusters, centered on termX, reveal linkages of language and thought that might otherwise escape our attention since we typically read in a linear fashion. In the case of an individual author, they help show that author’s linguistic habits – the unconscious as well as conscious mind at work. For a corpus of different authors, they can help reveal underlying assumptions – assumptions perhaps even unknown to the authors being studied!

We can also use this analysis to track changing word associations over time, and so see stylistic differences on the micro level or historical changes in sensibility on the macro level.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Data Tables

Row

Key Term Count Stats

Words following “house”

Words preceding “wild”.

Top 20 scores of Bell & Other poems

Interpretation

Key Term Count Stats: Top 10 in Fall of the house of Usher, the word “usher” ranks very high, which presents the main character. Both in Fall of the House of Usher and Cask of Amontillado, personal demonstratives, such as “my”, “you” and “her” appear many times; these words are stop words, and should be filtered. In Masque of the Red Death, content words appear more frequently, such as “clock”, “prince” and “assembly”; these words may help guess the main idea of the story. In Bell & Other poems, there are old-fashioned English words, such as “thy” and “thee”, which show the writing style; at the same time, there are contraction forms, such as “o’ er”, which tell people the work may be a poem without knowing the genre in advance.

Words following “house”: Look at the following words of “house”, we can know some characteristics of Usher’ s house and the surroundings there, both the outside layout and inside setting.

Words preceding “child”: Words precede “wild” are mainly nominal words,showing what things are wild. For Usher’ s house, “wild” is an important characteristic, and words preceding “wild” emphasize this feature.

bell_tfidf: Among the words matter most, there are many personal demonstratives. “thy” “her” “thee” “thou” suggest that in Poe’s poems, he uses the second and third persons more than the first person; while words “thy” “thee” “thou” is the kind address term to the main character. So, here again, we can have an insight into the author’s linguistic style and diction of his poems.