Zipf Results

Column

Zipf Law Test

Zipf Results

Zipf’s Law Definition

Zipf’s law is an empirical generalization based on studies of natural language text corpora. It has repeatedly been shown to have merit and good predictive power even when applied to very different collections of documents (text corpora) in all studied natural languages, including extinct languages. It examines the relationship between term frequency (how often a term is used the text corpus) and term rank.

By definition, the most frequently used term has rank 1 (the lowest rank); the second most frequent term, rank 2 (the second lowest rank); and the least frequent term, the highest rank. Beyond this simple ordering pattern, Zipf’s law states that the “frequency of any word is inversely proportional to its rank in the frequency table” {1}. In other words, as the Wikipedia entry {1} explains, for a given collection of documents (a text corpus), Zipf’s law predicts that “the most frequent word [rank 1] will occur approximately twice as often as the second most frequent word [rank 2], three times as often as the third most frequent word [rank 3], etc.”

Since the relationship is inversely proportional, and hence a power law distribution, we can test for Zipf law conformity by plotting on the logarithmic scale (log-log) the term frequency (Y axis) as a response to term rank (X axis).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Term Count Dist

Relative Distinctness

Cloud Results

Comparison Cloud Definition

From the R wordcloud package, the comparison cloud reveals the key differences in term frequency (or, tf-idf value if chosen instead) between two or more documents. This comparison is hence relative not absolute. It does not show the absolute differences between documents in term usage. Instead, it shows the strongest differences in term frequency (or tf-idf). The same term could appear in all of the documents selected for plotting, but it would only show for one document (if at all) if the term appears significantly more in that document than the others. Likewise, a term that appears often in one document, but does not appear in the other documents, would likely also be displayed for that document. The relative display size indicates the importance.

The comparison cloud maps out key changes (differences) between documents in terms of their vocabulary.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Analysis First of all,the four books show strong general conformance with Zipf’s law. Not much difference between the best fit line for ranks 10 to 5000, and the ideal conformance line for all ranks. Darwin’s book is writing to a general audience but Twain’s books are the records and reflections of life, but from the Zipf Law, even though they are in different writing goal, they still manifest certain commonalities and shared patterns of behavior when it comes to language usage. What’s more，even though Darwin is an English and Twain is an American，they words they used are not in a huge distinct. One thing should be notice is that even though The Voyage of the Beagle was wrote by Darwin when Twain just 4 years old. The distinction of words usage similar. Therefore，an explanation of it is that famous/educated writers likes Darwin and Twain share the same writing words mapping when they are educated.

In this four books, the widest range of bigram count distribution is “The Voyage of the Beagle” and the weakest one is Life on the Mississippi.

The Comparison Cloud shows the strongest relative differences in term usage. We can see that the dominant word in each books. Again，because The Voyage of the Beagle is a science book，the dominant words are some places’ and species’s names with little emotional words. But even though the other three books written by Mark Twain are also traveling books,the emotional words paired are more abundant. Maybe it is because Mark Twain’s books are the novels about life.

Column

Research Project & Researcher Info

Research Project

This project considers four famous novel from the 19th century. All of them are focus on the topic“Travel”. Three of them are written by Mark Twain and one is written by Charles Darwin. By nationality, Charles Darwin is English and Mark Twain is American. They are both males.The novels are:

The Innocents Abroad (1869), by Mark Twain
Roughing It (1872), by Mark Twain
Life on the Mississippi (1883), by Mark Twain
The Voyage of the Beagle (1839), by Charles Darwin

Some questions this study will explore – directly or indirectly:

Do those travel books have some key commonalities?
Does Mark Twain’s emotional vocabulary and/or writing style change distinctly as the time goes by?
What is the key differences between those four novels.
Can any key differences be explained by nationality?

Researcher Info:

This project, Text Mining of Four Travel Books , was submitted on 6 January 2021 by 林嘉怡, ID: 018, in partial fulfillment of the requirements for CLA 3206A: Text Mining for Liberal Arts Majors, Shantou University, Fall Semester 2020.

Data Tables

Row

Key Term Count Stats

Top 20 Words following “she” or “he”

About Data Tables

According to the Key Term Count Stats, we see that the book written by Mark Twain are in same unique ratio. It can show that Twain is a characteristic writer who has his own style an in a balance. Also, the largest novel, “The Voyage of the Beagle”, has the lowest unique ratio in the count. But it makes sense that because of Darwin writes his book for general audience and he has some responsibility to entertain and persuade. Therefore he choose some common words in order to make sense to the audience. Thus，even though this four books are all about travel，when it faces to different reader and has different purpose，the words the authors used are different. Also, from the chart we can see that even though all of the novels are in similar genre，Mark’s novels have grown shorter in length with time. The largest one is “Roughing It” with 165.76 and the smaller one is “Life on the Mississippi” with 133.48.

TF-IDF

Column

The Innocents Abroad

Roughing It

Life on the Mississippi

Column

The Voyage of the Beagle

About TF-IDF

Definition of TF-IDF

We have a text corpus composed of documents. (In CLA 3206A, typically a collection of novels. So the corpus is the collection; the documents, the individual novels). TF-IDF, term frequency–inverse document frequency, measures “how important a word is to a document in a collection” {2}.

We know that the frequently used terms in English, words such as “the”, “to”, “and”, and “of”, provide us with little insight about the document’s topics or distinct content. So low information value. But we also sense that if a term occurs often in one document, but not nearly as much in the other documents, that word likely does both relate to the content and help distinguish the document. So higher information value. TF-IDF, a widely used statistical measure for text-mining and informational retrieval, provides a formal mathematical expression of that intuition. It balances the document TF score against how often the term occurs in the rest of the corpus, the IDF. If a term is used often in the document and corpus, it has a low to (effectively) zero TF-IDF score. If the term appears often in the document, but rarely in the corpus: high TF-IDF score. So by this method, TF-IDF indicates the information value of a term.

(See also Silge & Robinson, “Analyzing word and document frequency: tf-idf”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Analysis The TF-IDF looks eccentric in some phrases. It is hard to get the emotion or writing style in it. But TF-IDF at least shows some high-frequency vocabulary for our reference. In the book “The Innocents Abroad”，we can notice that there are many words related to travel such as some cities’name “smyrna”“damascus”“naples”“ephesus”and so on,which are the cities in Eurasia. They occupy the main position of the words usage. Also，the word “pilgrims” is in a dominant position. Therefore，we can assume that the book The Innocents Abroad is a book about traveling in Eurasia. The TF-IDF result of the book “Roughing It” are not as useful as the former one. Most of the words are function words or name，which makes me hard to speculate the theme of the book. The third book Life on the Mississippi is the same with “Roughing It”，which has lots of function words and people’s name， The only difference is that a key word “Mississippi” appears. The last book The Voyage of the Beagle，has some scientific term such as “cordillera” “degs”“tierra”“lagoon”“pamapas”，those are some term for terrain or geography. Therefore，we can assume that this book is a scientific book about geographical description written while Darwin traveling.

Sentiment

Column

The Innocents Abroad

Roughing It

Life on the Mississippi

The Voyage of the Beagle

Column

Plutchik Analysis: Corpus

About Sentiment Analysis

Definition of Plutchik SA

The Sentiment Analysis lexicon nrc categorizes terms – the identified vocabulary – according to eight primary emotions as defined by the psychologist Robert Plutchik {7}. The eight basic emotions according to Plutchik are: Anger, Anticipation, Joy, Trust, Fear, Surprise, Sadness, and Disgust.

Radar Chart
The Plutchik radar chart matches the frequency of text terms to the lexicon, and maps out the text’s emotive vocabulary (as identified by nrc) along the eight Plutchik emotive axes. It does not indicate where the vocabulary appears in the document, only the overall concentration or “shape”.

SA Lines Plot
The Plutchik SA lines graph maps a document section by section in a linear fashion: similar to how one typically reads a text. Like the radar chart, it also matches the frequency of text terms to the lexicon. But it provides a strong indication of where the emotive vocabulary appears in the document and shows the pattern of usage.

In the case of a literary work such as novel, the SA Lines graph provides valuable insight into the document’s literary style. Likewise, the emotive vocabulary patterns revealed also have some relationship to the story arc (plot) and character development.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Analysis Those eight basic emotions are: Anger, Anticipation, Joy, Trust, Fear, Surprise, Sadness,and Disgust.In the four book, trust is the dominant emotion. And the Disgust is in the lowest position. From the chart we can notice that These four travel books are in a positive writing style which positive emotions dominate the whole book. Also，Mark Twain has a strong emotional description in the novels“The Innocents Abroad”and “Life on the Mississippi”，meanwhile the novel “Roughing it” is in a lower condition. May be it is because Mark Twain goes though somethings in his middle age. Therefore his novels are in such emotional fluctuation. Another reason is that “The Innocents Abroad is a book which describes other travelers emotion，therefore the emotional expression is abundant. To be specific，In Twain three travel’s books，the dominant emotion are Trust and Appreciation，sometimes might be Joy. And in the book written by Darwin，the dominant emotion are Trust，Appreciation and Fear.

From the radar chart, we can clearly see that even though these four books are not written by one writer, the emotion structure are similar. Also, the most interesting thing is that Roughing It and The voyage of the Beagle are highly overlapping in some emotions. One possible reason is that both of them are travel novels. And related to what we have found just now. The novel “Roughing It” is in a lower emotional condition compared with Twain’s others book. Therefore, one probably reason why Roughing It is in a different emotional tendency is that Roughing It is also an science book likes The voyage of the Beagle.

Also，Mark Twain’s emotion radar chart shows that he is has a well-defined emotional vocabulary’ and hence, this tells us something about both his literary style and her philosophical world-view.As a matter of fact，Mark Twain’ writing style is definitely unique.

Bigrams

Column

By TF-IDF Score

"Life on the Mississippi

Before “life”

After “short”

After “natural”

Before “river”

About Bigram & Word Pairs Analysis

Definition of Bigram & Word Pairs

To cite a standard definition:

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. {4}

The CLA 3206A bigram analysis methods focused on adjacent words, also known as word pairs (when separated). In some cases, as is standard Tidytext TM practice, stop-words were used to produce cleaned data sets of bigrams and word pairs {5}.

Bigrams and word pairs typically provide valuable insight into the major topics of a document or corpus. Word pairs also typically provide insight about the document style, and about word-associations. These in turn may reveal important linkages between content and ideas.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Definition of Word Pairs Netword

A network graph can reveal the structure of relationships among the word pairs, and not just a simple count of frequency or even tf-idf value. It potentially reveals connections and linkages that we might otherwise miss if the word pairs were plotted in a bar graph, or listed in a table, or if the text were read in a natural linear fashion.

Although relational, the Word Pairs network graph is also directional: it proceeds from term 1 to term 2, as indicated by the line and arrow. Indeed, as Silge & Robinson point out: the Word Pairs network graph is also a “visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word” {8}.

In brief, the Word Pairs network graph provides us with an overview of key word-pairs that also indicates their relationships within the the document (or documents).

(See also Silge & Robinson, “Relationships between words: n-grams and correlations”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Analysis The TF-IDF looks eccentric in some phrases. It is hard to get the emotion or writing style in it. But TF-IDF at least shows some high-frequency vocabulary for our reference.Besides，those four books have some cities’ name，which means probably the themes of those books are about travel. What’s more，The novel The Voyage of the Beagle has some terrain and plant name，which means that it may be a scientific books. Therefore，even though these four books are in a same topic“travel”，they focus on different aspects. Some of them are about personal experience but others are scientific investigation.

Also，The Key Word-Pair Networks have some Names of coastline，which reflects that this books describes somethings related to the sea. In the Key Word-Pair Networks people’s names and object description are abudant. I filter some adjective words and related nouns from the Key Word-Pair Networks in order to have a deeper emotional analysis of the book.

In the book Life on the Mississippi, positive emotion occurs frequently then negative one. Such as some word pairs blameless life,blighted life,humble life,natural dignity and so on.

Word Correlations

Column

Mark Twain Travel books: 4 Key Terms

Network Graph Mark Twain

Network Graph Charles Darwin

In Mark Twain books, “city”

In Charles Darwin books, “animal”

About Word Correlations

Definition

Word Pairs vs. Word Corrs:
Word Pairs consider adjacent words: term1, term2. This has a linear order: the way we naturally read a text. For some examples from Nathaniel Hawthorne’s The Scarlet Letter, designating term2 as “child”: “elf child”, “strange child”, “naughty child”, and “poor child”. In contrast, Word Correlations consider document sections {6}. If termX appears in a given section, what other terms are likely to appear in that same section? This can be anywhere in the section: any place before or after termX, not just adjacent.

Word Corr Findings :
So a Word Correlation analysis tells us what words are associated with termX, and the strength of those associations {6}. These word clusters, centered on termX, reveal linkages of language and thought that might otherwise escape our attention since we typically read in a linear fashion. In the case of an individual author, they help show that author’s linguistic habits – the unconscious as well as conscious mind at work. For a corpus of different authors, they can help reveal underlying assumptions – assumptions perhaps even unknown to the authors being studied!

Word Correlation analyses discover which terms co-occur with each other, as per a specified section of a document, and the importance (statistical significance) of that co-occurrence as measured by the correlation value {9}.

A network graph of word correlations displays linkages and associations that are not as easily captured in a bar plot or data table, and are typically missed when reading the text in a natural linear fashion. Unlike a Word Pairs network graph, a Word Correlations network graph is not directional: rather, it is cluster-centered depending on the strength of correlation.

In brief, the Word Correlations network graph reveals valuable term clusters (empirically determined word associations) which provide information about the deeper layers of language-usage and thought in the document (or documents).

(See also Silge & Robinson, “Relationships between words: n-grams and correlations”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Analysis Because Mark Twain’s writing style is characterized by humor and sarcasm. Therefore,the vocabulary of emotion is very extensive in his book. For example, anxiety is negative but it has a strong correlation with the positive word gentle. Using dark combines with bloody can enhance the degree of fear. Also, the steamer is an object which have no emotion tendency, but the writer puts gentle and steamer together, which shows the writer emotion towards steamer. And it help us to come up with the hypothesis that he enjoys life at sea. Meanwhile，from the graph’s we can see that these four words“break，dark，gentle，visit” are not used frequently. The word correlations are on average of 0.10.

Top 120 Word Correlations in Mark Twain’s travel books are the descriptions of daily life such as duck blonde lawyer, kid gloves and so on. But the Word Correlations in Charles Darwin’s travel book is tended to describe the science and biology. The word are more objective than Mark Twain’s one.

From the word cor we can see that Mark Twain’s novels have a huge number of description with “city”，similarly，in Charles Darwin’s travel，description with “animal” is also in high frequency.The reason why I choose these two words is that from the TF-IDF, the graph shows lots of cities’ names in Mark Twain’s books and some scientific description in Darwin’s book. Therefore I want to check out whether the two author pay lot of attention on those words and what are the related words with them.

Gender

Column

The Innocents Abroad, after “she / he”

Roughing It, after “she / he”

Life on the Mississippi, after “she / he”

Column

Mark Twain Gender Word Distribution

About Gender

Definition

The Gender Analyses in CLA 3206A follow the lead of Julia Silge’s study, for which Silge herself credits the break-through academic study by Professors Matthew Jockers and Gabi Kirilloff, “Understanding Gender and Character Agency in the 19th Century Novel”, Journal of Cultural Analytics (2017).

Approaching the topic of “character identity [as realized through] character action”, Jockers and Kirilloff(2017) examined “character agency in the context of character gender” by examining “trends in behavior associated with male and female characters”. They did so by studying patterns of gender pronoun usage: she what? he what? Similar to Silge, CLA 3206A follows their general method, though with differences on the technical (coding) level.

She | He
By seeing what words – and particularly verbs – follow “she” or “he”, we can gain insight into how these texts portray gender and hence gender roles: “character agency in the context of character gender”, as Jockers and Kirilloff (2017) express it.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Analysis By seeing the graph of the verb following by he/she in Mark Twain’s three travel books we can notice that there is a clear preference for male descriptions and show the Male dominance.Even though book are writing in different years, the preference of gender descriptions are in similar length.From the graph we can see that, Roughing It has a lower description on female.

Also, the writing style of Mark Twain has not changed distinctly. Those common words followed by he likes told,looked are still the key words for each book. Also, even though each novel has different key words followed by she, the descriptions are similar.Therefore，with the time goes by，Mark Twain’s preferences for gender do not change distinctly.

From the Mark Twain Gender Word Distribution，we can find that even though these three books are similar in general，they also have some unique gender vocabulary collocation. Some words only used in one book. For Roughing It，there is no key term following she. Therefore，we can assume that Roughing it has less description of female.