Zipf Results

Column

Zipf Law Test

Research Project

Author Information

This project, Text Mining of Four 19th Century Novels, was submitted on 4 January 2023 by [陈惠川,Truna], [ID: 001], in partial fulfillment of the requirements for CLA 3206A: Text Mining for Liberal Arts Majors, Shantou University, Fall Semester 2022.

Research Project

This project considers four famous novels from the mid-19th century. Two are written by women and two by men. By nationality, the authors are English and American. The novels are

Pride and Prejudice (1813), by Jane Austen Middlemarch (1871), by Mary Anne Evans Cranford (1853), by Elizabeth Gaskell Dracula (1897), by Bram Stoker Some questions this study will explore – directly or indirectly:

The key commonalities between the novels. The main sentiments of the novels. Gender roles as portrayed in the novels. The major word associations. Text Corpus

The text corpus consists of the four novels mentioned above, as downloaded from Project Gutenberg.

###Analysis

The four novels show strong general conformance with Zipf’s law. We do not expect exact conformance. We do see a deviation from the predicted high ranking (most commonly used) words. Professional writers likely choose their words more carefully and selectively than typical writers. So we might expect less filler and repetition. We can also see that the relationship between rank and frequency has a negative slope.

As shown in the graph, a linear model was fitted to the novels ranked 25 to 1000. Based on the adjusted R-squared values, this model explains more than 99% of the variation in the data. Thus, it shows a strong general consistency.

The results also show that although the authors differ in gender and nationality, they have certain commonalities and shared patterns of behavior in language use.

They also show that even writers who differ in terms of gender, nationality, and style, still manifest certain commonalities and shared patterns of behavior when it comes to language usage.

TF-IDF

Column

`

======================================================================

Column

TF-IDF

The Four Books

<div class="knitr-options" data-fig-width="576" data-fig-height="460"></div>
<img src="Sample_Dash_files/figure-html/unnamed-chunk-2-1.png" width="576" data-figure-id=fig2 />

terms used in four books

<div class="knitr-options" data-fig-width="576" data-fig-height="460"></div>
<img src="Sample_Dash_files/figure-html/unnamed-chunk-3-1.png" width="576" data-figure-id=fig3 />

### Analysis
Top 20 terms results
As shown in the chart, the most frequent occurrence in these four novels is the name of the main character. “Matty” , the main character in Cranford; “Helsing”, the character of Dracula; “Ivdate”, one of the main characters in Middlemarch; and “Dracy”, name of the men heroine in Pride and Prejudice. This suggests that the stories of the novels revolve around these characters.

What’s more, since the names of the main characters are special, we can conclude that the four writers are very careful and selective in choosing characters’ names.

'Terms about four books'
  From the bar chart, it can be seen that in these four novels, the word "miss" is the most frequent, while the word "manner" is less frequent and only appears in Pride and prejudice. It can be seen that the four authors have similar but individual characteristics in their writing styles.
  


### About TF-IDF
To cite the main points from Wikipedia, [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf):
* a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus
* used as a weighting factor in information retrieval, text mining, and user modeling
* tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus 

We use it as a standard measure to find the information value of a term in a text corpus.
We have a text corpus composed of documents. (In CLA 3206A, typically a collection of novels. So the corpus is the collection; the documents, the individual novels). TF-IDF, term frequency–inverse document frequency, measures “how important a word is to a document in a collection” {2}.

We know that the frequently used terms in English, words such as “the”, “to”, “and”, and “of”, provide us with little insight about the document’s topics or distinct content. So low information value. But we also sense that if a term occurs often in one document, but not nearly as much in the other documents, that word likely does both relate to the content and help distinguish the document. So higher information value. TF-IDF, a widely used statistical measure for text-mining and informational retrieval, provides a formal mathematical expression of that intuition. It balances the document TF score against how often the term occurs in the rest of the corpus, the IDF. If a term is used often in the document and corpus, it has a low to (effectively) zero TF-IDF score. If the term appears often in the document, but rarely in the corpus: high TF-IDF score. So by this method, TF-IDF indicates the information value of a term.

(See also Silge & Robinson, “Analyzing word and document frequency: tf-idf”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.


Column

Sentiment

Radar Chart

<div class="knitr-options" data-fig-width="576" data-fig-height="460"></div>

```{=html}
<canvas id="htmlwidget-34645856cb91418ae58b" class="chartJSRadar html-widget" width="576" height="460.8"></canvas>
<script type="application/json" data-for="htmlwidget-34645856cb91418ae58b">{"x":{"data":{"labels":["anger","anticipation","disgust","fear","joy","sadness","surprise","trust"],"datasets":[{"label":"Cranford","data":[733,1824,525,1052,1485,1103,806,2080],"backgroundColor":"rgba(255,0,0,0.2)","borderColor":"rgba(255,0,0,0.8)","pointBackgroundColor":"rgba(255,0,0,0.8)","pointBorderColor":"#fff","pointHoverBackgroundColor":"#fff","pointHoverBorderColor":"rgba(255,0,0,0.8)"},{"label":"Dracula","data":[2285,4185,1701,3644,3239,2964,1976,4426],"backgroundColor":"rgba(0,255,0,0.2)","borderColor":"rgba(0,255,0,0.8)","pointBackgroundColor":"rgba(0,255,0,0.8)","pointBorderColor":"#fff","pointHoverBackgroundColor":"#fff","pointHoverBorderColor":"rgba(0,255,0,0.8)"},{"label":"Middlemarch","data":[4863,8803,3352,5527,7616,5483,4412,10705],"backgroundColor":"rgba(0,0,255,0.2)","borderColor":"rgba(0,0,255,0.8)","pointBackgroundColor":"rgba(0,0,255,0.8)","pointBorderColor":"#fff","pointHoverBackgroundColor":"#fff","pointHoverBorderColor":"rgba(0,0,255,0.8)"},{"label":"Pride and Prejudice","data":[1295,3589,973,1768,3340,1755,1673,4186],"backgroundColor":"rgba(255,255,0,0.2)","borderColor":"rgba(255,255,0,0.8)","pointBackgroundColor":"rgba(255,255,0,0.8)","pointBorderColor":"#fff","pointHoverBackgroundColor":"#fff","pointHoverBorderColor":"rgba(255,255,0,0.8)"}]},"options":{"responsive":true,"title":{"display":false,"text":null},"scale":{"ticks":{"min":0},"pointLabels":{"fontSize":18}},"tooltips":{"enabled":true,"mode":"label"},"legend":{"display":true}}},"evals":[],"jsHooks":[]}</script>

SA Lines Plot

Analysis

As you can see from the radar chart, whether negative or positive emotions, Middlemarch the book appears to be the most relevant words.Dracula is significantly consistent with Middlemarch but in a small scale. Prdie and prejudice was similar with Dracula, however Pride and Prejudice ’s negative emotion appeared more frequently. The book Cranford has the fewest words about the occurrence of emotions, but it has about equal frequency of various emotions, except for trust. Trust is the emotion that appears the most in Cranford. This is similar to Middlemarch.

About Sentiment Analysis

Sentiment Analysis attempts to determine the emotional content of text. To cite a more formal definition from (Wikipedia)[https://en.wikipedia.org/wiki/Sentiment_analysis]:

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

These graphs primarily use the senitment lexicon nrc, which has some interesting features. It categorizes terms according to eight primary emotions as defined by Robert Plutchik in his Wheel of Emotions system.1 Those eight basic emotions are: Anger, Anticipation, Joy, Trust, Fear, Surprise, Sadness,and Disgust.

Bigrams

Column

By TF-IDF Scores

# A tibble: 677,436 × 2
   title               bigram             
   <chr>               <chr>              
 1 Pride and Prejudice there is           
 2 Pride and Prejudice is an              
 3 Pride and Prejudice an illustrated     
 4 Pride and Prejudice illustrated edition
 5 Pride and Prejudice edition of         
 6 Pride and Prejudice of this            
 7 Pride and Prejudice this title         
 8 Pride and Prejudice title which        
 9 Pride and Prejudice which may          
10 Pride and Prejudice may viewed         
# … with 677,426 more rows

Before “love”

# A tibble: 167 × 4
   title               word1 word2     n
   <chr>               <chr> <chr> <int>
 1 Middlemarch         in    love     42
 2 Pride and Prejudice in    love     34
 3 Dracula             i     love     14
 4 Middlemarch         of    love     11
 5 Middlemarch         my    love      9
 6 Pride and Prejudice my    love      8
 7 Middlemarch         his   love      7
 8 Pride and Prejudice of    love      7
 9 Dracula             you   love      6
10 Middlemarch         to    love      6
# … with 157 more rows

After “love”

# A tibble: 173 × 4
   title               word1 word2     n
   <chr>               <chr> <chr> <int>
 1 Middlemarch         love  with     28
 2 Pride and Prejudice love  with     17
 3 Middlemarch         love  for      13
 4 Middlemarch         love  and      12
 5 Dracula             love  and      10
 6 Dracula             love  him       9
 7 Dracula             love  you       9
 8 Middlemarch         love  of        8
 9 Pride and Prejudice love  and       8
10 Middlemarch         love  me        7
# … with 163 more rows

Before “wedding”

# A tibble: 21 × 4
   title               word1    word2       n
   <chr>               <chr>    <chr>   <int>
 1 Pride and Prejudice the      wedding     6
 2 Middlemarch         a        wedding     5
 3 Middlemarch         her      wedding     4
 4 Middlemarch         the      wedding     4
 5 Cranford            her      wedding     2
 6 Dracula             my       wedding     2
 7 Dracula             our      wedding     2
 8 Pride and Prejudice her      wedding     2
 9 Pride and Prejudice sister's wedding     2
10 Cranford            mother's wedding     1
# … with 11 more rows

After “wedding”

# A tibble: 25 × 4
   title               word1   word2       n
   <chr>               <chr>   <chr>   <int>
 1 Middlemarch         wedding journey     9
 2 Pride and Prejudice wedding clothes     4
 3 Middlemarch         wedding clothes     3
 4 Middlemarch         wedding ring        2
 5 Pride and Prejudice wedding day         2
 6 Pride and Prejudice wedding took        2
 7 Cranford            wedding day         1
 8 Cranford            wedding ring        1
 9 Cranford            wedding with        1
10 Dracula             wedding be          1
# … with 15 more rows

Analysis

According to Bigram Count Distribution,the distribution of bigrams in these four books is relatively even with each other. And their range distribution is consistent with the previous radar chart to a very hitch degree. Here we can see that the writing style of all four authors is careful and the choice of words is in harmony with the overall text.

About word pairs Analysis

As you can see from the graph, the word love is preceded or followed most often by a preposition, followed by a pronoun. This indicates that writers have similar styles when expressing the emotion of love. The word “wedding” is mostly preceded by a qualifier, such as “the”. After the word wedding, most of the words are “clothes”, “journey”. This shows that the authors tend to describe the layout of the wedding. ### About Bigram Analysis To cite Wikipedia:

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.

Bigram analyses focus on word pairs, including pair-associations and order of precedence. This can provide us with valuable information about the patterns and even thematic concerns of a text corpus. ### word pairs about “love”

Word Correlations

Column

The Four Books CORS

# A tibble: 7,281,902 × 3
   item1   item2   correlation
   <chr>   <chr>         <dbl>
 1 helsing van           0.976
 2 van     helsing       0.976
 3 adam    fitz          0.946
 4 fitz    adam          0.946
 5 brunoni signor        0.894
 6 signor  brunoni       0.894
 7 bourgh  de            0.796
 8 de      bourgh        0.796
 9 court   stone         0.763
10 stone   court         0.763
# … with 7,281,892 more rows

Pride and Prejudice Cors

# A tibble: 726,756 × 3
   item1     item2     correlation
   <chr>     <chr>           <dbl>
 1 bourgh    de              0.938
 2 de        bourgh          0.938
 3 william   sir             0.736
 4 sir       william         0.736
 5 pounds    thousand        0.700
 6 thousand  pounds          0.700
 7 catherine lady            0.678
 8 lady      catherine       0.678
 9 forster   colonel         0.591
10 colonel   forster         0.591
# … with 726,746 more rows
###Top 150 Word Correlations in Pride and Prejudice

Analysis

As shown in the figure, the most frequent pairings in these four books are about the names or titles of people. This indicates that the authors centered their creations around the characters. And in Mystery and Prejudice, in addition to the names of people, the correelation of a thousand pounds appears most often.It also fits in well with the theme of the book, as it is a story about two young people falling in love across class, where the differences in wealth levels that exist cannot be ignored.

About Word Correlations

Word Pairs vs. Word Corrs: Word Pairs consider adjacent words: term1, term2. This has a linear order: the way we naturally read a text. For some examples from Nathaniel Hawthorne’s The Scarlet Letter, designating term2 as “child”: “elf child”, “strange child”, “naughty child”, and “poor child”. In contrast, Word Correlations consider document sections {6}. If termX appears in a given section, what other terms are likely to appear in that same section? This can be anywhere in the section: any place before or after termX, not just adjacent.

Word Corr Findings : So a Word Correlation analysis tells us what words are associated with termX, and the strength of those associations {6}. These word clusters, centered on termX, reveal linkages of language and thought that might otherwise escape our attention since we typically read in a linear fashion. In the case of an individual author, they help show that author’s linguistic habits – the unconscious as well as conscious mind at work. For a corpus of different authors, they can help reveal underlying assumptions – assumptions perhaps even unknown to the authors being studied!

Word Correlation analysis provides insight as the likelihood of one term occurring shortly before or after another term. This shows us word associations which provides insight to the themes and concerns of the text corpus author or authors. We can also use this track changing word associations over time, and so see stylistic differences on the micro level or historical changes in sensibility on the macro level.

WP Network Graph

A network graph can reveal the structure of relationships among the word pairs, and not just a simple count of frequency or even tf-idf value. It potentially reveals connections and linkages that we might otherwise miss if the word pairs were plotted in a bar graph, or listed in a table, or if the text were read in a natural linear fashion.

Although relational, the Word Pairs network graph is also directional: it proceeds from term 1 to term 2, as indicated by the line and arrow. Indeed, as Silge & Robinson point out: the Word Pairs network graph is also a “visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word” {8}.

In brief, the Word Pairs network graph provides us with an overview of key word-pairs that also indicates their relationships within the the document (or documents).

(See also Silge & Robinson, “Relationships between words: n-grams and correlations”).

WC Network Graph

Word Correlation analyses discover which terms co-occur with each other, as per a specified section of a document, and the importance (statistical significance) of that co-occurrence as measured by the correlation value {9}.

A network graph of word correlations displays linkages and associations that are not as easily captured in a bar plot or data table, and are typically missed when reading the text in a natural linear fashion. Unlike a Word Pairs network graph, a Word Correlations network graph is not directional: rather, it is cluster-centered depending on the strength of correlation.

In brief, the Word Correlations network graph reveals valuable term clusters (empirically determined word associations) which provide information about the deeper layers of language-usage and thought in the document (or documents).

(See also Silge & Robinson, “Relationships between words: n-grams and correlations”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Gender

Column

He-She terms in four Books

after_he_she_plot

analysis

The portrayal of the two genders in these four books is very different. When describing he, the writers use the word interested the most. When describing women, the writers use the word recoverd the most. This may reveal the writers’ perceptions of gender. It also reveals differences in how the two genders should be represented.

About gender analysis

This approach to gender analysis was inspired by blog post | study by Julia Silge, author of textbook co-author and one of the creators of Tidytext. We freely use the code she provided – thank you, Julia!

Silge in turn credits the break-through academic study by Professors Matthew Jockers and Gabi Kirilloff, “Understanding Gender and Character Agency in the 19th Century Novel”, Journal of Cultural Analytics (2017).

Approaching the topic of “character identity [as realized through] character action”, Jockers and Kirilloff (2017) examined “character agency in the context of character gender” by examining “trends in behavior associated with male and female characters” in their macro-analytic study .

We borrow their insights also.

By seeing what words – and particularly verbs – follow “he” or “she”, we can gain insight into how these texts portray gender and hence gender roles: “character agency in the context of character gender”, as Jockers and Kirilloff (2017) put it.

Data Tables