Zipf Results

Column

Zipf Law Test

Zipf Law Test2

Term Count Distribution

Column

Research Project & Text Corpus

Research Project

This project contains four novels which are not so famous. All of the author is man. The novels are about animals and human.

  • Birds and Man (1927), by William Henry Hudson
  • Buffalo Land (2012), by W.E.Webb
  • Deadfalls and Snares (1907), by Arthur Robert Harding
  • The Bird Book (2022), by Steve Jenkins and Robin Page

Some questions this study will explore – directly or indirectly:

  • What are mainly mentioned in those books?
  • The mainly sentiment words in those books.
  • What are some interesting words’ bigrams?

Text Corpus

The text corpus consists of the four novels mentioned above, as downloaded from Project Gutenberg.

Zipf Results

Seeing from the plot of these four books, we knows that they are quite similar to each other. And they have a negative correlation to each other. But in the low and the high levels there are some deviations.

A linear model was fitted for ranks 25 to 1000, as indicated on the graph. According to the adjusted R Squared value, this model explains over 99% of the data variance. Again, strong general conformance.

The Zipf Test results overall are good news. It means our basic assumptions hold, and we can apply the usual text mining tools and concepts from corpus linguistics.

They also show that even writers who differ in terms of gender, nationality, and style, still manifest certain commonalities and shared patterns of behavior when it comes to language usage.

TF-IDF

Column

Birds and Man

The Bird Book

Column

Buffalo Land

Deadfalls and Snares

Analysis

From these four plots, we know the word that appear more frequently. It is to said that we can know the main idea and main theme of the books. As for The Bird Book, “color” and “draws” are the top two and we may guess this book is about drawing. And the “bird” and “eggs” appear frequently in Birds and Man. So we can say this book mainly talk about the bird and its living. While in Buffalo Land, the “buffalo” is the top one. So we can say the main character is buffalo. As for Deadfalls and Snares, the words related to trap always in the top such as “trap”, “traps” and “trapper”.

About TF-IDF

To cite the main points from Wikipedia, term frequency–inverse document frequency: * a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus * used as a weighting factor in information retrieval, text mining, and user modeling * tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus

We use it as a standard measure to find the information value of a term in a text corpus.

Sentiment

Column

The Bird Book

Birds and Man

Buffalo Land

Deadfalls and Snares

Analysis

These four plots are about sentiment words in those four books. We can see that the “darker” and “dark” are between the negative words in The Bird Book while the positive words have “abundant”, “beautiful” and “handsome”. And for the Birds and Man, “dark” also a negative word, and “wild”, “dead”, “cry” also a member of it. The positive words are “charm”, “pretty”, “beauty” and “beautiful”. Buffalo Land also have the positive words of “beautiful” and “pretty”. The negative are “wild”, “poor” and “dead” and so on. As for Deadfalls and Snares, the negative words are “trap”, “bait” and “snare” and so on. The positive words are “easy”, “strong” etc..

About Sentiment Analysis

Sentiment Analysis attempts to determine the emotional content of text. To cite a more formal definition from (Wikipedia)[https://en.wikipedia.org/wiki/Sentiment_analysis]:

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

These graphs primarily use the senitment lexicon nrc, which has some interesting features. It categorizes terms according to eight primary emotions as defined by Robert Plutchik in his Wheel of Emotions system.1 Those eight basic emotions are: Anger, Anticipation, Joy, Trust, Fear, Surprise, Sadness,and Disgust.

Bigrams

Column

The Word after Bird

The Word after Animal

Analysis

Since two of these books contain the “bird” in the name, so I want to know what’s the words after “bird”. We can find in the result that follow the bird, there are “is”, “life” and “and”,etc.. And it seems that all these books are related to animals. So I put the “animal” to see what the next words to it. We find that behind the “animal” there are words such as “is”, “to” and “life”.

About Bigram Analysis

To cite Wikipedia:

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.

Bigram analyses focus on word pairs, including pair-associations and order of precedence. This can provide us with valuable information about the patterns and even thematic concerns of a text corpus.