This project, Text Mining: Patterns and Differences in 6 Novels by Charles Dickens , was submitted on 7 January 2021 by [高翠盈], [ID: 015], in partial fulfillment of the requirements for CLA 3206A: Text Mining for Liberal Arts Majors, Shantou University, Fall Semester 2020.
Research Project
This project centers on six novels by Charles Dickens. The novels are:
Here are the research questions this project will explore:
Text Corpus The text corpus includes the six novels mentioned above, which were downloaded from Project Gutenberg.
Analysis
The text corpus (six novels by Charles Dickens) is generally consistent with Zipf’s law despite there is a deviation from the high ranking (most commonly used) words. This deviation indicates that Charles Dickens used the most common words in English as expected, but not so frequently as the Zipf’s law would predict. Therefore, Charles Dickens might choose his words more carefully and selectively than common writers. Besides, it seems that Charles Dickens has a term count usage pattern because the result “lines” of the 6 novels only slightly deviate from each other. We will explore it more detailedly in the next chart of term count summary stats.
The Adjusted R Square value of the best fit linear model is 0.9932606. That is to say,over 99% of the data variance is accounted for by the model based on the Zipf’s Law.
Thus, the Zipf’s test result meets the basic assumptions, and the typical text mining tools and concepts in corpus linguistics are applicable to this text mining study.
About Zipf’s Law
Zipf’s law is an empirical generalization based on studies of natural language text corpora. It has repeatedly been shown to have merit and good predictive power even when applied to very different collections of documents (text corpora) in all studied natural languages, including extinct languages. It examines the relationship between term frequency (how often a term is used the text corpus) and term rank.
By definition, the most frequently used term has rank 1 (the lowest rank); the second most frequent term, rank 2 (the second lowest rank); and the least frequent term, the highest rank. Beyond this simple ordering pattern, Zipf’s law states that the “frequency of any word is inversely proportional to its rank in the frequency table” {1}. In other words, as the Wikipedia entry {1} explains, for a given collection of documents (a text corpus), Zipf’s law predicts that “the most frequent word [rank 1] will occur approximately twice as often as the second most frequent word [rank 2], three times as often as the third most frequent word [rank 3], etc.”
Since the relationship is inversely proportional, and hence a power law distribution, we can test for Zipf law conformity by plotting on the logarithmic scale (log-log) the term frequency (Y axis) as a response to term rank (X axis).
Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.
Analysis
According to the term count summary stats table, the median count for all 6 novels is the same. For novels with similar size, such as David Copperfield and Dombey and Son, the key stats are close and their unique ratios are the same. The unique ratio shows the overall degree of terms variety. For the six novels, the unique ratios of similar-sized novels are close and the ratios decrease as novels’ sizes increase. It shows that Charles Dickens employed similar amount of distinct words for similar-sized novels and the bigger the novels are, the lower overall degree of term variety it has. It is reasonable that bigger novels have a lower unique ratio because the character names and places names are repeated more frequently in bigger novels.
Therefore, together with the Zipf result, the summary stats show there is a term count usage pattern in Charles Dickens novels. We will explore what exact informative words are used in the novels in general through the following TF-IDF analysis.
Analysis
From the TF-IDF chart, we can see that the character names dominate the chart, which shows that these names are repeated in the novels but they are not very commonly used in other writers’ works.
If we have a look at the next data table chart, which lists the terms with the highest IDF score (most uncommon words) in the 6 novels, we can find most of the character names in the table. It seems that the character names are unique and carefully chosen or created by Charles Dickens in his novels. These names may be informative, telling much about the characters or their fates, because they seem to have rich meaning. For instance, the name “murdstone” seems to be the combination of “murder” and “stone”.
In the following Word Cloud section, we will explore the commonality and relative distinctness about the words used in the 6 novels by term count.
About TF-IDF
We have a text corpus composed of documents. (In CLA 3206A, typically a collection of novels. So the corpus is the collection; the documents, the individual novels). TF-IDF, term frequency–inverse document frequency, measures “how important a word is to a document in a collection” {2}.
We know that the frequently used terms in English, words such as “the”, “to”, “and”, and “of”, provide us with little insight about the document’s topics or distinct content. So low information value. But we also sense that if a term occurs often in one document, but not nearly as much in the other documents, that word likely does both relate to the content and help distinguish the document. So higher information value. TF-IDF, a widely used statistical measure for text-mining and informational retrieval, provides a formal mathematical expression of that intuition. It balances the document TF score against how often the term occurs in the rest of the corpus, the IDF. If a term is used often in the document and corpus, it has a low to (effectively) zero TF-IDF score. If the term appears often in the document, but rarely in the corpus: high TF-IDF score. So by this method, TF-IDF indicates the information value of a term.
(See also Silge & Robinson, “Analyzing word and document frequency: tf-idf”).
Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.
Analysis
The words shown in the commonality word cloud can be divided into several groups:
All of these words are related to people’s everyday life, which shows that the 6 novels are about people’s daily life to some extent. This is consistent with Charles Dickens’s realistic style.
About Commonality Word cloud
From the R wordcloud package, the commonality word cloud reveals the terms shared across the documents in the corpus selected for plotting. It shows only the absolute intersection: the terms shared in common by all the documents selected. For example, if a commonality cloud of six novels is plotted, and a term is present in only five of the six novels, it will NOT be showed. The relative size of the term shown indicates its total count across all the documents selected.
The commonality cloud maps out continuity between documents in terms of their vocabulary.
Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.
The comparison word cloud shows the relative distinctness between novels. The character names are removed here. The result shows the differences in terms of setting, features of the characters, key terms for action and mood between the 6 novels:
In terms of setting, there are elements of industrialization in Dombey and Son (“toots”), Hard Times (“coketown”), and Great Expectations (“forge”,“fire”) while the setting in David Copperfield seems to be mostly at “house”/“home”.
Words about settingIn terms of features of the characters, the different statues of the characters may indicate different plots in the novels. For instance, there may be plots about litigation in The Pickwick Papers due to the appearance of “serjeant” and the characters in David Copperfield may suffer from sickness because of the appearance of “doctor”. Besides, it seems that Hard Times pays more attention to the not-so-educated lower class people, which can be seen from the accented colloquial words like “wi”(with), “tis”(this) and “yo” (you).
Words about characters’ featuresIn terms of key words for action, it seems that The Pickwick Papers are more about communication, Oliver Twist are more about reunion, Dombey and Son may have plots about robbery, David Copperfield are more about remembering, Hard Times are more about movements between places and Great Expectations are more about searching, solving problems and communication.
Words about actions
In terms of the key words for mood, sadness may be more prominent in Oliver Twist for the action “cried” and David Copperfield may pay more attention to “love” and “happy”.
Since there are not much words about mood in this comparison cloud, why don’t we try to explore the sentimental term use through sentiment analysis in the following two sections?
From the R wordcloud package, the comparison cloud reveals the key differences in term frequency (or, tf-idf value if chosen instead) between two or more documents. This comparison is hence relative not absolute. It does not show the absolute differences between documents in term usage. Instead, it shows the strongest differences in term frequency (or tf-idf). The same term could appear in all of the documents selected for plotting, but it would only show for one document (if at all) if the term appears significantly more in that document than the others. Likewise, a term that appears often in one document, but does not appear in the other documents, would likely also be displayed for that document. The relative display size indicates the importance.
The comparison cloud maps out key changes (differences) between documents in terms of their vocabulary.
Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.
Analysis
As is shown in the Plutchik SA Line plots for each novels, Trust usually dominates and tops the chart except that sometimes Sadness is temporary more prominent. Moreover, Disgust always lies beneath Sadness and the changes of Disgust are followed by the similar changes of Sadness in general. It seems that Charles Dickens has a pattern in the use of sentimental terms.
If we look at the radar chart, which enables us to see the overall sentimental terms use in all 6 novels, we can see that the six novels have similar “shape” regardless of their different sizes. This result indicates that Charles Dickens has a pattern in the sentimental terms use in his novels. In the following sentiment analysis word cloud, we will explore this pattern in details to figure out whether there are particular sentimental words repeated in all 6 novels.
The Sentiment Analysis lexicon nrc categorizes terms – the identified vocabulary – according to eight primary emotions as defined by the psychologist Robert Plutchik {7}. The eight basic emotions according to Plutchik are: Anger, Anticipation, Joy, Trust, Fear, Surprise, Sadness, and Disgust.
SA Lines Plot
The Plutchik SA lines graph maps a document section by section in a linear fashion: similar to how one typically reads a text. Like the radar chart, it also matches the frequency of text terms to the lexicon. But it provides a strong indication of where the emotive vocabulary appears in the document and shows the pattern of usage.
In the case of a literary work such as novel, the SA Lines graph provides valuable insight into the document’s literary style. Likewise, the emotive vocabulary patterns revealed also have some relationship to the story arc (plot) and character development.
Radar Chart
The Plutchik radar chart matches the frequency of text terms to the lexicon, and maps out the text’s emotive vocabulary (as identified by nrc) along the eight Plutchik emotive axes. It does not indicate where the vocabulary appears in the document, only the overall concentration or “shape”.
Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.
Analysis
We can see that Charles Dickens did repeat certain positive or negative words in all 6 novels. For instance, negative words like “death/dead”, “afraid/fear”, “poor”, “dark” and positive words like “pretty”, “smile”, “strong”, “love”, “happy”, etc. There are also some words largely reoccur in 4 or 5 of the 6 novels, such as “strange/stranger”, “lost”, “cold” and “wrong”. This indicates that Charles Dickens had a habit of sentimental word choice. It also indicates that the environment in his novels is often “dark” and “cold” and there is some topics commonly talked about in his novels, such as “death”, “fear”, “poor (people)”, “love”, etc.
Except commonality, there are some changes in the sentimental word choice in Charles Dickens novels. “Love” is not prominent in his first novel The Pickwick Papers. It becomes more prominent in his second novel Oliver Twist and then dominates in the later 4 novels. Besides, the word “confidence” becomes more prominent in Charles Dickens later 2 novels, Hard Times and Great Expectations. This may indicate the changes in the theme of his novels. Moreover, there are some words that only significantly appear in 1 novel, which can tell the different plots or topics in that novel. For example, the word “punch” and “prison” in The Pickwick Papers show that there may be plots about fight and crime. In Oliver Twist, words like “noise”, “quiet”, “silent” indicates that sound in environment may play an important part in the novel and the word “safe” may indicate the topic about desire for safety if we see this word together with the prominent negative words “poor”, “dead”, “dark” and “fear”.
About Sentiment Analysis Word cloud
The sentiment analysis word cloud is a specialized type of comparison cloud. Rather than two or more documents being passed to the function comparison.cloud(), the data is first transformed by mapping it to a SA lexicon. This creates two (or more, depending on the lexicon) data categories. For example, using the bing lexicon will map out the terms in a document (or documents) to “positive” and “negative” {10}. The data is then passed to the function to be contrasted by category, “positive” and “negative” in our example. The “positive” and “negative” terms are displayed as contrasting sides in the comparison cloud, with the size indicating the relative importance in term frequency difference.
The SA word cloud provides an overview of and insight into emotive language and hence in part the style of a document (or set of documents).
Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.
Analysis
Since “poor” and “strange” are two words largely reoccurs in Charles Dickens 6 novels, let’s explore what words are after “poor” and “strange”.
The words follow after “poor” are largely people (“fellow”, “mother”, “relations”), which indicates that in Charles’ novels, the characters often underwent something unfortunate.
The words after “strange” are about people (“man”, “boy”, “gentlemen”, “lady”, “faces”), their “feeling” and the environment (“place”,“news”,“sound”,“room”,“house”). Given that Charles Dickens is a writer of realism, this may indicate that people in the 19th century felt “strange” about the world and themselves.
A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. {4}
The CLA 3206A bigram analysis methods focused on adjacent words, also known as word pairs (when separated). In some cases, as is standard Tidytext TM practice, stop-words were used to produce cleaned data sets of bigrams and word pairs {5}.
Bigrams and word pairs typically provide valuable insight into the major topics of a document or corpus. Word pairs also typically provide insight about the document style, and about word-associations. These in turn may reveal important linkages between content and ideas.
Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.
Analysis
In order to figure out what kind of gender roles are portrayed in Charles Dickens works, we can make use of the gender analysis tool to “see what words, particularly verbs, follow”she" or “he”".
It is surprising to find that in the 6 novels, the most relative word after “he” is “paused” (relatively inactive) while after “she” is “arose” (active). However, except that, most words after “she” are inactive and reactive such as “pleaded”, “reclined”, “keeled”, “clung”, “prayed” and “forgave”. In contrast, words after “he” is more active, such as “handed”, “mounted”, “grasped”, “seized” and “bade”. There are also typical verbs like she “sobbed” and he “drank”. Besides, it seems that the male characters had gone through more difficulties and struggles in Charles Dickens’ novels, which is shown by the words he “struggled”.
The Gender Analyses in CLA 3206A follow the lead of Julia Silge’s study, for which Silge herself credits the break-through academic study by Professors Matthew Jockers and Gabi Kirilloff, “Understanding Gender and Character Agency in the 19th Century Novel”, Journal of Cultural Analytics (2017).
Approaching the topic of “character identity [as realized through] character action”, Jockers and Kirilloff(2017) examined “character agency in the context of character gender” by examining “trends in behavior associated with male and female characters”. They did so by studying patterns of gender pronoun usage: she what? he what? Similar to Silge, CLA 3206A follows their general method, though with differences on the technical (coding) level.
She | He
By seeing what words – and particularly verbs – follow “she” or “he”, we can gain insight into how these texts portray gender and hence gender roles: “character agency in the context of character gender”, as Jockers and Kirilloff (2017) express it.
Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.