Introduction and Methods

Column

A Study in Scarlet

Purpose Statement

Every author, every genre, and every novel have a particular style that encompasses the true essence of the writing. “A Study in Scarlet”, written by Arthur Conan Doyle, is the very first novel to include the famous detective Sherlock Holmes. As a result, it is the start of a particular style that would be repeated and emulated throughout the centuries. The purpose of this text mining research is to understand and analyze the word choice of Arthur Conan Doyle in his first ever Baker Street novel which made this and the subsequent works featuring Holmes a success. Through (1) the analysis of the most used words and bigrams in the novel, (2) a cross comparison of word use in select chapters, (3) a study of the frequency and use of verbs and adjectives, and (4) a comparison of sentiment analysis between “positive” and “negative” words, this research will begin to examine the writing style that immortalized Conan Doyle’s work. The final deliverables include an analysis of which words are most used throughout the novel, a discussion of how word use changes throughout sample chapters, a graphic describing the difference in actions vs descriptive word choice, and a sentiment analysis of emotion use throughout the entire novel.

Citations

Doyle, Arthur Conan. A Study in Scarlet. Leicester: Thorpe, 2011.

Mohammad, Saif M. NRC Emotion Lexicon. Accessed April 19, 2021. http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm.

Arthur Conan Doyle

Arthur Conan Doyle

Column

Methods

Using RStudio and the novel “A Study in Scarlet” by Arthur Conan Doyle pulled from the Project Gutenberg library (Creative Commons License), I proceeded to create this research endeavor and subsequent results. I started by importing my data set, wrangling and tidying it up by tokenizing the words to one per row, and removing stop words, missing values, and extra characters. The stop words removed were pulled from the RPackage tidytext dictionary available. Removing stop words allowed me to visualize the most important words in every section of the novel without worrying that common words like prepositions and pronouns would come up in my data as the most used words. I additionally cut off the first section of the raw data which included the creative commons license information from Project Gutenberg. This language is not part of the original Conan Doyle manuscript and hence not relevant to my study.

With my data in tidy text form, I was able to analyze the most used words through an individual word count and a bigram word count. Moreover, given my original endeavor to analyze the difference in word choice throughout chapters, I reformatted the data to be divided into chapters and pulled chapters 1, 8, and 14 as samples for this study. I created a frequency study of individual word counts and bigram counts for these chapters.

As part of my third objective, I analyzed the parts of speech used in the novel to see the difference in action-word usage (verbs) and description-word usage (adjectives) as stylistic choices. Using the cleanNLP package, I was able to tag each word in the novel to a part of speech and then utilized that information to perform a word count and a word cloud with the most used verbs and adjectives.

Finally, I created a sentiment analysis using the NRC Emotion Lexicon as the analysis dictionary. This dictionary tags sentiments as positive or negative and emotions as a several different feelings, including “fear”, “surprise”, and “anger”. My sentiment analysis included a direct analysis between positive and negative word use in the novel as well as the use of words associated to both “fear” and “surprise”. As part of the fourth deliverable, I wanted to research if, given that this is a mystery and thriller, Conan Doyle would consistently use more words related adrenaline, which can be subtagged within fear and surprise. I also wanted to see if there was an overall influx of negative words given the dark nature of this genre.

Using the ggplot and wordcloud packages in R Studio, I was able to produce graphics that feature in the results section of this report.

The First Sherlock Holmes

Column

Original Manuscript Cover

Original Manuscript Cover

Original Drawing of Sherlock Holmes by Sidney Page

Original Drawing of Sherlock Holmes by Sidney Page

Results

Column

I. Full Novel Frequency Word Count and Bigrams

Looking at the full novel word count frequency, the most used words are Holmes (n=95), time (n=77), and answered (n=59). This does not provide much analysis apart from the fact that the main character’s name is the most used word in the novel. This would make sense considering the novel is written in third person and Conan Doyle would need to reference the detective multiple times. However, this table also shows that Sherlock, the first name, is used 50 times but it is taking up space in the word analysis. Here, a bigram would be useful to see how these words interact with one another and which bigrams come out on top. It would be logical to say that Sherlock Holmes as a bigram would populate the top frequencies.

word n
holmes 95
time 77
answered 59
eyes 59
hand 57
ferrier 55
found 55
hope 55

Analyzing this bigram data set, it is not surprising to see that the most common bigram is Sherlock Holmes, the name of the main character and detective of the story. It is also not uncommon that the name of his companion, John Watson, is not as common since Watson is in fact the narrator of this novel. The bigram also exemplifies that a lot of the most used word are the names of people and places important to the story. For example, the words “Baker Street” are used 6 times; this is the home of Sherlock Holmes and John Watson. Similarly, the name of Jefferson Hope is used 30 times. This would suggest he is also a main character in this novel. In fact, those who have read it know that he is the main villain.

firstword secondword n
sherlock holmes 42
jefferson hope 30
john ferrier 23
brixton road 13
salt lake 10
lake city 9
joseph stangerson 8

II. Analyzing Frequencies per Chapter

Looking at the per chapter comparison from the three sample chapters (1, 8, and 14), there are clear differences between the most used words. Chapter 1 has “Stamford” (n=10), a name, while Chapter 8 has a common tangible noun, “eyes” (n=10), and Chapter 14 has an abstract noun “hope” (n=7). From reading the novel, Stamford is a non-important character since he is only mentioned as the introductory link between Watson and Holmes. This word frequency count exemplifies the usage of his name as a plot device in only the first chapter. For a visual description of this frequency count look at ( Figure 1)

Looking at the bigrams, there are some clear differences between which pair of words is most used as the novel unfolds. The first chapter has most instances of the name Sherlock Holmes (n=3). The logic follows that, since “A Study in Scarlet” is the first instance of the famous detective, Doyle has to allude to his character multiple times by name to give the background information necessary.

firstword secondword n
sherlock holmes 42
jefferson hope 30
john ferrier 23
brixton road 13
salt lake 10
lake city 9
joseph stangerson 8
lucy ferrier 8
baker street 6

Chapter 8 has the most instances of Sierra Blanco (n=3) which is the setting of the death in “A Study in Scarlet”.

firstword secondword n
sierra blanco 3
brother stangerson 2
extreme verge 2
grey shawl 2
john ferrier 2
joseph smith 2

Finally, Chapter 14 has more instances of Sherlock Holmes (n=2) as the most used bigram along with dead man’s (n=2). It is important to note that there are instances of bigrams that do not make sense, such as “page 23” which is a page number that was accidentally not deleted when the manuscript was uploaded to Project Gutenberg.

firstword secondword n
dead man’s 2
sherlock holmes 2
page 23 1
1 frontispiece 1
2 john 1
absent woman 1

III. Analyzing Parts of Speech

Looking at the counts for the parts of speech, there are more than twice as many verbs (n= 3218) as there are adjectives (n= 1710) in this novel. This would suggest that action moves the plot forward more than heavy description does. This follows the general formatting of detective novels which tend to be active in narration to propel the action. Looking at the specific verbs used, it seems that “answered” (n= 59) and “found” (n=55) are the two most used verbs. Given that mystery and thriller require deduction and analysis of evidence, it would be logical to assume that words that describe those actions are most common. This analysis also showcases the use of the past participle as the choice of verb conjugation since all the most used verbs are conjugated as such. In regard to the use of adjectives, the most used ones allude to ambiance “dark” (n=17), “white” (n=24), “silent”(n=13), as well as emotion with “terrible” (n=19). Looking at the word choice, it is easy to assume that the overall writing style of Arthur Conan Doyle relates to creating an environment of intrigue for the reader.

(See Word Cloud Tab for Data Visualizations)

IV. Sentiment Analysis

Looking at the emotion analysis of fear words, the most used word associated to fear is “death” (n=29), followed by “terrible” (n=19), and “doubt” (n=19) (See Figure 2). These emotions are in tandem with a genre of mystery and thriller which uses death as the most impactful ending. Looking at the emotion analysis for surprise words, the most used word associated to surprise is “hope” (n= 55) followed by “death” (n=29) and “mystery” (n=18 ) (See Figure 3). In both of these analyses, “death” and “mystery” appeared tagged, showcasing that those emotions are closely linked in the NRC Emoticon Lexicon.

Looking at the positive vs. negative words, there are some clear differences between the sets. Naturally, this division is clear-cut: if the word is tagged as positive, it cannot be tagged as negative. Among the most used positive words in the novel is “hope” (n= 55), followed by “found” (n=55) and “companion” (n=38). Among the most used negative words is “john” (n=36), “spoke” (n=29), and “death”(n=29) (See Figure 4). Comparing these figures, it is clear that some words that are tagged as surprise in the emotion lexicon are also tagged as positive in the sentiment lexicon, such as the word “hope”. This helps create an understanding of the nuanced meaning of hope in the text. At the same time, there are words that have been tagged such as “john” that do not in fact have a negative connotation in the text (or in real life).

Column

Figure 1: Word Counts per Chapter

Figure 2: Sentiment Analysis, Fear Words

Figure 3: Sentiment Analysis, Surprise Words

Figure 4: Sentiment Analysis, Positive vs Negative Words

Word Clouds

Adjective Usage

Verb Usage

Discussion

Column

Putting all this information into context, some main findings can be discerned. First, by analyzing the word counts and bigrams of the entire novel, it is clear that the most used wording is the name of the main character, Sherlock Holmes. Since the story is told in the third-person, logic follows that the author would need to constantly discuss the detective by his name.

Secondly, looking at the chapter frequencies, it is clear that some words are more necessary than others as the plot progresses. For example, the name Sherlock Holmes is used extensively in the first chapter, while it has no repeated mentions in the eight chapter. This is because the first chapter needs to introduce the main character. At the same time, there is a repeated mention on chapter one of Stamford, John Watson’s friend who introduces him to Sherlock Holmes. Clearly, he is important as a plot device in the first chapter, but no longer necessary in the subsequent ones, given that his name is not consistently used.

Thirdly, looking at the parts of speech analysis, it is clear that Conan Doyle moves his novel forward through the use of action as opposed to description. There are more than twice as many verbs than there are adjectives, suggesting that action indeed takes precedence. This is an important stylistic choice since the novel is read at a faster pace the more verbs are included in the narration. This analysis also helped elucidate the verb conjugation style preferred by Conan Doyle, with most verbs being conjugated in the past participle.

Finally, the sentiment analysis alludes to an overall ambiance of fear, thriller, and intrigue. There are multiple instances of words associated with both fear and surprise (taking into account that some words are actually tagged twice in these sentiments). The words that feature in this fear-based lexicon are also on the negative side, such as “death”. This would suggest that Conan Doyle’s style is infused with fear and surprise-based rhetoric meant to keep the adrenaline in the readers.

Out of the entire analysis, the analysis of the parts of speech surprised me the most. I was expecting there to be a balance between the verb and adjective use in the novel. I remember reading “A Study in Scarlet”, and I did not feel the action was being carried out as fast as the description of people, places, and evidence was. However, this analysis would suggest that Conan Doyle, in fact, intended his work to be more action than description driven.

I still have questions that remain unanswered in regard to the text. Firstly, I would like to create a comparison between this novel and his short stories to see the difference in the action and fast-paced writing style which can be accomplished through the parts of speech analysis. For this, I would need to download, wrangle, and tidy some of his short stories. Secondly, I would love to compare this novel to one of his older novels and see how his patterns in parts of speech and sentiment words differ throughout the years. From what I remember, the later novels tend to be more gruesome and explicit, so a sentiment analysis would be interesting. With that said, I believe that I would need to device my own dictionary with sentiment words specific to “A Study in Scarlet” since the dictionary used here tags words like “john” as negative.

Column

Original Drawing of Sherlock Holmes by Sidney Page

Original Drawing of Sherlock Holmes by Sidney Page

John Watson by Sidney Page

John Watson by Sidney Page

630: Quantitative Methods