For this assignment, I have chosen one of the most iconic detective novels, “A Study in Scarlet” by Arthur Conan Doyle. First, I downloaded all the necessary packages to work this assignment as shown below:

knitr::opts_chunk$set(echo = TRUE) 
#library(gutenbergr)
#library(tidyverse)
#library(wordcloud)
#library(stringr)
#library(topicmodels)
#library(data.table)
#library(cleanNLP)
#library(tidytext)
#library(textdata)
#cnlp_init_udpipe()
#I chose all of these packages based on the R Tutorial.

Download and Tidy

Now that I have all my packages ready, I can go ahead and use the gutenberg_download function to download “A Study in Scarlet” into my R Studio and start cleaning it up

(doyle<- gutenberg_download(244))
str(doyle) 
tibble [4,759 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:4759] 244 244 244 244 244 244 244 244 244 244 ...
 $ text        : chr [1:4759] "A STUDY IN SCARLET." "" "By A. Conan Doyle" "" ...

Looking at the internal structure with the str function, I can see that Doyle’s short story is already in a tibble which is the first step in organizing text data with these packages. However, there is a lot of unecessary infromation at the beginning (such as a Gutenberg disclaimer) that is not relevant for my text analysis. In fact, this information that is not part of the original Doyle manuscript could dkew my data analysis. Therefore, I have picked only the rows that have actual valuable data, which start on row 50.

(doyle.df<-doyle[50:4759,])

My next step is to tokenize the information by making every row one word. I can do this with the unnest_tokens function

(doyle.word <- doyle.df%>%
  unnest_tokens(word, text)) #word is the name of the new column I want to create, and text is the column from which the data came from within the doyle data frame

Just looking at the tibble, it is pretty obvious that there are stop words that can affect my analysis of the text, so I will go ahead and remove those using the tidytext stop words’ dictionary. I will also check for any missing values in order to remove those as well.

(tidy.doyle<- doyle.word %>%
  anti_join(stop_words)) #Used anti_join function so all the words in the data frame stop_words would be removed from the data frame doyle.word
Joining, by = "word"
sum(is.na(tidy.doyle$word)) #Using the sum function, I added all the values within the word column in the data set that exhbited NA.
[1] 0

I finally have my data in a tidy format, with no stop words, missing values, or extra characters. I can go ahead and produce a table of the top 50 words.

Analyzing Top 50 Words

tidy.doyle.wordcounts <-tidy.doyle %>% 
  count(word, sort=T)
head(tidy.doyle.wordcounts, 50)

Analysis: Looking at the top 50 words, it makes sense that Holmes, the main characters’ name, is the most used word in the data. However, I can also see that the word Sherlock, the first name, is also used 50 times but it is taking up space in my word analysis. Here, a bigram would be useful to see how these words interact with one another and which bigrams come out on top. My guess would be that Sherlock Holmes as a bigram is going to come up a lot.

Making a Bigram

(bigrams.doyle <- doyle.df %>% unnest_tokens(output = bigrams, input = text, token = "ngrams", 
    n = 2)) #Same formula to unnest tokens except instead of words I am now tokenizing for ngrams with an n of two 

Taking a quick look at the bigram here, I can see that a lot of the stop words create the same issues that they did before. As per recommended by Dr. Brown, the easiest option is to remove bigrams from the list on which there is a stop word as a first or second word in the bigram set. First, I will separate the words in the bigram, and then I will filter the stop words.

separated.words <- bigrams.doyle %>% separate(bigrams, c("firstword", "secondword"), sep = " ") #I copied this code from Dr. Brown's tutorial. Using the bigrams.doyle created before, I used the separate function to divide the bigram column into two columns named first word and second word and separated by a space. 
(doyle.bigram.final <-separated.words %>%
  filter(!firstword %in% stop_words$word) %>% 
  filter(!secondword %in% stop_words$word)) #Using the new data set called separated.words, I filtered not the first word if the word matches any of the ones in the wor column of stop_words. Then, I added the same thing but for the secondword column
NA
(doyle.diagram.wordcount <-doyle.bigram.final %>% count(firstword, secondword, sort=T)) #Doing the same word counting as I did before for the words, except this one is for the bigram counts. 
write_csv(doyle.diagram.wordcount,"/Users/silvanamontanola/Desktop/Quantitative/Code Files/bigram.csv")

After performing the bigram, I can see that there are 991 NAs on the data set since those are the ones who have the removed stop words. I will go ahead and remove those from the data set, so I can continue analyzing the bigram.

na.omit(doyle.diagram.wordcount) #na.omit is a function that easily allows me to eliminate all the NA's from the data set

Analysis: Analyzing this bigram data set, it is not surprising to see that the most common bigram is Sherlock Holmes, the name of the main character and detective of the story. It is also not uncommon that the name of his companion, John Watson, is not as common since Watson is in fact the narrator of this novel. Other information I would have missed without the bigram is the fact that a lot of the most used word are the names of people and places important to the story. For example, the words baker street are used 6 times, which is the home of Sherlock Holmes and John Watson. Similarly, we have the name of Jefferson Hope used 30 times. This would suggest he is also a main character in this novel. In fact, those who have read it know that he is the main villain.

Analyzing Frequencies per Chapter

Another thing I am interested in looking at is how the frequencies of words differ according to chapter. For example, I imagine that the villain’s name probably does not figure as often in the first chapters as it does in the last. “A Study in Scarlet” is divided into fourteen chapters, so I will subdivide the data by chapter to analyze word frequencies across the subsets.

First, I added an id. number to each word so I could filter for the word “chapter” within the data and figure out the cutoff points of each chapter.

(id.doyle <-doyle.word %>% mutate(ID= rownames(doyle.word)))
id.doyle %>% filter(id.doyle$word=="chapter") #filtered the data and realized that the word chapter actually shows up 15 times instead of 14. 

Since the word chapter shows up more times than intended, I had to manually check the wording of the cutoff points to see which sections are not in fact chapters.

id.doyle[2796:2800,]
id.doyle[6374:6380,]
id.doyle[41524:41530,]
id.doyle[43700:43706,]

As seen above, the last two instances of the word chapter are filler words at the end of the book from project Gutenberg. As such, I have removed those, and will subset the data based on the remaining 13 chapter headers (the first chapter does not have a chapter number, hence why it did not show up on the filtered data)

ch1 <- id.doyle[1:2795,]
ch2<- id.doyle[2796:6373,]
ch3<-id.doyle[6374:10243,]
ch4<- id.doyle[10244:12829,]
ch5<- id.doyle [12830:15370,]
ch6 <- id.doyle[15371:18594,]
ch7<- id.doyle[18595:21942,]
ch8 <- id.doyle[21943:25590,]
ch9 <-id.doyle[25591:28196,]
ch10<- id.doyle[28197:30041,]
ch11<- id.doyle[30042:33444,]
ch12 <-id.doyle[33445:36996,]
ch13 <- id.doyle[36997:41523,]
ch14 <- id.doyle[41523:43700,] #Created each chapter individually in order to analyze the difference in word frequencies 

Individual Word Frequencies per Chapter

Now I will repeat the same procedure that I did with the entire text in terms of frequency and bigram counts. Instead of doing it with all the chapters, I have chosen ch1, ch8, and ch14 as samples.

(tidy.doyle.ch1<- ch1 %>%
  anti_join(stop_words))
Joining, by = "word"
(tidy.doyle.ch8<- ch8 %>%
  anti_join(stop_words))
Joining, by = "word"
(tidy.doyle.ch14<- ch14 %>%
  anti_join(stop_words)) #Removed the stop words from all three of the chapters 
Joining, by = "word"
(wordcounts.ch1 <-tidy.doyle.ch1 %>% 
  count(word, sort=T))
(wordcounts.ch8 <-tidy.doyle.ch8 %>% 
  count(word, sort=T))
(wordcounts.ch14 <-tidy.doyle.ch14 %>% 
  count(word, sort=T)) #Counted the most used words in each chapter

Analysis: Looking at the per chapter comparison from the three samples, there are clear differences between the most used words, with ch1 having “Stamford”, a name, most commonly used, while chapter 8 has a common tangible noun, “eyes”, and chapter 14 has an abstract noun “hope”.

##Bigrams for Chapter Frequencies

Now, I will go ahead and create bigrams for the chapter samples and see how these differ from the entire chapter analysis. First, I need to find where the chapters divide in the bigram since the ID will be different in this table. Then, I can create the bigram chapters, remove the stop words, and then look at the count for the most used bigrams in chapter 1, 8, and 14 respectively.

(separated.words.id <-separated.words %>% mutate(ID= rownames(separated.words))) #added an ID number for each row in the bigram
separated.words.id %>% filter(separated.words.id$firstword=="chapter") #Looked for the chapter starts
ch1.bigram <-separated.words.id[1:2624,]
ch8.bigram <-separated.words.id[20580:23984,]
ch14.bigram <-separated.words.id[38860:40890,] #Created the bigrams for each chapter 
(doyle.bigram.ch1 <-ch1.bigram %>%
  filter(!firstword %in% stop_words$word) %>% 
  filter(!secondword %in% stop_words$word))
(doyle.bigram.ch8 <-ch8.bigram %>%
  filter(!firstword %in% stop_words$word) %>% 
  filter(!secondword %in% stop_words$word))
(doyle.bigram.ch14 <-ch.14bigram %>%
  filter(!firstword %in% stop_words$word) %>% 
  filter(!secondword %in% stop_words$word)) #Removed stop words from each of the three bigrams 
(doyle.diagram.ch1.wordcount <-doyle.bigram.ch1 %>% count(firstword, secondword, sort=T))
(doyle.diagram.ch8.wordcount <-doyle.bigram.ch8 %>% count(firstword, secondword, sort=T))
(doyle.diagram.ch14.wordcount <-doyle.bigram.ch14 %>% count(firstword, secondword, sort=T)) #Sorted the most used words for each of the three bigrams in order to compare them.

Analysis Looking at the bigrams, there are some clear differences between which pair of words is most used as the novel unfolds. The first chapter has most instances of the name Sherlock Holmes. This makes sense since “A Study in Scarlet” is the first instance of the famous detective, so Doyle has to give the background information necessary. Chapter 8 has instances of Sierra Blanco which is the setting of the death in “A Study in Scarlet”. Finally, Chapter 14 has more instances of Sherlock Holmes as the most used bigram along with dead man’s. I can also see how looking and analyzing the code is important since there are instances of bigrams that do not make sense, such as “page 23” which I am assuming is a page number that was accidentally not deleted when the manuscript was uploaded to Project Gutenberg.

#Analyzing Parts of Speech Another type of analysis that can be done on this novel is the study of the parts of speech. I was curious to see, given that this is a detective and adventure narration, if the novel would be action packed. The easiest way to do so is to verify the usage of verbs (action) against adjectives (description). First, I annotated the entire novel with the cnlp_annotate function which takes a little bit of time for R to run. Then, I divided the data frame to get only the tibble that has the actual words and the parts of speech.

#tidy.doyle.annotated <-cnlp_annotate(tidy.doyle$word)
tidy.doyle.annotated.full <- data.frame(tidy.doyle.annotated$token)

Here, I checked to see which parts of speech are included in this book.

unique(tidy.doyle.annotated.full$upos)
 [1] "NUM"   "NOUN"  "VERB"  "INTJ"  "ADV"   "ADJ"   "PART"  "PRON"  "X"     "PROPN" "SYM"   "AUX"   "CCONJ" "ADP"   "DET"  

Then, I conducted a test on the most used verbs as well as the number of verbs used. I did the same thing for the adjectives.

(doyle.verbs<-tidy.doyle.annotated.full %>% filter(upos == "VERB") %>% count(token, sort = T) %>% 
    top_n(100))
Selecting by n
tidy.doyle.annotated.full %>% filter(upos == "VERB") %>% count()
(doyle.adj<- tidy.doyle.annotated.full %>% filter(upos == "ADJ") %>% count(token, sort = T) %>% 
    top_n(100))
Selecting by n
tidy.doyle.annotated.full %>% filter(upos == "ADJ") %>% count()

Analysis: Looking at the counts for the parts of speech, there are more than twice as many verbs as there are adjectives in this novel. This would suggest that action moves the plot forward more than heavy description does. This follows the general formatting of detective novels which tend to be active in narration to propel the action. Looking at the specific verbs used, it seems that “answered” and “found” are the two most used verbs. Given that mystery and thriller requires deduction and analysis of evidence, it would be logical to assume that words that describe those actions are most common. Using this analysis also made it easy for me to realize that most of the novel is written in past participle since the most used verbs are conjugated as such. In regards to the use of adjectives, the most used ones allude to ambiance “dark”, “white”, “silent”, as well as emotion “terrible”. Just looking at the word choice, it is easy to assume that the overall writing style of Arthur Conan Doyle relates to creating an environment of intrigue for the reader.

#Word Frequency Figures

Now is the time to showcase this analysis in a graphical manner. For this, I have chosen to do a word cloud for the most used verbs and adjectives in order to showcase them in a digestible manner. For the comparison between chapters, I have chosen a bar graph. First, I needed to download the wordscloud two to create a colorful version of the world cloud.

#library(wordcloud2)
wordcloud.adj<-wordcloud2(doyle.adj)
wordcloud.verbs<-wordcloud2(doyle.verbs)
wordcloud.adj
wordcloud.verbs

This word cloud visually show what the analysis above stated. Now I will do the bar graphs for the word frequencies in the chapters.

ch1plot <- wordcounts.ch1 %>% top_n(20) %>%
    mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col(fill="#FF2400") + 
    xlab(NULL) + coord_flip() + labs(y = "Count", x = "Most Used Words")
Selecting by n
ch8plot <- wordcounts.ch8 %>% top_n(20) %>%
    mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col(fill="#660e00") + 
    xlab(NULL) + coord_flip() + labs(y = "Count", x = "Most Used Words")
Selecting by n
ch14plot <- wordcounts.ch14 %>% top_n(20) %>%
    mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col(fill="#330700") + 
    xlab(NULL) + coord_flip() + labs(y = "Count", x = "Most Used Words")
Selecting by n
combinedplots<- ggarrange(ch1plot, ch8plot, ch14plot, ncol=3, nrow=1)
(finalplots<-annotate_figure(combinedplots, bottom = "Figure 1: Most used words in Chapter 1, 8, and 14, respectively", top= "A Study in Scarlet"))

#Sentiment Analysis

The final part of this problem set was focusing on sentiment analysis. In order to do that, I decided to also use the NRC Emotion Lexicon because I was curious to see how the lexicon used by Conan Doyle relates to “negative” emotions (fear, sadness, anger).

#get_sentiments("nrc")
(sentiment.doyle<- tidy.doyle %>% inner_join(get_sentiments("nrc")))
Joining, by = "word"
(fear.doyle <-sentiment.doyle %>% filter(sentiment == "fear") %>% count(word, sort = T))
(negative.doyle <-sentiment.doyle  %>% filter(sentiment == "negative") %>% count(word, sort = T))
(surprise.doyle<-sentiment.doyle %>% filter(sentiment == "surprise") %>% count(word, sort = T))
(positive.doyle <- sentiment.doyle %>% filter(sentiment == "positive") %>% count(word, sort = T))
NA
(fearplot <- fear.doyle %>% top_n(20) %>%
    mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col(fill="#03AC13") + 
    xlab(NULL) + coord_flip() + labs(y = "Count", x = "Words Associated to Fear", title = "A Study In Scarlet: Sentiment Analysis of Fear"))
Selecting by n

(surpriseplot <- surprise.doyle %>% top_n(20) %>%
    mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col(fill="#32612D") + 
    xlab(NULL) + coord_flip() + labs(y = "Count", x = "Words Associated to Surprise", title = "A Study In Scarlet: Sentiment Analysis of Surprise"))
Selecting by n

positiveplot<-positive.doyle %>% top_n(20) %>%
    mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col(fill="#1338BE") + 
    xlab(NULL) + coord_flip() + labs(y = "Count", x = "Positive Words")
Selecting by n
negativeplot<-negative.doyle %>% top_n(20) %>%
    mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col(fill="#541E1B") + 
    xlab(NULL) + coord_flip() + labs(y = "Count", x = "Negative Words")
Selecting by n
combinedplots2<- ggarrange(positiveplot, negativeplot, ncol=2, nrow=1)
(finalplots2<-annotate_figure(combinedplots2, bottom = "Figure 4: Comparison of positive and negative words across A Study In Scarlet", top= "Sentiment Analysis: A Study in Scarlet"))

