For my January ISP (Independent Study Project), I analyzed the draft of a novel I had written.
I wanted to use R Programming tools, such as text mining and sentiment analysis, to gain insight about my novel and writing process. For this analysis, I looked at all the individual words in the novel and disregarded their context. Since I was unfamiliar with R before this ISP, I also had to learn basic syntax and data manipulation strategies.
library(dplyr)
library(tidytext)
library(ggplot2)
library(wordcloud)
library(textclean)
library(tidyr)
I used a format known as “TidyText.” This meant formatting my data frame so that each word was in its own row. This format made the text easier to analyze.
nov_base_text1 <- readLines("TMC.txt")
Encoding(nov_base_text1) <- "latin1"
nov_base_text2 <- textclean::replace_non_ascii(nov_base_text1)
nov_df <- dplyr::data_frame(line = 1:1949, text = nov_base_text2)
nov_word <- tidytext::unnest_tokens(nov_df, output = "word", input = text, token = "words")
my_stop_words <- subset(tidytext::stop_words, word != "face")
nov_word_clean <- dplyr::anti_join(nov_word, my_stop_words, by = c("word" = "word"))
To get the novel text into TidyText format, I had to read in the file line by line, separate the lines into words, get rid of “stop words” such as “and” that were unnecessary for analysis, etc.
I made a graph of the top twenty most commonly used terms. For clarity, I organized this graph so that the most frequently used terms appeared first.
common_words <- dplyr::count(nov_word_clean, word, sort = TRUE)
top_20_all <- common_words[1:20, 1:2]
word_graph <-
ggplot(data = top_20_all, aes(x = reorder(word, n), y = n)) + geom_col(fill = "tan3")
word_graph + labs(title = "Most Frequently Used Terms", x = ("Word"), y = ("Frequency")) + coord_flip()
Unsurprisingly, the bar graph showed that many of the most commonly used words were character names. This graph highlighted to what extent the “Toymaker” was a central character, as his name was by far the most commonly used. Since the story mainly takes place in a toy shop, it also makes sense that “shop” and “doll” were commonly used.
I was surprised to see that “time” and “moment” were among the top 20 most commonly used words, since I hadn’t considered the passage of time to be a major theme of the novel. However, upon reflection, the passage of time does play an important part in the arc of the main characters.
I made several word clouds looking at groups of words. This allowed me to look at more closely-associated words, rather than comparing all of the most frequently used terms.
char_names <- dplyr::filter(common_words, word %in% c("toymaker", "marie", "clementine", "kenton", "joseph", "stephen", "addy", "rosalind", "eve", "clementine's mother", "benjamin", "gregory", "clemence"))
char_names %>%
with(wordcloud(word, n, max.words = 100, min.freq = 1, colors = "pink3", random.order = FALSE))
This analysis gave insight as to which characters were the most important. However, this method of analysis wasn’t perfect. For instance, “Kenton” is smaller than “Marie,” which implies he is a less central character than her. In reality, Kenton is the narrator, which makes him an important character.
Since he narrates using “I,” his name appears less frequently than it would if he wasn’t narrating: i.e., instead of saying “Kenton said,” the text would say “I said.”
body_parts <- dplyr::filter(common_words, word %in% c("hand", "neck", "throat", "finger", "fingers", "face", "hair", "teeth", "leg", "legs", "arm", "arms", "back", "spine", "eye", "eyes", "voice", "head", "lips", "bones", "bone", "hands", "palm", "palms", "shoulder", "shoulders", "tongue", "skin"))
body_parts %>%
with(wordcloud(word, n, max.words = 100, min.freq = 1, color=alpha("red3", seq(0.4,1, 0.05)), random.order = FALSE))
Initially, it surprised me to see that “eyes” was such a frequently used term. However, I realized this is partly because there are no good synonyms or other descriptors for eyes. In reality, character hands (including “hand,” “hands, palm,” and “finger”) are by far the most commonly used descriptors.
color_list_gray <- c("gray18", "gray73", "gray18", "gray73", "gray73", "gray73", "gray18", "gray73", "gray18", "gray18")
light <- dplyr::filter(common_words, word %in% c("light", "dark", "dim", "pale", "bright", "night", "day", "dawn", "dusk", "nightfall"))
light %>%
with(wordcloud(word, n, max.words = 100, min.freq = 1, colors = color_list_gray, ordered.colors = TRUE, random.order = FALSE))
Much of the story takes place at night or at dusk/dawn, which this wordcloud confirms. “Light” was a more frequently used word than expected. However, rather than being used in the context of acute brightness, such as “a light sky,” “light” was usually used to describe “dim light” or light from flickering oil lamps.
color_list <- c("red3", "black", "gray90", "gray30", "blue3", "burlywood4", "chartreuse4", "gray58", "goldenrod1", "goldenrod4", "darkorchid1", "chocolate4", "darkorange2", "yellow3")
colors <- dplyr::filter(common_words, word %in% c("black", "white", "yellow", "gold", "silver", "bronze", "orange", "red", "blue", "green", "gray", "grey", "brown", "tan", "pink", "purple", "mahogany"))
colors %>%
with(wordcloud(word, n, max.words = 100, min.freq = 1, colors = color_list, ordered.colors = TRUE, random.order = FALSE))
While writing, I imagined this novel having an emotional color palette of dark browns, reds, and oranges. This was partially, but not entirely, reflected in the color wordcloud. Sometimes I had to describe specific objects–such as green leaves or black ribbon–that didn’t match this color palette. This word cloud also doesn’t include words associated with color, such as “wood” or “sunset,” which imply certain colors without directly stating them.
The sentiments for this section came from the “bing” lexicon, which categorizes words into a positive or negative binary. As it looks at individual words, it doesn’t take context into consideration.
top_positive_words <-
nov_word_clean %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment == "positive") %>%
count(word, sort = TRUE)
graph_positive <- ggplot(data = top_positive_words[1:15, 1:2], aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity", fill = "lightskyblue")
graph_positive + labs(title = "Most Frequent Positive Words", x = ("Word"), y = ("Frequency")) + ylim(NA, 50)+ coord_flip()
“Love” and “beautiful,” while marked positive by the bing lexicon, aren’t always used positively in the novel. Additionally, I’m not sure why “sharp” was marked as a positive word.
top_negative_words <-
nov_word_clean %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment == "negative") %>%
count(word, sort = TRUE)
graph_negative <- ggplot(data = top_negative_words[1:15, 1:2], aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity", fill = "yellowgreen")
graph_negative + labs(title = "Most Frequent Negative Words", x = ("Word"), y = ("Frequency")) + ylim(NA, 50) + coord_flip()
The bing lexicon doesn’t mark “gun” as a negative word–as it is referenced negatively in the novel–while the nrc lexicon does. Missing certain words is one of the limitations of choosing a lexicon.
nov_sentiment <- nov_word_clean %>%
inner_join(get_sentiments("bing")) %>%
count(index = line %/% 100, sentiment) %>%
ungroup %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
(graph_sentiment <- ggplot(nov_sentiment, aes(x = index, y = sentiment)) + geom_col(fill = "thistle4") +
labs(title = "Sentiment Change Over Time", x = ("Index"), y = ("Sentiment")))
This graph shows that the tone of the novel is overwhelmingly negative. The most commonly used positive word, “smile,” is used more frequently than the most commonly used negative word, “cold.” However, this novel uses overall more negative words than positive words.
The dip around index 12 and 13 marks roughly when one of the major characters in the novel dies, which makes sense. However, the less negative section at index 4 marks a depressing memory the narrator recollects, so it should be more negative.
Thank you for reading my ISP Report! I enjoyed getting the chance to look at my novel draft in a new way. I hope you found it interesting as well.