Hermann Hesse is a German novelist and wrote many novels during his
lifetime. Among the well-known works so far,
There are three visualizations and analyses used in this project. They are WordCloud, TF-IDF, and bigram. WordCloud is an analysis technique that simply presents the frequency of words that appear many times in the novel. It completely excludes the association with the work and simply helps the user visually check how many times the word has appeared. In the case of TF-IDF, it is an analysis technique that is easy to analyze the two works to understand what specific expressions each work has, and the characteristics revealed by each work can be checked through a graph. Bigram tells you the combination of words in the work so that you can easily grasp the context and theme of the work. The reason for using the above three analysis techniques is to find out the unique characteristics of each work by grasping the subject and frequency of words for each work. In this process, I thought that analyzing emotions was of little significance, so emotional analysis was excluded.
To summarize the results of the total analysis, it was found that
The data used in this project was taken from Project Gutenberg. In
the case of
#Data Loading
html_path <- "D:/TEXT ANALYSIS/textdata/ATA Final Project/demian_data.html"
html <- read_html(html_path)
html <- html %>%
html_elements("body") %>%
html_text2()
html <- strsplit(html, "\n")[[1]]
demian <- data.frame(text = html) %>%
mutate(title = "Demian (English)")
After importing
siddhartha <- gutenberg_download(2500) %>%
mutate(title = "Siddhartha")
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
The
hesse_books <- bind_rows(demian, siddhartha)
For data simultaneous analysis, a new dataset of two data combined into one was created.
#Data cleaning and preprocessing
data("stop_words")
hesse_data <- hesse_books %>%
mutate(text = str_to_lower(text)) %>%
mutate(text = str_replace_all(text, "[^a-z\\s]", " ")) %>%
mutate(text = str_squish(text)) %>%
filter(text != "") %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
filter(str_length(word) > 1)
custom_stop <- tibble(word = c("gutenberg", "project", "pg", "ebook", "chapter"))
stopwords <- bind_rows(stop_words, custom_stop)
After that, a preprocessing process was performed on the above dataset.In addition, a new list of terminology was created by adding words unrelated to the content of the text.
#Text data analysis
#Create Demian wordcloud data
wordcloud_dem <- hesse_books %>%
filter(title == "Demian (English)") %>%
mutate(text = str_to_lower(text),
text = str_replace_all(text, "[^a-z\\s]", " "),
text = str_squish(text)) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stopwords$word,
str_detect(word, "^[a-z']+$")) %>%
count(word, sort = TRUE)
#Visualizing
set.seed(123)
wordcloud(words = wordcloud_dem$word,
freq = wordcloud_dem$n,
scale = c(3, 0.3),
min.freq = 1,
max.words = 150,
random.order = FALSE,
rot.per = 0.25,
colors = brewer.pal(8, "Reds"))
#Create Siddhartha wordcloud data
wordcloud_sid <- hesse_books %>%
filter(title == "Siddhartha") %>%
mutate(text = str_to_lower(text),
text = str_replace_all(text, "[^a-z\\s]", " "),
text = str_squish(text)) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stopwords$word,
str_detect(word, "^[a-z']+$")) %>%
count(word, sort = TRUE)
#Visualizing
set.seed(123)
wordcloud(words = wordcloud_sid$word,
freq = wordcloud_sid$n,
scale = c(3, 0.4),
min.freq = 1,
max.words = 150,
random.order = FALSE,
rot.per = 0.25,
colors = brewer.pal(8, "Blues"))
For the first time, data analysis was conducted using word cloud. This is because you can grasp the frequency of words in the text and at the same time visually grasp which words are the core of the novel. Before creating word cloud, preprocessing such as tokenization and terminology removal was performed. After that, the visualization was performed using the word cloud package. In order to include words in the body excluding terminology in the word cloud, the word size was reduced in the basic layout, and a color palette was set in a single color to emphasize words with high frequency. Among the data analysis methods, word cloud is a visualization method that can immediately grasp words that frequently appear in novels. The purpose of this project was to select the method because I thought it was an appropriate visualization to present to general students who did not require detailed analysis in order to grasp the characteristics of the work.
#Create TF-IDF data
hesse_tfidf <- hesse_data %>%
anti_join(stopwords, by = "word") %>%
count(title, word, sort = TRUE) %>%
bind_tf_idf(word, title, n) %>%
arrange(desc(tf_idf))
#Visualizing TF-IDf data
hesse_tfidf %>%
group_by(title) %>%
slice_max(tf_idf, n = 10) %>%
ungroup() %>%
ggplot(aes(x = reorder_within(word, tf_idf, title), y = tf_idf, fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, scales = "free") +
scale_x_reordered() +
coord_flip() +
labs(title = "TF-IDF for each books", x = "Word", y = "TF-IDF")
TF-IDF was created using the data that the two documents were combined into one. This is to compare and analyze TF-IDF while comparing the two books. Analysis was conducted using the tidytext package, and this was visualized with the ggplot2 package. In order to compare the two data at the same time, I thought it was better to use the basically provided visualization graph as it is, so I did not set the visualization graph separately. The core of TF-IDF is to emphasize the words that stand out in one of the two documents. Therefore, it has the advantage of being able to grasp the subject of the work and compare what is different from other works. By intuitively presenting the importance of each document, the characteristics of the work can be compared at a glance to the user, so the analysis method was used for the second time.
#Before creating bigram network
hesse_bigrams <- hesse_books %>%
mutate(text = str_to_lower(text)) %>%
mutate(text = str_replace_all(text, "[^a-z\\s]", " ")) %>%
mutate(text = str_squish(text)) %>%
filter(text != "") %>%
unnest_tokens(output = bigram, text, token = "ngrams", n = 2)
hesse_bigram <- hesse_bigrams %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
word1 != "", word2 != "")
#Create Demian bigram data
demian_bigram <- hesse_books %>%
filter(title == "Demian (English)") %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
word1 != "", word2 != "") %>%
count(word1, word2, sort = TRUE) %>%
filter(n >= 3)
#Visualizing
demian1 <- graph_from_data_frame(demian_bigram)
set.seed(123)
ggraph(demian1, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "skyblue", size = 4) +
geom_node_text(aes(label = name), vjust = 1.5, hjust = 1, size = 3) +
theme_void() +
labs(title = "Demian - Bigram Network")
#Create Siddhartha bigram data
siddhartha_bigram <- hesse_books %>%
filter(title == "Siddhartha") %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
word1 != "", word2 != "") %>%
count(word1, word2, sort = TRUE) %>%
filter(n >= 3)
#Visualizing
siddhartha1 <- graph_from_data_frame(siddhartha_bigram)
set.seed(123)
ggraph(siddhartha1, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "tomato", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void() +
labs(title = "Siddhartha - Bigram Network")
A new dataset was created by tokenizing text data into 2-gram, and the corresponding data was divided again for each document to visualize each big network. Since there were not many word pairs made, the size was not modified separately, and contrasting colors were applied to the graph to specify that each was a different data. The reason for choosing the bigram network as data analysis is to grasp the context of the work through the connection relationship between words. The analysis method can grasp the frequency of word pairs, making it easier to grasp the context of the work than to check the frequency of a single word. In addition, I thought it was an effective analysis method for text analysis because I thought it was useful to be able to visually grasp not only the words that follow but also the relationship with other words.