Executive summary

Hermann Hesse is a German novelist and wrote many novels during his lifetime. Among the well-known works so far, , , and are called his three major growth novels. Analyzing the artist’s work becomes an important factor in grasping the artist’s thoughts and values as well as the era in which the work was written. Work analysis is usefully used in reading education in schools and libraries. This is because it is possible to present new values to students by linking the message of the work and the curriculum. In this project, and are selected among the above three works to analyze the work. The goal of the project is to analyze what each of the two growth novels contains and to find out what detailed topics are presented under the same theme of growth.

There are three visualizations and analyses used in this project. They are WordCloud, TF-IDF, and bigram. WordCloud is an analysis technique that simply presents the frequency of words that appear many times in the novel. It completely excludes the association with the work and simply helps the user visually check how many times the word has appeared. In the case of TF-IDF, it is an analysis technique that is easy to analyze the two works to understand what specific expressions each work has, and the characteristics revealed by each work can be checked through a graph. Bigram tells you the combination of words in the work so that you can easily grasp the context and theme of the work. The reason for using the above three analysis techniques is to find out the unique characteristics of each work by grasping the subject and frequency of words for each work. In this process, I thought that analyzing emotions was of little significance, so emotional analysis was excluded.

To summarize the results of the total analysis, it was found that is important to identify the characters, and it is important to compare the Christian worldview and the contrast within it. In Wordcloud, Demian accounted for the largest proportion of the central character in the novel, and a large number of words listing the inner world surrounding the words appeared. In addition, in the TF-IDF analysis, the results of the characters accounted for the highest percentage, and it was found that many words describing the character and words about the inner world were continued in Bigram. is characterized by many terms in the Buddhist worldview, contrary to the previous work. Through Wordcloud, it could be seen that the enlightenment of the character moves the entire story rather than the story unfolding around the character. In TF-IDF, words about teaching occupied the top percentage, and in Bigram, it was found that many contents related to Siddhartha were connected in pairs. It contained many lighter words than the previous work.

Data background

The data used in this project was taken from Project Gutenberg. In the case of , the data can be loaded directly from the website, so the data was loaded directly from Markdown. data cannot be loaded immediately from the website, so the HTML file was downloaded within the site, and it will be converted into a dataset within the program for analysis. The components of the data include the book title, the author, and the content of the text. Among them, the content of the text will be used for analysis.

Data loading, cleaning and preprocessing

#Data Loading

html_path <- "D:/TEXT ANALYSIS/textdata/ATA Final Project/demian_data.html"

html <- read_html(html_path)
html <- html %>% 
  html_elements("body") %>% 
  html_text2()
html <- strsplit(html, "\n")[[1]]

demian <- data.frame(text = html) %>%
  mutate(title = "Demian (English)")

After importing HTML to Markdown, the text content was extracted and extracted as text through simple preprocessing. After that, the text was divided into line units, and the data frame was created by naming the column as text.

siddhartha <- gutenberg_download(2500) %>%
  mutate(title = "Siddhartha")

## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

The data was taken directly from Project Gutenberg and loaded into Markdown.

hesse_books <- bind_rows(demian, siddhartha)

For data simultaneous analysis, a new dataset of two data combined into one was created.

#Data cleaning and preprocessing

data("stop_words")

hesse_data <- hesse_books %>%
  mutate(text = str_to_lower(text)) %>%
  mutate(text = str_replace_all(text, "[^a-z\\s]", " ")) %>%
  mutate(text = str_squish(text)) %>%
  filter(text != "") %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  filter(str_length(word) > 1)

custom_stop <- tibble(word = c("gutenberg", "project", "pg", "ebook", "chapter"))
stopwords <- bind_rows(stop_words, custom_stop)

After that, a preprocessing process was performed on the above dataset.In addition, a new list of terminology was created by adding words unrelated to the content of the text.

#Text data analysis

Anaysis and Figure 1 - Wordcloud

#Create Demian wordcloud data

wordcloud_dem <- hesse_books %>%
  filter(title == "Demian (English)") %>%
  mutate(text = str_to_lower(text),
         text = str_replace_all(text, "[^a-z\\s]", " "),
         text = str_squish(text)) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stopwords$word,
         str_detect(word, "^[a-z']+$")) %>%
  count(word, sort = TRUE)

#Visualizing

set.seed(123) 
wordcloud(words = wordcloud_dem$word,
          freq = wordcloud_dem$n,
           scale = c(3, 0.3),
          min.freq = 1,               
          max.words = 150,           
          random.order = FALSE,      
          rot.per = 0.25,            
          colors = brewer.pal(8, "Reds"))

#Create Siddhartha wordcloud data

wordcloud_sid <- hesse_books %>%
  filter(title == "Siddhartha") %>%
  mutate(text = str_to_lower(text),
         text = str_replace_all(text, "[^a-z\\s]", " "),
         text = str_squish(text)) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stopwords$word,
         str_detect(word, "^[a-z']+$")) %>%
  count(word, sort = TRUE)

#Visualizing

set.seed(123) 
wordcloud(words = wordcloud_sid$word,
          freq = wordcloud_sid$n,
           scale = c(3, 0.4),
          min.freq = 1,               
          max.words = 150,           
          random.order = FALSE,      
          rot.per = 0.25,            
          colors = brewer.pal(8, "Blues"))

For the first time, data analysis was conducted using word cloud. This is because you can grasp the frequency of words in the text and at the same time visually grasp which words are the core of the novel. Before creating word cloud, preprocessing such as tokenization and terminology removal was performed. After that, the visualization was performed using the word cloud package. In order to include words in the body excluding terminology in the word cloud, the word size was reduced in the basic layout, and a color palette was set in a single color to emphasize words with high frequency. Among the data analysis methods, word cloud is a visualization method that can immediately grasp words that frequently appear in novels. The purpose of this project was to select the method because I thought it was an appropriate visualization to present to general students who did not require detailed analysis in order to grasp the characteristics of the work.

Anaysis and Figure 2 - TF-IDF

#Create TF-IDF data

hesse_tfidf <- hesse_data %>%
  anti_join(stopwords, by = "word") %>%
  count(title, word, sort = TRUE) %>%
  bind_tf_idf(word, title, n) %>%
  arrange(desc(tf_idf))

#Visualizing TF-IDf data

hesse_tfidf %>%
  group_by(title) %>%
  slice_max(tf_idf, n = 10) %>%
  ungroup() %>%
  ggplot(aes(x = reorder_within(word, tf_idf, title), y = tf_idf, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, scales = "free") +
  scale_x_reordered() +
  coord_flip() +
  labs(title = "TF-IDF for each books", x = "Word", y = "TF-IDF")

TF-IDF was created using the data that the two documents were combined into one. This is to compare and analyze TF-IDF while comparing the two books. Analysis was conducted using the tidytext package, and this was visualized with the ggplot2 package. In order to compare the two data at the same time, I thought it was better to use the basically provided visualization graph as it is, so I did not set the visualization graph separately. The core of TF-IDF is to emphasize the words that stand out in one of the two documents. Therefore, it has the advantage of being able to grasp the subject of the work and compare what is different from other works. By intuitively presenting the importance of each document, the characteristics of the work can be compared at a glance to the user, so the analysis method was used for the second time.

Anaysis and Figure 3 - bigram network

#Before creating bigram network

hesse_bigrams <- hesse_books %>%
  mutate(text = str_to_lower(text)) %>%
  mutate(text = str_replace_all(text, "[^a-z\\s]", " ")) %>%
  mutate(text = str_squish(text)) %>%
  filter(text != "") %>%
  unnest_tokens(output = bigram, text, token = "ngrams", n = 2)

hesse_bigram <- hesse_bigrams %>%
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         word1 != "", word2 != "")

#Create Demian bigram data

demian_bigram <- hesse_books %>%
  filter(title == "Demian (English)") %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         word1 != "", word2 != "") %>%
  count(word1, word2, sort = TRUE) %>%
  filter(n >= 3) 

#Visualizing

demian1 <- graph_from_data_frame(demian_bigram)

set.seed(123)
ggraph(demian1, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "skyblue", size = 4) +
  geom_node_text(aes(label = name), vjust = 1.5, hjust = 1, size = 3) +
  theme_void() +
  labs(title = "Demian - Bigram Network")

#Create Siddhartha bigram data

siddhartha_bigram <- hesse_books %>%
  filter(title == "Siddhartha") %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         word1 != "", word2 != "") %>%
  count(word1, word2, sort = TRUE) %>%
  filter(n >= 3) 

#Visualizing

siddhartha1 <- graph_from_data_frame(siddhartha_bigram)

set.seed(123)
ggraph(siddhartha1, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "tomato", size = 3) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() +
  labs(title = "Siddhartha - Bigram Network")

A new dataset was created by tokenizing text data into 2-gram, and the corresponding data was divided again for each document to visualize each big network. Since there were not many word pairs made, the size was not modified separately, and contrasting colors were applied to the graph to specify that each was a different data. The reason for choosing the bigram network as data analysis is to grasp the context of the work through the connection relationship between words. The analysis method can grasp the frequency of word pairs, making it easier to grasp the context of the work than to check the frequency of a single word. In addition, I thought it was an effective analysis method for text analysis because I thought it was useful to be able to visually grasp not only the words that follow but also the relationship with other words.