library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(memnet)
library(jsonlite)
library(dplyr)
library(tidytext)
library(widyr)
library(ggplot2)
library(igraph)
library(ggraph)
titles = c('Crime and Punishment')
books = gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title") %>%
mutate(document = row_number())
create_chapters = books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(
text, regex("\\bchapter\\b", ignore_case = TRUE)
))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
by_chapter = create_chapters %>%
group_by(document) %>%
summarise(text = paste(text, collapse = ' '))
by_chapter <- by_chapter %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
by_chapter
## # A tibble: 62,063 x 2
## document word
## <chr> <chr>
## 1 Crime and Punishment_1 chapter
## 2 Crime and Punishment_1 exceptionally
## 3 Crime and Punishment_1 hot
## 4 Crime and Punishment_1 evening
## 5 Crime and Punishment_1 july
## 6 Crime and Punishment_1 garret
## 7 Crime and Punishment_1 lodged
## 8 Crime and Punishment_1 walked
## 9 Crime and Punishment_1 slowly
## 10 Crime and Punishment_1 hesitation
## # ... with 62,053 more rows
by_chapter %>% count(word, sort = TRUE)
## # A tibble: 9,090 x 2
## word n
## <chr> <int>
## 1 raskolnikov 725
## 2 time 385
## 3 sonia 370
## 4 razumihin 324
## 5 dounia 302
## 6 looked 293
## 7 suddenly 292
## 8 day 285
## 9 petrovitch 272
## 10 ivanovna 269
## # ... with 9,080 more rows
pair_words_list <- by_chapter %>% pairwise_count(word, document , sort =TRUE, upper = FALSE)
head(pair_words_list)
## # A tibble: 6 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 chapter time 40
## 2 chapter day 40
## 3 time day 40
## 4 chapter moment 40
## 5 time moment 40
## 6 day moment 40
set.seed(42)
pair_words_list %>%
filter(n >= 39) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "blue") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
What do the simple statistics and network plots tell you about the book you selected? Interpret your output in a few sentences summarizing your visualizations.
From simple statistics, the story in the book’s primary character is raskolnikov. The book uses the word suddenly 292 times, the word time has 385 occurrences , suggesting the book is a thriller. The same can be seen in the plot. There are strong connections between words time, suddenly, raskolnikov, suggesting the book is about a person name raskolnikov and suggesting the book is a crime thriller. Since the book is about crime, someone had to witness the crime. To explain the witness , words such as eyes, looked are, moment are used. Day is perhaps used to recollect. Each chapter in the book also seems to be using the words day, moment, raskolnikov, eyes and looked in excess, suggesting the book is about crime investigation, recollection of crime scene by witness on a particular day.