Network Models

Load the libraries + functions

library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(memnet)
library(jsonlite)
library(dplyr)
library(tidytext)
library(widyr)
library(ggplot2)
library(igraph)
library(ggraph)

The Data

Book Chosen:
- Crime and Punishment

titles = c('Crime and Punishment')

books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title") %>%
  mutate(document = row_number())

create_chapters = books %>%
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(
    text, regex("\\bchapter\\b", ignore_case = TRUE)
  ))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter)

by_chapter = create_chapters %>%
  group_by(document) %>%
  summarise(text = paste(text, collapse = ' '))

Clean up the data

by_chapter <- by_chapter %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words)
by_chapter

## # A tibble: 62,063 x 2
##    document               word         
##    <chr>                  <chr>        
##  1 Crime and Punishment_1 chapter      
##  2 Crime and Punishment_1 exceptionally
##  3 Crime and Punishment_1 hot          
##  4 Crime and Punishment_1 evening      
##  5 Crime and Punishment_1 july         
##  6 Crime and Punishment_1 garret       
##  7 Crime and Punishment_1 lodged       
##  8 Crime and Punishment_1 walked       
##  9 Crime and Punishment_1 slowly       
## 10 Crime and Punishment_1 hesitation   
## # ... with 62,053 more rows

Simple statistics

by_chapter %>% count(word, sort = TRUE)

## # A tibble: 9,090 x 2
##    word            n
##    <chr>       <int>
##  1 raskolnikov   725
##  2 time          385
##  3 sonia         370
##  4 razumihin     324
##  5 dounia        302
##  6 looked        293
##  7 suddenly      292
##  8 day           285
##  9 petrovitch    272
## 10 ivanovna      269
## # ... with 9,080 more rows

Collocates clean up

pair_words_list <- by_chapter %>% pairwise_count(word, document , sort =TRUE, upper = FALSE)
head(pair_words_list)

## # A tibble: 6 x 3
##   item1   item2      n
##   <chr>   <chr>  <dbl>
## 1 chapter time      40
## 2 chapter day       40
## 3 time    day       40
## 4 chapter moment    40
## 5 time    moment    40
## 6 day     moment    40

Create a network plot

set.seed(42)
pair_words_list %>%
  filter(n >= 39) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") + 
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "blue") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Interpretation

What do the simple statistics and network plots tell you about the book you selected? Interpret your output in a few sentences summarizing your visualizations.
From simple statistics, the story in the book’s primary character is raskolnikov. The book uses the word suddenly 292 times, the word time has 385 occurrences , suggesting the book is a thriller. The same can be seen in the plot. There are strong connections between words time, suddenly, raskolnikov, suggesting the book is about a person name raskolnikov and suggesting the book is a crime thriller. Since the book is about crime, someone had to witness the crime. To explain the witness , words such as eyes, looked are, moment are used. Day is perhaps used to recollect. Each chapter in the book also seems to be using the words day, moment, raskolnikov, eyes and looked in excess, suggesting the book is about crime investigation, recollection of crime scene by witness on a particular day.