Network Models

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk
library(gutenbergr)
library(stringr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(memnet)

## Loading required package: Rcpp

library(tidytext)
library(widyr)
library(ggplot2)
library(igraph)

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggraph)

The Data

Choose one of the books below. The code to download and structure the books has been provided for you, so all you would need to do is change out the title.

Book Titles:
- Crime and Punishment
- Pride and Prejudice
- A Christmas Carol
- The Iliad
- The Art of War
- An Inquiry into the Nature and Causes of the Wealth of Nations
- Democracy in America — Volume 1
- Dream Psychology: Psychoanalysis for Beginners
- Talks To Teachers On Psychology; And To Students On Some Of Life's Ideals

##r chunk
##pick one book from the list above
titles = c("The Art of War")

books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title", mirror = "http://mirrors.xmission.com/gutenberg/") %>%
  mutate(document = row_number())

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

## `summarise()` ungrouping output (override with `.groups` argument)

by_chapter$document

##  [1] "The Art of War_1"  "The Art of War_10" "The Art of War_11"
##  [4] "The Art of War_12" "The Art of War_13" "The Art of War_14"
##  [7] "The Art of War_15" "The Art of War_16" "The Art of War_17"
## [10] "The Art of War_18" "The Art of War_19" "The Art of War_2" 
## [13] "The Art of War_20" "The Art of War_21" "The Art of War_22"
## [16] "The Art of War_23" "The Art of War_24" "The Art of War_25"
## [19] "The Art of War_26" "The Art of War_27" "The Art of War_28"
## [22] "The Art of War_29" "The Art of War_3"  "The Art of War_30"
## [25] "The Art of War_31" "The Art of War_32" "The Art of War_33"
## [28] "The Art of War_34" "The Art of War_35" "The Art of War_36"
## [31] "The Art of War_37" "The Art of War_38" "The Art of War_39"
## [34] "The Art of War_4"  "The Art of War_40" "The Art of War_41"
## [37] "The Art of War_42" "The Art of War_43" "The Art of War_44"
## [40] "The Art of War_45" "The Art of War_46" "The Art of War_47"
## [43] "The Art of War_48" "The Art of War_49" "The Art of War_5" 
## [46] "The Art of War_50" "The Art of War_51" "The Art of War_52"
## [49] "The Art of War_53" "The Art of War_54" "The Art of War_55"
## [52] "The Art of War_56" "The Art of War_57" "The Art of War_58"
## [55] "The Art of War_59" "The Art of War_6"  "The Art of War_60"
## [58] "The Art of War_61" "The Art of War_62" "The Art of War_63"
## [61] "The Art of War_64" "The Art of War_65" "The Art of War_66"
## [64] "The Art of War_67" "The Art of War_68" "The Art of War_7" 
## [67] "The Art of War_8"  "The Art of War_9"

Clean up the data

In this section, you want to create a tibble/dataframe of the individual words from your book (use by_chapter$text). Try using unnest_tokens (arguments shoud be word, text) and anti_join to create a unigram list of words without stopwords included.

book_words <- by_chapter %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

book_words <- book_words[-grep("chapter", book_words$word), ]

head(book_words)

## # A tibble: 6 x 2
##   document         word      
##   <chr>            <chr>     
## 1 The Art of War_1 proceeds  
## 2 The Art of War_1 biography 
## 3 The Art of War_1 descendant
## 4 The Art of War_1 sun       
## 5 The Art of War_1 pin       
## 6 The Art of War_1 born

Simple statistics

In this section, use the count function to determine the most frequent words used in the book that are not stopwords.

book_words %>% count(word, sort=TRUE)

## # A tibble: 10,581 x 2
##    word           n
##    <chr>      <int>
##  1 army        1038
##  2 enemy        667
##  3 operations   592
##  4 line         542
##  5 war          541
##  6 battle       444
##  7 lines        416
##  8 ch           388
##  9 thousand     342
## 10 attack       340
## # … with 10,571 more rows

Collocates clean up

Create a tibble/dataframe that includes the collocate pairs in the book you picked using pairwise_count. The document column is equivalent to id in the lecture example.

book_word_pairs <- book_words %>% 
  pairwise_count(word, document, sort = TRUE, upper = FALSE)

## Warning: `distinct_()` is deprecated as of dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

head(book_word_pairs)

## # A tibble: 6 x 3
##   item1    item2     n
##   <chr>    <chr> <dbl>
## 1 2        1        41
## 2 war      army     41
## 3 military war      40
## 4 army     enemy    40
## 5 war      1        39
## 6 1        army     39

Create a network plot

Create a network plot of the collocates - remember you can change the n > XX to a number that keeps a lot of the data, but filters out a lot of the smaller combinations. Set the n value in the filter function to be equal to or less than the highest n value in the word_pairs table.

book_word_pairs %>%
  filter(n >= 35) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") + #use ?ggraph to see all the options
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "purple") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Interpretation

What do the simple statistics and network plots tell you about the book you selected? Interpret your output in a paragraph summarizing your visualizations.
ANSWER: As we can see words in the plot strength , battle , attack , troops are connected to the words army , enemy. The network plot clearly shows that the book talks a lot about war strategies. The degree i.e the number edges connected to the node’s army , enemy , war , military are higher. These four nodes are central and all those are related to the warfare and are strongly connected with each other.
Describe a set of texts and research question that interests you that could be explored using this method. Basically, what is a potential application of this method to another area of research? (At least a full paragraph including a definition of the problem or question, the text data that could be used, what the analysis might mean, and why the problem is important.)
- ANSWER: There are lot of Hindu mythological books which are written thousands of years ago , for example Mahabharata a huge book with more than 200,000 individual verse lines with 19 volumes , where each volume teaches unique lesson which can be applied in day to day life , It will take many months to read it, it almost took 50 years to translate it to different languages. What if i want to know about the topic war which may be discussed in multiple volumes. Building a language network model helps me identify the relationship between the word war and the other words related to war in that particular volume. There might be 5-6 wars discussed in the entire book or may be more, But a language network model helps me identify the most important nodes (or words) in that particular volume and also tells me whether the word war is disscuessed in that volume heavily or no. This technique may be more useful for the books which are not yet translated and without any prefix, especially to understand the context of them.