Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk
library(gutenbergr)
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(memnet)
## Loading required package: Rcpp
library(tidytext)
library(widyr)
library(ggplot2)
library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)

The Data

Choose one of the books below. The code to download and structure the books has been provided for you, so all you would need to do is change out the title.

##r chunk
##pick one book from the list above
titles = c("The Art of War")

books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title", mirror = "http://mirrors.xmission.com/gutenberg/") %>%
  mutate(document = row_number())

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))
## `summarise()` ungrouping output (override with `.groups` argument)
by_chapter$document
##  [1] "The Art of War_1"  "The Art of War_10" "The Art of War_11"
##  [4] "The Art of War_12" "The Art of War_13" "The Art of War_14"
##  [7] "The Art of War_15" "The Art of War_16" "The Art of War_17"
## [10] "The Art of War_18" "The Art of War_19" "The Art of War_2" 
## [13] "The Art of War_20" "The Art of War_21" "The Art of War_22"
## [16] "The Art of War_23" "The Art of War_24" "The Art of War_25"
## [19] "The Art of War_26" "The Art of War_27" "The Art of War_28"
## [22] "The Art of War_29" "The Art of War_3"  "The Art of War_30"
## [25] "The Art of War_31" "The Art of War_32" "The Art of War_33"
## [28] "The Art of War_34" "The Art of War_35" "The Art of War_36"
## [31] "The Art of War_37" "The Art of War_38" "The Art of War_39"
## [34] "The Art of War_4"  "The Art of War_40" "The Art of War_41"
## [37] "The Art of War_42" "The Art of War_43" "The Art of War_44"
## [40] "The Art of War_45" "The Art of War_46" "The Art of War_47"
## [43] "The Art of War_48" "The Art of War_49" "The Art of War_5" 
## [46] "The Art of War_50" "The Art of War_51" "The Art of War_52"
## [49] "The Art of War_53" "The Art of War_54" "The Art of War_55"
## [52] "The Art of War_56" "The Art of War_57" "The Art of War_58"
## [55] "The Art of War_59" "The Art of War_6"  "The Art of War_60"
## [58] "The Art of War_61" "The Art of War_62" "The Art of War_63"
## [61] "The Art of War_64" "The Art of War_65" "The Art of War_66"
## [64] "The Art of War_67" "The Art of War_68" "The Art of War_7" 
## [67] "The Art of War_8"  "The Art of War_9"

Clean up the data

In this section, you want to create a tibble/dataframe of the individual words from your book (use by_chapter$text). Try using unnest_tokens (arguments shoud be word, text) and anti_join to create a unigram list of words without stopwords included.

book_words <- by_chapter %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
book_words <- book_words[-grep("chapter", book_words$word), ]

head(book_words)
## # A tibble: 6 x 2
##   document         word      
##   <chr>            <chr>     
## 1 The Art of War_1 proceeds  
## 2 The Art of War_1 biography 
## 3 The Art of War_1 descendant
## 4 The Art of War_1 sun       
## 5 The Art of War_1 pin       
## 6 The Art of War_1 born

Simple statistics

In this section, use the count function to determine the most frequent words used in the book that are not stopwords.

book_words %>% count(word, sort=TRUE)
## # A tibble: 10,581 x 2
##    word           n
##    <chr>      <int>
##  1 army        1038
##  2 enemy        667
##  3 operations   592
##  4 line         542
##  5 war          541
##  6 battle       444
##  7 lines        416
##  8 ch           388
##  9 thousand     342
## 10 attack       340
## # … with 10,571 more rows

Collocates clean up

Create a tibble/dataframe that includes the collocate pairs in the book you picked using pairwise_count. The document column is equivalent to id in the lecture example.

book_word_pairs <- book_words %>% 
  pairwise_count(word, document, sort = TRUE, upper = FALSE)
## Warning: `distinct_()` is deprecated as of dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
head(book_word_pairs)
## # A tibble: 6 x 3
##   item1    item2     n
##   <chr>    <chr> <dbl>
## 1 2        1        41
## 2 war      army     41
## 3 military war      40
## 4 army     enemy    40
## 5 war      1        39
## 6 1        army     39

Create a network plot

Create a network plot of the collocates - remember you can change the n > XX to a number that keeps a lot of the data, but filters out a lot of the smaller combinations. Set the n value in the filter function to be equal to or less than the highest n value in the word_pairs table.

book_word_pairs %>%
  filter(n >= 35) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") + #use ?ggraph to see all the options
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "purple") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Interpretation