Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
##r chunk
library(gutenbergr)
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(memnet)
## Loading required package: Rcpp
library(tidytext)
library(widyr)
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
Choose one of the books below. The code to download and structure the books has been provided for you, so all you would need to do is change out the title.
##r chunk
##pick one book from the list above
titles = c("The Art of War")
books = gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title", mirror = "http://mirrors.xmission.com/gutenberg/") %>%
mutate(document = row_number())
create_chapters = books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
by_chapter = create_chapters %>%
group_by(document) %>%
summarise(text=paste(text,collapse=' '))
## `summarise()` ungrouping output (override with `.groups` argument)
by_chapter$document
## [1] "The Art of War_1" "The Art of War_10" "The Art of War_11"
## [4] "The Art of War_12" "The Art of War_13" "The Art of War_14"
## [7] "The Art of War_15" "The Art of War_16" "The Art of War_17"
## [10] "The Art of War_18" "The Art of War_19" "The Art of War_2"
## [13] "The Art of War_20" "The Art of War_21" "The Art of War_22"
## [16] "The Art of War_23" "The Art of War_24" "The Art of War_25"
## [19] "The Art of War_26" "The Art of War_27" "The Art of War_28"
## [22] "The Art of War_29" "The Art of War_3" "The Art of War_30"
## [25] "The Art of War_31" "The Art of War_32" "The Art of War_33"
## [28] "The Art of War_34" "The Art of War_35" "The Art of War_36"
## [31] "The Art of War_37" "The Art of War_38" "The Art of War_39"
## [34] "The Art of War_4" "The Art of War_40" "The Art of War_41"
## [37] "The Art of War_42" "The Art of War_43" "The Art of War_44"
## [40] "The Art of War_45" "The Art of War_46" "The Art of War_47"
## [43] "The Art of War_48" "The Art of War_49" "The Art of War_5"
## [46] "The Art of War_50" "The Art of War_51" "The Art of War_52"
## [49] "The Art of War_53" "The Art of War_54" "The Art of War_55"
## [52] "The Art of War_56" "The Art of War_57" "The Art of War_58"
## [55] "The Art of War_59" "The Art of War_6" "The Art of War_60"
## [58] "The Art of War_61" "The Art of War_62" "The Art of War_63"
## [61] "The Art of War_64" "The Art of War_65" "The Art of War_66"
## [64] "The Art of War_67" "The Art of War_68" "The Art of War_7"
## [67] "The Art of War_8" "The Art of War_9"
In this section, you want to create a tibble/dataframe of the individual words from your book (use by_chapter$text). Try using unnest_tokens (arguments shoud be word, text) and anti_join to create a unigram list of words without stopwords included.
book_words <- by_chapter %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
book_words <- book_words[-grep("chapter", book_words$word), ]
head(book_words)
## # A tibble: 6 x 2
## document word
## <chr> <chr>
## 1 The Art of War_1 proceeds
## 2 The Art of War_1 biography
## 3 The Art of War_1 descendant
## 4 The Art of War_1 sun
## 5 The Art of War_1 pin
## 6 The Art of War_1 born
In this section, use the count function to determine the most frequent words used in the book that are not stopwords.
book_words %>% count(word, sort=TRUE)
## # A tibble: 10,581 x 2
## word n
## <chr> <int>
## 1 army 1038
## 2 enemy 667
## 3 operations 592
## 4 line 542
## 5 war 541
## 6 battle 444
## 7 lines 416
## 8 ch 388
## 9 thousand 342
## 10 attack 340
## # … with 10,571 more rows
Create a tibble/dataframe that includes the collocate pairs in the book you picked using pairwise_count. The document column is equivalent to id in the lecture example.
book_word_pairs <- book_words %>%
pairwise_count(word, document, sort = TRUE, upper = FALSE)
## Warning: `distinct_()` is deprecated as of dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
head(book_word_pairs)
## # A tibble: 6 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 2 1 41
## 2 war army 41
## 3 military war 40
## 4 army enemy 40
## 5 war 1 39
## 6 1 army 39
Create a network plot of the collocates - remember you can change the n > XX to a number that keeps a lot of the data, but filters out a lot of the smaller combinations. Set the n value in the filter function to be equal to or less than the highest n value in the word_pairs table.
book_word_pairs %>%
filter(n >= 35) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") + #use ?ggraph to see all the options
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "purple") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
What do the simple statistics and network plots tell you about the book you selected? Interpret your output in a paragraph summarizing your visualizations.
ANSWER: As we can see words in the plot strength , battle , attack , troops are connected to the words army , enemy. The network plot clearly shows that the book talks a lot about war strategies. The degree i.e the number edges connected to the node’s army , enemy , war , military are higher. These four nodes are central and all those are related to the warfare and are strongly connected with each other.
Describe a set of texts and research question that interests you that could be explored using this method. Basically, what is a potential application of this method to another area of research? (At least a full paragraph including a definition of the problem or question, the text data that could be used, what the analysis might mean, and why the problem is important.)
war which may be discussed in multiple volumes. Building a language network model helps me identify the relationship between the word war and the other words related to war in that particular volume. There might be 5-6 wars discussed in the entire book or may be more, But a language network model helps me identify the most important nodes (or words) in that particular volume and also tells me whether the word war is disscuessed in that volume heavily or no. This technique may be more useful for the books which are not yet translated and without any prefix, especially to understand the context of them.