In the present report, we conduct a descriptive content analysis on 4 popular works (top 4 most downloaded books) on war/military strategy available in Gutenberg Project’s archives applying text mining with the tidy approach in R (see Silge & Robinson for the guiding code, 2017; see Grimmer & Stewart for the principles of automated text analysis, 2013; 269-271). We have a nomothetic approach for this report by summarizing content across a selection of books (although we acknowledge a small sample of text corpus) (Neuendorf, 2017; 23-24). The aim is two-fold: 1.) quantitatively explore/describe the content of the mentioned repository (free source) offered in terms of military strategy literature and 2.) map the role of strategy (in an ngram network) according to the selection of books and authors, in order to draw insights that can analytically inform domains other than warfare (public policies, commerce, e.g.).
Our work is based on the frameworks proposed by Kornberger & Vaara (2022) on strategy research (Ibid; 1-2). In their article titled Strategy as engagement: What organization strategy can learn from military strategy (2022), they point to an `intersectionality between the two domains´ which has not been fully integrated (Ibid; 2). To begin with, the authors offer a conceptual development for the role of strategy: moving the sociological eye from previous research on internal strategy practices (focus on processes and strategy-making within an organization), onto external engagement practices with the ecosystem(s) beyond (an interactionist framing) (Ibid.; 1-3). This means to reorient current strategy research onto the nature of the practices that aim to and can exert a clout on external actors to favor one´s interests and agenda(s), - and have a better understanding on what changes the other´s “trajectory” through competition, collaboration, or co-option (Ibid; 2-3).
They stress the importance of drawing methodological-analytic lessons from military strategy literature in order to have strategy clearly defined (Ibid.; 8). In this sense, it is conceived here as a `bridge between two shores´: policy (as big guiding principles, purposes, or Grand Strategy) and tactics (as a means, power, or material prowess) (Ibid; 2, Ibid. on Clausewitz, Gray and Admiral Wylie; 4). According to our authors, strategy is not something to be implemented, but a “living movement” among these two “sides” and its function is to translate `purposes into conducts on a battlefield and vice versa´ (Ibid. on Clausewitz; 8). Strategy ultimately refers to an effect (that the two “ends” of the bridge have on one another on a constant flux) and not a concrete action or model (Ibid.; 9). Hence, it has constant change, adaptation, and evolution as salient features in order to achieve victory (effectively exercising power) in the long term through policy and never through `operational issues of warfare´ solely (Ibid; 4, 6). Kornberger & Vaara´s work gains importance in the current context of hybrid wars, emergent AI markets, ambitious and transition-based climate change policies, among others. Continuing with the authors´ avowal, we consider useful to address core principles on strategy (Ibid; 10) from popular military strategy literature to navigate uncertain scenarios (as traversed by the `fog of war´) practically (to have awareness of a given situation and to train good judgment that might enlighten action) (Ibid. on Clausewitz; 3, 10).
library(stringr)
library(forcats)
library(gutenbergr)
library(tidyverse)
library(tidytext)
library(tm)
library(textdata)
library(psych)
library(skimr)
library(wordcloud2)
library(tidyr)
library(lifecycle)
library(scales)
library(igraph)
library(ggraph)
gut_works <-data.frame(gutenberg_works())
str(gut_works)
## 'data.frame': 44042 obs. of 8 variables:
## $ gutenberg_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "The Declaration of Independence of the United States of America" "The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States" "John F. Kennedy's Inaugural Address" "Lincoln's Gettysburg Address\r\nGiven November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA" ...
## $ author : chr "Jefferson, Thomas" "United States" "Kennedy, John F. (John Fitzgerald)" "Lincoln, Abraham" ...
## $ gutenberg_author_id: int 1638 1 1666 3 1 4 NA 3 3 NA ...
## $ language : chr "en" "en" "en" "en" ...
## $ gutenberg_bookshelf: chr "Politics/American Revolutionary War/United States Law" "Politics/American Revolutionary War/United States Law" "" "US Civil War" ...
## $ rights : chr "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." ...
## $ has_text : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
# We have data frame with 53,840 different gutenberg_id's (total number of rows). There are 8 variables ("gutenberg_id", "title", "author","gutenberg_author_id", "language", "gutenberg_bookshelf", "rights", "has_text")
gut_meta <- gutenberg_metadata
str(gut_meta) # We have a tibble with 69,199 different gutenberg_id's (total number of rows). We have the same 8 variables
## tibble [72,569 × 8] (S3: tbl_df/tbl/data.frame)
## $ gutenberg_id : int [1:72569] 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr [1:72569] "The Declaration of Independence of the United States of America" "The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States" "John F. Kennedy's Inaugural Address" "Lincoln's Gettysburg Address\r\nGiven November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA" ...
## $ author : chr [1:72569] "Jefferson, Thomas" "United States" "Kennedy, John F. (John Fitzgerald)" "Lincoln, Abraham" ...
## $ gutenberg_author_id: int [1:72569] 1638 1 1666 3 1 4 NA 3 3 NA ...
## $ language : chr [1:72569] "en" "en" "en" "en" ...
## $ gutenberg_bookshelf: chr [1:72569] "Politics/American Revolutionary War/United States Law" "Politics/American Revolutionary War/United States Law" "" "US Civil War" ...
## $ rights : chr [1:72569] "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." ...
## $ has_text : logi [1:72569] TRUE TRUE TRUE TRUE TRUE TRUE ...
## - attr(*, "date_updated")= Date[1:1], format: "2022-12-19"
gut_sub<- gutenberg_subjects
str(gut_sub) # We have tibble with 230,993 gutenberg_id's (total number of rows). There are 3 variables ("gutenberg_id", "subject_type","subject" )
## tibble [231,741 × 3] (S3: tbl_df/tbl/data.frame)
## $ gutenberg_id: int [1:231741] 1 1 1 1 2 2 2 2 3 3 ...
## $ subject_type: chr [1:231741] "lcsh" "lcsh" "lcc" "lcc" ...
## $ subject : chr [1:231741] "United States -- History -- Revolution, 1775-1783 -- Sources" "United States. Declaration of Independence" "E201" "JK" ...
## - attr(*, "date_updated")= Date[1:1], format: "2022-12-19"
length(unique(gut_sub$subject)) # There are 38,136 unique subjects
## [1] 38229
unique(gut_sub$subject_type) # There are 2 subject types "lcsh" (Library of Congress Subject Headings) and
## [1] "lcsh" "lcc"
# "lcc" (Library of Congress Classifications).
Of course, one book can be associated to different subjects and subject types. As a comment, we note that subjects are frequently ‘sui generis’ or very broad. However, since our objective is to analyse popular works on war/military strategy, the existing label of “Military art and science” in the Gutenberg Project might be useful to select books.
sub_sub <- gut_sub %>% filter(subject == "Military art and science") # There are 19 works in this filtered-by-subject tibble.
# Asking a series of simple questions and doing simple steps to familiarize with the text corpora.
# a. The longest and shortest books (in terms of lines of content).
sub_books %>% count(gutenberg_id, author, title) %>% arrange(desc(n)) %>% slice_head() # Longest (by lines): "Tactics, Volume 1 (of 2). Introduction and Formal Tactics of Infantry" n=23,576
## # A tibble: 1 × 4
## gutenberg_id author title n
## <int> <chr> <chr> <int>
## 1 16170 Halleck, H. W. (Henry Wager) "Elements of Military Art and… 14969
sub_books %>% count(gutenberg_id, author, title) %>% arrange(desc(n)) %>% slice_tail() # Shortest (by lines): "Some Principles of Frontier Mountain Warfare" n=1,143
## # A tibble: 1 × 4
## gutenberg_id author title n
## <int> <chr> <chr> <int>
## 1 24842 Swinton, E. D. (Ernest Dunlop) The Defence of Duffer's Dri… 1735
# As Grimmer & Stewart (2017; 272) point out lengthier texts are better suited for automated content analysis (more words, more data).
From here, we tokenize by one-word-as-a-unit using the unnest_tokens() function from the tidytext package.That is, having our library of 19 books on Military art and science, we unnest the words from the “text” column in order to have a tidy data frame in which each row of the mentioned column now represents a single token (one word).
# We create the object "w_books" to save this tokenization for further operations.
w_books<- sub_books %>% unnest_tokens(word, text)
# b. How many words in total are in the library?
length(w_books$word) # There are 1,350,008 words in total
## [1] 553695
# c. List of books in the library sorted by descending number of total words (grouping by gutenberg_id, author and title).
w_books %>% group_by(gutenberg_id, author, title) %>% summarize(total = n()) %>% arrange(desc(total))
## `summarise()` has grouped output by 'gutenberg_id', 'author'. You can override
## using the `.groups` argument.
## # A tibble: 8 × 4
## # Groups: gutenberg_id, author [8]
## gutenberg_id author title total
## <int> <chr> <chr> <int>
## 1 16170 Halleck, H. W. (Henry Wager) "Elements of … 141324
## 2 1946 Clausewitz, Carl von "On War" 107826
## 3 34459 Corbin, Thomas W. "The Romance … 88763
## 4 7294 Ardant du Picq, Charles Jean Jacques Joseph "Battle Studi… 81468
## 5 23473 Anonymous "Lectures on … 62725
## 6 59804 Radiguet, René-Louis-Jules "The Making o… 31595
## 7 36693 Clausewitz, Carl von "Grundgedanke… 22968
## 8 24842 Swinton, E. D. (Ernest Dunlop) "The Defence … 17026
# d. Which is the longest and shortest book (in terms of total word counts)?
w_books %>% count(gutenberg_id, author, title) %>% arrange(desc(n)) %>% slice_head() # Longest (by words) "Tactics, Volume 1 (of 2). Introduction and Formal Tactics of Infantry" n= 186,457
## # A tibble: 1 × 4
## gutenberg_id author title n
## <int> <chr> <chr> <int>
## 1 16170 Halleck, H. W. (Henry Wager) "Elements of Military Art an… 141324
w_books %>% count(gutenberg_id, author, title) %>% arrange(desc(n)) %>% slice_tail() # Shortest (by words) "Some Principles of Frontier Mountain Warfare" n= 8,929
## # A tibble: 1 × 4
## gutenberg_id author title n
## <int> <chr> <chr> <int>
## 1 24842 Swinton, E. D. (Ernest Dunlop) The Defence of Duffer's Dri… 17026
# e. How many unique words are in the library?
w_books %>% count(word) %>% summarize(total = n()) %>% pull(total) # 48,981 unique words in total
## [1] 24968
# f. List of books in the library sorted by descending number of unique words (grouping by gutenberg_id, author and title).
w_books %>% group_by(gutenberg_id, author, title) %>% summarise( total = n_distinct(word)) %>% arrange(desc(total))
## `summarise()` has grouped output by 'gutenberg_id', 'author'. You can override
## using the `.groups` argument.
## # A tibble: 8 × 4
## # Groups: gutenberg_id, author [8]
## gutenberg_id author title total
## <int> <chr> <chr> <int>
## 1 16170 Halleck, H. W. (Henry Wager) "Elements of M… 10816
## 2 34459 Corbin, Thomas W. "The Romance o… 7386
## 3 1946 Clausewitz, Carl von "On War" 6939
## 4 7294 Ardant du Picq, Charles Jean Jacques Joseph "Battle Studie… 6915
## 5 23473 Anonymous "Lectures on L… 5695
## 6 59804 Radiguet, René-Louis-Jules "The Making of… 4166
## 7 36693 Clausewitz, Carl von "Grundgedanken… 4116
## 8 24842 Swinton, E. D. (Ernest Dunlop) "The Defence o… 2713
# g. Books with the most and least amount of unique words.
w_books %>% group_by(title) %>% summarise( total = n_distinct(word)) %>% arrange(desc(total)) %>% filter(row_number()==1) # Most unique words: "Della scienza militare" n=12,732
## # A tibble: 1 × 2
## title total
## <chr> <int>
## 1 "Elements of Military Art and Science\r\nOr, Course Of Instruction In S… 10816
w_books %>% group_by(title) %>% summarise( total = n_distinct(word)) %>% arrange(desc(total)) %>% filter(row_number()==19) # Least unique words: "Some Principles of Frontier Mountain Warfare" n=1,637
## # A tibble: 0 × 2
## # ℹ 2 variables: title <chr>, total <int>
From here, we are only interested in the top 4 most popular works. Importantly, metadata on popularity (number of downloads) is not available within the functions of gutenbergr package, so it has to be access and noted manually as seen in: https://www.gutenberg.org/ebooks/subject/89. Where we see ‘Books about Military art and science (sorted by popularity)’.
In this sense, the 4 most downloaded works are:
This as of: “Tue Jan 31 14:11:18 2023”
# Hence, we create our final library of the 4 most downloaded books.
top4books <- c("1946", "13549", "50750", "7294") # List with selected gutenberg_id´s
final_lib <- gutenberg_download(top4books, meta_fields = c("gutenberg_id", "author", "title"))
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/1/3/5/4/13549/13549.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/5/0/7/5/50750/50750.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
For this part, we follow Silge & Robinson´s (2017; Chapter 3) and (see Sebastian, 2020) steps, concepts and code (customized) in order to address quantitatively what each book is about by analyzing the tf (capturing the occurrence of each word in a book from our final library), the idf (a approach that lessens the “weight” (score) of the most frequent terms in favor of “rare” or less common words) and the tf-idf index (the multiplication of the two previous measures to detect the salient/relevant/particular words in a text).
# Total words per book
total_wordsperbook %>% arrange(desc(total)) # "The Art of War" by A. H Jomini has the most words n = 144,210.
## # A tibble: 2 × 4
## # Groups: gutenberg_id, title [2]
## gutenberg_id title author total
## <int> <chr> <chr> <int>
## 1 1946 On War Clausewitz, Car… 107826
## 2 7294 Battle Studies; Ancient and Modern Battle Ardant du Picq,… 81468
# Example of a filter of word count
filter(total_wordsperbook, title == "Battle Studies; Ancient and Modern Battle") # Example of filter
## # A tibble: 1 × 4
## # Groups: gutenberg_id, title [1]
## gutenberg_id title author total
## <int> <chr> <chr> <int>
## 1 7294 Battle Studies; Ancient and Modern Battle Ardant du Picq, … 81468
# Gathering tf by id, title and author with total words per title and author together.
l_b_1<- left_join(l_b, total_wordsperbook)
## Joining with `by = join_by(gutenberg_id, title, author)`
l_b_1 %>% arrange(desc(n)) %>% head() # "the" is the most frequent word in all of the library and appears most in "The Art of War" by A. H Jomini n = 11,733.
## # A tibble: 6 × 6
## gutenberg_id title author word n total
## <int> <chr> <chr> <chr> <int> <int>
## 1 1946 On War Claus… the 8439 107826
## 2 7294 Battle Studies; Ancient and Modern Bat… Ardan… the 6611 81468
## 3 1946 On War Claus… of 5295 107826
## 4 7294 Battle Studies; Ancient and Modern Bat… Ardan… of 3289 81468
## 5 1946 On War Claus… in 3046 107826
## 6 1946 On War Claus… to 2969 107826
l_b_1[order(l_b_1$n),] %>% head() # On the contrary, for example, "107" is one of the least common terms n=1 found in "On War" by Clausewitz.
## # A tibble: 6 × 6
## gutenberg_id title author word n total
## <int> <chr> <chr> <chr> <int> <int>
## 1 1946 On War Clausewitz, Carl von 107 1 107826
## 2 1946 On War Clausewitz, Carl von 10th 1 107826
## 3 1946 On War Clausewitz, Carl von 119 1 107826
## 4 1946 On War Clausewitz, Carl von 12,000 1 107826
## 5 1946 On War Clausewitz, Carl von 122v 1 107826
## 6 1946 On War Clausewitz, Carl von 130 1 107826
# Term frequency distribution: ocurrences of a word (in each of our books) divided by the total amount of words (of the respective work) (Ibid.; 3.1).
ggplot(l_b_1, aes(n/total, fill = title)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~title, ncol = 2, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 285 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
# We are actually interested in the long tails to help us see the amount of rare words in each of the works (those which make a book distinguishable).
# This is observed as the Zipf's law: which establishes that the occurrence of a word is inversely proportional to its rank (Ibid.; 3.2).
# Example as seen in (Ibid.)
f_by_rank <- l_b_1 %>%
group_by(title, author) %>%
mutate(rank = row_number(),
`term frequency` = n/total) %>%
ungroup()
f_by_rank
## # A tibble: 13,854 × 8
## gutenberg_id title author word n total rank `term frequency`
## <int> <chr> <chr> <chr> <int> <int> <int> <dbl>
## 1 1946 On War Claus… the 8439 107826 1 0.0783
## 2 7294 Battle Studies… Ardan… the 6611 81468 1 0.0811
## 3 1946 On War Claus… of 5295 107826 2 0.0491
## 4 7294 Battle Studies… Ardan… of 3289 81468 2 0.0404
## 5 1946 On War Claus… in 3046 107826 3 0.0282
## 6 1946 On War Claus… to 2969 107826 4 0.0275
## 7 1946 On War Claus… and 2673 107826 5 0.0248
## 8 1946 On War Claus… a 2491 107826 6 0.0231
## 9 7294 Battle Studies… Ardan… to 2249 81468 3 0.0276
## 10 1946 On War Claus… is 2198 107826 7 0.0204
## # ℹ 13,844 more rows
f_by_rank %>%
ggplot(aes(rank, `term frequency`, color = title)) +
geom_line(size = 0.9, alpha = 0.3, show.legend = TRUE) +
scale_x_log10() +
scale_y_log10()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Gathering tf_idf by gutenberg_id, title and author with total words per title and author together (Ibid.; 3.3).
lib_tf_idf <- l_b_1 %>% bind_tf_idf(word, title, n)
lib_tf_idf
## # A tibble: 13,854 × 9
## gutenberg_id title author word n total tf idf tf_idf
## <int> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 1946 On War Claus… the 8439 107826 0.0783 0 0
## 2 7294 Battle Studies; A… Ardan… the 6611 81468 0.0811 0 0
## 3 1946 On War Claus… of 5295 107826 0.0491 0 0
## 4 7294 Battle Studies; A… Ardan… of 3289 81468 0.0404 0 0
## 5 1946 On War Claus… in 3046 107826 0.0282 0 0
## 6 1946 On War Claus… to 2969 107826 0.0275 0 0
## 7 1946 On War Claus… and 2673 107826 0.0248 0 0
## 8 1946 On War Claus… a 2491 107826 0.0231 0 0
## 9 7294 Battle Studies; A… Ardan… to 2249 81468 0.0276 0 0
## 10 1946 On War Claus… is 2198 107826 0.0204 0 0
## # ℹ 13,844 more rows
lib_tf_idf %>% arrange(desc(tf_idf))
## # A tibble: 13,854 × 9
## gutenberg_id title author word n total tf idf tf_idf
## <int> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 7294 Battle Studies;… Ardan… mora… 78 81468 9.57e-4 0.693 6.64e-4
## 2 7294 Battle Studies;… Ardan… picq 68 81468 8.35e-4 0.693 5.79e-4
## 3 7294 Battle Studies;… Ardan… orga… 61 81468 7.49e-4 0.693 5.19e-4
## 4 7294 Battle Studies;… Ardan… arda… 54 81468 6.63e-4 0.693 4.59e-4
## 5 7294 Battle Studies;… Ardan… foot… 51 81468 6.26e-4 0.693 4.34e-4
## 6 1946 On War Claus… obje… 65 107826 6.03e-4 0.693 4.18e-4
## 7 7294 Battle Studies;… Ardan… etc 46 81468 5.65e-4 0.693 3.91e-4
## 8 7294 Battle Studies;… Ardan… caes… 44 81468 5.40e-4 0.693 3.74e-4
## 9 7294 Battle Studies;… Ardan… rifle 44 81468 5.40e-4 0.693 3.74e-4
## 10 1946 On War Claus… buon… 58 107826 5.38e-4 0.693 3.73e-4
## # ℹ 13,844 more rows
# We see some important words (nouns, verbs, adjectives, etc.) for each book yet we observe some terms that apparently do not carry much meaning (fig, footnote, etc.)
# First, we customize a list of stopwords and then apply the anti_join() function with "stop_words" {tidytext package} as an argument (to further remove 1,149 stop words from our library).
customstopwords <- tibble(word = c("1", "2", "3", "4","eq", "co", "rc", "ac", "ak", "bn",
"fig", "figs", "file", "cg", "cb", "cm","ab", "_k", "_k_", "_x","fig", "footnote",
"http", "of", "_of", "_ab_", "0", "deg","sidenote", "_a_","_b_", "_c_", "_o_",
"_s_", "_e_", "lu", "thou", "thy", "thee", "hast", "_abcd_", "nay", "consider'd",
"call'd", "hath", "gallery.euroweb.hu", "_an", "dost", "sayest", "seest", "thyself",
"wilt", "cf", "m.t.h.s", "_an", "shew", "shewn", "allow'd", "_c",
"transcribers", "diagram", "_photo", "_mn_", "_g_", "_p_", "_v_",
"_ac_", "_f_", "_d_", "_ad_", "_ef_", "tho", "mention'd",
"turn'd", "shewing", "form'd", "design'd", "etc", "chapter"))
tidy_words <- final_lib %>% unnest_tokens(word, text) %>%
count(gutenberg_id, author, title, word, sort = TRUE) %>% anti_join(stop_words)%>%
anti_join(customstopwords)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
# Visualizing tf_idf according to (Ibid.). Here the top 20 words (most ranked).
tidy_words_1 <- tidy_words %>% bind_tf_idf(word, title, n)
tidy_words_1 %>%
group_by(title) %>%
slice_max(tf_idf, n = 20) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, ncol = 4, scales = "free") +
labs(x = "tf-idf", y = NULL)
# Wordcloud2 tf_idf
w_cloud <- wordcloud2(tidy_words_1 %>% count(word, tf_idf, wt= tf_idf, sort = TRUE),
minSize = 0, gridSize = 0, fontFamily = "mono",
fontWeight = "normal", color = "random-light", backgroundColor = "grey",
minRotation = -pi/4, maxRotation = pi/4, shuffle = TRUE, rotateRatio = 0.4,
shape = "diamond", ellipticity = 0.65, widgetsize = NULL, figPath = NULL,
hoverFunction = NULL
)
w_cloud + WCtheme(2) + WCtheme(3)
# Token= bigram. According to (Ibid.; 4.1)
strategy_bigrams <-final_lib %>% unnest_tokens(bigram, text, token = "ngrams", n= 2) %>% filter(!is.na(bigram))
strategy_bigrams # one token (a bigram) per row
## # A tibble: 172,298 × 4
## gutenberg_id author title bigram
## <int> <chr> <chr> <chr>
## 1 1946 Clausewitz, Carl von On War on war
## 2 1946 Clausewitz, Carl von On War by general
## 3 1946 Clausewitz, Carl von On War general carl
## 4 1946 Clausewitz, Carl von On War carl von
## 5 1946 Clausewitz, Carl von On War von clausewitz
## 6 1946 Clausewitz, Carl von On War on war
## 7 1946 Clausewitz, Carl von On War war general
## 8 1946 Clausewitz, Carl von On War general carl
## 9 1946 Clausewitz, Carl von On War carl von
## 10 1946 Clausewitz, Carl von On War von clausewitz
## # ℹ 172,288 more rows
strategy_bigrams %>% count(bigram, sort= TRUE)
## # A tibble: 76,661 × 2
## bigram n
## <chr> <int>
## 1 of the 2227
## 2 in the 1124
## 3 to the 746
## 4 it is 699
## 5 on the 516
## 6 of a 401
## 7 to be 382
## 8 and the 360
## 9 at the 345
## 10 by the 334
## # ℹ 76,651 more rows
# A significant amount of our tokens here include words without much meaning for our report such as "the", "of", "an", etc. Hence, we remove them (Ibid.; 4.1.1):
tidy_bigrams <-strategy_bigrams %>% separate(bigram,c("word1","word2"), sep = " ")
tidy_bigrams_1 <- tidy_bigrams %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word1 %in% customstopwords) %>% filter(!word2 %in% customstopwords)
# New count after filtering the stopwords for the two words composing our token unit (Ibid.):
final_bigrams <- tidy_bigrams_1 %>%
count(word1, word2, sort = TRUE)
final_bigrams # The bigram is separated in 2 columns.
## # A tibble: 11,419 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 du picq 66
## 2 ardant du 51
## 3 moral effect 50
## 4 political object 26
## 5 colonel ardant 24
## 6 enemy's force 24
## 7 military virtue 22
## 8 modern battle 22
## 9 enemy's army 21
## 10 moral forces 21
## # ℹ 11,409 more rows
# Gathering/unifying bigrams
final_bigrams_together <- final_bigrams %>% unite(bigram, word1,word2, sep = " ")
final_bigrams_together
## # A tibble: 11,419 × 2
## bigram n
## <chr> <int>
## 1 du picq 66
## 2 ardant du 51
## 3 moral effect 50
## 4 political object 26
## 5 colonel ardant 24
## 6 enemy's force 24
## 7 military virtue 22
## 8 modern battle 22
## 9 enemy's army 21
## 10 moral forces 21
## # ℹ 11,409 more rows
# Example of filter to see the most common context of the word "strategy" (located at the end of a bigram) in each book.
tidy_bigrams_1 %>%
filter(word2 == "strategy") %>%
count(title, word1, sort = TRUE)
## # A tibble: 14 × 3
## title word1 n
## <chr> <chr> <int>
## 1 On War combat 2
## 2 On War constituting 2
## 3 Battle Studies; Ancient and Modern Battle moltke's 1
## 4 On War 37 1
## 5 On War book 1
## 6 On War compose 1
## 7 On War defeat 1
## 8 On War effects 1
## 9 On War forces 1
## 10 On War keeping 1
## 11 On War unfortunate 1
## 12 On War victory 1
## 13 On War war 1
## 14 On War words 1
# Using the graph_from_data_frame function (Ibid.; 4.1.4)
final_bigrams_graph <- final_bigrams %>% filter(n > 10) %>% graph_from_data_frame()
final_bigrams_graph
## IGRAPH 18e6faa DN-- 39 28 --
## + attr: name (v/c), n (e/n)
## + edges from 18e6faa (vertex names):
## [1] du ->picq ardant ->du moral ->effect
## [4] political ->object colonel ->ardant enemy's ->force
## [7] military ->virtue modern ->battle enemy's ->army
## [10] moral ->forces ancient ->battle ancient ->combat
## [13] battle ->field hundred ->meters armed ->force
## [16] military ->force military ->history military ->spirit
## [19] reciprocal->action fire ->arms historical->documents
## [22] positive ->object editor's ->note left ->wing
## + ... omitted several edges
set.seed(2023) # setting seed
arrow <- grid::arrow(type = "closed", length = unit(.15, "inches"))
# Using ggraph
library(ggraph)
ggraph(final_bigrams_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = arrow, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "red", size = 2) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028
Kornberger, M., & Vaara, E. (2022). Strategy as engagement: What organization strategy can learn from military strategy. Long Range Planning, 55(4), 102125. https://doi.org/10.1016/j.lrp.2021.102125
Neuendorf, K. A. (2017). The Content Analysis Guidebook. SAGE Publications, Inc. https://doi.org/10.4135/9781071802878
Sebastian, A. (2020, July 16). A Gentle Introduction To Calculating The TF-IDF Values. Medium. https://towardsdatascience.com/a-gentle-introduction-to-calculating-the-tf-idf-values-9e391f8a13e5
Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach (1st edition). O’Reilly Media. https://www.tidytextmining.com/index.html