nlp_strategy by Luis Felipe Villota

Insights from military strategy literature: Exploring the role of strategy in 4 popular works available in the Gutenberg Project.

In the present report, we conduct a descriptive content analysis on 4 popular works (top 4 most downloaded books) on war/military strategy available in Gutenberg Project’s archives applying text mining with the tidy approach in R (see Silge & Robinson for the guiding code, 2017; see Grimmer & Stewart for the principles of automated text analysis, 2013; 269-271). We have a nomothetic approach for this report by summarizing content across a selection of books (although we acknowledge a small sample of text corpus) (Neuendorf, 2017; 23-24). The aim is two-fold: 1.) quantitatively explore/describe the content of the mentioned repository (free source) offered in terms of military strategy literature and 2.) map the role of strategy (in an ngram network) according to the selection of books and authors, in order to draw insights that can analytically inform domains other than warfare (public policies, commerce, e.g.).

Our work is based on the frameworks proposed by Kornberger & Vaara (2022) on strategy research (Ibid; 1-2). In their article titled Strategy as engagement: What organization strategy can learn from military strategy (2022), they point to an `intersectionality between the two domains´ which has not been fully integrated (Ibid; 2). To begin with, the authors offer a conceptual development for the role of strategy: moving the sociological eye from previous research on internal strategy practices (focus on processes and strategy-making within an organization), onto external engagement practices with the ecosystem(s) beyond (an interactionist framing) (Ibid.; 1-3). This means to reorient current strategy research onto the nature of the practices that aim to and can exert a clout on external actors to favor one´s interests and agenda(s), - and have a better understanding on what changes the other´s “trajectory” through competition, collaboration, or co-option (Ibid; 2-3).

They stress the importance of drawing methodological-analytic lessons from military strategy literature in order to have strategy clearly defined (Ibid.; 8). In this sense, it is conceived here as a `bridge between two shores´: policy (as big guiding principles, purposes, or Grand Strategy) and tactics (as a means, power, or material prowess) (Ibid; 2, Ibid. on Clausewitz, Gray and Admiral Wylie; 4). According to our authors, strategy is not something to be implemented, but a “living movement” among these two “sides” and its function is to translate `purposes into conducts on a battlefield and vice versa´ (Ibid. on Clausewitz; 8). Strategy ultimately refers to an effect (that the two “ends” of the bridge have on one another on a constant flux) and not a concrete action or model (Ibid.; 9). Hence, it has constant change, adaptation, and evolution as salient features in order to achieve victory (effectively exercising power) in the long term through policy and never through `operational issues of warfare´ solely (Ibid; 4, 6). Kornberger & Vaara´s work gains importance in the current context of hybrid wars, emergent AI markets, ambitious and transition-based climate change policies, among others. Continuing with the authors´ avowal, we consider useful to address core principles on strategy (Ibid; 10) from popular military strategy literature to navigate uncertain scenarios (as traversed by the `fog of war´) practically (to have awareness of a given situation and to train good judgment that might enlighten action) (Ibid. on Clausewitz; 3, 10).

1. Downloading and loading basic packages

library(stringr)
library(forcats)
library(gutenbergr)
library(tidyverse)
library(tidytext)
library(tm)
library(textdata)
library(psych)
library(skimr)
library(wordcloud2)
library(tidyr)
library(lifecycle)
library(scales)
library(igraph)
library(ggraph)

2. Gathering and selecting data from the Gutenberg Project

2.1 Gathering metadata and creating objects to know gutenberg_id´s to identify all books in the Gutenberg Project archive.

gut_works <-data.frame(gutenberg_works())

str(gut_works)

## 'data.frame':    44042 obs. of  8 variables:
##  $ gutenberg_id       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title              : chr  "The Declaration of Independence of the United States of America" "The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States" "John F. Kennedy's Inaugural Address" "Lincoln's Gettysburg Address\r\nGiven November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA" ...
##  $ author             : chr  "Jefferson, Thomas" "United States" "Kennedy, John F. (John Fitzgerald)" "Lincoln, Abraham" ...
##  $ gutenberg_author_id: int  1638 1 1666 3 1 4 NA 3 3 NA ...
##  $ language           : chr  "en" "en" "en" "en" ...
##  $ gutenberg_bookshelf: chr  "Politics/American Revolutionary War/United States Law" "Politics/American Revolutionary War/United States Law" "" "US Civil War" ...
##  $ rights             : chr  "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." ...
##  $ has_text           : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

# We have data frame with 53,840 different gutenberg_id's (total number of rows). There are 8 variables ("gutenberg_id", "title",                              "author","gutenberg_author_id", "language", "gutenberg_bookshelf", "rights", "has_text")

gut_meta <- gutenberg_metadata

str(gut_meta) # We have a tibble with 69,199 different gutenberg_id's (total number of rows). We have the same 8 variables

## tibble [72,569 × 8] (S3: tbl_df/tbl/data.frame)
##  $ gutenberg_id       : int [1:72569] 1 2 3 4 5 6 7 8 9 10 ...
##  $ title              : chr [1:72569] "The Declaration of Independence of the United States of America" "The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States" "John F. Kennedy's Inaugural Address" "Lincoln's Gettysburg Address\r\nGiven November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA" ...
##  $ author             : chr [1:72569] "Jefferson, Thomas" "United States" "Kennedy, John F. (John Fitzgerald)" "Lincoln, Abraham" ...
##  $ gutenberg_author_id: int [1:72569] 1638 1 1666 3 1 4 NA 3 3 NA ...
##  $ language           : chr [1:72569] "en" "en" "en" "en" ...
##  $ gutenberg_bookshelf: chr [1:72569] "Politics/American Revolutionary War/United States Law" "Politics/American Revolutionary War/United States Law" "" "US Civil War" ...
##  $ rights             : chr [1:72569] "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." ...
##  $ has_text           : logi [1:72569] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  - attr(*, "date_updated")= Date[1:1], format: "2022-12-19"

2.2 Selecting and importing books by subject

gut_sub<- gutenberg_subjects

        str(gut_sub) # We have tibble with 230,993 gutenberg_id's (total number of rows). There are 3 variables ("gutenberg_id", "subject_type","subject" )

## tibble [231,741 × 3] (S3: tbl_df/tbl/data.frame)
##  $ gutenberg_id: int [1:231741] 1 1 1 1 2 2 2 2 3 3 ...
##  $ subject_type: chr [1:231741] "lcsh" "lcsh" "lcc" "lcc" ...
##  $ subject     : chr [1:231741] "United States -- History -- Revolution, 1775-1783 -- Sources" "United States. Declaration of Independence" "E201" "JK" ...
##  - attr(*, "date_updated")= Date[1:1], format: "2022-12-19"

        length(unique(gut_sub$subject)) # There are 38,136 unique subjects

## [1] 38229

        unique(gut_sub$subject_type) # There are 2 subject types "lcsh" (Library of Congress Subject Headings) and

## [1] "lcsh" "lcc"

                                     # "lcc" (Library of Congress Classifications).

Of course, one book can be associated to different subjects and subject types. As a comment, we note that subjects are frequently ‘sui generis’ or very broad. However, since our objective is to analyse popular works on war/military strategy, the existing label of “Military art and science” in the Gutenberg Project might be useful to select books.

sub_sub <- gut_sub %>% filter(subject == "Military art and science") # There are 19 works in this filtered-by-subject tibble.

2.3 Downloading the works by subject (Military art and science) with inclusion of the metadata of title and author, - into an object (tibble) that is our initial library for this report.

sub_books <- gutenberg_download(sub_sub, meta_fields = c("title", "author", "gutenberg_id"))

## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/1/3/5/4/13549/13549.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/1/4/6/2/14625/14625.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/4/2/0/44200/44200.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/4/6/3/44635/44635.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/8/3/6/48366/48366.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/8/5/1/48512/48512.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/5/0/7/5/50750/50750.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/5/5/1/0/55109/55109.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/5/5/1/8/55185/55185.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/5/6/1/4/56146/56146.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/6/4/9/2/64927/64927.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

str(sub_books) # 159,712 rows (total lines of books' content) X 4 columns ("gutenberg_id", "text", "title", "author")

## tibble [61,447 × 4] (S3: tbl_df/tbl/data.frame)
##  $ gutenberg_id: int [1:61447] 1946 1946 1946 1946 1946 1946 1946 1946 1946 1946 ...
##  $ text        : chr [1:61447] "ON WAR" "" "by General Carl von Clausewitz" "" ...
##  $ title       : chr [1:61447] "On War" "On War" "On War" "On War" ...
##  $ author      : chr [1:61447] "Clausewitz, Carl von" "Clausewitz, Carl von" "Clausewitz, Carl von" "Clausewitz, Carl von" ...

# List of books in the library (grouping the lines of content by gutenberg_id, author and title)

sub_books %>% count(gutenberg_id, author, title) # We have 19 different books and 19 different authors in total.

## # A tibble: 8 × 4
##   gutenberg_id author                                      title               n
##          <int> <chr>                                       <chr>           <int>
## 1         1946 Clausewitz, Carl von                        "On War"        10922
## 2         7294 Ardant du Picq, Charles Jean Jacques Joseph "Battle Studie…  9081
## 3        16170 Halleck, H. W. (Henry Wager)                "Elements of M… 14969
## 4        23473 Anonymous                                   "Lectures on L…  7079
## 5        24842 Swinton, E. D. (Ernest Dunlop)              "The Defence o…  1735
## 6        34459 Corbin, Thomas W.                           "The Romance o… 10454
## 7        36693 Clausewitz, Carl von                        "Grundgedanken…  2850
## 8        59804 Radiguet, René-Louis-Jules                  "The Making of…  4357

# Note: The author of "Military Instructors Manual" (gutenberg_id = 14625) appears as NA, but the real author is Captain James P. Cole

## [1] "Sun Nov 17 00:32:34 2024"

3. Exploratory Data Analysis

# Asking a series of simple questions and doing simple steps to familiarize with the text corpora. 

# a. The longest and shortest books (in terms of lines of content).

sub_books %>% count(gutenberg_id, author, title) %>% arrange(desc(n)) %>% slice_head() # Longest (by lines): "Tactics, Volume 1 (of 2). Introduction and Formal Tactics of Infantry" n=23,576

## # A tibble: 1 × 4
##   gutenberg_id author                       title                              n
##          <int> <chr>                        <chr>                          <int>
## 1        16170 Halleck, H. W. (Henry Wager) "Elements of Military Art and… 14969

sub_books %>% count(gutenberg_id, author, title) %>% arrange(desc(n)) %>% slice_tail() # Shortest (by lines): "Some Principles of Frontier Mountain Warfare" n=1,143

## # A tibble: 1 × 4
##   gutenberg_id author                         title                            n
##          <int> <chr>                          <chr>                        <int>
## 1        24842 Swinton, E. D. (Ernest Dunlop) The Defence of Duffer's Dri…  1735

# As Grimmer & Stewart (2017; 272) point out lengthier texts are better suited for automated content analysis (more words, more data).

From here, we tokenize by one-word-as-a-unit using the unnest_tokens() function from the tidytext package.That is, having our library of 19 books on Military art and science, we unnest the words from the “text” column in order to have a tidy data frame in which each row of the mentioned column now represents a single token (one word).

# We create the object "w_books" to save this tokenization for further operations. 

w_books<- sub_books %>% unnest_tokens(word, text)

# b. How many words in total are in the library?

length(w_books$word) # There are 1,350,008 words in total

## [1] 553695

# c. List of books in the library sorted by descending number of total words (grouping by gutenberg_id, author and title).

w_books %>% group_by(gutenberg_id, author, title) %>% summarize(total = n()) %>% arrange(desc(total))

## `summarise()` has grouped output by 'gutenberg_id', 'author'. You can override
## using the `.groups` argument.

## # A tibble: 8 × 4
## # Groups:   gutenberg_id, author [8]
##   gutenberg_id author                                      title           total
##          <int> <chr>                                       <chr>           <int>
## 1        16170 Halleck, H. W. (Henry Wager)                "Elements of … 141324
## 2         1946 Clausewitz, Carl von                        "On War"       107826
## 3        34459 Corbin, Thomas W.                           "The Romance …  88763
## 4         7294 Ardant du Picq, Charles Jean Jacques Joseph "Battle Studi…  81468
## 5        23473 Anonymous                                   "Lectures on …  62725
## 6        59804 Radiguet, René-Louis-Jules                  "The Making o…  31595
## 7        36693 Clausewitz, Carl von                        "Grundgedanke…  22968
## 8        24842 Swinton, E. D. (Ernest Dunlop)              "The Defence …  17026

# d. Which is the longest and shortest book (in terms of total word counts)?

w_books %>% count(gutenberg_id, author, title) %>% arrange(desc(n)) %>% slice_head() # Longest (by words) "Tactics, Volume 1 (of 2). Introduction and Formal Tactics of Infantry" n= 186,457

## # A tibble: 1 × 4
##   gutenberg_id author                       title                              n
##          <int> <chr>                        <chr>                          <int>
## 1        16170 Halleck, H. W. (Henry Wager) "Elements of Military Art an… 141324

w_books %>% count(gutenberg_id, author, title) %>% arrange(desc(n)) %>% slice_tail() # Shortest (by words) "Some Principles of Frontier Mountain Warfare"  n= 8,929

## # A tibble: 1 × 4
##   gutenberg_id author                         title                            n
##          <int> <chr>                          <chr>                        <int>
## 1        24842 Swinton, E. D. (Ernest Dunlop) The Defence of Duffer's Dri… 17026

# e. How many unique words are in the library?

w_books %>% count(word) %>% summarize(total = n()) %>% pull(total) # 48,981 unique words in total

## [1] 24968

# f. List of books in the library sorted by descending number of unique words (grouping by gutenberg_id, author and title).

w_books %>% group_by(gutenberg_id, author, title) %>% summarise( total = n_distinct(word)) %>% arrange(desc(total))

## `summarise()` has grouped output by 'gutenberg_id', 'author'. You can override
## using the `.groups` argument.

## # A tibble: 8 × 4
## # Groups:   gutenberg_id, author [8]
##   gutenberg_id author                                      title           total
##          <int> <chr>                                       <chr>           <int>
## 1        16170 Halleck, H. W. (Henry Wager)                "Elements of M… 10816
## 2        34459 Corbin, Thomas W.                           "The Romance o…  7386
## 3         1946 Clausewitz, Carl von                        "On War"         6939
## 4         7294 Ardant du Picq, Charles Jean Jacques Joseph "Battle Studie…  6915
## 5        23473 Anonymous                                   "Lectures on L…  5695
## 6        59804 Radiguet, René-Louis-Jules                  "The Making of…  4166
## 7        36693 Clausewitz, Carl von                        "Grundgedanken…  4116
## 8        24842 Swinton, E. D. (Ernest Dunlop)              "The Defence o…  2713

# g. Books with the most and least amount of unique words.

w_books %>% group_by(title) %>% summarise( total = n_distinct(word)) %>% arrange(desc(total)) %>% filter(row_number()==1) # Most unique words: "Della scienza militare" n=12,732

## # A tibble: 1 × 2
##   title                                                                    total
##   <chr>                                                                    <int>
## 1 "Elements of Military Art and Science\r\nOr, Course Of Instruction In S… 10816

w_books %>% group_by(title) %>% summarise( total = n_distinct(word)) %>% arrange(desc(total)) %>% filter(row_number()==19) # Least unique words: "Some Principles of Frontier Mountain Warfare" n=1,637

## # A tibble: 0 × 2
## # ℹ 2 variables: title <chr>, total <int>

From here, we are only interested in the top 4 most popular works. Importantly, metadata on popularity (number of downloads) is not available within the functions of gutenbergr package, so it has to be access and noted manually as seen in: https://www.gutenberg.org/ebooks/subject/89. Where we see ‘Books about Military art and science (sorted by popularity)’.

In this sense, the 4 most downloaded works are:

“On War” by Carl von Clausewitz (gutenberg_id: 1946 ) = 2,438 downloads
“The Art of War” by Antoine Henri baron de Jomini (gutenberg_id: 13549) = 543 downloads
“The Officer’s Manual: Napoleon’s Maxims of War” by Emperor of the French, Napoleon I (gutenberg_id: 50750 ) = 378 downloads
“Battle Studies; Ancient and Modern Battle” by Charles Jean Jacques Joseph Ardant du Picq (gutenberg_id: 7294) = 181 downloads

This as of: “Tue Jan 31 14:11:18 2023”

# Hence, we create our final library of the 4 most downloaded books. 

top4books <- c("1946", "13549", "50750", "7294") # List with selected gutenberg_id´s

final_lib <- gutenberg_download(top4books, meta_fields = c("gutenberg_id", "author", "title"))

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/1/3/5/4/13549/13549.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/5/0/7/5/50750/50750.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

4. Term frequency (tf), inverse document frequency (idf) and the tf-idf statistic

For this part, we follow Silge & Robinson´s (2017; Chapter 3) and (see Sebastian, 2020) steps, concepts and code (customized) in order to address quantitatively what each book is about by analyzing the tf (capturing the occurrence of each word in a book from our final library), the idf (a approach that lessens the “weight” (score) of the most frequent terms in favor of “rare” or less common words) and the tf-idf index (the multiplication of the two previous measures to detect the salient/relevant/particular words in a text).

# Total words per book 

total_wordsperbook %>% arrange(desc(total)) # "The Art of War" by A. H Jomini has the most words n = 144,210.

## # A tibble: 2 × 4
## # Groups:   gutenberg_id, title [2]
##   gutenberg_id title                                     author            total
##          <int> <chr>                                     <chr>             <int>
## 1         1946 On War                                    Clausewitz, Car… 107826
## 2         7294 Battle Studies; Ancient and Modern Battle Ardant du Picq,…  81468

# Example of a filter of word count

filter(total_wordsperbook, title == "Battle Studies; Ancient and Modern Battle") # Example of filter

## # A tibble: 1 × 4
## # Groups:   gutenberg_id, title [1]
##   gutenberg_id title                                     author            total
##          <int> <chr>                                     <chr>             <int>
## 1         7294 Battle Studies; Ancient and Modern Battle Ardant du Picq, … 81468

# Gathering tf by id, title and author with total words per title and author together.

l_b_1<- left_join(l_b, total_wordsperbook)

## Joining with `by = join_by(gutenberg_id, title, author)`

l_b_1 %>% arrange(desc(n)) %>% head() # "the" is the most frequent word in all of the library and appears most in "The Art of War" by A. H Jomini n = 11,733.

## # A tibble: 6 × 6
##   gutenberg_id title                                   author word      n  total
##          <int> <chr>                                   <chr>  <chr> <int>  <int>
## 1         1946 On War                                  Claus… the    8439 107826
## 2         7294 Battle Studies; Ancient and Modern Bat… Ardan… the    6611  81468
## 3         1946 On War                                  Claus… of     5295 107826
## 4         7294 Battle Studies; Ancient and Modern Bat… Ardan… of     3289  81468
## 5         1946 On War                                  Claus… in     3046 107826
## 6         1946 On War                                  Claus… to     2969 107826

l_b_1[order(l_b_1$n),] %>% head() # On the contrary, for example, "107" is one of the least common terms n=1 found in "On War" by Clausewitz.

## # A tibble: 6 × 6
##   gutenberg_id title  author               word       n  total
##          <int> <chr>  <chr>                <chr>  <int>  <int>
## 1         1946 On War Clausewitz, Carl von 107        1 107826
## 2         1946 On War Clausewitz, Carl von 10th       1 107826
## 3         1946 On War Clausewitz, Carl von 119        1 107826
## 4         1946 On War Clausewitz, Carl von 12,000     1 107826
## 5         1946 On War Clausewitz, Carl von 122v       1 107826
## 6         1946 On War Clausewitz, Carl von 130        1 107826

4.1 First visualizations: tf distribution and Zipf´s Law

# Term frequency distribution: ocurrences of a word (in each of our books) divided by the total amount of words (of the respective work) (Ibid.; 3.1).

ggplot(l_b_1, aes(n/total, fill = title)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~title, ncol = 2, scales = "free_y")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 285 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

# We are actually interested in the long tails to help us see the amount of rare words in each of the works (those which make a book distinguishable). 

# This is observed as the Zipf's law: which establishes that the occurrence of a word is inversely proportional to its rank (Ibid.; 3.2).

# Example as seen in (Ibid.)

f_by_rank <- l_b_1 %>% 
  group_by(title, author) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total) %>%
  ungroup()

f_by_rank

## # A tibble: 13,854 × 8
##    gutenberg_id title           author word      n  total  rank `term frequency`
##           <int> <chr>           <chr>  <chr> <int>  <int> <int>            <dbl>
##  1         1946 On War          Claus… the    8439 107826     1           0.0783
##  2         7294 Battle Studies… Ardan… the    6611  81468     1           0.0811
##  3         1946 On War          Claus… of     5295 107826     2           0.0491
##  4         7294 Battle Studies… Ardan… of     3289  81468     2           0.0404
##  5         1946 On War          Claus… in     3046 107826     3           0.0282
##  6         1946 On War          Claus… to     2969 107826     4           0.0275
##  7         1946 On War          Claus… and    2673 107826     5           0.0248
##  8         1946 On War          Claus… a      2491 107826     6           0.0231
##  9         7294 Battle Studies… Ardan… to     2249  81468     3           0.0276
## 10         1946 On War          Claus… is     2198 107826     7           0.0204
## # ℹ 13,844 more rows

f_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = title)) + 
  geom_line(size = 0.9, alpha = 0.3, show.legend = TRUE) + 
  scale_x_log10() +
  scale_y_log10()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

4.2 The tf-idf statistic

# Gathering tf_idf by gutenberg_id, title and author with total words per title and author together (Ibid.; 3.3).

lib_tf_idf <- l_b_1 %>% bind_tf_idf(word, title, n)

lib_tf_idf

## # A tibble: 13,854 × 9
##    gutenberg_id title              author word      n  total     tf   idf tf_idf
##           <int> <chr>              <chr>  <chr> <int>  <int>  <dbl> <dbl>  <dbl>
##  1         1946 On War             Claus… the    8439 107826 0.0783     0      0
##  2         7294 Battle Studies; A… Ardan… the    6611  81468 0.0811     0      0
##  3         1946 On War             Claus… of     5295 107826 0.0491     0      0
##  4         7294 Battle Studies; A… Ardan… of     3289  81468 0.0404     0      0
##  5         1946 On War             Claus… in     3046 107826 0.0282     0      0
##  6         1946 On War             Claus… to     2969 107826 0.0275     0      0
##  7         1946 On War             Claus… and    2673 107826 0.0248     0      0
##  8         1946 On War             Claus… a      2491 107826 0.0231     0      0
##  9         7294 Battle Studies; A… Ardan… to     2249  81468 0.0276     0      0
## 10         1946 On War             Claus… is     2198 107826 0.0204     0      0
## # ℹ 13,844 more rows

lib_tf_idf %>% arrange(desc(tf_idf))

## # A tibble: 13,854 × 9
##    gutenberg_id title            author word      n  total      tf   idf  tf_idf
##           <int> <chr>            <chr>  <chr> <int>  <int>   <dbl> <dbl>   <dbl>
##  1         7294 Battle Studies;… Ardan… mora…    78  81468 9.57e-4 0.693 6.64e-4
##  2         7294 Battle Studies;… Ardan… picq     68  81468 8.35e-4 0.693 5.79e-4
##  3         7294 Battle Studies;… Ardan… orga…    61  81468 7.49e-4 0.693 5.19e-4
##  4         7294 Battle Studies;… Ardan… arda…    54  81468 6.63e-4 0.693 4.59e-4
##  5         7294 Battle Studies;… Ardan… foot…    51  81468 6.26e-4 0.693 4.34e-4
##  6         1946 On War           Claus… obje…    65 107826 6.03e-4 0.693 4.18e-4
##  7         7294 Battle Studies;… Ardan… etc      46  81468 5.65e-4 0.693 3.91e-4
##  8         7294 Battle Studies;… Ardan… caes…    44  81468 5.40e-4 0.693 3.74e-4
##  9         7294 Battle Studies;… Ardan… rifle    44  81468 5.40e-4 0.693 3.74e-4
## 10         1946 On War           Claus… buon…    58 107826 5.38e-4 0.693 3.73e-4
## # ℹ 13,844 more rows

# We see some important words (nouns, verbs, adjectives, etc.) for each book yet we observe some terms that apparently do not carry much meaning (fig, footnote, etc.)

4.3 Removing stopwords

# First, we customize a list of stopwords and then apply the anti_join() function with "stop_words" {tidytext package} as an argument (to further remove 1,149 stop words from our library).

customstopwords <- tibble(word =   c("1", "2", "3", "4","eq", "co", "rc", "ac", "ak", "bn", 
"fig", "figs", "file", "cg", "cb", "cm","ab", "_k", "_k_", "_x","fig", "footnote", 
"http", "of", "_of", "_ab_", "0", "deg","sidenote", "_a_","_b_", "_c_", "_o_",
"_s_", "_e_", "lu", "thou", "thy", "thee", "hast", "_abcd_", "nay", "consider'd",
"call'd", "hath", "gallery.euroweb.hu", "_an", "dost", "sayest", "seest", "thyself",
"wilt", "cf", "m.t.h.s", "_an", "shew", "shewn", "allow'd", "_c", 
"transcribers", "diagram", "_photo", "_mn_", "_g_", "_p_", "_v_", 
"_ac_", "_f_", "_d_", "_ad_", "_ef_", "tho", "mention'd", 
"turn'd", "shewing", "form'd", "design'd", "etc", "chapter"))

tidy_words <- final_lib %>% unnest_tokens(word, text) %>%
  count(gutenberg_id, author, title, word, sort = TRUE) %>% anti_join(stop_words)%>%
                anti_join(customstopwords)

## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`

4.4 Second visualization: tf-idf by title (without stopwords)

# Visualizing tf_idf according to (Ibid.). Here the top 20 words (most ranked). 

tidy_words_1 <- tidy_words  %>% bind_tf_idf(word, title, n) 

tidy_words_1 %>%
  group_by(title) %>%
  slice_max(tf_idf, n = 20) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 4, scales = "free") +
  labs(x = "tf-idf", y = NULL)

4.5 Wordcloud

# Wordcloud2 tf_idf 

w_cloud <- wordcloud2(tidy_words_1 %>% count(word, tf_idf, wt= tf_idf, sort = TRUE), 
        minSize = 0, gridSize = 0, fontFamily = "mono", 
        fontWeight = "normal", color = "random-light", backgroundColor = "grey", 
        minRotation = -pi/4, maxRotation = pi/4, shuffle = TRUE, rotateRatio = 0.4, 
        shape = "diamond", ellipticity = 0.65, widgetsize = NULL, figPath = NULL, 
        hoverFunction = NULL
)

w_cloud + WCtheme(2) + WCtheme(3)

5. Relationships between words

5.1 Tokenizing by n-grams

# Token= bigram. According to (Ibid.; 4.1) 

strategy_bigrams <-final_lib %>% unnest_tokens(bigram, text, token = "ngrams", n= 2) %>% filter(!is.na(bigram))

strategy_bigrams # one token (a bigram) per row

## # A tibble: 172,298 × 4
##    gutenberg_id author               title  bigram        
##           <int> <chr>                <chr>  <chr>         
##  1         1946 Clausewitz, Carl von On War on war        
##  2         1946 Clausewitz, Carl von On War by general    
##  3         1946 Clausewitz, Carl von On War general carl  
##  4         1946 Clausewitz, Carl von On War carl von      
##  5         1946 Clausewitz, Carl von On War von clausewitz
##  6         1946 Clausewitz, Carl von On War on war        
##  7         1946 Clausewitz, Carl von On War war general   
##  8         1946 Clausewitz, Carl von On War general carl  
##  9         1946 Clausewitz, Carl von On War carl von      
## 10         1946 Clausewitz, Carl von On War von clausewitz
## # ℹ 172,288 more rows

5.2 Counting initial bigrams, removing stop words and final count

strategy_bigrams %>% count(bigram, sort= TRUE)

## # A tibble: 76,661 × 2
##    bigram      n
##    <chr>   <int>
##  1 of the   2227
##  2 in the   1124
##  3 to the    746
##  4 it is     699
##  5 on the    516
##  6 of a      401
##  7 to be     382
##  8 and the   360
##  9 at the    345
## 10 by the    334
## # ℹ 76,651 more rows

# A significant amount of our tokens here include words without much meaning for our report such as "the", "of", "an", etc. Hence, we remove them (Ibid.; 4.1.1): 

tidy_bigrams <-strategy_bigrams %>% separate(bigram,c("word1","word2"), sep = " ")

tidy_bigrams_1 <- tidy_bigrams %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word1 %in% customstopwords) %>% filter(!word2 %in% customstopwords) 

# New count after filtering the stopwords for the two words composing our token unit (Ibid.): 

final_bigrams <- tidy_bigrams_1 %>% 
  count(word1, word2, sort = TRUE)

final_bigrams # The bigram is separated in 2 columns.

## # A tibble: 11,419 × 3
##    word1     word2      n
##    <chr>     <chr>  <int>
##  1 du        picq      66
##  2 ardant    du        51
##  3 moral     effect    50
##  4 political object    26
##  5 colonel   ardant    24
##  6 enemy's   force     24
##  7 military  virtue    22
##  8 modern    battle    22
##  9 enemy's   army      21
## 10 moral     forces    21
## # ℹ 11,409 more rows

# Gathering/unifying bigrams 

final_bigrams_together <- final_bigrams %>% unite(bigram, word1,word2, sep = " ")

final_bigrams_together

## # A tibble: 11,419 × 2
##    bigram               n
##    <chr>            <int>
##  1 du picq             66
##  2 ardant du           51
##  3 moral effect        50
##  4 political object    26
##  5 colonel ardant      24
##  6 enemy's force       24
##  7 military virtue     22
##  8 modern battle       22
##  9 enemy's army        21
## 10 moral forces        21
## # ℹ 11,409 more rows

5.3 Exploring bigrams: filtering for strategy

# Example of filter to see the most common context of the word "strategy" (located at the end of a bigram) in each book. 

tidy_bigrams_1 %>%
  filter(word2 == "strategy") %>%
  count(title, word1, sort = TRUE)

## # A tibble: 14 × 3
##    title                                     word1            n
##    <chr>                                     <chr>        <int>
##  1 On War                                    combat           2
##  2 On War                                    constituting     2
##  3 Battle Studies; Ancient and Modern Battle moltke's         1
##  4 On War                                    37               1
##  5 On War                                    book             1
##  6 On War                                    compose          1
##  7 On War                                    defeat           1
##  8 On War                                    effects          1
##  9 On War                                    forces           1
## 10 On War                                    keeping          1
## 11 On War                                    unfortunate      1
## 12 On War                                    victory          1
## 13 On War                                    war              1
## 14 On War                                    words            1

5.5 Visualizing the network

# Using the graph_from_data_frame function (Ibid.; 4.1.4)

final_bigrams_graph <- final_bigrams %>% filter(n > 10) %>% graph_from_data_frame()
 
final_bigrams_graph

## IGRAPH 18e6faa DN-- 39 28 -- 
## + attr: name (v/c), n (e/n)
## + edges from 18e6faa (vertex names):
##  [1] du        ->picq      ardant    ->du        moral     ->effect   
##  [4] political ->object    colonel   ->ardant    enemy's   ->force    
##  [7] military  ->virtue    modern    ->battle    enemy's   ->army     
## [10] moral     ->forces    ancient   ->battle    ancient   ->combat   
## [13] battle    ->field     hundred   ->meters    armed     ->force    
## [16] military  ->force     military  ->history   military  ->spirit   
## [19] reciprocal->action    fire      ->arms      historical->documents
## [22] positive  ->object    editor's  ->note      left      ->wing     
## + ... omitted several edges

set.seed(2023) # setting seed

arrow <- grid::arrow(type = "closed", length = unit(.15, "inches"))

# Using ggraph
library(ggraph)
ggraph(final_bigrams_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = arrow, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "red", size = 2) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

References

Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028

Kornberger, M., & Vaara, E. (2022). Strategy as engagement: What organization strategy can learn from military strategy. Long Range Planning, 55(4), 102125. https://doi.org/10.1016/j.lrp.2021.102125

Neuendorf, K. A. (2017). The Content Analysis Guidebook. SAGE Publications, Inc. https://doi.org/10.4135/9781071802878

Sebastian, A. (2020, July 16). A Gentle Introduction To Calculating The TF-IDF Values. Medium. https://towardsdatascience.com/a-gentle-introduction-to-calculating-the-tf-idf-values-9e391f8a13e5

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach (1st edition). O’Reilly Media. https://www.tidytextmining.com/index.html