Ussher Thesis Visualization

Visualizations Using the Ussher Package

Below are a number of dataframes and analytical visualizations rendered using data from the “ussher” package in R.

More information on the ussher package can be viewed in R using “?ussher”

To start, the ussher data set is called, and the indexed paragraphs are tokenized by word, for more tidy NLP analysis. The resulting tables are various abbreviated views of the tokenized data.

ussher

## # A tibble: 6,998 × 8
## # Rowwise: 
##    Index EventTxt                     YearB…¹ Epoch BibBk1 AnnoM…² Season JulPer
##    <dbl> <chr>                          <dbl> <chr> <chr>    <dbl> <chr>   <dbl>
##  1     1 In the beginning God create…   -4004 1st … <NA>         1 Autumn    710
##  2     2 On the first day   of the w…   -4004 1st … Ge           1 Autumn    710
##  3     3 On the second day    Monday…   -4004 1st … Ge           1 Autumn    710
##  4     4 On the third day     Tuesda…   -4004 1st … Ge           1 Autumn    710
##  5     5 On the fourth day  Wednesda…   -4004 1st … <NA>         1 Autumn    710
##  6     6 On the fifth day  Thursday …   -4004 1st … <NA>         1 Autumn    710
##  7     7 On the sixth day  Friday  O…   -4004 1st … <NA>         1 Autumn    710
##  8     8 Now on the seventh day   Sa…   -4004 1st … <NA>         1 Autumn    710
##  9     9 After the first week of the…   -4004 1st … <NA>         1 Autumn    710
## 10    10 The Devil envied God s hono…   -4004 1st … <NA>         1 Autumn    710
## # … with 6,988 more rows, and abbreviated variable names ¹YearBCAD, ²AnnoMund
## # ℹ Use `print(n = ...)` to see more rows

ussh.ind <- ussher

tidy_annals <- ussh.ind %>%
  unnest_tokens(word, EventTxt)
head(tidy_annals[!(!is.na(tidy_annals$word) & tidy_annals$word=="" & tidy_annals$word==" "& tidy_annals$word=="   "), ])

## # A tibble: 6 × 8
## # Rowwise: 
##   Index YearBCAD Epoch   BibBk1 AnnoMund Season JulPer word     
##   <dbl>    <dbl> <chr>   <chr>     <dbl> <chr>   <dbl> <chr>    
## 1     1    -4004 1st Age <NA>          1 Autumn    710 in       
## 2     1    -4004 1st Age <NA>          1 Autumn    710 the      
## 3     1    -4004 1st Age <NA>          1 Autumn    710 beginning
## 4     1    -4004 1st Age <NA>          1 Autumn    710 god      
## 5     1    -4004 1st Age <NA>          1 Autumn    710 created  
## 6     1    -4004 1st Age <NA>          1 Autumn    710 the

Initial Bigram

Bigrams (two-word combinations) can also be tokenized. This first attempt at bigrams includes stop words (such as “in” or “of”) to more simply show the bigrams in succession (“In the”,“the beginning”, etc.)

## # A tibble: 6 × 8
## # Rowwise: 
##   Index YearBCAD Epoch   BibBk1 AnnoMund Season JulPer bigram       
##   <dbl>    <dbl> <chr>   <chr>     <dbl> <chr>   <dbl> <chr>        
## 1     1    -4004 1st Age <NA>          1 Autumn    710 in the       
## 2     1    -4004 1st Age <NA>          1 Autumn    710 the beginning
## 3     1    -4004 1st Age <NA>          1 Autumn    710 beginning god
## 4     1    -4004 1st Age <NA>          1 Autumn    710 god created  
## 5     1    -4004 1st Age <NA>          1 Autumn    710 created the  
## 6     1    -4004 1st Age <NA>          1 Autumn    710 the heaven

Bigrams are then separated, filtered and united to develop variables that can be used differently in various visualizations and correlation analyses. One of the simplest tables to develop is a count of unique bigrams in the entire text. This becomes very useful upon deeper analysis and visualization.

For example, many of the bigrams that occur more frequently than 200 times often happen to be scholarly or source references. Thus, isolated high frequency bigrams may be useful in isolating James Ussher’s primary sources, contributing authors, or indexing and apendix-related citation, or even a study of Enlightenment Era scholarship and research practices and conventions. Moreover, this efficient “superindex” can not only cross reference to the location in the original corpus, but also can be compared to dates, Epochs and other features unique to the chronology.

bigrams_separated <- ussher_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

head(bigram_counts)

## # A tibble: 6 × 3
##   word1    word2          n
##   <chr>    <chr>      <int>
## 1 diod     sic          525
## 2 foot     soldiers     408
## 3 josephus antiq        375
## 4 tacitus  annals       169
## 5 velleius paterculus   166
## 6 polyb    legat        163

It is now necessary to join the various ways bigram information has been sliced, for better analysis of the data within the text. The resultant tables are presented here for continuity.

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

count_united <- bigrams_united %>% 
  add_count(bigram)
head(count_united)

## # A tibble: 6 × 9
##   Index YearBCAD Epoch   BibBk1 AnnoMund Season JulPer bigram                  n
##   <dbl>    <dbl> <chr>   <chr>     <dbl> <chr>   <dbl> <chr>               <int>
## 1     1    -4004 1st Age <NA>          1 Autumn    710 beginning god           1
## 2     1    -4004 1st Age <NA>          1 Autumn    710 god created             3
## 3     1    -4004 1st Age <NA>          1 Autumn    710 earth ge                1
## 4     1    -4004 1st Age <NA>          1 Autumn    710 chronology happened     1
## 5     1    -4004 1st Age <NA>          1 Autumn    710 evening preceding       1
## 6     1    -4004 1st Age <NA>          1 Autumn    710 julian calendar        20

A Brief Diversion into Multiple ngrams

The sample visualizations below do not rely on more complex ngrams, but here trigrams are generated as an example of possible further exploration beyond the scope of this report:

ussher_trigrams <- ussh.ind %>%
  unnest_tokens(trigram, EventTxt, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)
head(ussher_trigrams)

## # A tibble: 6 × 4
##   word1   word2 word3          n
##   <chr>   <chr> <chr>      <int>
## 1 ad      attic epist         93
## 2 cicero  ad    attic         79
## 3 appian  civil war           59
## 4 caesar  civil war           45
## 5 hirtius de    bell          39
## 6 de      bell  alexandrin    32

As can be seen in some of the terms above, interesting references appear. “ad attic epist” is a scholarly reference to Cicero’s letter to Atticus, aka Epistle ad Atticum. The fact that that trigram appears in the corpus 93 times indicates the source’s prominence and would make for an interesting review of appearances by date.

Developing Bigram and Term Visualizations for Distant Reading of Corpus

Once a bigram dataframe is established, it can be filtered. In the example below, the filtered table produces bigram counts of various 1st words that end with the 2nd word “son.” 96 different bigrams combinations are identified by the Epoch in which they appear.

bigrams_filtered %>%
  filter(word2 == "son") %>%
  count(Epoch, word1, sort = FALSE)

## # A tibble: 96 × 3
##    Epoch   word1         n
##    <chr>   <chr>     <int>
##  1 1st Age st            1
##  2 3rd Age begotten      1
##  3 3rd Age promised      1
##  4 4th Age jair          1
##  5 4th Age manasseh      1
##  6 4th Age naphtali      1
##  7 4th Age semiramis     1
##  8 5th Age baruch        1
##  9 5th Age cambyses      1
## 10 5th Age firstborn     1
## # … with 86 more rows
## # ℹ Use `print(n = ...)` to see more rows

In NLP, term frequency(tf) and idf(inverse document frequency) can be compared in a unified measure known as tf-idf. This enables a statistical method for estimating the impact of certain bigrams on the corpus containing those bigrams. Then we can compare different segments of the chronology (in this case, those segments are based on the variable “Epoch” or one of Ussher’s designated “Seven Ages of the World.”) In this way, an analyst can take a quick look at the most impactful bigrams of each Age and then begin to proximate an initial “distant reading” of those segments.

Very basically, tf measures if a bigram occurs frequently in the corpus or not, and idf puts more weight on any bigrams that are particularly unique to a given Epoch. tf-idf is an algorithm derived from these counterbalancing elements to better determine thematic elements unique to each distinct Epoch.

bigram_tf_idf <- bigrams_united %>%
  count(Epoch, bigram) %>%
  bind_tf_idf(bigram, Epoch, n) %>%
  arrange(desc(tf_idf))
head(bigram_tf_idf)

## # A tibble: 6 × 6
##   Epoch   bigram               n     tf   idf tf_idf
##   <chr>   <chr>            <int>  <dbl> <dbl>  <dbl>
## 1 1st Age adam died            7 0.0407  1.95 0.0792
## 2 2nd Age noah died            5 0.0333  1.25 0.0418
## 3 1st Age friday september     3 0.0174  1.95 0.0339
## 4 1st Age god created          3 0.0174  1.95 0.0339
## 5 1st Age living creatures     4 0.0233  1.25 0.0291
## 6 7th Age tacitus annals     158 0.0210  1.25 0.0263

Visualizing tf-df in Ussher

Each Epoch in Ussher’s chronology is distinguished by different tf-idf factors. A combination of these factors can function as a sort of a fingerprint of that particular section of the chronology.

The 5 strongest tf-idf bigrams by Epoch are visualized here.

library(forcats)

bigram_tf_idf %>%
  group_by(Epoch) %>%
  slice_max(tf_idf, n = 5) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(bigram, tf_idf), fill = Epoch)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Epoch, ncol = 2, scales = "free") +
  labs(x = "tf-idf", y = NULL)

bigrams_separated %>%
  filter(word1 == "not") %>%
  count(word1, word2, sort = TRUE)

## # A tibble: 792 × 3
##    word1 word2     n
##    <chr> <chr> <int>
##  1 not   to      169
##  2 not   be      109
##  3 not   know     72
##  4 not   only     60
##  5 not   so       49
##  6 not   go       47
##  7 not   far      46
##  8 not   have     46
##  9 not   yet      44
## 10 not   allow    43
## # … with 782 more rows
## # ℹ Use `print(n = ...)` to see more rows

Sentiment in History

In social media and many other text analysis processes, “sentiment analysis” is often thought of as the primary objective of NLP models. Sentiment is typically measured statistically by assigning a certain negative or positive numeric weight to lists of words. For example, if Amazon wants to detect “troll 1-star” reviews, they could run sentiment analysis on all 1-star reviews and determine which ones are disproportionately neutral or even positive, indicating that a substantively negative review was not posted, despite the apparently strong negative response reflected in the single-star.

Other works, including medical, technical and historical works tend to rely less on sentiment analysis for the purposes of determining contextual meaning or even truth-detecting in their respective texts. There are other uses for sentiment analysis, however. One overlooked use is governance and process in medical, legal, technical and historical documents via analysis of “not” sentiment bigrams. By examining bigrams whose first term is that a negation term (in this example, the word “not”.)

In the sentiment visualization below, a cursory examination reveals a few points of note: The strongest negative sentiment “not” pair is “not allow”, while the strongest positive sentiment “not” pair is “not kill.” This very limited examination rings true for a chronology that heavily covers the records and affairs of lawmakers, kings and social conflict.

library(textdata)
AFINN <- get_sentiments("afinn")

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, value, sort = TRUE)

head(not_words)

## # A tibble: 6 × 3
##   word2 value     n
##   <chr> <dbl> <int>
## 1 allow     1    43
## 2 want      1    39
## 3 like      2    20
## 4 agree     1    18
## 5 fight    -1    16
## 6 stop     -1    16

library(ggplot2)

not_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes((n * value)*-1, word2, fill = n * value < 0)) +
  geom_col(show.legend = FALSE) +
  labs(x = "Sentiment value * number of occurrences",
       y = "Words preceded by \"not\"")

Term Correlation Exercises

Here frequently occurring single words contained in the 1st Age portion of the chronology are correlated. When performing correlations on words, it is important to slice the data in order to get a significant number of occurences without creating too large of a correlation matrix. In this example, words had to occur more than ten times within the 1st Age, and were positively or negatively correlated

ussher_index_words <- ussh.ind %>%
  filter(Epoch == "1st Age") %>%
  filter(Index > 0) %>%
  unnest_tokens(word, EventTxt) %>%
  filter(!word %in% stop_words$word)

word_pairs <- ussher_index_words %>%
  pairwise_count(word, Index, sort = TRUE)

## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

head(word_pairs)

## # A tibble: 6 × 3
##   item1 item2     n
##   <chr> <chr> <dbl>
## 1 adam  ge       14
## 2 ge    adam     14
## 3 ge    god      11
## 4 god   ge       11
## 5 day   ge       11
## 6 born  ge       11

word_cors <- ussher_index_words %>%
  group_by(word) %>%
  filter(n() >= 10) %>%
  pairwise_cor(word, Index, sort = TRUE) %>% 
  ungroup()

head(word_cors)

## # A tibble: 6 × 3
##   item1 item2 correlation
##   <chr> <chr>       <dbl>
## 1 world god         0.625
## 2 god   world       0.625
## 3 day   earth       0.480
## 4 earth day         0.480
## 5 earth god         0.414
## 6 god   earth       0.414

word_cors %>%
  filter(item1 == "god")

## # A tibble: 6 × 3
##   item1 item2 correlation
##   <chr> <chr>       <dbl>
## 1 god   world      0.625 
## 2 god   earth      0.414 
## 3 god   day        0.354 
## 4 god   adam      -0.0423
## 5 god   ge        -0.125 
## 6 god   born      -0.350

library(ggcorrplot)
  
# Computing correlation matrix
correlation_matrix <- xtabs(correlation~., word_cors)
correlation_matrix

##        item2
## item1          adam        born         day       earth          ge         god
##   adam   0.00000000 -0.14414999 -0.33793249 -0.14056338  0.19102329 -0.04225771
##   born  -0.14414999  0.00000000 -0.52277330 -0.17729434  0.26726124 -0.35025832
##   day   -0.33793249 -0.52277330  0.00000000  0.47997460 -0.36675724  0.35387166
##   earth -0.14056338 -0.17729434  0.47997460  0.00000000 -0.31137996  0.41409498
##   ge     0.19102329  0.26726124 -0.36675724 -0.31137996  0.00000000 -0.12535663
##   god   -0.04225771 -0.35025832  0.35387166  0.41409498 -0.12535663  0.00000000
##   world -0.09507985 -0.29315098  0.28908807  0.09334108 -0.17094086  0.62500000
##        item2
## item1         world
##   adam  -0.09507985
##   born  -0.29315098
##   day    0.28908807
##   earth  0.09334108
##   ge    -0.17094086
##   god    0.62500000
##   world  0.00000000

# Visualizing the correlation matrix using 
# square and circle methods
ggcorrplot(correlation_matrix, method ="square")

ggcorrplot(correlation_matrix, method ="circle")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Thematic Development

Multi-dimensional referencing of terms in ussher relies on correlation, dating and date categories (“Epochs”) and ngrammatic relationships. This referencing opens up a host of opportunities for data-oriented “distant reading” and analysis of a text.

Here are a few methods:

annals.count <- tidy_annals %>%
  anti_join(stop_words) %>%
  count(Epoch, word, sort = TRUE)

## Joining, by = "word"

epochs_dtm <- annals.count %>%
  cast_dtm(Epoch, word,n)
head(annals.count)

## # A tibble: 6 × 3
## # Rowwise: 
##   Epoch   word          n
##   <chr>   <chr>     <int>
## 1 6th Age king       2135
## 2 6th Age alexander  1492
## 3 6th Age army       1486
## 4 6th Age city       1240
## 5 6th Age soldiers   1173
## 6 6th Age war        1132

epochs_lda <- LDA(epochs_dtm, k = 10, control = list(seed = 1234))
epochs_lda

## A LDA_VEM topic model with 10 topics.

epochs_topics <- tidy(epochs_lda, matrix = "beta")
head(epochs_topics)

## # A tibble: 6 × 3
##   topic term      beta
##   <int> <chr>    <dbl>
## 1     1 king  0.0193  
## 2     2 king  0.0180  
## 3     3 king  0.00857 
## 4     4 king  0.00183 
## 5     5 king  0.000931
## 6     6 king  0.0170

Topics can be explored by grouping top terms by epoch and visualizing them. Here topics are clustered into 10 different groupings (a number somewhat arbitrarily selected for variety. Preliminary cluster analysis could be performed to more specifically narrow or expand the number of clusters selected.)

The position of a term within the top five makes a difference in subjectively evaluating the respective topic profiles.

top_terms <- epochs_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 5) %>% 
  ungroup() %>%
  arrange(topic, -beta)
head(top_terms)

## # A tibble: 6 × 3
##   topic term       beta
##   <int> <chr>     <dbl>
## 1     1 king     0.0193
## 2     1 josephus 0.0169
## 3     1 time     0.0140
## 4     1 city     0.0128
## 5     1 son      0.0112
## 6     2 king     0.0180

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

Linear Discriminant Analysis LDA is a useful dimensionality reduction algorithm for matrixing topics. It is a supervised learning mechanism that is optimized to put distance between classes.

This ultimately means that a gamma matrix can be used to illustrate how the different topics are distributed within different documents, or - in this case - between different epochs.

A number of observations can be made from visualizing the gamma matrix in this case:

Low level cross-topic sharing is to be expected in a single unified corpus. If this method were used to compare different corpora, such as Ussher’s Chronology and Sun Tzu’s Art of War, one would expect significantly lower cross-topic sharing. Moreover, the number of topic divisions selected at the start will make a difference, which is why in full practice, an elbow, silhouette or other cluster algorithm is critical.
The first 3 Epochs share a single, identical topic: Topic 9 (genesis-god-egypt-day-jacob). The 4th and 5th Age share topic (king-god-son-time-people), while the 6th and 7th Age each have more distinct topic profiles, but also have multiple topic balances. Unique topic elements can provide leads for further distinguishing these profiles.
It makes sense, subjectively, that a topic that might emerge out of such a chronology would include the figure of “jesus”, but that such a topic would not appear until the 7th Age of the Earth. Likewise, a topic including Plutarch would only be expected to be mentioned during the 6th Age.

epoch_gamma <- tidy(epochs_lda, matrix = "gamma")
head(epoch_gamma)

## # A tibble: 6 × 3
##   document topic      gamma
##   <chr>    <int>      <dbl>
## 1 6th Age      1 0.000162  
## 2 7th Age      1 0.353     
## 3 5th Age      1 0.00000302
## 4 4th Age      1 0.00000453
## 5 3rd Age      1 0.0000107 
## 6 1st Age      1 0.0000521

epoch_gamma %>%
  mutate(title = reorder(document, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ document) +
  labs(x = "topic", y = expression(gamma))+
  ggtitle("Association of Topics by Epoch")

epoch_classifications <- epoch_gamma %>%
  group_by(document) %>%
  slice_max(gamma) %>%
  ungroup()
epoch_classifications

## # A tibble: 7 × 3
##   document topic gamma
##   <chr>    <int> <dbl>
## 1 1st Age      9 1.00 
## 2 2nd Age      9 0.999
## 3 3rd Age      9 1.00 
## 4 4th Age      6 1.00 
## 5 5th Age      6 1.00 
## 6 6th Age     10 0.277
## 7 7th Age      5 0.647

epoch_topics <- epoch_classifications %>%
  count(document, topic) %>%
  group_by(document) %>%
  slice_max(n, n = 1) %>% 
  ungroup() %>%
  transmute(consensus = document, topic)

epoch_classifications %>%
  inner_join(epoch_topics, by = "topic") %>%
  filter(document != consensus)

## # A tibble: 8 × 4
##   document topic gamma consensus
##   <chr>    <int> <dbl> <chr>    
## 1 1st Age      9 1.00  2nd Age  
## 2 1st Age      9 1.00  3rd Age  
## 3 2nd Age      9 0.999 1st Age  
## 4 2nd Age      9 0.999 3rd Age  
## 5 3rd Age      9 1.00  1st Age  
## 6 3rd Age      9 1.00  2nd Age  
## 7 4th Age      6 1.00  5th Age  
## 8 5th Age      6 1.00  4th Age

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:viridis':
## 
##     viridis_pal

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

assignments <- augment(epochs_lda, data = epochs_dtm)
assignments <- assignments %>%
  inner_join(epoch_topics, by = c(".topic" = "topic"))
head(assignments)

## # A tibble: 6 × 5
##   document term  count .topic consensus
##   <chr>    <chr> <dbl>  <dbl> <chr>    
## 1 5th Age  king    258      6 4th Age  
## 2 5th Age  king    258      6 5th Age  
## 3 4th Age  king     68      6 4th Age  
## 4 4th Age  king     68      6 5th Age  
## 5 3rd Age  king     21      9 1st Age  
## 6 3rd Age  king     21      9 2nd Age

assignments %>%
  count(document, consensus, wt = count) %>%
  mutate(across(c(document, consensus), ~str_wrap(., 20))) %>%
  group_by(document) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, document, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "darkred", label = percent_format()) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Epoch words were assigned to",
       y = "Epoch words came from",
       fill = "% of assignments")

Big Picture Topic Analysis

Using a different kind of correlation matrix, the above observations can be visualized more efficiently. Here it can be seen that the first three Epochs share the same topic, as to do Epochs 4 and 5, while 6 and 7, respectively, have their own unique fingerprints, sharing nothing in common with the other ages.

Interactives and Shiny App Development

Two novel interactive visualizations have been developed to indicate possible new frontiers in NLP-based distant reading analytics, using ngrams.

Interactive One: Bigram Distribution by Season

These examine the count of frequently-occuring bigrams, and the years, seasons and ages in which those bigrams occur. For the purpose of this visualization, bigrams with no season associated with it are not included in the visualization, but easily could be included as a 5th season category if necessary.

The first examines the seasonal distribution of various bigrams. One quickly notices that, in the First Age, Summer is by far the season most frequently associated with bigrams. By hovering over specific “x”s indicating unique bigrams, the visualization indicates that births and deaths in the first age are associated with summer, while “yearly fast” shows up in Autumn. No seasonal bigrams appear in Winter or Spring in the 1st Age.

NOTE: Lower and upper limits were placed on bigram counts. By adjusting those counts, more or fewer bigrams would appear on the chart.

Interactive Two: Bigram Distribution By Count

The second visualization lends itself to specific bigram use over time, as bigrams are stratified by count, and aside from a few very common counts (such as 2) bigrams have a fairly unique count profile.

What this means is that the analyst can zoom in on specific date ranges or positions on the visualization and review the qualities of a given bigram. For example a very high-frequency bigram near the top of the visualization is “foot soldiers.” By hovering across its instances, it can be seen that foot soldier is typically associated with the winter months and only begins to appear as a seasonal bigram in the 6th Age, in the Autumn of 538 BC.

Winter Soldier: Further Exploration

This observation might be of “distant read” use to research. Historical hypotheses are beyond the scope or expertise of this exercise, but analytical questions that may inform professional inquiry abound:

Is “foot soldiers” a term specific to 6th and 7th Age sources?
Can the winter seasonality of the term be correlated to geographic locations or other signifiers that might indicate the propensity of soldiers to be referred to so frequently in that season?
Would the NA Season (aka “no season associated”) variable provide noise or add information to the importance of winter “foot soldiers?”
Is there another bigram that refers to soldiers prior to the 6th Age, and does that term reflect a similar seasonal profile to “foot soldiers?”

The Ussher data set can be further manipulated and modeled to pursue questions like these, and thousands more, to data-centered analytical conclusions, thus providing tools and techniques for applying NLP model theory to novel pathways of analysis.

Shiny Application

Finally, an app has been developed in Shiny that illustrates how to make more interactive the “distant reading” analytical process using NLP.

https://datascinet.shinyapps.io/UssherXplore/

count_united <- count_united %>%
  filter(n>2) %>% 
  filter(n<500) %>% 
  filter(!is.na(Season))
usshplot<-ggplot(count_united,aes(YearBCAD,Season,color=Epoch,size=n,bigram=bigram,Epoch=Epoch,AnnoMund=AnnoMund))+
  geom_point(shape = 4,alpha=1)+
  xlab("Year BC or AD")+ylab("Season of Bigram Appearance")+ ggtitle("Bigram Distribution by Season")+
  labs(color="Season")
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:igraph':
## 
##     groups

## The following object is masked from 'package:sentimentr':
## 
##     highlight

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

ggplotly(usshplot,tooltip= c("bigram","YearBCAD","Epoch","AnnoMund","n"))

linegraph <-ggplot(count_united,aes(YearBCAD,n,group=bigram,color=Season,bigram=bigram,Epoch=Epoch,AnnoMund=AnnoMund))+
  geom_line()+
  geom_point(shape = 1,alpha=1)+
  xlab("Year BC or AD")+ylab("Number of Bigram Appearances in Entire Chronology")+ ggtitle("Bigram Distribution by Count")+
  labs(color="Season")

ggplotly(linegraph,tooltip= c("bigram","YearBCAD","Epoch","AnnoMund","n","Season"))