Objective

In April I did a beta test of using Natural Language Processing (NLP) to analyze the published communications between Hooker and Meade with higher headquarters in Washington, D.C., during the Gettysburg Campaign. The technique quickly informed us that Halleck used Harper’s Ferry as a point to annoy and force General Hooker to show his hand, ulimately leading to his resignation.

That report is available on Rpubs at this link https://rpubs.com/undsioux88/getty_nlp_test

This time around I will apply similar methods to General Lee’s messages during the same campaign.

Robert E. Lee:

Data Loading and Preparation

I performed this analysis using the open-soure statistical software R version 3.5.2. The source of the data is the same as before “The War of the Rebellion: a compilation of the official records of the Union and Confederate armies”. Here we load the data and see its dimensions.

## [1] 127   5

What that output means is there are 127 messages and 5 columns in the data, which are as follows:

## [1] "Date"      "Recipient" "Message"   "Sender"    "id"

The column names correspond with the date of the message, for whom the message was intended, the message itself, who sent the message, and a numeric message identifier.

Let’s look at a simple matrix of the number of messages attributed to each sender in the data.

Sender Freq
Chilton, Asst. AG 2
Cooper 2
Ewell 3
Guild 2
JEB Stuart 4
Jeff Davis 3
Lee 84
Long 1
Longstreet 2
MG Jones 6
Pendleton 1
Pickett 1
Seddon 4
Sorrel 5
Taylor, Asst. AG 6
Venable, ADC 1

I only want those 84 message sent by General Lee himself. Then, getting rid of “stopwords” we can look at a chart of the most frequent words Lee used.

Like with the Army of the Potomac analysis, we see a lot of reference to the enemy and cavalry etc. The count of “General” is a bit high as the opening of a message would normally reference the recipient’s rank. Interesting the focus on the Valley and Winchester, which was the critical point within the Shenandoah Valley.

If you are curious as to a table of recipients of the 84 messages then here you go…

Number of Lee’s messages received
A.P. Hill 2
C.O. at Winchester 1
Chairman, Ambulance Cmte. 1
COL Gorgas 1
COL Wharton 1
Davis 3
Elzey 1
Ewell 7
General Cooper 4
General Order 1
Governor Vance 1
Hunter 1
Imboden 11
JEB Stuart 15
Jeff Davis 6
Jenkins 2
Longstreet 4
Major Collins 1
MG Jones 6
Pickett 4
Seddon 5
Special Order 3
Trimble 2
Wharton 1

Notice the most frequent recipients of his messages were Stuart and Imboden, cavalry commanders. One of the things I find interesting is that he sent four messages to Pickett, which was unique for an infantry Division Commander in the Army of Northern Virginia during this campaign.

Keywords in Context

A possible important task in understanding textual data is to look at words of interest, keywords if you will, in their context. Let’s take Gettysburg for example, and see what we can derive about its context in the official record. What the output below shows is the 15 words before and 15 words after “Gettysburg”.

docname from to pre keyword post pattern
lee.csv.70 77 77 ave no good reason against it I desire you to move in the direction of Gettysburg via Heidlersburg where you will have turnpike most of the way and you can thus Gettysburg
lee.csv.70 129 129 side of the mountains When you come to Heidlersburg you can either move directly on Gettysburg or turn down to Cashtown Your trains and heavy artillery you can send if you Gettysburg
lee.csv.72 158 158 Carlisle road You must turn off everybody belonging to the army on the road to Gettysburg The reserve trains of the army are parked between Greenwood and Cashtown on said road Gettysburg
lee.csv.82 33 33 During the night Lieutenant Thomas L Norwood Thirtyseventh North Carolina Regiment who was wounded at Gettysburg and made his escape arrived He reports he passed at Waynesborough what he supposed a Gettysburg
lee.csv.117 140 140 any course not in accordance with their inclinations The day after the last battle at Gettysburg on sending back the train with the wounded it was reported that about 5000 well Gettysburg

Gettysburg occurs in four of the messages with it occuring twice in the first message. That first message was the one to Ewell directing him to move his Corps by way of Heidlersburg and then at his discretion move directly on Gettysburg or Chambersburg. A fascinating story that warrants further investigation is this LT Thomas Norwood from the 37th North Carolina who made his escape, returned to the army, and provided intelligence.

bigrams

With a bigram the algorithm puts together consecutive sequences of words. You can study these sequences by themself and in conjunction with single words. The results are available to explore in an interactive table.

Another interesting way to examine the results are to find the bigrams that occur at least 5 times and plot them in a relationship graph often referred to as network analysis.

Your Excellency? Yes, General Lee referred to Jeff Davis and Governor Vance as such. Now, who is this Colonel Wharton character? Turns out he was commanding some troops in the Valley as a subordinate to MG Sameul Jones. I think this show how doing an analysis like this can uncover such a thing. Lee only sent him one message, and the fact that it is linked to “General” means that he is referred to in messages rather than messaged directly.

One of my favorite methods in NLP is to extract “topics” from text. In the next section we will extract and analyze the main topics of Lee’s messages. There are a number of ways to do this such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc. I will keep it simple, for me anyway, by doing LDA, and finding a set number of topics. There a several ways to determine the optimal number of topics in a set of text documents, but let’s not go down that rabbit hole. Also note that I’m a trained professional, so please don’t try this at home.

Latent Dirichlet Allocation

Identifying Topics

Previously, I did what is known as “tokenizing” both individual words and bigrams. I’m going to combine those to extract topics. As for the number of topics, let’s try 12 and see what happens.

Topic Number of Messages
1 12
2 7
3 3
4 2
5 4
6 12
7 18
8 8
9 3
10 5
11 5
12 5

The table above gives us the number of messages that are assigned to each of the 12 pre-specified topics. There are a bunch of things we could do next like classify and extract the 5 messages that comprise topic number 10, but let’s find the top 15 (highest probability) words/bigrams associated with each of the topics.

Topic.1 Topic.2 Topic.3 Topic.4 Topic.5 Topic.6 Topic.7 Topic.8 Topic.9 Topic.10 Topic.11 Topic.12
move brigade cavalry operations time enemy General force army Your army Carolina
route General wounded artillery position front Valley received road desire troops North
cross regiments This duty hope He Winchester Potomac corps leaving division CourtHouse
information service reported forward command line command letter You Front_Royal companies return
General Virginia sufficient officers forwarded river Colonel hope safety success regiment Culpeper
Gap brigades President commanding In You troops Staunton Chambersburg transportation effect Culpeper_CourtHouse
movements North_Carolina column sufficient officers left enemy guard trains Excellency convalescents Federal
send cavalry Ewells_corps receive Richmond Rappahannock General_Ewell send position subject view General_Beauregard
corps North captured Hill service As Imboden desire report peace engineer result
reach Carolina Potomac collecting practicable rear Ewell supplies join people time benefit
Longstreet officers commanding advance west Fredericksburg Jones prevent officer instructions divisions Beauregard
mountains Col desire learn ranks H march prisoners train bring Your_letter Virginia
division Cavalry From Shenandoah deem advanced Should Maryland It_will leave Department troops
General_Your South letter prepared A inform cavalry opportunity instant means engineer_troops condition
enemys separate Pickett heard letter withdrawn morning horses attention opinion copy note

The column refers to the topic number and the rows are the top words within that topic. One can qualitatively or even quantitatively summarize a topic by giving it some sort of title. For instance, one could qualitatively label Topic 7 as “Lee’s Shenandoah Valley Coordination Messages”. We could consume ourselves with creating more topics or doing a hierarchical topic modeling strategy, but let’s not getted bogged down in details. Let’s plot the topics and key words from what is known as the “beta matrix”. The plot shows the beta scores with the higher the score the more that word or bigram distinguishes it from other topics. Let me ask you this, are you curious as to what topic 12 is about considering the mentions to General Beauregard? I’ll leave it to you to uncover that if you are not familiar.

We could do quite a bit more analysis like finding word correlations, co-locations and on and on, but let’s wrap this up with Named Entity Recognition (NER). This is a methodology to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations etc. A researcher can use NER to discover entities of interest. As you explore the output of the count of entities below, it is important to keep in mind that any NER algorithm is subject to make a mistake or two, for example classifying a location as a person or vice versa. I’m creating a database to take modern NER extractions and convert them based on Civil War entities. Maybe someday I will make that publicly available.

Named Entities

Here I just create a table of the NER for the message from June 3 thru June 7. Notice how Longstreet is a location! Thus, my endeavor to create a corrective database.

## Warning: `cols` is now required.
## Please use `cols = c(res)`

Conclusion

I hope this quick analysis stimulates your thinking on the potential of using NLP for historical documents. I shall endeavor to do put together something for the unit after action reports. That might be interesting.