In April I did a beta test of using Natural Language Processing (NLP) to analyze the published communications between Hooker and Meade with higher headquarters in Washington, D.C., during the Gettysburg Campaign. The technique quickly informed us that Halleck used Harper’s Ferry as a point to annoy and force General Hooker to show his hand, ulimately leading to his resignation.
That report is available on Rpubs at this link https://rpubs.com/undsioux88/getty_nlp_test
This time around I will apply similar methods to General Lee’s messages during the same campaign.
Robert E. Lee:
I performed this analysis using the open-soure statistical software R version 3.5.2. The source of the data is the same as before “The War of the Rebellion: a compilation of the official records of the Union and Confederate armies”. Here we load the data and see its dimensions.
## [1] 127 5
What that output means is there are 127 messages and 5 columns in the data, which are as follows:
## [1] "Date" "Recipient" "Message" "Sender" "id"
The column names correspond with the date of the message, for whom the message was intended, the message itself, who sent the message, and a numeric message identifier.
Let’s look at a simple matrix of the number of messages attributed to each sender in the data.
| Sender | Freq |
|---|---|
| Chilton, Asst. AG | 2 |
| Cooper | 2 |
| Ewell | 3 |
| Guild | 2 |
| JEB Stuart | 4 |
| Jeff Davis | 3 |
| Lee | 84 |
| Long | 1 |
| Longstreet | 2 |
| MG Jones | 6 |
| Pendleton | 1 |
| Pickett | 1 |
| Seddon | 4 |
| Sorrel | 5 |
| Taylor, Asst. AG | 6 |
| Venable, ADC | 1 |
I only want those 84 message sent by General Lee himself. Then, getting rid of “stopwords” we can look at a chart of the most frequent words Lee used.
Like with the Army of the Potomac analysis, we see a lot of reference to the enemy and cavalry etc. The count of “General” is a bit high as the opening of a message would normally reference the recipient’s rank. Interesting the focus on the Valley and Winchester, which was the critical point within the Shenandoah Valley.
If you are curious as to a table of recipients of the 84 messages then here you go…
| Number of Lee’s messages received | |
|---|---|
| A.P. Hill | 2 |
| C.O. at Winchester | 1 |
| Chairman, Ambulance Cmte. | 1 |
| COL Gorgas | 1 |
| COL Wharton | 1 |
| Davis | 3 |
| Elzey | 1 |
| Ewell | 7 |
| General Cooper | 4 |
| General Order | 1 |
| Governor Vance | 1 |
| Hunter | 1 |
| Imboden | 11 |
| JEB Stuart | 15 |
| Jeff Davis | 6 |
| Jenkins | 2 |
| Longstreet | 4 |
| Major Collins | 1 |
| MG Jones | 6 |
| Pickett | 4 |
| Seddon | 5 |
| Special Order | 3 |
| Trimble | 2 |
| Wharton | 1 |
Notice the most frequent recipients of his messages were Stuart and Imboden, cavalry commanders. One of the things I find interesting is that he sent four messages to Pickett, which was unique for an infantry Division Commander in the Army of Northern Virginia during this campaign.
A possible important task in understanding textual data is to look at words of interest, keywords if you will, in their context. Let’s take Gettysburg for example, and see what we can derive about its context in the official record. What the output below shows is the 15 words before and 15 words after “Gettysburg”.
| docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
| lee.csv.70 | 77 | 77 | ave no good reason against it I desire you to move in the direction of | Gettysburg | via Heidlersburg where you will have turnpike most of the way and you can thus | Gettysburg |
| lee.csv.70 | 129 | 129 | side of the mountains When you come to Heidlersburg you can either move directly on | Gettysburg | or turn down to Cashtown Your trains and heavy artillery you can send if you | Gettysburg |
| lee.csv.72 | 158 | 158 | Carlisle road You must turn off everybody belonging to the army on the road to | Gettysburg | The reserve trains of the army are parked between Greenwood and Cashtown on said road | Gettysburg |
| lee.csv.82 | 33 | 33 | During the night Lieutenant Thomas L Norwood Thirtyseventh North Carolina Regiment who was wounded at | Gettysburg | and made his escape arrived He reports he passed at Waynesborough what he supposed a | Gettysburg |
| lee.csv.117 | 140 | 140 | any course not in accordance with their inclinations The day after the last battle at | Gettysburg | on sending back the train with the wounded it was reported that about 5000 well | Gettysburg |
Gettysburg occurs in four of the messages with it occuring twice in the first message. That first message was the one to Ewell directing him to move his Corps by way of Heidlersburg and then at his discretion move directly on Gettysburg or Chambersburg. A fascinating story that warrants further investigation is this LT Thomas Norwood from the 37th North Carolina who made his escape, returned to the army, and provided intelligence.
With a bigram the algorithm puts together consecutive sequences of words. You can study these sequences by themself and in conjunction with single words. The results are available to explore in an interactive table.
Another interesting way to examine the results are to find the bigrams that occur at least 5 times and plot them in a relationship graph often referred to as network analysis.
Your Excellency? Yes, General Lee referred to Jeff Davis and Governor Vance as such. Now, who is this Colonel Wharton character? Turns out he was commanding some troops in the Valley as a subordinate to MG Sameul Jones. I think this show how doing an analysis like this can uncover such a thing. Lee only sent him one message, and the fact that it is linked to “General” means that he is referred to in messages rather than messaged directly.
One of my favorite methods in NLP is to extract “topics” from text. In the next section we will extract and analyze the main topics of Lee’s messages. There are a number of ways to do this such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc. I will keep it simple, for me anyway, by doing LDA, and finding a set number of topics. There a several ways to determine the optimal number of topics in a set of text documents, but let’s not go down that rabbit hole. Also note that I’m a trained professional, so please don’t try this at home.
Latent Dirichlet Allocation
Previously, I did what is known as “tokenizing” both individual words and bigrams. I’m going to combine those to extract topics. As for the number of topics, let’s try 12 and see what happens.
| Topic | Number of Messages |
|---|---|
| 1 | 12 |
| 2 | 7 |
| 3 | 3 |
| 4 | 2 |
| 5 | 4 |
| 6 | 12 |
| 7 | 18 |
| 8 | 8 |
| 9 | 3 |
| 10 | 5 |
| 11 | 5 |
| 12 | 5 |
The table above gives us the number of messages that are assigned to each of the 12 pre-specified topics. There are a bunch of things we could do next like classify and extract the 5 messages that comprise topic number 10, but let’s find the top 15 (highest probability) words/bigrams associated with each of the topics.
| Topic.1 | Topic.2 | Topic.3 | Topic.4 | Topic.5 | Topic.6 | Topic.7 | Topic.8 | Topic.9 | Topic.10 | Topic.11 | Topic.12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| move | brigade | cavalry | operations | time | enemy | General | force | army | Your | army | Carolina |
| route | General | wounded | artillery | position | front | Valley | received | road | desire | troops | North |
| cross | regiments | This | duty | hope | He | Winchester | Potomac | corps | leaving | division | CourtHouse |
| information | service | reported | forward | command | line | command | letter | You | Front_Royal | companies | return |
| General | Virginia | sufficient | officers | forwarded | river | Colonel | hope | safety | success | regiment | Culpeper |
| Gap | brigades | President | commanding | In | You | troops | Staunton | Chambersburg | transportation | effect | Culpeper_CourtHouse |
| movements | North_Carolina | column | sufficient | officers | left | enemy | guard | trains | Excellency | convalescents | Federal |
| send | cavalry | Ewells_corps | receive | Richmond | Rappahannock | General_Ewell | send | position | subject | view | General_Beauregard |
| corps | North | captured | Hill | service | As | Imboden | desire | report | peace | engineer | result |
| reach | Carolina | Potomac | collecting | practicable | rear | Ewell | supplies | join | people | time | benefit |
| Longstreet | officers | commanding | advance | west | Fredericksburg | Jones | prevent | officer | instructions | divisions | Beauregard |
| mountains | Col | desire | learn | ranks | H | march | prisoners | train | bring | Your_letter | Virginia |
| division | Cavalry | From | Shenandoah | deem | advanced | Should | Maryland | It_will | leave | Department | troops |
| General_Your | South | letter | prepared | A | inform | cavalry | opportunity | instant | means | engineer_troops | condition |
| enemys | separate | Pickett | heard | letter | withdrawn | morning | horses | attention | opinion | copy | note |
The column refers to the topic number and the rows are the top words within that topic. One can qualitatively or even quantitatively summarize a topic by giving it some sort of title. For instance, one could qualitatively label Topic 7 as “Lee’s Shenandoah Valley Coordination Messages”. We could consume ourselves with creating more topics or doing a hierarchical topic modeling strategy, but let’s not getted bogged down in details. Let’s plot the topics and key words from what is known as the “beta matrix”. The plot shows the beta scores with the higher the score the more that word or bigram distinguishes it from other topics. Let me ask you this, are you curious as to what topic 12 is about considering the mentions to General Beauregard? I’ll leave it to you to uncover that if you are not familiar.
We could do quite a bit more analysis like finding word correlations, co-locations and on and on, but let’s wrap this up with Named Entity Recognition (NER). This is a methodology to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations etc. A researcher can use NER to discover entities of interest. As you explore the output of the count of entities below, it is important to keep in mind that any NER algorithm is subject to make a mistake or two, for example classifying a location as a person or vice versa. I’m creating a database to take modern NER extractions and convert them based on Civil War entities. Maybe someday I will make that publicly available.
Here I just create a table of the NER for the message from June 3 thru June 7. Notice how Longstreet is a location! Thus, my endeavor to create a corrective database.
## Warning: `cols` is now required.
## Please use `cols = c(res)`
I hope this quick analysis stimulates your thinking on the potential of using NLP for historical documents. I shall endeavor to do put together something for the unit after action reports. That might be interesting.