General Lee’s Communications During the Gettysburg Campaign

Objective

In April I did a beta test of using Natural Language Processing (NLP) to analyze the published communications between Hooker and Meade with higher headquarters in Washington, D.C., during the Gettysburg Campaign. The technique quickly informed us that Halleck used Harper’s Ferry as a point to annoy and force General Hooker to show his hand, ulimately leading to his resignation.

That report is available on Rpubs at this link https://rpubs.com/undsioux88/getty_nlp_test

This time around I will apply similar methods to General Lee’s messages during the same campaign.

Robert E. Lee:

Data Loading and Preparation

I performed this analysis using the open-soure statistical software R version 3.5.2. The source of the data is the same as before “The War of the Rebellion: a compilation of the official records of the Union and Confederate armies”. Here we load the data and see its dimensions.

## [1] 127   5

What that output means is there are 127 messages and 5 columns in the data, which are as follows:

## [1] "Date"      "Recipient" "Message"   "Sender"    "id"

The column names correspond with the date of the message, for whom the message was intended, the message itself, who sent the message, and a numeric message identifier.

Let’s look at a simple matrix of the number of messages attributed to each sender in the data.

Sender	Freq
Chilton, Asst. AG	2
Cooper	2
Ewell	3
Guild	2
JEB Stuart	4
Jeff Davis	3
Lee	84
Long	1
Longstreet	2
MG Jones	6
Pendleton	1
Pickett	1
Seddon	4
Sorrel	5
Taylor, Asst. AG	6
Venable, ADC	1

I only want those 84 message sent by General Lee himself. Then, getting rid of “stopwords” we can look at a chart of the most frequent words Lee used.

Like with the Army of the Potomac analysis, we see a lot of reference to the enemy and cavalry etc. The count of “General” is a bit high as the opening of a message would normally reference the recipient’s rank. Interesting the focus on the Valley and Winchester, which was the critical point within the Shenandoah Valley.

If you are curious as to a table of recipients of the 84 messages then here you go…

	Number of Lee’s messages received
A.P. Hill	2
C.O. at Winchester	1
Chairman, Ambulance Cmte.	1
COL Gorgas	1
COL Wharton	1
Davis	3
Elzey	1
Ewell	7
General Cooper	4
General Order	1
Governor Vance	1
Hunter	1
Imboden	11
JEB Stuart	15
Jeff Davis	6
Jenkins	2
Longstreet	4
Major Collins	1
MG Jones	6
Pickett	4
Seddon	5
Special Order	3
Trimble	2
Wharton	1

Notice the most frequent recipients of his messages were Stuart and Imboden, cavalry commanders. One of the things I find interesting is that he sent four messages to Pickett, which was unique for an infantry Division Commander in the Army of Northern Virginia during this campaign.

Keywords in Context

A possible important task in understanding textual data is to look at words of interest, keywords if you will, in their context. Let’s take Gettysburg for example, and see what we can derive about its context in the official record. What the output below shows is the 15 words before and 15 words after “Gettysburg”.

docname	from	to	pre	keyword	post	pattern
lee.csv.70	77	77	ave no good reason against it I desire you to move in the direction of	Gettysburg	via Heidlersburg where you will have turnpike most of the way and you can thus	Gettysburg
lee.csv.70	129	129	side of the mountains When you come to Heidlersburg you can either move directly on	Gettysburg	or turn down to Cashtown Your trains and heavy artillery you can send if you	Gettysburg
lee.csv.72	158	158	Carlisle road You must turn off everybody belonging to the army on the road to	Gettysburg	The reserve trains of the army are parked between Greenwood and Cashtown on said road	Gettysburg
lee.csv.82	33	33	During the night Lieutenant Thomas L Norwood Thirtyseventh North Carolina Regiment who was wounded at	Gettysburg	and made his escape arrived He reports he passed at Waynesborough what he supposed a	Gettysburg
lee.csv.117	140	140	any course not in accordance with their inclinations The day after the last battle at	Gettysburg	on sending back the train with the wounded it was reported that about 5000 well	Gettysburg

Gettysburg occurs in four of the messages with it occuring twice in the first message. That first message was the one to Ewell directing him to move his Corps by way of Heidlersburg and then at his discretion move directly on Gettysburg or Chambersburg. A fascinating story that warrants further investigation is this LT Thomas Norwood from the 37th North Carolina who made his escape, returned to the army, and provided intelligence.

bigrams

With a bigram the algorithm puts together consecutive sequences of words. You can study these sequences by themself and in conjunction with single words. The results are available to explore in an interactive table.

Another interesting way to examine the results are to find the bigrams that occur at least 5 times and plot them in a relationship graph often referred to as network analysis.

Your Excellency? Yes, General Lee referred to Jeff Davis and Governor Vance as such. Now, who is this Colonel Wharton character? Turns out he was commanding some troops in the Valley as a subordinate to MG Sameul Jones. I think this show how doing an analysis like this can uncover such a thing. Lee only sent him one message, and the fact that it is linked to “General” means that he is referred to in messages rather than messaged directly.

One of my favorite methods in NLP is to extract “topics” from text. In the next section we will extract and analyze the main topics of Lee’s messages. There are a number of ways to do this such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc. I will keep it simple, for me anyway, by doing LDA, and finding a set number of topics. There a several ways to determine the optimal number of topics in a set of text documents, but let’s not go down that rabbit hole. Also note that I’m a trained professional, so please don’t try this at home.

Latent Dirichlet Allocation

Identifying Topics

Previously, I did what is known as “tokenizing” both individual words and bigrams. I’m going to combine those to extract topics. As for the number of topics, let’s try 12 and see what happens.

Topic	Number of Messages
1	12
2	7
3	3
4	2
5	4
6	12
7	18
8	8
9	3
10	5
11	5
12	5

The table above gives us the number of messages that are assigned to each of the 12 pre-specified topics. There are a bunch of things we could do next like classify and extract the 5 messages that comprise topic number 10, but let’s find the top 15 (highest probability) words/bigrams associated with each of the topics.

Topic.1	Topic.2	Topic.3	Topic.4	Topic.5	Topic.6	Topic.7	Topic.8	Topic.9	Topic.10	Topic.11	Topic.12
move	brigade	cavalry	operations	time	enemy	General	force	army	Your	army	Carolina
route	General	wounded	artillery	position	front	Valley	received	road	desire	troops	North
cross	regiments	This	duty	hope	He	Winchester	Potomac	corps	leaving	division	CourtHouse
information	service	reported	forward	command	line	command	letter	You	Front_Royal	companies	return
General	Virginia	sufficient	officers	forwarded	river	Colonel	hope	safety	success	regiment	Culpeper
Gap	brigades	President	commanding	In	You	troops	Staunton	Chambersburg	transportation	effect	Culpeper_CourtHouse
movements	North_Carolina	column	sufficient	officers	left	enemy	guard	trains	Excellency	convalescents	Federal
send	cavalry	Ewells_corps	receive	Richmond	Rappahannock	General_Ewell	send	position	subject	view	General_Beauregard
corps	North	captured	Hill	service	As	Imboden	desire	report	peace	engineer	result
reach	Carolina	Potomac	collecting	practicable	rear	Ewell	supplies	join	people	time	benefit
Longstreet	officers	commanding	advance	west	Fredericksburg	Jones	prevent	officer	instructions	divisions	Beauregard
mountains	Col	desire	learn	ranks	H	march	prisoners	train	bring	Your_letter	Virginia
division	Cavalry	From	Shenandoah	deem	advanced	Should	Maryland	It_will	leave	Department	troops
General_Your	South	letter	prepared	A	inform	cavalry	opportunity	instant	means	engineer_troops	condition
enemys	separate	Pickett	heard	letter	withdrawn	morning	horses	attention	opinion	copy	note

The column refers to the topic number and the rows are the top words within that topic. One can qualitatively or even quantitatively summarize a topic by giving it some sort of title. For instance, one could qualitatively label Topic 7 as “Lee’s Shenandoah Valley Coordination Messages”. We could consume ourselves with creating more topics or doing a hierarchical topic modeling strategy, but let’s not getted bogged down in details. Let’s plot the topics and key words from what is known as the “beta matrix”. The plot shows the beta scores with the higher the score the more that word or bigram distinguishes it from other topics. Let me ask you this, are you curious as to what topic 12 is about considering the mentions to General Beauregard? I’ll leave it to you to uncover that if you are not familiar.

We could do quite a bit more analysis like finding word correlations, co-locations and on and on, but let’s wrap this up with Named Entity Recognition (NER). This is a methodology to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations etc. A researcher can use NER to discover entities of interest. As you explore the output of the count of entities below, it is important to keep in mind that any NER algorithm is subject to make a mistake or two, for example classifying a location as a person or vice versa. I’m creating a database to take modern NER extractions and convert them based on Civil War entities. Maybe someday I will make that publicly available.

Named Entities

Here I just create a table of the NER for the message from June 3 thru June 7. Notice how Longstreet is a location! Thus, my endeavor to create a corrective database.

## Warning: `cols` is now required.
## Please use `cols = c(res)`

Conclusion

I hope this quick analysis stimulates your thinking on the potential of using NLP for historical documents. I shall endeavor to do put together something for the unit after action reports. That might be interesting.