Objective

The purpose of this project is to use Natural Language Processing (NLP) to develop insights around the published communications between the commander of The Army of the Potomac and higher headquarters in Washington, D.C., during the Gettysburg Campaign.

The analysis is focused on four individuals and their communication with each other:
1. President Abraham Lincoln
2. Union General-in-Chief, Major General Henry Halleck
3. Secretary of War, Edwin Stanton
4. Commander, Army of the Potomac, Major General Joseph Hooker
5. Commander, Army of the Potomac, Major General George Meade

The time period covered is from June 4th, 1863 through August 3rd, 1863, which corresponds to the start of Lee’s Campaign and Meade sending troops to quell the draft riots. I compiled the data into 229 separate messages. A couple of the messages sent by Hooker’s Chief of Staff, Daniel Butterfield, I attributed to Hooker for simplicity of analysis and understanding. The source for this analysis is “The War of the Rebellion: a compilation of the official records of the Union and Confederate armies”.

The volume I used was Chapter 39, Operations in North Carolina, Virginia, West Virginia, Maryland, Pennsylvania, and Department of the East; June 3rd-August 3rd, 1863. You can access the document via the Hathi Trust Digital Library at the following link.

https://babel.hathitrust.org/cgi/pt?num=1&u=1&seq=75&view=image&size=100&id=coo.31924077699761.com

Table of Contents

Major General Hooker:

Data Loading and Preparation

You can download the data from googledocs https://docs.google.com/spreadsheets/d/1eEnBJ6tyAcvBLWddSvBhD7HBFqnR20UldqmYb7mMQ_Y/edit?usp=sharing. I performed this analysis using the open-soure statistical software R version 3.5.2.

Here are the column names:

## [1] "date"      "recipient" "message"   "sender"    "id"

The column names correspond with the date of the message, for whom the message was intended, the message itself, who sent the message, and a numeric message identifier.

We start simple with a plot of the words occuring more than 35 times in the messages. Be advised that in this process, I’ve removed meaningless “stopwords” such as “a”, “and”, “the” etc.

No surprise that the messages had a focus on the enemy. Note a couple of things. First, the high frequency of the word “cavalry”. This is an artifact of the importance of using Pleasonton’s troopers to gather intelligence, the fights at Brandy Station, Aldie etc., and the anxiety caused by JEB Stuart and his forces. Hooker and Halleck were truly in the dark on Lee’s movements and intentions. The other thing is that in this intial analysis the focus is on each individual word, so Harper’s Ferry becomes “tokenized” as two separate words. I shall address that in the section on bigrams.

We can see how many messages are attributable to each of the leaders, starting when Hooker was in command.

Sender Total Messages
Hooker 59
Halleck 27
Lincoln 13
Stanton 2

We can see that Hooker sent over half of the messages. The various senders and recipients are as follows:

Dix Halleck Hooker Lincoln Stanton
Halleck 0 0 27 0 0
Hooker 1 36 0 18 4
Lincoln 0 0 13 0 0
Stanton 0 0 2 0 0

The table has senders on the left and recipients on the top, so General Hooker sent 18 messages to Lincoln and received 13 from him. The data includes the message to General Dix who received only the one message from Hooker.

With Meade in command, we see a different dynamic in the messages, as Lincoln is now but a mere bystander, and his two messages are passed to Meade via Halleck. The mutual contempt between Hooker and Halleck led to Hooker’s eventual undoing, and reestablished a more functional chain-of-command.

Sender Total Messages
Meade 72
Halleck 53
Lincoln/Halleck 2
French 1

We now turn our attention to the top 10 words of Lincoln, Halleck, Hooker, and Meade.

Halleck was quite interested in Harper’s Ferry it seems, while Lincoln was focused on the towns of Martinsburg and Winchester in the Shenandoah Valley. Let’s drill down into this Harper’s Ferry matter.

Keywords in Context

An important task in understanding text documents is to look at words of interest, keywords if you will, in their context. Let’s take Harper’s Ferry for example, and see what we can derive about its importance. Since Halleck discussed it in a number of his messages, let’s look at those with just a focus on messages during Hooker’s reign. But first, I have some comments on the output below. The column “docname”" corresponds to the message id number. You can disregrad from and to, but “pre” is the words prior to the keyword. Thus, “keyword” and “post” provide us the keyword of interest and the ten words or punctuation after the keyword. Additionally, I’ve turned Harper’s Ferry into one word HarpersFerry, which helps to single it out as the entity that interests us.

docname from to pre keyword post
wor.csv.7 56 56 injunction to keep in view the safety of Washington and HarpersFerry In regard to the contingency which you suppose may arise
wor.csv.7 175 175 boards for the defenses of Washington Neither this capital nor HarpersFerry could long hold out against a large force They must
wor.csv.37 10 10 No information of enemy in direction of Winchester and HarpersFerry as late as that from General Pleasonton The forces at
wor.csv.37 25 25 from General Pleasonton The forces at Martinsburg are arriving at HarpersFerry
wor.csv.38 7 7 Garrison of Martinsburg has arrived at HarpersFerry Milroy did not obey orders given on the 11th to
wor.csv.38 27 27 to abandon Winchester and probably has or will be captured HarpersFerry ought to hold out some time Pleasontons telegrams to you
wor.csv.38 89 89 Longstreet and Stuart are crossing the Potomac above and below HarpersFerry They certainly should be pursued The force used for that
wor.csv.48 117 117 dark or on mere conjecture Tyler is in command at HarpersFerry with it is said only 9000 men but according to
wor.csv.49 11 11 There is now no doubt that the enemy is surrounding HarpersFerry but in what force I have no information General Schenck
wor.csv.52 11 11 Information of enemys actual position and force in front of HarpersFerry is as indefinite as that in your front Nearly everything
wor.csv.52 39 39 enemy mentioned is Halltown The bridges across both rivers at HarpersFerry are believed to be intact and most of Tylers troops
wor.csv.55 12 12 have given no directions for our army to move to HarpersFerry I have advised the movement of a force sufficiently strong
wor.csv.55 41 41 the enemy is and then move to the relief of HarpersFerry or elsewhere as circumstances might require With the remainder of
wor.csv.55 115 115 We have no positive information of any large force against HarpersFerry and it cannot be known whether it will be necessary
wor.csv.62 32 32 given to you as received What is meant by abandoning HarpersFerry is merely that General Tyler has concentrated his force in
wor.csv.62 57 57 Heights No enemy in any force has been seen below HarpersFerry north of the river and it is hoped that Tylers
wor.csv.63 17 17 has informed you what is meant by the abandonment of HarpersFerrya mere change of position It changes in no respect the
wor.csv.85 13 13 has been notified that the troops of his department in HarpersFerry and vicinity would obey all orders direct from you and

This is interesting as Halleck demanded that Hooker not abandon Harper’s Ferry. This compels us to examine Hooker’s thoughts on the matter.

Harper’s Ferry, 1863:

docname from to pre keyword post
wor.csv.4 346 346 keep in view always the importance of covering Washington and HarpersFerry either directly or by so operating as to be able
wor.csv.45 164 164 that I may be informed what troops there are at HarpersFerry and who is in command of them and also who
wor.csv.46 110 110 be thrown as speedily as possible across the river at HarpersFerry while another should be thrown over the most direct line
wor.csv.50 12 12 received your telegram Please inform me whether our forces at HarpersFerry are in the town or on the heights and if
wor.csv.50 40 40 or Maryland Heights and which if any what bridges at HarpersFerry and where from what direction is the enemy making his
wor.csv.51 13 13 with your directions I shall march to the relief of HarpersFerry I put my column again in motion at 3 a
wor.csv.59 50 50 successively that Martinsburg and Winchester were invested and surrounded that HarpersFerry was closely invested with urgent calls upon me for relief
wor.csv.59 101 101 from one of his wagon trains that General Tyler at HarpersFerry whose urgent calls as represented to me required under my
wor.csv.59 134 134 in no danger Telegraph operator just reports to me that HarpersFerry is abandoned by our forces Is this true Directions have
wor.csv.59 160 160 to make a reconnaissance in the direction of Winchester and HarpersFerry for the purpose of ascertaining the whereabouts and strength of
wor.csv.61 6 6 Advice of the abandonment of HarpersFerry renders forced marches unnecessary to relieve it This army will
wor.csv.84 104 104 are ready General French is now on his way to HarpersFerry and I have given directions for the force at Poolesville
wor.csv.96 9 9 General Hooker personally has just left here for HarpersFerry where he will be about 11 oclock Point of Rocks
wor.csv.96 37 37 Copies of all dispatches should be sent to Frederick and HarpersFerry up to 11 a m and after that to Frederick
wor.csv.99 9 9 I have received your telegram in regard to HarpersFerry I find 10000 men here in condition to take the
wor.csv.99 40 40 defend a ford of the river and as far as HarpersFerry is concerned there is nothing of it As for the
wor.csv.100 8 8 My original instructions require me to cover HarpersFerry and Washington I have now imposed upon me in addition

The key textual insight of Hooker’s assessment of Harper’s Ferry is discernible from message identifier number 99 (wor.csv99).

“I have received your telegram in regard to Harper’s Ferry. I find 10,000 men here, in condition to take the field. Here they are of no earthly account. They cannot defend a ford of the river, and, as far as Harper’s Ferry is concerned, there is nothing of it. As for the fortifications, the work of the troops, they remain when the troops are withdrawn. No enemy will ever take possession of them. This is my opinion. All the public property could have been secured tonight, and the troops marched to where they could have ’ been of some service. Now they are but a bait for the rebels, should they return. I beg that this may be presented to the Secretary of War and His Excellency the President.”

As it turns out, the sparring over what to do about Harper’s Ferry compelled Hooker to submit his resignation. Here is the full text of that fateful message to Halleck, identifier 100.

“My original instructions require me to cover Harper’s Ferry and Washington. I have now imposed upon me, in addition, an enemy in my front of more than my number. I beg to be understood, respectfully, but firmly, that I am unable to comply with this condition with the means at my disposal, and earnestly request that I may at once be relieved from the position I occupy.”

Needless to say, Lincoln granted Hooker his request, ordering Major General Meade to take command.

Here’s another example of keywords in context where we see what Lincoln had to say about the Confederate Capital, Richmond. In this case we have the 15 pre and post tokens of the keyword.

docname from to pre keyword post
wor.csv.14 29 29 would not go south of Rappahannock upon Lees moving north of it If you had Richmond invested today you would not be able to take it in twenty days meanwhile your
wor.csv.14 60 60 communications and with them your army would be ruined I think Lees army and not Richmond is your sure objective point If he comes toward the Upper Potomac follow on his

Hooker was of the opinion that since Lee had left Richmond uncovered by his movement away from Fredericksburg that he should advance on the Capital. Lincoln expressed his disapproval of such a move, and then reiterated that the objective is Lee’s Army, not Richmond, therefore he should maneuver accordingly. This is a clear indication that there was tension around identifying the operational objective.

Shifting gears, we will now look at the data from a bigram perspective. Previously, I processed the data (tokenized is the technical term) for individual words and analyzed it accordingly. With a bigram the algorithm puts together consecutive sequences of words. For example, we will now include in the analysis Harper’s Ferry as “Harpers_Ferry”.

Major General Halleck:

bigrams

After processing the messages into bigrams after removing stopwords, this is an example of what the data looks like. You can explore the table interactively.

A way to visualize the bigram relationships is with a network graph. In the graph below, I’ve filtered for only those bigrams that occur more than five times.

When I first saw the graph, I had to scratch my head about “New York”, but after some exploration discovered this was an important topic in the messages related to the draft riots.

So, one can use a network graph to identify areas of interest, and what are called “named entities” such as Chester Gap or General French for further research. We shall explore named entities in depth shortly, but in the next section we will extract themes from the messages.

Identifying Topics

This method of NLP is referred to as Topic Modeling. The various algorithms used in topic models provide a statistical method for uncovering latent “topics” in a body of text documents, which facilitates semantic understanding. One can apply the method to individual words or n-grams, even sentences. In this example, I will initially extract topics from the bigrams produced above in two parts, Hooker’s messages then Meade’s messages.

There are a number of methods to identify the optimal number of topics to incorporate into your algorithm. However, I still find it best to use your judgment in the matter, applying trial and error to select the proper number of topics.

As such, your opinion will possibly differ, but 8 topics for General Hooker seems about right for me. After the algorithm creates topics and assigns one to a message, we can review the results. Here, we see a table of topic number “Var1” and the frequency that topic occurs in the messages.

Var1 Freq
1 10
2 8
3 8
4 6
5 9
6 5
7 5
8 5

The first topic (Var1) is associated with 10 of Hooker’s messages. To start understanding what these topics are we look at the following table of the first four topics, which provides a list of ten bigrams associated to each topic.

Topic.1 Topic.2 Topic.3 Topic.4
Blue_Ridge Sulphur_Springs Harpers_Ferry North_Carolina
Mr_President Alexandria_Railroad Thoroughfare_Gap MajorGeneral_Hancock
Second_Corps AP_Hill Eleventh_Corps Fairfax_Station
considerable_force Pleasonton_dated fair_presumption cavalry_raid
rebel_forces Colonel_Duffies majorgeneral_commanding Trimbles_division
Brandy_Station Gap_visited Banks_Ford General_Stahel
greatly_reinforced position_strength General_Schencks General_Tyler
General_Dix White_House Generals_Schenck Stuarts_force
James_River conjecture_To direction_shortly night_Nothing
Your_dispatch covering_Washington Nothing_noteworthy reported_enemy
4_Nothing Crossing_So pickets_reported Yesterday_Colonel
Duffies_pickets In_view After_giving movement_similar
force_moved Lees_army army_Under picket_line
Has_General MajorGeneral_Dix Fredericksburg_After Richmond_This
JNO_BUFORD reach_Warrenton To_accomplish Under_instructions

We see that topic 1 contains messages to Lincoln, and topic 3 contains information about Harper’s Ferry. It is outside the scope of this document, but you can assign a topic number to the file that contains the messages, which will facilitate further analysis. Let’s examine the next four topics.

Topic.5 Topic.6 Topic.7 Topic.8
General_Pleasonton Harpers_Ferry correct_information Culpeper_CourtHouse
rebel_cavalry Upper_Potomac General_Halleck His_Excellency
Hills_corps BrigadierGeneral_Pleasonton My_headquarters AP_Hills
Ewells_corps Hanover_Junction enemys_camps Your_telegram
General_Buford night_Shall General_Dixs Sixth_Corps
Alexandria_Railroad Dixs_force visited_yesterday cavalry_force
rebel_camps enemy_crossing Shenandoah_Valley General_Slough
First_Third June_4 Hamiltons_Crossing Buford_June
pm_received General_Jones January_31 Shenandoah_Valley
tomorrow_morning army_dated return_Will Lees_army
General_Couch campsthe_enemy This_morning Yesterday_morning
Following_received determine_satisfactorily utmost_promptitude Charleston_forces
force_moved late_positions camps_All Their_cavalry
heard_JNO additional_brigade Longstreets_command Tredwell_Moore
His_position Heintzelmans_forces Mr_President MajorGeneral_Heintzelmans

We see topic 6 includes further messages about Harper’s Ferry .

This brings us to the themes once Meade took command. I extracted 10 topics for him.

Var1 Freq
1 17
2 5
3 4
4 7
5 5
6 9
7 8
8 4
9 6
10 7

Topic 1 for Meade is linked to 17 of his messages. As we did with Hooker, let’s have a look at top 15 bigrams per topic.

Topic.1 Topic.2 Topic.3 Topic.4 Topic.5
New_York General_Couch Second_Corps Harpers_Ferry cavalry_force
General_Schurz Ewells_corps existing_circumstances enemys_cavalry Alexandria_Railroad
move_forward General_Schenck army_tonight Sixth_Corps Lees_army
General_W Twelfth_Corps Their_cavalry First_Corps My_movement
difficulties_encountered positive_information Our_cavalry coup_de F_Smith
information_respecting A_brigade White_Plains Longstreets_corps General_Buford
positive_intelligence Cedar_Mountain losses_sustained Our_cavalry procure_horses
personal_considerations considerable_body exact_condition reliable_intelligence Orange_CourtHouse
keeping_Washington fifty_wagons received_As AP_Hills main_body
strong_force This_movement hola_Maryland General_Barksdales enemy_south
Creek_Have AP_Hill Lees_army Mountain_Pass entire_army
giving_Lee Armistead_Have main_Reliable corps_commanders Lee_Colonel
Relay_Junction Catholic_church New_York General_Spinola Two_brigades
Schenck_increase D_Roman Corps_6 enemys_position Ferry_garrison
York_State H_Protzman move_tomorrow de_main 80000_They

Note in topic 5 the point about procuring horses. After the battle, Meade expressed his concern about the need for horses for the cavalry. Prior to this paper, I’d never heard of General Spinola. Turns out he took command of The Excelsior Brigade on July 11, 1863. He also was the first Italian-American elected to the House of Representatives. Now, the next topics.

Topic.6 Topic.7 Topic.8 Topic.9 Topic.10
Chester_Gap Brigadier_General General_Smith South_Mountain My_cavalry
reliable_intelligence General_Couch General_French General_Naglee Front_Royal
Falling_Waters General_Lockwood Eleventh_Corps Major_General Third_Corps
Maryland_Heights My_army strong_position Manassas_Gap honor_herewith
enemys_movement Warrenton_Junction accounts_agree railroad_crossing battleflags_captured
telegraphic_communication Colonel_Lowell Maryland_Heights Northern_Central Warrenton_Junction
momentous_consequences Your_dispatch dispatch_received directed_General Manassas_Gap
Aquia_Creek AP_Hill Mr_William intelligence_leads sufficient_force
Culpeper_CourtHouse push_forward Waynesborough_A rear_passing Cumberland_Valley
Upper_Rappahannock Totally_unexpected rear_guard Hew_Market reliable_information
de_main Potomac_Please railroad_bridge movement_proposed General_Pleasonton
sufficient_force State_Militia Susquehanna_keeping 700_Dont previously_reported
Poolesville_Do army_passed Dispatch_received Confederate_money Reserve_Artillery
received_Two Lees_entire Harpers_Ferry driving_horses force_So
Your_dispatch Potomac_All passing_yesterday H_Groves Seneca_Creek

Topic 10 is interesting as it contains the messages where Meade enumerates the battleflags captured during the battle.

Major General Meade:

Below I plot the top bigrams per topic arranged by what is known as “beta”, which is just a fancy way of saying the per-topic-per-word probability. Here are the top 6 for each topic.

This raises an interesting point in methodology. Within a message are a number of sub-topics. There are many methods to extract topics that are meaningful given the context of the documents. In my experience, this requires numerous iterations and subject matter expertise. However, it can help even an uninitiated researcher in identifying major themes and focusing their efforts.

The topics are insightful, but I shall endeavor to improve them with further experimentation using other methods and will update the results above if applicable. Our next task is to extract named entities such as people and places.

Named Entity Recognition

Named Entity Recognition (NER) is an NLP methodology to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations etc. A researcher can use NER to discover entities of interest. As you explore the output of the count of entities below, it is important to keep in mind that any NER algorithm is subject to make a mistake or two, for example classifying a location as a person or vice versa like General Pleasonton. I’m of the opinion that the more text is convoluted, poorly written, or has unusual use of punctuation, the more mistakes the algorithm will make. Nontheless, performing NER allows you to quickly can insight into the people, places, and organizations discussed in these messages. Additionally, I’m creating a master file of Civil War entities that correct the algorithms most egregious errors. I’ve put the results in an interactive table below.

A method of extracting a high-level summary of the content of a text is to extract the direct objects, excluding pronouns and their action verbs. A direct object receives the action of sentence subject. Take this sentence for example, “General Sedgwick crossed the river.” The direct object answers the question “whom” or “what”. Thus, Sedwick crossed what? He crossed the river! Therefore our dependency algorithm would return “crossed -> river”. Let’s put this into practice.

Dependencies

To apply this method on our data, I will create a table of dependencies broken out by Hooker, Halleck, and Meade per message id. You can explore this table interactively, which should help you develop insights or create further inquiries.

Using the dependencies along with word frequency, topics, and named entities should help you gain an understanding or at a minimum create questions or ideas for further research.

In the next section, we will look at plotting on a map the geographic locations extracted as part of the Named Entity Recognition exercise.

Test maps

In this final section, I have added a java-based map of the algorithm extracted geographic locations, cleaned with my master file, and the number of times it was mentioned in the messages. Given the number of locations in the messages, I’ve plotted only those that occur at least 5 times. I’ve included two different map views. I obtained The latitude and longitude of the points via the Google Maps automatic geocoding interface. If you are interested in how that works, drop me a note.

In future iterations of the map, I will include entities over time. For instance, a radio button to compare entities by commander, message sender or the like.

Summary

With this paper, I intended to apply common Natural Language Processing (NLP) tools to analyze the messages between the Union’s senior leadership during the Gettysburg Campaign from June 4 through August 3, 1863. The methods demonstrated above are the same used by thousands of organizations worldwide to mine their text data for insights. I think there are some future directions to consider:

I hope you found this interesting. If you have comments or questions please feel free to contact me by email - datameister66@gmail.com -

Thank You.

The Author, July 4, 2018: