The purpose of this project is to use Natural Language Processing (NLP) to develop insights around the published communications between the commander of The Army of the Potomac and higher headquarters in Washington, D.C., during the Gettysburg Campaign.
The analysis is focused on four individuals and their communication with each other:
1. President Abraham Lincoln
2. Union General-in-Chief, Major General Henry Halleck
3. Secretary of War, Edwin Stanton
4. Commander, Army of the Potomac, Major General Joseph Hooker
5. Commander, Army of the Potomac, Major General George Meade
The time period covered is from June 4th, 1863 through August 3rd, 1863, which corresponds to the start of Lee’s Campaign and Meade sending troops to quell the draft riots. I compiled the data into 229 separate messages. A couple of the messages sent by Hooker’s Chief of Staff, Daniel Butterfield, I attributed to Hooker for simplicity of analysis and understanding. The source for this analysis is “The War of the Rebellion: a compilation of the official records of the Union and Confederate armies”.
The volume I used was Chapter 39, Operations in North Carolina, Virginia, West Virginia, Maryland, Pennsylvania, and Department of the East; June 3rd-August 3rd, 1863. You can access the document via the Hathi Trust Digital Library at the following link.
https://babel.hathitrust.org/cgi/pt?num=1&u=1&seq=75&view=image&size=100&id=coo.31924077699761.com
Major General Hooker:
You can download the data from googledocs https://docs.google.com/spreadsheets/d/1eEnBJ6tyAcvBLWddSvBhD7HBFqnR20UldqmYb7mMQ_Y/edit?usp=sharing. I performed this analysis using the open-soure statistical software R version 3.5.2.
Here are the column names:
## [1] "date" "recipient" "message" "sender" "id"
The column names correspond with the date of the message, for whom the message was intended, the message itself, who sent the message, and a numeric message identifier.
We start simple with a plot of the words occuring more than 35 times in the messages. Be advised that in this process, I’ve removed meaningless “stopwords” such as “a”, “and”, “the” etc.
No surprise that the messages had a focus on the enemy. Note a couple of things. First, the high frequency of the word “cavalry”. This is an artifact of the importance of using Pleasonton’s troopers to gather intelligence, the fights at Brandy Station, Aldie etc., and the anxiety caused by JEB Stuart and his forces. Hooker and Halleck were truly in the dark on Lee’s movements and intentions. The other thing is that in this intial analysis the focus is on each individual word, so Harper’s Ferry becomes “tokenized” as two separate words. I shall address that in the section on bigrams.
We can see how many messages are attributable to each of the leaders, starting when Hooker was in command.
| Sender | Total Messages |
|---|---|
| Hooker | 59 |
| Halleck | 27 |
| Lincoln | 13 |
| Stanton | 2 |
We can see that Hooker sent over half of the messages. The various senders and recipients are as follows:
| Dix | Halleck | Hooker | Lincoln | Stanton | |
|---|---|---|---|---|---|
| Halleck | 0 | 0 | 27 | 0 | 0 |
| Hooker | 1 | 36 | 0 | 18 | 4 |
| Lincoln | 0 | 0 | 13 | 0 | 0 |
| Stanton | 0 | 0 | 2 | 0 | 0 |
The table has senders on the left and recipients on the top, so General Hooker sent 18 messages to Lincoln and received 13 from him. The data includes the message to General Dix who received only the one message from Hooker.
With Meade in command, we see a different dynamic in the messages, as Lincoln is now but a mere bystander, and his two messages are passed to Meade via Halleck. The mutual contempt between Hooker and Halleck led to Hooker’s eventual undoing, and reestablished a more functional chain-of-command.
| Sender | Total Messages |
|---|---|
| Meade | 72 |
| Halleck | 53 |
| Lincoln/Halleck | 2 |
| French | 1 |
We now turn our attention to the top 10 words of Lincoln, Halleck, Hooker, and Meade.
Halleck was quite interested in Harper’s Ferry it seems, while Lincoln was focused on the towns of Martinsburg and Winchester in the Shenandoah Valley. Let’s drill down into this Harper’s Ferry matter.
An important task in understanding text documents is to look at words of interest, keywords if you will, in their context. Let’s take Harper’s Ferry for example, and see what we can derive about its importance. Since Halleck discussed it in a number of his messages, let’s look at those with just a focus on messages during Hooker’s reign. But first, I have some comments on the output below. The column “docname”" corresponds to the message id number. You can disregrad from and to, but “pre” is the words prior to the keyword. Thus, “keyword” and “post” provide us the keyword of interest and the ten words or punctuation after the keyword. Additionally, I’ve turned Harper’s Ferry into one word HarpersFerry, which helps to single it out as the entity that interests us.
| docname | from | to | pre | keyword | post |
|---|---|---|---|---|---|
| wor.csv.7 | 56 | 56 | injunction to keep in view the safety of Washington and | HarpersFerry | In regard to the contingency which you suppose may arise |
| wor.csv.7 | 175 | 175 | boards for the defenses of Washington Neither this capital nor | HarpersFerry | could long hold out against a large force They must |
| wor.csv.37 | 10 | 10 | No information of enemy in direction of Winchester and | HarpersFerry | as late as that from General Pleasonton The forces at |
| wor.csv.37 | 25 | 25 | from General Pleasonton The forces at Martinsburg are arriving at | HarpersFerry | |
| wor.csv.38 | 7 | 7 | Garrison of Martinsburg has arrived at | HarpersFerry | Milroy did not obey orders given on the 11th to |
| wor.csv.38 | 27 | 27 | to abandon Winchester and probably has or will be captured | HarpersFerry | ought to hold out some time Pleasontons telegrams to you |
| wor.csv.38 | 89 | 89 | Longstreet and Stuart are crossing the Potomac above and below | HarpersFerry | They certainly should be pursued The force used for that |
| wor.csv.48 | 117 | 117 | dark or on mere conjecture Tyler is in command at | HarpersFerry | with it is said only 9000 men but according to |
| wor.csv.49 | 11 | 11 | There is now no doubt that the enemy is surrounding | HarpersFerry | but in what force I have no information General Schenck |
| wor.csv.52 | 11 | 11 | Information of enemys actual position and force in front of | HarpersFerry | is as indefinite as that in your front Nearly everything |
| wor.csv.52 | 39 | 39 | enemy mentioned is Halltown The bridges across both rivers at | HarpersFerry | are believed to be intact and most of Tylers troops |
| wor.csv.55 | 12 | 12 | have given no directions for our army to move to | HarpersFerry | I have advised the movement of a force sufficiently strong |
| wor.csv.55 | 41 | 41 | the enemy is and then move to the relief of | HarpersFerry | or elsewhere as circumstances might require With the remainder of |
| wor.csv.55 | 115 | 115 | We have no positive information of any large force against | HarpersFerry | and it cannot be known whether it will be necessary |
| wor.csv.62 | 32 | 32 | given to you as received What is meant by abandoning | HarpersFerry | is merely that General Tyler has concentrated his force in |
| wor.csv.62 | 57 | 57 | Heights No enemy in any force has been seen below | HarpersFerry | north of the river and it is hoped that Tylers |
| wor.csv.63 | 17 | 17 | has informed you what is meant by the abandonment of | HarpersFerrya | mere change of position It changes in no respect the |
| wor.csv.85 | 13 | 13 | has been notified that the troops of his department in | HarpersFerry | and vicinity would obey all orders direct from you and |
This is interesting as Halleck demanded that Hooker not abandon Harper’s Ferry. This compels us to examine Hooker’s thoughts on the matter.
Harper’s Ferry, 1863:
| docname | from | to | pre | keyword | post |
|---|---|---|---|---|---|
| wor.csv.4 | 346 | 346 | keep in view always the importance of covering Washington and | HarpersFerry | either directly or by so operating as to be able |
| wor.csv.45 | 164 | 164 | that I may be informed what troops there are at | HarpersFerry | and who is in command of them and also who |
| wor.csv.46 | 110 | 110 | be thrown as speedily as possible across the river at | HarpersFerry | while another should be thrown over the most direct line |
| wor.csv.50 | 12 | 12 | received your telegram Please inform me whether our forces at | HarpersFerry | are in the town or on the heights and if |
| wor.csv.50 | 40 | 40 | or Maryland Heights and which if any what bridges at | HarpersFerry | and where from what direction is the enemy making his |
| wor.csv.51 | 13 | 13 | with your directions I shall march to the relief of | HarpersFerry | I put my column again in motion at 3 a |
| wor.csv.59 | 50 | 50 | successively that Martinsburg and Winchester were invested and surrounded that | HarpersFerry | was closely invested with urgent calls upon me for relief |
| wor.csv.59 | 101 | 101 | from one of his wagon trains that General Tyler at | HarpersFerry | whose urgent calls as represented to me required under my |
| wor.csv.59 | 134 | 134 | in no danger Telegraph operator just reports to me that | HarpersFerry | is abandoned by our forces Is this true Directions have |
| wor.csv.59 | 160 | 160 | to make a reconnaissance in the direction of Winchester and | HarpersFerry | for the purpose of ascertaining the whereabouts and strength of |
| wor.csv.61 | 6 | 6 | Advice of the abandonment of | HarpersFerry | renders forced marches unnecessary to relieve it This army will |
| wor.csv.84 | 104 | 104 | are ready General French is now on his way to | HarpersFerry | and I have given directions for the force at Poolesville |
| wor.csv.96 | 9 | 9 | General Hooker personally has just left here for | HarpersFerry | where he will be about 11 oclock Point of Rocks |
| wor.csv.96 | 37 | 37 | Copies of all dispatches should be sent to Frederick and | HarpersFerry | up to 11 a m and after that to Frederick |
| wor.csv.99 | 9 | 9 | I have received your telegram in regard to | HarpersFerry | I find 10000 men here in condition to take the |
| wor.csv.99 | 40 | 40 | defend a ford of the river and as far as | HarpersFerry | is concerned there is nothing of it As for the |
| wor.csv.100 | 8 | 8 | My original instructions require me to cover | HarpersFerry | and Washington I have now imposed upon me in addition |
The key textual insight of Hooker’s assessment of Harper’s Ferry is discernible from message identifier number 99 (wor.csv99).
“I have received your telegram in regard to Harper’s Ferry. I find 10,000 men here, in condition to take the field. Here they are of no earthly account. They cannot defend a ford of the river, and, as far as Harper’s Ferry is concerned, there is nothing of it. As for the fortifications, the work of the troops, they remain when the troops are withdrawn. No enemy will ever take possession of them. This is my opinion. All the public property could have been secured tonight, and the troops marched to where they could have ’ been of some service. Now they are but a bait for the rebels, should they return. I beg that this may be presented to the Secretary of War and His Excellency the President.”
As it turns out, the sparring over what to do about Harper’s Ferry compelled Hooker to submit his resignation. Here is the full text of that fateful message to Halleck, identifier 100.
“My original instructions require me to cover Harper’s Ferry and Washington. I have now imposed upon me, in addition, an enemy in my front of more than my number. I beg to be understood, respectfully, but firmly, that I am unable to comply with this condition with the means at my disposal, and earnestly request that I may at once be relieved from the position I occupy.”
Needless to say, Lincoln granted Hooker his request, ordering Major General Meade to take command.
Here’s another example of keywords in context where we see what Lincoln had to say about the Confederate Capital, Richmond. In this case we have the 15 pre and post tokens of the keyword.
| docname | from | to | pre | keyword | post |
|---|---|---|---|---|---|
| wor.csv.14 | 29 | 29 | would not go south of Rappahannock upon Lees moving north of it If you had | Richmond | invested today you would not be able to take it in twenty days meanwhile your |
| wor.csv.14 | 60 | 60 | communications and with them your army would be ruined I think Lees army and not | Richmond | is your sure objective point If he comes toward the Upper Potomac follow on his |
Hooker was of the opinion that since Lee had left Richmond uncovered by his movement away from Fredericksburg that he should advance on the Capital. Lincoln expressed his disapproval of such a move, and then reiterated that the objective is Lee’s Army, not Richmond, therefore he should maneuver accordingly. This is a clear indication that there was tension around identifying the operational objective.
Shifting gears, we will now look at the data from a bigram perspective. Previously, I processed the data (tokenized is the technical term) for individual words and analyzed it accordingly. With a bigram the algorithm puts together consecutive sequences of words. For example, we will now include in the analysis Harper’s Ferry as “Harpers_Ferry”.
Major General Halleck:
After processing the messages into bigrams after removing stopwords, this is an example of what the data looks like. You can explore the table interactively.
A way to visualize the bigram relationships is with a network graph. In the graph below, I’ve filtered for only those bigrams that occur more than five times.
When I first saw the graph, I had to scratch my head about “New York”, but after some exploration discovered this was an important topic in the messages related to the draft riots.
So, one can use a network graph to identify areas of interest, and what are called “named entities” such as Chester Gap or General French for further research. We shall explore named entities in depth shortly, but in the next section we will extract themes from the messages.
This method of NLP is referred to as Topic Modeling. The various algorithms used in topic models provide a statistical method for uncovering latent “topics” in a body of text documents, which facilitates semantic understanding. One can apply the method to individual words or n-grams, even sentences. In this example, I will initially extract topics from the bigrams produced above in two parts, Hooker’s messages then Meade’s messages.
There are a number of methods to identify the optimal number of topics to incorporate into your algorithm. However, I still find it best to use your judgment in the matter, applying trial and error to select the proper number of topics.
As such, your opinion will possibly differ, but 8 topics for General Hooker seems about right for me. After the algorithm creates topics and assigns one to a message, we can review the results. Here, we see a table of topic number “Var1” and the frequency that topic occurs in the messages.
| Var1 | Freq |
|---|---|
| 1 | 10 |
| 2 | 8 |
| 3 | 8 |
| 4 | 6 |
| 5 | 9 |
| 6 | 5 |
| 7 | 5 |
| 8 | 5 |
The first topic (Var1) is associated with 10 of Hooker’s messages. To start understanding what these topics are we look at the following table of the first four topics, which provides a list of ten bigrams associated to each topic.
| Topic.1 | Topic.2 | Topic.3 | Topic.4 |
|---|---|---|---|
| Blue_Ridge | Sulphur_Springs | Harpers_Ferry | North_Carolina |
| Mr_President | Alexandria_Railroad | Thoroughfare_Gap | MajorGeneral_Hancock |
| Second_Corps | AP_Hill | Eleventh_Corps | Fairfax_Station |
| considerable_force | Pleasonton_dated | fair_presumption | cavalry_raid |
| rebel_forces | Colonel_Duffies | majorgeneral_commanding | Trimbles_division |
| Brandy_Station | Gap_visited | Banks_Ford | General_Stahel |
| greatly_reinforced | position_strength | General_Schencks | General_Tyler |
| General_Dix | White_House | Generals_Schenck | Stuarts_force |
| James_River | conjecture_To | direction_shortly | night_Nothing |
| Your_dispatch | covering_Washington | Nothing_noteworthy | reported_enemy |
| 4_Nothing | Crossing_So | pickets_reported | Yesterday_Colonel |
| Duffies_pickets | In_view | After_giving | movement_similar |
| force_moved | Lees_army | army_Under | picket_line |
| Has_General | MajorGeneral_Dix | Fredericksburg_After | Richmond_This |
| JNO_BUFORD | reach_Warrenton | To_accomplish | Under_instructions |
We see that topic 1 contains messages to Lincoln, and topic 3 contains information about Harper’s Ferry. It is outside the scope of this document, but you can assign a topic number to the file that contains the messages, which will facilitate further analysis. Let’s examine the next four topics.
| Topic.5 | Topic.6 | Topic.7 | Topic.8 |
|---|---|---|---|
| General_Pleasonton | Harpers_Ferry | correct_information | Culpeper_CourtHouse |
| rebel_cavalry | Upper_Potomac | General_Halleck | His_Excellency |
| Hills_corps | BrigadierGeneral_Pleasonton | My_headquarters | AP_Hills |
| Ewells_corps | Hanover_Junction | enemys_camps | Your_telegram |
| General_Buford | night_Shall | General_Dixs | Sixth_Corps |
| Alexandria_Railroad | Dixs_force | visited_yesterday | cavalry_force |
| rebel_camps | enemy_crossing | Shenandoah_Valley | General_Slough |
| First_Third | June_4 | Hamiltons_Crossing | Buford_June |
| pm_received | General_Jones | January_31 | Shenandoah_Valley |
| tomorrow_morning | army_dated | return_Will | Lees_army |
| General_Couch | campsthe_enemy | This_morning | Yesterday_morning |
| Following_received | determine_satisfactorily | utmost_promptitude | Charleston_forces |
| force_moved | late_positions | camps_All | Their_cavalry |
| heard_JNO | additional_brigade | Longstreets_command | Tredwell_Moore |
| His_position | Heintzelmans_forces | Mr_President | MajorGeneral_Heintzelmans |
We see topic 6 includes further messages about Harper’s Ferry .
This brings us to the themes once Meade took command. I extracted 10 topics for him.
| Var1 | Freq |
|---|---|
| 1 | 17 |
| 2 | 5 |
| 3 | 4 |
| 4 | 7 |
| 5 | 5 |
| 6 | 9 |
| 7 | 8 |
| 8 | 4 |
| 9 | 6 |
| 10 | 7 |
Topic 1 for Meade is linked to 17 of his messages. As we did with Hooker, let’s have a look at top 15 bigrams per topic.
| Topic.1 | Topic.2 | Topic.3 | Topic.4 | Topic.5 |
|---|---|---|---|---|
| New_York | General_Couch | Second_Corps | Harpers_Ferry | cavalry_force |
| General_Schurz | Ewells_corps | existing_circumstances | enemys_cavalry | Alexandria_Railroad |
| move_forward | General_Schenck | army_tonight | Sixth_Corps | Lees_army |
| General_W | Twelfth_Corps | Their_cavalry | First_Corps | My_movement |
| difficulties_encountered | positive_information | Our_cavalry | coup_de | F_Smith |
| information_respecting | A_brigade | White_Plains | Longstreets_corps | General_Buford |
| positive_intelligence | Cedar_Mountain | losses_sustained | Our_cavalry | procure_horses |
| personal_considerations | considerable_body | exact_condition | reliable_intelligence | Orange_CourtHouse |
| keeping_Washington | fifty_wagons | received_As | AP_Hills | main_body |
| strong_force | This_movement | hola_Maryland | General_Barksdales | enemy_south |
| Creek_Have | AP_Hill | Lees_army | Mountain_Pass | entire_army |
| giving_Lee | Armistead_Have | main_Reliable | corps_commanders | Lee_Colonel |
| Relay_Junction | Catholic_church | New_York | General_Spinola | Two_brigades |
| Schenck_increase | D_Roman | Corps_6 | enemys_position | Ferry_garrison |
| York_State | H_Protzman | move_tomorrow | de_main | 80000_They |
Note in topic 5 the point about procuring horses. After the battle, Meade expressed his concern about the need for horses for the cavalry. Prior to this paper, I’d never heard of General Spinola. Turns out he took command of The Excelsior Brigade on July 11, 1863. He also was the first Italian-American elected to the House of Representatives. Now, the next topics.
| Topic.6 | Topic.7 | Topic.8 | Topic.9 | Topic.10 |
|---|---|---|---|---|
| Chester_Gap | Brigadier_General | General_Smith | South_Mountain | My_cavalry |
| reliable_intelligence | General_Couch | General_French | General_Naglee | Front_Royal |
| Falling_Waters | General_Lockwood | Eleventh_Corps | Major_General | Third_Corps |
| Maryland_Heights | My_army | strong_position | Manassas_Gap | honor_herewith |
| enemys_movement | Warrenton_Junction | accounts_agree | railroad_crossing | battleflags_captured |
| telegraphic_communication | Colonel_Lowell | Maryland_Heights | Northern_Central | Warrenton_Junction |
| momentous_consequences | Your_dispatch | dispatch_received | directed_General | Manassas_Gap |
| Aquia_Creek | AP_Hill | Mr_William | intelligence_leads | sufficient_force |
| Culpeper_CourtHouse | push_forward | Waynesborough_A | rear_passing | Cumberland_Valley |
| Upper_Rappahannock | Totally_unexpected | rear_guard | Hew_Market | reliable_information |
| de_main | Potomac_Please | railroad_bridge | movement_proposed | General_Pleasonton |
| sufficient_force | State_Militia | Susquehanna_keeping | 700_Dont | previously_reported |
| Poolesville_Do | army_passed | Dispatch_received | Confederate_money | Reserve_Artillery |
| received_Two | Lees_entire | Harpers_Ferry | driving_horses | force_So |
| Your_dispatch | Potomac_All | passing_yesterday | H_Groves | Seneca_Creek |
Topic 10 is interesting as it contains the messages where Meade enumerates the battleflags captured during the battle.
Major General Meade:
Below I plot the top bigrams per topic arranged by what is known as “beta”, which is just a fancy way of saying the per-topic-per-word probability. Here are the top 6 for each topic.
This raises an interesting point in methodology. Within a message are a number of sub-topics. There are many methods to extract topics that are meaningful given the context of the documents. In my experience, this requires numerous iterations and subject matter expertise. However, it can help even an uninitiated researcher in identifying major themes and focusing their efforts.
The topics are insightful, but I shall endeavor to improve them with further experimentation using other methods and will update the results above if applicable. Our next task is to extract named entities such as people and places.
Named Entity Recognition (NER) is an NLP methodology to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations etc. A researcher can use NER to discover entities of interest. As you explore the output of the count of entities below, it is important to keep in mind that any NER algorithm is subject to make a mistake or two, for example classifying a location as a person or vice versa like General Pleasonton. I’m of the opinion that the more text is convoluted, poorly written, or has unusual use of punctuation, the more mistakes the algorithm will make. Nontheless, performing NER allows you to quickly can insight into the people, places, and organizations discussed in these messages. Additionally, I’m creating a master file of Civil War entities that correct the algorithms most egregious errors. I’ve put the results in an interactive table below.
A method of extracting a high-level summary of the content of a text is to extract the direct objects, excluding pronouns and their action verbs. A direct object receives the action of sentence subject. Take this sentence for example, “General Sedgwick crossed the river.” The direct object answers the question “whom” or “what”. Thus, Sedwick crossed what? He crossed the river! Therefore our dependency algorithm would return “crossed -> river”. Let’s put this into practice.
To apply this method on our data, I will create a table of dependencies broken out by Hooker, Halleck, and Meade per message id. You can explore this table interactively, which should help you develop insights or create further inquiries.
Using the dependencies along with word frequency, topics, and named entities should help you gain an understanding or at a minimum create questions or ideas for further research.
In the next section, we will look at plotting on a map the geographic locations extracted as part of the Named Entity Recognition exercise.
In this final section, I have added a java-based map of the algorithm extracted geographic locations, cleaned with my master file, and the number of times it was mentioned in the messages. Given the number of locations in the messages, I’ve plotted only those that occur at least 5 times. I’ve included two different map views. I obtained The latitude and longitude of the points via the Google Maps automatic geocoding interface. If you are interested in how that works, drop me a note.
In future iterations of the map, I will include entities over time. For instance, a radio button to compare entities by commander, message sender or the like.
With this paper, I intended to apply common Natural Language Processing (NLP) tools to analyze the messages between the Union’s senior leadership during the Gettysburg Campaign from June 4 through August 3, 1863. The methods demonstrated above are the same used by thousands of organizations worldwide to mine their text data for insights. I think there are some future directions to consider:
I hope you found this interesting. If you have comments or questions please feel free to contact me by email - datameister66@gmail.com -
Thank You.
The Author, July 4, 2018: