Using Natural Language Processing to Extract Key Attributes from DPAA Legacy Case Files in Cambodia
Rebecca Barbanell MS Data Scientist
From Passion to Purpose
Transitioning from mathematics to data science led me to an incredible opportunity I never anticipated - an internship at the Ness of Brodgar in Orkney, Scotland. A professor was seeking someone to organize their XRF data and handle sample collection, and my background made me the right fit.
Ring of Brodgar
collecting floor samples in structure 12
There, I encountered archological data challenge:
Vast amounts of unstructured data
Critical information scattered across reports and field notes
The need to transform raw information into structured, usable formats
This challenge became my driving purpose.
Bridging Past and Present
This passion led me to my role as a Reasearch Analyst at the Center for Digital Antiquity and the Digital Archaeological Record (tDAR).
What is tDAR?
An online repository preserving digital archaeological records
Mission: extend our knowledge of the human past and improve cultural heritage management
Supports discovery, access, and reuse of archaeological data
My role: Apply modern analytical techniques to transform legacy case documents into coherent, quantifiable resources for DPAA.
The Challenge We Face
DPAA legacy case files contain valuable quantifiable data, however
Hundreds of unstructured reports and documents (unstructured data)
Critical information buried in narrative text
Manual analysis of the raw text can be time-consuming and inconsistent
import chardet # function that detects file encodingimport re # regex package (pattern Finder)import nltk # Natural Language Toolkit for sentence tokenizationfrom nltk.tokenize import sent_tokenize# Download the necessary NLTK datanltk.download('punkt')# Function to read a text file with automatic encoding detectiondef read_text_file(file_path):withopen(file_path, 'rb') asfile: content =file.read() result = chardet.detect(content) detected_encoding = result['encoding']withopen(file_path, 'r', encoding=detected_encoding) asfile:returnfile.read()# Define Regular Expression (regex) patternspatterns = {"INCIDENT": r'[a-z]{3}-\d{4}-[a-z]|incident \d+', # e.g. sea-####-r, incident ####"REFNO": r'refno\s\d+', # refno ####"CASE": r'\bcase\s\d{4}\b', # case ####"COUNTRY": r'\(?\b(kingdom of cambodia|koc|k\.o\.c)\b\)?', #hard code Country of Loss"RANK": r'\b(pfc|lcpl|pvt|capt|1lt|wo1|cw2|sp4|sp5|cpt|ssg|ltjg|pilot|navigator|weapons officer|sar|door gunner)\b', # hard code rank of service members "TYPE": r'\b(aircraft|capture|airboat|helicopter|ground loss|awol)\b', # hard code loss incident type "AIRCRAFT": r'\b(a1e|uh-1f|uh-1b|uh-1h|f-4d|uh-1|ov-10|f-4e|oh-6a|ah-1g|f-100d|hh-53c|f-4|ch-53a)\b', # hard code loss incident vehicle type"SITE": r'\((cb[-\s]?\d{5}|kh[-\s]?\d{5})\)', # cb ####, kh ####"MISSION": r'\(?\d{2,4}-\d{1}[a-z]{1,2}\)?', # e.g. ####-#cb, ##-#c, ##-#cb, (##-#c)"ACCESSION": r'\bcilhi\s\d{4}-\d{3}\b|\bcil\s\d{4}-\d{3}\b|\bcil-\d{4}-\d{3}\b',"REPORT": r'[a-z]{2}\d{2}-\d{4}', # e.g., cs##-####"CORNER": r'\b[news][0-9]{3} [news][0-9]{3}\b', # pattern for pairs"AREA_TERMS": r'\b(square meters|cubic meters|total|approximately \d+\.\d+ square meters|\d+\.\d+ cubic meters)[.,;:!?]?\b', # Added pattern for area terms"YEAR_OF_LOSS": r'\b(0?[1-9]|[12][0-9]|3[01])\s(january|february|march|april|may|june|july|august|september|october|november|december)\s(196[0-9]|197[0-9]|1980)\b',"DATE_MISSION": r'(\d{1,2} [a-zA-Z]+) to (\d{1,2} [a-zA-Z]+ \d{4})'}# Load and process the datafile_path ="C:/Users/rbarbane/Desktop/DPPA Data/txt files/txt files/CASE_2003_ESR_08-2CB_042138Z_JUN_09.txt"text_data = read_text_file(file_path)# Convert text to lowercasetext_data = text_data.lower()# Tokenize the text into sentencessentences = sent_tokenize(text_data)# Extract data based on patternsextracted_data = {}for key, pattern in patterns.items():if key =="AREA_TERMS":# Extract sentences containing area terms matching_sentences = [sentence for sentence in sentences if re.search(pattern, sentence, re.IGNORECASE)] extracted_data[key] = matching_sentenceselse: matches = re.findall(pattern, text_data) extracted_data[key] =list(set(matches)) # Remove duplicates by converting to a set and back to a list# Print the extracted datafor key, matches in extracted_data.items():if key =="AREA_TERMS":print(f"{key}:")for sentence in matches:print(f" {sentence}")else:print(f"{key}: {matches}")
Extraction Methods
Key Attributes Names and Methods of Extraction
Incident Attributes
Extraction Method
Excavated Cubic Meters (Total)
manual/aggregated
Excavated Depth (Range)
manual
Excavated Square Meters Total (from multiple units)
manual
Screening Methods Used (Mesh Size)
manual
Units Used
manual
Accession Number Assigned (Presence/Absence)
aggregated
Accession Numbers Assigned (e.g., Evidence sent to lab)
NLP
Association/Correlation to another Incident
NLP
Count of Days
aggregated
Positive or Negative Identification post-excavation
aggregated
Country of Loss
NLP hard code
Date of Orignial Loss or Incident
NLP
Incident Number, Refno Number, Case Number
NLP
Incident Type (e.g., airplane, helicopter, ground/capture, n/a)
NLP hard code
Military Conflict Name
NLP hard code
Mission End Date
manual
Mission Start Date
manual
Number of Missions per the Incident
NLP
Osseous Remain Presence/Absence
manual
Rank(s) of Missing Personnel
NLP hard code
Recommended Site Close or remain open
manual
Search and Recovery Number
NLP
Service Branch
aggregated
Site Count
aggregated
Site Name
NLP/manual
Vehicle involved in loss incident
NLP hard code
Number of DPAA Team Members Total
manual
Number of Local Participants (Excavation Team)
manual
Number of witnesses interviewed during the mission (can be same reoccurring witness throughout the investigation)
manual
Electronic Case File document name
NLP
MGRS Coordinates found in Reports
NLP
Number of missing personnel per incident
manual
Material evidence terms Used in Reports
NLP
Land feature terms used in Reports
NLP
Key Attributes NLP
Incident Attributes
Extraction Method
Accession Numbers Assigned (e.g., Evidence sent to lab)
NLP
Association/Correlation to another Incident
NLP
Country of Loss
NLP hard code
Date of Orignial Loss or Incident
NLP
Incident Number, Refno Number, Case Number
NLP
Incident Type (e.g., airplane, helicopter, ground/capture, n/a)