From Reports to Insights:

Using Natural Language Processing to Extract Key Attributes from DPAA Legacy Case Files in Cambodia

Rebecca Barbanell MS Data Scientist

From Passion to Purpose

Transitioning from mathematics to data science led me to an incredible opportunity I never anticipated - an internship at the Ness of Brodgar in Orkney, Scotland. A professor was seeking someone to organize their XRF data and handle sample collection, and my background made me the right fit.

collecting floor samples in structure 12

There, I encountered archological data challenge:

Vast amounts of unstructured data
Critical information scattered across reports and field notes
The need to transform raw information into structured, usable formats

This challenge became my driving purpose.

Bridging Past and Present

This passion led me to my role as a Reasearch Analyst at the Center for Digital Antiquity and the Digital Archaeological Record (tDAR).

What is tDAR?

An online repository preserving digital archaeological records
Mission: extend our knowledge of the human past and improve cultural heritage management
Supports discovery, access, and reuse of archaeological data

tDAR School

My role: Apply modern analytical techniques to transform legacy case documents into coherent, quantifiable resources for DPAA.

The Challenge We Face

DPAA legacy case files contain valuable quantifiable data, however

Hundreds of unstructured reports and documents (unstructured data)
Critical information buried in narrative text
Manual analysis of the raw text can be time-consuming and inconsistent
Key insights remain inaccessible, preventing directed, high-level analysis and historical insights.

How can we unlock this wealth of historical data?

The Solution: Natural Language Processing

Using NLP to extract structured data from unstructured text

Transform narrative reports into searchable, analyzable data
Identify key patterns and attributes automatically
Scale analysis across thousands of documents
Maintain consistency and accuracy

Let’s see how this works in practice…

Key Terms: For Using Regular Expression (NLP)

What is Regular Expression (regex)

regex is a tool that helps you search for and find specific patterns in text, making it easier to work with large amounts of information.

What is Tokenization

Splits text into individual words.

Combining Regex and Tokenization

Custom rules for splitting text.

Key Terms: For Using Regular Expression (NLP)

What is the difference of hardcoding and coding?

Coding (General Programming):

Writing instructions for a computer to follow
Creating flexible, maintainable solutions
Using variables, functions, and dynamic values

Hardcoding

Embedding fixed values directly into your source code
Making values that should be changeable into permanent, unchangeable parts of your program

Regular Expression (NLP) Hard Code

“TYPE”: “r’/b(aircraft|capture|airboat|helicopter|ground loss|awol)/b’”
- “TYPE” is the key in where all the matches are stored
- “r” raw string bypassing ’’ as a character and not escape character - ex. “/n” would be treated as a new line character
- “/b” is a word boundary, ensuring that only whole word matches are stored.
- Go to Code

Code Run Through



import chardet  # function that detects file encoding
import re  # regex package (pattern Finder)
import nltk  # Natural Language Toolkit for sentence tokenization
from nltk.tokenize import sent_tokenize

# Download the necessary NLTK data
nltk.download('punkt')

# Function to read a text file with automatic encoding detection
def read_text_file(file_path):
    with open(file_path, 'rb') as file:
        content = file.read()
        result = chardet.detect(content)
        
        detected_encoding = result['encoding']
    with open(file_path, 'r', encoding=detected_encoding) as file:
        return file.read()

# Define Regular Expression (regex) patterns
patterns = {
    "INCIDENT": r'[a-z]{3}-\d{4}-[a-z]|incident \d+',  # e.g. sea-####-r, incident ####
    "REFNO": r'refno\s\d+',  # refno ####
    "CASE": r'\bcase\s\d{4}\b',  # case ####
    "COUNTRY": r'\(?\b(kingdom of cambodia|koc|k\.o\.c)\b\)?', #hard code Country of Loss
    "RANK": r'\b(pfc|lcpl|pvt|capt|1lt|wo1|cw2|sp4|sp5|cpt|ssg|ltjg|pilot|navigator|weapons officer|sar|door gunner)\b', # hard code rank of service members 
    "TYPE": r'\b(aircraft|capture|airboat|helicopter|ground loss|awol)\b',  # hard code loss incident type 
    "AIRCRAFT": r'\b(a1e|uh-1f|uh-1b|uh-1h|f-4d|uh-1|ov-10|f-4e|oh-6a|ah-1g|f-100d|hh-53c|f-4|ch-53a)\b',  # hard code loss incident vehicle type
    "SITE": r'\((cb[-\s]?\d{5}|kh[-\s]?\d{5})\)', # cb ####, kh ####
    "MISSION": r'\(?\d{2,4}-\d{1}[a-z]{1,2}\)?',  # e.g. ####-#cb, ##-#c, ##-#cb, (##-#c)
    "ACCESSION": r'\bcilhi\s\d{4}-\d{3}\b|\bcil\s\d{4}-\d{3}\b|\bcil-\d{4}-\d{3}\b',
    "REPORT": r'[a-z]{2}\d{2}-\d{4}',  # e.g., cs##-####
    "CORNER": r'\b[news][0-9]{3} [news][0-9]{3}\b',  # pattern for pairs
    "AREA_TERMS": r'\b(square meters|cubic meters|total|approximately \d+\.\d+ square meters|\d+\.\d+ cubic meters)[.,;:!?]?\b',  # Added pattern for area terms
    "YEAR_OF_LOSS": r'\b(0?[1-9]|[12][0-9]|3[01])\s(january|february|march|april|may|june|july|august|september|october|november|december)\s(196[0-9]|197[0-9]|1980)\b',
    "DATE_MISSION": r'(\d{1,2} [a-zA-Z]+) to (\d{1,2} [a-zA-Z]+ \d{4})'
}

# Load and process the data

file_path = "C:/Users/rbarbane/Desktop/DPPA Data/txt files/txt files/CASE_2003_ESR_08-2CB_042138Z_JUN_09.txt"   

text_data = read_text_file(file_path)

# Convert text to lowercase
text_data = text_data.lower()

# Tokenize the text into sentences
sentences = sent_tokenize(text_data)

# Extract data based on patterns
extracted_data = {}
for key, pattern in patterns.items():
    if key == "AREA_TERMS":
        # Extract sentences containing area terms
        matching_sentences = [sentence for sentence in sentences if re.search(pattern, sentence, re.IGNORECASE)]
        extracted_data[key] = matching_sentences
    else:
        matches = re.findall(pattern, text_data)
        extracted_data[key] = list(set(matches))  # Remove duplicates by converting to a set and back to a list

# Print the extracted data
for key, matches in extracted_data.items():
    if key == "AREA_TERMS":
        print(f"{key}:")
        for sentence in matches:
            print(f"  {sentence}")
    else:
        print(f"{key}: {matches}")

Extraction Methods

Key Attributes Names and Methods of Extraction
Incident Attributes	Extraction Method
Excavated Cubic Meters (Total)	manual/aggregated
Excavated Depth (Range)	manual
Excavated Square Meters Total (from multiple units)	manual
Screening Methods Used (Mesh Size)	manual
Units Used	manual
Accession Number Assigned (Presence/Absence)	aggregated
Accession Numbers Assigned (e.g., Evidence sent to lab)	NLP
Association/Correlation to another Incident	NLP
Count of Days	aggregated
Positive or Negative Identification post-excavation	aggregated
Country of Loss	NLP hard code
Date of Orignial Loss or Incident	NLP
Incident Number, Refno Number, Case Number	NLP
Incident Type (e.g., airplane, helicopter, ground/capture, n/a)	NLP hard code
Military Conflict Name	NLP hard code
Mission End Date	manual
Mission Start Date	manual
Number of Missions per the Incident	NLP
Osseous Remain Presence/Absence	manual
Rank(s) of Missing Personnel	NLP hard code
Recommended Site Close or remain open	manual
Search and Recovery Number	NLP
Service Branch	aggregated
Site Count	aggregated
Site Name	NLP/manual
Vehicle involved in loss incident	NLP hard code
Number of DPAA Team Members Total	manual
Number of Local Participants (Excavation Team)	manual
Number of witnesses interviewed during the mission (can be same reoccurring witness throughout the investigation)	manual
Electronic Case File document name	NLP
MGRS Coordinates found in Reports	NLP
Number of missing personnel per incident	manual
Material evidence terms Used in Reports	NLP
Land feature terms used in Reports	NLP

Key Attributes NLP
Incident Attributes	Extraction Method
Accession Numbers Assigned (e.g., Evidence sent to lab)	NLP
Association/Correlation to another Incident	NLP
Country of Loss	NLP hard code
Date of Orignial Loss or Incident	NLP
Incident Number, Refno Number, Case Number	NLP
Incident Type (e.g., airplane, helicopter, ground/capture, n/a)	NLP hard code
Military Conflict Name	NLP hard code
Number of Missions per the Incident	NLP
Rank(s) of Missing Personnel	NLP hard code
Search and Recovery Number	NLP
Vehicle involved in loss incident	NLP hard code
Electronic Case File document name	NLP
MGRS Coordinates found in Reports	NLP
Material evidence terms Used in Reports	NLP
Land feature terms used in Reports	NLP