From Reports to Insights:

Using Natural Language Processing to Extract Key Attributes from DPAA Legacy Case Files in Cambodia

Rebecca Barbanell MS Data Scientist

From Passion to Purpose

Transitioning from mathematics to data science led me to an incredible opportunity I never anticipated - an internship at the Ness of Brodgar in Orkney, Scotland. A professor was seeking someone to organize their XRF data and handle sample collection, and my background made me the right fit.

Ring of Brodgar

collecting floor samples in structure 12

There, I encountered archological data challenge:

  • Vast amounts of unstructured data

  • Critical information scattered across reports and field notes

  • The need to transform raw information into structured, usable formats

This challenge became my driving purpose.

Bridging Past and Present

This passion led me to my role as a Reasearch Analyst at the Center for Digital Antiquity and the Digital Archaeological Record (tDAR).

What is tDAR?

  • An online repository preserving digital archaeological records

  • Mission: extend our knowledge of the human past and improve cultural heritage management

  • Supports discovery, access, and reuse of archaeological data

tDAR School

My role: Apply modern analytical techniques to transform legacy case documents into coherent, quantifiable resources for DPAA.

The Challenge We Face

DPAA legacy case files contain valuable quantifiable data, however

  • Hundreds of unstructured reports and documents (unstructured data)

  • Critical information buried in narrative text

  • Manual analysis of the raw text can be time-consuming and inconsistent

  • Key insights remain inaccessible, preventing directed, high-level analysis and historical insights.

How can we unlock this wealth of historical data?

The Solution: Natural Language Processing

Using NLP to extract structured data from unstructured text

  • Transform narrative reports into searchable, analyzable data

  • Identify key patterns and attributes automatically

  • Scale analysis across thousands of documents

  • Maintain consistency and accuracy

Let’s see how this works in practice…

Key Terms: For Using Regular Expression (NLP)

What is Regular Expression (regex)

  • regex is a tool that helps you search for and find specific patterns in text, making it easier to work with large amounts of information.

What is Tokenization

  • Splits text into individual words.

Combining Regex and Tokenization

  • Custom rules for splitting text.

Key Terms: For Using Regular Expression (NLP)

What is the difference of hardcoding and coding?

Coding (General Programming):

  • Writing instructions for a computer to follow

  • Creating flexible, maintainable solutions

  • Using variables, functions, and dynamic values

Hardcoding

  • Embedding fixed values directly into your source code

  • Making values that should be changeable into permanent, unchangeable parts of your program

Regular Expression (NLP) Hard Code

  • “TYPE”: “r’/b(aircraft|capture|airboat|helicopter|ground loss|awol)/b’”
    • “TYPE” is the key in where all the matches are stored
    • “r” raw string bypassing ’’ as a character and not escape character - ex. “/n” would be treated as a new line character
    • “/b” is a word boundary, ensuring that only whole word matches are stored.
    • Go to Code

Code Run Through



import chardet  # function that detects file encoding
import re  # regex package (pattern Finder)
import nltk  # Natural Language Toolkit for sentence tokenization
from nltk.tokenize import sent_tokenize

# Download the necessary NLTK data
nltk.download('punkt')

# Function to read a text file with automatic encoding detection
def read_text_file(file_path):
    with open(file_path, 'rb') as file:
        content = file.read()
        result = chardet.detect(content)
        
        detected_encoding = result['encoding']
    with open(file_path, 'r', encoding=detected_encoding) as file:
        return file.read()

# Define Regular Expression (regex) patterns
patterns = {
    "INCIDENT": r'[a-z]{3}-\d{4}-[a-z]|incident \d+',  # e.g. sea-####-r, incident ####
    "REFNO": r'refno\s\d+',  # refno ####
    "CASE": r'\bcase\s\d{4}\b',  # case ####
    "COUNTRY": r'\(?\b(kingdom of cambodia|koc|k\.o\.c)\b\)?', #hard code Country of Loss
    "RANK": r'\b(pfc|lcpl|pvt|capt|1lt|wo1|cw2|sp4|sp5|cpt|ssg|ltjg|pilot|navigator|weapons officer|sar|door gunner)\b', # hard code rank of service members 
    "TYPE": r'\b(aircraft|capture|airboat|helicopter|ground loss|awol)\b',  # hard code loss incident type 
    "AIRCRAFT": r'\b(a1e|uh-1f|uh-1b|uh-1h|f-4d|uh-1|ov-10|f-4e|oh-6a|ah-1g|f-100d|hh-53c|f-4|ch-53a)\b',  # hard code loss incident vehicle type
    "SITE": r'\((cb[-\s]?\d{5}|kh[-\s]?\d{5})\)', # cb ####, kh ####
    "MISSION": r'\(?\d{2,4}-\d{1}[a-z]{1,2}\)?',  # e.g. ####-#cb, ##-#c, ##-#cb, (##-#c)
    "ACCESSION": r'\bcilhi\s\d{4}-\d{3}\b|\bcil\s\d{4}-\d{3}\b|\bcil-\d{4}-\d{3}\b',
    "REPORT": r'[a-z]{2}\d{2}-\d{4}',  # e.g., cs##-####
    "CORNER": r'\b[news][0-9]{3} [news][0-9]{3}\b',  # pattern for pairs
    "AREA_TERMS": r'\b(square meters|cubic meters|total|approximately \d+\.\d+ square meters|\d+\.\d+ cubic meters)[.,;:!?]?\b',  # Added pattern for area terms
    "YEAR_OF_LOSS": r'\b(0?[1-9]|[12][0-9]|3[01])\s(january|february|march|april|may|june|july|august|september|october|november|december)\s(196[0-9]|197[0-9]|1980)\b',
    "DATE_MISSION": r'(\d{1,2} [a-zA-Z]+) to (\d{1,2} [a-zA-Z]+ \d{4})'
}

# Load and process the data

file_path = "C:/Users/rbarbane/Desktop/DPPA Data/txt files/txt files/CASE_2003_ESR_08-2CB_042138Z_JUN_09.txt"   

text_data = read_text_file(file_path)

# Convert text to lowercase
text_data = text_data.lower()

# Tokenize the text into sentences
sentences = sent_tokenize(text_data)

# Extract data based on patterns
extracted_data = {}
for key, pattern in patterns.items():
    if key == "AREA_TERMS":
        # Extract sentences containing area terms
        matching_sentences = [sentence for sentence in sentences if re.search(pattern, sentence, re.IGNORECASE)]
        extracted_data[key] = matching_sentences
    else:
        matches = re.findall(pattern, text_data)
        extracted_data[key] = list(set(matches))  # Remove duplicates by converting to a set and back to a list

# Print the extracted data
for key, matches in extracted_data.items():
    if key == "AREA_TERMS":
        print(f"{key}:")
        for sentence in matches:
            print(f"  {sentence}")
    else:
        print(f"{key}: {matches}")

Extraction Methods

Key Attributes Names and Methods of Extraction
Incident Attributes Extraction Method
Excavated Cubic Meters (Total) manual/aggregated
Excavated Depth (Range) manual
Excavated Square Meters Total (from multiple units) manual
Screening Methods Used (Mesh Size) manual
Units Used manual
Accession Number Assigned (Presence/Absence) aggregated
Accession Numbers Assigned (e.g., Evidence sent to lab) NLP
Association/Correlation to another Incident NLP
Count of Days aggregated
Positive or Negative Identification post-excavation aggregated
Country of Loss NLP hard code
Date of Orignial Loss or Incident NLP
Incident Number, Refno Number, Case Number NLP
Incident Type (e.g., airplane, helicopter, ground/capture, n/a) NLP hard code
Military Conflict Name NLP hard code
Mission End Date manual
Mission Start Date manual
Number of Missions per the Incident NLP
Osseous Remain Presence/Absence manual
Rank(s) of Missing Personnel NLP hard code
Recommended Site Close or remain open manual
Search and Recovery Number NLP
Service Branch aggregated
Site Count aggregated
Site Name NLP/manual
Vehicle involved in loss incident NLP hard code
Number of DPAA Team Members Total manual
Number of Local Participants (Excavation Team) manual
Number of witnesses interviewed during the mission (can be same reoccurring witness throughout the investigation) manual
Electronic Case File document name NLP
MGRS Coordinates found in Reports NLP
Number of missing personnel per incident manual
Material evidence terms Used in Reports NLP
Land feature terms used in Reports NLP
Key Attributes NLP
Incident Attributes Extraction Method
Accession Numbers Assigned (e.g., Evidence sent to lab) NLP
Association/Correlation to another Incident NLP
Country of Loss NLP hard code
Date of Orignial Loss or Incident NLP
Incident Number, Refno Number, Case Number NLP
Incident Type (e.g., airplane, helicopter, ground/capture, n/a) NLP hard code
Military Conflict Name NLP hard code
Number of Missions per the Incident NLP
Rank(s) of Missing Personnel NLP hard code
Search and Recovery Number NLP
Vehicle involved in loss incident NLP hard code
Electronic Case File document name NLP
MGRS Coordinates found in Reports NLP
Material evidence terms Used in Reports NLP
Land feature terms used in Reports NLP