This notebook presents a concise exploratory data analysis of three NamUs datasets: Missing Persons, Unclaimed Persons, and Unidentified Persons—with the goal of understanding their structure, distributions, and analytic affordances as they relate to visulization.
Following a bit of data cleaning, the analysis uses a small, reusable explore_dataset() function to perform a consistent, single-variable review of each dataset. For every feature, the explorer summarizes type, missingness, value ranges, and common categories. Each dataset is then accompanied by a short written interpretation that distills these observations and notes implications for visualization design.
The notebook is organized to begin with a geographic overview of all three datasets combined, using county-level prominence to establish the national spatial distribution and relative concentration of cases. From there, each dataset is examined individually, with attention to how differences in temporal coverage, demographic certainty, and geographic resolution suggest different—and potentially complementary—visualization strategies.
## lets explore the missing persons data set
explore_dataset(missing_persons)
## ==============================Dataset overview: missing_persons==============================
## Rows: 26182Columns: 15
## Column types:
## case_number dlc legal_last_name legal_first_name
## "character" "Date" "character" "character"
## missing_age city county state
## "numeric" "character" "character" "character"
## biological_sex race_ethnicity date_modified intptlat_city
## "character" "character" "Date" "numeric"
## intptlong_city intptlat_county intptlong_county
## "numeric" "numeric" "numeric"
##
## ----------------------------------Variable: case_numberType: characterMissing: 0 (0%)
## Unique values (non-missing): 26182
## Most common values:
## .
## MP1 MP10 MP100 MP10000 MP100005
## 1 1 1 1 1
## ----------------------------------Variable: dlcType: DateMissing: 0 (0%)
## Date range:
## Min: 1902-01-01Max: 2026-01-28----------------------------------Variable: legal_last_nameType: characterMissing: 0 (0%)
## Unique values (non-missing): 14222
## Most common values:
## .
## Smith Johnson Williams Jones Brown
## 217 183 159 138 124
## ----------------------------------Variable: legal_first_nameType: characterMissing: 0 (0%)
## Unique values (non-missing): 6460
## Most common values:
## .
## Michael John Robert James David
## 437 419 416 395 361
## ----------------------------------Variable: missing_ageType: numericMissing: 0 (0%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 21.00 32.00 34.85 46.00 116.00
## ----------------------------------Variable: cityType: characterMissing: 0 (0%)
## Unique values (non-missing): 6059
## Most common values:
## .
## Houston Los Angeles Dallas San Francisco Memphis
## 498 480 444 260 226
## ----------------------------------Variable: countyType: characterMissing: 0 (0%)
## Unique values (non-missing): 1414
## Most common values:
## .
## Los Angeles Harris Dallas Maricopa Orange
## 995 621 545 395 376
## ----------------------------------Variable: stateType: characterMissing: 0 (0%)
## Unique values (non-missing): 55
## Most common values:
## .
## CA TX FL AK AZ
## 3771 2884 2378 1297 1087
## ----------------------------------Variable: biological_sexType: characterMissing: 0 (0%)
## Unique values (non-missing): 2
## Most common values:
## .
## Male Female
## 16557 9625
## ----------------------------------Variable: race_ethnicityType: characterMissing: 0 (0%)
## Unique values (non-missing): 78
## Most common values:
## .
## White / Caucasian Black / African American
## 14563 4368
## Hispanic / Latino White / Caucasian, Hispanic / Latino
## 3670 1118
## American Indian / Alaska Native
## 824
## ----------------------------------Variable: date_modifiedType: DateMissing: 0 (0%)
## Date range:
## Min: 2014-01-13Max: 2026-02-01----------------------------------Variable: intptlat_cityType: numericMissing: 1787 (6.8%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.38 32.79 36.17 37.30 40.85 70.65 1787
## ----------------------------------Variable: intptlong_cityType: numericMissing: 1787 (6.8%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -176.60 -116.99 -95.08 -99.09 -82.40 145.78 1787
## ----------------------------------Variable: intptlat_countyType: numericMissing: 201 (0.8%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.38 32.77 36.17 37.19 40.92 69.45 201
## ----------------------------------Variable: intptlong_countyType: numericMissing: 201 (0.8%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -164.19 -116.18 -94.42 -98.01 -81.98 179.62 201
This dataset contains 26,182 records representing missing persons cases, with 15 variables spanning identifiers, demographics, temporal fields, and geographic information. The analysis below summarizes key distributional patterns, coverage, and structural characteristics observed during an initial univariate exploration.
The case_number field uniquely identifies each record, with a one-to-one correspondence between rows and case numbers. This confirms its role as a stable identifier suitable for record tracking and joins.
The legal_first_name and legal_last_name fields exhibit very high cardinality, with over 6,400 unique first names and 14,200 unique last names. Common names such as Michael, John, Smith, and Johnson appear frequently, reflecting expected population distributions.
The dlc (date last contact) variable spans a wide historical range, from 1902 through early 2026, indicating the presence of both historical cold cases and very recent records. This breadth supports long-term trend analysis but also suggests the importance of stratifying by era to avoid conflating cases from substantially different reporting contexts.
The date_modified variable is concentrated in the modern period (2014–2026) and potentially reflects a system update activity rather than event timing. It is probably best interpreted as metadata useful for assessing data recency or workflow processes.
The missing_age variable ranges from 1 to 116 years, with a median age of 32 and a mean of approximately 35. The distribution indicates a predominance of adult cases alongside substantial representation of minors. Extreme values at the upper end suggest potential value of follow-up validation, particularly when producing age-stratified summaries.
Geographic information is provided at the city, county, and state levels. The dataset spans 55 distinct state or territory codes, with the highest case counts observed in California, Texas, and Florida, consistent with population size and reporting volume.
The biological_sex variable contains two categories—Male and Female—with males comprising approximately 63% of cases. This distribution is consistent across the dataset and supports stratified descriptive analysis.
The race_ethnicity field includes 78 distinct values, encompassing both single-category and multi-category identities. While major groups such as White / Caucasian and Black / African American account for most records, the long-tail structure indicates that this variable encodes complex demographic information that would benefit from normalization prior to comparative analysis.
Latitude and longitude values are available at both city and county levels. County-level coordinates are nearly complete, while city-level coordinates exhibit modest missingness (approximately 7%). Coordinate ranges extend beyond the continental United States, consistent with inclusion of U.S. territories and edge-case records. County-level coordinates appear more stable and may serve as a preferred default for spatial aggregation and mapping.
Overall, the dataset demonstrates broad coverage across time, geography, and demographics, with distributions that are consistent with expectations for a national missing persons dataset. Key analytical considerations emerging from this EDA include handling historical records, normalizing high-cardinality categorical variables, and applying thoughtful validation to extreme values. With these considerations addressed, the dataset is well-positioned to support descriptive analysis, trend reporting, and geographic summarization.
Beyond a queryable table, this dataset naturally supports interactive visualizations that emphasize time, geography, and demographic composition. The wide temporal span of the dlc field lends itself to interactive timelines or time-series views, while the availability of city- and county-level coordinates enables map-based exploration at multiple geographic resolutions. Demographic variables such as age, sex, and race/ethnicity are well suited to interactive distributions and comparative summaries, particularly when combined with filtering and brushing. Coordinated views—linking maps, timelines, and summary charts—would allow users to explore patterns dynamically and drill down from aggregate trends to individual cases, making the dataset especially well suited for exploratory and narrative-driven data visualization.
For a data visualization audience, it will likely be useful to present geographic prominence normalized by population (per capita), as large metropolitan areas dominate raw counts and population-adjusted views can better highlight relative concentration across regions.
## lets explore the unclaimed persons data set
explore_dataset(unclaimed_persons)
## ==============================Dataset overview: unclaimed_persons==============================
## Rows: 21891Columns: 14
## Column types:
## case_number dbf legal_last_name legal_first_name
## "character" "Date" "character" "character"
## biological_sex race_ethnicity city county
## "character" "character" "character" "character"
## state date_modified intptlat_city intptlong_city
## "character" "Date" "numeric" "numeric"
## intptlat_county intptlong_county
## "numeric" "numeric"
##
## ----------------------------------Variable: case_numberType: characterMissing: 0 (0%)
## Unique values (non-missing): 21891
## Most common values:
## .
## UCP100 UCP1000 UCP100000 UCP100001 UCP100002
## 1 1 1 1 1
## ----------------------------------Variable: dbfType: DateMissing: 509 (2.3%)
## Date range:
## Min: 1939-12-16Max: 2026-01-30----------------------------------Variable: legal_last_nameType: characterMissing: 0 (0%)
## Unique values (non-missing): 10957
## Most common values:
## .
## Smith Johnson Williams Brown Jones
## 221 181 177 157 153
## ----------------------------------Variable: legal_first_nameType: characterMissing: 0 (0%)
## Unique values (non-missing): 4263
## Most common values:
## .
## Robert James John Michael William
## 628 598 524 482 449
## ----------------------------------Variable: biological_sexType: characterMissing: 0 (0%)
## Unique values (non-missing): 4
## Most common values:
## .
## Male Female Unsure
## 16545 4322 1022 2
## ----------------------------------Variable: race_ethnicityType: characterMissing: 0 (0%)
## Unique values (non-missing): 55
## Most common values:
## .
## White / Caucasian Black / African American Hispanic / Latino
## 9829 4469 2832
## Uncertain
## 1665 1491
## ----------------------------------Variable: cityType: characterMissing: 0 (0%)
## Unique values (non-missing): 1448
## Most common values:
## .
## Bronx New York Brooklyn Queens
## 2425 1603 1525 1185 1104
## ----------------------------------Variable: countyType: characterMissing: 0 (0%)
## Unique values (non-missing): 332
## Most common values:
## .
## Bronx New York San Bernardino Kings Queens
## 2398 2128 1578 1520 1291
## ----------------------------------Variable: stateType: characterMissing: 0 (0%)
## Unique values (non-missing): 44
## Most common values:
## .
## NY CA TX WA TN
## 7444 3629 2327 1962 771
## ----------------------------------Variable: date_modifiedType: DateMissing: 0 (0%)
## Date range:
## Min: 2010-04-12Max: 2026-02-05----------------------------------Variable: intptlat_cityType: numericMissing: 537 (2.5%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 17.97 34.03 40.64 38.06 40.85 48.92 537
## ----------------------------------Variable: intptlong_cityType: numericMissing: 537 (2.5%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -158.18 -117.43 -86.12 -94.35 -73.94 -65.74 537
## ----------------------------------Variable: intptlat_countyType: numericMissing: 86 (0.4%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.22 34.20 40.64 38.05 40.85 55.25 86
## ----------------------------------Variable: intptlong_countyType: numericMissing: 86 (0.4%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -162.00 -116.18 -84.47 -93.73 -73.95 -66.59 86
This dataset contains 21,891 records representing unclaimed persons, with 14 variables covering identifiers, demographics, temporal fields, and geographic information. The analysis below summarizes key structural characteristics and distributional patterns observed during an initial univariate exploration.
The case_number field uniquely identifies each record, with a one-to-one correspondence between rows and case numbers. As with the missing persons dataset, this confirms its role as a stable identifier suitable for record tracking and joins.
The legal_first_name and legal_last_name fields again exhibit high cardinality, with over 4,200 unique first names and 10,900 unique last names. Common names such as Robert, James, Smith, and Johnson appear most frequently, reflecting expected population patterns. These fields primarily function as natural identifiers and are most useful for reporting, linkage, or manual review rather than analytic modeling.
The dbf (date body found) variable spans a long historical range, from 1939 through early 2026, indicating that the dataset includes both historical and contemporary unclaimed persons cases. A small proportion of records (approximately 2%) are missing this value, but the overall coverage supports temporal analysis across multiple decades.
The date_modified variable ranges from 2010 to 2026 and likely reflects system updates or administrative activity rather than event timing. As such, it is best interpreted as metadata useful for understanding record recency or workflow processes.
The biological_sex field contains four recorded values, with Male and Female comprising the vast majority of cases and a smaller number labeled as Unsure. This introduces slightly more categorical complexity than observed in the missing persons dataset, but the distribution remains suitable for stratified descriptive analysis.
The race_ethnicity variable includes 55 distinct values, capturing both single-category identities and broader or uncertain classifications. While categories such as White / Caucasian and Black / African American account for most records, the presence of a long tail of categories suggests that normalization or grouping would be beneficial for comparative or visual analysis.
Geographic information is provided at the city, county, and state levels. The dataset spans 44 distinct state codes, with particularly high concentrations in New York, California, and Texas. City- and county-level fields show lower cardinality than in the missing persons dataset, reflecting more geographically concentrated reporting patterns.
Lattitude and longitude values are available at both city and county levels, with modest missingness for city-level coordinates (approximately 2.5%) and very low missingness for county-level coordinates. Coordinate ranges are consistent with coverage across the United States and its territories, and county-level coordinates again appear to provide the most stable basis for spatial aggregation and mapping.
Overall, the unclaimed persons dataset exhibits strong structural consistency and broad temporal and geographic coverage. Compared to the missing persons dataset, it shows slightly more categorical complexity in demographic fields and a greater concentration of cases in specific metropolitan regions. Key analytical considerations emerging from this EDA include handling historical records, normalizing demographic categories, and leveraging county-level geography as a stable spatial unit. With these considerations in mind, the dataset is well suited for descriptive analysis, geographic exploration, and comparative visualization.
In addition to a queryable table, this dataset lends itself well to interactive visualizations centered on time and geography. The long temporal span of the dbf field supports timeline and trend-based views, while the geographic concentration of cases makes map-based exploration particularly informative. Interactive distributions for demographic variables and coordinated views linking maps, timelines, and summary charts would allow users to explore spatial and temporal patterns dynamically and drill down from aggregate trends to individual records, aligning well with exploratory data visualization goals.
From a visualization perspective, showing geographic prominence on a per-capita basis would help contextualize raw counts, since highly populated metropolitan areas otherwise dominate the map and can obscure regions with disproportionately high relative prevalence.
## lets explore the unclaimed persons data set
explore_dataset(unidentified_persons)
## ==============================Dataset overview: unidentified_persons==============================
## Rows: 15477Columns: 15
## Column types:
## case_number me_c_case dbf age_from
## "character" "character" "Date" "integer"
## age_to city county state
## "integer" "character" "character" "character"
## biological_sex race_ethnicity date_modified intptlat_city
## "character" "character" "Date" "numeric"
## intptlong_city intptlat_county intptlong_county
## "numeric" "numeric" "numeric"
##
## ----------------------------------Variable: case_numberType: characterMissing: 0 (0%)
## Unique values (non-missing): 15477
## Most common values:
## .
## UP100 UP10002 UP10004 UP100068 UP10008
## 1 1 1 1 1
## ----------------------------------Variable: me_c_caseType: characterMissing: 0 (0%)
## Unique values (non-missing): 15385
## Most common values:
## .
## ME/C Case N/A ME/C Case 15MB000503
## 55 5 3
## ME/C Case 01-057 ME/C Case 03-4051
## 2 2
## ----------------------------------Variable: dbfType: DateMissing: 0 (0%)
## Date range:
## Min: 1915-05-13Max: 2026-01-23----------------------------------Variable: age_fromType: integerMissing: 3206 (20.7%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 20.0 25.0 29.1 39.0 99.0 3206
## ----------------------------------Variable: age_toType: integerMissing: 3206 (20.7%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 35.00 45.00 48.27 60.00 110.00 3206
## ----------------------------------Variable: cityType: characterMissing: 0 (0%)
## Unique values (non-missing): 3348
## Most common values:
## .
## New York Los Angeles Brooklyn Chicago
## 2534 467 373 337 294
## ----------------------------------Variable: countyType: characterMissing: 0 (0%)
## Unique values (non-missing): 929
## Most common values:
## .
## Pima Los Angeles New York Brooks San Diego
## 1319 1062 511 413 407
## ----------------------------------Variable: stateType: characterMissing: 0 (0%)
## Unique values (non-missing): 54
## Most common values:
## .
## CA AZ TX NY FL
## 2987 2139 2072 1475 881
## ----------------------------------Variable: biological_sexType: characterMissing: 0 (0%)
## Unique values (non-missing): 4
## Most common values:
## .
## Male Female Unsure
## 11485 2854 1135 3
## ----------------------------------Variable: race_ethnicityType: characterMissing: 0 (0%)
## Unique values (non-missing): 97
## Most common values:
## .
## White / Caucasian Uncertain
## 4417 3736
## Black / African American Hispanic / Latino
## 2245 2239
## White / Caucasian, Hispanic / Latino
## 1447
## ----------------------------------Variable: date_modifiedType: DateMissing: 0 (0%)
## Date range:
## Min: 2011-06-06Max: 2026-02-05----------------------------------Variable: intptlat_cityType: numericMissing: 1265 (8.2%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.38 31.95 34.18 35.41 40.47 67.74 1265
## ----------------------------------Variable: intptlong_cityType: numericMissing: 1265 (8.2%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -176.60 -113.54 -97.07 -96.91 -80.11 144.70 1265
## ----------------------------------Variable: intptlat_countyType: numericMissing: 37 (0.2%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.38 32.13 34.20 34.98 40.01 67.01 37
## ----------------------------------Variable: intptlong_countyType: numericMissing: 37 (0.2%)
## Numeric summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -164.19 -113.91 -97.29 -96.95 -80.28 144.70 37
This dataset contains 15,477 records representing unidentified persons, with 15 variables capturing identifiers, estimated demographics, temporal fields, and geographic information. The analysis below summarizes key structural characteristics and distributional patterns observed during an initial univariate exploration.
The case_number field uniquely identifies each record, with a one-to-one correspondence between rows and case numbers, confirming its suitability as a stable identifier for record tracking and joins.
The me_c_case field provides a medical examiner or coroner case reference and exhibits very high cardinality, with nearly all values unique. A small number of records are labeled generically (e.g., “ME/C Case N/A”), indicating variability in how this identifier is populated. This field functions primarily as an administrative reference rather than an analytic feature.
The dbf (date body found) variable spans a wide historical range from 1915 through early 2026, indicating that the dataset includes both long-standing historical cases and contemporary unidentified persons. This temporal breadth supports longitudinal analysis but suggests the importance of stratifying by era to account for changes in investigative practices and reporting systems.
The date_modified field ranges from 2011 to 2026 and appears to reflect system updates or administrative activity rather than the timing of case events. As such, it is best interpreted as metadata related to record maintenance and recency.
Age information is represented as a range via age_from and age_to. Both fields exhibit approximately 21% missingness, reflecting inherent uncertainty or unavailable information in unidentified persons cases. Where present, the distributions indicate that most estimated ages fall within adult ranges, with median lower and upper bounds of approximately 25 and 45 years, respectively. The presence of wide ranges and upper bounds extending beyond 100 highlights the interpretive uncertainty associated with age estimation in this context.
Geographic information is provided at the city, county, and state levels. The dataset spans 54 distinct state or territory codes, with the largest concentrations in California, Arizona, Texas, and New York. Compared to the missing and unclaimed persons datasets, unidentified persons cases show notable clustering in specific counties and regions, suggesting localized investigative or reporting patterns.
Latitude and longitude values are available at both city and county levels. City-level coordinates exhibit moderate missingness (approximately 8%), while county-level coordinates are nearly complete. Coordinate ranges extend beyond the continental United States, consistent with inclusion of U.S. territories and edge-case records. County-level coordinates again appear to provide a more stable basis for spatial aggregation.
The biological_sex variable includes four recorded values, with Male comprising the majority of cases, followed by Female and Unsure classifications. This additional categorical complexity reflects uncertainty inherent in unidentified cases and should be considered in stratified analysis.
The race_ethnicity field contains 97 distinct values, the highest among the three datasets. A substantial proportion of records are labeled as Uncertain, alongside common categories such as White / Caucasian, Black / African American, and Hispanic / Latino. The prevalence of uncertainty and multi-category values underscores the need for careful normalization or grouping prior to comparative demographic analysis.
Overall, the unidentified persons dataset reflects the inherent uncertainty and complexity of cases where individual identity has not been established. Compared to the missing and unclaimed persons datasets, it exhibits higher missingness in age-related fields, greater categorical complexity in demographic variables, and pronounced geographic clustering. These characteristics make the dataset particularly well suited for descriptive, spatial, and exploratory analysis, while also emphasizing the importance of transparent handling of uncertainty in any downstream analysis or visualization.
Beyond a queryable table, this dataset is especially well suited to interactive visualizations that foreground uncertainty, time, and place. Timeline views based on dbf can highlight historical persistence and recent case activity, while map-based visualizations can reveal geographic clustering and regional investigative patterns. Interactive age-range displays and demographic summaries that explicitly communicate uncertainty would be particularly valuable. Coordinated views linking maps, timelines, and summary panels would support exploratory workflows and allow users to move fluidly between aggregate patterns and individual case context.
Given the strong geographic clustering observed, population-normalized (per-capita) visualizations may be particularly informative for highlighting relative prominence beyond major metros, where raw counts alone tend to reflect population size rather than comparative concentration.
```