Introduction

This notebook presents a concise exploratory data analysis of three NamUs datasets: Missing Persons, Unclaimed Persons, and Unidentified Persons—with the goal of understanding their structure, distributions, and analytic affordances as they relate to visulization.

Following a bit of data cleaning, the analysis uses a small, reusable explore_dataset() function to perform a consistent, single-variable review of each dataset. For every feature, the explorer summarizes type, missingness, value ranges, and common categories. Each dataset is then accompanied by a short written interpretation that distills these observations and notes implications for visualization design.

The notebook is organized to begin with a geographic overview of all three datasets combined, using county-level prominence to establish the national spatial distribution and relative concentration of cases. From there, each dataset is examined individually, with attention to how differences in temporal coverage, demographic certainty, and geographic resolution suggest different—and potentially complementary—visualization strategies.

Geographic Spread of All Data

Missing Persons

Dataset Explorer Output

## lets explore the missing persons data set
explore_dataset(missing_persons)
## ==============================Dataset overview: missing_persons==============================
## Rows: 26182Columns: 15
## Column types:
##      case_number              dlc  legal_last_name legal_first_name 
##      "character"           "Date"      "character"      "character" 
##      missing_age             city           county            state 
##        "numeric"      "character"      "character"      "character" 
##   biological_sex   race_ethnicity    date_modified    intptlat_city 
##      "character"      "character"           "Date"        "numeric" 
##   intptlong_city  intptlat_county intptlong_county 
##        "numeric"        "numeric"        "numeric" 
## 
## ----------------------------------Variable: case_numberType: characterMissing: 0 (0%)
## Unique values (non-missing): 26182
## Most common values:
## .
##      MP1     MP10    MP100  MP10000 MP100005 
##        1        1        1        1        1 
## ----------------------------------Variable: dlcType: DateMissing: 0 (0%)
## Date range:
## Min: 1902-01-01Max: 2026-01-28----------------------------------Variable: legal_last_nameType: characterMissing: 0 (0%)
## Unique values (non-missing): 14222
## Most common values:
## .
##    Smith  Johnson Williams    Jones    Brown 
##      217      183      159      138      124 
## ----------------------------------Variable: legal_first_nameType: characterMissing: 0 (0%)
## Unique values (non-missing): 6460
## Most common values:
## .
## Michael    John  Robert   James   David 
##     437     419     416     395     361 
## ----------------------------------Variable: missing_ageType: numericMissing: 0 (0%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   21.00   32.00   34.85   46.00  116.00 
## ----------------------------------Variable: cityType: characterMissing: 0 (0%)
## Unique values (non-missing): 6059
## Most common values:
## .
##       Houston   Los Angeles        Dallas San Francisco       Memphis 
##           498           480           444           260           226 
## ----------------------------------Variable: countyType: characterMissing: 0 (0%)
## Unique values (non-missing): 1414
## Most common values:
## .
## Los Angeles      Harris      Dallas    Maricopa      Orange 
##         995         621         545         395         376 
## ----------------------------------Variable: stateType: characterMissing: 0 (0%)
## Unique values (non-missing): 55
## Most common values:
## .
##   CA   TX   FL   AK   AZ 
## 3771 2884 2378 1297 1087 
## ----------------------------------Variable: biological_sexType: characterMissing: 0 (0%)
## Unique values (non-missing): 2
## Most common values:
## .
##   Male Female 
##  16557   9625 
## ----------------------------------Variable: race_ethnicityType: characterMissing: 0 (0%)
## Unique values (non-missing): 78
## Most common values:
## .
##                    White / Caucasian             Black / African American 
##                                14563                                 4368 
##                    Hispanic / Latino White / Caucasian, Hispanic / Latino 
##                                 3670                                 1118 
##      American Indian / Alaska Native 
##                                  824 
## ----------------------------------Variable: date_modifiedType: DateMissing: 0 (0%)
## Date range:
## Min: 2014-01-13Max: 2026-02-01----------------------------------Variable: intptlat_cityType: numericMissing: 1787 (6.8%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.38   32.79   36.17   37.30   40.85   70.65    1787 
## ----------------------------------Variable: intptlong_cityType: numericMissing: 1787 (6.8%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -176.60 -116.99  -95.08  -99.09  -82.40  145.78    1787 
## ----------------------------------Variable: intptlat_countyType: numericMissing: 201 (0.8%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.38   32.77   36.17   37.19   40.92   69.45     201 
## ----------------------------------Variable: intptlong_countyType: numericMissing: 201 (0.8%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -164.19 -116.18  -94.42  -98.01  -81.98  179.62     201

This dataset contains 26,182 records representing missing persons cases, with 15 variables spanning identifiers, demographics, temporal fields, and geographic information. The analysis below summarizes key distributional patterns, coverage, and structural characteristics observed during an initial univariate exploration.

Identifiers and Names

The case_number field uniquely identifies each record, with a one-to-one correspondence between rows and case numbers. This confirms its role as a stable identifier suitable for record tracking and joins.

The legal_first_name and legal_last_name fields exhibit very high cardinality, with over 6,400 unique first names and 14,200 unique last names. Common names such as Michael, John, Smith, and Johnson appear frequently, reflecting expected population distributions.

Temporal Fields

The dlc (date last contact) variable spans a wide historical range, from 1902 through early 2026, indicating the presence of both historical cold cases and very recent records. This breadth supports long-term trend analysis but also suggests the importance of stratifying by era to avoid conflating cases from substantially different reporting contexts.

The date_modified variable is concentrated in the modern period (2014–2026) and potentially reflects a system update activity rather than event timing. It is probably best interpreted as metadata useful for assessing data recency or workflow processes.

Age at Disappearance

The missing_age variable ranges from 1 to 116 years, with a median age of 32 and a mean of approximately 35. The distribution indicates a predominance of adult cases alongside substantial representation of minors. Extreme values at the upper end suggest potential value of follow-up validation, particularly when producing age-stratified summaries.

Geographic Coverage

Geographic information is provided at the city, county, and state levels. The dataset spans 55 distinct state or territory codes, with the highest case counts observed in California, Texas, and Florida, consistent with population size and reporting volume.

Demographic Characteristics

The biological_sex variable contains two categories—Male and Female—with males comprising approximately 63% of cases. This distribution is consistent across the dataset and supports stratified descriptive analysis.

The race_ethnicity field includes 78 distinct values, encompassing both single-category and multi-category identities. While major groups such as White / Caucasian and Black / African American account for most records, the long-tail structure indicates that this variable encodes complex demographic information that would benefit from normalization prior to comparative analysis.

Geospatial Coordinates (Census-Derived)

Latitude and longitude values are available at both city and county levels. County-level coordinates are nearly complete, while city-level coordinates exhibit modest missingness (approximately 7%). Coordinate ranges extend beyond the continental United States, consistent with inclusion of U.S. territories and edge-case records. County-level coordinates appear more stable and may serve as a preferred default for spatial aggregation and mapping.

Overall Assessment

Overall, the dataset demonstrates broad coverage across time, geography, and demographics, with distributions that are consistent with expectations for a national missing persons dataset. Key analytical considerations emerging from this EDA include handling historical records, normalizing high-cardinality categorical variables, and applying thoughtful validation to extreme values. With these considerations addressed, the dataset is well-positioned to support descriptive analysis, trend reporting, and geographic summarization.

Thoughts on Data Visualization

Beyond a queryable table, this dataset naturally supports interactive visualizations that emphasize time, geography, and demographic composition. The wide temporal span of the dlc field lends itself to interactive timelines or time-series views, while the availability of city- and county-level coordinates enables map-based exploration at multiple geographic resolutions. Demographic variables such as age, sex, and race/ethnicity are well suited to interactive distributions and comparative summaries, particularly when combined with filtering and brushing. Coordinated views—linking maps, timelines, and summary charts—would allow users to explore patterns dynamically and drill down from aggregate trends to individual cases, making the dataset especially well suited for exploratory and narrative-driven data visualization.

For a data visualization audience, it will likely be useful to present geographic prominence normalized by population (per capita), as large metropolitan areas dominate raw counts and population-adjusted views can better highlight relative concentration across regions.

Unclaimed Persons

Dataset Explorer Output

## lets explore the unclaimed persons data set
explore_dataset(unclaimed_persons)
## ==============================Dataset overview: unclaimed_persons==============================
## Rows: 21891Columns: 14
## Column types:
##      case_number              dbf  legal_last_name legal_first_name 
##      "character"           "Date"      "character"      "character" 
##   biological_sex   race_ethnicity             city           county 
##      "character"      "character"      "character"      "character" 
##            state    date_modified    intptlat_city   intptlong_city 
##      "character"           "Date"        "numeric"        "numeric" 
##  intptlat_county intptlong_county 
##        "numeric"        "numeric" 
## 
## ----------------------------------Variable: case_numberType: characterMissing: 0 (0%)
## Unique values (non-missing): 21891
## Most common values:
## .
##    UCP100   UCP1000 UCP100000 UCP100001 UCP100002 
##         1         1         1         1         1 
## ----------------------------------Variable: dbfType: DateMissing: 509 (2.3%)
## Date range:
## Min: 1939-12-16Max: 2026-01-30----------------------------------Variable: legal_last_nameType: characterMissing: 0 (0%)
## Unique values (non-missing): 10957
## Most common values:
## .
##    Smith  Johnson Williams    Brown    Jones 
##      221      181      177      157      153 
## ----------------------------------Variable: legal_first_nameType: characterMissing: 0 (0%)
## Unique values (non-missing): 4263
## Most common values:
## .
##  Robert   James    John Michael William 
##     628     598     524     482     449 
## ----------------------------------Variable: biological_sexType: characterMissing: 0 (0%)
## Unique values (non-missing): 4
## Most common values:
## .
##   Male Female        Unsure 
##  16545   4322   1022      2 
## ----------------------------------Variable: race_ethnicityType: characterMissing: 0 (0%)
## Unique values (non-missing): 55
## Most common values:
## .
##        White / Caucasian Black / African American        Hispanic / Latino 
##                     9829                     4469                     2832 
##                Uncertain                          
##                     1665                     1491 
## ----------------------------------Variable: cityType: characterMissing: 0 (0%)
## Unique values (non-missing): 1448
## Most common values:
## .
##    Bronx New York Brooklyn            Queens 
##     2425     1603     1525     1185     1104 
## ----------------------------------Variable: countyType: characterMissing: 0 (0%)
## Unique values (non-missing): 332
## Most common values:
## .
##          Bronx       New York San Bernardino          Kings         Queens 
##           2398           2128           1578           1520           1291 
## ----------------------------------Variable: stateType: characterMissing: 0 (0%)
## Unique values (non-missing): 44
## Most common values:
## .
##   NY   CA   TX   WA   TN 
## 7444 3629 2327 1962  771 
## ----------------------------------Variable: date_modifiedType: DateMissing: 0 (0%)
## Date range:
## Min: 2010-04-12Max: 2026-02-05----------------------------------Variable: intptlat_cityType: numericMissing: 537 (2.5%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   17.97   34.03   40.64   38.06   40.85   48.92     537 
## ----------------------------------Variable: intptlong_cityType: numericMissing: 537 (2.5%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -158.18 -117.43  -86.12  -94.35  -73.94  -65.74     537 
## ----------------------------------Variable: intptlat_countyType: numericMissing: 86 (0.4%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.22   34.20   40.64   38.05   40.85   55.25      86 
## ----------------------------------Variable: intptlong_countyType: numericMissing: 86 (0.4%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -162.00 -116.18  -84.47  -93.73  -73.95  -66.59      86

This dataset contains 21,891 records representing unclaimed persons, with 14 variables covering identifiers, demographics, temporal fields, and geographic information. The analysis below summarizes key structural characteristics and distributional patterns observed during an initial univariate exploration.

Identifiers and Names

The case_number field uniquely identifies each record, with a one-to-one correspondence between rows and case numbers. As with the missing persons dataset, this confirms its role as a stable identifier suitable for record tracking and joins.

The legal_first_name and legal_last_name fields again exhibit high cardinality, with over 4,200 unique first names and 10,900 unique last names. Common names such as Robert, James, Smith, and Johnson appear most frequently, reflecting expected population patterns. These fields primarily function as natural identifiers and are most useful for reporting, linkage, or manual review rather than analytic modeling.

Temporal Fields

The dbf (date body found) variable spans a long historical range, from 1939 through early 2026, indicating that the dataset includes both historical and contemporary unclaimed persons cases. A small proportion of records (approximately 2%) are missing this value, but the overall coverage supports temporal analysis across multiple decades.

The date_modified variable ranges from 2010 to 2026 and likely reflects system updates or administrative activity rather than event timing. As such, it is best interpreted as metadata useful for understanding record recency or workflow processes.

Demographic Characteristics

The biological_sex field contains four recorded values, with Male and Female comprising the vast majority of cases and a smaller number labeled as Unsure. This introduces slightly more categorical complexity than observed in the missing persons dataset, but the distribution remains suitable for stratified descriptive analysis.

The race_ethnicity variable includes 55 distinct values, capturing both single-category identities and broader or uncertain classifications. While categories such as White / Caucasian and Black / African American account for most records, the presence of a long tail of categories suggests that normalization or grouping would be beneficial for comparative or visual analysis.

Geographic Coverage

Geographic information is provided at the city, county, and state levels. The dataset spans 44 distinct state codes, with particularly high concentrations in New York, California, and Texas. City- and county-level fields show lower cardinality than in the missing persons dataset, reflecting more geographically concentrated reporting patterns.

Lattitude and longitude values are available at both city and county levels, with modest missingness for city-level coordinates (approximately 2.5%) and very low missingness for county-level coordinates. Coordinate ranges are consistent with coverage across the United States and its territories, and county-level coordinates again appear to provide the most stable basis for spatial aggregation and mapping.

Overall Assessment

Overall, the unclaimed persons dataset exhibits strong structural consistency and broad temporal and geographic coverage. Compared to the missing persons dataset, it shows slightly more categorical complexity in demographic fields and a greater concentration of cases in specific metropolitan regions. Key analytical considerations emerging from this EDA include handling historical records, normalizing demographic categories, and leveraging county-level geography as a stable spatial unit. With these considerations in mind, the dataset is well suited for descriptive analysis, geographic exploration, and comparative visualization.

Thoughts on Data Visualization

In addition to a queryable table, this dataset lends itself well to interactive visualizations centered on time and geography. The long temporal span of the dbf field supports timeline and trend-based views, while the geographic concentration of cases makes map-based exploration particularly informative. Interactive distributions for demographic variables and coordinated views linking maps, timelines, and summary charts would allow users to explore spatial and temporal patterns dynamically and drill down from aggregate trends to individual records, aligning well with exploratory data visualization goals.

From a visualization perspective, showing geographic prominence on a per-capita basis would help contextualize raw counts, since highly populated metropolitan areas otherwise dominate the map and can obscure regions with disproportionately high relative prevalence.

Unidentified Persons

Dataset Explorer Output

## lets explore the unclaimed persons data set
explore_dataset(unidentified_persons)
## ==============================Dataset overview: unidentified_persons==============================
## Rows: 15477Columns: 15
## Column types:
##      case_number        me_c_case              dbf         age_from 
##      "character"      "character"           "Date"        "integer" 
##           age_to             city           county            state 
##        "integer"      "character"      "character"      "character" 
##   biological_sex   race_ethnicity    date_modified    intptlat_city 
##      "character"      "character"           "Date"        "numeric" 
##   intptlong_city  intptlat_county intptlong_county 
##        "numeric"        "numeric"        "numeric" 
## 
## ----------------------------------Variable: case_numberType: characterMissing: 0 (0%)
## Unique values (non-missing): 15477
## Most common values:
## .
##    UP100  UP10002  UP10004 UP100068  UP10008 
##        1        1        1        1        1 
## ----------------------------------Variable: me_c_caseType: characterMissing: 0 (0%)
## Unique values (non-missing): 15385
## Most common values:
## .
##                             ME/C Case N/A ME/C Case 15MB000503 
##                   55                    5                    3 
##     ME/C Case 01-057    ME/C Case 03-4051 
##                    2                    2 
## ----------------------------------Variable: dbfType: DateMissing: 0 (0%)
## Date range:
## Min: 1915-05-13Max: 2026-01-23----------------------------------Variable: age_fromType: integerMissing: 3206 (20.7%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    20.0    25.0    29.1    39.0    99.0    3206 
## ----------------------------------Variable: age_toType: integerMissing: 3206 (20.7%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   35.00   45.00   48.27   60.00  110.00    3206 
## ----------------------------------Variable: cityType: characterMissing: 0 (0%)
## Unique values (non-missing): 3348
## Most common values:
## .
##                New York Los Angeles    Brooklyn     Chicago 
##        2534         467         373         337         294 
## ----------------------------------Variable: countyType: characterMissing: 0 (0%)
## Unique values (non-missing): 929
## Most common values:
## .
##        Pima Los Angeles    New York      Brooks   San Diego 
##        1319        1062         511         413         407 
## ----------------------------------Variable: stateType: characterMissing: 0 (0%)
## Unique values (non-missing): 54
## Most common values:
## .
##   CA   AZ   TX   NY   FL 
## 2987 2139 2072 1475  881 
## ----------------------------------Variable: biological_sexType: characterMissing: 0 (0%)
## Unique values (non-missing): 4
## Most common values:
## .
##   Male Female Unsure        
##  11485   2854   1135      3 
## ----------------------------------Variable: race_ethnicityType: characterMissing: 0 (0%)
## Unique values (non-missing): 97
## Most common values:
## .
##                    White / Caucasian                            Uncertain 
##                                 4417                                 3736 
##             Black / African American                    Hispanic / Latino 
##                                 2245                                 2239 
## White / Caucasian, Hispanic / Latino 
##                                 1447 
## ----------------------------------Variable: date_modifiedType: DateMissing: 0 (0%)
## Date range:
## Min: 2011-06-06Max: 2026-02-05----------------------------------Variable: intptlat_cityType: numericMissing: 1265 (8.2%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.38   31.95   34.18   35.41   40.47   67.74    1265 
## ----------------------------------Variable: intptlong_cityType: numericMissing: 1265 (8.2%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -176.60 -113.54  -97.07  -96.91  -80.11  144.70    1265 
## ----------------------------------Variable: intptlat_countyType: numericMissing: 37 (0.2%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.38   32.13   34.20   34.98   40.01   67.01      37 
## ----------------------------------Variable: intptlong_countyType: numericMissing: 37 (0.2%)
## Numeric summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -164.19 -113.91  -97.29  -96.95  -80.28  144.70      37

This dataset contains 15,477 records representing unidentified persons, with 15 variables capturing identifiers, estimated demographics, temporal fields, and geographic information. The analysis below summarizes key structural characteristics and distributional patterns observed during an initial univariate exploration.

Identifiers and Case References

The case_number field uniquely identifies each record, with a one-to-one correspondence between rows and case numbers, confirming its suitability as a stable identifier for record tracking and joins.

The me_c_case field provides a medical examiner or coroner case reference and exhibits very high cardinality, with nearly all values unique. A small number of records are labeled generically (e.g., “ME/C Case N/A”), indicating variability in how this identifier is populated. This field functions primarily as an administrative reference rather than an analytic feature.

Temporal Fields

The dbf (date body found) variable spans a wide historical range from 1915 through early 2026, indicating that the dataset includes both long-standing historical cases and contemporary unidentified persons. This temporal breadth supports longitudinal analysis but suggests the importance of stratifying by era to account for changes in investigative practices and reporting systems.

The date_modified field ranges from 2011 to 2026 and appears to reflect system updates or administrative activity rather than the timing of case events. As such, it is best interpreted as metadata related to record maintenance and recency.

Estimated Age Ranges

Age information is represented as a range via age_from and age_to. Both fields exhibit approximately 21% missingness, reflecting inherent uncertainty or unavailable information in unidentified persons cases. Where present, the distributions indicate that most estimated ages fall within adult ranges, with median lower and upper bounds of approximately 25 and 45 years, respectively. The presence of wide ranges and upper bounds extending beyond 100 highlights the interpretive uncertainty associated with age estimation in this context.

Geographic Coverage

Geographic information is provided at the city, county, and state levels. The dataset spans 54 distinct state or territory codes, with the largest concentrations in California, Arizona, Texas, and New York. Compared to the missing and unclaimed persons datasets, unidentified persons cases show notable clustering in specific counties and regions, suggesting localized investigative or reporting patterns.

Latitude and longitude values are available at both city and county levels. City-level coordinates exhibit moderate missingness (approximately 8%), while county-level coordinates are nearly complete. Coordinate ranges extend beyond the continental United States, consistent with inclusion of U.S. territories and edge-case records. County-level coordinates again appear to provide a more stable basis for spatial aggregation.

Demographic Characteristics

The biological_sex variable includes four recorded values, with Male comprising the majority of cases, followed by Female and Unsure classifications. This additional categorical complexity reflects uncertainty inherent in unidentified cases and should be considered in stratified analysis.

The race_ethnicity field contains 97 distinct values, the highest among the three datasets. A substantial proportion of records are labeled as Uncertain, alongside common categories such as White / Caucasian, Black / African American, and Hispanic / Latino. The prevalence of uncertainty and multi-category values underscores the need for careful normalization or grouping prior to comparative demographic analysis.

Overall Assessment

Overall, the unidentified persons dataset reflects the inherent uncertainty and complexity of cases where individual identity has not been established. Compared to the missing and unclaimed persons datasets, it exhibits higher missingness in age-related fields, greater categorical complexity in demographic variables, and pronounced geographic clustering. These characteristics make the dataset particularly well suited for descriptive, spatial, and exploratory analysis, while also emphasizing the importance of transparent handling of uncertainty in any downstream analysis or visualization.

Thoughts on Data Visualization

Beyond a queryable table, this dataset is especially well suited to interactive visualizations that foreground uncertainty, time, and place. Timeline views based on dbf can highlight historical persistence and recent case activity, while map-based visualizations can reveal geographic clustering and regional investigative patterns. Interactive age-range displays and demographic summaries that explicitly communicate uncertainty would be particularly valuable. Coordinated views linking maps, timelines, and summary panels would support exploratory workflows and allow users to move fluidly between aggregate patterns and individual case context.

Given the strong geographic clustering observed, population-normalized (per-capita) visualizations may be particularly informative for highlighting relative prominence beyond major metros, where raw counts alone tend to reflect population size rather than comparative concentration.

```