Fishermen — Data Curation & De-duplication

0.1 Mehtods

For the capture data, we make sure there’s only one row per fisherman per day. We standardize names and IDs, and rebuild the date if it was split into year/month/day. Catch fields are parsed as numbers and we compute a total_kg per row. If we find duplicate (id, date) rows, we collapse them into one by summing the numeric fields and keeping the most common text values. We write out audit CSVs and a run summary, and stop with an error if the data still isn’t unique after these fixes.

For the DCS matching, we normalize text so it’s comparable (names are lowercased, trimmed, de-accented; MRNs are A–Z/0–9 only; villages are uppercased). We then create links in three ways: exact MRN matches (perfect anchors); fuzzy MRN matches that allow small typos (up to two edits or very high string similarity) but only if the names are also very similar and ages are within ~3 years; and name-only matches for rows without MRNs, using a weighted score that combines name similarity (largest weight), age proximity, and whether the village matches. We keep only mutual-best name pairs, with an override for extremely similar names. Finally, we cluster connected records and assign each cluster a person_id (MRN-anchored clusters first). Each row gets a match_pct based on its strongest link, scaled to a percentage.

0.2 Key metrics

Metric	Value
Rows evaluated	5172
Unique people (clusters)	1778
MRN-anchored clusters	1622
Fuzzy-only clusters	156
Singleton people	980
Rows in matched clusters	4192
Edges source	Strict

0.3 Parameters actually used

source	name_weight	age_weight	village_weight	name_score_threshold	name_override_threshold	precision_vs_MRN	recall_vs_MRN	f1_vs_MRN
Strict defaults	0.85	0.1	0.05	0.965	0.975	NA	NA	NA

0.4 Performance vs pseudo-gold (exact MRN pairs)

predicted	gold	tp	precision	recall	f1
24165	19621	19621	0.812	1	0.896

0.5 QC snapshots

rows	% missing name	% missing age	% missing village	% missing MRN
5172	0	0.8	0	3.5

edge_type	edges
mrn_exact	19621
mrn_fuzzy	2417
name_fuzzy_strict	25

Fishermen — Data Curation & De-duplication

Walter Chin, Oswaldo Huchim

2025-08-29

0.1 Mehtods

0.2 Key metrics

0.3 Parameters actually used

0.4 Performance vs pseudo-gold (exact MRN pairs)

0.5 QC snapshots

0.6 Largest cluster network

0.7 Row-level match% buckets

0.8 flowchart

0.9 Appendix: session info