0.1 Mehtods

For the capture data, we make sure there’s only one row per fisherman per day. We standardize names and IDs, and rebuild the date if it was split into year/month/day. Catch fields are parsed as numbers and we compute a total_kg per row. If we find duplicate (id, date) rows, we collapse them into one by summing the numeric fields and keeping the most common text values. We write out audit CSVs and a run summary, and stop with an error if the data still isn’t unique after these fixes.

For the DCS matching, we normalize text so it’s comparable (names are lowercased, trimmed, de-accented; MRNs are A–Z/0–9 only; villages are uppercased). We then create links in three ways: exact MRN matches (perfect anchors); fuzzy MRN matches that allow small typos (up to two edits or very high string similarity) but only if the names are also very similar and ages are within ~3 years; and name-only matches for rows without MRNs, using a weighted score that combines name similarity (largest weight), age proximity, and whether the village matches. We keep only mutual-best name pairs, with an override for extremely similar names. Finally, we cluster connected records and assign each cluster a person_id (MRN-anchored clusters first). Each row gets a match_pct based on its strongest link, scaled to a percentage.

0.2 Key metrics

Metric Value
Rows evaluated 5172
Unique people (clusters) 1778
MRN-anchored clusters 1622
Fuzzy-only clusters 156
Singleton people 980
Rows in matched clusters 4192
Edges source Strict

0.3 Parameters actually used

source name_weight age_weight village_weight name_score_threshold name_override_threshold precision_vs_MRN recall_vs_MRN f1_vs_MRN
Strict defaults 0.85 0.1 0.05 0.965 0.975 NA NA NA

0.4 Performance vs pseudo-gold (exact MRN pairs)

predicted gold tp precision recall f1
24165 19621 19621 0.812 1 0.896

0.5 QC snapshots

rows % missing name % missing age % missing village % missing MRN
5172 0 0.8 0 3.5
edge_type edges
mrn_exact 19621
mrn_fuzzy 2417
name_fuzzy_strict 25

0.6 Largest cluster network

0.7 Row-level match% buckets

0.8 flowchart

0.9 Appendix: session info