Mehtods
For the capture data, we make sure there’s only one row per fisherman
per day. We standardize names and IDs, and rebuild the date if it was
split into year/month/day. Catch fields are parsed as numbers and we
compute a total_kg per row. If we find duplicate (id, date) rows, we
collapse them into one by summing the numeric fields and keeping the
most common text values. We write out audit CSVs and a run summary, and
stop with an error if the data still isn’t unique after these fixes.
For the DCS matching, we normalize text so it’s comparable (names are
lowercased, trimmed, de-accented; MRNs are A–Z/0–9 only; villages are
uppercased). We then create links in three ways: exact MRN matches
(perfect anchors); fuzzy MRN matches that allow small typos (up to two
edits or very high string similarity) but only if the names are also
very similar and ages are within ~3 years; and name-only matches for
rows without MRNs, using a weighted score that combines name similarity
(largest weight), age proximity, and whether the village matches. We
keep only mutual-best name pairs, with an override for extremely similar
names. Finally, we cluster connected records and assign each cluster a
person_id (MRN-anchored clusters first). Each row gets a match_pct based
on its strongest link, scaled to a percentage.
Key metrics
| Rows evaluated |
5172 |
| Unique people (clusters) |
1778 |
| MRN-anchored clusters |
1622 |
| Fuzzy-only clusters |
156 |
| Singleton people |
980 |
| Rows in matched clusters |
4192 |
| Edges source |
Strict |
Parameters actually
used
| Strict defaults |
0.85 |
0.1 |
0.05 |
0.965 |
0.975 |
NA |
NA |
NA |
QC snapshots
| mrn_exact |
19621 |
| mrn_fuzzy |
2417 |
| name_fuzzy_strict |
25 |
Largest cluster
network

Row-level match%
buckets

Appendix: session
info