goldenmatch.core.quality that reuses a GoldenCheck public
API and returns nothing when there’s no signal.
The four levers
Results — quality-weighted survivorship
When building a cluster’s golden record, prefer the higher-quality cell: the canonical spelling over a typo, a real date over a future-dated one. Driven byGoldenRulesConfig.quality_weighting (on by default; a no-op on clean data, so
there’s zero cost until there’s an actual quality issue).
Recall — quality-aware blocking
Edit-distance variants that survive normalization (Californa vs California)
otherwise shard true duplicates into different blocks and are lost before
scoring runs. With this on, GoldenMatch adds a fuzzy-tolerant blocking pass for
flagged columns so the variants co-block. Purely additive — recall can only rise.
Precision — FD-driven negative evidence
A column that functionally determines others (acct → name) is a data-driven
identity anchor, even when its name doesn’t look like an id. Disagreement on
such a column is strong evidence two records are not the same entity, so
GoldenMatch admits it as a negative-evidence field the name heuristic would miss.
Trust — quality-gated review
A confident match score measures string agreement, not whether that agreement rests on trustworthy data. A high-scoring pair built on a GoldenCheck-flagged cell is held for review instead of auto-merged — with the reason attached, so the steward sees why.Safety & posture
- Opt-in, default OFF (except survivorship, which is a no-op when clean). With a flag off, matching is byte-identical to GoldenCheck-free behaviour.
- Additive — no door ever removes a match, a blocking key, or a decision; the worst case for blocking/review is “more candidates / more review items”.
- Benchmark-gated — defaults flip on only after a measured win on the reference ER datasets (DBLP-ACM / Febrl3 / NCVR), not on the assumption that more quality info must help.