Blocking
Comparing every pair of records is quadratic and infeasible at scale. Blocking reduces candidate pairs by grouping records that share a key (for example, the same ZIP or the same soundex of a surname) and only comparing within blocks. The blocking key is the single biggest performance lever. ForN rows split into blocks of sizes n1, n2, ...:
static, adaptive, sorted_neighborhood, multi_pass, ann, ann_pairs, canopy, and learned (data-driven predicate selection).
Scoring
Within each block, candidate pairs are scored. A matchkey is a named rule that combines one or more fields, each with a scorer and a weight, and a threshold above which a pair is a match. GoldenMatch provides 12+ scoring methods, including:- String similarity:
exact,jaro_winkler,levenshtein,token_sort,soundex_match,dice,jaccard. - Name-aware:
name_freq_weighted_jw(surname IDF-weighted),given_name_aliased_jw(alias-aware). - Embedding:
embedding,record_embedding. - Probabilistic: Fellegi-Sunter EM-trained m/u probabilities with automatic threshold estimation.
- LLM: GPT-4o-mini or Claude scores borderline pairs, with budget caps and graceful degradation.
Clustering
Pairwise matches are transitively closed into clusters: if A matches B and B matches C, then A, B, and C form one cluster. Clusters are labeledstrong, weak, or split. Oversized clusters are automatically split on their weakest edge using a minimum spanning tree, which prevents a single shared value (a common info@ email, say) from collapsing unrelated records into one giant cluster.
Survivorship
A golden record is the canonical record synthesized from a cluster. GoldenMatch offers five merge strategies:most_completemajority_votesource_prioritymost_recentfirst_non_null
Privacy-preserving linkage
When you cannot share raw data across organizations, PPRL matches records using Bloom-filter encodings instead of plaintext, reaching F1 0.924 on the FEBRL4 benchmark.Auto-config
GoldenMatch can choose all of the above for you and iterate until it converges.